Skip to content

Megatron-DeepSpeed

Megatron-DeepSpeed is a scalable, highly performant library for training large language models on any GPU1.

In particular, it retains the core 4D parallelism2 functionality of the NVIDIA / Megatron-LM library, while leveraging the microsoft / DeepSpeed library for efficient scaling and 🍋 saforem2 / ezpz for automated device + backend selection.

Getting Started

  1. Clone the argonne-lcf / Megatron-DeepSpeed repository:

    git clone https://github.com/argonne-lcf/Megatron-DeepSpeed
    cd Megatron-DeepSpeed
    
  2. Set up your environment:

    export PBS_O_WORKDIR=$(pwd)
    source <(curl -s https://raw.githubusercontent.com/saforem2/ezpz/refs/heads/main/src/ezpz/bin/utils.sh)
    ezpz_setup_env
    
    [Optional] Setup WandB

    To enable Weights & Biases (WandB) logging, we need to install and login:

    python3 -m pip install wandb --upgrade
    wandb login
    

    NOTE: WandB can be disabled by setting export WANDB_DISABLED=1

    See wandb: Quickstart for additional information

  3. Install dependencies:

    1. 🍋 saforem2 / ezpz:
    python3 -m pip install -e "git+https://github.com/saforem2/ezpz#egg=ezpz" --require-virtualenv
    
    1. microsoft / DeepSpeed:
    python3 -m pip install deepspeed
    
  4. Launch training:

    # Before launching, `PBS_O_WORKDIR` should be set to Megatron-DeepSpeed's PATH
    # and venv inside Megatron-DeepSpeed/venv should be activated.
    TP=2 NLAYERS=10 DATA_FILE_LIST=ALCF/data-lists/aurora/books.txt bash train_aGPT_7B.sh
    

    This will launch a distributed pre-training run with:

    • NLAYERS=10: Llama style model consisting of 10 layers

    • TP=2: Split across 2 Tensor Parallel groups

    • DATA_FILE_LIST: Using the Books corpus of the Dolma dataset

    Overridable Options

    This is a simple subset of the overridable options.

    The full list (as well as their default values) can be found in ALCF / helpers.sh

    • DTYPE: Data type
    • DATA_FILE_LIST: Data file list
    • FFN_HIDDEN_SIZE: Feedforward Neural Network projection size
    • GRAD_ACC_STEPS: Gradient accumulation steps
    • HEADS: Number of attention heads
    • HIDDEN: Hidden size
    • MICRO_BATCH: Micro batch size
    • NO_FLASH_ATTN: No Flash Attention
    • NLAYERS: Number of layers
    • NUM_KV_HEAD: Number of key-value heads
    • OPT: Optimizer
      • adam
      • adam8bit
      • adamw
      • adamwschedulefree
      • apex.adam
      • apex.sgd
      • ds.fusedlamb
      • ds.onebitlamb
      • galoreadamw
      • galoreadamw8bit
      • galoreadamw8bitperlayer
      • ipex.fusedlamb
      • ipex.lamb
      • shampoo
      • sgd
      • sgdschedulefree
      • sophiag
    • PP: Pipeline parallelism degree
    • SEQ: Sequence length
    • SP: Sequence parallelism (Ulysses) degree
    • TP: Tensor parallelism degree
    • TRAIN_TOKENS: Number of training tokens
    • TRAIN_ITERS: Number of training iterations
    • USE_ACTIVATION_CHECKPOINTING: Use activation checkpointing
    • WEIGHT_DECAY: Weight decay
    • ZERO_STAGE: Zero stage

  1. Megatron-DeepSpeed is designed to work on any GPU, including NVIDIA GPUs (NCCL), AMD GPUs (RCCL), and Intel XPUs (CCL). 

  2. 4D parallelism refers to data (DP), tensor (TP), pipeline (PP), and sequence (SP) parallelism degrees of freedom.