Megatron-DeepSpeed¶

Megatron-DeepSpeed is a scalable, highly performant library for training large language models on any GPU².

In particular, it retains the core 4D parallelism¹ functionality of the NVIDIA / Megatron-LM library, while leveraging the microsoft / DeepSpeed library for efficient scaling and 🍋 saforem2 / ezpz for automated device + backend selection.

Getting Started¶

Clone the argonne-lcf / Megatron-DeepSpeed repository:

git clone https://github.com/argonne-lcf/Megatron-DeepSpeed
cd Megatron-DeepSpeed

Set up your environment:
```
export PBS_O_WORKDIR=$(pwd)
source <(curl -s https://raw.githubusercontent.com/saforem2/ezpz/refs/heads/main/src/ezpz/bin/utils.sh)
ezpz_setup_env
```
[Optional] Setup WandB

To enable Weights & Biases (WandB) logging, we need to install and login:
```
python3 -m pip install wandb --upgrade
wandb login
```
NOTE: WandB can be disabled by setting export WANDB_DISABLED=1

See wandb: Quickstart for additional information

Install dependencies:

🍋 saforem2 / ezpz:

python3 -m pip install -e "git+https://github.com/saforem2/ezpz#egg=ezpz" --require-virtualenv

microsoft / DeepSpeed:

python3 -m pip install deepspeed

Launch training:
```
# Before launching, `PBS_O_WORKDIR` should be set to Megatron-DeepSpeed's PATH
# and venv inside Megatron-DeepSpeed/venv should be activated.
TP=2 NLAYERS=10 DATA_FILE_LIST=ALCF/data-lists/aurora/books.txt bash train_aGPT_7B.sh
```
This will launch a distributed pre-training run with:
- NLAYERS=10: Llama style model consisting of 10 layers
- TP=2: Split across 2 Tensor Parallel groups
- DATA_FILE_LIST: Using the Books corpus of the Dolma dataset
Overridable Options

This is a simple subset of the overridable options.

The full list (as well as their default values) can be found in ALCF / helpers.sh
- DTYPE: Data type
- DATA_FILE_LIST: Data file list
- FFN_HIDDEN_SIZE: Feedforward Neural Network projection size
- GRAD_ACC_STEPS: Gradient accumulation steps
- HEADS: Number of attention heads
- HIDDEN: Hidden size
- MICRO_BATCH: Micro batch size
- NO_FLASH_ATTN: No Flash Attention
- NLAYERS: Number of layers
- NUM_KV_HEAD: Number of key-value heads
- OPT: Optimizer
  - adam
  - adam8bit
  - adamw
  - adamwschedulefree
  - apex.adam
  - apex.sgd
  - ds.fusedlamb
  - ds.onebitlamb
  - galoreadamw
  - galoreadamw8bit
  - galoreadamw8bitperlayer
  - ipex.fusedlamb
  - ipex.lamb
  - shampoo
  - sgd
  - sgdschedulefree
  - sophiag
- PP: Pipeline parallelism degree
- SEQ: Sequence length
- SP: Sequence parallelism (Ulysses) degree
- TP: Tensor parallelism degree
- TRAIN_TOKENS: Number of training tokens
- TRAIN_ITERS: Number of training iterations
- USE_ACTIVATION_CHECKPOINTING: Use activation checkpointing
- WEIGHT_DECAY: Weight decay
- ZERO_STAGE: Zero stage

4D parallelism refers to data (DP), tensor (TP), pipeline (PP), and sequence (SP) parallelism degrees of freedom. ↩
Megatron-DeepSpeed is designed to work on any GPU, including NVIDIA GPUs (NCCL), AMD GPUs (RCCL), and Intel XPUs (CCL). ↩