Megatron-DeepSpeed¶
Megatron-DeepSpeed is a scalable, highly performant library for training large language models on any GPU1.
In particular, it retains the core 4D parallelism2 functionality of the NVIDIA / Megatron-LM library, while leveraging the microsoft / DeepSpeed library for efficient scaling and 🍋 saforem2 / ezpz for automated device + backend selection.
Getting Started¶
-
Clone the argonne-lcf /
Megatron-DeepSpeedrepository: -
Set up your environment:
export PBS_O_WORKDIR=$(pwd) source <(curl -s https://raw.githubusercontent.com/saforem2/ezpz/refs/heads/main/src/ezpz/bin/utils.sh) ezpz_setup_env[Optional] Setup WandB
To enable Weights & Biases (WandB) logging, we need to install and login:
NOTE: WandB can be disabled by setting
export WANDB_DISABLED=1See
wandb: Quickstart for additional information -
Install dependencies:
-
Launch training:
# Before launching, `PBS_O_WORKDIR` should be set to Megatron-DeepSpeed's PATH # and venv inside Megatron-DeepSpeed/venv should be activated. TP=2 NLAYERS=10 DATA_FILE_LIST=ALCF/data-lists/aurora/books.txt bash train_aGPT_7B.shThis will launch a distributed pre-training run with:
-
NLAYERS=10: Llama style model consisting of 10 layers -
TP=2: Split across 2 Tensor Parallel groups -
DATA_FILE_LIST: Using the Books corpus of the Dolma dataset
Overridable Options
This is a simple subset of the overridable options.
The full list (as well as their default values) can be found in ALCF /
helpers.shDTYPE: Data typeDATA_FILE_LIST: Data file listFFN_HIDDEN_SIZE: Feedforward Neural Network projection sizeGRAD_ACC_STEPS: Gradient accumulation stepsHEADS: Number of attention headsHIDDEN: Hidden sizeMICRO_BATCH: Micro batch sizeNO_FLASH_ATTN: No Flash AttentionNLAYERS: Number of layersNUM_KV_HEAD: Number of key-value headsOPT: Optimizeradamadam8bitadamwadamwschedulefreeapex.adamapex.sgdds.fusedlambds.onebitlambgaloreadamwgaloreadamw8bitgaloreadamw8bitperlayeripex.fusedlambipex.lambshampoosgdsgdschedulefreesophiag
PP: Pipeline parallelism degreeSEQ: Sequence lengthSP: Sequence parallelism (Ulysses) degreeTP: Tensor parallelism degreeTRAIN_TOKENS: Number of training tokensTRAIN_ITERS: Number of training iterationsUSE_ACTIVATION_CHECKPOINTING: Use activation checkpointingWEIGHT_DECAY: Weight decayZERO_STAGE: Zero stage
-