Example Multi-Node Programs
SambaNova provides examples of some well-known AI applications under the path: /opt/sambaflow/apps/starters
, on both SambaNova compute nodes. Make a copy of this to your home directory:
Copy starters to your personal directory structure if you have not already done so.
cd ~/
mkdir apps
cp -r /opt/sambaflow/apps/starters apps/starters
UNet
Set-up
Copy files and change directory if you have not already done so.
cp -r /opt/sambaflow/apps/image ~/apps/image
cd ~/apps/image
cp /software/sambanova/apps/image/pytorch/unet/*.sh .
You just copied two bash scripts. They are:
-
unet_all.sh
- Compiles UNet and then submits a batch job to run the model.
-
unet_batch.sh
- Runs Unet.
Unet All
Here is a breakdown of unet_all.sh.
The argument -x is used to specify that each executed line is to be displayed.
The second line is to stop on error.
Lastly, set total time, SECONDS, to zero.
#! /bin/bash -x
set -e
#
# Usage: ./unet_all.sh 256 256
#
SECONDS=0
Set variables.
# IMage size.
IM=${1}
# Batch Size
BS=${2}
NUM_WORKERS=1
export OMP_NUM_THREADS=16
Activate the virtual environment. And, establish the UNet directory.
source /opt/sambaflow/venv/bin/activate
UNET=$(pwd)/unet
Display model name and time.
echo "Model: UNET"
echo "Date: " $(date +%m/%d/%y)
echo "Time: " $(date +%H:%M)
echo "COMPILE"
This section will compile the model for multiple RDUs if it does not exist.
A log file will be created at compile_${BS}_${IM}_NN.log.
# Compile for parallel RDUs
if [ ! -e out/unet_train_${BS}_${IM}_NN/unet_train_${BS}_${IM}_NN.pef ] ; then
python ${UNET}/unet.py compile -b ${BS} --in-channels=3 --in-width=${IM} --in-height=${IM} --enable-conv-tiling --mac-v2 --compiler-configs-file ${UNET}/jsons/compiler_configs/unet_compiler_configs_no_inst.json --pef-name="unet_train_${BS}_${IM}_NN" --data-parallel -ws 2 > compile_${BS}_${IM}_NN.log 2>&1
fi
Here, a batch job is submitted for the multi-node run.
Sbatch argument definitions:
-
--gres=rdu:1
This indicates that the model fits on a single RDU.
-
--tasks-per-node 8
All eight RDUs per node are to be used. Valid options are 1 through 8.
-
--nodes 2
The number of nodes to use. Currently there are two nodes.
-
--nodelist sm-02,sm-01
The node names to use.
-
--cpus-per-task=16
CPUs per model.
-
unet_batch.sh
The bash script to be batched.
Unet_batch.sh argument definitions:
-
NN
Number of nodes.
# Run Multi-Node, Data Parallel
NN=2
echo "RUN"
echo "NN=${NN}"
sbatch --gres=rdu:1 --tasks-per-node 8 --nodes 2 --nodelist sm-02,sm-01 --cpus-per-task=16 ./unet_batch.sh ${NN} ${NUM_WORKERS}
echo "Duration: " $SECONDS
Unet Batch
Here is a description of unet_batch.sh. This script is automatically run by unet_all.sh.
This block is the same as above.
#! /bin/bash -x
set -e
#
# Usage: ./unet_batch.sh 2 1
#
SECONDS=0
Establish variables.
# Batch Size
BS=256
# IMage size
IM=256
NN=${1}
NUM_WORKERS=${2}
export OMP_NUM_THREADS=16
DATADIR=/software/sambanova/dataset/kaggle_3m
UNET=$(pwd)/unet
export SAMBA_CCL_USE_PCIE_TRANSPORT=0
Activate virtual environment.
source /opt/sambaflow/venv/bin/activate
Display an informative banner.
echo "Model: UNET_TRAIN"
echo "Date: " $(date +%m/%d/%y)
echo "Time: " $(date +%H:%M)
Run Unet
srun --mpi=pmi2 python ${UNET}/unet_hook.py run --do-train --in-channels=3 --in-width=${IM} --in-height=${IM} --init-features 32 --batch-size=${BS} --epochs 2 --data-dir ${DATADIR} --log-dir log_dir_unet_${NN}_train_kaggle --pef=$(pwd)/out/unet_train_${BS}_${IM}_NN/unet_train_${BS}_${IM}_NN.pef --data-parallel --reduce-on-rdu --num-workers=${NUM_WORKERS}
Display total execution time.
echo "Duration: " $SECONDS
Compile and Run
Change directory:
cd ~/apps/image/
Compile and run UNet:
./unet_all.sh 256 256
Squeue will give you the queue status.
squeue