Example Programs
SambaNova provides examples of some well-known AI applications under the path: /opt/sambaflow/apps/starters
, on both SambaNova compute nodes. Make a copy of this to your home directory:
Copy starters to your personal directory structure:
cd ~/
mkdir apps
cp -r /opt/sambaflow/apps/starters apps/starters
LeNet
Change directory
cd ~/apps/starters/lenet
Common Arguments
Below are some of the common arguments used across most of the models in the example code.
Argument | Default | Help |
---|---|---|
-b | 1 | Batch size for training |
-n, | 100 | Number of iterations to run |
--num-iterations | the pef for | |
-e, | 1 | Number epochs for training |
--num-epochs | ||
--log-path | 'check | Log path |
points' | ||
--num-workers | 0 | Number of workers |
--measure-train- | None | Measure training performance |
performance | ||
LeNet Arguments
Argument | Default | Help |
---|---|---|
--lr | 0.01 | Learning rate for training |
--momentum | 0.0 | Momentum value for training |
--weight-decay | 0.01 | Weight decay for training |
--data-path | './data' | Data path |
--data-folder | 'mnist_ | Folder containing mnist data |
data' | ||
NOTE: If you receive an \"HTTP error\" message on any of the following commands, run the command again. Such errors (e.g 503) are commonly an intermittent failure to download a dataset.
Run these commands:
srun python lenet.py compile -b=1 --pef-name="lenet" --output-folder="pef"
srun python lenet.py run --pef="pef/lenet/lenet.pef"
To use Slurm sbatch, create submit-lenet-job.sh with the following contents:
#!/bin/sh
python lenet.py compile -b=1 --pef-name="lenet" --output-folder="pef"
python lenet.py run --pef="pef/lenet/lenet.pef"
Then
sbatch --output=pef/lenet/output.log submit-lenet-job.sh
Squeue will give you the queue status.
squeue
# One may also...
watch squeue
The output file will look something like this:
[Info][SAMBA][Default] # Placing log files in
pef/lenet/lenet.samba.log
[Info][MAC][Default] # Placing log files in
pef/lenet/lenet.mac.log
[Warning][SAMBA][Default] #
--------------------------------------------------
Using patched version of torch.cat and torch.stack
--------------------------------------------------
[Warning][SAMBA][Default] # The dtype of "targets" to
CrossEntropyLoss is torch.int64, however only int16 is currently
supported, implicit conversion will happen
[Warning][MAC][GraphLoweringPass] # lenet__reshape skip
set_loop_to_air
[Warning][MAC][GraphLoweringPass] # lenet__reshape_bwd skip
set_loop_to_air
...
Epoch [1/1], Step [59994/60000], Loss: 0.1712
Epoch [1/1], Step [59995/60000], Loss: 0.1712
Epoch [1/1], Step [59996/60000], Loss: 0.1712
Epoch [1/1], Step [59997/60000], Loss: 0.1712
Epoch [1/1], Step [59998/60000], Loss: 0.1712
Epoch [1/1], Step [59999/60000], Loss: 0.1712
Epoch [1/1], Step [60000/60000], Loss: 0.1712
Test Accuracy: 98.06 Loss: 0.0628
2021-6-10 10:52:28 : [INFO][SC][53607]: SambaConnector: PEF File:
pef/lenet/lenet.pef
Log ID initialized to: [ALCFUserID][python][53607] at
/var/log/sambaflow/runtime/sn.log
MNIST - Feed Forward Network
Change directory
cd ~/apps/starters/ffn_mnist/
Commands to run MNIST example:
srun python ffn_mnist.py compile --pef-name="ffn_mnist" --output-folder="pef"
srun python ffn_mnist.py run --pef="pef/ffn_mnist/ffn_mnist.pef" --data-path mnist_data
To run the same using Slurm sbatch, create and run the submit-ffn_mnist-job.sh with the following contents.
#!/bin/sh
python ffn_mnist.py compile --pef-name="ffn_mnist" --output-folder="pef"
python ffn_mnist.py run --pef="pef/ffn_mnist/ffn_mnist.pef" --data-path mnist_data
sbatch --output=pef/ffn_mnist/output.log submit-ffn_mnist-job.sh
Logistic Regression
Change directory
cd ~/apps/starters/logreg
Logistic Regression Arguments
This is not an exhaustive list of arguments.
Arguments
Argument | Default | Help | Step |
---|---|---|---|
--lr | 0.001 | Learning rate for training | Compile |
--momentum | 0.0 | Momentum value for training | Compile |
--weight-decay | 1e-4 | Weight decay for training | Compile |
--num-features | 784 | Number features for training | Compile |
--num-classes | 10 | Number classes for training | Compile |
--weight-norm | na | Enable weight normalization | Compile |
Run these commands:
srun python logreg.py compile --pef-name="logreg" --output-folder="pef"
srun python logreg.py test --pef="pef/logreg/logreg.pef"
srun python logreg.py run --pef="pef/logreg/logreg.pef"
To use Slurm, create submit-logreg-job.sh with the following contents:
#!/bin/sh
python logreg.py compile --pef-name="logreg" --output-folder="pef"
python logreg.py test --pef="pef/logreg/logreg.pef"
python logreg.py run --pef="pef/logreg/logreg.pef"
Then
sbatch --output=pef/logreg/output.log submit-logreg-job.sh
The output, pef/logreg/output.log, will look something like this:
[Info][SAMBA][Default] # Placing log files in
pef/logreg/logreg.samba.log
[Info][MAC][Default] # Placing log files in
pef/logreg/logreg.mac.log
[Warning][SAMBA][Default] #
--------------------------------------------------
Using patched version of torch.cat and torch.stack
--------------------------------------------------
[Warning][SAMBA][Default] # The dtype of "targets" to
CrossEntropyLoss is torch.int64, however only int16 is currently
supported, implicit conversion will happen
[Warning][MAC][MemoryOpTransformPass] # Backward graph is trimmed
according to requires_grad to save computation.
[Warning][MAC][WeightShareNodeMergePass] # Backward graph is
trimmed according to requires_grad to save computation.
[Warning][MAC][ReduceCatFaninPass] # Backward graph is trimmed
according to requires_grad to save computation.
[info ] [PLASMA] Launching plasma compilation! See log file:
/home/ALCFUserID/apps/starters/pytorch/pef/logreg//logreg.plasma_compile.log
...
[Warning][SAMBA][Default] # The dtype of "targets" to
CrossEntropyLoss is torch.int64, however only int16 is currently
supported, implicit conversion will happen
Epoch [1/1], Step [10000/60000], Loss: 0.4763
Epoch [1/1], Step [20000/60000], Loss: 0.4185
Epoch [1/1], Step [30000/60000], Loss: 0.3888
Epoch [1/1], Step [40000/60000], Loss: 0.3721
Epoch [1/1], Step [50000/60000], Loss: 0.3590
Epoch [1/1], Step [60000/60000], Loss: 0.3524
Test Accuracy: 90.07 Loss: 0.3361
2021-6-11 8:38:49 : [INFO][SC][99185]: SambaConnector: PEF File:
pef/logreg/logreg.pef
Log ID initialized to: [ALCFUserID][python][99185] at
/var/log/sambaflow/runtime/sn.log
UNet
Change directory and copy files.
cp -r /opt/sambaflow/apps/image ~/apps/image
cd ~/apps/image/unet
cp /software/sambanova/apps/image/pytorch/unet/*.sh .
Export the path to the dataset which is required for the training.
export OUTDIR=~/apps/image/unet
export DATADIR=/software/sambanova/dataset/kaggle_3m
Run these commands for training (compile + train):
sbatch unet_compile_run_inf_rl.sh compile 32 1 # Takes over 15 minutes.
sbatch unet_compile_run_inf_rl.sh test 32 1 # Very fast.
sbatch unet_compile_run_inf_rl.sh run 32 1 #
The output files are named slurm-\<batch ID>.out.
Using SLURM: To use Slurm, create submit-unet-job.sh with the following contents:
#!/bin/sh
export OUTDIR=~/apps/image/unet
export DATADIR=/software/sambanova/dataset/kaggle_3m
./unet_compile_run_inf_rl.sh compile 32 1
./unet_compile_run_inf_rl.sh test 32 1
./unet_compile_run_inf_rl.sh run 32 1
Then
sbatch submit-unet-job.sh
Squeue will give you the queue status.
squeue