Steps to Run a Model/Program
[This subsection is an adaption of
Slurm is installed and running on all the CPU nodes. The coordination between the Cerebras system and the nodes in the Cerebras cluster is performed by Slurm. See section Job Queueing and Submission for more details.
The worker nodes (see the first diagram in System Overview) for the CS-2 are cs2-02-med[2-7].
You may occasionally need to log into a specific worker node for debugging purposes.
CS_IP address of the Cerebras system:
The CS-2 system can be accessed using the
CS_IP environment variable is set to this value by the
/software/cerebras/cs2-02/envs/cs_env.sh script, and the
$CS_IP variable may be used by any user application that needs to access the CS-2 wafer.
Running slurm jobs:
Cerebras includes two scripts for running slurm jobs.
csrun_cpu is for running a Cerebras compilation. By default it reserves a single entire worker node.
csrun_wse is for running a job on the wafer scale engine. By default it reserves five entire worker nodes, which are used to feed the dataset to the CS2 wafer.
csrun_cpu --help and
csrun_wse --help will list the available options.
See section Job Queuing and Submission for more details.
Running a training job on the wafer
Follow these instructions to compile and train the
fc_mnist TensorFlow estimator example. This model is a couple of fully connected layers plus dropout and RELU.
cd ~/ mkdir ~/R1.1.0/ cp -r /software/cerebras/model_zoo/modelzoo-R1.1.0 ~/R1.1.0/modelzoo cd ~/R1.1.0/modelzoo/fc_mnist/tf csrun_wse python run.py --mode train --cs_ip 192.168.220.50 --max_steps 100000
You should see a training rate of about 1870 steps per second, and output that finishes with something similar to this:
INFO:tensorflow:Training finished with 25600000 samples in 53.424 seconds, 479188.55 samples/second. INFO:tensorflow:Loss for final step: 0.0.
To separately compile and train,
# delete any existing compile artifacts and checkpoints rm -r model_dir csrun_cpu python run.py --mode train --compile_only --cs_ip 192.168.220.50 csrun_wse python run.py --mode train --cs_ip 192.168.220.50 --max_steps 100000
The training will reuse an existing compilation if no changes were made that force a recompile, and will start from the newest checkpoint file if any. Compiles may be done while another job is using the wafer.
See also the current Cerebras quickstart documentation, that uses a clone of Cerebras's abbreviated public "reference implementations" github repo rather than the full modelzoo.
Running a training job on the CPU
The examples in the modelzoo will run in CPU mode, either using the csrun_cpu script, or in a singularity shell as shown below.
To separately compile and train,
# delete any existing compile artifacts and checkpoints rm -r model_dir csrun_cpu python run.py --mode train --compile_only csrun_cpu python run.py --mode train --max_steps 400
Note: If no cs_ip is specified, a training run will be in cpu mode.
Change the max steps for the training run command line to something smaller than the default so that the training completes in a reasonable amount of time. (CPU mode is >2 orders of magnitude slower for many examples.)
Using a singularity shell
This illustrates how to create a singularity container.
-B /opt:/opt is an illustrative example of how to bind a directory to a singularity container. (The singularity containers by default bind both one's home directory and /tmp, read/write.)
cd ~/R1.1.0/modelzoo/fc_mnist/tf singularity shell -B /opt:/opt /software/cerebras/cs2-02/container/cbcore_latest.sif
At the shell prompt for the container,
#rm -r model_dir # compile and train on the CPUs python run.py --mode train --max_steps 1000 python run.py --mode eval --eval_steps 1000 # validate_only is the first portion of a compile python run.py --mode train --validate_only # remove the existing compile and training artifacts rm -r model_dir # compile_only does a compile but no training python run.py --mode train --compile_only
exit at the shell prompt to exit the container.