Skip to content

Argonne Leadership Computing Facility

Argonne Leadership
Computing Facility

Steps to Run a Model/Program

Getting Started

[This subsection is an adaption of]


Slurm is installed and running on all the CPU nodes. The coordination between the Cerebras system and the nodes in the Cerebras cluster is performed by Slurm. See section Job Queueing and Submission for more details.

Worker hostnames:

The worker nodes (see the first diagram in System Overview) for the CS-2 are cs2-02-med[2-7].
You may occasionally need to log into a specific worker node for debugging purposes.

CS_IP address of the Cerebras system:

The CS-2 system can be accessed using the CS_IP
The CS_IP environment variable is set to this value by the /software/cerebras/cs2-02/envs/ script, and the $CS_IP variable may be used by any user application that needs to access the CS-2 wafer.

Running slurm jobs:

Cerebras includes two scripts for running slurm jobs.
csrun_cpu is for running a Cerebras compilation. By default it reserves a single entire worker node.
csrun_wse is for running a job on the wafer scale engine. By default it reserves five entire worker nodes, which are used to feed the dataset to the CS2 wafer.
csrun_cpu --help and csrun_wse --help will list the available options.
See section Job Queuing and Submission for more details.

Running a training job on the wafer

Follow these instructions to compile and train the fc_mnist TensorFlow estimator example. This model is a couple of fully connected layers plus dropout and RELU.

cd ~/
mkdir ~/R1.1.0/
cp -r /software/cerebras/model_zoo/modelzoo-R1.1.0 ~/R1.1.0/modelzoo
cd ~/R1.1.0/modelzoo/fc_mnist/tf
csrun_wse python --mode train --cs_ip --max_steps 100000

You should see a training rate of about 1870 steps per second, and output that finishes with something similar to this:

INFO:tensorflow:Training finished with 25600000 samples in 53.424 seconds, 479188.55 samples/second.
INFO:tensorflow:Loss for final step: 0.0.

To separately compile and train,

# delete any existing compile artifacts and checkpoints
rm -r model_dir
csrun_cpu python --mode train --compile_only --cs_ip
csrun_wse python --mode train --cs_ip --max_steps 100000

The training will reuse an existing compilation if no changes were made that force a recompile, and will start from the newest checkpoint file if any. Compiles may be done while another job is using the wafer.

See also the current Cerebras quickstart documentation, that uses a clone of Cerebras's abbreviated public "reference implementations" github repo rather than the full modelzoo.

Running a training job on the CPU

The examples in the modelzoo will run in CPU mode, either using the csrun_cpu script, or in a singularity shell as shown below.

Using csrun_cpu

To separately compile and train,

# delete any existing compile artifacts and checkpoints
rm -r model_dir
csrun_cpu python --mode train --compile_only
csrun_cpu python --mode train --max_steps 400

Note: If no cs_ip is specified, a training run will be in cpu mode.

Change the max steps for the training run command line to something smaller than the default so that the training completes in a reasonable amount of time. (CPU mode is >2 orders of magnitude slower for many examples.)

Using a singularity shell

This illustrates how to create a singularity container. The -B /opt:/opt is an illustrative example of how to bind a directory to a singularity container. (The singularity containers by default bind both one's home directory and /tmp, read/write.)

cd ~/R1.1.0/modelzoo/fc_mnist/tf
singularity shell -B /opt:/opt /software/cerebras/cs2-02/container/cbcore_latest.sif

At the shell prompt for the container,

#rm -r model_dir
# compile and train on the CPUs
python --mode train --max_steps 1000
python --mode eval --eval_steps 1000
# validate_only is the first portion of a compile
python --mode train --validate_only
# remove the existing compile and training artifacts
rm -r model_dir
# compile_only does a compile but no training
python --mode train --compile_only

Type exit at the shell prompt to exit the container.