Running a Model/Program¶
Getting Started¶
Job submission and queuing¶
Cerebras jobs are initiated and tracked automatically within the Python framework in cerebras.modelzoo.common.run_utils. This framework interacts with the Cerebras cluster management node.
Timelimit and usage guidelines¶
We currently have access to four CS-3 systems, which are in high demand across users. To ensure fair access and smooth operation for everyone, we kindly ask all users to be mindful of shared usage. At this stage, the system does not enforce strict scheduling policies, so it is especially important that usage remains self-regulated and considerate of others. In particular, we request that users avoid submitting multiple concurrent jobs or queuing up a large number of jobs, as this can prevent others from accessing the system. As a general guideline, large jobs should be limited to a maximum runtime of 24 hours. To help manage this, please make use of the job_time_sec parameter in your .yaml configuration to explicitly bound the duration of your jobs. For example:
We appreciate your cooperation in using the CS-3 systems responsibly and respectfully, and in helping maintain a productive and fair environment for all users.Login nodes¶
Jobs are launched from user nodes. If you expect a loss of an internet connection for any reason, for long-running jobs we suggest logging into a specific user node and using either screen or tmux to create persistent command line sessions. For details use:
Running jobs on the wafer¶
Follow these instructions to compile and train a small (111m parameters) GPT3 model.
Cerebras virtual environments¶
First, make a virtual environment for Cerebras for PyTorch. See Customizing Environments for the procedures for making PyTorch virtual environments for Cerebras. If an environment is made in ~/R_2.9.0/, it would be activated as follows:
Note: to access any external web resources from a Cerebras user node, you will need to have a proxy environment variable set (or equivalent). wget needs the lower-case proxy environment variable.
Clone the Cerebras modelzoo¶
If you have not already cloned the Cerebras modelzoo repo and checked out the Release_2.9.0 tag, do so.
mkdir ~/R_2.9.0
cd ~/R_2.9.0
export HTTPS_PROXY=http://proxy.alcf.anl.gov:3128
git clone https://github.com/Cerebras/modelzoo.git
cd modelzoo
git tag
git checkout Release_2.9.0
Running a Pytorch sample¶
Activate your PyTorch virtual environment, and change to the working directory¶
source ~/R_2.9.0/venv_cerebras_pt/bin/activate
cd ~/R_2.9.0/modelzoo/src/cerebras/modelzoo/models/nlp/gpt3
Next, copy a sample config file. This is for a small GPT3 model, modified to use a preprocessed dataset and to run for fewer steps.
cp /software/cerebras/dataset/OWT/Pytorch/111m_modified.yaml configs/Cerebras_GPT/111m_modified.yaml
Running a sample PyTorch training/validation job¶
To run the sample:
export MODEL_DIR=model_dir_gpt3_111m
# deletion of the model_dir is only needed if sample has been previously run
if [ -d "$MODEL_DIR" ]; then rm -Rf $MODEL_DIR; fi
cszoo fit configs/Cerebras_GPT/111m_modified.yaml --job_labels name=gpt3_111m --model_dir $MODEL_DIR |& tee mytest.log
A successful GPT3 (111m parameters) PyTorch training/validation run should finish with output resembling the following:
2025-10-09 18:19:52,310 INFO: Beginning appliance run
2025-10-09 18:19:54,361 INFO: | Eval Device=CSX, GlobalStep=400, Batch=20, Loss=5.94325, Rate=1163.27 samples/sec, GlobalRate=1173.27 samples/sec, LoopTimeRemaining=0:00:08, TimeRemaining=0:00:08
2025-10-09 18:19:56,408 INFO: | Eval Device=CSX, GlobalStep=400, Batch=40, Loss=5.92024, Rate=1174.18 samples/sec, GlobalRate=1172.88 samples/sec, LoopTimeRemaining=0:00:06, TimeRemaining=0:00:06
2025-10-09 18:19:58,463 INFO: | Eval Device=CSX, GlobalStep=400, Batch=60, Loss=5.89623, Rate=1171.13 samples/sec, GlobalRate=1171.33 samples/sec, LoopTimeRemaining=0:00:04, TimeRemaining=0:00:04
2025-10-09 18:20:00,514 INFO: | Eval Device=CSX, GlobalStep=400, Batch=80, Loss=5.92834, Rate=1164.75 samples/sec, GlobalRate=1170.97 samples/sec, LoopTimeRemaining=0:00:02, TimeRemaining=0:00:02
2025-10-09 18:20:02,564 INFO: | Eval Device=CSX, GlobalStep=400, Batch=100, Loss=5.92817, Rate=1172.36 samples/sec, GlobalRate=1170.91 samples/sec, LoopTimeRemaining=0:00:00, TimeRemaining=0:00:00
2025-10-09 18:20:23,263 INFO: Avg Eval Loss: 5.928174624443054
2025-10-09 18:20:23,278 INFO: Evaluation metrics:
2025-10-09 18:20:23,278 INFO: - eval/lm_perplexity = 375.4686584472656
2025-10-09 18:20:23,278 INFO: - eval/accuracy = 0.16977091133594513
2025-10-09 18:20:23,279 INFO: Evaluation completed successfully!
2025-10-09 18:20:23,281 INFO: Processed 48000 training sample(s) in 820.575766695 seconds.
As the console output shows, for this sample, the run framework starts three jobs (two compiles and one execute) as part of a single workflow:
(venv_cerebras_pt) username@cer-anl-net001-us-sr01:~/R_2.9.0/modelzoo/src/cerebras/modelzoo/models/nlp/gpt3$ grep -B1 "Job id:" mytest.log
2025-10-30 18:10:39,460 INFO: Initiating a new compile wsjob against the cluster server.
2025-10-30 18:10:39,479 INFO: Job id: wsjob-acxb4mqan53ppiffvdaafq, workflow id: wflow-ocjyqlrf5szhpecphsq3x8, namespace: job-operator, remote log path: /n1/wsjob/workdir/job-operator/wsjob-acxb4mqan53ppiffvdaafq
--
2025-10-30 18:11:51,527 INFO: Initiating a new execute wsjob against the cluster server.
2025-10-30 18:11:51,558 INFO: Job id: wsjob-uttzzftdpppygmvqspykpr, workflow id: wflow-ocjyqlrf5szhpecphsq3x8, namespace: job-operator, remote log path: /n1/wsjob/workdir/job-operator/wsjob-uttzzftdpppygmvqspykpr
--
2025-10-30 18:21:33,099 INFO: Initiating a new compile wsjob against the cluster server.
2025-10-30 18:21:33,118 INFO: Job id: wsjob-6mvjwjqovjprbibbpi3w43, workflow id: wflow-ocjyqlrf5szhpecphsq3x8, namespace: job-operator, remote log path: /n1/wsjob/workdir/job-operator/wsjob-6mvjwjqovjprbibbpi3w43
(venv_cerebras_pt) username@cer-anl-net001-us-sr01:~/R_2.9.0/modelzoo/src/cerebras/modelzoo/models/nlp/gpt3$
The jobs can be seen with csctl get jobs, from another console session on a user node. See Job Queuing and Submission for more details.
Checkpoints¶
Model training can be (re-)started from a model checkpoint, if e.g. a job stops due to error, by adding --checkpoint_path=path_to_mdl_file to a cszoo fit command line. For example, to continue training the model above another 400 steps after it is has been trained for 400 steps, modify configs/Cerebras_GPT/111m_modified.yaml, changing the value of max_steps to 800
Then
The save of a model checkpoint requires roughly 60 seconds per billion model parameters, so adjust the checkpoint frequency accordingly.
Another consideration is your disk space quota; checkpoints can quickly exceed this quota. Old checkpoints can be deleted manually, and your quota can be increased on request.