Running a Model/Program¶
Getting Started¶
Job submission and queuing¶
Cerebras jobs are initiated and tracked automatically within the Python framework in cerebras.modelzoo.common.run_utils. This framework interacts with the Cerebras cluster management node.
Login nodes¶
Jobs are launched from user nodes. If you expect a loss of an internet connection for any reason, for long-running jobs we suggest logging into a specific user node and using either screen or tmux to create persistent command line sessions. For details use:
Running jobs on the wafer¶
Follow these instructions to compile and train a small (111m parameters) GPT3 model.
Cerebras virtual environments¶
First, make a virtual environment for Cerebras for PyTorch. See Customizing Environments for the procedures for making PyTorch virtual environments for Cerebras. If an environment is made in ~/R_2.6.0/, it would be activated as follows:
Note: to access any external web resources from a Cerebras user node, you will need to have a proxy environment variable set (or equivalent). wget needs the lower-case proxy environment variable.
Clone the Cerebras modelzoo¶
If you have not already cloned the Cerebras modelzoo repo and checked out the Release_2.6.0 tag, do so.
mkdir ~/R_2.6.0
cd ~/R_2.6.0
export HTTPS_PROXY=http://proxy.alcf.anl.gov:3128
git clone https://github.com/Cerebras/modelzoo.git
cd modelzoo
git tag
git checkout Release_2.6.0
Running a Pytorch sample¶
Activate your PyTorch virtual environment, and change to the working directory¶
source ~/R_2.6.0/venv_cerebras_pt/bin/activate
cd ~/R_2.6.0/modelzoo/src/cerebras/modelzoo/models/nlp/gpt3
Next, copy a sample config file. This is for a small GPT3 model, modified to use a preprocessed dataset and to run for fewer steps.
cp /software/cerebras/dataset/OWT/Pytorch/111m_modified.yaml configs/Cerebras_GPT/111m_modified.yaml
Running a sample PyTorch training/validation job¶
To run the sample:
export MODEL_DIR=model_dir_gpt3_111m
# deletion of the model_dir is only needed if sample has been previously run
if [ -d "$MODEL_DIR" ]; then rm -Rf $MODEL_DIR; fi
cszoo fit configs/Cerebras_GPT/111m_modified.yaml --job_labels name=gpt3_111m --model_dir $MODEL_DIR |& tee mytest.log
A successful GPT3 (111m parameters) PyTorch training/validation run should finish with output resembling the following:
2025-10-09 18:19:52,310 INFO: Beginning appliance run
2025-10-09 18:19:54,361 INFO: | Eval Device=CSX, GlobalStep=400, Batch=20, Loss=5.94325, Rate=1163.27 samples/sec, GlobalRate=1173.27 samples/sec, LoopTimeRemaining=0:00:08, TimeRemaining=0:00:08
2025-10-09 18:19:56,408 INFO: | Eval Device=CSX, GlobalStep=400, Batch=40, Loss=5.92024, Rate=1174.18 samples/sec, GlobalRate=1172.88 samples/sec, LoopTimeRemaining=0:00:06, TimeRemaining=0:00:06
2025-10-09 18:19:58,463 INFO: | Eval Device=CSX, GlobalStep=400, Batch=60, Loss=5.89623, Rate=1171.13 samples/sec, GlobalRate=1171.33 samples/sec, LoopTimeRemaining=0:00:04, TimeRemaining=0:00:04
2025-10-09 18:20:00,514 INFO: | Eval Device=CSX, GlobalStep=400, Batch=80, Loss=5.92834, Rate=1164.75 samples/sec, GlobalRate=1170.97 samples/sec, LoopTimeRemaining=0:00:02, TimeRemaining=0:00:02
2025-10-09 18:20:02,564 INFO: | Eval Device=CSX, GlobalStep=400, Batch=100, Loss=5.92817, Rate=1172.36 samples/sec, GlobalRate=1170.91 samples/sec, LoopTimeRemaining=0:00:00, TimeRemaining=0:00:00
2025-10-09 18:20:23,263 INFO: Avg Eval Loss: 5.928174624443054
2025-10-09 18:20:23,278 INFO: Evaluation metrics:
2025-10-09 18:20:23,278 INFO: - eval/lm_perplexity = 375.4686584472656
2025-10-09 18:20:23,278 INFO: - eval/accuracy = 0.16977091133594513
2025-10-09 18:20:23,279 INFO: Evaluation completed successfully!
2025-10-09 18:20:23,281 INFO: Processed 48000 training sample(s) in 820.575766695 seconds.
As the console output shows, for this sample, the run framework starts three jobs (two compiles and one execute) as part of a single workflow:
(venv_cerebras_pt) username@cer-anl-net001-us-sr01:~/R_2.6.0/modelzoo/src/cerebras/modelzoo/models/nlp/gpt3$ grep -B1 "Job id:" mytest.log
2025-10-30 18:10:39,460 INFO: Initiating a new compile wsjob against the cluster server.
2025-10-30 18:10:39,479 INFO: Job id: wsjob-acxb4mqan53ppiffvdaafq, workflow id: wflow-ocjyqlrf5szhpecphsq3x8, namespace: job-operator, remote log path: /n1/wsjob/workdir/job-operator/wsjob-acxb4mqan53ppiffvdaafq
--
2025-10-30 18:11:51,527 INFO: Initiating a new execute wsjob against the cluster server.
2025-10-30 18:11:51,558 INFO: Job id: wsjob-uttzzftdpppygmvqspykpr, workflow id: wflow-ocjyqlrf5szhpecphsq3x8, namespace: job-operator, remote log path: /n1/wsjob/workdir/job-operator/wsjob-uttzzftdpppygmvqspykpr
--
2025-10-30 18:21:33,099 INFO: Initiating a new compile wsjob against the cluster server.
2025-10-30 18:21:33,118 INFO: Job id: wsjob-6mvjwjqovjprbibbpi3w43, workflow id: wflow-ocjyqlrf5szhpecphsq3x8, namespace: job-operator, remote log path: /n1/wsjob/workdir/job-operator/wsjob-6mvjwjqovjprbibbpi3w43
(venv_cerebras_pt) username@cer-anl-net001-us-sr01:~/R_2.6.0/modelzoo/src/cerebras/modelzoo/models/nlp/gpt3$
The jobs can be seen with csctl get jobs, from another console session on a user node. See Job Queuing and Submission for more details.