PyTorch on Aurora¶
PyTorch is a popular, open-source deep learning framework developed and released by Facebook. The PyTorch home page, has more information about PyTorch, which you can refer to. For troubleshooting on Aurora, please contact support@alcf.anl.gov.
Major changes in the frameworks module in Fall 2025¶
- The
torch_cclmodule has been removed.import oneccl_bindings_for_pytorch as torch_cclis no longer needed. - When initializing
torch.distributed, thebackendmust be changed toxcclfromccl. import intel_extension_for_pytorch as ipexis now an optional import. The vendor is currently working on upstreaming all of the functionality from IPEX to the mainline PyTorch distribution. If you experience performance variations after removing the import, please switch back to importing it.horovodsupport for PyTorch has been removed.
Provided Installation¶
PyTorch is already installed on Aurora with GPU support and available through the frameworks module. To use it from a compute node, please load the following modules:
Then, you canimport PyTorch in Python as usual (below showing results from the frameworks/2025.2.0 module): A simple but useful check could be to use PyTorch to get device information on a compute node. You can do this the following way: Example output:
GPU availability: True
Number of tiles = 12
Current tile = 0
Current device ID = <intel_extension_for_pytorch.xpu.device object at 0x1540a9f25790>
Device properties = _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) Level-Zero', \
type='gpu', driver_version='1.3.30872', total_memory=65536MB, max_compute_units=448, gpu_eu_count=448, \
gpu_subslice_count=56, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, \
has_atomic64=1)
Each Aurora node has 6 GPUs (also called "Devices" or "cards") and each GPU is composed of two tiles (also called "Sub-device"). By default, each tile is mapped to one PyTorch device, giving a total of 12 devices per node in the above output.
Using GPU Devices as PyTorch devices
By default, each tile is mapped to one PyTorch device, giving a total of 12 devices per node, as seen above. To map a PyTorch device to one particular GPU Device out of the 6 available on a compute node, these environmental variables should be set
export ZE_FLAT_DEVICE_HIERARCHY=COMPOSITE
export ZE_AFFINITY_MASK=0
# or, equivalently, following the syntax `Device.Sub-device`
export ZE_AFFINITY_MASK=0.0,0.1
Device:0 and Sub-devices: 0, 1, i.e. the two tiles of the GPU:0. This is particularly important in setting a performance benchmarking baseline. Setting the above environmental variables after loading the frameworks modules, you can check that each PyTorch device is now mapped to one GPU: Example output
1
_XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) Level-Zero', type='gpu', driver_version='1.3.30872', total_memory=131072MB, max_compute_units=896, gpu_eu_count=896, gpu_subslice_count=112, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
More information and details are available through the Level Zero Specification Documentation - Affinity Mask
Code changes to run PyTorch on Aurora GPUs¶
Intel Extension for PyTorch (IPEX) is an open-source project that extends PyTorch with optimizations for extra performance boost on Intel CPUs and enables the use of Intel GPUs.
Here we list some common changes that you may need to do to your PyTorch code in order to use Intel GPUs.
Please consult Intel's IPEX Documentation for additional details and useful tutorials.
Note: Steps related to IPEX are optional.
- Import the
intel_extension_for_pytorchright after importingtorch: - All the
APIcalls involvingtorch.cuda, should be replaced withtorch.xpu. For example: - When moving tensors and model to GPU, replace
"cuda"with"xpu". For example: - Convert model and loss criterion to
xpu, and then callipex.optimizefor additional performance boost:
Tip
A more portable solution to select the appropriate device is the following:
Example: training a PyTorch model on a single GPU tile¶
Here is a simple code to train a dummy PyTorch model on CPU:
And here is the code to train the same model on a single GPU tile on Aurora, with new or modified lines highlighted:
Note: Steps related to IPEX are optional.
Here are the steps to run the above code on Aurora:
- Login to Aurora:
- Request a one-node interactive job for 30 minutes:
- Copy the above Python script into a file called
pytorch_xpu.pyand make it executable withchmod a+x pytorch_xpu.py. - Load the frameworks module:
- Run the script:
PyTorch Best Practices on Aurora¶
When running PyTorch applications, we have found the following practices to be generally, if not universally, useful and encourage you to try some of these techniques to boost performance of your own applications.
-
Use Reduced Precision. Reduced Precision is available on Intel Max 1550 and is supported with PyTorch operations. In general, the way to do this is via the PyTorch Automatic Mixed Precision package (AMP), as described in the mixed precision documentation. In PyTorch, users generally need to manage casting and loss scaling manually, though context managers and function decorators can provide easy tools to do this.
-
PyTorch has a
JITmodule as well as backends to support op fusion, similar to TensorFlow'stf.functiontools. See TorchScript for more information. -
torch.compilewill be available through the next framework release. -
In order to run an application with
TF32precision type, one must set the following environmental parameter:export IPEX_FP32_MATH_MODE=TF32. This allows calculations usingTF32as opposed to the defaultFP32, and done throughintel_extension_for_pytorchmodule. -
For convolutional neural networks, using
channels_last(NHWC) memory format gives better performance. More info here and here
Distributed Training on multiple GPUs¶
Distributed training with PyTorch on Aurora is facilitated through both Distributed Data Parallel (DDP) and Horovod, with comparable performance. We recommend using native PyTorch DDP to perform Data Parallel training on Aurora.
Distributed Data Parallel (DDP)¶
Code changes to train on multiple GPUs using DDP¶
The key steps in performing distributed training are:
- Initialize PyTorch's
DistributedDataParallelwithbackend='xccl' - Use
DistributedSamplerto partition the training data among the ranks - Pin each rank to a GPU
- Wrap the model in DDP to keep it in sync across the ranks
- Rescale the learning rate
- Use
set_epochfor shuffling data across epochs
Here is the code to train the same dummy PyTorch model on multiple GPUs, where new or modified lines have been highlighted:
Note: Steps related to IPEX are optional.
Here are the steps to run the above code on Aurora:
- Login to Aurora:
- Request an interactive job on two nodes for 30 minutes:
- Copy the above Python script into a file called
pytorch_ddp.pyand make it executable withchmod a+x pytorch_ddp.py. - Load the frameworks module:
- Run the script on 24 tiles, 12 per node:
Settings for training beyond 16 nodes
Setting the CPU Affinity
The CPU affinity can be set manually through mpiexec. You can do this the following way (after having loaded all needed modules):
## Option 1
export CPU_BIND="list:4:9:14:19:20:25:56:61:66:71:74:79" # 12 ppn to 12 cores
## Option 2
export CPU_BIND="verbose,list:4-7:8-11:12-15:16-19:20-23:24-27:56-59:60-63:64-67:68-71:72-75:76-79" # 12 ppn with each rank having 4 cores
mpiexec ... --cpu-bind=${CPU_BIND}
These bindings should be used along with the following oneCCL and Horovod environment variable settings:
HOROVOD_THREAD_AFFINITY="4,8,12,16,20,24,56,60,64,68,72,76"
## Option 1
CCL_WORKER_AFFINITY="42,43,44,45,46,47,94,95,96,97,98,99"
## Option 2
unset CCL_WORKER_AFFINITY # Default will pick up from the last 24 cores even if you didn't specify these in the binding.
When running 12 ranks per node with these settings the frameworks use 4 cores, with Horovod tightly coupled with the frameworks using one of the 4 cores, and oneCCL using a separate core for better performance, e.g. with rank 0 the frameworks would use cores 4-7, Horovod would use core 4, and oneCCL would use core 42.
In the provided CPU binding list we have provided two options. First one is based on one CPU core per rank. In the second option, we assign 4 CPU cores per rank. In the first oneCCL worker affinity option we pick 12 CPU cores, one per rank. Notice that, these cores are picked out from the last 12 cores of each socket (CPU), aligned with oneCCL default core picking strategy. 42-47 belongs to the first socket, and 94-99 belongs to the second socket. We leave a few cores free, in case, the user may want to use other services like copper and DAOS along with their application. The second oneCCL option is to delegate task of picking cores to the system. In this case, the user should not declare or export the CCL_WORKER_AFFINITY variable.
Each workload may perform better with different settings. The criteria for choosing the cpu bindings are:
- Binding for GPU and NIC affinity – To bind the ranks to cores on the proper socket or NUMA nodes.
- Binding for cache access – This is the part that will change per application and some experimentation is needed.
Important: This setup is a work in progress, and based on observed performance. The recommended settings are likely to changed with new framework releases.
Distributed Training with Multiple CCSs¶
The Intel PVC GPUs contain 4 Compute Command Streamers (CCSs) on each tile, which can be used to group Execution Units (EUs) into common pools. These pools can then be accessed by separate processes thereby enabling distributed training with multiple MPI processes per tile. This feature on PVC is similar to MPS on NVIDIA GPUs and can be beneficial for increasing computational throughput when training or performing inference with smaller models which do not require the entire memory of a PVC tile. For more information, see the section on using multiple CCSs under the Running Jobs on Aurora page.
For DDP distributed training with multiple CCSs can be enabled programmatically within the user code by explicitly setting the xpu device in PyTorch, for example
- PVC GPU allow the use of 1, 2 or 4 CCSs on each tile
and then adding the proper environment variables and mpiexec settings in the run script. For example, to run distributed training with 48 MPI processes per node exposing 4 CCSs per tile, set
Alternatively, users can use the following modified GPU affinity script in their mpiexec command in order to bind multiple MPI processes to each tile by setting ZE_AFFINITY_MASK
| gpu_affinity_ccs.sh | |
|---|---|
- Note that the script takes the number of CCSs exposed as a command line argument
Checking PVC usage with xpu-smi
Users are invited to check correct placement of the MPI ranks on the different tiles by connecting to the compute node being used and executing
- In this case, GPU_ID refers to the 6 GPU on each node, not an individual tile
and checking the GPU and memory utilization of both tiles.
Multiple CCSs and oneCCL
- When performing distributed training exposing multiple CCSs, the collective communications with the oneCCL backend are delegated to the CPU. This is done in the background by oneCCL, so no change to the users' code is required to move data between host and device, however it may impact the performance of the collectives at scale.
- When using PyTorch DDP, the model must be offloaded to the XPU device after calling the
DDP()wrapper on the model to avoid hangs.