PyTorch on Aurora¶
PyTorch is a popular, open-source deep learning framework developed and released by Facebook. The PyTorch home page, has more information about PyTorch, which you can refer to. For troubleshooting on Aurora, please contact support@alcf.anl.gov.
Major changes in the frameworks module of Spring 2026 (frameworks/2025.3.1)¶
- The
torch_cclmodule has been removed.import oneccl_bindings_for_pytorch as torch_cclis no longer needed. - When initializing
torch.distributed, thebackendmust be changed toxcclfromccl. import intel_extension_for_pytorch as ipexis now deprecated. The vendor is upstreaming all of the functionality from IPEX to the mainline PyTorch distribution. If you experience performance variations after removing the import, please switch back to importing it.horovodsupport for PyTorch has been removed.ONEAPI_DEVICE_SELECTORhas been set to"opencl:gpu;level_zero:gpu", if this causes any issues, please revert to Level Zero only withexport ONEAPI_DEVICE_SELECTOR="level_zero:gpu"
Provided Installation¶
PyTorch is already installed on Aurora with GPU support and available through the frameworks module. To use it from a compute node, please load the following modules:
Then, you can import PyTorch in Python as usual (below showing results from the frameworks/2025.3.1 module):
A simple but useful check could be to use PyTorch to get device information on a compute node. You can do this the following way:
Example output:
GPU availability: True
Number of tiles = 12
Current tile = 0
Current device ID = <torch.xpu.device object at 0x154c8fad4d40>
Device properties = _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', device_id=0xBD6, uuid=d20ebf0c-4ca0-6be7-0000-000000000001, driver_version='1.6.33578+42', total_memory=65520MB, max_compute_units=448, gpu_eu_count=448, gpu_subslice_count=56, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
Tile-as-device setting for AI/ML worklaods
Each Aurora node has 6 GPUs (also called "Devices" or "cards") and each GPU is composed of two tiles (also called "Sub-device"). By default, the frameworks module sets ZE_FLAT_DEVICE_HIERARCHY=FLAT, meaning that the 12 PVC tiles are exposed as devices (see more details on the Python page). This is the recommended setting for AI/ML workloads.
Using the entire PVC GPU as PyTorch devices
By default, each tile is mapped to one PyTorch device, giving a total of 12 devices per node, as seen above. To map a PyTorch device to an entire PVC GPU out of the 6 available on a compute node, set
and mask the devices with
# To mask entire PVC GPUs
export ZE_AFFINITY_MASK=0,1
# or to mask particular tiles only (use syntax `Device.Sub-device`)
export ZE_AFFINITY_MASK=0.0,1.0
You can check that each PyTorch device is now mapped to one GPU with:
module load frameworks
ZE_FLAT_DEVICE_HIERARCHY=COMPOSITE ZE_AFFINITY_MASK=0 python test_affinity.py
| test_affinity.py | |
|---|---|
Example output
1
_XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', device_id=0xBD6, uuid=d20ebf0c-4ca0-6be7-0000-000000000000, driver_version='1.6.33578+42', total_memory=131040MB, max_compute_units=896, gpu_eu_count=896, gpu_subslice_count=112, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
More information and details are available through the Level Zero Specification Documentation - Affinity Mask
Code changes to run PyTorch on Aurora GPUs¶
Here we list some common changes that you may need to do to your PyTorch code in order to use Intel GPUs.
- All the
APIcalls involvingtorch.cuda, should be replaced withtorch.xpu. For example: - When moving tensors and model to GPU, replace
"cuda"with"xpu". For example:
Tip
A more portable solution to select the appropriate device is the following:
Example: training a PyTorch model on a single GPU tile¶
Here is a simple code to train a dummy PyTorch model on CPU:
And here is the code to train the same model on a single GPU tile on Aurora, with new or modified lines highlighted:
PyTorch Best Practices on Aurora¶
When running PyTorch applications, we have found the following practices to be generally, if not universally, useful and encourage you to try some of these techniques to boost performance of your own applications.
-
Use Reduced Precision. Reduced Precision is available on Intel Max 1550 and is supported with PyTorch operations. In general, the way to do this is via the PyTorch Automatic Mixed Precision package (AMP), as described in the mixed precision documentation. In PyTorch, users generally need to manage casting and loss scaling manually, though context managers and function decorators can provide easy tools to do this.
-
PyTorch has a
JITmodule as well as backends to support op fusion, similar to TensorFlow'stf.functiontools. See TorchScript for more information. -
torch.compileis available for Intel Max 1550 GPU and can be used to speed up training and inference. See PyTorch Docs for more information. -
For convolutional neural networks, using
channels_last(NHWC) memory format gives better performance. More info here and here
Distributed Training on multiple GPUs¶
Distributed training with PyTorch on Aurora is facilitated through both Distributed Data Parallel (DDP). Horovod is no longer supported in recent frameworks modules.
Distributed Data Parallel (DDP)¶
Code changes to train on multiple GPUs using DDP¶
The key steps in performing distributed training are:
- Initialize PyTorch's
DistributedDataParallelwithbackend='xccl' - Use
DistributedSamplerto partition the training data among the ranks - Pin each rank to a GPU
- Wrap the model in DDP to keep it in sync across the ranks
- Rescale the learning rate
- Use
set_epochfor shuffling data across epochs
Here is the code to train the same dummy PyTorch model on multiple GPUs, where new or modified lines have been highlighted:
CPU bindings for best performance on Aurora
For good performance, it is important to set the appropriate CPU affinity when launching training scripts with mpiexec. When using all 12 PVC tiles on each of the nodes, the following setting is recommended
export CPU_BIND="verbose,list:4-7:8-11:12-15:16-19:20-23:24-27:56-59:60-63:64-67:68-71:72-75:76-79" # (1)!
mpiexec ... --cpu-bind=${CPU_BIND} python pytorch_ddp.py
- 12 processes per node, evenly split across the 2 CPU sockets, with each rank having 4 cores available
Distributed Training with Multiple CCSs¶
The Intel PVC GPUs contain 4 Compute Command Streamers (CCSs) on each tile, which can be used to group Execution Units (EUs) into common pools. These pools can then be accessed by separate processes thereby enabling distributed training with multiple MPI processes per tile. This feature on PVC is similar to MPS on NVIDIA GPUs and can be beneficial for increasing computational throughput when training or performing inference with smaller models which do not require the entire memory of a PVC tile. For more information, see the section on using multiple CCSs under the Running Jobs on Aurora page.
For DDP distributed training with multiple CCSs can be enabled programmatically within the user code by explicitly setting the xpu device in PyTorch, for example
- PVC GPU allow the use of 1, 2 or 4 CCSs on each tile
and then adding the proper environment variables and mpiexec settings in the run script. For example, to run distributed training with 48 MPI processes per node exposing 4 CCSs per tile, set
Alternatively, users can use the following modified GPU affinity script in their mpiexec command in order to bind multiple MPI processes to each tile by setting ZE_AFFINITY_MASK
| gpu_affinity_ccs.sh | |
|---|---|
- Note that the script takes the number of CCSs exposed as a command line argument
Checking PVC usage with xpu-smi
Users are invited to check correct placement of the MPI ranks on the different tiles by connecting to the compute node being used and executing
- In this case, GPU_ID refers to the 6 GPU on each node, not an individual tile
and checking the GPU and memory utilization of both tiles.
Alternatively, execute
card0refers to GPU 0,card1for GPU 1, etc.
and press 1 on the keybord to see the utilization of the CCS on the selected GPU.
Multiple CCSs and oneCCL
- When performing distributed training exposing multiple CCSs, the collective communications with the oneCCL backend are delegated to the CPU. This is done in the background by oneCCL, so no change to the users' code is required to move data between host and device, however it may impact the performance of the collectives at scale.
- When using PyTorch DDP, the model must be offloaded to the XPU device after calling the
DDP()wrapper on the model to avoid hangs.