TensorFlow on Aurora¶
TensorFlow is a popular, open-source deep learning framework developed and released by Google. The TensorFlow home page has more information about the framework. For troubleshooting on Aurora, contact support@alcf.anl.gov.
Recent major changes¶
TensorFlow now has its own module, separate from the frameworks module. This will change again in the near future, as we are testing a containerized solution that will be made available to users.
Provided installation¶
TensorFlow is preinstalled on Aurora and available through the tensorflow module. To use it on a compute node, load the module:
Then you can import TensorFlow as usual. The following output is from the tensorflow/2025.2.0 module:
A useful check is to verify that TensorFlow detects the available devices on a compute node:
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'),
PhysicalDevice(name='/physical_device:XPU:0', device_type='XPU'),
PhysicalDevice(name='/physical_device:XPU:1', device_type='XPU'),
PhysicalDevice(name='/physical_device:XPU:2', device_type='XPU'),
PhysicalDevice(name='/physical_device:XPU:3', device_type='XPU'),
PhysicalDevice(name='/physical_device:XPU:4', device_type='XPU'),
PhysicalDevice(name='/physical_device:XPU:5', device_type='XPU'),
PhysicalDevice(name='/physical_device:XPU:6', device_type='XPU'),
PhysicalDevice(name='/physical_device:XPU:7', device_type='XPU'),
PhysicalDevice(name='/physical_device:XPU:8', device_type='XPU'),
PhysicalDevice(name='/physical_device:XPU:9', device_type='XPU'),
PhysicalDevice(name='/physical_device:XPU:10', device_type='XPU'),
PhysicalDevice(name='/physical_device:XPU:11', device_type='XPU')]
Note that tf.config returns 12 tiles across 6 GPUs (the GPU resources on an Aurora compute node), treating each tile as a separate device. You can set the environment variable ZE_FLAT_DEVICE_HIERARCHY to control this behavior, as described in the Level Zero Specification documentation. This environment variable replaces ITEX_TILE_AS_DEVICE, which is deprecated.
Intel Extension for TensorFlow is available as an open-source project on GitHub.
Consult the following resources for additional details and tutorials:
TensorFlow best practices on Aurora¶
Single device performance¶
To expose a single tile out of the 12 available (6 GPUs, each with 2 tiles) on a compute node, set the following environment variable:
More details are available in the Level Zero Specification Documentation - Affinity Mask.Single node performance¶
The following practices have been found to generally improve performance of TensorFlow applications on Aurora.
Reduced precision¶
Use reduced precision whenever your application allows it. The Intel Max 1550 GPUs support reduced precision through TensorFlow operations. The standard approach is through the tf.keras.mixed_precision policy, as described in the mixed precision documentation. Intel Extension for TensorFlow is fully compatible with the Keras mixed precision API and also provides an advanced auto mixed precision feature. You can set two environment variables to gain the performance benefits of FP16/BF16 without modifying application code:
export ITEX_AUTO_MIXED_PRECISION=1
export ITEX_AUTO_MIXED_PRECISION_DATA_TYPE="BFLOAT16" # or "FLOAT16"
keras.Model.fit), you will also need to apply loss scaling. TensorFlow's graph API¶
Use TensorFlow's graph API to improve the efficiency of operations. Although TensorFlow operates in eager mode by default, the @tf.function decorator traces Python functions and replaces them with lower-level, semi-compiled TensorFlow graphs. See the tf.function documentation for details. When possible, use jit_compile. Be aware that when using tf.function, Python expressions that are not tensors are often replaced with constants in the graph, which may not be the intended behavior.
An experimental feature enables aggressive kernel fusion through the oneDNN Graph API. Intel Extension for TensorFlow can offload performance-critical graph partitions to the oneDNN library for more aggressive graph optimizations. Enable it by setting the following environment variable:
This feature is experimental and actively under development.TF32 math mode¶
The Intel Xe Matrix Extensions (Intel XMX) engines in the Intel Max 1550 Xe-HPC GPUs natively support TF32 math mode. Enable it through Intel Extension for TensorFlow by setting the following environment variable:
XLA compilation (planned/upcoming)¶
XLA (Accelerated Linear Algebra) is a compiler available in TensorFlow and central to frameworks like JAX. XLA compiles tf.Graph objects generated by tf.function and performs optimizations such as operation fusion. XLA can deliver significant performance improvements with minimal code changes—simply set the environment variable TF_XLA_FLAGS=--tf_xla_auto_jit=2. However, if your code is complex or uses dynamically sized tensors (where the shape changes each iteration), XLA can be detrimental: the compilation overhead may outweigh the performance gains. XLA is particularly effective when combined with reduced precision, yielding speedups greater than 100% in some models.
Intel provides initial GPU support for TensorFlow models with XLA acceleration through Intel Extension for OpenXLA. Full TensorFlow and PyTorch support is planned.
A simple example¶
The following example demonstrates how to use an Intel GPU with TensorFlow:
Multi-GPU / multi-node scale up¶
TensorFlow supports scaling across multiple GPUs per node and across multiple nodes. Good performance has been observed with Horovod in particular. For details, see the Horovod documentation. The following sections cover Aurora-specific considerations.
Environment variables¶
The following environment variables should be set in the batch submission script (PBS script) when running on more than 16 nodes.
oneCCL environment variables
We have identified a set of environment settings that typically provide better performance or address potential application hangs and crashes at large scale. This particular setup is still experimental, and it might change as the environment variable settings are refined. Users are encouraged to check this page regularly.
Among them, there is a minimal list, which are essential for functionality for training workloads, and we have tested up to 1024 nodes.
Minimal set
Beyond that an application should tune based on the list below. This list is not exhaustive.
Users of vLLM and other inference services should rely on the variables set by the frameworks module.
The following additional set of environment variable setups might be application-dependent. Users are encouraged to try to set them and see whether they help their applications.
CPU affinity¶
CPU affinity should be set manually through mpiexec:
## Option 1
export CPU_BIND="list:4:9:14:19:20:25:56:61:66:71:74:79" # 12 ppn to 12 cores
## Option 2
export CPU_BIND="verbose,list:4-7:8-11:12-15:16-19:20-23:24-27:56-59:60-63:64-67:68-71:72-75:76-79" # 12 ppn with each rank having 4 cores
mpiexec ... --cpu-bind=${CPU_BIND}
These bindings should be used along with the following oneCCL and Horovod environment variable settings:
HOROVOD_THREAD_AFFINITY="4,8,12,16,20,24,56,60,64,68,72,76"
## Option 1
CCL_WORKER_AFFINITY="42,43,44,45,46,47,94,95,96,97,98,99"
## Option 2
unset CCL_WORKER_AFFINITY # Default will pick up from the last 24 cores even if you didn't specify these in the binding.
When running 12 ranks per node with these settings, the framework uses 4 cores per rank, with Horovod pinned to one of those 4 cores and oneCCL using a separate core for better performance. For example, rank 0 would use cores 4--7 for the framework, core 4 for Horovod, and core 42 for oneCCL.
The CPU binding list provides two options. The first assigns one CPU core per rank; the second assigns 4 CPU cores per rank. In the first oneCCL worker affinity option, 12 CPU cores are selected (one per rank) from the last 12 cores of each socket, consistent with the oneCCL default core selection strategy: cores 42--47 belong to the first socket, and cores 94--99 belong to the second socket. A few cores are left free for other services such as Cray MPICH and DAOS. The second oneCCL option delegates core selection to the system; in this case, do not declare or export the CCL_WORKER_AFFINITY variable.
Each workload may perform better with different settings. The criteria for choosing CPU bindings are:
- GPU and NIC affinity -- Bind ranks to cores on the appropriate socket or NUMA node.
- Cache access -- The optimal binding varies by application and may require experimentation.
Note
This setup is a work in progress based on observed performance. The recommended settings are likely to change with new frameworks module releases. To learn more about CPU binding, see the Running Jobs page.
Distributed training¶
Distributed training with TensorFlow on Aurora is facilitated through Horovod, using Intel Optimization for Horovod.
The key steps for distributed training are outlined in the following example:
A detailed implementation of the same example is available here:
A suite of detailed and well-documented examples is available in the Intel Optimization for Horovod repository:
A simple job script¶
Below is a simple job script:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 | |