Aurora Machine Overview¶
Aurora is a 10,624-node HPE Cray-Ex based system. It has 166 racks with 21,248 CPUs and 63,744 GPUs. Each node consists of 2 Intel Xeon CPU Max Series (codename Sapphire Rapids or SPR) with on-package HBM and 6 Intel Data Center GPU Max Series (codename Ponte Vecchio or PVC). Each Xeon CPU has 52 physical cores supporting 2 hardware threads per core and 64 GB of HBM. Each CPU socket has 512 GB of DDR5 memory. The GPUs are connected all-to-all with Intel Xe Link interfaces. Each node has 8 HPE Slingshot-11 NICs, and the system is connected in a Dragonfly topology. The GPUs may send messages directly to the NIC via PCIe, without the need to copy into CPU memory.

Figure 1: Summary of the compute, memory, and communication hardware contained within a single Aurora node.
The Intel Data Center GPU Max Series is based on Xe Core. Each Xe core consists of 8 vector engines and 8 matrix engines with 512 KB of L1 cache that can be configured as cache or Shared Local Memory (SLM). 16 Xe cores are grouped together to form a slice. 4 slices are combined along with a large L2 cache and 4 HBM2E memory controllers to form a stack or tile. One or more stacks/tiles can then be combined on a socket to form a GPU. More detailed information about node architecture can be found here.
Aurora Compute Node¶
| NODE COMPONENT | DESCRIPTION | PER NODE | AGGREGATE |
|---|---|---|---|
| Processor | 2000 MHz | 2 | 21,248 |
| Cores/Threads | Intel Xeon CPU Max 9470C Series | 104/208 | 1,104,896/2,209,792 |
| CPU HBM | HBM2e | 64x2 GiB | 1.328 PiB |
| CPU DRAM | DDR5 | 512x2 GiB | 10.375 PiB |
| GPUs | Intel Data Center Max 1550 Series | 6 | 63,744 |
| GPU HBM | HBM2e | 768 GiB | 7.968 PiB |
Aurora GPU Architecture Summary¶
| GPU COMPONENT | DESCRIPTION | COUNT | CAPABILITY |
|---|---|---|---|
| Stack | a.k.a. Tile | 2 | |
| Xe Vector Engine | a.k.a. EU (execution unit) | 512 per Stack (448 active) | 8 threads, 512b SIMD |
| Xe Matrix Engine | a.k.a. systolic part of EU | 512 per Stack (448 active) | |
| Register | 512-bit register | 128 per thread | |
| Xe Core | a.k.a. subslice; unit of 8 EUs | 64 per Stack | 128 per GPU |
| L1 cache | 128 KiB | ||
| Last Level cache | a.k.a. RAMBO cache | 384 MiB per GPU |
See Aurora Overview for more information.