ALCF Systems

ALCF resources include leadership-class supercomputers, novel AI accelerators, visualization clusters, advanced data storage systems, high-performance networking capabilities, and a wide variety of software tools and services to help facility users achieve their science goals.

Quick brown fox Credit: Argonne National Laboratory

Supercomputing Resources

ALCF supercomputing resources support large-scale, computationally intensive projects aimed at solving some of the world’s most complex and challenging scientific problems.

System Name Purpose Architecture Peak Performance Processors per Node GPUs per Node Nodes Cores Memory Interconnect Racks
Aurora Purpose Science Campaigns Architecture HPE Cray EX Peak Performance 2 EF Processors per Node 2 Intel Xeon CPU Max Series processors GPUs per Node 6 Intel Data Center GPU Max Series Nodes 10,624
CPUs: 21,248
GPUs: 63,744
Cores 9,264,128 Memory 20.4 PB Interconnect HPE Slingshot 11 with Dragonfly Configuration Racks 166
Polaris Purpose Science Campaigns Architecture HPE Apollo 6500 Gen10+ Peak Performance 34 PF; 44 PF of Tensor Core
FP64 performance
Processors per Node 3rd Gen AMD EPYC GPUs per Node 4 NVIDIA A100 Tensor Core Nodes 560 Cores 17,920 Memory 280 TB (DDR4); 87.5 TB (HBM) Interconnect HPE Slingshot 11 with Dragonfly configuration Racks 40
Sophia Purpose Science Campaigns Architecture NVIDIA DGX A100 Peak Performance 3.9 PF Processors per Node 2 AMD EPYC 7742 GPUs per Node 8 NVIDIA A100 Tensor Core Nodes 24 Cores 3,072 Memory 26 TB (DDR4); 8.32 TB (GPU) Interconnect NVIDIA HDR IniniBand Racks 7

ALCF AI Testbed

The ALCF AI Testbed provides an infrastructure of next-generation AI-accelerator machines for research campaigns at the intersection of AI and science. AI testbeds include:

System Name System Size Compute Units per Accelerator Estimated Performance of a Single Accelerator (TFlops) Software Stack Support Interconnect
Cerebras CS-2 2 Nodes (Each with a Wafer-Scale Engine) Including MemoryX and SwarmX 850,000 Cores > 5,780 (FP16) Cerebras SDK, TensorFlow, PyTorch Ethernet-based
SambaNova Cardinal SN30 64 Accelerators (8 Nodes and 8 Accelerators per Node) 1,280 Programmable Compute Units >660 (BF16) SambaFlow, PyTorch Ethernet-based
GroqRack 72 Accelerators (9 Nodes and 8 Accelerators per Node) 5,120 Vector ALUs >188 (FP16) >750 (INT8) GroqWare SDK, ONNX RealScale™
Graphcore Bow Pod-64 64 Accelerators (4 Nodes and 16 Accelerators per Node) 1,472 Independent Processing Units >250 (FP16) PopART, TensorFlow, PyTorch, ONNX IPU Link
Habana Gaudi-1 16 Accelerators (2 Nodes and 8 Accelerators per Node) 8 TPC + GEMM Engine >150 (FP16) SynapseAI, TensorFlow, PyTorch Ethernet-based

Data Storage Systems

ALCF disk storage systems provide intermediate-term storage for users to access, analyze, and share computational and experimental data. Tape storage is used to archive data from completed projects.

System Name File System Storage System Usable Capacity Sustained Data Transfer Rate Disk Drives
Eagle File System Lustre Storage System HPE ClusterStor E1000 Usable Capacity 100 PB Sustained Data Transfer Rate 650 GB/s Disk Drives 8,480
Grand File System Lustre Storage System HPE ClusterStor E1000 Usable Capacity 100 PB Sustained Data Transfer Rate 650 GB/s Disk Drives 8,480
Swift File System Lustre Storage System All NVMe Flash Storage Array Usable Capacity 123 TB Sustained Data Transfer Rate 48 GB/s Disk Drives 24
Tape Storage File System Storage System LT06 and LT08 Tape Technology Usable Capacity 300 PB Sustained Data Transfer Rate Disk Drives

Networking

Networking is the fabric that ties all of the ALCF’s computing systems together. InfiniBand enables communication between system I/O nodes and the ALCF’s various storage systems. The Production HPC SAN is built upon NVIDIA Mellanox High Data Rate (HDR) InfiniBand hardware. Two 800-port core switches provide the backbone links between 80 edge switches, yielding 1600 total available host ports, each at 200 Gbps, in a non-blocking fat-tree topology. The full bisection bandwidth of this fabric is 320 Tbps. The HPC SAN is maintained by the NVIDIA Mellanox Unified Fabric Manager (UFM), providing Adaptive Routing to avoid congestion, as well as the NVIDIA Mellanox Self-Healing Interconnect Enhancement for InteLligent Datacenters (SHIELD) resiliency system for link fault detection and recovery.

When external communications are required, Ethernet is the interconnect of choice. Remote user access, systems maintenance and management, and high-performance data transfers are all enabled by the Local Area Network (LAN) and Wide Area Network (WAN) Ethernet infrastructure.

This connectivity is built upon a combination of Extreme Networks SLX and MLXe routers and NVIDIA Mellanox Ethernet switches. ALCF systems connect to other research institutions over multiple 100 Gbps connections that link to many high-performance research networks, including regional networks like the Metropolitan Research and Education Network (MREN), as well as national and international networks like the Energy Sciences Network (ESnet) and Internet2.

Joint Laboratory for System Evaluation

Argonne’s Joint Laboratory for System Evaluation (JLSE) provides access to leading-edge testbeds for research aimed at evaluating future extreme-scale computing systems, technologies, and capabilities. Here is a partial listing of the novel technology that make up the JLSE.