ALCF Continues to Expand AI Testbed Systems Deployed for Open Science

In 2024, the ALCF AI Testbed initiated further upgrades with the deployment of SambaNova Suite and established benchmarks for its optimized Cerebras Wafer-Scale Cluster WSE-2 that promise to make extreme-scale AI computing substantially more manageable and effective.

The testbed is a growing collection of some of the world’s most advanced AI accelerators available for open science. Designed to enable researchers to explore next-generation machine learning applications and workloads to advance AI for science, the systems are also helping the facility to gain a better understanding of how novel AI technologies can be integrated with traditional supercomputing systems powered by CPUs and GPUs. With the AI Testbed, the ALCF user community can leverage novel AI technologies for innovative research projects involving large language models (LLMs), large-scale data analysis, and the development of trustworthy AI.

The ALCF AI Testbed includes systems from Cerebras, Graphcore, Groq, and SambaNova.

The testbed’s AI accelerators are equipped with unique hardware and software features to efficiently handle a variety of AI tasks, including:

AI Model Training: Using large datasets to “teach” an AI model to detect patterns and make accurate, trustworthy predictions.
Inference: Employing a trained AI model to make predictions on new data.
Large Language Models (LLMs): AI models that are trained on large amounts of text data to understand, generate, and predict text-based content.
Computer Vision Models: AI models that are trained to understand and analyze visual data for tasks such as image classification and object recognition.
Foundation Models: Similar to LLMs, these AI models are trained on diverse datasets to perform a broad set of processing tasks. Foundation models, however, can serve as a starting point for developing more specialized AI models for specific domains or applications.

These methods are powerful tools for speeding up scientific progress. Computer vision models can help scientists automate the analysis of images generated by microscopes, x-ray light sources, and other imaging techniques. LLMs, on the other hand, are helping researchers to sift through massive amounts of published scientific data quickly to identify promising materials for medicines, batteries, and other uses.

Experimental data analysis also benefits from the lab’s AI Testbed. Researchers from Argonne’s Advanced Photon Source (APS) are exploring how different accelerators can enable fast, scalable AI model training and inference to accelerate the analysis of x-ray imaging data. Rapid data analysis methods are becoming increasingly important for the APS and other experimental facilities as data generation rates continue to grow.

Argonne's Rick Stevens discusses how modern science requires access to powerful AI systems for inference to drive scientific discovery.

ALCF's Venkat Vishwanath discusses how AI inference is becoming a large and critical workload for advancing science.

SambaNova

The ALCF AI Testbed began further expansion of its SambaNova platform with the addition of SambaNova Suite. Powered by SambaNova DataScale SN40L systems under installation, SambaNova Suite is a fully integrated hardware-software platform that enables users to train, fine tune, and deploy AI workloads. Optimized for low-latency, high-throughput inference, the platform provides scientists with a new AI resource to accelerate scientific research.

The deployment of the DataScale SN40L system will extend advanced AI inference capabilities beyond the ALCF’s traditional user base. By making trained AI models more accessible, the platform aims to enable a wider community of researchers to explore new directions in generative and agentic AI workloads for science and engineering.

Being able to rapidly evaluate AI models and adjust parameters for improved performance is crucial for driving progress in AI-driven science across many research areas, including drug discovery, materials identification, and brain mapping.

The ALCF’s platform contains sixteen of SambaNova’s Reconfigurable DataFlow Units (RDU). The system’s capabilities support the development of large foundation models like Argonne’s AuroraGPT, which is being built to enable autonomous scientific exploration across disciplines, including biology, chemistry, and materials science. AuroraGPT is being trained on Argonne’s Aurora exascale system.

The ability to switch between different AI models instantly and fine-tune them using domain-specific datasets can help streamline the process of testing and validating their performance.

The system also gives the lab a new platform to continue its explorations into energy-efficient technologies for next-generation supercomputers and data centers, as one of the aims of the ALCF AI Testbed is to determine how novel AI accelerators like the SN40L can be integrated with future supercomputers to enhance energy efficiency.

Cerebras

The Cerebras system, previously upgraded to a Wafer-Scale Cluster WSE-2, optimized the ALCF’s existing Cerebras CS-2 system to include two CS-2 engines, enabling near-perfect linear scaling of large language models (LLMs). This capability helps make extreme-scale AI substantially more manageable.

An Argonne-led research team examined the feasibility of performing continuous energy Monte Carlo (MC) particle transport on the Cerebras WSE-2—simulations with the potential to fill in crucial gaps in experimental and operational nuclear reactor data.

The researchers ported a key kernel from the MC transport algorithm to the Cerebras Software Language programming model and evaluated the performance of the kernel on the Cerebras WSE-2. The team developed an architecture-specific optimization to leverage the capabilities of the WSE-2 and a highly optimized CUDA kernel for testing on a conventional graphics processing unit (GPU), which served as a baseline to contextualize the WSE-2’s performance.

A single WSE-2 was found to run 130 times faster than the highly optimized CUDA version of the kernel deployed on the conventional GPU comparison—significantly outpacing expected performance increase, given the difference in transistor counts between the architectures. A follow-up study saw the WSE-2 achieve a 182x speedup over the GPU.

The team’s analysis suggests the potential for a wide variety of complex and irregular simulation methods to be mapped efficiently onto AI accelerators like the Cerebras WSE-2, providing users with an invaluable tool for effective scientific discovery.

ALCF Annual Report

Year in Review

Features

Growing the HPC Community

Expertise and Resources

Science

ALCF Annual Report

Menu

ALCF Continues to Expand AI Testbed Systems Deployed for Open Science

SambaNova

Cerebras