

# Developing Custom Applications at Wafer-Scale

**Leighton Wilson** 

leighton.wilson@cerebras.net

**SC24** 

© 2024 Cerebras Systems Inc. All Rights Reserved



## Cerebras Wafer-Scale Engine (WSE-3)

The Largest Chip in the World

900,000 cores optimized for sparse linear algebra
46,225 mm<sup>2</sup> silicon
4.0 trillion transistors
44 Gigabytes of on-chip memory
21 PByte/s memory bandwidth
214 Pbit/s fabric bandwidth
5nm process technology

#### **Cluster-scale acceleration on a single chip**



## Cerebras CS Systen

Cerebras

## The world's most powerful Al and HPC

### accelerator

- Powered by WSE
- Install, deploy easily into a standard rack
- Programmable via our SDK or PyTorch





## **CS** Architecture Basics



Logical 2D array of individually programmable Processing Elements

#### **Flexible compute**

- ~900,000 general purpose CPUs
- 16- and 32-bit native FP and integer data types
- **Dataflow programming**: Tasks are activated or triggered by the arrival of data packets

#### **Flexible communication**

- Programmable router
- Static or dynamic routes (colors)
- Data packets (wavelets) passed between PEs
- Single cycle PE-to-PE communication

#### **Fast memory**

- 48 kB SRAM per PE for data and instructions
- 1 cycle read/write



### Cerebras Supports Two Programming Paradigms

**For Al Users,** Cerebras ML stack provides **familiar, high-level** programmability with popular ML frameworks and compatibility with 3P model repos and ML Ops tools

For HPC Users, Cerebras SDK provides flexible, lower-level programmability and access to HW performance features.



Cerebras SDK & CSL



## **Cerebras SDK**

A general-purpose parallel-computing platform and API allowing software developers to write custom programs ("kernels") for Cerebras systems.





## **SDK Example Programs Available**

**Repository:** <u>github.com/Cerebras/csl-examples</u>

- Introductory Tutorials
- GEMV
- GEMM
- Cholesky Decomposition
- 1D and 2D FFT
- 7-Point Stencil SpMV
- Power Method

- Conjugate Gradient
- Preconditioned Conjugate Gradient
- Finite Difference Stencil Computations
- Mandelbrot Set Generator
- Shift-Add Multiplication
- Hypersparse SpMV
- Histogram Computation



#### SDK Usage and Impact Scaling the "Memory Wall" for Multi-Dimensional Seismic **Processing with Algebraic Compression on Cerebras CS-2 Trackable Agent-based Evolution Models at** Over the past year, SDK has evolved from a closed tool Wafer Scale Hatem Ltaief requiring NDA access to a public platform for Wafer-Scale Yuxi Hong Extreme Computing Research Center Computing. We're supporting more research and lutionary computation population sizes, whi taining diagnostic phylogeny telemetry Using Wafer-Scale AI Hardware for publications than ever. Case Study in Developing a Monte ras ETH **Near-Optimal Wafer-Scale Reduce** Eidgenössische Technische Hochschule Zürich Swiss Federal Institute of Technology Zurich Kazu Piotr Luczynski Lukas Gianinazzi Patrick Iff Department of Computer Science Department of Computer Science Department of Computer Science ETH Zurich ETH Zurich ETH Zurich Matrix-Free Finite-Volume Kernels on a Dataflow Architecture Communication Collectives for the e Sensi Torsten Hoefler sity of Rome Department of Computer Science **Cerebras Wafer-Scale Engine** itecture th ETH Zurich peak FLOP performance due to its inherently stochastic Authors: Ryuichi Sai (Rice University); Francois Hamon (TotalEnergies E&P Research and Technology USA, LLC); Automated Code Generation of High-Order Stencils for a Dataflow and various other HPC applications [35, 38, 51, 58]. However, max-John Mellor-Crummey (Rice University); and Mauricio Araya-Polo (TotalEnergies E&P Research and Technology **Finite-Volume Flux** imizing performance on this architecture necessitates tailoring Architecture USA, LLC) communication patterns to its unique characteristics. This need motivates our investigation of Reduce and AllReduce on the WSE ation Abstract: Fast and accurate numerical simulations are crucial for designing large-scale geological carbon storage projects ensuring safe long-term CO2 containment -- as a climate change mitigation strategy. These simulations Authors: Ryuichi Sai, John Mellor-Crummey, and Jinfan Xu (Rice University) and Mauricio Araya-Polo 1.2 Limitations of state-of-the-art involve solving numerous large and complex linear systems arising from the implicit Finite-Volume (FV) (TotalEnergies E&P Research and Technology USA, LLC) auelin Current wafer-scale Reduce and AllReduce implementations are François P. Hamon discretization of PDEs governing subsurface fluid flow. Compounded with highly detailed geo-models, solving primarily optimized for extreme vector sizes. This means they are TotalEnergies EP Research & ems linear systems is computationally and memory expensive, and accounts for the majority of the simulation suboptimal for the intermediate and variable vector lengths typ-Abstract: Finite-difference methods based on high-order stencils are widely used in seismic simulations, weather Technology US, LLC. rnia, USA computing time. Modern intricate memory hierarchical systems are insufficient to overcome the challenges of forecasting, and computational fluid dynamics. Recently, multiple research groups have begun exploring the use Houston, Texas, USA large-scale numerical simulations. Therefore, exploring algorithms that can leverage alternative and balanced of dataflow architectures, such as Cerebras' wafer-scale engine, to accelerate stencil computations. However, paradigms, such as dataflow and in-memory computing is crucial. This work introduces a matrix-free algorithm to Monte Carlo with Single-Cycle Latency: Optimizat implementations of stencil computations for dataflow architectures must address unique challenges, such as solve FV-based linear systems using a dataflow architecture to significantly minimize memory bottlenecks. Our **Cross Section Lookup Kernel for AI Acce** managing the routing of data communications and accommodating a significantly constrained memory footprint. Randolph R. Settgast implementation achieves two orders-of-magnitude speedup compared to a GPGPU-based reference These make hand-crafting code for a dataflow architecture difficult and time-consuming. This paper describes a Lawrence Livermore National mplementation, and up to 1.2 PFlops on a single dataflow device. John Tramm <sup>1,\*</sup>, Bryce Allen<sup>1,2</sup>, Kazutomo Yosh framework for developing portable, high-performance implementations of stencil computations for modern node Laboratory Profile Algorithms on the Cerebras architectures. The paper focuses on code generation strategies for the Cerebras wafer-scale engine, including Livermore, California, USA code generation of router configurations and sequencing of communication for high-order stencils. A 25-point Wafer-Scale Engi **CereSZ: Enabling and Scaling Error-bounded Lossy** star-shaped stencil written using our tool is 7x shorter than hand-crafted code written in Cerebras Software the Cerebras Wafer Scale Engine \*\*\*SPCL Language (CSL), and it delivers comparable performance to manually written code. **Compression on Cerebras CS-2** 1D Allred Multiplication on Cerebras WSE-2: Evaluating Vvas Giridharan 0 1 2 3 4 5 Anonymous Autho ms in Spatial Computing 0 1 2 3 4 5 of data Filip Dobrosavljević even of dofilip@student.ethz.ch **Trackable Agent-based Evolution Models at Wafer Scale** To ta ETH Zurich Switzerland Matthew Andres Moreno<sup>1,2,3,\*</sup>, Connor Yang<sup>4</sup>, Emily Dolson<sup>5,6</sup>, and Luis Zaman<sup>1,2</sup> niques tific app <sup>1</sup>Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, United States Torsten Hoefler a user-s <sup>2</sup>Center for the Study of Complex Systems, University of Michigan, Ann Arbor, United States torsten.hoefler@inf.ethz.ch sors on ETH Zurich <sup>3</sup>Michigan Institute for Data Science, University of Michigan, Ann Arbor, United States process Switzerland <sup>4</sup>Undergraduate Research Opportunities Program, University of Michigan, Ann Arbor, United States come in <sup>5</sup>Department of Computer Science and Engineering, Michigan State University, East Lansing, United States (e.g. red <sup>6</sup>Program in Ecology, Evolution, and Behavior, Michigan State University, East Lansing, United States paralleli \*corresponding author: morenoma@umich.edu on NVI Abstract large-sca Continuing improvements in computing hardware are poised to transform capabilities for in silico modeling of cross-scale phenomsystems ena underlying major open questions in evolutionary biology and In reco artificial life, such as transitions in individuality, eco-evolutionary the high dynamics, and rare evolutionary events. Emerging ML/AI-oriented



## **SDK Access**

Get local access to the SDK simulator!

• Email <u>developer@cerebras.net</u> for access

Join the Cerebras Developer Community

• Forums at <u>discourse.cerebras.net</u>

View our public SDK examples GitHub repository

• See github.com/Cerebras/csl-examples

Partner systems at ANL, EPCC, PSC, LRZ, ...

Questions? <a>leighton.wilson@cerebras.net</a>



discourse.cerebras.net



cerebras.net/developers/sdk-request

