Example Programs and Makefiles for Polaris
Several simple examples of building CPU and GPU-enabled codes on Polaris are available in the ALCF GettingStart repo for several programming models. If build your application is problematic for some reason (e.g. absence of a GPU), then users are encouraged to build and test applications directly on one of the Polaris compute nodes via an interactive job. The discussion below makes use of the NVHPC
compilers in the default environment as illustrative examples. Similar examples for other compilers on Polaris are available in the ALCF GettingStarted repo.
CPU MPI+OpenMP Example
One of the first useful tasks with any new machine, scheduler, and job launcher is to ensure one is binding MPI ranks and OpenMP threads to the host cpu as intended. A simple HelloWorld MPI+OpenMP example is available here to get started with.
The application can be straightforwardly compiled using the Cray compiler wrappers.
The executable hello_affinity
can then be launched in a job script (or directly in shell of interactive job) using mpiexec
as discussed here.
#!/bin/sh
#PBS -l select=1:system=polaris
#PBS -l place=scatter
#PBS -l walltime=0:30:00
#PBS -l filesystems=home
# MPI example w/ 16 MPI ranks per node spread evenly across cores
NNODES=`wc -l < $PBS_NODEFILE`
NRANKS_PER_NODE=16
NDEPTH=4
NTHREADS=1
NTOTRANKS=$(( NNODES * NRANKS_PER_NODE ))
echo "NUM_OF_NODES= ${NNODES} TOTAL_NUM_RANKS= ${NTOTRANKS} RANKS_PER_NODE= ${NRANKS_PER_NODE} THREADS_PER_RANK= ${NTHREADS}"
mpiexec -n ${NTOTRANKS} --ppn ${NRANKS_PER_NODE} --depth=${NDEPTH} --cpu-bind depth ./hello_affinity
CUDA
Several variants of C/C++ and Fortran CUDA examples are available here that include MPI and multi-gpu examples.
One can use the Cray compiler wrappers to compile GPU-enabled applications as well. This example of simple vector addition uses the NVIDIA compilers.
The craype-accel-nvidia80
module in the default environment will add the -gpu
compiler flag for nvhpc
compilers along with appropriate include directories and libraries. It is left to the user to provide an additional flag to the nvhpc
compilers to select the target GPU programming model. In this case, -cuda
is used to indicate compilation of CUDA code. The application can then be launched within a batch job submission script or as follows on one of the compute nodes.
$ ./vecadd
# of devices= 4
[0] Platform[ Nvidia ] Type[ GPU ] Device[ NVIDIA A100-SXM4-40GB ]
[1] Platform[ Nvidia ] Type[ GPU ] Device[ NVIDIA A100-SXM4-40GB ]
[2] Platform[ Nvidia ] Type[ GPU ] Device[ NVIDIA A100-SXM4-40GB ]
[3] Platform[ Nvidia ] Type[ GPU ] Device[ NVIDIA A100-SXM4-40GB ]
Running on GPU 0!
Using single-precision
Name= NVIDIA A100-SXM4-40GB
Locally unique identifier=
Clock Frequency(KHz)= 1410000
Compute Mode= 0
Major compute capability= 8
Minor compute capability= 0
Number of multiprocessors on device= 108
Warp size in threads= 32
Single precision performance ratio= 2
Result is CORRECT!! :)
GPU OpenACC
A simple MPI-parallel OpenACC example is available here. Compilation proceeds similar to the above CUDA example except for the use of the -acc=gpu
compiler flag to indicate compilation of OpenACC code for GPUs.
$ mpiexec -n 4 ./vecadd
# of devices= 4
Using single-precision
Rank 0 running on GPU 0!
Rank 1 running on GPU 1!
Rank 2 running on GPU 2!
Rank 3 running on GPU 3!
Result is CORRECT!! :)
CUDA_VISIBLE_DEVICES
appropriately for each MPI rank. One example is available here where each MPI rank is similarly bound to a single GPU with round-robin assignment. The binding of MPI ranks to GPUs is discussed in more detail here.
GPU OpenCL
A simple OpenCL example is available here. The OpenCL headers and library are available in the NVHPC SDK and cuda toolkits. The environment variable NVIDIA_PATH
is defined for the PrgEnv-nvhpc
programming environment.
CC -o vecadd -g -O3 -std=c++0x -I${NVIDIA_PATH}/cuda/include main.o -L${NVIDIA_PATH}/cuda/lib64 -lOpenCL
This simple example can be run on a Polaris compute node as follows.
$ ./vecadd
Running on GPU!
Using single-precision
CL_DEVICE_NAME: NVIDIA A100-SXM4-40GB
CL_DEVICE_VERSION: OpenCL 3.0 CUDA
CL_DEVICE_OPENCL_C_VERSION: OpenCL C 1.2
CL_DEVICE_MAX_COMPUTE_UNITS: 108
CL_DEVICE_MAX_CLOCK_FREQUENCY: 1410
CL_DEVICE_MAX_WORK_GROUP_SIZE: 1024
Result is CORRECT!! :)
GPU OpenMP
A simple MPI-parallel OpenMP example is available here. Compilation proceeds similar to the above examples except for use of the -mp=gpu
compiler flag to indicated compilation of OpenMP code for GPUs.
Similar to the OpenACC example above, this code binds MPI ranks to GPUs in a round-robin fashion.