oneAPI Collective Communications Library (oneCCL) provides an efficient implementation of communication patterns used in deep learning. oneCCL is governed by the UXL Foundation and is an implementation of the oneAPI specification.
2021.14 is the current version of oneCCL that is available to users through the Aurora compute image.
oneCCL environment variables
We have identified a set of environment settings that typically provide better performance or address potential application hangs and crashes at large scale. This particular setup is still experimental, and it might change as the environment variable settings are refined. Users are encouraged to check this page regularly.
exportCCL_PROCESS_LAUNCHER=pmixexportCCL_ATL_TRANSPORT=mpi
exportCCL_ALLREDUCE_SCALEOUT=direct:0-1048576;rabenseifner:1048577-max# currently best allreduce algorithm at large scaleexportCCL_BCAST=double_tree# currently best bcast algorithm at large scaleexportCCL_KVS_MODE=mpi
exportCCL_CONFIGURATION_PATH=""exportCCL_CONFIGURATION=cpu_gpu_dpcpp
exportCCL_KVS_CONNECTION_TIMEOUT=600exportCCL_ZE_CACHE_OPEN_IPC_HANDLES_THRESHOLD=1024exportCCL_KVS_USE_MPI_RANKS=1exportMPI_PROVIDER=$FI_PROVIDERunsetMPIR_CVAR_CH4_POSIX_COLL_SELECTION_TUNING_JSON_FILE
unsetMPIR_CVAR_CH4_COLL_SELECTION_TUNING_JSON_FILE
unsetMPIR_CVAR_COLL_SELECTION_TUNING_JSON_FILE
The following additional set of environment variable setups might be application-dependent. Users are encouraged to try to set them and see whether they help their applications.
ulimit-cunlimited
exportFI_MR_ZE_CACHE_MONITOR_ENABLED=0exportFI_MR_CACHE_MONITOR=disabled
exportFI_CXI_RX_MATCH_MODE=hybrid
exportFI_CXI_OFLOW_BUF_SIZE=8388608exportFI_CXI_DEFAULT_CQ_SIZE=1048576exportFI_CXI_CQ_FILL_PERCENT=30exportINTELGT_AUTO_ATTACH_DISABLE=1exportPALS_PING_PERIOD=240exportPALS_RPC_TIMEOUT=240exportMPIR_CVAR_GATHERV_INTER_SSEND_MIN_PROCS=-1# to solve the sync send issue in Horovod seg faultexportCCL_ATL_SYNC_COLL=1# to avoid potential hang at large scaleexportCCL_OP_SYNC=1# to avoid potential hang at large scale
You can compile examples from the oneCCL Git repository and use the library from the system default instead of local builds. More information at: oneCCL Benchmark User Guide
#!/bin/bash -x# qsub -l nodes=2:ncpus=208 -q debug -l walltime=02:00:00 -l filesystems=home:flare -A <Project Name> ./pbs_job_#PBS -A <ProjectName>#PBS -k doemoduleloadframeworkscd$PBS_O_WORKDIRechoJobid:$PBS_JOBIDechoRunningonnodes`cat$PBS_NODEFILE`NNODES=`wc-l<$PBS_NODEFILE`RANKS_PER_NODE=12# Number of MPI ranks per nodeNRANKS=$((NNODES*RANKS_PER_NODE))echo"NUM_OF_NODES=${NNODES} TOTAL_NUM_RANKS=${NRANKS} RANKS_PER_NODE=${RANKS_PER_NODE}"## Option 1exportCPU_BINDING1="list:4:9:14:19:20:25:56:61:66:71:74:79"# 12 ppn to 12 cores## Option 2exportCPU_BINDING2="list:4-7:8-11:12-15:16-19:20-23:24-27:56-59:60-63:64-67:68-71:72-75:76-79"# 12 ppn with each rank having 4 cores## Option 1 for oneCCL worker affinity exportCCL_WORKER_AFFINITY=42,43,44,45,46,47,94,95,96,97,98,99
## Option 2unsetCCL_WORKER_AFFINITY# Default will pick up from the last 24 cores even if you didn't specify these in the binding.EXT_ENV="--env FI_CXI_DEFAULT_CQ_SIZE=1048576"APP1=/lus/flare/projects/Aurora_deployment/kaushik/all_reduce_frameworks/gitrepos/oneCCL/build/_install/examples/benchmark/benchmark
echo$CCL_ROOTexportLD_LIBRARY_PATH=$CCL_ROOT/lib:$LD_LIBRARY_PATHexportCPATH=$CCL_ROOT/include:$CPATHexportLIBRARY_PATH=$CCL_ROOT/lib:$LIBRARY_PATHexportCCL_PROCESS_LAUNCHER=pmixexportCCL_ATL_TRANSPORT=mpi
exportCCL_ALLREDUCE=topo
exportCCL_ALLREDUCE_SCALEOUT=rabenseifnerexportCCL_KVS_MODE=mpi
exportCCL_CONFIGURATION_PATH=""exportCCL_CONFIGURATION=cpu_gpu_dpcpp
exportCCL_KVS_CONNECTION_TIMEOUT=600whichpython
mkdir-p./out_${PBS_JOBID}/c_oneccl_gpu
forNNODESin48163264doRANKS_PER_NODE=12# Number of MPI ranks per nodeNRANKS=$((NNODES*RANKS_PER_NODE))forBUF_SIZEin124816326412825651210242048409681921638432768655361310722621445242881048576209715241943048388608167772163355443267108864134217728268435456dodate
mpiexec${EXT_ENV}--envCCL_LOG_LEVEL=info--envCCL_PROCESS_LAUNCHER=pmix--envCCL_ATL_TRANSPORT=mpi\--np${NRANKS}-ppn${RANKS_PER_NODE}--cpu-bind$CPU_BINDING1$APP1\--elem_counts${BUF_SIZE},${BUF_SIZE},${BUF_SIZE}\--collallreduce-joff-i1-w0--backendsycl--sycl_dev_typegpu>./out_${PBS_JOBID}/c_oneccl_gpu/${PBS_JOBID}_${NNODES}_${NRANKS}_${RANKS_PER_NODE}_${BUF_SIZE}_sycl_ccl_gpu_out_w1.txt
date
echo${BUF_SIZE}donedone# For CPU only, change benchmark options to : --backend host --sycl_dev_type host
In the provided CPU binding list we have provided two options. First one is based on one CPU core per rank. In the second option, we assign 4 CPU cores per rank. In the first oneCCL worker affinity option we pick 12 CPU cores, one per rank. Notice that, these cores are picked out from the last 12 cores of each socket (CPU), aligned with oneCCL default core picking strategy. 42-47 belongs to the first socket, and 94-99 belongs to the second socket. We leave a few cores free, in case, the user may want to use other services like copper and DAOS along with their application. The second oneCCL option is to delegate task of picking cores to the system. In this case, the user should not declare or export the CCL_WORKER_AFFINITY variable.