DAOS Architecture
DAOS is a major file system in Aurora with 230 PB delivering upto >30 TB/s with 1024 DAOS server storage Nodes. DAOS is an open-source software-defined object store designed for massively distributed Non Volatile Memory (NVM) and NVMe SSD. DAOS presents a unified storage model with a native Key-array Value storage interface supporitng POSIX, MPIO, DFS and HDF5. Users can use DAOS for their I/O and checkpointing on Aurora. DAOS is fully integrated with the wider Aurora compute fabric as can be seen in the overall storage architecture below.
DAOS Overview
The first step in using DAOS is to get DAOS POOL space allocated for your project. Users should submit a request as noted below to have a DAOS pool created for your project.
DAOS Pool Allocation
DAOS pool is a physically allocated dedicated storage space for your project.
Email support@alcf.anl.gov to request a DAOS pool with the following information.
- Project Name
- Alcf User Names
- Total Space requested (typically 100 TBs++)
- Justification
- Preferred pool name
Note
This is an initial test DAOS configuration and as such, any data on the DAOS system will envtually be deleted when the configuration is changed into a larger system. Warning will be given before the system is wiped to allow time for users to move any important data off.
Modules
Please load the daos
module when using DAOS. This should be done on the login node (UAN) or in the compute node (jobscript):
Pool
Pool is a dedicated space allocated to your project. Once your pool is allocated for your project space.
Confirm you are able to query the pool via:
Example output:
daos pool query hacc
Pool 050b20a3-3fcc-499b-a6cf-07d4b80b04fd, ntarget=640, disabled=0, leader=2, version=131
Pool space info:
- Target(VOS) count:640
- Storage tier 0 (SCM):
Total size: 6.0 TB
Free: 4.4 TB, min:6.5 GB, max:7.0 GB, mean:6.9 GB
- Storage tier 1 (NVMe):
Total size: 200 TB
Free: 194 TB, min:244 GB, max:308 GB, mean:303 GB
Rebuild done, 4 objs, 0 recs
DAOS Container
The container is the basic unit of storage. A POSIX container can contain hundreds of millions of files, you can use it to store all of your data. You only need a small set of containers; perhaps just one per major unit of project work is sufficient.
There are 3 modes with which we can operate with the DAOS containers 1. POSIX container POSIX Mode 2. POSIX Container MPI-IO Mode 3. DFS container through DAOS APIs.
Create a POSIX container
$ DAOS_POOL=datascience
$ DAOS_CONT=LLM-GPT-1T
$ daos container create --type POSIX ${DAOS_POOL} ${DAOS_CONT} --properties rd_fac:1
Container UUID : 59747044-016b-41be-bb2b-22693333a380
Container Label: LLM-GPT-1T
Container Type : POSIX
Successfully created container 59747044-016b-41be-bb2b-22693333a380
If you prefer a higher data protection and recovery you can --properties rd_fac:2 and if you don't need data protection and recovery, you can remove --properties rd_fac:1. We recommend to have at least --properties rd_fac:1.
DAOS sanity checks
If any of the following command results in an error, then you can confirm DAOS is currently down. 'Out of group or member list' error is an exception and can be safely ignored. This error message will be fixed in the next daos release.
module use /soft/modulefiles
module load daos/base
env | grep DRPC
ps –ef | grep daos
clush --hostfile ${PBS_NODEFILE} ps –ef | grep agent | grep -v grep' | dshbak -c #to check on all compute nodes
export DAOS_POOL=Your_allocated_pool_name
daos pool query ${DAOS_POOL}
daos cont list ${DAOS_POOL}
daos container get-prop $DAOS_POOL_NAME $DAOS_CONT_NAME
- Look for messages like Rebuild busy and state degraded in the daos pool query.
- Look for messages like Health (status) : UNCLEAN in the get prop
daos pool autotest $DAOS_POOL_NAME
daos container check --pool=$DAOS_POOL_NAME --cont=$DAOS_CONT_NAME
Mount a POSIX container
Currently, you must manually mount your container prior to use on any node you are working on.
In the future, we hope to automate some of this via additional qsub
options.
To mount a POSIX container on a login node
mkdir –p /tmp/${DAOS_POOL}/${DAOS_CONT}
start-dfuse.sh -m /tmp/${DAOS_POOL}/${DAOS_CONT} --pool ${DAOS_POOL} --cont ${DAOS_CONT} # To mount
mount | grep dfuse # To confirm if its mounted
# Mode 1
ls /tmp/${DAOS_POOL}/${DAOS_CONT}
cd /tmp/${DAOS_POOL}/${DAOS_CONT}
cp ~/temp.txt ~ /tmp/${DAOS_POOL}/${DAOS_CONT}/
cat /tmp/${DAOS_POOL}/${DAOS_CONT}/temp.txt
fusermount3 -u /tmp/${DAOS_POOL}/${DAOS_CONT} # To unmount
To mount a POSIX container on Compute Nodes
You need to mount the container on all compute nodes.
launch-dfuse.sh ${DAOS_POOL_NAME}:${DAOS_CONT_NAME} # launched using pdsh on all compute nodes mounted at: /tmp/<pool>/<container>
mount | grep dfuse # To confirm if its mounted
ls /tmp/${DAOS_POOL}/${DAOS_CONT}/
clean-dfuse.sh ${DAOS_POOL_NAME}:${DAOS_CONT_NAME} # To unmount on all nodes
Job Submission
The -ldaos=default
switch will ensure that DAOS is accessible on the compute nodes.
Job submission without requesting DAOS:
qsub -l select=1 -l walltime=01:00:00 -A Aurora_deployment -k doe -l filesystems=flare -q lustre_scaling ./pbs_script1.sh or - I
Job submission with DAOS:
qsub -l select=1 -l walltime=01:00:00 -A Aurora_deployment -k doe -l filesystems=flare -q lustre_scaling -l daos=default ./pbs_script1.sh or - I
NIC and Core Binding
Each Aurora compute node has 8 NICs and each DAOS server node has 2 NICs. Each NIC is capable of driving 20-25 GB/s unidirection for data transfer. Every read and write goes over the NIC and hence NIC binding is the key to achieve good performance.
For 12 PPN, the following binding is recommended.
Interception library for POSIX containers
The interception library (IL) is a next step in improving DAOS performance. This provides kernel-bypass for I/O data, leading to improved performance. The libioil IL will intercept basic read and write POSIX calls while all metadata calls still go through dFuse. The libpil4dfs IL should be used for both data and metadata calls to go through dFuse. The IL can provide a large performance improvement for bulk I/O as it bypasses the kernel and commuNICates with DAOS directly in userspace. It will also take advantage of the multiple NICs on the node based on how many MPI processes are running on the node and which CPU socket they are on.
Interception library for POSIX mode
mpiexec # no interception
mpiexec --env LD_PRELOAD=/usr/lib64/libioil.so # only data is intercepted
mpiexec --env LD_PRELOAD=/usr/lib64/libpil4dfs.so # preferred - both metadata and data is intercepted. This provides close to DFS mode performance.
Sample job script
Currently, --no-vni
is required in the mpiexec
command to use DAOS.
#!/bin/bash -x
#PBS -l select=512
#PBS -l walltime=01:00:00
#PBS -A Aurora_deployment
#PBS -q lustre_scaling
#PBS -k doe
#PBS -ldaos=default
# qsub -l select=512:ncpus=208 -l walltime=01:00:00 -A Aurora_deployment -l filesystems=flare -q lustre_scaling -ldaos=default ./pbs_script.sh or - I
# please do not miss -ldaos=default in your qsub :'(
export TZ='/usr/share/zoneinfo/US/Central'
date
module use /soft/modulefiles
module load daos
env | grep DRPC #optional
ps -ef|grep daos #optional
clush --hostfile ${PBS_NODEFILE} 'ps -ef|grep agent|grep -v grep' | dshbak -c #optional
DAOS_POOL=datascience
DAOS_CONT=thundersvm_exp1
daos pool query ${DAOS_POOL} #optional
daos cont list ${DAOS_POOL} #optional
daos container destroy ${DAOS_POOL} ${DAOS_CONT} #optional
daos container create --type POSIX ${DAOS_POOL} ${DAOS_CONT} --properties rd_fac:1
daos container query ${DAOS_POOL} ${DAOS_CONT} #optional
daos container get-prop ${DAOS_POOL} ${DAOS_CONT} #optional
daos container list ${DAOS_POOL} #optional
launch-dfuse.sh ${DAOS_POOL}:${DAOS_CONT} # To mount on a compute node
# mkdir -p /tmp/${DAOS_POOL}/${DAOS_CONT} # To mount on a login node
# start-dfuse.sh -m /tmp/${DAOS_POOL}/${DAOS_CONT} --pool ${DAOS_POOL} --cont ${DAOS_CONT} # To mount on a login node
mount|grep dfuse #optional
ls /tmp/${DAOS_POOL}/${DAOS_CONT} #optional
# cp /lus/flare/projects/CSC250STDM10_CNDA/kaushik/thundersvm/input_data/real-sim_M100000_K25000_S0.836 /tmp/${DAOS_POOL}/${DAOS_CONT} #one time
# daos filesystem copy --src /lus/flare/projects/CSC250STDM10_CNDA/kaushik/thundersvm/input_data/real-sim_M100000_K25000_S0.836 --dst daos://tmp/${DAOS_POOL}/${DAOS_CONT} # check https://docs.daos.io/v2.4/testing/datamover/
cd $PBS_O_WORKDIR
echo Jobid: $PBS_JOBID
echo Running on nodes `cat $PBS_NODEFILE`
NNODES=`wc -l < $PBS_NODEFILE`
RANKS_PER_NODE=12 # Number of MPI ranks per node
NRANKS=$(( NNODES * RANKS_PER_NODE ))
echo "NUM_OF_NODES=${NNODES} TOTAL_NUM_RANKS=${NRANKS} RANKS_PER_NODE=${RANKS_PER_NODE}"
CPU_BINDING1=list:4:9:14:19:20:25:56:61:66:71:74:79
export THUN_WS_PROB_SIZE=1024
export ZE_FLAT_DEVICE_HIERARCHY=COMPOSITE
export AFFINITY_ORDERING=compact
export RANKS_PER_TILE=1
export PLATFORM_NUM_GPU=6
export PLATFORM_NUM_GPU_TILES=2
date
LD_PRELOAD=/usr/lib64/libpil4dfs.so mpiexec -np ${NRANKS} -ppn ${RANKS_PER_NODE} --cpu-bind ${CPU_BINDING1} \
--no-vni -genvall thunder/svm_mpi/run/aurora/wrapper.sh thunder/svm_mpi/build_ws1024/bin/thundersvm-train \
-s 0 -t 2 -g 1 -c 10 -o 1 /tmp/datascience/thunder_1/real-sim_M100000_K25000_S0.836
date
clean-dfuse.sh ${DAOS_POOL}:${DAOS_CONT} #to unmount on compute node
# fusermount3 -u /tmp/${DAOS_POOL}/${DAOS_CONT} #to unmount on login node
MPI-IO Mode
Mode 2
The ROMIO MPI-IO layer provides multiple I/O backends including a custom DAOS backend.
MPI-IO can be used with dFuse and the interception library when using the ufs
backend but the daos
backend will provide optimal performance.
In order to use this, one can prefix the file names with daos:
which will tell MPI-IO to use the DAOS backend.
export ROMIO_PRINT_HINTS=1
echo "cb_nodes 128" >> ${PBS_O_WORKDIR}/romio_hints
mpiexec --env ROMIO_HINTS = romio_hints_file program daos:/mpi_io_file.data
or
mpiexec --env MPICH_MPIIO_HINTS = path_to_your_file*:cb_config_list=#*:2#
:romio_cb_read=enable
:romio_cb_write=enable
:cb_nodes=32
program daos:/mpi_io_file.data
DFS Mode
Mode 3
DFS is the user level API for DAOS. This API is very similar to POSIX but still has many differences that would require code changes to utilize DFS directly. The DFS API can provide the best overall performance for any scenario other than workloads which benefit from caching.
Reference code for using DAOS through DFS mode and DAOS APIs
Full code at /soft/daos/examples/src
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
#include <daos.h>
#include <daos_fs.h>
int main(int argc, char **argv)
{
dfs_t *dfs;
d_iov_t global;
ret = MPI_Init(&argc, &argv);
ret = MPI_Comm_rank(MPI_COMM_WORLD, &rank);
ret = dfs_init();
ret = dfs_connect(getenv("DAOS_POOL"), NULL, getenv("DAOS_CONT"), O_RDWR, NULL, &dfs);
ret = dfs_open(dfs, NULL, filename, S_IFREG|S_IRUSR|S_IWUSR, O_CREAT|O_WRONLY, obj_class, chunk_size, NULL, &obj);
ret = dfs_write(dfs, obj, &sgl, off, NULL);
ret = dfs_read(dfs, obj, &sgl, off, &read, NULL);
ret = dfs_disconnect(dfs);
ret = daos_fini();
ret = MPI_Finalize();
}
DAOS Hardware
Each DAOS server nodes is based on the Intel Coyote Pass platform. * (2) Xeon 5320 CPU (Ice Lake) * (16) 32GB DDR4 DIMMs * (16) 512GB Intel Optane Persistent Memory 200 * (16) 15.3TB Samsung PM1733 NVMe * (2) HPE Slingshot NIC
Darshan profiler for DAOS
Currently, you need to install your own local darshan-daos profiler You need to use DFS mode (3) or POSIX with interception library to profile
module use /soft/modulefiles
module load daos
module list
git clone https://github.com/darshan-hpc/darshan.git
git checkout snyder/dev-daos-module-3.4
./prepare.sh
mkdir /home/kaushikvelusamy/soft/profilers/darshan-daos/darshan-install
./configure --prefix=/home/kaushikvelusamy/soft/profilers/darshan-daos/darshan-install \
--with-log-path=/home/kaushikvelusamy/soft/profilers/darshan-daos/darshan-logs \
--with-jobid-env=PBS_JOBID \
CC=mpicc --enable-daos-mod
make && make install
chmod 755 ~/soft/profilers/darshan-daos/darshan/darshan-install/darshan-mk-log-dirs.pl
mkdir /home/kaushikvelusamy/soft/profilers/darshan-daos/darshan-logs
cd /home/kaushikvelusamy/soft/profilers/darshan-daos/darshan-logs
~/soft/profilers/darshan-daos/darshan/darshan-install/darshan-mk-log-dirs.pl
~/soft/profilers/darshan-daos/darshan-install/bin/darshan-config --log-path
Preload darshan first then DAOS interception library:
mpiexec --env LD_PRELOAD=~/soft/profilers/darshan-daos/darshan-install/lib/libdarshan.so:/usr/lib64/libpil4dfs.so
-np 32 -ppn 16 --no-vni -genvall \
ior -a DFS --dfs.pool=datascience_ops --dfs.cont=ior_test1 \
-i 5 -t 16M -b 2048M -w -r -C -e -c -v -o /ior_2.dat
Install darshan-util from laptop:
conda info –envs
conda activate env-non-mac-darshan-temp
/Users/kvelusamy/Desktop/tools/spack/share/spack/setup-env.sh
spack install darshan darshan-util
export DYLD_FALLBACK_LIBRARY_PATH=/Users/kvelusamy/Desktop/tools/spack/opt/spack/darwin-ventura-m1/apple-clang-14.0.3/darshan-util-3.4.4-od752jyfljrrey3d4gjeypdcppho42k2/lib/:$DYLD_FALLBACK_LIBRARY_PATH
darshan-parser ~/Downloads/kaushikv_ior_id917110-44437_10-23-55830-632270104473632905_1.darshan
python3 -m darshan summary ~/Downloads/kaushikv_ior_id917110-44437_10-23-55830-632270104473632905_1.darshan #coming soon
Cluster Size
DAOS cluster size is the number of available DAOS servers. While we are working towards bringing up the entire 1024 DAOS server available users, currently different number of DAOS nodes could be up. Please check with support or run an IOR test to get an estimate on the current number of DAOS servers available.
Best practices
Check that you requested DAOS
qsub –l daos=default
Did you load DAOS module?
module load daos
Do you have your DAOS pool allocated?
daos pool query datascience
Is DAOS client running on all your nodes?
ps –ef | grep daos
Is your container mounted on all nodes?
mount | grep dfuse
Can you ls in your container?
ls /tmp/${DAOS_POOL}/${DAOS_CONT}
Did your I/O actually fail?
What is the health property in your container?
daos container get-prop $DAOS_POOL $CONT
Is your space full? Min and max
daos pool query datascience
Does your query show failed targets or rebuild in process?
daos pool query datascience
daos pool autotest
daos container check