DAOS Architecture
Warning
DAOS is a scratch file system. Please note that data may be removed or unavailable at any time.
DAOS is a major file system in Aurora with 230 PB delivering upto >30 TB/s with 1024 DAOS server storage Nodes. DAOS is an open-source software-defined object store designed for massively distributed Non-Volatile Memory (NVM) and NVMe SSD. DAOS presents a unified storage model with a native Key-array Value storage interface supporitng POSIX, MPIO, DFS and HDF5. Users can use DAOS for their I/O and checkpointing on Aurora. DAOS is fully integrated with the wider Aurora compute fabric as can be seen in the overall storage architecture below.
DAOS Overview
The first step in using DAOS is to get DAOS POOL space allocated for your project. Users should submit a request as noted below to have a DAOS pool created for your project.
DAOS Pool Allocation
DAOS pool is a physically allocated dedicated storage space for your project.
Email support@alcf.anl.gov to request a DAOS pool with the following information:
- Project Name
- ALCF User Names
- Total Space requested (typically 100 TBs++)
- Justification
- Preferred pool name
Note
This is an initial test DAOS configuration and as such, any data on the DAOS system will eventually be deleted when the configuration is changed into a larger system. Warning will be given before the system is wiped to allow time for users to move any important data off.
Modules
Please load the daos
module when using DAOS. This should be done on the login node (UAN) or in the compute node (jobscript):
Pool
Pool is a dedicated space allocated to your project. Once your pool is allocated for your project space.
Confirm you are able to query the pool via:
daos pool query hacc
Pool 050b20a3-3fcc-499b-a6cf-07d4b80b04fd, ntarget=640, disabled=0, leader=2, version=131
Pool space info:
- Target(VOS) count:640
- Storage tier 0 (SCM):
Total size: 6.0 TB
Free: 4.4 TB, min:6.5 GB, max:7.0 GB, mean:6.9 GB
- Storage tier 1 (NVMe):
Total size: 200 TB
Free: 194 TB, min:244 GB, max:308 GB, mean:303 GB
Rebuild done, 4 objs, 0 recs
DAOS Container
The container is the basic unit of storage. A POSIX container can contain hundreds of millions of files, you can use it to store all of your data. You only need a small set of containers; perhaps just one per major unit of project work is sufficient.
There are 3 modes with which we can operate with the DAOS containers:
- POSIX container POSIX Mode
- POSIX Container MPI-IO Mode
- DFS container through DAOS APIs.
Create a POSIX container
$ DAOS_POOL=datascience
$ DAOS_CONT=LLM-GPT-1T
$ daos container create --type POSIX ${DAOS_POOL} ${DAOS_CONT} --properties rd_fac:1
Container UUID : 59747044-016b-41be-bb2b-22693333a380
Container Label: LLM-GPT-1T
Container Type : POSIX
Successfully created container 59747044-016b-41be-bb2b-22693333a380
If you prefer a higher data protection and recovery you can --properties rd_fac:2
and if you don't need data protection and recovery, you can remove --properties rd_fac:1
.
We recommend to have at least --properties rd_fac:1
.
DAOS sanity checks
If any of the following command results in an error, then you can confirm DAOS is currently down. 'Out of group or member list' error is an exception and can be safely ignored. This error message will be fixed in the next DAOS release.
module use /soft/modulefiles
module load daos/base
env | grep DRPC
ps –ef | grep daos
clush --hostfile ${PBS_NODEFILE} ps –ef | grep agent | grep -v grep' | dshbak -c #to check on all compute nodes
export DAOS_POOL=Your_allocated_pool_name
daos pool query ${DAOS_POOL}
daos cont list ${DAOS_POOL}
daos container get-prop $DAOS_POOL $DAOS_CONT
- Look for messages like
Rebuild busy and state degraded in the daos pool query.
- Look for messages like
Health (status) : UNCLEAN in the get prop
Mount a POSIX container
Currently, you must manually mount your container prior to use on any node you are working on.
In the future, we hope to automate some of this via additional qsub
options.
To mount a POSIX container on a login node
To mount a POSIX container on Compute Nodes
You need to mount the container on all compute nodes.
Job Submission
The -l filesystems=daos_user
and -l daos=daos_user
switch will ensure that DAOS is accessible on the compute nodes.
Job submission without requesting DAOS:
qsub -l select=1 -l walltime=01:00:00 -A <ProjectName> -k doe -l filesystems=flare -q debug ./pbs_script1.sh or - I
Job submission with DAOS:
qsub -l select=1 -l walltime=01:00:00 -A <ProjectName> -k doe -l filesystems=flare:daos_user -l daos=daos_user -q debug ./pbs_script1.sh or - I
NIC and Core Binding
Each Aurora compute node has 8 NICs and each DAOS server node has 2 NICs. Each NIC is capable of driving 20-25 GB/s unidirection for data transfer. Every read and write goes over the NIC and hence NIC binding is the key to achieve good performance.
For 12 PPN, the following binding is recommended:
Interception library for POSIX containers
The interception library (IL) is a next step in improving DAOS performance. This provides kernel-bypass for I/O data, leading to improved performance. The libioil
IL will intercept basic read and write POSIX calls while all metadata calls still go through dFuse. The libpil4dfs
IL should be used for both data and metadata calls to go through dFuse.
The IL can provide a large performance improvement for bulk I/O as it bypasses the kernel and commuNICates with DAOS directly in userspace. It will also take advantage of the multiple NICs on the node based on how many MPI processes are running on the node and which CPU socket they are on.
Interception library for POSIX mode | |
---|---|
Sample job script
Currently, --no-vni
is required in the mpiexec
command to use DAOS.
MPI-IO Mode (Mode 2)
The ROMIO MPI-IO layer provides multiple I/O backends including a custom DAOS backend.
MPI-IO can be used with dFuse and the interception library when using the ufs
backend but the daos
backend will provide optimal performance. In order to use this, one can prefix the file names with daos:
which will tell MPI-IO to use the DAOS backend.
DFS Mode (Mode 3)
DFS is the user level API for DAOS. This API is very similar to POSIX but still has many differences that would require code changes to utilize DFS directly. The DFS API can provide the best overall performance for any scenario other than workloads which benefit from caching.
/soft/daos/examples/src/
DAOS Hardware
Each DAOS server node is based on the Intel Coyote Pass platform:
- (2) Xeon 5320 CPU (Ice Lake)
- (16) 32GB DDR4 DIMMs
- (16) 512GB Intel Optane Persistent Memory 200
- (16) 15.3TB Samsung PM1733 NVMe
- (2) HPE Slingshot NIC
Darshan profiler for DAOS
Currently, you need to install your own local darshan-daos
profiler. You need to use DFS mode (3) or POSIX with interception library to profile.
Preload darshan first then DAOS interception library:
mpiexec --env LD_PRELOAD=~/soft/profilers/darshan-daos/darshan-install/lib/libdarshan.so:/usr/lib64/libpil4dfs.so
-np 32 -ppn 16 --no-vni -genvall \
ior -a DFS --dfs.pool=datascience_ops --dfs.cont=ior_test1 \
-i 5 -t 16M -b 2048M -w -r -C -e -c -v -o /ior_2.dat
Install darshan-util
from laptop:
Cluster Size
DAOS cluster size is the number of available DAOS servers. While we are working towards bringing up the entire 1024 DAOS server available users, currently different number of DAOS nodes could be up. Please check with support or run an IOR test to get an estimate on the current number of DAOS servers available. The bandwidth listed here in the last column is a theoretical peak bandwidth.
Best practices
Check that you requested DAOS
qsub –l filesystems=daos_user -l daos=daos_user
Did you load DAOS module?
module load daos
Do you have your DAOS pool allocated?
daos pool query datascience
Is DAOS client running on all your nodes?
ps –ef | grep daos
Is your container mounted on all nodes?
mount | grep dfuse
Can you ls in your container?
ls /tmp/${DAOS_POOL}/${DAOS_CONT}
Did your I/O actually fail?
What is the health property in your container?
daos container get-prop $DAOS_POOL $CONT
Is your space full? Min and max
daos pool query datascience
Does your query show failed targets or rebuild in process?
daos pool query datascience
daos pool autotest
daos container check