DAOS Architecture¶
DAOS is a major file system in Aurora with 230 PB delivering up to >30 TB/s with 1024 DAOS server storage Nodes. DAOS is an open-source software-defined object store designed for massively distributed Non-Volatile Memory (NVM) and NVMe SSD. DAOS presents a unified storage model with a native Key-array Value storage interface supporting POSIX, MPIO, DFS and HDF5. Users can use DAOS for their I/O and checkpointing on Aurora. DAOS is fully integrated with the wider Aurora compute fabric as can be seen in the overall storage architecture below.
DAOS Overview¶
The first step in using DAOS is to get DAOS POOL space allocated for your project. Users should submit a request as noted below to have a DAOS pool created for your project.
DAOS Pool Allocation¶
DAOS pool is a physically allocated dedicated storage space for your project.
Email support@alcf.anl.gov to request a DAOS pool with the following information:
- Project Name
- ALCF User Names
- Total Space requested (typically 100 TBs++)
- Justification
- Preferred pool name
Note¶
This is an initial test DAOS configuration and as such, any data on the DAOS system will eventually be deleted when the configuration is changed into a larger system. Warning will be given before the system is wiped to allow time for users to move any important data off.
Modules¶
Please load the daos
module when using DAOS. This should be done on the login node (UAN) or in the compute node (jobscript):
Pool¶
A pool is a dedicated space allocated to your project. Once your pool has been allocated for your project space, confirm that you are able to query the pool:
daos pool query hacc
Pool 050b20a3-3fcc-499b-a6cf-07d4b80b04fd, ntarget=4096, disabled=0, leader=2, version=131
Pool space info:
- Target(VOS) count:640
- Storage tier 0 (SCM):
Total size: 6.0 TB
Free: 4.4 TB, min:6.5 GB, max:7.0 GB, mean:6.9 GB
- Storage tier 1 (NVMe):
Total size: 200 TB
Free: 194 TB, min:244 GB, max:308 GB, mean:303 GB
Rebuild done, 4 objs, 0 recs
POSIX Containers¶
In DAOS general terms, a container is a logical space within a pool where data and metadata are stored. It's essentially a self-contained object namespace and versioning space. There are several types of containers, but all of the focus in this guide and all future references will be on utilizing containers of the POSIX type in the context of the DAOS File System (DFS). DFS is essentially a POSIX emulation layer on top of DAOS and is implemented in the libdfs library, allowing a DAOS container to be accessed as a hierarchical POSIX namespace. libdfs supports files, directories, and symbolic links, but not hard links. The DAOS official documentation on DFS can be found here.
With more than 1024 servers at full deployment, the user-accessible cluster named daos_user
has 16,384 solid state drives (SSDs) and 16,384 persistent memory modules, and without some amount of data redundancy a hardware failure on any one could result in the loss of your data. DAOS has several data redundancy options available, and a tradeoff must be made between data resiliency, performance, and volume. The recommended tradeoff is to specify a redundancy factor of 3 on the container for both files and directories via the rd_fac:3
container property. By default, this means files will utilize an erasure coding algorithm with a ratio of 16 data blocks to 3 parity blocks (in DAOS file object class terms EC_16P3GX
), which in simplest terms, means 19 blocks of erasure coding stores 16 blocks of data. For directories, the default is to create 3 full duplicates of the directory, which is basically an emulation of an inode in traditional file system terms, by setting the directory object class to RP_4G1
. For this default setting, there is little performance tradeoff for directories at this redundancy level, since it just contains metadata.
In the scenario with the above settings, when a server failure occurs, be it a software or hardware failure (e.g. an SSD, persistent memory module, or a networking switch failure) on up to 3 servers, a process called a rebuild occurs. During rebuild, the data on the failed servers is reconstructed to preserve data integrity, and the servers with the failures are excluded from the cluster. The servers or network can be repaired in the future so that the servers are eventually reintegrated to the cluster. The rebuild process in this scenario does not disrupt service, and the cluster does not experience any outage. If more than 3 servers are lost (say, due to a network issue) or more servers are lost during the rebuild, then the cluster will be taken offline to conduct repairs.
These parameters are set at container creation as follows along with others which will be described below for best practices:
The chunk-size of 2 MB and the ec_cell_sz
(erasure coding cell size) of 128 KB work together to optimally stripe the data across the 16 data servers plus 3 parity servers (19 erasure coding servers) and set the maximum amount of data written to one SSD on one server by one client per transaction to the ec_cell_sz
of 128 KB. The general rule of thumb is the chunk-size should equal the number of data servers (excluding parity servers) multiplied by the ec_cell_sz
or at least be an even multiple of it. If your application does large amounts of IO per process, you could experiment with the settings by increasing them proportionately, e.g. setting the chunk-size to 16 MB and the ec_cell_sz
to 1 MB. DAOS containers have a property for both server and client checksum, whereby the client will retry the data transfer to or from the server in the case of corruption, however by default this is disabled, to enable it for best performance and acceptable accuracy usage of the CRC-32 algorithm is recommended with the above parameters cksum:crc32,srv_cksum:on
.
Now, the GX
in EC_16P3GX
tells the container to stripe the data across all servers in the pool, which is optimum if your application is writing a single shared file or at most one file per node, but instead if your application is writing more than one file per node, say file per process, for best performance you should change the GX
to G32
, the 32 being the hard-coded number of servers the data in the file will be striped across. You can do this in one of two ways:
- Use the
--file-oclass
parameter explicitly in the container creation. The call would look like: - Create a subdirectory in the container and set the attribute on it. For example, if your container was created with
EC_16P3GX
and you wanted a subdirectory<dir name>
to haveEC_16P3G32
, mount the container (this is described in the POSIX Container Access via DFUSE section below) with directory<dir name>
at/tmp/<pool name>/<container name>
and then: By default any top-level directory created in a container will inherit the directory and file object class from the container, and any subdirectory inherits from its parent, so in this fashion you can change the default and have a mix of file object classes in the same container.
There is maintenance overhead with containers, therefore it is advisable to create just one or a few containers and create multiple directories in the few containers to partition your work.
DAOS Agent Check¶
Whether you are accessing DAOS when running a job from a compute node or managing data from a login node, the DAOS agent daemon is needed to connect the DAOS client to the DAOS server cluster, in your case daos_user
. The DAOS agent facilitates all authentication and communication between the DAOS clients and servers. The DAOS agent daemon should always be running for the daos_user
cluster on the UANs, however on the compute nodes the daos_user
agent is only started in the PBS prologue specified via the -l filesystems=daos_user_fs
resource requirement, and is terminated in the PBS epilogue. To verify that it is running, first load the daos
module:
Then to verify the DAOS daemon process for daos_user
is running, run this command:
Additionally on the compute nodes, you can run this clush
command to check if the agent is running on all nodes in the job:
On the UANs there may be several agents running for different clusters so you may get several lines of output (on the compute node you will get only one), but the one for daos_user is named daos_agent_oneScratch
and looks like this:
Then verify the daos_user agent will be the one used by the DAOS client:
You should then see this:
DAOS pool and container sanity checks (is the daos_user cluster up or down?)¶
If any of the following command results in an error, then you can confirm the daos_user cluster is currently down
- Look for messages like
Rebuild busy and state degraded in the daos pool query.
- 'Out of group or member list' error is an exception and can be safely ignored. This error message will be fixed in the next DAOS release.
You can also use the following commands for further diagnosis.
There are example programs and job scripts provided under /soft/daos/examples/
.
POSIX Container Access via DFUSE¶
DAOS POSIX container access can be accomplished with no application code modifications needed through DAOS filesystem (DFS) dfuse mount points for both the compute and UANs. Once mounted, you can access files in the container as you normally would via POSIX file system commands. Currently, this must be done manually prior to use on any node on which you are working. In the future, we hope to automate some of this via additional qsub
options.
1. To mount a POSIX container on a UAN¶
2. To mount a POSIX container on Compute Nodes¶
You need to mount the container on all compute nodes. This is done via the launch-dfuse.sh
script which does a clush
command of start-dfuse.sh
:
DAOS Data mover instruction is provided at here.
'UNCLEAN' Container Status¶
If you get an error trying to access your container (such as on the dfuse container mount) your container may have a status of 'UNCLEAN'. You can check this with the following command:
You should see output with the 'Health' property set to 'UNCLEAN':
This 'UNCLEAN' status indicates that the DAOS system has had a temporary loss of redundancy which may or may not have resulted in corruption of the metadata (including directory structures) or the data itself. In order to investigate to determine if there is actual metadata or data corruption, you will first need to be able to access the container by explicitly setting the status of the container to HEALTHY:
To check on metadata corruption run this DAOS filesystem command to have DAOS check for metadata corruption.:
If the metadata is ok you should see something like this:
However if you see failure messages or the 'Number of leaked OIDs in namespace' is greater than 0 then you have metadata corruption. Otherwise, the next step is to manually manually verify the data correctness yourself, by whatever means is appropriate (i.e. loading data into your simulator, loading the data into analysis programs, utilizing your own checksums, or just visually inspecting the files). So if your metadata or data has been corrupted, you should report this data corruption to ALCF Support support@alcf.anl.gov and someone from the DAOS team will follow up wtih you to investigate.
Job Submission¶
The -l filesystems=daos_user_fs
PBS resource requirement will ensure that DAOS is accessible on the compute nodes.
Job submission without requesting DAOS:
# replace `./pbs_script1.sh` with `-I` for an interactive job
qsub -l select=1 -l walltime=01:00:00 -A <ProjectName> -k doe -l filesystems=flare -q debug ./pbs_script1.sh
Job submission with DAOS:
qsub -l select=1 -l walltime=01:00:00 -A <ProjectName> -k doe -l filesystems=flare:daos_user_fs -q debug ./pbs_script1.sh
NIC and Core Binding¶
Each Aurora compute node has 8 NICs and each DAOS server node has 2 NICs. Each NIC is capable of driving 20-25 GB/s unidirectional for data transfer. Every read and write goes over the NIC and hence NIC binding is the key to achieve good performance.
For 12 PPN, the following binding is recommended:
NIC 0 | NIC 1 | NIC 2 | NIC 3 | NIC 4 | NIC 5 | NIC 6 | NIC 7 |
---|---|---|---|---|---|---|---|
0 | 1 | 2 | 3 | 52 | 53 | 54 | 55 |
4 | 5 | 6 | 7 | 56 | 57 | 58 | 59 |
8 | 9 | 10 | 11 | 60 | 61 | 62 | 63 |
12 | 13 | 14 | 15 | 64 | 65 | 66 | 67 |
16 | 17 | 18 | 19 | 68 | 69 | 70 | 71 |
20 | 21 | 22 | 23 | 72 | 73 | 74 | 75 |
24 | 25 | 26 | 27 | 76 | 77 | 78 | 79 |
28 | 29 | 30 | 31 | 80 | 81 | 82 | 83 |
32 | 33 | 34 | 35 | 84 | 85 | 86 | 87 |
36 | 37 | 38 | 39 | 88 | 89 | 90 | 91 |
40 | 41 | 42 | 43 | 92 | 93 | 94 | 95 |
44 | 45 | 46 | 47 | 96 | 97 | 98 | 99 |
48 | 49 | 50 | 51 | 100 | 101 | 102 | 103 |
: Sample NIC to Core binding
Interception library for POSIX containers¶
The interception library (IL) is a next step in improving DAOS performance. This provides kernel-bypass for I/O data, leading to improved performance. The libioil
IL will intercept basic read and write POSIX calls while all metadata calls still go through dFuse. The libpil4dfs
IL should be used for both data and metadata calls to go through dFuse.
The IL can provide a large performance improvement for bulk I/O as it bypasses the kernel and commuNICates with DAOS directly in userspace. It will also take advantage of the multiple NICs on the node based on how many MPI processes are running on the node and which CPU socket they are on.
Interception library for POSIX mode | |
---|---|
Sample job script¶
Currently, --no-vni
is required in the mpiexec
command to use DAOS.
MPI-IO Container Access¶
The MPICH MPI-IO layer on Aurora (ROMIO) provides multiple I/O backends including one for DAOS. ROMIO can be used with dFuse and the interception library utilizing the UFS backend, but the DAOS backend will provide optimal performance. By default ROMIO will auto-detect DFS and use the DAOS backend. MPI-IO itself is a common backend for many I/O libraries, including HDF5 and PNetCDF. Whether using collective I/O MPI-IO calls directly or indirectly via an I/O library, a process called collective buffering can be done where data from small non-contiguous chunks across many compute nodes in the collective is aggregated into larger contiguous buffers on a few compute nodes, referred to as aggregators, from which DFS API calls are made to write to or read from DAOS. Collective buffering can improve or degrade I/O performance depending on the I/O pattern, and in the case of DAOS, disabling it can lead to I/O failures in some cases, where the I/O traffic directly from all the compute nodes in the collective to DAOS is too stressful in the form of extreme numbers of small non-contiguous data reads and writes. In ROMIO there are hints that should be set to either optimally enable or disable collective buffering. At this time you should explicitly enable collective buffering in the most optimal fashion, as disabling it or allowing it to default to disabled could result in I/O failures. To optimally enable collective buffering, create a file with the following contents:
Then simply set the following environment variable at run time to point to it:
If you want to verify the settings, additionally set:
Which will print out all the ROMIO hints at run time.
DFS Container Access¶
DFS is the user level API for DAOS. This API is very similar to POSIX but still has many differences that would require code changes to utilize DFS directly. The DFS API can provide the best overall performance for any scenario other than workloads which benefit from caching.
The full code is available on the Aurora filesystem within /soft/daos/examples/src/
Example of PyTorch integration: pydaos.daos_torch
module¶
First, setup an interactive job on a compute node and initialize the environment as follows:
Where the example Python script is:>
- PyDAOS uses
dfs_write()
and read functions, which are faster than POSIXdfuse_write()
and read functions. - PyDAOS uses DFS containers and Python DAOS containers.
- The path to the dataset folders inside these containers does not include
/tmp
and just starts from/dataset_dir1
which assumes a folder inside theDAOS_POOL
andDAOS_CONT
- The above build path might be upgraded with newer builds without warning
- More examples can be found at DAOS GitHub repo >
pydaos.torch
DAOS Hardware¶
Each DAOS server node is based on the Intel Coyote Pass platform:
- (2) Xeon 5320 CPU (Ice Lake)
- (16) 32GB DDR4 DIMMs
- (16) 512GB Intel Optane Persistent Memory 200
- (16) 15.3TB Samsung PM1733 NVMe
- (2) HPE Slingshot NIC
Darshan profiler for DAOS¶
Darshan is a lightweight I/O profiling tool consisting of a shared library your application preloads at runtime which generates a binary log file at program termination, and a suite of utilities to analyze this file. Full official documentation can be found here.
1. Darshan¶
On Aurora, Darshan has been built in the programming environment in /soft
.
To get the Darshan utilities loaded into your programming environment, execute the following:
However it has not yet been fully modularized so the shared library must be manually preloaded at run time via LD_PRELOAD
, along with PNetCDF and HDF5 shared libraries since support for those I/O libraries is included, and all 3 must precede any DAOS interception library, so the specification would be:
If your application is using gpu_tile_compact.sh
then this whole LD_PRELOAD
will go in your personal copy of the Bash script via the export
builtin command.
Run your application normally as you would do with mpiexec
or mpirun
.
This generates a binary log file which has two additional modules: DFS for the DAOS file system API layer, and DAOS for the underlying object store.
By default, the binary log file is stored here:
where the last 3 directories are the date the file is generated, with your user ID, job ID and timestamp in the file name. Alternatively, at run time you can specify the file name to be saved with a specified name in a different location with the following environment variable:
2. darshan-util
environment module¶
module load darshan-util
is needed for darshan-parser
and pydarshan
LD_PRELOAD=/soft/perftools/darshan/darshan-3.4.7/lib/libdarshan-util.so:$LD_PRELOAD
3. darshan-parser
utility¶
darshan-parser
can be used on the binany log file to get a text output of all the metrics as follows:
4. PyDarshan library and Python module¶
For generating a graphical summary report, it is recommended to use the PyDarshan module on Aurora. It is a simple process of creating and activating a Python environment, installing the Darshan package, and then running the summary report generation command:
For custom build:
For system build:
The above 3 lines should be replaced by a simplermodule load pydarshan
in the future. should generate the .html
Darshan report
Cluster Size¶
DAOS cluster size is the number of available DAOS servers. While we are working towards bringing up the entire 1024 DAOS server available users, currently different number of DAOS nodes could be up. Please check with support or run an IOR test to get an estimate on the current number of DAOS servers available. The bandwidth listed here in the last column is a theoretical peak bandwidth.
Expected Bandwidth Expected number of DAOS servers and its approximate expected bandwidth
Nodes | Percentage | Throughput |
---|---|---|
20 | 2% | 1 TB/s |
128 | 12.50% | 5 TB/s |
600 | 60% | 10 TB/s |
800 | 78% | 20 TB/s |
1024 | 100% | 30 TB/s |
The size of your current DAOS cluster can be found using the following formula:
The value ofntarget
comes from the output of: targets_per_node=32
is fixed given the node hardware configuration of our filesystem. An example:
> daos pool query hacc
Pool 050b20a3-3fcc-499b-a6cf-07d4b80b04fd, ntarget=4096, disabled=0, leader=2, version=131
Sharing containers with multiple users¶
If you'd like to create a container that includes a dataset and allows multiple users from your project team to reuse it concurrently (with simultaneous mounting and safe read/write operations, i.e., without race conditions), you can follow the below procedure below. Before proceeding, ensure that all intended users have the necessary access to your project, pool, and user group.
Known issues and workarounds¶
1. Large bulk I/O write issue¶
There is a known issue Python with pil4dfs
- Fix provided in DAOS-17499 - Current workaround is to set D_IL_COMPATIBLE=1
environment variable. Y - You can skip pil4dfs
for now if that happens.
2. pydaos.daos_torch
disconnect and clean up¶
There is a DFS disconnect clean up issue. This should be fixed in the next release.
3. Libfabric endpoint creation error¶
Occasionally at a high number of nodes and/or high PPN the following error that looks like this may show up in your stderr log:
04/02-11:03:16.60 x4319c0s0b0n0 DAOS[53174/53174/0] external ERR # [1092097.708457] mercury->ctx [error] /builddir/build/BUILD/mercury-2.4.0/src/na/na_ofi.c:5400 na_ofi_eq_open() fi_cq_open failed, rc: -17 (File exists)
04/02-11:03:16.61 x4319c0s0b0n0 DAOS[53174/53174/0] external ERR # [1092097.722714] mercury->cls [error] /builddir/build/BUILD/mercury-2.4.0/src/na/na_ofi.c:5191 na_ofi_basic_ep_open() Could not open event queues
04/02-11:03:16.61 x4319c0s0b0n0 DAOS[53174/53174/0] external ERR # [1092097.722737] mercury->cls [error] /builddir/build/BUILD/mercury-2.4.0/src/na/na_ofi.c:5158 na_ofi_endpoint_open() na_ofi_basic_ep_open() failed
04/02-11:03:16.61 x4319c0s0b0n0 DAOS[53174/53174/0] external ERR # [1092097.722743] mercury->cls [error] /builddir/build/BUILD/mercury-2.4.0/src/na/na_ofi.c:7712 na_ofi_initialize() Could not create endpoint
04/02-11:03:16.61 x4319c0s0b0n0 DAOS[53174/53174/0] external ERR # [1092097.722976] mercury->cls [error] /builddir/build/BUILD/mercury-2.4.0/src/na/na.c:879 NA_Initialize_opt2() Could not initialize plugin
04/02-11:03:16.61 x4319c0s0b0n0 DAOS[53174/53174/0] external ERR # [1092097.722988] mercury->cls [error] /scratchbox/daos/mschaara/io500/daos/build/external/debug/mercury/src/mercury_core.c:1347 hg_core_init() Could not initialize NA class (info_string=ofi+cxi://cxi4, listen=0)
04/02-11:03:16.61 x4319c0s0b0n0 DAOS[53174/53174/0] external ERR # [1092097.723007] mercury->cls [error] /scratchbox/daos/mschaara/io500/daos/build/external/debug/mercury/src/mercury_core.c:6074 HG_Core_init_opt2() Cannot initialize core class
04/02-11:03:16.61 x4319c0s0b0n0 DAOS[53174/53174/0] external ERR # [1092097.723014] mercury->cls [error] /scratchbox/daos/mschaara/io500/daos/build/external/debug/mercury/src/mercury.c:1128 HG_Init_opt2() Could not create HG core class
You can disregard this, as the DAOS client will simply retry the operation until it succeeds.
4. Issue with the gpu_tile_compact.sh
bash script and the DAOS Interception Libraries¶
There is currently a bug between the oneAPI Level Zero, the DAOS Interception Libraries (/usr/lib64/libpil4dfs.so and /usr/lib64/libioil.so) and the /soft/tools/mpi_wrapper_utils/gpu_tile_compact.sh bash script where you may get an error like this sporadically at scale:
This issue is still under investigation. In the meantime, there is a workaround which is to take the /soft/tools/mpi_wrapper_utils/gpu_tile_compact.sh
Bash script and create your own version of it to perform the LD_PRELOAD
of the interception library within this script. In the case of the libpil4dfs.so
, you would add the following line just before the execution of the binary:
5. NA_HOSTUNREACH
errors¶
is almost always the no-vni issue or network issue and not a DAOS issue
Best practices¶
- Check that you requested DAOS:
- Check that you loaded the DAOS module:
- Check that you have your DAOS pool allocated:
- Check that the DAOS client is running on all your nodes:
- Check that your container is mounted on all nodes:
- Check that you can
ls
in your container: - Check that your I/O actually failed.
- Check the health property in your container:
- Check if your space is full (min and max):
- Check if your query shows failed targets or rebuild in process:
- Run the following commands to check the health of your DAOS pool and container:
- If you are still having issues, please submit a ticket at support@alcf.anl.gov