DAOS¶
DAOS is a major file system on Aurora, with 230 PB and up to >30 TB/s from 1024 DAOS server storage nodes. DAOS is an open-source, software-defined object store designed for massively distributed Non-Volatile Memory (NVM) and NVMe SSDs. DAOS presents a unified storage model with a native key-array value interface supporting POSIX, MPI-IO, DFS, and HDF5. Users can use DAOS for I/O and checkpointing on Aurora. DAOS is fully integrated with the wider Aurora compute fabric.
This guide covers: - DAOS pool allocation and container setup - POSIX/DFS/MPI-IO access modes - job submission and performance guidance - troubleshooting and known issues
The first step in using DAOS is to get DAOS pool space allocated for your project. Users should submit a request as noted below to have a DAOS pool created for your project.
DAOS Pool Allocation¶
A DAOS pool is physically allocated dedicated storage space for your project.
Email support@alcf.anl.gov to request a DAOS pool with the following information:
- Project Name
- ALCF User Names
- Total space requested (typically 100 TB+)
- Justification
- Preferred pool name
This is an initial test DAOS configuration and as such, any data on the DAOS system will eventually be deleted when the configuration is changed into a larger system. Warning will be given before the system is wiped to allow time for users to move any important data off.
Modules¶
Load the daos module when using DAOS. This can be done on the login node (UAN) or in the compute node job script:
Pool¶
A pool is a dedicated space allocated to your project. Once your pool has been allocated for your project space, confirm that you are able to query the pool:
daos pool query hacc
Pool 050b20a3-3fcc-499b-a6cf-07d4b80b04fd, ntarget=4096, disabled=0, leader=2, version=131
Pool space info:
- Target(VOS) count:640
- Storage tier 0 (SCM):
Total size: 6.0 TB
Free: 4.4 TB, min:6.5 GB, max:7.0 GB, mean:6.9 GB
- Storage tier 1 (NVMe):
Total size: 200 TB
Free: 194 TB, min:244 GB, max:308 GB, mean:303 GB
Rebuild done, 4 objs, 0 recs
POSIX Containers¶
In DAOS terms, a container is a logical space within a pool where data and metadata are stored. For beginners, think of a container as a project directory where your files are stored. There are several container types, but this guide focuses on POSIX containers in the context of the DAOS File System (DFS). Advanced options are provided later in Advanced container creation configuration.
POSIX Container Access via DFUSE¶
DAOS POSIX container access can be done with no application code changes through DAOS filesystem (DFS) dfuse mount points on both compute and login nodes. Once mounted, you can access files in the container as you normally would with POSIX/Lustre commands.
Currently, this must be done manually before use on each node where you are working. In the future, this may be automated through additional qsub options.
Mount a POSIX container on a login node¶
Mount a POSIX container on compute nodes¶
You need to mount the container on all compute nodes. This is done via the launch-dfuse.sh script which does a clush command of start-dfuse.sh:
DAOS data mover instructions are provided here.
Job Submission¶
The -l filesystems=daos_user_fs PBS resource requirement will ensure that DAOS is accessible on the compute nodes.
Interception Library for POSIX Containers¶
The interception library (IL) is a next step for improving DAOS performance. It provides kernel bypass for I/O data. The libpil4dfs IL should be used so both data and metadata calls go through dFuse.
| Interception library for POSIX mode | |
|---|---|
Sample Job Script¶
Currently, --no-vni is required in the mpiexec command to use DAOS.
MPI-IO Container Access¶
MPI-IO is a common backend for many I/O libraries, including HDF5 and PNetCDF. You should be able to directly use MPI-IO with DAOS: https://docs.daos.io/v2.0/user/mpi-io/
To optimally enable collective buffering, create a file with the following contents:
Then set the following environment variables at runtime to point to it:
DFS Container Access¶
DFS is the user level API for DAOS. This API is very similar to POSIX but still has many differences that would require code changes to utilize DFS directly. The DFS API can provide the best overall performance for any scenario other than workloads which benefit from caching.
The full code is available on the Aurora filesystem within /soft/daos/examples/src/
PyDAOS PyTorch access through pydaos.daos_torch module¶
First, set up an interactive job on a compute node and initialize the environment as follows:
- PyDAOS uses
dfs_write()and read functions, which are faster than POSIXdfuse_write()and read functions. - PyDAOS uses DFS containers and Python DAOS containers.
- The path to the dataset folders inside these containers does not include
/tmpand just starts from/dataset_dir1which assumes a folder inside theDAOS_POOLandDAOS_CONT - The above build path might be upgraded with newer builds without warning
- More examples can be found at DAOS GitHub repo >
pydaos.torch
Darshan profiler for DAOS¶
Darshan is a lightweight I/O profiling tool consisting of a shared library that your application preloads at runtime. It generates a binary log file at program termination, plus utilities to analyze that log. Full official documentation is available here.
Follow these notes to install your own Darshan tool under your user space here.
1. Darshan¶
On Aurora, Darshan has been built in the programming environment in /soft.
To get the Darshan parser utilities loaded into your programming environment, execute the following:
In order to instrument your application and generate a Darshan binary log file, the shared library must be manually preloaded at run time via LD_PRELOAD passed as an environment variable to mpiexec, along with PNetCDF and HDF5 shared libraries (since support for those I/O libraries is included), and all 3 must precede any DAOS interception library. So the final specification would be:
If your application uses gpu_tile_compact.sh and you experience a hang or error as described in the 'Known issues and workarounds' section below, you should add the entire LD_PRELOAD definition to your personal copy of the Bash script. Do this by using the export builtin command. See here for an example:
Run your application normally as you would do with mpiexec or mpirun.
This generates a binary log file which has two additional modules: DFS for the DAOS file system API layer, and DAOS for the underlying object store.
By default, the binary log file is stored here:
where the last 3 directories are the date the file is generated, with your user ID, job ID and timestamp in the file name. Alternatively, at run time you can specify the file name to be saved with a specified name in a different location with the following environment variable:
Python PyDarshan Summary Report .html¶
The first step in your analysis should be to generate a graphical summary report to get a rough estimate of overall IO performance. For generating this graphical summary report, it is recommended to use the PyDarshan module on Aurora. It is a simple process of creating and activating a Python environment, installing the Darshan package, and then running the summary report generation command:
For custom build:
should generate the .html Darshan report. Once the environment is created, to reuse it on subsequent analysis just activate it again:
darshan-parser utility¶
darshan-parser can be used on the binary log file to get text output of all raw counters, which is more detailed than the Python summary .html:
Cluster Size¶
DAOS cluster size is the number of available DAOS servers. While we are working toward bringing up all 1024 DAOS servers for users, different numbers of DAOS nodes may be up at any given time. Please check with support or run an IOR test to estimate the current number of DAOS servers available. The bandwidth in the last column below is theoretical peak bandwidth.
Expected Bandwidth Expected number of DAOS servers and its approximate expected bandwidth
| Nodes | Percentage | Throughput |
|---|---|---|
| 20 | 2% | 1 TB/s |
| 128 | 12.50% | 5 TB/s |
| 600 | 60% | 10 TB/s |
| 800 | 78% | 20 TB/s |
| 1024 | 100% | 30 TB/s |
The size of your current DAOS cluster can be found using the following formula:
The value ofntarget comes from the output of: targets_per_node=32 is fixed given the node hardware configuration of our filesystem. An example:
> daos pool query hacc
Pool 050b20a3-3fcc-499b-a6cf-07d4b80b04fd, ntarget=4096, disabled=0, leader=2, version=131
DAOS Hardware and Aurora Architecture¶
Each DAOS server node is based on the Intel Coyote Pass platform:
- (2) Xeon 5320 CPU (Ice Lake)
- (16) 32GB DDR4 DIMMs
- (16) 512GB Intel Optane Persistent Memory 200
- (16) 15.3TB Samsung PM1733 NVMe
- (2) HPE Slingshot NIC

Advanced Container Creation Configuration¶
DFS is essentially a POSIX emulation layer on top of DAOS and is implemented in the libdfs library, allowing a DAOS container to be accessed as a hierarchical POSIX namespace. libdfs supports files, directories, and symbolic links, but not hard links. The DAOS official documentation on DFS is here.
With more than 1024 servers at full deployment, the user-accessible cluster named daos_user has 16,384 solid state drives (SSDs) and 16,384 persistent memory modules, and without some amount of data redundancy a hardware failure on any one could result in the loss of your data. DAOS has several data redundancy options available, and a tradeoff must be made between data resiliency, performance, and volume. The recommended tradeoff is to specify a redundancy factor of 3 on the container for both files and directories via the rd_fac:3 container property. By default, this means files will utilize an erasure coding algorithm with a ratio of 16 data blocks to 3 parity blocks (in DAOS file object class terms EC_16P3GX), which in simplest terms, means 19 blocks of erasure coding stores 16 blocks of data. For directories, the default is to create 3 full duplicates of the directory, which is basically an emulation of an inode in traditional file system terms, by setting the directory object class to RP_4G1. For this default setting, there is little performance tradeoff for directories at this redundancy level, since it just contains metadata.
In the scenario with the above settings, when a server failure occurs, be it a software or hardware failure (e.g. an SSD, persistent memory module, or a networking switch failure) on up to 3 servers, a process called a rebuild occurs. During rebuild, the data on the failed servers is reconstructed to preserve data integrity, and the servers with the failures are excluded from the cluster. The servers or network can be repaired in the future so that the servers are eventually reintegrated to the cluster. The rebuild process in this scenario does not disrupt service, and the cluster does not experience any outage. If more than 3 servers are lost (say, due to a network issue) or more servers are lost during the rebuild, then the cluster will be taken offline to conduct repairs.
These parameters are set at container creation as follows along with others which will be described below for best practices:
The chunk-size of 2 MB and the ec_cell_sz (erasure coding cell size) of 128 KB work together to optimally stripe the data across the 16 data servers plus 3 parity servers (19 erasure coding servers) and set the maximum amount of data written to one SSD on one server by one client per transaction to the ec_cell_sz of 128 KB. The general rule of thumb is the chunk-size should equal the number of data servers (excluding parity servers) multiplied by the ec_cell_sz or at least be an even multiple of it. If your application does large amounts of IO per process, you could experiment with the settings by increasing them proportionately, e.g. setting the chunk-size to 16 MB and the ec_cell_sz to 1 MB. DAOS containers have a property for both server and client checksum, whereby the client will retry the data transfer to or from the server in the case of corruption, however by default this is disabled, to enable it for best performance and acceptable accuracy usage of the CRC-32 algorithm is recommended with the above parameters cksum:crc32,srv_cksum:on.
Now, the GX in EC_16P3GX tells the container to stripe the data across all servers in the pool, which is optimum if your application is writing a single shared file or at most one file per node, but instead if your application is writing more than one file per node, say file per process, for best performance you should change the GX to G32, the 32 being the hard-coded number of servers the data in the file will be striped across. You can do this in one of two ways:
- Use the
--file-oclassparameter explicitly in the container creation. The call would look like: - Create a subdirectory in the container and set the attribute on it. For example, if your container was created with
EC_16P3GXand you wanted a subdirectory<dir name>to haveEC_16P3G32, mount the container (this is described in the POSIX Container Access via DFUSE section below) with directory<dir name>at/tmp/<pool name>/<container name>and then: By default any top-level directory created in a container will inherit the directory and file object class from the container, and any subdirectory inherits from its parent, so in this fashion you can change the default and have a mix of file object classes in the same container.
There is maintenance overhead with containers, so it is advisable to create one (or a small number of) containers and then create multiple directories in those containers to partition your work.

More information: https://docs.daos.io/v2.6/overview/architecture/
More Information on MPI-IO¶
The MPICH MPI-IO layer on Aurora (ROMIO) provides multiple I/O backends, including one for DAOS. ROMIO can be used with dFuse and the interception library via the UFS backend, but the DAOS backend provides optimal performance. By default, ROMIO auto-detects DFS and uses the DAOS backend. MPI-IO itself is a common backend for many I/O libraries, including HDF5 and PNetCDF. Whether using collective I/O MPI-IO calls directly, or indirectly via an I/O library, a process called collective buffering can aggregate small non-contiguous chunks from many compute nodes into larger contiguous buffers on a subset of compute nodes (aggregators), from which DFS API calls are made to write/read data in DAOS. Collective buffering can improve or degrade I/O performance depending on the I/O pattern. In DAOS, disabling it can lead to I/O failures in some cases where all compute nodes issue extreme volumes of small non-contiguous reads/writes directly to DAOS. ROMIO hints should be set to optimally enable or disable collective buffering. At this time, you should explicitly enable collective buffering in an optimal way, as disabling it (or allowing it to default disabled) can cause I/O failures.
Fixing UNCLEAN Container Status¶
If you get an error while trying to access your container (for example, on a dfuse mount), the container may have status UNCLEAN. You can check this with:
You should see output with the Health property set to UNCLEAN:
This UNCLEAN status indicates that DAOS had a temporary loss of redundancy, which may or may not have caused metadata corruption (including directory structures) or data corruption. To investigate, first make the container accessible by explicitly setting status to HEALTHY:
To check metadata corruption, run this DAOS filesystem command:
If metadata is healthy, you should see output similar to:
If you see failure messages, or if Number of leaked OIDs in namespace is greater than 0, then metadata corruption exists. Otherwise, manually verify data correctness by appropriate means (for example: loading data into your simulator, using analysis programs, running your own checksums, or visually inspecting files). If metadata or data is corrupted, report it to ALCF Support at support@alcf.anl.gov, and someone from the DAOS team will follow up with you to investigate.
NIC and Core Binding¶
Each Aurora compute node has 8 NICs and each DAOS server node has 2 NICs. Each NIC is capable of driving 20-25 GB/s unidirectional for data transfer. Every read and write goes over the NIC and hence NIC binding is the key to achieve good performance.
For 12 PPN, the following binding is recommended:
| NIC 0 | NIC 1 | NIC 2 | NIC 3 | NIC 4 | NIC 5 | NIC 6 | NIC 7 |
|---|---|---|---|---|---|---|---|
| 0 | 1 | 2 | 3 | 52 | 53 | 54 | 55 |
| 4 | 5 | 6 | 7 | 56 | 57 | 58 | 59 |
| 8 | 9 | 10 | 11 | 60 | 61 | 62 | 63 |
| 12 | 13 | 14 | 15 | 64 | 65 | 66 | 67 |
| 16 | 17 | 18 | 19 | 68 | 69 | 70 | 71 |
| 20 | 21 | 22 | 23 | 72 | 73 | 74 | 75 |
| 24 | 25 | 26 | 27 | 76 | 77 | 78 | 79 |
| 28 | 29 | 30 | 31 | 80 | 81 | 82 | 83 |
| 32 | 33 | 34 | 35 | 84 | 85 | 86 | 87 |
| 36 | 37 | 38 | 39 | 88 | 89 | 90 | 91 |
| 40 | 41 | 42 | 43 | 92 | 93 | 94 | 95 |
| 44 | 45 | 46 | 47 | 96 | 97 | 98 | 99 |
| 48 | 49 | 50 | 51 | 100 | 101 | 102 | 103 |
: Sample NIC to Core binding
DAOS Pool and Container Sanity Checks (daos_user cluster up or down)¶
If any of the following commands results in an error, the daos_user cluster may currently be down.
- Look for messages like
Rebuild busy and state degraded in the daos pool query. - 'Out of group or member list' error is an exception and can be safely ignored. This error message will be fixed in the next DAOS release.
You can also use the following commands for further diagnosis.
There are example programs and job scripts provided under /soft/daos/examples/.
More Information on Interception Library¶
The libioil IL will intercept basic read and write POSIX calls while all metadata calls still go through dFuse. The IL can provide a large performance improvement for bulk I/O as it bypasses the kernel and communicates with DAOS directly in user space. It will also take advantage of multiple NICs on the node based on how many MPI processes are running and which CPU socket they are on.

| Interception library for POSIX mode | |
|---|---|
Sharing containers with multiple users¶
If you want to create a container that includes a dataset and allows multiple users in your project team to reuse it concurrently (with simultaneous mounting and safe read/write operations, i.e., without race conditions), follow the procedure below. Before proceeding, ensure all intended users have access to the project, pool, and user group.
More information: https://docs.daos.io/v2.6/user/container/#ownership
Known issues and workarounds¶
Large Bulk I/O Write Issue¶
There is a known Python issue with pil4dfs. - Fix provided in DAOS-17499 - Current workaround is to set the D_IL_COMPATIBLE=1 environment variable. - You can skip pil4dfs for now if that happens.
pydaos.daos_torch Disconnect and Cleanup¶
There is a DFS disconnect/cleanup issue. This should be fixed in the next release.
Libfabric Endpoint Creation Error¶
At high node counts and/or high PPN, the following error may appear in stderr:
04/02-11:03:16.60 x4319c0s0b0n0 DAOS[53174/53174/0] external ERR # [1092097.708457] mercury->ctx [error] /builddir/build/BUILD/mercury-2.4.0/src/na/na_ofi.c:5400 na_ofi_eq_open() fi_cq_open failed, rc: -17 (File exists)
04/02-11:03:16.61 x4319c0s0b0n0 DAOS[53174/53174/0] external ERR # [1092097.722714] mercury->cls [error] /builddir/build/BUILD/mercury-2.4.0/src/na/na_ofi.c:5191 na_ofi_basic_ep_open() Could not open event queues
04/02-11:03:16.61 x4319c0s0b0n0 DAOS[53174/53174/0] external ERR # [1092097.722737] mercury->cls [error] /builddir/build/BUILD/mercury-2.4.0/src/na/na_ofi.c:5158 na_ofi_endpoint_open() na_ofi_basic_ep_open() failed
04/02-11:03:16.61 x4319c0s0b0n0 DAOS[53174/53174/0] external ERR # [1092097.722743] mercury->cls [error] /builddir/build/BUILD/mercury-2.4.0/src/na/na_ofi.c:7712 na_ofi_initialize() Could not create endpoint
04/02-11:03:16.61 x4319c0s0b0n0 DAOS[53174/53174/0] external ERR # [1092097.722976] mercury->cls [error] /builddir/build/BUILD/mercury-2.4.0/src/na/na.c:879 NA_Initialize_opt2() Could not initialize plugin
04/02-11:03:16.61 x4319c0s0b0n0 DAOS[53174/53174/0] external ERR # [1092097.722988] mercury->cls [error] /scratchbox/daos/mschaara/io500/daos/build/external/debug/mercury/src/mercury_core.c:1347 hg_core_init() Could not initialize NA class (info_string=ofi+cxi://cxi4, listen=0)
04/02-11:03:16.61 x4319c0s0b0n0 DAOS[53174/53174/0] external ERR # [1092097.723007] mercury->cls [error] /scratchbox/daos/mschaara/io500/daos/build/external/debug/mercury/src/mercury_core.c:6074 HG_Core_init_opt2() Cannot initialize core class
04/02-11:03:16.61 x4319c0s0b0n0 DAOS[53174/53174/0] external ERR # [1092097.723014] mercury->cls [error] /scratchbox/daos/mschaara/io500/daos/build/external/debug/mercury/src/mercury.c:1128 HG_Init_opt2() Could not create HG core class
You can disregard this, as the DAOS client will simply retry the operation until it succeeds.
Issue with gpu_tile_compact.sh and DAOS Interception Libraries¶
There is currently a bug involving oneAPI Level Zero, the DAOS interception libraries (/usr/lib64/libpil4dfs.so and /usr/lib64/libioil.so), and /soft/tools/mpi_wrapper_utils/gpu_tile_compact.sh (specifically the /usr/bin/udevadm call), where you may sporadically see a hang or an error like:
This issue is still under investigation. In the meantime, a workaround is to copy /soft/tools/mpi_wrapper_utils/gpu_tile_compact.sh and add LD_PRELOAD of the interception library within your local script. For libpil4dfs.so, add the following line after the /usr/bin/udevadm call and just before binary execution:
NA_HOSTUNREACH Errors¶
This is almost always a --no-vni issue or a network issue, not a DAOS issue.
na_ofi_mem_register Errors¶
There is a network limitation where, if your application performs heavy I/O from a severely segmented memory buffer, you may see an error like:
A workaround for this error is to set the following environment variable at runtime:
Best Practices¶
DAOS Agent Check¶
Whether you are accessing DAOS from a compute-node job or managing data from a login node, the DAOS agent daemon is required to connect the DAOS client to the DAOS server cluster (in your case, daos_user). The DAOS agent handles authentication and communication between clients and servers.
The DAOS agent daemon should always be running for the daos_user cluster on UANs. On compute nodes, the daos_user agent is started by the PBS prologue specified via -l filesystems=daos_user_fs and terminated in the PBS epilogue.
To verify it is running, first load the daos module and inspect processes:
On UANs, several agents may be running for different clusters (on compute nodes, usually one). The daos_user agent is named daos_agent_oneScratch and looks similar to:
Then verify the DAOS client is using the daos_user agent:
Quick Troubleshooting Checklist¶
Submission and Environment¶
- Check that you requested DAOS:
- Check that you loaded the DAOS module:
- Check that your DAOS pool is allocated:
Runtime Access Validation¶
- Check that the DAOS client is running on all nodes:
- Check that your container is mounted on all nodes:
- Check that you can list files in your container:
- Check that your I/O actually failed.
Pool/Container Health¶
- Check the container health property:
- Check whether your pool space is full (min/max free space):
- Check whether your pool reports failed targets or rebuild in process:
- Run explicit DAOS pool/container health checks:
Escalation¶
- If issues persist, submit a ticket to support@alcf.anl.gov.