Examples

We here list a set of example workloads. In the first example, we show the benchmarking process, including generating the dataset, running the benchmark with profiling, and processing the logs and profiling data. For the rest of the workloads, we list the YAML configure files.

UNET3D: 3D Medical Image Segmentation

Reference Implementation: https://github.com/mlcommons/training/tree/master/image_segmentation/pytorch
Framework: PyTorch
Dataset: .npz format image files containing a single sample.
Trains over multiple epochs, performs evaluation on a held-out test set periodically.

# contents of unßet3d.yaml

model: unet3d

framework: pytorch

workflow:
    generate_data: False
    train: True
    checkpoint: True

dataset:
    data_folder: data/unet3d/
    format: npz
    num_files_train: 168
    num_samples_per_file: 1
    record_length: 146600628
    record_length_stdev: 68341808
    record_length_resize: 2097152

reader:
    data_loader: pytorch
    batch_size: 4
    read_threads: 4
    file_shuffle: seed
    sample_shuffle: seed

train:
    epochs: 5
    computation_time: 1.3604

checkpoint:
    checkpoint_folder: checkpoints/unet3d
    checkpoint_after_epoch: 5
    epochs_between_checkpoints: 2
    model_size: 499153191

First, we generate the dataset with `++workload.workflow.generate=False`

mpirun -np 8 dlio_benchmark workload=unet3d ++workload.workflow.generate_data=True ++workload.workflow.train=False

Then, we run the appliation with iostat profiling

dlio_benchmark workload=unet3d ++workload.workflow.profiling=iostat

To run in data parallel mode, one can do

mpirun -np 8 dlio_benchmark workload=unet3d ++workload.workflow.profiling=iostat

This will run the benchmark and produce the following logging output:

[INFO] 2023-06-27T21:27:12.956820 Running DLIO with 8 process(es) [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/main.py:108]
[INFO] 2023-06-27T21:27:12.956967 Reading workload YAML config file 'dlio_benchmark.configs/workload/unet3d.yaml' [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/main.py:110]
[INFO] 2023-06-27T21:27:13.010843 Starting data generation [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/main.py:165]
[INFO] 2023-06-27T21:27:13.011399 Generating dataset in data/unet3d/train and data/unet3d/valid [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/data_generator/data_generator.py:73]
[INFO] 2023-06-27T21:27:13.011457 Number of files for training dataset: 168 [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/data_generator/data_generator.py:74]
[INFO] 2023-06-27T21:27:13.011500 Number of files for validation dataset: 0 [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/data_generator/data_generator.py:75]
[INFO] 2023-06-27T21:27:14.149995 Generating NPZ Data: [>------------------------------------------------------------] 0.6% 1 of 168  [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/utility.py:108]
[INFO] 2023-06-27T21:27:15.919235 Generating NPZ Data: [===>---------------------------------------------------------] 5.4% 9 of 168  [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/utility.py:108]
[INFO] 2023-06-27T21:27:17.240473 Generating NPZ Data: [======>------------------------------------------------------] 10.1% 17 of 168  [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/utility.py:108]
[INFO] 2023-06-27T21:27:18.181652 Generating NPZ Data: [=========>---------------------------------------------------] 14.9% 25 of 168  [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/utility.py:108]
[INFO] 2023-06-27T21:27:19.070685 Generating NPZ Data: [============>------------------------------------------------] 19.6% 33 of 168  [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/utility.py:108]
[INFO] 2023-06-27T21:27:19.761225 Generating NPZ Data: [===============>---------------------------------------------] 24.4% 41 of 168  [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/utility.py:108]
[INFO] 2023-06-27T21:27:21.772731 Generating NPZ Data: [==================>------------------------------------------] 29.2% 49 of 168  [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/utility.py:108]
[INFO] 2023-06-27T21:27:22.621811 Generating NPZ Data: [====================>----------------------------------------] 33.9% 57 of 168  [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/utility.py:108]
[INFO] 2023-06-27T21:27:23.523462 Generating NPZ Data: [=======================>-------------------------------------] 38.7% 65 of 168  [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/utility.py:108]
[INFO] 2023-06-27T21:27:24.455943 Generating NPZ Data: [==========================>----------------------------------] 43.5% 73 of 168  [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/utility.py:108]
[INFO] 2023-06-27T21:27:25.243788 Generating NPZ Data: [=============================>-------------------------------] 48.2% 81 of 168  [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/utility.py:108]
[INFO] 2023-06-27T21:27:25.811104 Generating NPZ Data: [================================>----------------------------] 53.0% 89 of 168  [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/utility.py:108]
[INFO] 2023-06-27T21:27:26.787472 Generating NPZ Data: [===================================>-------------------------] 57.7% 97 of 168  [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/utility.py:108]
[INFO] 2023-06-27T21:27:28.969593 Generating NPZ Data: [======================================>----------------------] 62.5% 105 of 168  [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/utility.py:108]
[INFO] 2023-06-27T21:27:29.958574 Generating NPZ Data: [========================================>--------------------] 67.3% 113 of 168  [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/utility.py:108]
[INFO] 2023-06-27T21:27:31.206116 Generating NPZ Data: [===========================================>-----------------] 72.0% 121 of 168  [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/utility.py:108]
[INFO] 2023-06-27T21:27:32.909674 Generating NPZ Data: [==============================================>--------------] 76.8% 129 of 168  [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/utility.py:108]
[INFO] 2023-06-27T21:27:34.357919 Generating NPZ Data: [=================================================>-----------] 81.5% 137 of 168  [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/utility.py:108]
[INFO] 2023-06-27T21:27:35.710920 Generating NPZ Data: [====================================================>--------] 86.3% 145 of 168  [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/utility.py:108]
[INFO] 2023-06-27T21:27:38.266190 Generating NPZ Data: [=======================================================>-----] 91.1% 153 of 168  [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/utility.py:108]
[INFO] 2023-06-27T21:27:39.301475 Generating NPZ Data: [==========================================================>--] 95.8% 161 of 168  [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/utility.py:108]
[INFO] 2023-06-27T21:27:39.846579 Generation done [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/main.py:170]
[INFO] 2023-06-27T21:27:39.850430 Profiling Started with iostat [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/main.py:177]
[INFO] 2023-06-27T21:27:39.888114 Max steps per epoch: 5 = 1 * 168 / 4 / 8 (samples per file * num files / batch size / comm size) [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/main.py:324]
[INFO] 2023-06-27T21:27:39.888787 Starting epoch 1: 5 steps expected [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:128]
[INFO] 2023-06-27T21:27:39.979028 Starting block 1 [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:198]
[INFO] 2023-06-27T21:27:59.680070 Rank 0 step 1 processed 4 samples in 19.699954509735107 s [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:259]
[INFO] 2023-06-27T21:27:59.680076 Rank 1 step 1 processed 4 samples in 19.703863859176636 s [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:259]
[INFO] 2023-06-27T21:27:59.694070 Rank 3 step 1 processed 4 samples in 19.726907968521118 s [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:259]
[INFO] 2023-06-27T21:27:59.693802 Rank 4 step 1 processed 4 samples in 19.708129405975342 s [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:259]
[INFO] 2023-06-27T21:27:59.691022 Rank 2 step 1 processed 4 samples in 19.712920427322388 s [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:259]
[INFO] 2023-06-27T21:27:59.695373 Rank 6 step 1 processed 4 samples in 19.72462296485901 s [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:259]
[INFO] 2023-06-27T21:27:59.706875 Rank 5 step 1 processed 4 samples in 19.735779762268066 s [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:259]
[INFO] 2023-06-27T21:27:59.712785 Rank 7 step 1 processed 4 samples in 19.74686098098755 s [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:259]
[INFO] 2023-06-27T21:28:01.326995 Rank 0 step 2 processed 4 samples in 1.6458377838134766 s [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:259]
[INFO] 2023-06-27T21:28:01.327250 Rank 2 step 2 processed 4 samples in 1.6303155422210693 s [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:259]
[INFO] 2023-06-27T21:28:01.335634 Rank 1 step 2 processed 4 samples in 1.644171953201294 s [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:259]
[INFO] 2023-06-27T21:28:01.343710 Rank 4 step 2 processed 4 samples in 1.6453940868377686 s [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:259]
[INFO] 2023-06-27T21:28:01.355700 Rank 3 step 2 processed 4 samples in 1.6606194972991943 s [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:259]
[INFO] 2023-06-27T21:28:01.361624 Rank 5 step 2 processed 4 samples in 1.6541204452514648 s [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:259]
[INFO] 2023-06-27T21:28:01.364827 Rank 6 step 2 processed 4 samples in 1.6675446033477783 s [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:259]
[INFO] 2023-06-27T21:28:01.372457 Rank 7 step 2 processed 4 samples in 1.659090280532837 s [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:259]
[INFO] 2023-06-27T21:28:02.774831 Rank 0 step 3 processed 4 samples in 1.4467418193817139 s [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:259]
[INFO] 2023-06-27T21:28:02.775530 Rank 1 step 3 processed 4 samples in 1.4396388530731201 s [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:259]
[INFO] 2023-06-27T21:28:02.777924 Rank 6 step 3 processed 4 samples in 1.4070987701416016 s [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:259]
[INFO] 2023-06-27T21:28:02.778453 Rank 7 step 3 processed 4 samples in 1.4057674407958984 s [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:259]
[INFO] 2023-06-27T21:28:02.782499 Rank 2 step 3 processed 4 samples in 1.4540395736694336 s [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:259]
[INFO] 2023-06-27T21:28:02.783395 Rank 3 step 3 processed 4 samples in 1.4274392127990723 s [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:259]
[INFO] 2023-06-27T21:28:02.783894 Rank 4 step 3 processed 4 samples in 1.439401388168335 s [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:259]
[INFO] 2023-06-27T21:28:02.799731 Rank 5 step 3 processed 4 samples in 1.4285638332366943 s [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:259]
[INFO] 2023-06-27T21:28:04.229823 Rank 0 step 4 processed 4 samples in 1.454030990600586 s [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:259]
[INFO] 2023-06-27T21:28:04.229826 Rank 1 step 4 processed 4 samples in 1.453265905380249 s [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:259]
[INFO] 2023-06-27T21:28:04.240324 Rank 2 step 4 processed 4 samples in 1.4558677673339844 s [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:259]
[INFO] 2023-06-27T21:28:04.240330 Rank 3 step 4 processed 4 samples in 1.4567136764526367 s [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:259]
[INFO] 2023-06-27T21:28:04.245584 Rank 6 step 4 processed 4 samples in 1.4674956798553467 s [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:259]
[INFO] 2023-06-27T21:28:04.247221 Rank 4 step 4 processed 4 samples in 1.4627764225006104 s [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:259]
[INFO] 2023-06-27T21:28:04.250820 Rank 7 step 4 processed 4 samples in 1.4712388515472412 s [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:259]
[INFO] 2023-06-27T21:28:04.252102 Rank 5 step 4 processed 4 samples in 1.4519073963165283 s [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:259]
[INFO] 2023-06-27T21:28:13.523484 Rank 0 step 5 processed 4 samples in 9.293325901031494 s [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:259]
[INFO] 2023-06-27T21:28:13.527061 Maximum number of steps reached [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/main.py:297]
[INFO] 2023-06-27T21:28:13.527543 Rank 6 step 5 processed 4 samples in 9.281713724136353 s [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:259]
[INFO] 2023-06-27T21:28:13.523490 Rank 1 step 5 processed 4 samples in 9.28818964958191 s [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:259]
[INFO] 2023-06-27T21:28:13.527551 Rank 7 step 5 processed 4 samples in 9.267073631286621 s [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:259]
[INFO] 2023-06-27T21:28:13.539249 Rank 4 step 5 processed 4 samples in 9.291641473770142 s [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:259]
[INFO] 2023-06-27T21:28:13.546242 Rank 2 step 5 processed 4 samples in 9.305717945098877 s [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:259]
[INFO] 2023-06-27T21:28:13.545463 Rank 5 step 5 processed 4 samples in 9.277906894683838 s [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:259]
[INFO] 2023-06-27T21:28:13.548088 Rank 3 step 5 processed 4 samples in 9.307523012161255 s [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:259]
[INFO] 2023-06-27T21:28:13.541554 Ending block 1 - 5 steps completed in 33.56 s [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:216]
[INFO] 2023-06-27T21:28:13.712092 Epoch 1 - Block 1 [Training] Accelerator Utilization [AU] (%): 39.2945 [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:219]
[INFO] 2023-06-27T21:28:13.713038 Epoch 1 - Block 1 [Training] Throughput (samples/second): 4.7693 [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:220]
[INFO] 2023-06-27T21:28:20.379070 Ending epoch 1 - 5 steps completed in 40.49 s [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:159]
[INFO] 2023-06-27T21:28:20.387992 Starting epoch 2: 5 steps expected [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:128]
[INFO] 2023-06-27T21:28:20.458422 Starting block 1 [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:198]
[INFO] 2023-06-27T21:28:38.420511 Rank 0 step 1 processed 4 samples in 17.950562000274658 s [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:259]
[INFO] 2023-06-27T21:28:38.423065 Rank 2 step 1 processed 4 samples in 17.90280842781067 s [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:259]
[INFO] 2023-06-27T21:28:38.423041 Rank 4 step 1 processed 4 samples in 17.953059911727905 s [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:259]
[INFO] 2023-06-27T21:28:38.425153 Rank 6 step 1 processed 4 samples in 17.904606580734253 s [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:259]
[INFO] 2023-06-27T21:28:38.427028 Rank 1 step 1 processed 4 samples in 17.957058906555176 s [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:259]
[INFO] 2023-06-27T21:28:38.430326 Rank 3 step 1 processed 4 samples in 17.909387826919556 s [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:259]
[INFO] 2023-06-27T21:28:38.444290 Rank 5 step 1 processed 4 samples in 17.92300271987915 s [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:259]
[INFO] 2023-06-27T21:28:38.450703 Rank 7 step 1 processed 4 samples in 17.980567455291748 s [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:259]
[INFO] 2023-06-27T21:28:39.852909 Rank 0 step 2 processed 4 samples in 1.4301834106445312 s [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:259]
[INFO] 2023-06-27T21:28:39.860430 Rank 4 step 2 processed 4 samples in 1.437042474746704 s [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:259]
[INFO] 2023-06-27T21:28:39.864937 Rank 1 step 2 processed 4 samples in 1.4373478889465332 s [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:259]
[INFO] 2023-06-27T21:28:39.865620 Rank 5 step 2 processed 4 samples in 1.4209046363830566 s [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:259]
[INFO] 2023-06-27T21:28:39.871567 Rank 2 step 2 processed 4 samples in 1.4482154846191406 s [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:259]
[INFO] 2023-06-27T21:28:39.879498 Rank 6 step 2 processed 4 samples in 1.4534542560577393 s [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:259]
[INFO] 2023-06-27T21:28:39.888964 Rank 7 step 2 processed 4 samples in 1.437666416168213 s [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:259]
[INFO] 2023-06-27T21:28:39.890346 Rank 3 step 2 processed 4 samples in 1.4595756530761719 s [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:259]
[INFO] 2023-06-27T21:28:41.311217 Rank 0 step 3 processed 4 samples in 1.4581162929534912 s [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:259]
[INFO] 2023-06-27T21:28:41.312092 Rank 2 step 3 processed 4 samples in 1.4399495124816895 s [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:259]
[INFO] 2023-06-27T21:28:41.313566 Rank 5 step 3 processed 4 samples in 1.4474966526031494 s [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:259]
[INFO] 2023-06-27T21:28:41.314422 Rank 6 step 3 processed 4 samples in 1.434694528579712 s [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:259]
[INFO] 2023-06-27T21:28:41.311211 Rank 4 step 3 processed 4 samples in 1.4503426551818848 s [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:259]
[INFO] 2023-06-27T21:28:41.318728 Rank 1 step 3 processed 4 samples in 1.4535951614379883 s [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:259]
[INFO] 2023-06-27T21:28:41.323162 Rank 7 step 3 processed 4 samples in 1.4327857494354248 s [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:259]
[INFO] 2023-06-27T21:28:41.339936 Rank 3 step 3 processed 4 samples in 1.4491026401519775 s [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:259]
[INFO] 2023-06-27T21:28:42.749878 Rank 0 step 4 processed 4 samples in 1.4382779598236084 s [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:259]
[INFO] 2023-06-27T21:28:42.749646 Rank 1 step 4 processed 4 samples in 1.4295282363891602 s [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:259]
[INFO] 2023-06-27T21:28:42.759622 Rank 4 step 4 processed 4 samples in 1.4434914588928223 s [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:259]
[INFO] 2023-06-27T21:28:42.759677 Rank 5 step 4 processed 4 samples in 1.445906162261963 s [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:259]
[INFO] 2023-06-27T21:28:42.760392 Rank 6 step 4 processed 4 samples in 1.4456770420074463 s [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:259]
[INFO] 2023-06-27T21:28:42.762643 Rank 2 step 4 processed 4 samples in 1.450068712234497 s [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:259]
[INFO] 2023-06-27T21:28:42.767003 Rank 7 step 4 processed 4 samples in 1.4435951709747314 s [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:259]
[INFO] 2023-06-27T21:28:42.766916 Rank 3 step 4 processed 4 samples in 1.4258863925933838 s [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:259]
[INFO] 2023-06-27T21:28:50.486273 Rank 0 step 5 processed 4 samples in 7.736128330230713 s [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:259]
[INFO] 2023-06-27T21:28:50.489983 Maximum number of steps reached [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/main.py:297]
[INFO] 2023-06-27T21:28:50.496764 Rank 2 step 5 processed 4 samples in 7.733910799026489 s [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:259]
[INFO] 2023-06-27T21:28:50.507343 Rank 4 step 5 processed 4 samples in 7.74742317199707 s [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:259]
[INFO] 2023-06-27T21:28:50.507864 Rank 3 step 5 processed 4 samples in 7.7405922412872314 s [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:259]
[INFO] 2023-06-27T21:28:50.516752 Rank 1 step 5 processed 4 samples in 7.766550779342651 s [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:259]
[INFO] 2023-06-27T21:28:50.519272 Rank 5 step 5 processed 4 samples in 7.759366273880005 s [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:259]
[INFO] 2023-06-27T21:28:50.522207 Rank 6 step 5 processed 4 samples in 7.76110053062439 s [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:259]
[INFO] 2023-06-27T21:28:50.522231 Rank 7 step 5 processed 4 samples in 7.754213571548462 s [/usr/local/lib/python3.10/dist-packages/dlio_benchmark/utils/statscounter.py:259]

...

This will generate the logs and profiling data inside hydra_log/${model}/${data}-${time} folder.

$ hydra_log/unet3d/2023-06-27-21-27-12
0_output.json  2_output.json  4_output.json  6_output.json  dlio.log     per_epoch_stats.json
1_output.json  3_output.json  5_output.json  7_output.json  iostat.json  summary.json

BERT: Natural Language Processing Model

Reference Implementation: https://github.com/mlcommons/training/tree/master/language_model/tensorflow/bert
Framework: Tensorflow
Dataset: Multiple tfrecord files containing many samples each.
Trains in a single epoch, performs periodic checkpointing of its parameters.

model: bert

framework: tensorflow

workflow:
    generate_data: False
    train: True
    debug: False
    checkpoint: True

dataset:
    data_folder: data/bert
    format: tfrecord
    num_files_train: 500
    num_samples_per_file: 313532
    record_length: 2500
    file_prefix: part

train:
    computation_time: 0.968
    total_training_steps: 5000

reader:
    data_loader: tensorflow
    read_threads: 1
    computation_threads: 1
    transfer_size: 262144
    batch_size: 48
    file_shuffle: seed
    sample_shuffle: seed

checkpoint:
    checkpoint_folder: checkpoints/bert
    steps_between_checkpoints: 1250
    model_size: 4034713312

CosmoFlow: 3D CNN to Learn the Universe at Scale

Reference Implementation: https://github.com/mlcommons/hpc/tree/main/cosmoflow
Framework: Tensorflow Keras
Dataset: Multiple tfrecord files containing many samples each.
Trains in multiple epochs

# contents of cosmoflow.yaml
model: cosmoflow

framework: tensorflow

workflow:
    generate_data: False
    train: True

dataset:
    data_folder: ./data/cosmoflow
    num_files_train: 1024
    num_samples_per_file: 512
    record_length: 131072

reader:
    data_loader: tensorflow
    computation_threads: 8
    read_threads: 8
    batch_size: 1

train:
    epochs: 4

ResNet50: 3D Image classification

Reference Implementation: https://github.com/tensorflow/benchmarks/tree/master/scripts/tf_cnn_benchmarks
Framework: Tensorflow
Dataset: ImageNet datasets saved in tfrecords files
Trains in multiple epochs.

# contents of resnet50.yaml
model: resnet50

framework: tensorflow

workflow:
    generate_data: False
    train: True

dataset:
    num_files_train: 1024
    num_samples_per_file: 1024
    record_length: 150528
    data_folder: data/resnet50
    format: tfrecord

data_loader:
    data_loader: tensorflow
    read_threads: 8
    computation_threads: 8