Creating a Data Loader Plugin

Within DLIO Benchmark we can define custom data loader implementations. This feature allows us to extend DLIO Benchmark with new data loader implementation easily without changing existing code. To achieve this developers have to take the following main steps.

  1. Write their custom data loader.

  2. Define workflow configuration.

  3. Run the workload with custom data loader.

Write their custom data loader.

In this section, we will describe how to write the custom data loader. To write a data loader you need to implement BaseDataLoader Class. This data loader needs to added <ROOT>/dlio_benchmark/plugins/experimental/src/data_loader. A complete examples can be seen at <ROOT>/dlio_benchmark/data_loader/

  • For PyTorch: torch_data_loader.py

  • For TensorFlow: tf_data_loader.py

  • For Nvidia Dali: dali_data_loader.py

Say we store the custom data loader for pytorch into <ROOT>/dlio_benchmark/plugins/experimental/src/data_loader/pytorch_custom_data_loader.py

import torch
from dlio_benchmark.data_loader.base_data_loader import BaseDataLoader

# MAKE SURE the name of class is unique
class CustomTorchDataLoader(BaseDataLoader):

    def __init__(self, format_type, dataset_type, epoch_number):
        super().__init__(format_type, dataset_type, epoch_number, DataLoaderType.PYTORCH)


    def read(self):
        batch_size = self._args.batch_size if self.dataset_type is DatasetType.TRAIN else self._args.batch_size_eval
        # Define your dataset definition here.
        self._dataset = DataLoader(PYTORCH_DATASET,
                                batch_size=batch_size,
                                sampler=PYTORCH_SAMPLER,
                                num_workers=self._args.read_threads,
                                pin_memory=True,
                                drop_last=True,
                                worker_init_fn=WORKER_INIT_FN)

    def next(self):
        # THIS PART OF CODE NEED NOT CHANGE
        # This iterates and gets the batch of images.
        super().next()
        total = self._args.training_steps if self.dataset_type is DatasetType.TRAIN else self._args.eval_steps
        for batch in self._dataset:
            yield batch

    def finalize(self):
        # Perform any cleanup as required.

Additionally, you may need to define your own PyTorch Dataset.

# MAKE SURE the name of class is unique
class CustomTorchDataset(Dataset):

    def __init__(self, format_type, dataset_type, epoch, num_samples, num_workers, batch_size):
        self.format_type = format_type
        self.dataset_type = dataset_type
        self.epoch_number = epoch
        self.num_samples = num_samples
        self.reader = None
        self.num_images_read = 0
        self.batch_size = batch_size
        if num_workers == 0:
            self.worker_init(-1)

    def worker_init(self, worker_id):
        # If you wanna use Existing Data Reader.
        self.reader = ReaderFactory.get_reader(type=self.format_type,
                                            dataset_type=self.dataset_type,
                                            thread_index=worker_id,
                                            epoch_number=self.epoch_number)

    def __len__(self):
        return self.num_samples

    def __getitem__(self, image_idx):
        # Example existing reader call.
        self.num_images_read += 1
        step = int(math.ceil(self.num_images_read / self.batch_size))
        return self.reader.read_index(image_idx, step)

Define workflow configuration.

In this section, we will detail how to create a custom workflow configuration for DLIO Benchmark. The workload configuration for plugins exists in <ROOT>/dlio_benchmark/plugins/experimental. You can copy an existing configuration from <ROOT>/dlio_benchmark/configs/workload and modify it for your custom data loader. Main changes to the workflow configuration are:

# Rest remains as it is
reader:
    data_loader_classname: dlio_benchmark.plugins.experimental.src.data_loader.pytorch_custom_data_loader.CustomTorchDataLoader
    data_loader_sampler: iterative/index # CHOOSE the correct sampler.

In the above configuration, data_loader_classname should point to FQN of the class (as in the PYTHONPATH). Also, data_loader_sampler should be set to iterative if the data loader implements a iterative reading and index should be used if data loader is using an index based reading. The torch_data_loader.py is an example of index based data loader and tf_data_loader.py is an example of iterative data loader.

Run the workload with custom data loader.

To run the custom data loader, we have to define the plugin folder as the custom config folder. This is described in the Running DLIO page. We need to pass path plugins/experimental/configs as the path.