.. _custom_data_loader: Creating a Data Loader Plugin ============================== Within DLIO Benchmark we can define custom data loader implementations. This feature allows us to extend DLIO Benchmark with new data loader implementation easily without changing existing code. To achieve this developers have to take the following main steps. 1. Write their custom data loader. 2. Define workflow configuration. 3. Run the workload with custom data loader. Write their custom data loader. -------------------------------- In this section, we will describe how to write the custom data loader. To write a data loader you need to implement `BaseDataLoader` Class. This data loader needs to added `/dlio_benchmark/plugins/experimental/src/data_loader`. A complete examples can be seen at `/dlio_benchmark/data_loader/` - For PyTorch: torch_data_loader.py - For TensorFlow: tf_data_loader.py - For Nvidia Dali: dali_data_loader.py Say we store the custom data loader for pytorch into `/dlio_benchmark/plugins/experimental/src/data_loader/pytorch_custom_data_loader.py` .. code-block:: python import torch from dlio_benchmark.data_loader.base_data_loader import BaseDataLoader # MAKE SURE the name of class is unique class CustomTorchDataLoader(BaseDataLoader): def __init__(self, format_type, dataset_type, epoch_number): super().__init__(format_type, dataset_type, epoch_number, DataLoaderType.PYTORCH) def read(self): batch_size = self._args.batch_size if self.dataset_type is DatasetType.TRAIN else self._args.batch_size_eval # Define your dataset definition here. self._dataset = DataLoader(PYTORCH_DATASET, batch_size=batch_size, sampler=PYTORCH_SAMPLER, num_workers=self._args.read_threads, pin_memory=True, drop_last=True, worker_init_fn=WORKER_INIT_FN) def next(self): # THIS PART OF CODE NEED NOT CHANGE # This iterates and gets the batch of images. super().next() total = self._args.training_steps if self.dataset_type is DatasetType.TRAIN else self._args.eval_steps for batch in self._dataset: yield batch def finalize(self): # Perform any cleanup as required. Additionally, you may need to define your own PyTorch Dataset. .. code-block:: python # MAKE SURE the name of class is unique class CustomTorchDataset(Dataset): def __init__(self, format_type, dataset_type, epoch, num_samples, num_workers, batch_size): self.format_type = format_type self.dataset_type = dataset_type self.epoch_number = epoch self.num_samples = num_samples self.reader = None self.num_images_read = 0 self.batch_size = batch_size if num_workers == 0: self.worker_init(-1) def worker_init(self, worker_id): # If you wanna use Existing Data Reader. self.reader = ReaderFactory.get_reader(type=self.format_type, dataset_type=self.dataset_type, thread_index=worker_id, epoch_number=self.epoch_number) def __len__(self): return self.num_samples def __getitem__(self, image_idx): # Example existing reader call. self.num_images_read += 1 step = int(math.ceil(self.num_images_read / self.batch_size)) return self.reader.read_index(image_idx, step) Define workflow configuration. ------------------------------ In this section, we will detail how to create a custom workflow configuration for DLIO Benchmark. The workload configuration for plugins exists in `/dlio_benchmark/plugins/experimental`. You can copy an existing configuration from `/dlio_benchmark/configs/workload` and modify it for your custom data loader. Main changes to the workflow configuration are: .. code-block:: yaml # Rest remains as it is reader: data_loader_classname: dlio_benchmark.plugins.experimental.src.data_loader.pytorch_custom_data_loader.CustomTorchDataLoader data_loader_sampler: iterative/index # CHOOSE the correct sampler. In the above configuration, `data_loader_classname` should point to FQN of the class (as in the PYTHONPATH). Also, `data_loader_sampler` should be set to `iterative` if the data loader implements a iterative reading and `index` should be used if data loader is using an index based reading. The `torch_data_loader.py` is an example of index based data loader and `tf_data_loader.py` is an example of iterative data loader. Run the workload with custom data loader. ------------------------------------------ To run the custom data loader, we have to define the plugin folder as the custom config folder. This is described in the :ref:`run` page. We need to pass path `plugins/experimental/configs` as the path.