Skip to content

multitask_data_loader

allennlp.data.data_loaders.multitask_data_loader

[SOURCE]


maybe_shuffle_instances

def maybe_shuffle_instances(
    loader: DataLoader,
    shuffle: bool
) -> Iterable[Instance]

MultiTaskDataLoader

@DataLoader.register("multitask")
class MultiTaskDataLoader(DataLoader):
 | def __init__(
 |     self,
 |     reader: MultiTaskDatasetReader,
 |     data_path: Dict[str, str],
 |     scheduler: MultiTaskScheduler,
 |     *,
 |     sampler: MultiTaskEpochSampler = None,
 |     instances_per_epoch: int = None,
 |     num_workers: Dict[str, int] = None,
 |     max_instances_in_memory: Dict[str, int] = None,
 |     start_method: Dict[str, str] = None,
 |     instance_queue_size: Dict[str, int] = None,
 |     instance_chunk_size: Dict[str, int] = None,
 |     shuffle: bool = True,
 |     cuda_device: Optional[Union[int, str, torch.device]] = None
 | ) -> None

A DataLoader intended for multi-task learning. The basic idea is that you use a MultiTaskDatasetReader, which takes a dictionary of DatasetReaders, keyed by some name. You use those same names for various parameters here, including the data paths that get passed to each reader. We will load each dataset and iterate over instances in them using a MultiTaskEpochSampler and a MultiTaskScheduler. The EpochSampler says how much to use from each dataset at each epoch, and the Scheduler orders the instances in the epoch however you want. Both of these are designed to be used in conjunction with trainer Callbacks, if desired, to have the sampling and/or scheduling behavior be dependent on the current state of training.

While it is not necessarily required, this DatasetReader was designed to be used alongside a MultiTaskModel, which can handle instances coming from different datasets. If your datasets are similar enough (say, they are all reading comprehension datasets with the same format), or your model is flexible enough, then you could feasibly use this DataLoader with a normal, non-multitask Model.

Registered as a DataLoader with name "multitask".

Parameters

  • reader : MultiTaskDatasetReader
  • data_path : Dict[str, str]
    One file per underlying dataset reader in the MultiTaskDatasetReader, which will be passed to those readers to construct one DataLoader per dataset.
  • scheduler : MultiTaskScheduler, optional (default = HomogeneousRoundRobinScheduler)
    The scheduler determines how instances are ordered within an epoch. By default, we'll select one batch of instances from each dataset in turn, trying to ensure as uniform a mix of datasets as possible. Note that if your model can handle it, using a RoundRobinScheduler is likely better than a HomogeneousRoundRobinScheduler (because it does a better job mixing gradient signals from various datasets), so you may want to consider switching. We use the homogeneous version as default because it should work for any allennlp model, while the non-homogeneous one might not.
  • sampler : MultiTaskEpochSampler, optional (default = None)
    Only used if instances_per_epoch is not None. If we need to select a subset of the data for an epoch, this sampler will tell us with what proportion we should sample from each dataset. For instance, we might want to focus more on datasets that are underperforming in some way, by having those datasets contribute more instances this epoch than other datasets.
  • instances_per_epoch : int, optional (default = None)
    If not None, we will use this many instances per epoch of training, drawing from the underlying datasets according to the sampler.
  • num_workers : Dict[str, int], optional (default = None)
    Used when creating one MultiProcessDataLoader per dataset. If you want non-default behavior for this parameter in the DataLoader for a particular dataset, pass the corresponding value here, keyed by the dataset name.
  • max_instances_in_memory : Dict[str, int], optional (default = None)
    Used when creating one MultiProcessDataLoader per dataset. If you want non-default behavior for this parameter in the DataLoader for a particular dataset, pass the corresponding value here, keyed by the dataset name.
  • start_method : Dict[str, str], optional (default = None)
    Used when creating one MultiProcessDataLoader per dataset. If you want non-default behavior for this parameter in the DataLoader for a particular dataset, pass the corresponding value here, keyed by the dataset name.
  • instance_queue_size : Dict[str, int], optional (default = None)
    Used when creating one MultiProcessDataLoader per dataset. If you want non-default behavior for this parameter in the DataLoader for a particular dataset, pass the corresponding value here, keyed by the dataset name.
  • instance_chunk_size : Dict[str, int], optional (default = None)
    Used when creating one MultiProcessDataLoader per dataset. If you want non-default behavior for this parameter in the DataLoader for a particular dataset, pass the corresponding value here, keyed by the dataset name.
  • shuffle : bool, optional (default = True)
    If False, we will not shuffle the instances that come from each underlying data loader. You almost certainly never want to use this except when debugging.
  • cuda_device : Optional[Union[int, str, torch.device]], optional (default = None)
    If given, batches will automatically be put on this device.

    Note

    This should typically not be set in an AllenNLP configuration file. The Trainer will automatically call set_target_device() before iterating over batches.

__iter__

class MultiTaskDataLoader(DataLoader):
 | ...
 | def __iter__(self) -> Iterator[TensorDict]

iter_instances

class MultiTaskDataLoader(DataLoader):
 | ...
 | def iter_instances(self) -> Iterator[Instance]

The only external contract for this method is that it iterates over instances individually; it doesn't actually specify anything about batching or anything else. The implication is that you iterate over all instances in the dataset, in an arbitrary order. The only external uses of this method are in vocabulary construction (the MultiProcessDataLoader uses this function internally when constructing batches, but that's an implementation detail).

So, the only thing we need to do here is iterate over all instances from all datasets, and that's sufficient. We won't be using this for batching, because that requires some complex, configurable scheduling.

The underlying data loaders here could be using multiprocessing; we don't need to worry about that in this class. Caching is also handled by the underlying data loaders.

index_with

class MultiTaskDataLoader(DataLoader):
 | ...
 | def index_with(self, vocab: Vocabulary) -> None

set_target_device

class MultiTaskDataLoader(DataLoader):
 | ...
 | def set_target_device(self, device: torch.device) -> None