multitask_data_loader
allennlp.data.data_loaders.multitask_data_loader
maybe_shuffle_instances¶
def maybe_shuffle_instances(
loader: DataLoader,
shuffle: bool
) -> Iterable[Instance]
MultiTaskDataLoader¶
@DataLoader.register("multitask")
class MultiTaskDataLoader(DataLoader):
| def __init__(
| self,
| reader: MultiTaskDatasetReader,
| data_path: Dict[str, str],
| scheduler: MultiTaskScheduler,
| *, sampler: MultiTaskEpochSampler = None,
| *, instances_per_epoch: int = None,
| *, num_workers: Dict[str, int] = None,
| *, max_instances_in_memory: Dict[str, int] = None,
| *, start_method: Dict[str, str] = None,
| *, instance_queue_size: Dict[str, int] = None,
| *, instance_chunk_size: Dict[str, int] = None,
| *, shuffle: bool = True,
| *, cuda_device: Optional[Union[int, str, torch.device]] = None
| ) -> None
A DataLoader
intended for multi-task learning. The basic idea is that you use a
MultiTaskDatasetReader
, which takes a dictionary of DatasetReaders
, keyed by some name. You
use those same names for various parameters here, including the data paths that get passed to
each reader. We will load each dataset and iterate over instances in them using a
MultiTaskEpochSampler
and a MultiTaskScheduler
. The EpochSampler
says how much to use
from each dataset at each epoch, and the Scheduler
orders the instances in the epoch however
you want. Both of these are designed to be used in conjunction with trainer Callbacks
, if
desired, to have the sampling and/or scheduling behavior be dependent on the current state of
training.
While it is not necessarily required, this DatasetReader
was designed to be used alongside a
MultiTaskModel
, which can handle instances coming from different datasets. If your datasets
are similar enough (say, they are all reading comprehension datasets with the same format), or
your model is flexible enough, then you could feasibly use this DataLoader
with a normal,
non-multitask Model
.
Registered as a DataLoader
with name "multitask".
Parameters¶
- reader :
MultiTaskDatasetReader
- data_path :
Dict[str, str]
One file per underlying dataset reader in theMultiTaskDatasetReader
, which will be passed to those readers to construct oneDataLoader
per dataset. - scheduler :
MultiTaskScheduler
, optional (default =HomogeneousRoundRobinScheduler
)
Thescheduler
determines how instances are ordered within an epoch. By default, we'll select one batch of instances from each dataset in turn, trying to ensure as uniform a mix of datasets as possible. Note that if your model can handle it, using aRoundRobinScheduler
is likely better than aHomogeneousRoundRobinScheduler
(because it does a better job mixing gradient signals from various datasets), so you may want to consider switching. We use the homogeneous version as default because it should work for any allennlp model, while the non-homogeneous one might not. - sampler :
MultiTaskEpochSampler
, optional (default =None
)
Only used ifinstances_per_epoch
is notNone
. If we need to select a subset of the data for an epoch, thissampler
will tell us with what proportion we should sample from each dataset. For instance, we might want to focus more on datasets that are underperforming in some way, by having those datasets contribute more instances this epoch than other datasets. - instances_per_epoch :
int
, optional (default =None
)
If notNone
, we will use this many instances per epoch of training, drawing from the underlying datasets according to thesampler
. - num_workers :
Dict[str, int]
, optional (default =None
)
Used when creating oneMultiProcessDataLoader
per dataset. If you want non-default behavior for this parameter in theDataLoader
for a particular dataset, pass the corresponding value here, keyed by the dataset name. - max_instances_in_memory :
Dict[str, int]
, optional (default =None
)
Used when creating oneMultiProcessDataLoader
per dataset. If you want non-default behavior for this parameter in theDataLoader
for a particular dataset, pass the corresponding value here, keyed by the dataset name. - start_method :
Dict[str, str]
, optional (default =None
)
Used when creating oneMultiProcessDataLoader
per dataset. If you want non-default behavior for this parameter in theDataLoader
for a particular dataset, pass the corresponding value here, keyed by the dataset name. - instance_queue_size :
Dict[str, int]
, optional (default =None
)
Used when creating oneMultiProcessDataLoader
per dataset. If you want non-default behavior for this parameter in theDataLoader
for a particular dataset, pass the corresponding value here, keyed by the dataset name. - instance_chunk_size :
Dict[str, int]
, optional (default =None
)
Used when creating oneMultiProcessDataLoader
per dataset. If you want non-default behavior for this parameter in theDataLoader
for a particular dataset, pass the corresponding value here, keyed by the dataset name. - shuffle :
bool
, optional (default =True
)
IfFalse
, we will not shuffle the instances that come from each underlying data loader. You almost certainly never want to use this except when debugging. -
cuda_device :
Optional[Union[int, str, torch.device]]
, optional (default =None
)
If given, batches will automatically be put on this device.Note
This should typically not be set in an AllenNLP configuration file. The
Trainer
will automatically callset_target_device()
before iterating over batches.
__iter__¶
class MultiTaskDataLoader(DataLoader):
| ...
| def __iter__(self) -> Iterator[TensorDict]
iter_instances¶
class MultiTaskDataLoader(DataLoader):
| ...
| def iter_instances(self) -> Iterator[Instance]
index_with¶
class MultiTaskDataLoader(DataLoader):
| ...
| def index_with(self, vocab: Vocabulary) -> None
set_target_device¶
class MultiTaskDataLoader(DataLoader):
| ...
| def set_target_device(self, device: torch.device) -> None