dataloader
[ allennlp.data.dataloader ]
TensorDict#
TensorDict = Dict[str, Union[torch.Tensor, Dict[str, torch.Tensor]]]
allennlp_collate#
def allennlp_collate(instances: List[Instance]) -> TensorDict
DataLoader#
class DataLoader(Registrable)
A DataLoader is responsible for generating batches of instances from a Dataset,
or another source of data. This is essentially just an abstraction over torch.utils.data.DataLoader.
This class only has one required method, __iter__(), that creates an iterable
of TensorDicts. Additionally, this class comes with a __len__() method
that just raises a TypeError by default. When possible, this should be overriden
to return the number of batches that will be generated by the __iter__() method.
default_implementation#
class DataLoader(Registrable):
| ...
| default_implementation = "pytorch_dataloader"
PyTorchDataLoader#
@DataLoader.register("pytorch_dataloader", constructor="from_partial_objects")
class PyTorchDataLoader(data.DataLoader, DataLoader):
| def __init__(
| self,
| dataset: data.Dataset,
| batch_size: int = 1,
| shuffle: bool = False,
| sampler: Sampler = None,
| batch_sampler: BatchSampler = None,
| num_workers: int = 0,
| collate_fn=allennlp_collate,
| pin_memory: bool = False,
| drop_last: bool = False,
| timeout: int = 0,
| worker_init_fn=None,
| multiprocessing_context: str = None,
| batches_per_epoch: int = None
| )
A registrable version of the pytorch
DataLoader.
Firstly, this class exists is so that we can construct a DataLoader
from a configuration file and have a different default collate_fn.
You can use this class directly in python code, but it is identical to using
pytorch dataloader with allennlp's custom collate function:
from torch.utils.data import DataLoader
from allennlp.data import allennlp_collate
# Construct a dataloader directly for a dataset which contains allennlp
# Instances which have _already_ been indexed.
my_loader = DataLoader(dataset, batch_size=32, collate_fn=allennlp_collate)
Secondly, this class adds a batches_per_epoch parameter which, if given, determines the number
of batches after which an epoch ends. If this is None, then an epoch is set to be one full pass
through your data. You might use this if you have a very large dataset and want more frequent
checkpoints and evaluations on validation data, for instance.
In a typical AllenNLP configuration file, the dataset parameter does not get an entry under
the "data_loader", it gets constructed separately.
from_partial_objects#
class PyTorchDataLoader(data.DataLoader, DataLoader):
| ...
| @classmethod
| def from_partial_objects(
| cls,
| dataset: data.Dataset,
| batch_size: int = 1,
| shuffle: bool = False,
| sampler: Lazy[Sampler] = None,
| batch_sampler: Lazy[BatchSampler] = None,
| num_workers: int = 0,
| pin_memory: bool = False,
| drop_last: bool = False,
| timeout: int = 0,
| worker_init_fn=None,
| multiprocessing_context: str = None,
| batches_per_epoch: int = None
| ) -> "PyTorchDataLoader"