dataset_reader

allennlp.data.dataset_readers.dataset_reader

AllennlpDataset#

class AllennlpDataset(Dataset):
 | def __init__(
 |     self,
 |     instances: List[Instance],
 |     vocab: Vocabulary = None
 | )

An AllennlpDataset is created by calling .read() on a non-lazy DatasetReader. It's essentially just a thin wrapper around a list of instances.

index_with#

class AllennlpDataset(Dataset):
 | ...
 | def index_with(self, vocab: Vocabulary)

AllennlpLazyDataset#

class AllennlpLazyDataset(IterableDataset):
 | def __init__(
 |     self,
 |     instance_generator: Callable[[str], Iterable[Instance]],
 |     file_path: str,
 |     vocab: Vocabulary = None
 | ) -> None

An AllennlpLazyDataset is created by calling .read() on a lazy DatasetReader.

Parameters

instance_generator : Callable[[str], Iterable[Instance]]
A factory function that creates an iterable of Instances from a file path. This is usually just DatasetReader._instance_iterator.
file_path : str
The path to pass to the instance_generator function.
vocab : Vocab, optional (default = None)
An optional vocab. This can also be set later with the .index_with method.

index_with#

class AllennlpLazyDataset(IterableDataset):
 | ...
 | def index_with(self, vocab: Vocabulary)

DatasetReader#

class DatasetReader(Registrable):
 | def __init__(
 |     self,
 |     lazy: bool = False,
 |     cache_directory: Optional[str] = None,
 |     max_instances: Optional[int] = None,
 |     manual_distributed_sharding: bool = False,
 |     manual_multi_process_sharding: bool = False,
 |     serialization_dir: Optional[str] = None
 | ) -> None

A DatasetReader knows how to turn a file containing a dataset into a collection of Instances. To implement your own, just override the _read(file_path) method to return an Iterable of the instances. This could be a list containing the instances or a lazy generator that returns them one at a time.

All parameters necessary to _read the data apart from the filepath should be passed to the constructor of the DatasetReader.

Parameters

lazy : bool, optional (default = False)
If this is true, instances() will return an object whose __iter__ method reloads the dataset each time it's called. Otherwise, instances() returns a list.
cache_directory : str, optional (default = None)
If given, we will use this directory to store a cache of already-processed Instances in every file passed to read, serialized (by default, though you can override this) as one string-formatted Instance per line. If the cache file for a given file_path exists, we read the Instances from the cache instead of re-processing the data (using _instances_from_cache_file). If the cache file does not exist, we will create it on our first pass through the data (using _instances_to_cache_file).

Note

It is the caller's responsibility to make sure that this directory is unique for any combination of code and parameters that you use. That is, if you pass a directory here, we will use any existing cache files in that directory regardless of the parameters you set for this DatasetReader!
max_instances : int, optional (default = None)
If given, will stop reading after this many instances. This is a useful setting for debugging. Setting this disables caching.
manual_distributed_sharding : bool, optional (default = False)
By default, when used in a distributed setting, DatasetReader makes sure that each worker process only receives a subset of the data. It does this by reading the whole dataset in each worker, but filtering out the instances that are not needed. If you can implement a faster mechanism that only reads part of the data, set this to True, and do the sharding yourself.
manual_multi_process_sharding : bool, optional (default = False)
This is similar to the manual_distributed_sharding parameter, but applies to multi-process data loading. By default, when this reader is used by a multi-process data loader (i.e. a DataLoader with num_workers > 1), each worker will filter out all but a subset of the instances that are needed so that you don't end up with duplicates.

Note

There is really no benefit of using a multi-process DataLoader unless you can specifically implement a faster sharding mechanism within _read(). In that case you should set manual_multi_process_sharding to True.
serialization_dir : str, optional (default = None)
The directory in which the training output is saved to, or the directory the model is loaded from.

CACHE_FILE_LOCK_TIMEOUT#

class DatasetReader(Registrable):
 | ...
 | CACHE_FILE_LOCK_TIMEOUT: int = 10

The number of seconds to wait for the lock on a cache file to become available.

read#

class DatasetReader(Registrable):
 | ...
 | def read(
 |     self,
 |     file_path: Union[Path, str]
 | ) -> Union[AllennlpDataset, AllennlpLazyDataset]

Returns an dataset containing all the instances that can be read from the file path.

If self.lazy is False, this eagerly reads all instances from self._read() and returns an AllennlpDataset.

If self.lazy is True, this returns an AllennlpLazyDataset, which internally relies on the generator created from self._read() to lazily produce Instances. In this case your implementation of _read() must also be lazy (that is, not load all instances into memory at once), otherwise you will get a ConfigurationError.

In either case, the returned Iterable can be iterated over multiple times. It's unlikely you want to override this function, but if you do your result should likewise be repeatedly iterable.

_read#

class DatasetReader(Registrable):
 | ...
 | def _read(self, file_path: str) -> Iterable[Instance]

Reads the instances from the given file_path and returns them as an Iterable (which could be a list or could be a generator). You are strongly encouraged to use a generator, so that users can read a dataset in a lazy way, if they so choose.

text_to_instance#

class DatasetReader(Registrable):
 | ...
 | def text_to_instance(self, *inputs) -> Instance

Does whatever tokenization or processing is necessary to go from textual input to an Instance. The primary intended use for this is with a Predictor, which gets text input as a JSON object and needs to process it to be input to a model.

The intent here is to share code between _read and what happens at model serving time, or any other time you want to make a prediction from new data. We need to process the data in the same way it was done at training time. Allowing the DatasetReader to process new text lets us accomplish this, as we can just call DatasetReader.text_to_instance when serving predictions.

The input type here is rather vaguely specified, unfortunately. The Predictor will have to make some assumptions about the kind of DatasetReader that it's using, in order to pass it the right information.

serialize_instance#

class DatasetReader(Registrable):
 | ...
 | def serialize_instance(self, instance: Instance) -> str

Serializes an Instance to a string. We use this for caching the processed data.

The default implementation is to use jsonpickle. If you would like some other format for your pre-processed data, override this method.

deserialize_instance#

class DatasetReader(Registrable):
 | ...
 | def deserialize_instance(self, string: str) -> Instance

Deserializes an Instance from a string. We use this when reading processed data from a cache.

The default implementation is to use jsonpickle. If you would like some other format for your pre-processed data, override this method.