allennlp.data.dataset_readers.multiprocess_dataset_reader¶
-
class
allennlp.data.dataset_readers.multiprocess_dataset_reader.MultiprocessDatasetReader(base_reader: allennlp.data.dataset_readers.dataset_reader.DatasetReader, num_workers: int, epochs_per_read: int = 1, output_queue_size: int = 1000)[source]¶ Bases:
allennlp.data.dataset_readers.dataset_reader.DatasetReaderWraps another dataset reader and uses it to read from multiple input files using multiple processes. Note that in this case the
file_pathpassed toread()should be a glob, and that the dataset reader will return instances from all files matching the glob.The order the files are processed in is a function of Numpy’s random state up to non-determinism caused by using multiple worker processes. This can be avoided by setting
num_workersto 1.- Parameters
- base_reader
DatasetReader Each process will use this dataset reader to read zero or more files.
- num_workers
int How many data-reading processes to run simultaneously.
- epochs_per_read
int, (optional, default=1) Normally a call to
DatasetReader.read()returns a single epoch worth of instances, and yourDataIteratorhandles iteration over multiple epochs. However, in the multiple-process case, it’s possible that you’d want finished workers to continue on to the next epoch even while others are still finishing the previous epoch. Passing in a value larger than 1 allows that to happen.- output_queue_size: ``int``, (optional, default=1000)
The size of the queue on which read instances are placed to be yielded. You might need to increase this if you’re generating instances too quickly.
- base_reader
-
read(self, file_path: str) → Iterable[allennlp.data.instance.Instance][source]¶ Returns an
Iterablecontaining all the instances in the specified dataset.If
self.lazyis False, this callsself._read(), ensures that the result is a list, then returns the resulting list.If
self.lazyis True, this returns an object whose__iter__method callsself._read()each iteration. In this case your implementation of_read()must also be lazy (that is, not load all instances into memory at once), otherwise you will get aConfigurationError.In either case, the returned
Iterablecan be iterated over multiple times. It’s unlikely you want to override this function, but if you do your result should likewise be repeatedly iterable.
-
class
allennlp.data.dataset_readers.multiprocess_dataset_reader.QIterable(output_queue_size, epochs_per_read, num_workers, reader, file_path)[source]¶ Bases:
collections.abc.Iterable,typing.GenericYou can’t set attributes on Iterators, so this is just a dumb wrapper that exposes the output_queue.
-
class
allennlp.data.dataset_readers.multiprocess_dataset_reader.logger[source]¶ Bases:
objectmultiprocessing.log_to_stderr causes some output in the logs even when we don’t use this dataset reader. This is a small hack to instantiate the stderr logger lazily only when it’s needed (which is only when using the MultiprocessDatasetReader)