allennlp.data.dataset_readers.multiprocess_dataset_reader¶
-
class
allennlp.data.dataset_readers.multiprocess_dataset_reader.
MultiprocessDatasetReader
(base_reader: allennlp.data.dataset_readers.dataset_reader.DatasetReader, num_workers: int, epochs_per_read: int = 1, output_queue_size: int = 1000)[source]¶ Bases:
allennlp.data.dataset_readers.dataset_reader.DatasetReader
Wraps another dataset reader and uses it to read from multiple input files using multiple processes. Note that in this case the
file_path
passed toread()
should be a glob, and that the dataset reader will return instances from all files matching the glob.The order the files are processed in is a function of Numpy’s random state up to non-determinism caused by using multiple worker processes. This can be avoided by setting
num_workers
to 1.- Parameters
- base_reader
DatasetReader
Each process will use this dataset reader to read zero or more files.
- num_workers
int
How many data-reading processes to run simultaneously.
- epochs_per_read
int
, (optional, default=1) Normally a call to
DatasetReader.read()
returns a single epoch worth of instances, and yourDataIterator
handles iteration over multiple epochs. However, in the multiple-process case, it’s possible that you’d want finished workers to continue on to the next epoch even while others are still finishing the previous epoch. Passing in a value larger than 1 allows that to happen.- output_queue_size: ``int``, (optional, default=1000)
The size of the queue on which read instances are placed to be yielded. You might need to increase this if you’re generating instances too quickly.
- base_reader
-
read
(self, file_path: str) → Iterable[allennlp.data.instance.Instance][source]¶ Returns an
Iterable
containing all the instances in the specified dataset.If
self.lazy
is False, this callsself._read()
, ensures that the result is a list, then returns the resulting list.If
self.lazy
is True, this returns an object whose__iter__
method callsself._read()
each iteration. In this case your implementation of_read()
must also be lazy (that is, not load all instances into memory at once), otherwise you will get aConfigurationError
.In either case, the returned
Iterable
can be iterated over multiple times. It’s unlikely you want to override this function, but if you do your result should likewise be repeatedly iterable.
-
class
allennlp.data.dataset_readers.multiprocess_dataset_reader.
QIterable
(output_queue_size, epochs_per_read, num_workers, reader, file_path)[source]¶ Bases:
collections.abc.Iterable
,typing.Generic
You can’t set attributes on Iterators, so this is just a dumb wrapper that exposes the output_queue.
-
class
allennlp.data.dataset_readers.multiprocess_dataset_reader.
logger
[source]¶ Bases:
object
multiprocessing.log_to_stderr causes some output in the logs even when we don’t use this dataset reader. This is a small hack to instantiate the stderr logger lazily only when it’s needed (which is only when using the MultiprocessDatasetReader)