Skip to content

interleaving_dataset_reader

allennlp.data.dataset_readers.interleaving_dataset_reader

[SOURCE]


InterleavingDatasetReader

@DatasetReader.register("interleaving")
class InterleavingDatasetReader(DatasetReader):
 | def __init__(
 |     self,
 |     readers: Dict[str, DatasetReader],
 |     dataset_field_name: str = "dataset",
 |     scheme: str = "round_robin",
 |     **kwargs
 | ) -> None

A DatasetReader that wraps multiple other dataset readers, and interleaves their instances, adding a MetadataField to indicate the provenance of each instance.

Unlike most of our other dataset readers, here the file_path passed into read() should be a JSON-serialized dictionary with one file_path per wrapped dataset reader (and with corresponding keys).

Registered as a DatasetReader with name "interleaving".

Parameters

  • readers : Dict[str, DatasetReader]
    The dataset readers to wrap. The keys of this dictionary will be used as the values in the MetadataField indicating provenance.
  • dataset_field_name : str, optional (default = "dataset")
    The name of the MetadataField indicating which dataset an instance came from.
  • scheme : str, optional (default = "round_robin")
    Indicates how to interleave instances. Currently the two options are "round_robin", which repeatedly cycles through the datasets grabbing one instance from each; and "all_at_once", which yields all the instances from the first dataset, then all the instances from the second dataset, and so on. You could imagine also implementing some sort of over- or under-sampling, although hasn't been done.

text_to_instance

class InterleavingDatasetReader(DatasetReader):
 | ...
 | def text_to_instance(
 |     self,
 |     dataset_key: str,
 |     *args,
 |     **kwargs
 | ) -> Instance

apply_token_indexers

class InterleavingDatasetReader(DatasetReader):
 | ...
 | def apply_token_indexers(self, instance: Instance) -> None