Skip to content

sharded_dataset_reader

allennlp.data.dataset_readers.sharded_dataset_reader

[SOURCE]


ShardedDatasetReader

@DatasetReader.register("sharded")
class ShardedDatasetReader(DatasetReader):
 | def __init__(self, base_reader: DatasetReader, **kwargs) -> None

Wraps another dataset reader and uses it to read from multiple input files.

Note that in this case the file_path passed to read() should either be a glob path or a path or URL to an archive file ('.zip' or '.tar.gz').

The dataset reader will return instances from all files matching the glob, or all files within the archive.

The order the files are processed in is deterministic to enable the instances to be filtered according to worker rank in the distributed training or multi-process data loading scenarios. In either case, the number of file shards should ideally be a multiple of the number of workers, and each file should produce roughly the same number of instances.

Registered as a DatasetReader with name "sharded".

Parameters

  • base_reader : DatasetReader
    Reader with a read method that accepts a single file.

text_to_instance

class ShardedDatasetReader(DatasetReader):
 | ...
 | def text_to_instance(self, *args, **kwargs) -> Instance

Just delegate to the base reader text_to_instance.

apply_token_indexers

class ShardedDatasetReader(DatasetReader):
 | ...
 | def apply_token_indexers(self, instance: Instance) -> None