dataset_reader
allennlp.data.dataset_readers.dataset_reader
AllennlpDataset#
class AllennlpDataset(Dataset):
| def __init__(
| self,
| instances: List[Instance],
| vocab: Vocabulary = None
| )
An AllennlpDataset
is created by calling .read()
on a non-lazy DatasetReader
.
It's essentially just a thin wrapper around a list of instances.
index_with#
class AllennlpDataset(Dataset):
| ...
| def index_with(self, vocab: Vocabulary)
AllennlpLazyDataset#
class AllennlpLazyDataset(IterableDataset):
| def __init__(
| self,
| instance_generator: Callable[[str], Iterable[Instance]],
| file_path: str,
| vocab: Vocabulary = None
| ) -> None
An AllennlpLazyDataset
is created by calling .read()
on a lazy DatasetReader
.
Parameters
- instance_generator :
Callable[[str], Iterable[Instance]]
A factory function that creates an iterable ofInstance
s from a file path. This is usually justDatasetReader._instance_iterator
. - file_path :
str
The path to pass to theinstance_generator
function. - vocab :
Vocab
, optional (default =None
)
An optional vocab. This can also be set later with the.index_with
method.
index_with#
class AllennlpLazyDataset(IterableDataset):
| ...
| def index_with(self, vocab: Vocabulary)
DatasetReader#
class DatasetReader(Registrable):
| def __init__(
| self,
| lazy: bool = False,
| cache_directory: Optional[str] = None,
| max_instances: Optional[int] = None,
| manual_distributed_sharding: bool = False,
| manual_multi_process_sharding: bool = False,
| serialization_dir: Optional[str] = None
| ) -> None
A DatasetReader
knows how to turn a file containing a dataset into a collection
of Instances
. To implement your own, just override the _read(file_path)
method
to return an Iterable
of the instances. This could be a list containing the instances
or a lazy generator that returns them one at a time.
All parameters necessary to _read
the data apart from the filepath should be passed
to the constructor of the DatasetReader
.
Parameters
-
lazy :
bool
, optional (default =False
)
If this is true,instances()
will return an object whose__iter__
method reloads the dataset each time it's called. Otherwise,instances()
returns a list. -
cache_directory :
str
, optional (default =None
)
If given, we will use this directory to store a cache of already-processedInstances
in every file passed toread
, serialized (by default, though you can override this) as one string-formattedInstance
per line. If the cache file for a givenfile_path
exists, we read theInstances
from the cache instead of re-processing the data (using_instances_from_cache_file
). If the cache file does not exist, we will create it on our first pass through the data (using_instances_to_cache_file
).Note
It is the caller's responsibility to make sure that this directory is unique for any combination of code and parameters that you use. That is, if you pass a directory here, we will use any existing cache files in that directory regardless of the parameters you set for this DatasetReader!
-
max_instances :
int
, optional (default =None
)
If given, will stop reading after this many instances. This is a useful setting for debugging. Setting this disables caching. -
manual_distributed_sharding :
bool
, optional (default =False
)
By default, when used in a distributed setting,DatasetReader
makes sure that each worker process only receives a subset of the data. It does this by reading the whole dataset in each worker, but filtering out the instances that are not needed. If you can implement a faster mechanism that only reads part of the data, set this to True, and do the sharding yourself. -
manual_multi_process_sharding :
bool
, optional (default =False
)
This is similar to themanual_distributed_sharding
parameter, but applies to multi-process data loading. By default, when this reader is used by a multi-process data loader (i.e. aDataLoader
withnum_workers > 1
), each worker will filter out all but a subset of the instances that are needed so that you don't end up with duplicates.Note
There is really no benefit of using a multi-process
DataLoader
unless you can specifically implement a faster sharding mechanism within_read()
. In that case you should setmanual_multi_process_sharding
toTrue
. -
serialization_dir :
str
, optional (default =None
)
The directory in which the training output is saved to, or the directory the model is loaded from.
CACHE_FILE_LOCK_TIMEOUT#
class DatasetReader(Registrable):
| ...
| CACHE_FILE_LOCK_TIMEOUT: int = 10
The number of seconds to wait for the lock on a cache file to become available.
read#
class DatasetReader(Registrable):
| ...
| def read(
| self,
| file_path: Union[Path, str]
| ) -> Union[AllennlpDataset, AllennlpLazyDataset]
Returns an dataset containing all the instances that can be read from the file path.
If self.lazy
is False
, this eagerly reads all instances from self._read()
and returns an AllennlpDataset
.
If self.lazy
is True
, this returns an AllennlpLazyDataset
, which internally
relies on the generator created from self._read()
to lazily produce Instance
s.
In this case your implementation of _read()
must also be lazy
(that is, not load all instances into memory at once), otherwise
you will get a ConfigurationError
.
In either case, the returned Iterable
can be iterated
over multiple times. It's unlikely you want to override this function,
but if you do your result should likewise be repeatedly iterable.
_read#
class DatasetReader(Registrable):
| ...
| def _read(self, file_path: str) -> Iterable[Instance]
Reads the instances from the given file_path and returns them as an
Iterable
(which could be a list or could be a generator).
You are strongly encouraged to use a generator, so that users can
read a dataset in a lazy way, if they so choose.
text_to_instance#
class DatasetReader(Registrable):
| ...
| def text_to_instance(self, *inputs) -> Instance
Does whatever tokenization or processing is necessary to go from textual input to an
Instance
. The primary intended use for this is with a
Predictor
, which gets text input as a JSON
object and needs to process it to be input to a model.
The intent here is to share code between _read
and what happens at
model serving time, or any other time you want to make a prediction from new data. We need
to process the data in the same way it was done at training time. Allowing the
DatasetReader
to process new text lets us accomplish this, as we can just call
DatasetReader.text_to_instance
when serving predictions.
The input type here is rather vaguely specified, unfortunately. The Predictor
will
have to make some assumptions about the kind of DatasetReader
that it's using, in order
to pass it the right information.
serialize_instance#
class DatasetReader(Registrable):
| ...
| def serialize_instance(self, instance: Instance) -> str
Serializes an Instance
to a string. We use this for caching the processed data.
The default implementation is to use jsonpickle
. If you would like some other format
for your pre-processed data, override this method.
deserialize_instance#
class DatasetReader(Registrable):
| ...
| def deserialize_instance(self, string: str) -> Instance
Deserializes an Instance
from a string. We use this when reading processed data from a
cache.
The default implementation is to use jsonpickle
. If you would like some other format
for your pre-processed data, override this method.