dataset
allennlp.tango.dataset
AllenNLP Tango is an experimental API and parts of it might change or disappear every time we release a new version.
DatasetDict¶
@dataclass
class DatasetDict
This definition of a dataset combines all splits, the vocabulary, and some metadata, into one handy class.
splits¶
class DatasetDict:
| ...
| splits: Mapping[str, Sequence[Any]] = None
Maps the name of the split to a sequence of instances. AllenNlpDataset
does not
enforce what the instances are made of, so they are of type Sequence[Any]
. However,
the data loader will care about the type of the instances, and surely throw an error
if it encounters a type it cannot handle.
vocab¶
class DatasetDict:
| ...
| vocab: Optional[Vocabulary] = None
The vocabulary of this dataset.
metadata¶
class DatasetDict:
| ...
| metadata: Mapping[str, Any] = field(default_factory=dict)
Metadata can contain anything you need.
DatasetReaderAdapterStep¶
@Step.register("dataset_reader_adapter")
class DatasetReaderAdapterStep(Step)
This step creates an DatasetDict
from old-school dataset readers. If you're
tempted to write a new DatasetReader
, and then use this step with it, don't.
Just write a Step
that creates the DatasetDict
you need directly.
DETERMINISTIC¶
class DatasetReaderAdapterStep(Step):
| ...
| DETERMINISTIC = True
CACHEABLE¶
class DatasetReaderAdapterStep(Step):
| ...
| CACHEABLE = True
VERSION¶
class DatasetReaderAdapterStep(Step):
| ...
| VERSION = "002"
run¶
class DatasetReaderAdapterStep(Step):
| ...
| def run(
| self,
| reader: DatasetReader,
| splits: Dict[str, str]
| ) -> DatasetDict
reader
specifies the old-school dataset reader to use.splits
maps the names of the splits to the filenames to use for the dataset reader. It might look like this:{ "train": "/path/to/train.json", "validation": "/path/to/validation.json" }
DatasetRemixStep¶
@Step.register("dataset_remix")
class DatasetRemixStep(Step)
This step can remix splits in a dataset into new splits.
DETERMINISTIC¶
class DatasetRemixStep(Step):
| ...
| DETERMINISTIC = True
CACHEABLE¶
class DatasetRemixStep(Step):
| ...
| CACHEABLE = False
VERSION¶
class DatasetRemixStep(Step):
| ...
| VERSION = "001"
run¶
class DatasetRemixStep(Step):
| ...
| def run(
| self,
| input: DatasetDict,
| new_splits: Dict[str, str],
| keep_old_splits: bool = True,
| shuffle_before: bool = False,
| shuffle_after: bool = False,
| random_seed: int = 1532637578
| ) -> DatasetDict