dataset

allennlp.tango.dataset

AllenNLP Tango is an experimental API and parts of it might change or disappear every time we release a new version.

DatasetDict¶

@dataclass
class DatasetDict

This definition of a dataset combines all splits, the vocabulary, and some metadata, into one handy class.

splits¶

class DatasetDict:
 | ...
 | splits: Mapping[str, Sequence[Any]] = None

Maps the name of the split to a sequence of instances. AllenNlpDataset does not enforce what the instances are made of, so they are of type Sequence[Any]. However, the data loader will care about the type of the instances, and surely throw an error if it encounters a type it cannot handle.

vocab¶

class DatasetDict:
 | ...
 | vocab: Optional[Vocabulary] = None

The vocabulary of this dataset.

metadata¶

class DatasetDict:
 | ...
 | metadata: Mapping[str, Any] = field(default_factory=dict)

Metadata can contain anything you need.

DatasetReaderAdapterStep¶

@Step.register("dataset_reader_adapter")
class DatasetReaderAdapterStep(Step)

This step creates an DatasetDict from old-school dataset readers. If you're tempted to write a new DatasetReader, and then use this step with it, don't. Just write a Step that creates the DatasetDict you need directly.

DETERMINISTIC¶

class DatasetReaderAdapterStep(Step):
 | ...
 | DETERMINISTIC = True

CACHEABLE¶

class DatasetReaderAdapterStep(Step):
 | ...
 | CACHEABLE = True

VERSION¶

class DatasetReaderAdapterStep(Step):
 | ...
 | VERSION = "002"

run¶

class DatasetReaderAdapterStep(Step):
 | ...
 | def run(
 |     self,
 |     reader: DatasetReader,
 |     splits: Dict[str, str]
 | ) -> DatasetDict

reader specifies the old-school dataset reader to use.
splits maps the names of the splits to the filenames to use for the dataset reader. It might look like this:
```
{
    "train": "/path/to/train.json",
    "validation": "/path/to/validation.json"
}
```

DatasetRemixStep¶

@Step.register("dataset_remix")
class DatasetRemixStep(Step)

This step can remix splits in a dataset into new splits.

DETERMINISTIC¶

class DatasetRemixStep(Step):
 | ...
 | DETERMINISTIC = True

CACHEABLE¶

class DatasetRemixStep(Step):
 | ...
 | CACHEABLE = False

VERSION¶

class DatasetRemixStep(Step):
 | ...
 | VERSION = "001"

run¶

class DatasetRemixStep(Step):
 | ...
 | def run(
 |     self,
 |     input: DatasetDict,
 |     new_splits: Dict[str, str],
 |     keep_old_splits: bool = True,
 |     shuffle_before: bool = False,
 |     shuffle_after: bool = False,
 |     random_seed: int = 1532637578
 | ) -> DatasetDict