Skip to content

dataset

allennlp.tango.dataset

[SOURCE]


AllenNLP Tango is an experimental API and parts of it might change or disappear every time we release a new version.

DatasetDict

@dataclass
class DatasetDict

This definition of a dataset combines all splits, the vocabulary, and some metadata, into one handy class.

splits

class DatasetDict:
 | ...
 | splits: Mapping[str, Sequence[Any]] = None

Maps the name of the split to a sequence of instances. AllenNlpDataset does not enforce what the instances are made of, so they are of type Sequence[Any]. However, the data loader will care about the type of the instances, and surely throw an error if it encounters a type it cannot handle.

vocab

class DatasetDict:
 | ...
 | vocab: Optional[Vocabulary] = None

The vocabulary of this dataset.

metadata

class DatasetDict:
 | ...
 | metadata: Mapping[str, Any] = field(default_factory=dict)

Metadata can contain anything you need.

DatasetReaderAdapterStep

@Step.register("dataset_reader_adapter")
class DatasetReaderAdapterStep(Step)

This step creates an DatasetDict from old-school dataset readers. If you're tempted to write a new DatasetReader, and then use this step with it, don't. Just write a Step that creates the DatasetDict you need directly.

DETERMINISTIC

class DatasetReaderAdapterStep(Step):
 | ...
 | DETERMINISTIC = True

CACHEABLE

class DatasetReaderAdapterStep(Step):
 | ...
 | CACHEABLE = True

VERSION

class DatasetReaderAdapterStep(Step):
 | ...
 | VERSION = "002"

run

class DatasetReaderAdapterStep(Step):
 | ...
 | def run(
 |     self,
 |     reader: DatasetReader,
 |     splits: Dict[str, str]
 | ) -> DatasetDict
  • reader specifies the old-school dataset reader to use.
  • splits maps the names of the splits to the filenames to use for the dataset reader. It might look like this:
    {
        "train": "/path/to/train.json",
        "validation": "/path/to/validation.json"
    }
    

DatasetRemixStep

@Step.register("dataset_remix")
class DatasetRemixStep(Step)

This step can remix splits in a dataset into new splits.

DETERMINISTIC

class DatasetRemixStep(Step):
 | ...
 | DETERMINISTIC = True

CACHEABLE

class DatasetRemixStep(Step):
 | ...
 | CACHEABLE = False

VERSION

class DatasetRemixStep(Step):
 | ...
 | VERSION = "001"

run

class DatasetRemixStep(Step):
 | ...
 | def run(
 |     self,
 |     input: DatasetDict,
 |     new_splits: Dict[str, str],
 |     keep_old_splits: bool = True,
 |     shuffle_before: bool = False,
 |     shuffle_after: bool = False,
 |     random_seed: int = 1532637578
 | ) -> DatasetDict