allennlp.data.iterators

The various DataIterator subclasses can be used to iterate over datasets with different batching and padding schemes.

class allennlp.data.iterators.data_iterator.DataIterator(batch_size: int = 32, instances_per_epoch: int = None, max_instances_in_memory: int = None, cache_instances: bool = False, track_epoch: bool = False, maximum_samples_per_batch: Tuple[str, int] = None)[source]

Bases: allennlp.common.registrable.Registrable

An abstract DataIterator class. DataIterators must override _create_batches().

Parameters
batch_sizeint, optional, (default = 32)

The size of each batch of instances yielded when calling the iterator.

instances_per_epochint, optional, (default = None)

If specified, each epoch will consist of precisely this many instances. If not specified, each epoch will consist of a single pass through the dataset.

max_instances_in_memoryint, optional, (default = None)

If specified, the iterator will load this many instances at a time into an in-memory list and then produce batches from one such list at a time. This could be useful if your instances are read lazily from disk.

cache_instancesbool, optional, (default = False)

If true, the iterator will cache the tensorized instances in memory. If false, it will do the tensorization anew each iteration.

track_epochbool, optional, (default = False)

If true, each instance will get a MetadataField containing the epoch number.

maximum_samples_per_batchTuple[str, int], (default = None)

If specified, then is a tuple (padding_key, limit) and we will ensure that every batch is such that batch_size * sequence_length <= limit where sequence_length is given by the padding_key. This is done by moving excess instances to the next batch (as opposed to dividing a large batch evenly) and should result in a fairly tight packing.

default_implementation: str = 'bucket'
get_num_batches(self, instances: Iterable[allennlp.data.instance.Instance]) → int[source]

Returns the number of batches that dataset will be split into; if you want to track progress through the batch with the generator produced by __call__, this could be useful.

index_with(self, vocab: allennlp.data.vocabulary.Vocabulary)[source]
allennlp.data.iterators.data_iterator.add_epoch_number(batch: allennlp.data.dataset.Batch, epoch: int) → allennlp.data.dataset.Batch[source]

Add the epoch number to the batch instances as a MetadataField.

class allennlp.data.iterators.basic_iterator.BasicIterator(batch_size: int = 32, instances_per_epoch: int = None, max_instances_in_memory: int = None, cache_instances: bool = False, track_epoch: bool = False, maximum_samples_per_batch: Tuple[str, int] = None)[source]

Bases: allennlp.data.iterators.data_iterator.DataIterator

A very basic iterator that takes a dataset, possibly shuffles it, and creates fixed sized batches.

It takes the same parameters as allennlp.data.iterators.DataIterator

class allennlp.data.iterators.bucket_iterator.BucketIterator(sorting_keys: List[Tuple[str, str]], padding_noise: float = 0.1, biggest_batch_first: bool = False, batch_size: int = 32, instances_per_epoch: int = None, max_instances_in_memory: int = None, cache_instances: bool = False, track_epoch: bool = False, maximum_samples_per_batch: Tuple[str, int] = None, skip_smaller_batches: bool = False)[source]

Bases: allennlp.data.iterators.data_iterator.DataIterator

An iterator which by default, pads batches with respect to the maximum input lengths per batch. Additionally, you can provide a list of field names and padding keys which the dataset will be sorted by before doing this batching, causing inputs with similar length to be batched together, making computation more efficient (as less time is wasted on padded elements of the batch).

Parameters
sorting_keysList[Tuple[str, str]]

To bucket inputs into batches, we want to group the instances by padding length, so that we minimize the amount of padding necessary per batch. In order to do this, we need to know which fields need what type of padding, and in what order.

For example, [("sentence1", "num_tokens"), ("sentence2", "num_tokens"), ("sentence1", "num_token_characters")] would sort a dataset first by the “num_tokens” of the “sentence1” field, then by the “num_tokens” of the “sentence2” field, and finally by the “num_token_characters” of the “sentence1” field. TODO(mattg): we should have some documentation somewhere that gives the standard padding keys used by different fields.

padding_noisefloat, optional (default=.1)

When sorting by padding length, we add a bit of noise to the lengths, so that the sorting isn’t deterministic. This parameter determines how much noise we add, as a percentage of the actual padding value for each instance.

biggest_batch_firstbool, optional (default=False)

This is largely for testing, to see how large of a batch you can safely use with your GPU. This will let you try out the largest batch that you have in the data first, so that if you’re going to run out of memory, you know it early, instead of waiting through the whole epoch to find out at the end that you’re going to crash.

Note that if you specify max_instances_in_memory, the first batch will only be the biggest from among the first “max instances in memory” instances.

batch_sizeint, optional, (default = 32)

The size of each batch of instances yielded when calling the iterator.

instances_per_epochint, optional, (default = None)

See BasicIterator.

max_instances_in_memoryint, optional, (default = None)

See BasicIterator.

maximum_samples_per_batchTuple[str, int], (default = None)

See BasicIterator.

skip_smaller_batchesbool, optional, (default = False)

When the number of data samples is not dividable by batch_size, some batches might be smaller than batch_size. If set to True, those smaller batches will be discarded.

allennlp.data.iterators.bucket_iterator.sort_by_padding(instances: List[allennlp.data.instance.Instance], sorting_keys: List[Tuple[str, str]], vocab: allennlp.data.vocabulary.Vocabulary, padding_noise: float = 0.0) → List[allennlp.data.instance.Instance][source]

Sorts the instances by their padding lengths, using the keys in sorting_keys (in the order in which they are provided). sorting_keys is a list of (field_name, padding_key) tuples.

class allennlp.data.iterators.multiprocess_iterator.MultiprocessIterator(base_iterator: allennlp.data.iterators.data_iterator.DataIterator, num_workers: int = 1, output_queue_size: int = 1000)[source]

Bases: allennlp.data.iterators.data_iterator.DataIterator

Wraps another `DataIterator` and uses it to generate tensor dicts using multiple processes.

Parameters
base_iteratorDataIterator

The DataIterator for generating tensor dicts. It will be shared among processes, so it should not be stateful in any way.

num_workersint, optional (default = 1)

The number of processes used for generating tensor dicts.

output_queue_size: ``int``, optional (default = 1000)

The size of the output queue on which tensor dicts are placed to be consumed. You might need to increase this if you’re generating tensor dicts too quickly.

index_with(self, vocab: allennlp.data.vocabulary.Vocabulary)[source]
class allennlp.data.iterators.homogeneous_batch_iterator.HomogeneousBatchIterator(batch_size: int = 32, instances_per_epoch: int = None, max_instances_in_memory: int = None, cache_instances: bool = False, track_epoch: bool = False, partition_key: str = 'dataset', skip_smaller_batches: bool = False)[source]

Bases: allennlp.data.iterators.data_iterator.DataIterator

This iterator takes a dataset of potentially heterogeneous instances and yields back homogeneous batches. It assumes that each instance has some MetadataField indicating what “type” of instance it is and bases its notion of homogeneity on that (and, in particular, not on inspecting the “field signature” of the instance.)

Parameters
batch_sizeint, optional, (default = 32)

The size of each batch of instances yielded when calling the iterator.

instances_per_epochint, optional, (default = None)

If specified, each epoch will consist of precisely this many instances. If not specified, each epoch will consist of a single pass through the dataset.

max_instances_in_memoryint, optional, (default = None)

If specified, the iterator will load this many instances at a time into an in-memory list and then produce batches from one such list at a time. This could be useful if your instances are read lazily from disk.

cache_instancesbool, optional, (default = False)

If true, the iterator will cache the tensorized instances in memory. If false, it will do the tensorization anew each iteration.

track_epochbool, optional, (default = False)

If true, each instance will get a MetadataField containing the epoch number.

partition_keystr, optional, (default = “dataset”)

The key of the MetadataField indicating what “type” of instance this is.

skip_smaller_batchesbool, optional, (default = False)

When the number of data samples is not dividable by batch_size, some batches might be smaller than batch_size. If set to True, those smaller batches will be discarded.

class allennlp.data.iterators.same_language_iterator.SameLanguageIterator(sorting_keys: List[Tuple[str, str]], padding_noise: float = 0.1, biggest_batch_first: bool = False, batch_size: int = 32, instances_per_epoch: int = None, max_instances_in_memory: int = None, cache_instances: bool = False, track_epoch: bool = False, maximum_samples_per_batch: Tuple[str, int] = None, skip_smaller_batches: bool = False)[source]

Bases: allennlp.data.iterators.bucket_iterator.BucketIterator

Splits batches into batches containing the same language. The language of each instance is determined by looking at the ‘lang’ value in the metadata.

It takes the same parameters as allennlp.data.iterators.BucketIterator

allennlp.data.iterators.same_language_iterator.split_by_language(instance_list)[source]
class allennlp.data.iterators.pass_through_iterator.PassThroughIterator[source]

Bases: allennlp.data.iterators.data_iterator.DataIterator

An iterator which performs no batching or shuffling of instances, only tensorization. E.g, instances are effectively passed ‘straight through’ the iterator.

This is essentially the same as a BasicIterator with shuffling disabled, the batch size set to 1, and maximum samples per batch disabled. The only difference is that this iterator removes the batch dimension. This can be useful for rare situations where batching is best performed within the dataset reader (e.g. for contiguous language modeling, or for other problems where state is shared across batches).