allennlp.data.dataset_readers.penn_tree_bank

class allennlp.data.dataset_readers.penn_tree_bank.PennTreeBankConstituencySpanDatasetReader(token_indexers: Dict[str, allennlp.data.token_indexers.token_indexer.TokenIndexer] = None, use_pos_tags: bool = True, lazy: bool = False, label_namespace_prefix: str = '', pos_label_namespace: str = 'pos')[source]

Bases: allennlp.data.dataset_readers.dataset_reader.DatasetReader

Reads constituency parses from the WSJ part of the Penn Tree Bank from the LDC. This DatasetReader is designed for use with a span labelling model, so it enumerates all possible spans in the sentence and returns them, along with gold labels for the relevant spans present in a gold tree, if provided.

Parameters
token_indexersDict[str, TokenIndexer], optional (default=``{“tokens”: SingleIdTokenIndexer()}``)

We use this to define the input representation for the text. See TokenIndexer. Note that the output tags will always correspond to single token IDs based on how they are pre-tokenised in the data file.

use_pos_tagsbool, optional, (default = True)

Whether or not the instance should contain gold POS tags as a field.

lazybool, optional, (default = False)

Whether or not instances can be consumed lazily.

label_namespace_prefixstr, optional, (default = "")

Prefix used for the label namespace. The span_labels will use namespace label_namespace_prefix + 'labels', and if using POS tags their namespace is label_namespace_prefix + pos_label_namespace.

pos_label_namespacestr, optional, (default = "pos")

The POS tag namespace is label_namespace_prefix + pos_label_namespace.

text_to_instance(self, tokens: List[str], pos_tags: List[str] = None, gold_tree: nltk.tree.Tree = None) → allennlp.data.instance.Instance[source]

We take pre-tokenized input here, because we don’t have a tokenizer in this class.

Parameters
tokensList[str], required.

The tokens in a given sentence.

pos_tagsList[str], optional, (default = None).

The POS tags for the words in the sentence.

gold_treeTree, optional (default = None).

The gold parse tree to create span labels from.

Returns
An Instance containing the following fields:
tokensTextField

The tokens in the sentence.

pos_tagsSequenceLabelField

The POS tags of the words in the sentence. Only returned if use_pos_tags is True

spansListField[SpanField]

A ListField containing all possible subspans of the sentence.

span_labelsSequenceLabelField, optional.

The constituency tags for each of the possible spans, with respect to a gold parse tree. If a span is not contained within the tree, a span will have a NO-LABEL label.

gold_treeMetadataField(Tree)

The gold NLTK parse tree for use in evaluation.