allennlp.data.dataset_readers.dataset_utils¶

class allennlp.data.dataset_readers.dataset_utils.ontonotes.Ontonotes[source]¶

Bases: object

This DatasetReader is designed to read in the English OntoNotes v5.0 data in the format used by the CoNLL 2011/2012 shared tasks. In order to use this Reader, you must follow the instructions provided here (v12 release):, which will allow you to download the CoNLL style annotations for the OntoNotes v5.0 release – LDC2013T19.tgz obtained from LDC.

Once you have run the scripts on the extracted data, you will have a folder structured as follows:

conll-formatted-ontonotes-5.0/

── data

├── development

└── data

└── english

└── annotations: ├── bc ├── bn ├── mz ├── nw ├── pt ├── tc └── wb

├── test

└── data

└── english

└── annotations: ├── bc ├── bn ├── mz ├── nw ├── pt ├── tc └── wb

└── train

└── data

└── english

└── annotations: ├── bc ├── bn ├── mz ├── nw ├── pt ├── tc └── wb

The file path provided to this class can then be any of the train, test or development directories(or the top level data directory, if you are not utilizing the splits).

The data has the following format, ordered by column.

1 Document IDstr

This is a variation on the document filename

2 Part numberint

Some files are divided into multiple parts numbered as 000, 001, 002, … etc.

3 Word numberint

This is the word index of the word in that sentence.

4 Wordstr

This is the token as segmented/tokenized in the Treebank. Initially the *_skel file contain the placeholder [WORD] which gets replaced by the actual token from the Treebank which is part of the OntoNotes release.

5 POS Tagstr

This is the Penn Treebank style part of speech. When parse information is missing, all part of speeches except the one for which there is some sense or proposition annotation are marked with a XX tag. The verb is marked with just a VERB tag.

6 Parse bit: str

This is the bracketed structure broken before the first open parenthesis in the parse, and the word/part-of-speech leaf replaced with a *. When the parse information is missing, the first word of a sentence is tagged as (TOP* and the last word is tagged as *) and all intermediate words are tagged with a *.

7 Predicate lemma: str

The predicate lemma is mentioned for the rows for which we have semantic role information or word sense information. All other rows are marked with a “-“.

8 Predicate Frameset ID: int

The PropBank frameset ID of the predicate in Column 7.

9 Word sense: float

This is the word sense of the word in Column 3.

10 Speaker/Author: str

This is the speaker or author name where available. Mostly in Broadcast Conversation and Web Log data. When not available the rows are marked with an “-“.

11 Named Entities: str

These columns identifies the spans representing various named entities. For documents which do not have named entity annotation, each line is represented with an *.

12+ Predicate Arguments: str

There is one column each of predicate argument structure information for the predicate mentioned in Column 7. If there are no predicates tagged in a sentence this is a single column with all rows marked with an *.

-1 Co-reference: str

Co-reference chain information encoded in a parenthesis structure. For documents that do: not have co-reference annotations, each line is represented with a “-“.

dataset_document_iterator(self, file_path: str) → Iterator[List[allennlp.data.dataset_readers.dataset_utils.ontonotes.OntonotesSentence]][source]¶: An iterator over CONLL formatted files which yields documents, regardless of the number of document annotations in a particular file. This is useful for conll data which has been preprocessed, such as the preprocessing which takes place for the 2012 CONLL Coreference Resolution task.

dataset_iterator(self, file_path: str) → Iterator[allennlp.data.dataset_readers.dataset_utils.ontonotes.OntonotesSentence][source]¶: An iterator over the entire dataset, yielding all sentences processed.

static dataset_path_iterator(file_path: str) → Iterator[str][source]¶: An iterator returning file_paths in a directory containing CONLL-formatted files.

sentence_iterator(self, file_path: str) → Iterator[allennlp.data.dataset_readers.dataset_utils.ontonotes.OntonotesSentence][source]¶: An iterator over the sentences in an individual CONLL formatted file.

class allennlp.data.dataset_readers.dataset_utils.ontonotes.OntonotesSentence(document_id: str, sentence_id: int, words: List[str], pos_tags: List[str], parse_tree: Optional[nltk.tree.Tree], predicate_lemmas: List[Optional[str]], predicate_framenet_ids: List[Optional[str]], word_senses: List[Optional[float]], speakers: List[Optional[str]], named_entities: List[str], srl_frames: List[Tuple[str, List[str]]], coref_spans: Set[Tuple[int, Tuple[int, int]]])[source]¶

Bases: object

A class representing the annotations available for a single CONLL formatted sentence.

Parameters

document_idstr: This is a variation on the document filename
sentence_idint: The integer ID of the sentence within a document.
wordsList[str]: This is the tokens as segmented/tokenized in the Treebank.
pos_tagsList[str]: This is the Penn-Treebank-style part of speech. When parse information is missing, all parts of speech except the one for which there is some sense or proposition annotation are marked with a XX tag. The verb is marked with just a VERB tag.
parse_treenltk.Tree: An nltk Tree representing the parse. It includes POS tags as pre-terminal nodes. When the parse information is missing, the parse will be None.
predicate_lemmasList[Optional[str]]: The predicate lemma of the words for which we have semantic role information or word sense information. All other indices are None.
predicate_framenet_idsList[Optional[int]]: The PropBank frameset ID of the lemmas in predicate_lemmas, or None.
word_sensesList[Optional[float]]: The word senses for the words in the sentence, or None. These are floats because the word sense can have values after the decimal, like 1.1.
speakersList[Optional[str]]: The speaker information for the words in the sentence, if present, or None This is the speaker or author name where available. Mostly in Broadcast Conversation and Web Log data. When not available the rows are marked with an “-“.
named_entitiesList[str]: The BIO tags for named entities in the sentence.
srl_framesList[Tuple[str, List[str]]]: A dictionary keyed by the verb in the sentence for the given Propbank frame labels, in a BIO format.
coref_spansSet[TypedSpan]: The spans for entity mentions involved in coreference resolution within the sentence. Each element is a tuple composed of (cluster_id, (start_index, end_index)). Indices are inclusive.

exception allennlp.data.dataset_readers.dataset_utils.span_utils.InvalidTagSequence(tag_sequence=None)[source]¶: Bases: Exception

allennlp.data.dataset_readers.dataset_utils.span_utils.bio_tags_to_spans(tag_sequence: List[str], classes_to_ignore: List[str] = None) → List[Tuple[str, Tuple[int, int]]][source]¶

Given a sequence corresponding to BIO tags, extracts spans. Spans are inclusive and can be of zero length, representing a single word span. Ill-formed spans are also included (i.e those which do not start with a “B-LABEL”), as otherwise it is possible to get a perfect precision score whilst still predicting ill-formed spans in addition to the correct spans. This function works properly when the spans are unlabeled (i.e., your labels are simply “B”, “I”, and “O”).

Parameters

tag_sequenceList[str], required.: The integer class labels for a sequence.
classes_to_ignoreList[str], optional (default = None).: A list of string class labels excluding the bio tag which should be ignored when extracting spans.

Returns

spansList[TypedStringSpan]: The typed, extracted spans from the sequence, in the format (label, (span_start, span_end)). Note that the label does not contain any BIO tag prefixes.

allennlp.data.dataset_readers.dataset_utils.span_utils.bioul_tags_to_spans(tag_sequence: List[str], classes_to_ignore: List[str] = None) → List[Tuple[str, Tuple[int, int]]][source]¶

Given a sequence corresponding to BIOUL tags, extracts spans. Spans are inclusive and can be of zero length, representing a single word span. Ill-formed spans are not allowed and will raise InvalidTagSequence. This function works properly when the spans are unlabeled (i.e., your labels are simply “B”, “I”, “O”, “U”, and “L”).

Parameters

tag_sequenceList[str], required.: The tag sequence encoded in BIOUL, e.g. [“B-PER”, “L-PER”, “O”].
classes_to_ignoreList[str], optional (default = None).: A list of string class labels excluding the bio tag which should be ignored when extracting spans.

Returns

spansList[TypedStringSpan]: The typed, extracted spans from the sequence, in the format (label, (span_start, span_end)).

allennlp.data.dataset_readers.dataset_utils.span_utils.bmes_tags_to_spans(tag_sequence: List[str], classes_to_ignore: List[str] = None) → List[Tuple[str, Tuple[int, int]]][source]¶

Given a sequence corresponding to BMES tags, extracts spans. Spans are inclusive and can be of zero length, representing a single word span. Ill-formed spans are also included (i.e those which do not start with a “B-LABEL”), as otherwise it is possible to get a perfect precision score whilst still predicting ill-formed spans in addition to the correct spans. This function works properly when the spans are unlabeled (i.e., your labels are simply “B”, “M”, “E” and “S”).

Parameters

tag_sequenceList[str], required.: The integer class labels for a sequence.
classes_to_ignoreList[str], optional (default = None).: A list of string class labels excluding the bio tag which should be ignored when extracting spans.

Returns

spansList[TypedStringSpan]: The typed, extracted spans from the sequence, in the format (label, (span_start, span_end)). Note that the label does not contain any BIO tag prefixes.

allennlp.data.dataset_readers.dataset_utils.span_utils.enumerate_spans(sentence: List[~T], offset: int = 0, max_span_width: int = None, min_span_width: int = 1, filter_function: Callable[[List[~T]], bool] = None) → List[Tuple[int, int]][source]¶

Given a sentence, return all token spans within the sentence. Spans are inclusive. Additionally, you can provide a maximum and minimum span width, which will be used to exclude spans outside of this range.

Finally, you can provide a function mapping List[T] -> bool, which will be applied to every span to decide whether that span should be included. This allows filtering by length, regex matches, pos tags or any Spacy Token attributes, for example.

Parameters

sentenceList[T], required.: The sentence to generate spans for. The type is generic, as this function can be used with strings, or Spacy Tokens or other sequences.
offsetint, optional (default = 0): A numeric offset to add to all span start and end indices. This is helpful if the sentence is part of a larger structure, such as a document, which the indices need to respect.
max_span_widthint, optional (default = None): The maximum length of spans which should be included. Defaults to len(sentence).
min_span_widthint, optional (default = 1): The minimum length of spans which should be included. Defaults to 1.
filter_functionCallable[[List[T]], bool], optional (default = None): A function mapping sequences of the passed type T to a boolean value. If True, the span is included in the returned spans from the sentence, otherwise it is excluded..

allennlp.data.dataset_readers.dataset_utils.span_utils.iob1_tags_to_spans(tag_sequence: List[str], classes_to_ignore: List[str] = None) → List[Tuple[str, Tuple[int, int]]][source]¶

Given a sequence corresponding to IOB1 tags, extracts spans. Spans are inclusive and can be of zero length, representing a single word span. Ill-formed spans are also included (i.e., those where “B-LABEL” is not preceded by “I-LABEL” or “B-LABEL”).

Parameters

tag_sequenceList[str], required.: The integer class labels for a sequence.
classes_to_ignoreList[str], optional (default = None).: A list of string class labels excluding the bio tag which should be ignored when extracting spans.

Returns

spansList[TypedStringSpan]: The typed, extracted spans from the sequence, in the format (label, (span_start, span_end)). Note that the label does not contain any BIO tag prefixes.

allennlp.data.dataset_readers.dataset_utils.span_utils.iob1_to_bioul(tag_sequence: List[str]) → List[str][source]¶

allennlp.data.dataset_readers.dataset_utils.span_utils.to_bioul(tag_sequence: List[str], encoding: str = 'IOB1') → List[str][source]¶

Given a tag sequence encoded with IOB1 labels, recode to BIOUL.

In the IOB1 scheme, I is a token inside a span, O is a token outside a span and B is the beginning of span immediately following another span of the same type.

In the BIO scheme, I is a token inside a span, O is a token outside a span and B is the beginning of a span.

Parameters

tag_sequenceList[str], required.: The tag sequence encoded in IOB1, e.g. [“I-PER”, “I-PER”, “O”].
encodingstr, optional, (default = IOB1).: The encoding type to convert from. Must be either “IOB1” or “BIO”.

Returns

bioul_sequence: List[str]: The tag sequence encoded in IOB1, e.g. [“B-PER”, “L-PER”, “O”].

Utility functions for reading the standardised text2sql datasets presented in “Improving Text to SQL Evaluation Methodology”

class allennlp.data.dataset_readers.dataset_utils.text2sql_utils.SqlData[source]¶

Bases: tuple

A utility class for reading in text2sql data.

Parameters

textList[str]: The tokens in the text of the query.
text_with_variablesList[str]: The tokens in the text of the query with variables mapped to table names/abstract variables.
variable_tagsList[str]: Labels for each word in text which correspond to which variable in the sql the token is linked to. “O” is used to denote no tag.
sqlList[str]: The tokens in the SQL query which corresponds to the text.
text_variablesDict[str, str]: A dictionary of variables associated with the text, e.g. {“city_name0”: “san fransisco”}
sql_variablesDict[str, Dict[str, str]]: A dictionary of variables and column references associated with the sql query.

property sql¶: Alias for field number 3

property sql_variables¶: Alias for field number 5

property text¶: Alias for field number 0

property text_variables¶: Alias for field number 4

property text_with_variables¶: Alias for field number 1

property variable_tags¶: Alias for field number 2

class allennlp.data.dataset_readers.dataset_utils.text2sql_utils.TableColumn(name, column_type, is_primary_key)[source]¶

Bases: tuple

property column_type¶: Alias for field number 1

property is_primary_key¶: Alias for field number 2

property name¶: Alias for field number 0

allennlp.data.dataset_readers.dataset_utils.text2sql_utils.clean_and_split_sql(sql: str) → List[str][source]¶: Cleans up and unifies a SQL query. This involves unifying quoted strings and splitting brackets which aren’t formatted consistently in the data.

allennlp.data.dataset_readers.dataset_utils.text2sql_utils.clean_unneeded_aliases(sql_tokens: List[str]) → List[str][source]¶

allennlp.data.dataset_readers.dataset_utils.text2sql_utils.column_has_numeric_type(column: allennlp.data.dataset_readers.dataset_utils.text2sql_utils.TableColumn) → bool[source]¶

allennlp.data.dataset_readers.dataset_utils.text2sql_utils.column_has_string_type(column: allennlp.data.dataset_readers.dataset_utils.text2sql_utils.TableColumn) → bool[source]¶

allennlp.data.dataset_readers.dataset_utils.text2sql_utils.process_sql_data(data: List[Dict[str, Any]], use_all_sql: bool = False, use_all_queries: bool = False, remove_unneeded_aliases: bool = False, schema: Dict[str, List[allennlp.data.dataset_readers.dataset_utils.text2sql_utils.TableColumn]] = None) → Iterable[allennlp.data.dataset_readers.dataset_utils.text2sql_utils.SqlData][source]¶

A utility function for reading in text2sql data. The blob is the result of loading the json from a file produced by the script scripts/reformat_text2sql_data.py.

Parameters

dataJsonDict
use_all_sqlbool, optional (default = False): Whether to use all of the sql queries which have identical semantics, or whether to just use the first one.
use_all_queriesbool, (default = False): Whether or not to enforce query sentence uniqueness. If false, duplicated queries will occur in the dataset as separate instances, as for a given SQL query, not only are there multiple queries with the same template, but there are also duplicate queries.
remove_unneeded_aliasesbool, (default = False): The text2sql data by default creates alias names for all tables, regardless of whether the table is derived or if it is identical to the original (e.g SELECT TABLEalias0.COLUMN FROM TABLE AS TABLEalias0). This is not necessary and makes the action sequence and grammar manipulation much harder in a grammar based decoder. Note that this does not remove aliases which are legitimately required, such as when a new table is formed by performing operations on the original table.
schemaDict[str, List[TableColumn]], optional, (default = None): A schema to resolve primary keys against. Converts ‘ID’ column names to their actual name with respect to the Primary Key for the table in the schema.

allennlp.data.dataset_readers.dataset_utils.text2sql_utils.read_dataset_schema(schema_path: str) → Dict[str, List[allennlp.data.dataset_readers.dataset_utils.text2sql_utils.TableColumn]][source]¶

Reads a schema from the text2sql data, returning a dictionary mapping table names to their columns and respective types. This handles columns in an arbitrary order and also allows either {Table, Field} or {Table, Field} Name as headers, because both appear in the data. It also uppercases table and column names if they are not already uppercase.

Parameters

schema_pathstr, required.: The path to the csv schema.

Returns

A dictionary mapping table names to typed columns.

allennlp.data.dataset_readers.dataset_utils.text2sql_utils.replace_variables(sentence: List[str], sentence_variables: Dict[str, str]) → Tuple[List[str], List[str]][source]¶: Replaces abstract variables in text with their concrete counterparts.

allennlp.data.dataset_readers.dataset_utils.text2sql_utils.resolve_primary_keys_in_schema(sql_tokens: List[str], schema: Dict[str, List[allennlp.data.dataset_readers.dataset_utils.text2sql_utils.TableColumn]]) → List[str][source]¶: Some examples in the text2sql datasets use ID as a column reference to the column of a table which has a primary key. This causes problems if you are trying to constrain a grammar to only produce the column names directly, because you don’t know what ID refers to. So instead of dealing with that, we just replace it.

allennlp.data.dataset_readers.dataset_utils.text2sql_utils.split_table_and_column_names(table: str) → Iterable[str][source]¶