allennlp.data.dataset_readers.dataset_utils¶
-
class
allennlp.data.dataset_readers.dataset_utils.ontonotes.
Ontonotes
[source]¶ Bases:
object
This DatasetReader is designed to read in the English OntoNotes v5.0 data in the format used by the CoNLL 2011/2012 shared tasks. In order to use this Reader, you must follow the instructions provided here (v12 release):, which will allow you to download the CoNLL style annotations for the OntoNotes v5.0 release – LDC2013T19.tgz obtained from LDC.
Once you have run the scripts on the extracted data, you will have a folder structured as follows:
- conll-formatted-ontonotes-5.0/
- ── data
- ├── development
- └── data
- └── english
- └── annotations
├── bc ├── bn ├── mz ├── nw ├── pt ├── tc └── wb
- ├── test
- └── data
- └── english
- └── annotations
├── bc ├── bn ├── mz ├── nw ├── pt ├── tc └── wb
- └── train
- └── data
- └── english
- └── annotations
├── bc ├── bn ├── mz ├── nw ├── pt ├── tc └── wb
The file path provided to this class can then be any of the train, test or development directories(or the top level data directory, if you are not utilizing the splits).
The data has the following format, ordered by column.
- 1 Document ID
str
This is a variation on the document filename
- 2 Part number
int
Some files are divided into multiple parts numbered as 000, 001, 002, … etc.
- 3 Word number
int
This is the word index of the word in that sentence.
- 4 Word
str
This is the token as segmented/tokenized in the Treebank. Initially the
*_skel
file contain the placeholder [WORD] which gets replaced by the actual token from the Treebank which is part of the OntoNotes release.- 5 POS Tag
str
This is the Penn Treebank style part of speech. When parse information is missing, all part of speeches except the one for which there is some sense or proposition annotation are marked with a XX tag. The verb is marked with just a VERB tag.
- 6 Parse bit:
str
This is the bracketed structure broken before the first open parenthesis in the parse, and the word/part-of-speech leaf replaced with a
*
. When the parse information is missing, the first word of a sentence is tagged as(TOP*
and the last word is tagged as*)
and all intermediate words are tagged with a*
.- 7 Predicate lemma:
str
The predicate lemma is mentioned for the rows for which we have semantic role information or word sense information. All other rows are marked with a “-“.
- 8 Predicate Frameset ID:
int
The PropBank frameset ID of the predicate in Column 7.
- 9 Word sense:
float
This is the word sense of the word in Column 3.
- 10 Speaker/Author:
str
This is the speaker or author name where available. Mostly in Broadcast Conversation and Web Log data. When not available the rows are marked with an “-“.
- 11 Named Entities:
str
These columns identifies the spans representing various named entities. For documents which do not have named entity annotation, each line is represented with an
*
.- 12+ Predicate Arguments:
str
There is one column each of predicate argument structure information for the predicate mentioned in Column 7. If there are no predicates tagged in a sentence this is a single column with all rows marked with an
*
.- -1 Co-reference:
str
- Co-reference chain information encoded in a parenthesis structure. For documents that do
not have co-reference annotations, each line is represented with a “-“.
-
dataset_document_iterator
(self, file_path: str) → Iterator[List[allennlp.data.dataset_readers.dataset_utils.ontonotes.OntonotesSentence]][source]¶ An iterator over CONLL formatted files which yields documents, regardless of the number of document annotations in a particular file. This is useful for conll data which has been preprocessed, such as the preprocessing which takes place for the 2012 CONLL Coreference Resolution task.
-
dataset_iterator
(self, file_path: str) → Iterator[allennlp.data.dataset_readers.dataset_utils.ontonotes.OntonotesSentence][source]¶ An iterator over the entire dataset, yielding all sentences processed.
-
class
allennlp.data.dataset_readers.dataset_utils.ontonotes.
OntonotesSentence
(document_id: str, sentence_id: int, words: List[str], pos_tags: List[str], parse_tree: Optional[nltk.tree.Tree], predicate_lemmas: List[Optional[str]], predicate_framenet_ids: List[Optional[str]], word_senses: List[Optional[float]], speakers: List[Optional[str]], named_entities: List[str], srl_frames: List[Tuple[str, List[str]]], coref_spans: Set[Tuple[int, Tuple[int, int]]])[source]¶ Bases:
object
A class representing the annotations available for a single CONLL formatted sentence.
- Parameters
- document_id
str
This is a variation on the document filename
- sentence_id
int
The integer ID of the sentence within a document.
- words
List[str]
This is the tokens as segmented/tokenized in the Treebank.
- pos_tags
List[str]
This is the Penn-Treebank-style part of speech. When parse information is missing, all parts of speech except the one for which there is some sense or proposition annotation are marked with a XX tag. The verb is marked with just a VERB tag.
- parse_tree
nltk.Tree
An nltk Tree representing the parse. It includes POS tags as pre-terminal nodes. When the parse information is missing, the parse will be
None
.- predicate_lemmas
List[Optional[str]]
The predicate lemma of the words for which we have semantic role information or word sense information. All other indices are
None
.- predicate_framenet_ids
List[Optional[int]]
The PropBank frameset ID of the lemmas in
predicate_lemmas
, orNone
.- word_senses
List[Optional[float]]
The word senses for the words in the sentence, or
None
. These are floats because the word sense can have values after the decimal, like1.1
.- speakers
List[Optional[str]]
The speaker information for the words in the sentence, if present, or
None
This is the speaker or author name where available. Mostly in Broadcast Conversation and Web Log data. When not available the rows are marked with an “-“.- named_entities
List[str]
The BIO tags for named entities in the sentence.
- srl_frames
List[Tuple[str, List[str]]]
A dictionary keyed by the verb in the sentence for the given Propbank frame labels, in a BIO format.
- coref_spans
Set[TypedSpan]
The spans for entity mentions involved in coreference resolution within the sentence. Each element is a tuple composed of (cluster_id, (start_index, end_index)). Indices are inclusive.
- document_id
-
exception
allennlp.data.dataset_readers.dataset_utils.span_utils.
InvalidTagSequence
(tag_sequence=None)[source]¶ Bases:
Exception
Given a sequence corresponding to BIO tags, extracts spans. Spans are inclusive and can be of zero length, representing a single word span. Ill-formed spans are also included (i.e those which do not start with a “B-LABEL”), as otherwise it is possible to get a perfect precision score whilst still predicting ill-formed spans in addition to the correct spans. This function works properly when the spans are unlabeled (i.e., your labels are simply “B”, “I”, and “O”).
- Parameters
- tag_sequenceList[str], required.
The integer class labels for a sequence.
- classes_to_ignoreList[str], optional (default = None).
A list of string class labels excluding the bio tag which should be ignored when extracting spans.
- Returns
- spansList[TypedStringSpan]
The typed, extracted spans from the sequence, in the format (label, (span_start, span_end)). Note that the label does not contain any BIO tag prefixes.
Given a sequence corresponding to BIOUL tags, extracts spans. Spans are inclusive and can be of zero length, representing a single word span. Ill-formed spans are not allowed and will raise
InvalidTagSequence
. This function works properly when the spans are unlabeled (i.e., your labels are simply “B”, “I”, “O”, “U”, and “L”).- Parameters
- tag_sequence
List[str]
, required. The tag sequence encoded in BIOUL, e.g. [“B-PER”, “L-PER”, “O”].
- classes_to_ignore
List[str]
, optional (default = None). A list of string class labels excluding the bio tag which should be ignored when extracting spans.
- tag_sequence
- Returns
- spans
List[TypedStringSpan]
The typed, extracted spans from the sequence, in the format (label, (span_start, span_end)).
- spans
Given a sequence corresponding to BMES tags, extracts spans. Spans are inclusive and can be of zero length, representing a single word span. Ill-formed spans are also included (i.e those which do not start with a “B-LABEL”), as otherwise it is possible to get a perfect precision score whilst still predicting ill-formed spans in addition to the correct spans. This function works properly when the spans are unlabeled (i.e., your labels are simply “B”, “M”, “E” and “S”).
- Parameters
- tag_sequenceList[str], required.
The integer class labels for a sequence.
- classes_to_ignoreList[str], optional (default = None).
A list of string class labels excluding the bio tag which should be ignored when extracting spans.
- Returns
- spansList[TypedStringSpan]
The typed, extracted spans from the sequence, in the format (label, (span_start, span_end)). Note that the label does not contain any BIO tag prefixes.
-
allennlp.data.dataset_readers.dataset_utils.span_utils.
enumerate_spans
(sentence: List[~T], offset: int = 0, max_span_width: int = None, min_span_width: int = 1, filter_function: Callable[[List[~T]], bool] = None) → List[Tuple[int, int]][source]¶ Given a sentence, return all token spans within the sentence. Spans are inclusive. Additionally, you can provide a maximum and minimum span width, which will be used to exclude spans outside of this range.
Finally, you can provide a function mapping
List[T] -> bool
, which will be applied to every span to decide whether that span should be included. This allows filtering by length, regex matches, pos tags or any SpacyToken
attributes, for example.- Parameters
- sentence
List[T]
, required. The sentence to generate spans for. The type is generic, as this function can be used with strings, or Spacy
Tokens
or other sequences.- offset
int
, optional (default = 0) A numeric offset to add to all span start and end indices. This is helpful if the sentence is part of a larger structure, such as a document, which the indices need to respect.
- max_span_width
int
, optional (default = None) The maximum length of spans which should be included. Defaults to len(sentence).
- min_span_width
int
, optional (default = 1) The minimum length of spans which should be included. Defaults to 1.
- filter_function
Callable[[List[T]], bool]
, optional (default = None) A function mapping sequences of the passed type T to a boolean value. If
True
, the span is included in the returned spans from the sentence, otherwise it is excluded..
- sentence
Given a sequence corresponding to IOB1 tags, extracts spans. Spans are inclusive and can be of zero length, representing a single word span. Ill-formed spans are also included (i.e., those where “B-LABEL” is not preceded by “I-LABEL” or “B-LABEL”).
- Parameters
- tag_sequenceList[str], required.
The integer class labels for a sequence.
- classes_to_ignoreList[str], optional (default = None).
A list of string class labels excluding the bio tag which should be ignored when extracting spans.
- Returns
- spansList[TypedStringSpan]
The typed, extracted spans from the sequence, in the format (label, (span_start, span_end)). Note that the label does not contain any BIO tag prefixes.
-
allennlp.data.dataset_readers.dataset_utils.span_utils.
iob1_to_bioul
(tag_sequence: List[str]) → List[str][source]¶
-
allennlp.data.dataset_readers.dataset_utils.span_utils.
to_bioul
(tag_sequence: List[str], encoding: str = 'IOB1') → List[str][source]¶ Given a tag sequence encoded with IOB1 labels, recode to BIOUL.
In the IOB1 scheme, I is a token inside a span, O is a token outside a span and B is the beginning of span immediately following another span of the same type.
In the BIO scheme, I is a token inside a span, O is a token outside a span and B is the beginning of a span.
- Parameters
- tag_sequence
List[str]
, required. The tag sequence encoded in IOB1, e.g. [“I-PER”, “I-PER”, “O”].
- encodingstr, optional, (default =
IOB1
). The encoding type to convert from. Must be either “IOB1” or “BIO”.
- tag_sequence
- Returns
- bioul_sequence:
List[str]
The tag sequence encoded in IOB1, e.g. [“B-PER”, “L-PER”, “O”].
- bioul_sequence:
Utility functions for reading the standardised text2sql datasets presented in “Improving Text to SQL Evaluation Methodology”
-
class
allennlp.data.dataset_readers.dataset_utils.text2sql_utils.
SqlData
[source]¶ Bases:
tuple
A utility class for reading in text2sql data.
- Parameters
- text
List[str]
The tokens in the text of the query.
- text_with_variables
List[str]
The tokens in the text of the query with variables mapped to table names/abstract variables.
- variable_tags
List[str]
Labels for each word in
text
which correspond to which variable in the sql the token is linked to. “O” is used to denote no tag.- sql
List[str]
The tokens in the SQL query which corresponds to the text.
- text_variables
Dict[str, str]
A dictionary of variables associated with the text, e.g. {“city_name0”: “san fransisco”}
- sql_variables
Dict[str, Dict[str, str]]
A dictionary of variables and column references associated with the sql query.
- text
-
property
sql
¶ Alias for field number 3
-
property
sql_variables
¶ Alias for field number 5
-
property
text
¶ Alias for field number 0
-
property
text_variables
¶ Alias for field number 4
-
property
text_with_variables
¶ Alias for field number 1
Alias for field number 2
-
class
allennlp.data.dataset_readers.dataset_utils.text2sql_utils.
TableColumn
(name, column_type, is_primary_key)[source]¶ Bases:
tuple
-
property
column_type
¶ Alias for field number 1
-
property
is_primary_key
¶ Alias for field number 2
-
property
name
¶ Alias for field number 0
-
property
-
allennlp.data.dataset_readers.dataset_utils.text2sql_utils.
clean_and_split_sql
(sql: str) → List[str][source]¶ Cleans up and unifies a SQL query. This involves unifying quoted strings and splitting brackets which aren’t formatted consistently in the data.
-
allennlp.data.dataset_readers.dataset_utils.text2sql_utils.
clean_unneeded_aliases
(sql_tokens: List[str]) → List[str][source]¶
-
allennlp.data.dataset_readers.dataset_utils.text2sql_utils.
column_has_numeric_type
(column: allennlp.data.dataset_readers.dataset_utils.text2sql_utils.TableColumn) → bool[source]¶
-
allennlp.data.dataset_readers.dataset_utils.text2sql_utils.
column_has_string_type
(column: allennlp.data.dataset_readers.dataset_utils.text2sql_utils.TableColumn) → bool[source]¶
-
allennlp.data.dataset_readers.dataset_utils.text2sql_utils.
process_sql_data
(data: List[Dict[str, Any]], use_all_sql: bool = False, use_all_queries: bool = False, remove_unneeded_aliases: bool = False, schema: Dict[str, List[allennlp.data.dataset_readers.dataset_utils.text2sql_utils.TableColumn]] = None) → Iterable[allennlp.data.dataset_readers.dataset_utils.text2sql_utils.SqlData][source]¶ A utility function for reading in text2sql data. The blob is the result of loading the json from a file produced by the script
scripts/reformat_text2sql_data.py
.- Parameters
- data
JsonDict
- use_all_sql
bool
, optional (default = False) Whether to use all of the sql queries which have identical semantics, or whether to just use the first one.
- use_all_queries
bool
, (default = False) Whether or not to enforce query sentence uniqueness. If false, duplicated queries will occur in the dataset as separate instances, as for a given SQL query, not only are there multiple queries with the same template, but there are also duplicate queries.
- remove_unneeded_aliases
bool
, (default = False) The text2sql data by default creates alias names for all tables, regardless of whether the table is derived or if it is identical to the original (e.g SELECT TABLEalias0.COLUMN FROM TABLE AS TABLEalias0). This is not necessary and makes the action sequence and grammar manipulation much harder in a grammar based decoder. Note that this does not remove aliases which are legitimately required, such as when a new table is formed by performing operations on the original table.
- schema
Dict[str, List[TableColumn]]
, optional, (default = None) A schema to resolve primary keys against. Converts ‘ID’ column names to their actual name with respect to the Primary Key for the table in the schema.
- data
-
allennlp.data.dataset_readers.dataset_utils.text2sql_utils.
read_dataset_schema
(schema_path: str) → Dict[str, List[allennlp.data.dataset_readers.dataset_utils.text2sql_utils.TableColumn]][source]¶ Reads a schema from the text2sql data, returning a dictionary mapping table names to their columns and respective types. This handles columns in an arbitrary order and also allows either
{Table, Field}
or{Table, Field} Name
as headers, because both appear in the data. It also uppercases table and column names if they are not already uppercase.- Parameters
- schema_path
str
, required. The path to the csv schema.
- schema_path
- Returns
- A dictionary mapping table names to typed columns.
-
allennlp.data.dataset_readers.dataset_utils.text2sql_utils.
replace_variables
(sentence: List[str], sentence_variables: Dict[str, str]) → Tuple[List[str], List[str]][source]¶ Replaces abstract variables in text with their concrete counterparts.
-
allennlp.data.dataset_readers.dataset_utils.text2sql_utils.
resolve_primary_keys_in_schema
(sql_tokens: List[str], schema: Dict[str, List[allennlp.data.dataset_readers.dataset_utils.text2sql_utils.TableColumn]]) → List[str][source]¶ Some examples in the text2sql datasets use ID as a column reference to the column of a table which has a primary key. This causes problems if you are trying to constrain a grammar to only produce the column names directly, because you don’t know what ID refers to. So instead of dealing with that, we just replace it.