allennlp.data.dataset_readers.coreference_resolution¶
-
class
allennlp.data.dataset_readers.coreference_resolution.conll.
ConllCorefReader
(max_span_width: int, token_indexers: Dict[str, allennlp.data.token_indexers.token_indexer.TokenIndexer] = None, lazy: bool = False)[source]¶ Bases:
allennlp.data.dataset_readers.dataset_reader.DatasetReader
Reads a single CoNLL-formatted file. This is the same file format as used in the
SrlReader
, but is preprocessed to dump all documents into a single file per train, dev and test split. See scripts/compile_coref_data.sh for more details of how to pre-process the Ontonotes 5.0 data into the correct format.Returns a
Dataset
where theInstances
have four fields:text
, aTextField
containing the full document text,spans
, aListField[SpanField]
of inclusive start and end indices for span candidates, andmetadata
, aMetadataField
that stores the instance’s original text. For data with gold cluster labels, we also include the originalclusters
(a list of list of index pairs) and aSequenceLabelField
of cluster ids for every span candidate.- Parameters
- max_span_width: ``int``, required.
The maximum width of candidate spans to consider.
- token_indexers
Dict[str, TokenIndexer]
, optional This is used to index the words in the document. See
TokenIndexer
. Default is{"tokens": SingleIdTokenIndexer()}
.
-
text_to_instance
(self, sentences: List[List[str]], gold_clusters: Union[List[List[Tuple[int, int]]], NoneType] = None) → allennlp.data.instance.Instance[source]¶ - Parameters
- sentences
List[List[str]]
, required. A list of lists representing the tokenised words and sentences in the document.
- gold_clusters
Optional[List[List[Tuple[int, int]]]]
, optional (default = None) A list of all clusters in the document, represented as word spans. Each cluster contains some number of spans, which can be nested and overlap, but will never exactly match between clusters.
- sentences
- Returns
- An
Instance
containing the followingFields
: - text
TextField
The text of the full document.
- spans
ListField[SpanField]
A ListField containing the spans represented as
SpanFields
with respect to the document text.- span_labels
SequenceLabelField
, optional - The id of the cluster which each possible span belongs to, or -1 if it does
not belong to a cluster. As these labels have variable length (it depends on how many spans we are considering), we represent this a as a
SequenceLabelField
with respect to thespans ``ListField
.
- text
- An
-
allennlp.data.dataset_readers.coreference_resolution.conll.
canonicalize_clusters
(clusters: DefaultDict[int, List[Tuple[int, int]]]) → List[List[Tuple[int, int]]][source]¶ The CONLL 2012 data includes 2 annotated spans which are identical, but have different ids. This checks all clusters for spans which are identical, and if it finds any, merges the clusters containing the identical spans.
-
class
allennlp.data.dataset_readers.coreference_resolution.winobias.
WinobiasReader
(max_span_width: int, token_indexers: Dict[str, allennlp.data.token_indexers.token_indexer.TokenIndexer] = None, lazy: bool = False)[source]¶ Bases:
allennlp.data.dataset_readers.dataset_reader.DatasetReader
TODO(Mark): Add paper reference.
Winobias is a dataset to analyse the issue of gender bias in co-reference resolution. It contains simple sentences with pro/anti stereotypical gender associations with which to measure the bias of a coreference system trained on another corpus. It is effectively a toy dataset and as such, uses very simplistic language; it has little use outside of evaluating a model for bias.
The dataset is formatted with a single sentence per line, with a maximum of 2 non-nested coreference clusters annotated using either square or round brackets. For example:
[The salesperson] sold (some books) to the librarian because [she] was trying to sell (them).
Returns a list of
Instances
which have four fields:text
, aTextField
containing the full sentence text,spans
, aListField[SpanField]
of inclusive start and end indices for span candidates, andmetadata
, aMetadataField
that stores the instance’s original text. For data with gold cluster labels, we also include the originalclusters
(a list of list of index pairs) and aSequenceLabelField
of cluster ids for every span candidate in themetadata
also.- Parameters
- max_span_width: ``int``, required.
The maximum width of candidate spans to consider.
- token_indexers
Dict[str, TokenIndexer]
, optional This is used to index the words in the sentence. See
TokenIndexer
. Default is{"tokens": SingleIdTokenIndexer()}
.
-
text_to_instance
(self, sentence: List[allennlp.data.tokenizers.token.Token], gold_clusters: Union[List[List[Tuple[int, int]]], NoneType] = None) → allennlp.data.instance.Instance[source]¶ - Parameters
- sentence
List[Token]
, required. The already tokenised sentence to analyse.
- gold_clusters
Optional[List[List[Tuple[int, int]]]]
, optional (default = None) A list of all clusters in the sentence, represented as word spans. Each cluster contains some number of spans, which can be nested and overlap, but will never exactly match between clusters.
- sentence
- Returns
- An
Instance
containing the followingFields
: - text
TextField
The text of the full sentence.
- spans
ListField[SpanField]
A ListField containing the spans represented as
SpanFields
with respect to the sentence text.- span_labels
SequenceLabelField
, optional - The id of the cluster which each possible span belongs to, or -1 if it does
not belong to a cluster. As these labels have variable length (it depends on how many spans we are considering), we represent this a as a
SequenceLabelField
with respect to thespans ``ListField
.
- text
- An