Skip to content

conll

allennlp_models.coref.dataset_readers.conll

[SOURCE]


ConllCorefReader#

@DatasetReader.register("coref")
class ConllCorefReader(DatasetReader):
 | def __init__(
 |     self,
 |     max_span_width: int,
 |     token_indexers: Dict[str, TokenIndexer] = None,
 |     wordpiece_modeling_tokenizer: Optional[PretrainedTransformerTokenizer] = None,
 |     max_sentences: int = None,
 |     remove_singleton_clusters: bool = False,
 |     **kwargs
 | ) -> None

Reads a single CoNLL-formatted file. This is the same file format as used in the allennlp.data.dataset_readers.semantic_role_labelling.SrlReader, but is preprocessed to dump all documents into a single file per train, dev and test split. See scripts/compile_coref_data.sh for more details of how to pre-process the Ontonotes 5.0 data into the correct format.

Returns a Dataset where the Instances have four fields : text, a TextField containing the full document text, spans, a ListField[SpanField] of inclusive start and end indices for span candidates, and metadata, a MetadataField that stores the instance's original text. For data with gold cluster labels, we also include the original clusters (a list of list of index pairs) and a SequenceLabelField of cluster ids for every span candidate.

Parameters

  • max_span_width : int
    The maximum width of candidate spans to consider.
  • token_indexers : Dict[str, TokenIndexer], optional
    This is used to index the words in the document. See TokenIndexer. Default is {"tokens": SingleIdTokenIndexer()}.
  • wordpiece_modeling_tokenizer : PretrainedTransformerTokenizer, optional (default = None)
    If not None, this dataset reader does subword tokenization using the supplied tokenizer and distribute the labels to the resulting wordpieces. All the modeling will be based on wordpieces. If this is set to False (default), the user is expected to use PretrainedTransformerMismatchedIndexer and PretrainedTransformerMismatchedEmbedder, and the modeling will be on the word-level.
  • max_sentences : int, optional (default = None)
    The maximum number of sentences in each document to keep. By default keeps all sentences.
  • remove_singleton_clusters : bool, optional (default = False)
    Some datasets contain clusters that are singletons (i.e. no coreferents). This option allows the removal of them. Ontonotes shouldn't have these, and this option should be used for testing only.

text_to_instance#

class ConllCorefReader(DatasetReader):
 | ...
 | @overrides
 | def text_to_instance(
 |     self,
 |     sentences: List[List[str]],
 |     gold_clusters: Optional[List[List[Tuple[int, int]]]] = None
 | ) -> Instance