Skip to content





class PrecoReader(DatasetReader):
 | def __init__(
 |     self,
 |     max_span_width: int,
 |     token_indexers: Dict[str, TokenIndexer] = None,
 |     wordpiece_modeling_tokenizer: Optional[PretrainedTransformerTokenizer] = None,
 |     max_sentences: int = None,
 |     remove_singleton_clusters: bool = True,
 |     **kwargs
 | ) -> None

Reads a single JSON-lines file for the PreCo dataset. Each line contains a "sentences" key for a list of sentences and a "mention_clusters" key for the clusters.

Returns a Dataset where the Instances have four fields : text, a TextField containing the full document text, spans, a ListField[SpanField] of inclusive start and end indices for span candidates, and metadata, a MetadataField that stores the instance's original text. For data with gold cluster labels, we also include the original clusters (a list of list of index pairs) and a SequenceLabelField of cluster ids for every span candidate.


  • max_span_width : int
    The maximum width of candidate spans to consider.
  • token_indexers : Dict[str, TokenIndexer], optional
    This is used to index the words in the document. See TokenIndexer. Default is {"tokens": SingleIdTokenIndexer()}.
  • wordpiece_modeling_tokenizer : PretrainedTransformerTokenizer, optional (default = None)
    If not None, this dataset reader does subword tokenization using the supplied tokenizer and distribute the labels to the resulting wordpieces. All the modeling will be based on wordpieces. If this is set to False (default), the user is expected to use PretrainedTransformerMismatchedIndexer and PretrainedTransformerMismatchedEmbedder, and the modeling will be on the word-level.
  • max_sentences : int, optional (default = None)
    The maximum number of sentences in each document to keep. By default keeps all sentences.
  • remove_singleton_clusters : bool, optional (default = False)
    Some datasets contain clusters that are singletons (i.e. no coreferents). This option allows the removal of them.


class PrecoReader(DatasetReader):
 | ...
 | def text_to_instance(
 |     self,
 |     sentences: List[List[str]],
 |     gold_clusters: Optional[List[List[Tuple[int, int, int]]]] = None
 | ) -> Instance