preco
allennlp_models.coref.dataset_readers.preco
PrecoReader#
@DatasetReader.register("preco")
class PrecoReader(DatasetReader):
| def __init__(
| self,
| max_span_width: int,
| token_indexers: Dict[str, TokenIndexer] = None,
| wordpiece_modeling_tokenizer: Optional[PretrainedTransformerTokenizer] = None,
| max_sentences: int = None,
| remove_singleton_clusters: bool = True,
| **kwargs
| ) -> None
Reads a single JSON-lines file for the PreCo dataset. Each line contains a "sentences" key for a list of sentences and a "mention_clusters" key for the clusters.
Returns a Dataset
where the Instances
have four fields : text
, a TextField
containing the full document text, spans
, a ListField[SpanField]
of inclusive start and
end indices for span candidates, and metadata
, a MetadataField
that stores the instance's
original text. For data with gold cluster labels, we also include the original clusters
(a list of list of index pairs) and a SequenceLabelField
of cluster ids for every span
candidate.
Parameters¶
- max_span_width :
int
The maximum width of candidate spans to consider. - token_indexers :
Dict[str, TokenIndexer]
, optional
This is used to index the words in the document. SeeTokenIndexer
. Default is{"tokens": SingleIdTokenIndexer()}
. - wordpiece_modeling_tokenizer :
PretrainedTransformerTokenizer
, optional (default =None
)
If not None, this dataset reader does subword tokenization using the supplied tokenizer and distribute the labels to the resulting wordpieces. All the modeling will be based on wordpieces. If this is set toFalse
(default), the user is expected to usePretrainedTransformerMismatchedIndexer
andPretrainedTransformerMismatchedEmbedder
, and the modeling will be on the word-level. - max_sentences :
int
, optional (default =None
)
The maximum number of sentences in each document to keep. By default keeps all sentences. - remove_singleton_clusters :
bool
, optional (default =False
)
Some datasets contain clusters that are singletons (i.e. no coreferents). This option allows the removal of them.
text_to_instance#
class PrecoReader(DatasetReader):
| ...
| def text_to_instance(
| self,
| sentences: List[List[str]],
| gold_clusters: Optional[List[List[Tuple[int, int, int]]]] = None
| ) -> Instance