conll
ConllCorefReader#
class ConllCorefReader(DatasetReader):
| def __init__(
| self,
| max_span_width: int,
| token_indexers: Dict[str, TokenIndexer] = None,
| wordpiece_modeling_tokenizer: Optional[PretrainedTransformerTokenizer] = None,
| max_sentences: int = None,
| remove_singleton_clusters: bool = False,
| **kwargs
| ) -> None
Reads a single CoNLL-formatted file. This is the same file format as used in the
SrlReader
, but is preprocessed
to dump all documents into a single file per train, dev and test split. See
scripts/compile_coref_data.sh for more details of how to pre-process the Ontonotes 5.0 data
into the correct format.
Returns a Dataset
where the Instances
have four fields : text
, a TextField
containing the full document text, spans
, a ListField[SpanField]
of inclusive start and
end indices for span candidates, and metadata
, a MetadataField
that stores the instance's
original text. For data with gold cluster labels, we also include the original clusters
(a list of list of index pairs) and a SequenceLabelField
of cluster ids for every span
candidate.
Parameters
- max_span_width :
int
The maximum width of candidate spans to consider. - token_indexers :
Dict[str, TokenIndexer]
, optional
This is used to index the words in the document. SeeTokenIndexer
. Default is{"tokens": SingleIdTokenIndexer()}
. - wordpiece_modeling_tokenizer :
PretrainedTransformerTokenizer
, optional (default =None
)
If not None, this dataset reader does subword tokenization using the supplied tokenizer and distribute the labels to the resulting wordpieces. All the modeling will be based on wordpieces. If this is set toFalse
(default), the user is expected to usePretrainedTransformerMismatchedIndexer
andPretrainedTransformerMismatchedEmbedder
, and the modeling will be on the word-level. - max_sentences :
int
, optional (default =None
)
The maximum number of sentences in each document to keep. By default keeps all sentences. - remove_singleton_clusters :
bool
, optional (default =False
)
Some datasets contain clusters that are singletons (i.e. no coreferents). This option allows the removal of them. Ontonotes shouldn't have these, and this option should be used for testing only.
text_to_instance#
class ConllCorefReader(DatasetReader):
| ...
| @overrides
| def text_to_instance(
| self,
| sentences: List[List[str]],
| gold_clusters: Optional[List[List[Tuple[int, int]]]] = None
| ) -> Instance