Skip to content

ontonotes_ner

allennlp_models.tagging.dataset_readers.ontonotes_ner

[SOURCE]


OntonotesNamedEntityRecognition#

@DatasetReader.register("ontonotes_ner")
class OntonotesNamedEntityRecognition(DatasetReader):
 | def __init__(
 |     self,
 |     token_indexers: Dict[str, TokenIndexer] = None,
 |     domain_identifier: str = None,
 |     coding_scheme: str = "BIO",
 |     **kwargs
 | ) -> None

This DatasetReader is designed to read in the English OntoNotes v5.0 data for fine-grained named entity recognition. It returns a dataset of instances with the following fields:

tokens : TextField The tokens in the sentence. tags : SequenceLabelField A sequence of BIO tags for the NER classes.

Note that the "/pt/" directory of the Onotonotes dataset representing annotations on the new and old testaments of the Bible are excluded, because they do not contain NER annotations.

Parameters

  • token_indexers : Dict[str, TokenIndexer], optional
    We similarly use this for both the premise and the hypothesis. See TokenIndexer. Default is {"tokens": SingleIdTokenIndexer()}.
  • domain_identifier : str, optional (default = None)
    A string denoting a sub-domain of the Ontonotes 5.0 dataset to use. If present, only conll files under paths containing this domain identifier will be processed.
  • coding_scheme : str, optional (default = None)
    The coding scheme to use for the NER labels. Valid options are "BIO" or "BIOUL".

Returns

  • A Dataset of Instances for Fine-Grained NER.

text_to_instance#

class OntonotesNamedEntityRecognition(DatasetReader):
 | ...
 | def text_to_instance(
 |     self,
 |     tokens: List[Token],
 |     ner_tags: List[str] = None
 | ) -> Instance

We take pre-tokenized input here, because we don't have a tokenizer in this class.