allennlp.data.dataset_readers.conll2003¶
-
class
allennlp.data.dataset_readers.conll2003.Conll2003DatasetReader(token_indexers: Dict[str, allennlp.data.token_indexers.token_indexer.TokenIndexer] = None, tag_label: str = 'ner', feature_labels: Sequence[str] = (), lazy: bool = False, coding_scheme: str = 'IOB1', label_namespace: str = 'labels')[source]¶ Bases:
allennlp.data.dataset_readers.dataset_reader.DatasetReaderReads instances from a pretokenised file where each line is in the following format:
WORD POS-TAG CHUNK-TAG NER-TAG
with a blank line indicating the end of each sentence and ‘-DOCSTART- -X- -X- O’ indicating the end of each article, and converts it into a
Datasetsuitable for sequence tagging.Each
Instancecontains the words in the"tokens"TextField. The values corresponding to thetag_labelvalues will get loaded into the"tags"SequenceLabelField. And if you specify anyfeature_labels(you probably shouldn’t), the corresponding values will get loaded into their ownSequenceLabelFields.This dataset reader ignores the “article” divisions and simply treats each sentence as an independent
Instance. (Technically the reader splits sentences on any combination of blank lines and “DOCSTART” tags; in particular, it does the right thing on well formed inputs.)- Parameters
- token_indexers
Dict[str, TokenIndexer], optional (default=``{“tokens”: SingleIdTokenIndexer()}``) We use this to define the input representation for the text. See
TokenIndexer.- tag_label: ``str``, optional (default=``ner``)
Specify ner, pos, or chunk to have that tag loaded into the instance field tag.
- feature_labels: ``Sequence[str]``, optional (default=``()``)
These labels will be loaded as features into the corresponding instance fields:
pos->pos_tags,chunk->chunk_tags,ner->ner_tagsEach will have its own namespace:pos_tags,chunk_tags,ner_tags. If you want to use one of the tags as a feature in your model, it should be specified here.- coding_scheme: ``str``, optional (default=``IOB1``)
Specifies the coding scheme for
ner_labelsandchunk_labels. Valid options areIOB1andBIOUL. TheIOB1default maintains the original IOB1 scheme in the CoNLL 2003 NER data. In the IOB1 scheme, I is a token inside a span, O is a token outside a span and B is the beginning of span immediately following another span of the same type.- label_namespace: ``str``, optional (default=``labels``)
Specifies the namespace for the chosen
tag_label.
- token_indexers