allennlp.data.dataset_readers.conll2000¶
-
class
allennlp.data.dataset_readers.conll2000.Conll2000DatasetReader(token_indexers: Dict[str, allennlp.data.token_indexers.token_indexer.TokenIndexer] = None, tag_label: str = 'chunk', feature_labels: Sequence[str] = (), lazy: bool = False, coding_scheme: str = 'BIO', label_namespace: str = 'labels')[source]¶ Bases:
allennlp.data.dataset_readers.dataset_reader.DatasetReaderReads instances from a pretokenised file where each line is in the following format:
WORD POS-TAG CHUNK-TAG
with a blank line indicating the end of each sentence and converts it into a
Datasetsuitable for sequence tagging.Each
Instancecontains the words in the"tokens"TextField. The values corresponding to thetag_labelvalues will get loaded into the"tags"SequenceLabelField. And if you specify anyfeature_labels(you probably shouldn’t), the corresponding values will get loaded into their ownSequenceLabelFields.- Parameters
- token_indexers
Dict[str, TokenIndexer], optional (default=``{“tokens”: SingleIdTokenIndexer()}``) We use this to define the input representation for the text. See
TokenIndexer.- tag_label: ``str``, optional (default=``chunk``)
Specify pos, or chunk to have that tag loaded into the instance field tag.
- feature_labels: ``Sequence[str]``, optional (default=``()``)
These labels will be loaded as features into the corresponding instance fields:
pos->pos_tagsorchunk->chunk_tags. Each will have its own namespace:pos_tagsorchunk_tags. If you want to use one of the tags as a feature in your model, it should be specified here.- coding_scheme: ``str``, optional (default=``BIO``)
Specifies the coding scheme for
chunk_labels. Valid options areBIOandBIOUL. TheBIOdefault maintains the original BIO scheme in the CoNLL 2000 chunking data. In the BIO scheme, B is a token starting a span, I is a token continuing a span, and O is a token outside of a span.- label_namespace: ``str``, optional (default=``labels``)
Specifies the namespace for the chosen
tag_label.
- token_indexers