conll2000
allennlp_models.tagging.dataset_readers.conll2000
Conll2000DatasetReader#
@DatasetReader.register("conll2000")
class Conll2000DatasetReader(DatasetReader):
| def __init__(
| self,
| token_indexers: Dict[str, TokenIndexer] = None,
| tag_label: str = "chunk",
| feature_labels: Sequence[str] = (),
| convert_to_coding_scheme: Optional[str] = None,
| label_namespace: str = "labels",
| **kwargs
| ) -> None
Reads instances from a pretokenised file where each line is in the following format:
WORD POS-TAG CHUNK-TAG
with a blank line indicating the end of each sentence
and converts it into a Dataset suitable for sequence tagging.
Each Instance contains the words in the "tokens" TextField.
The values corresponding to the tag_label
values will get loaded into the "tags" SequenceLabelField.
And if you specify any feature_labels (you probably shouldn't),
the corresponding values will get loaded into their own SequenceLabelField s.
Registered as a DatasetReader with name "conll2000".
Parameters¶
- token_indexers :
Dict[str, TokenIndexer], optional (default ={"tokens": SingleIdTokenIndexer()})
We use this to define the input representation for the text. SeeTokenIndexer. - tag_label :
str, optional (default =chunk)
Specifypos, orchunkto have that tag loaded into the instance fieldtag. - feature_labels :
Sequence[str], optional (default =())
These labels will be loaded as features into the corresponding instance fields:pos->pos_tagsorchunk->chunk_tags. Each will have its own namespace :pos_tagsorchunk_tags. If you want to use one of the tags as afeaturein your model, it should be specified here. - convert_to_coding_scheme :
str, optional (default =BIO)
Specifies the coding scheme forchunk_labels.Conll2000DatasetReaderassumes a coding scheme of input data isBIO. Valid options areNoneandBIOUL. TheNonedefault maintains the original BIO scheme in the CoNLL 2000 chunking data. In the BIO scheme, B is a token starting a span, I is a token continuing a span, and O is a token outside of a span. - coding_scheme :
str, optional (default =BIO)
This parameter is deprecated. If you specifycoding_schemetoBIO, consider simply removing it or specifyingconvert_to_coding_schemetoNone. If you want to specifyBIOULforcoding_scheme, replace it withconvert_to_coding_scheme. - label_namespace :
str, optional (default =labels)
Specifies the namespace for the chosentag_label.
text_to_instance#
class Conll2000DatasetReader(DatasetReader):
| ...
| def text_to_instance(
| self,
| tokens: List[Token],
| pos_tags: List[str] = None,
| chunk_tags: List[str] = None
| ) -> Instance
We take pre-tokenized input here, because we don't have a tokenizer in this class.