conll2003
allennlp.data.dataset_readers.conll2003
Conll2003DatasetReader¶
@DatasetReader.register("conll2003")
class Conll2003DatasetReader(DatasetReader):
| def __init__(
| self,
| token_indexers: Dict[str, TokenIndexer] = None,
| tag_label: str = "ner",
| feature_labels: Sequence[str] = (),
| convert_to_coding_scheme: Optional[str] = None,
| label_namespace: str = "labels",
| **kwargs
| ) -> None
Reads instances from a pretokenised file where each line is in the following format:
WORD POS-TAG CHUNK-TAG NER-TAG
with a blank line indicating the end of each sentence
and -DOCSTART- -X- -X- O
indicating the end of each article,
and converts it into a Dataset
suitable for sequence tagging.
Each Instance
contains the words in the "tokens"
TextField
.
The values corresponding to the tag_label
values will get loaded into the "tags"
SequenceLabelField
.
And if you specify any feature_labels
(you probably shouldn't),
the corresponding values will get loaded into their own SequenceLabelField
s.
This dataset reader ignores the "article" divisions and simply treats
each sentence as an independent Instance
. (Technically the reader splits sentences
on any combination of blank lines and "DOCSTART" tags; in particular, it does the right
thing on well formed inputs.)
Registered as a DatasetReader
with name "conll2003".
Parameters¶
- token_indexers :
Dict[str, TokenIndexer]
, optional (default ={"tokens": SingleIdTokenIndexer()}
)
We use this to define the input representation for the text. SeeTokenIndexer
. - tag_label :
str
, optional (default =ner
)
Specifyner
,pos
, orchunk
to have that tag loaded into the instance fieldtag
. - feature_labels :
Sequence[str]
, optional (default =()
)
These labels will be loaded as features into the corresponding instance fields:pos
->pos_tags
,chunk
->chunk_tags
,ner
->ner_tags
Each will have its own namespace :pos_tags
,chunk_tags
,ner_tags
. If you want to use one of the tags as afeature
in your model, it should be specified here. - convert_to_coding_scheme :
str
, optional (default =None
)
Specifies the coding scheme forner_labels
andchunk_labels
.Conll2003DatasetReader
assumes a coding scheme of input data isIOB1
. Valid options areNone
andBIOUL
. TheNone
default maintains the original IOB1 scheme in the CoNLL 2003 NER data. In the IOB1 scheme, I is a token inside a span, O is a token outside a span and B is the beginning of span immediately following another span of the same type. - coding_scheme :
str
, optional (default =IOB1
)
This parameter is deprecated. If you specifycoding_scheme
toIOB1
, consider simply removing it or specifyingconvert_to_coding_scheme
toNone
. If you want to specifyBIOUL
forcoding_scheme
, replace it withconvert_to_coding_scheme
. - label_namespace :
str
, optional (default =labels
)
Specifies the namespace for the chosentag_label
.
text_to_instance¶
class Conll2003DatasetReader(DatasetReader):
| ...
| def text_to_instance(
| self,
| tokens: List[Token],
| pos_tags: List[str] = None,
| chunk_tags: List[str] = None,
| ner_tags: List[str] = None
| ) -> Instance
We take pre-tokenized
input here, because we don't have a tokenizer in this class.
apply_token_indexers¶
class Conll2003DatasetReader(DatasetReader):
| ...
| def apply_token_indexers(self, instance: Instance) -> None