conll2000

allennlp_models.tagging.dataset_readers.conll2000

Conll2000DatasetReader#

@DatasetReader.register("conll2000")
class Conll2000DatasetReader(DatasetReader):
 | def __init__(
 |     self,
 |     token_indexers: Dict[str, TokenIndexer] = None,
 |     tag_label: str = "chunk",
 |     feature_labels: Sequence[str] = (),
 |     coding_scheme: str = "BIO",
 |     label_namespace: str = "labels",
 |     **kwargs
 | ) -> None

Reads instances from a pretokenised file where each line is in the following format:

WORD POS-TAG CHUNK-TAG

with a blank line indicating the end of each sentence and converts it into a Dataset suitable for sequence tagging.

Each Instance contains the words in the "tokens" TextField. The values corresponding to the tag_label values will get loaded into the "tags" SequenceLabelField. And if you specify any feature_labels (you probably shouldn't), the corresponding values will get loaded into their own SequenceLabelField s.

Registered as a DatasetReader with name "conll2000".

Parameters¶

token_indexers : Dict[str, TokenIndexer], optional (default = {"tokens": SingleIdTokenIndexer()})
We use this to define the input representation for the text. See TokenIndexer.
tag_label : str, optional (default = chunk)
Specify pos, or chunk to have that tag loaded into the instance field tag.
feature_labels : Sequence[str], optional (default = ())
These labels will be loaded as features into the corresponding instance fields: pos -> pos_tags or chunk -> chunk_tags. Each will have its own namespace : pos_tags or chunk_tags. If you want to use one of the tags as a feature in your model, it should be specified here.
coding_scheme : str, optional (default = BIO)
Specifies the coding scheme for chunk_labels. Valid options are BIO and BIOUL. The BIO default maintains the original BIO scheme in the CoNLL 2000 chunking data. In the BIO scheme, B is a token starting a span, I is a token continuing a span, and O is a token outside of a span.
label_namespace : str, optional (default = labels)
Specifies the namespace for the chosen tag_label.

text_to_instance#

class Conll2000DatasetReader(DatasetReader):
 | ...
 | def text_to_instance(
 |     self,
 |     tokens: List[Token],
 |     pos_tags: List[str] = None,
 |     chunk_tags: List[str] = None
 | ) -> Instance

We take pre-tokenized input here, because we don't have a tokenizer in this class.