Skip to content

conll2000

allennlp_models.tagging.dataset_readers.conll2000

[SOURCE]


Conll2000DatasetReader#

@DatasetReader.register("conll2000")
class Conll2000DatasetReader(DatasetReader):
 | def __init__(
 |     self,
 |     token_indexers: Dict[str, TokenIndexer] = None,
 |     tag_label: str = "chunk",
 |     feature_labels: Sequence[str] = (),
 |     convert_to_coding_scheme: Optional[str] = None,
 |     label_namespace: str = "labels",
 |     **kwargs
 | ) -> None

Reads instances from a pretokenised file where each line is in the following format:

WORD POS-TAG CHUNK-TAG

with a blank line indicating the end of each sentence and converts it into a Dataset suitable for sequence tagging.

Each Instance contains the words in the "tokens" TextField. The values corresponding to the tag_label values will get loaded into the "tags" SequenceLabelField. And if you specify any feature_labels (you probably shouldn't), the corresponding values will get loaded into their own SequenceLabelField s.

Registered as a DatasetReader with name "conll2000".

Parameters

  • token_indexers : Dict[str, TokenIndexer], optional (default = {"tokens": SingleIdTokenIndexer()})
    We use this to define the input representation for the text. See TokenIndexer.
  • tag_label : str, optional (default = chunk)
    Specify pos, or chunk to have that tag loaded into the instance field tag.
  • feature_labels : Sequence[str], optional (default = ())
    These labels will be loaded as features into the corresponding instance fields: pos -> pos_tags or chunk -> chunk_tags. Each will have its own namespace : pos_tags or chunk_tags. If you want to use one of the tags as a feature in your model, it should be specified here.
  • convert_to_coding_scheme : str, optional (default = BIO)
    Specifies the coding scheme for chunk_labels. Conll2000DatasetReader assumes a coding scheme of input data is BIO. Valid options are None and BIOUL. The None default maintains the original BIO scheme in the CoNLL 2000 chunking data. In the BIO scheme, B is a token starting a span, I is a token continuing a span, and O is a token outside of a span.
  • coding_scheme : str, optional (default = BIO)
    This parameter is deprecated. If you specify coding_scheme to BIO, consider simply removing it or specifying convert_to_coding_scheme to None. If you want to specify BIOUL for coding_scheme, replace it with convert_to_coding_scheme.
  • label_namespace : str, optional (default = labels)
    Specifies the namespace for the chosen tag_label.

text_to_instance#

class Conll2000DatasetReader(DatasetReader):
 | ...
 | def text_to_instance(
 |     self,
 |     tokens: List[Token],
 |     pos_tags: List[str] = None,
 |     chunk_tags: List[str] = None
 | ) -> Instance

We take pre-tokenized input here, because we don't have a tokenizer in this class.