Skip to content

sequence_tagging

allennlp.data.dataset_readers.sequence_tagging

[SOURCE]


DEFAULT_WORD_TAG_DELIMITER#

DEFAULT_WORD_TAG_DELIMITER = "###"

SequenceTaggingDatasetReader#

@DatasetReader.register("sequence_tagging")
class SequenceTaggingDatasetReader(DatasetReader):
 | def __init__(
 |     self,
 |     word_tag_delimiter: str = DEFAULT_WORD_TAG_DELIMITER,
 |     token_delimiter: str = None,
 |     token_indexers: Dict[str, TokenIndexer] = None,
 |     **kwargs
 | ) -> None

Reads instances from a pretokenised file where each line is in the following format:

WORD###TAG [TAB] WORD###TAG [TAB] ..... \n

and converts it into a Dataset suitable for sequence tagging. You can also specify alternative delimiters in the constructor.

Registered as a DatasetReader with name "sequence_tagging".

Parameters

  • word_tag_delimiter : str, optional (default = "###")
    The text that separates each WORD from its TAG.
  • token_delimiter : str, optional (default = None)
    The text that separates each WORD-TAG pair from the next pair. If None then the line will just be split on whitespace.
  • token_indexers : Dict[str, TokenIndexer], optional (default = {"tokens": SingleIdTokenIndexer()})
    We use this to define the input representation for the text. See TokenIndexer. Note that the output tags will always correspond to single token IDs based on how they are pre-tokenised in the data file.

text_to_instance#

class SequenceTaggingDatasetReader(DatasetReader):
 | ...
 | def text_to_instance(
 |     self,
 |     tokens: List[Token],
 |     tags: List[str] = None
 | ) -> Instance

We take pre-tokenized input here, because we don't have a tokenizer in this class.