sequence_tagging
allennlp.data.dataset_readers.sequence_tagging
DEFAULT_WORD_TAG_DELIMITER#
DEFAULT_WORD_TAG_DELIMITER = "###"
SequenceTaggingDatasetReader#
@DatasetReader.register("sequence_tagging")
class SequenceTaggingDatasetReader(DatasetReader):
| def __init__(
| self,
| word_tag_delimiter: str = DEFAULT_WORD_TAG_DELIMITER,
| token_delimiter: str = None,
| token_indexers: Dict[str, TokenIndexer] = None,
| **kwargs
| ) -> None
Reads instances from a pretokenised file where each line is in the following format:
WORD###TAG [TAB] WORD###TAG [TAB] ..... \n
and converts it into a Dataset
suitable for sequence tagging. You can also specify
alternative delimiters in the constructor.
Registered as a DatasetReader
with name "sequence_tagging".
Parameters
- word_tag_delimiter :
str
, optional (default ="###"
)
The text that separates each WORD from its TAG. - token_delimiter :
str
, optional (default =None
)
The text that separates each WORD-TAG pair from the next pair. IfNone
then the line will just be split on whitespace. - token_indexers :
Dict[str, TokenIndexer]
, optional (default ={"tokens": SingleIdTokenIndexer()}
)
We use this to define the input representation for the text. SeeTokenIndexer
. Note that theoutput
tags will always correspond to single token IDs based on how they are pre-tokenised in the data file.
text_to_instance#
class SequenceTaggingDatasetReader(DatasetReader):
| ...
| def text_to_instance(
| self,
| tokens: List[Token],
| tags: List[str] = None
| ) -> Instance
We take pre-tokenized
input here, because we don't have a tokenizer in this class.