sentence_splitter
allennlp.data.tokenizers.sentence_splitter
SentenceSplitter#
class SentenceSplitter(Registrable)
A SentenceSplitter
splits strings into sentences.
default_implementation#
class SentenceSplitter(Registrable):
| ...
| default_implementation = "spacy"
split_sentences#
class SentenceSplitter(Registrable):
| ...
| def split_sentences(self, text: str) -> List[str]
Splits a text
str
paragraph into a list of str
, where each is a sentence.
batch_split_sentences#
class SentenceSplitter(Registrable):
| ...
| def batch_split_sentences(self, texts: List[str]) -> List[List[str]]
Default implementation is to just iterate over the texts and call split_sentences
.
SpacySentenceSplitter#
@SentenceSplitter.register("spacy")
class SpacySentenceSplitter(SentenceSplitter):
| def __init__(
| self,
| language: str = "en_core_web_sm",
| rule_based: bool = False
| ) -> None
A SentenceSplitter
that uses spaCy's built-in sentence boundary detection.
Spacy's default sentence splitter uses a dependency parse to detect sentence boundaries, so it is slow, but accurate.
Another option is to use rule-based sentence boundary detection. It's fast and has a small memory footprint,
since it uses punctuation to detect sentence boundaries. This can be activated with the rule_based
flag.
By default, SpacySentenceSplitter
calls the default spacy boundary detector.
Registered as a SentenceSplitter
with name "spacy".
split_sentences#
class SpacySentenceSplitter(SentenceSplitter):
| ...
| @overrides
| def split_sentences(self, text: str) -> List[str]
batch_split_sentences#
class SpacySentenceSplitter(SentenceSplitter):
| ...
| @overrides
| def batch_split_sentences(self, texts: List[str]) -> List[List[str]]
This method lets you take advantage of spacy's batch processing.