Skip to content

sentence_splitter

allennlp.data.tokenizers.sentence_splitter

[SOURCE]


SentenceSplitter

class SentenceSplitter(Registrable)

A SentenceSplitter splits strings into sentences.

default_implementation

class SentenceSplitter(Registrable):
 | ...
 | default_implementation = "spacy"

split_sentences

class SentenceSplitter(Registrable):
 | ...
 | def split_sentences(self, text: str) -> List[str]

Splits a text str paragraph into a list of str, where each is a sentence.

batch_split_sentences

class SentenceSplitter(Registrable):
 | ...
 | def batch_split_sentences(self, texts: List[str]) -> List[List[str]]

Default implementation is to just iterate over the texts and call split_sentences.

SpacySentenceSplitter

@SentenceSplitter.register("spacy")
class SpacySentenceSplitter(SentenceSplitter):
 | def __init__(
 |     self,
 |     language: str = "en_core_web_sm",
 |     rule_based: bool = False
 | ) -> None

A SentenceSplitter that uses spaCy's built-in sentence boundary detection.

Spacy's default sentence splitter uses a dependency parse to detect sentence boundaries, so it is slow, but accurate.

Another option is to use rule-based sentence boundary detection. It's fast and has a small memory footprint, since it uses punctuation to detect sentence boundaries. This can be activated with the rule_based flag.

By default, SpacySentenceSplitter calls the default spacy boundary detector.

Registered as a SentenceSplitter with name "spacy".

split_sentences

class SpacySentenceSplitter(SentenceSplitter):
 | ...
 | def split_sentences(self, text: str) -> List[str]

batch_split_sentences

class SpacySentenceSplitter(SentenceSplitter):
 | ...
 | def batch_split_sentences(self, texts: List[str]) -> List[List[str]]

This method lets you take advantage of spacy's batch processing.