@Metric.register("bleu") class BLEU(Metric): | def __init__( | self, | ngram_weights: Iterable[float] = (0.25, 0.25, 0.25, 0.25), | exclude_indices: Set[int] = None | ) -> None
Bilingual Evaluation Understudy (BLEU).
BLEU is a common metric used for evaluating the quality of machine translations against a set of reference translations. See Papineni et. al., "BLEU: a method for automatic evaluation of machine translation", 2002.
- ngram_weights :
Iterable[float], optional (default =
(0.25, 0.25, 0.25, 0.25))
Weights to assign to scores for each ngram size.
- exclude_indices :
Set[int], optional (default =
Indices to exclude when calculating ngrams. This should usually include the indices of the start, end, and pad tokens.
We chose to implement this from scratch instead of wrapping an existing implementation
nltk.translate.bleu_score) for a two reasons. First, so that we could
pass tensors directly to this metric instead of first converting the tensors to lists of strings.
And second, because functions like
meant to be called once over the entire corpus, whereas it is more efficient
in our use case to update the running precision counts every batch.
This implementation only considers a reference set of size 1, i.e. a single gold target sequence for each predicted sequence.
class BLEU(Metric): | ... | def reset(self) -> None
class BLEU(Metric): | ... | def __call__( | self, | predictions: torch.LongTensor, | gold_targets: torch.LongTensor, | mask: Optional[torch.BoolTensor] = None | ) -> None
Update precision counts.
- predictions :
Batched predicted tokens of shape
- references :
Batched reference (gold) translations with shape
class BLEU(Metric): | ... | def get_metric(self, reset: bool = False) -> Dict[str, float]