Skip to content




class BLEU(Metric):
 | def __init__(
 |     self,
 |     ngram_weights: Iterable[float] = (0.25, 0.25, 0.25, 0.25),
 |     exclude_indices: Set[int] = None
 | ) -> None

Bilingual Evaluation Understudy (BLEU).

BLEU is a common metric used for evaluating the quality of machine translations against a set of reference translations. See Papineni et. al., "BLEU: a method for automatic evaluation of machine translation", 2002.


  • ngram_weights : Iterable[float], optional (default = (0.25, 0.25, 0.25, 0.25))
    Weights to assign to scores for each ngram size.
  • exclude_indices : Set[int], optional (default = None)
    Indices to exclude when calculating ngrams. This should usually include the indices of the start, end, and pad tokens.


We chose to implement this from scratch instead of wrapping an existing implementation (such as nltk.translate.bleu_score) for a two reasons. First, so that we could pass tensors directly to this metric instead of first converting the tensors to lists of strings. And second, because functions like nltk.translate.bleu_score.corpus_bleu() are meant to be called once over the entire corpus, whereas it is more efficient in our use case to update the running precision counts every batch.

This implementation only considers a reference set of size 1, i.e. a single gold target sequence for each predicted sequence.


class BLEU(Metric):
 | ...
 | def reset(self) -> None


class BLEU(Metric):
 | ...
 | def __call__(
 |     self,
 |     predictions: torch.LongTensor,
 |     gold_targets: torch.LongTensor,
 |     mask: Optional[torch.BoolTensor] = None
 | ) -> None

Update precision counts.


  • predictions : torch.LongTensor
    Batched predicted tokens of shape (batch_size, max_sequence_length).
  • references : torch.LongTensor
    Batched reference (gold) translations with shape (batch_size, max_gold_sequence_length).


  • None


class BLEU(Metric):
 | ...
 | def get_metric(self, reset: bool = False) -> Dict[str, float]