allennlp.training.metrics¶
A Metric
is some quantity or quantities
that can be accumulated during training or evaluation; for example,
accuracy or F1 score.
-
class
allennlp.training.metrics.metric.
Metric
[source]¶ Bases:
allennlp.common.registrable.Registrable
A very general abstract class representing a metric which can be accumulated.
-
get_metric
(self, reset: bool) → Union[float, Tuple[float, ...], Dict[str, float], Dict[str, List[float]]][source]¶ Compute and return the metric. Optionally also call
self.reset()
.
-
static
unwrap_to_tensors
(*tensors: torch.Tensor)[source]¶ If you actually passed gradient-tracking Tensors to a Metric, there will be a huge memory leak, because it will prevent garbage collection for the computation graph. This method ensures that you’re using tensors directly and that they are on the CPU.
-
-
class
allennlp.training.metrics.attachment_scores.
AttachmentScores
(ignore_classes: List[int] = None)[source]¶ Bases:
allennlp.training.metrics.metric.Metric
Computes labeled and unlabeled attachment scores for a dependency parse, as well as sentence level exact match for both labeled and unlabeled trees. Note that the input to this metric is the sampled predictions, not the distribution itself.
- Parameters
- ignore_classes
List[int]
, optional (default = None) A list of label ids to ignore when computing metrics.
- ignore_classes
-
class
allennlp.training.metrics.auc.
Auc
(positive_label=1)[source]¶ Bases:
allennlp.training.metrics.metric.Metric
The AUC Metric measures the area under the receiver-operating characteristic (ROC) curve for binary classification problems.
-
class
allennlp.training.metrics.average.
Average
[source]¶ Bases:
allennlp.training.metrics.metric.Metric
This
Metric
breaks with the typicalMetric
API and just stores values that were computed in some fashion outside of aMetric
. If you have some external code that computes the metric for you, for instance, you can use this to report the average result using ourMetric
API.
-
class
allennlp.training.metrics.boolean_accuracy.
BooleanAccuracy
[source]¶ Bases:
allennlp.training.metrics.metric.Metric
Just checks batch-equality of two tensors and computes an accuracy metric based on that. That is, if your prediction has shape (batch_size, dim_1, …, dim_n), this metric considers that as a set of batch_size predictions and checks that each is entirely correct across the remaining dims. This means the denominator in the accuracy computation is batch_size, with the caveat that predictions that are totally masked are ignored (in which case the denominator is the number of predictions that have at least one unmasked element).
This is similar to
CategoricalAccuracy
, if you’ve already done a.max()
on your predictions. If you have categorical output, though, you should typically just useCategoricalAccuracy
. The reason you might want to use this instead is if you’ve done some kind of constrained inference and don’t have a prediction tensor that matches the API ofCategoricalAccuracy
, which assumes a final dimension of sizenum_classes
.
-
class
allennlp.training.metrics.bleu.
BLEU
(ngram_weights: Iterable[float] = (0.25, 0.25, 0.25, 0.25), exclude_indices: Set[int] = None)[source]¶ Bases:
allennlp.training.metrics.metric.Metric
Bilingual Evaluation Understudy (BLEU).
BLEU is a common metric used for evaluating the quality of machine translations against a set of reference translations. See Papineni et. al., “BLEU: a method for automatic evaluation of machine translation”, 2002.
- Parameters
- ngram_weights
Iterable[float]
, optional (default = (0.25, 0.25, 0.25, 0.25)) Weights to assign to scores for each ngram size.
- exclude_indices
Set[int]
, optional (default = None) Indices to exclude when calculating ngrams. This should usually include the indices of the start, end, and pad tokens.
- ngram_weights
Notes
We chose to implement this from scratch instead of wrapping an existing implementation (such as nltk.translate.bleu_score) for a two reasons. First, so that we could pass tensors directly to this metric instead of first converting the tensors to lists of strings. And second, because functions like nltk.translate.bleu_score.corpus_bleu() are meant to be called once over the entire corpus, whereas it is more efficient in our use case to update the running precision counts every batch.
This implementation only considers a reference set of size 1, i.e. a single gold target sequence for each predicted sequence.
-
class
allennlp.training.metrics.categorical_accuracy.
CategoricalAccuracy
(top_k: int = 1, tie_break: bool = False)[source]¶ Bases:
allennlp.training.metrics.metric.Metric
Categorical Top-K accuracy. Assumes integer labels, with each item to be classified having a single correct class. Tie break enables equal distribution of scores among the classes with same maximum predicted scores.
-
class
allennlp.training.metrics.conll_coref_scores.
ConllCorefScores
[source]¶ Bases:
allennlp.training.metrics.metric.Metric
-
get_metric
(self, reset: bool = False) → Tuple[float, float, float][source]¶ Compute and return the metric. Optionally also call
self.reset()
.
-
-
class
allennlp.training.metrics.conll_coref_scores.
Scorer
(metric)[source]¶ Bases:
object
Mostly borrowed from <https://github.com/clarkkev/deep-coref/blob/master/evaluation.py>
-
static
b_cubed
(clusters, mention_to_gold)[source]¶ Averaged per-mention precision and recall. <https://pdfs.semanticscholar.org/cfe3/c24695f1c14b78a5b8e95bcbd1c666140fd1.pdf>
-
static
ceafe
(clusters, gold_clusters)[source]¶ Computes the Constrained EntityAlignment F-Measure (CEAF) for evaluating coreference. Gold and predicted mentions are aligned into clusterings which maximise a metric - in this case, the F measure between gold and predicted clusters.
-
static
muc
(clusters, mention_to_gold)[source]¶ Counts the mentions in each predicted cluster which need to be re-allocated in order for each predicted cluster to be contained by the respective gold cluster. <https://aclweb.org/anthology/M/M95/M95-1005.pdf>
-
static
-
class
allennlp.training.metrics.covariance.
Covariance
[source]¶ Bases:
allennlp.training.metrics.metric.Metric
This
Metric
calculates the unbiased sample covariance between two tensors. Each element in the two tensors is assumed to be a different observation of the variable (i.e., the input tensors are implicitly flattened into vectors and the covariance is calculated between the vectors).This implementation is mostly modeled after the streaming_covariance function in Tensorflow. See: https://github.com/tensorflow/tensorflow/blob/v1.10.1/tensorflow/contrib/metrics/python/ops/metric_ops.py#L3127
The following is copied from the Tensorflow documentation:
The algorithm used for this online computation is described in https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Online . Specifically, the formula used to combine two sample comoments is C_AB = C_A + C_B + (E[x_A] - E[x_B]) * (E[y_A] - E[y_B]) * n_A * n_B / n_AB The comoment for a single batch of data is simply sum((x - E[x]) * (y - E[y])), optionally masked.
-
class
allennlp.training.metrics.drop_em_and_f1.
DropEmAndF1
[source]¶ Bases:
allennlp.training.metrics.metric.Metric
This
Metric
takes the best span string computed by a model, along with the answer strings labeled in the data, and computes exact match and F1 score using the official DROP evaluator (which has special handling for numbers and for questions with multiple answer spans, among other things).
-
class
allennlp.training.metrics.evalb_bracketing_scorer.
EvalbBracketingScorer
(evalb_directory_path: str = '/Users/michael/hack/allenai/allennlp/allennlp/tools/EVALB', evalb_param_filename: str = 'COLLINS.prm')[source]¶ Bases:
allennlp.training.metrics.metric.Metric
This class uses the external EVALB software for computing a broad range of metrics on parse trees. Here, we use it to compute the Precision, Recall and F1 metrics. You can download the source for EVALB from here: <https://nlp.cs.nyu.edu/evalb/>.
Note that this software is 20 years old. In order to compile it on modern hardware, you may need to remove an
include <malloc.h>
statement inevalb.c
before it will compile.AllenNLP contains the EVALB software, but you will need to compile it yourself before using it because the binary it generates is system dependent. To build it, run
make
inside theallennlp/tools/EVALB
directory.Note that this metric reads and writes from disk quite a bit. You probably don’t want to include it in your training loop; instead, you should calculate this on a validation set only.
- Parameters
- evalb_directory_path
str
, required. The directory containing the EVALB executable.
- evalb_param_filename: ``str``, optional (default = “COLLINS.prm”)
The relative name of the EVALB configuration file used when scoring the trees. By default, this uses the COLLINS.prm configuration file which comes with EVALB. This configuration ignores POS tags and some punctuation labels.
- evalb_directory_path
-
static
clean_evalb
(evalb_directory_path: str = '/Users/michael/hack/allenai/allennlp/allennlp/tools/EVALB')[source]¶
-
class
allennlp.training.metrics.fbeta_measure.
FBetaMeasure
(beta: float = 1.0, average: str = None, labels: List[int] = None)[source]¶ Bases:
allennlp.training.metrics.metric.Metric
Compute precision, recall, F-measure and support for each class.
The precision is the ratio
tp / (tp + fp)
wheretp
is the number of true positives andfp
the number of false positives. The precision is intuitively the ability of the classifier not to label as positive a sample that is negative.The recall is the ratio
tp / (tp + fn)
wheretp
is the number of true positives andfn
the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples.The F-beta score can be interpreted as a weighted harmonic mean of the precision and recall, where an F-beta score reaches its best value at 1 and worst score at 0.
If we have precision and recall, the F-beta score is simply:
F-beta = (1 + beta ** 2) * precision * recall / (beta ** 2 * precision + recall)
The F-beta score weights recall more than precision by a factor of
beta
.beta == 1.0
means recall and precision are equally important.The support is the number of occurrences of each class in
y_true
.- Parameters
- beta
float
, optional (default = 1.0) The strength of recall versus precision in the F-score.
- averagestring, [None (default), ‘micro’, ‘macro’]
If
None
, the scores for each class are returned. Otherwise, this determines the type of averaging performed on the data:'micro'
:Calculate metrics globally by counting the total true positives, false negatives and false positives.
'macro'
:Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.
- labels: list, optional
The set of labels to include and their order if
average is None
. Labels present in the data can be excluded, for example to calculate a multi-class average ignoring a majority negative class. Labels not present in the data will result in 0 components in a macro average.
- beta
-
class
allennlp.training.metrics.f1_measure.
F1Measure
(positive_label: int)[source]¶ Bases:
allennlp.training.metrics.fbeta_measure.FBetaMeasure
Computes Precision, Recall and F1 with respect to a given
positive_label
. For example, for a BIO tagging scheme, you would pass the classification index of the tag you are interested in, resulting in the Precision, Recall and F1 score being calculated for this tag only.
-
class
allennlp.training.metrics.mean_absolute_error.
MeanAbsoluteError
[source]¶ Bases:
allennlp.training.metrics.metric.Metric
This
Metric
calculates the mean absolute error (MAE) between two tensors.
-
class
allennlp.training.metrics.mention_recall.
MentionRecall
[source]¶
-
class
allennlp.training.metrics.pearson_correlation.
PearsonCorrelation
[source]¶ Bases:
allennlp.training.metrics.metric.Metric
This
Metric
calculates the sample Pearson correlation coefficient (r) between two tensors. Each element in the two tensors is assumed to be a different observation of the variable (i.e., the input tensors are implicitly flattened into vectors and the correlation is calculated between the vectors).This implementation is mostly modeled after the streaming_pearson_correlation function in Tensorflow. See https://github.com/tensorflow/tensorflow/blob/v1.10.1/tensorflow/contrib/metrics/python/ops/metric_ops.py#L3267
This metric delegates to the Covariance metric the tracking of three [co]variances:
covariance(predictions, labels)
, i.e. covariancecovariance(predictions, predictions)
, i.e. variance ofpredictions
covariance(labels, labels)
, i.e. variance oflabels
If we have these values, the sample Pearson correlation coefficient is simply:
r = covariance / (sqrt(predictions_variance) * sqrt(labels_variance))
if predictions_variance or labels_variance is 0, r is 0
-
class
allennlp.training.metrics.perplexity.
Perplexity
[source]¶ Bases:
allennlp.training.metrics.average.Average
Perplexity is a common metric used for evaluating how well a language model predicts a sample.
Notes
Assumes negative log likelihood loss of each batch (base e). Provides the average perplexity of the batches.
-
class
allennlp.training.metrics.sequence_accuracy.
SequenceAccuracy
[source]¶ Bases:
allennlp.training.metrics.metric.Metric
Sequence Top-K accuracy. Assumes integer labels, with each item to be classified having a single correct class.
-
class
allennlp.training.metrics.span_based_f1_measure.
SpanBasedF1Measure
(vocabulary: allennlp.data.vocabulary.Vocabulary, tag_namespace: str = 'tags', ignore_classes: List[str] = None, label_encoding: Optional[str] = 'BIO', tags_to_spans_function: Optional[Callable[[List[str], Optional[List[str]]], List[Tuple[str, Tuple[int, int]]]]] = None)[source]¶ Bases:
allennlp.training.metrics.metric.Metric
The Conll SRL metrics are based on exact span matching. This metric implements span-based precision and recall metrics for a BIO tagging scheme. It will produce precision, recall and F1 measures per tag, as well as overall statistics. Note that the implementation of this metric is not exactly the same as the perl script used to evaluate the CONLL 2005 data - particularly, it does not consider continuations or reference spans as constituents of the original span. However, it is a close proxy, which can be helpful for judging model performance during training. This metric works properly when the spans are unlabeled (i.e., your labels are simply “B”, “I”, “O” if using the “BIO” label encoding).
-
class
allennlp.training.metrics.squad_em_and_f1.
SquadEmAndF1
[source]¶ Bases:
allennlp.training.metrics.metric.Metric
This
Metric
takes the best span string computed by a model, along with the answer strings labeled in the data, and computed exact match and F1 score using the official SQuAD evaluation script.
-
class
allennlp.training.metrics.srl_eval_scorer.
SrlEvalScorer
(srl_eval_path: str = '/Users/michael/hack/allenai/allennlp/allennlp/tools/srl-eval.pl', ignore_classes: List[str] = None)[source]¶ Bases:
allennlp.training.metrics.metric.Metric
This class uses the external srl-eval.pl script for computing the CoNLL SRL metrics.
AllenNLP contains the srl-eval.pl script, but you will need perl 5.x.
Note that this metric reads and writes from disk quite a bit. In particular, it writes and subsequently reads two files per __call__, which is typically invoked once per batch. You probably don’t want to include it in your training loop; instead, you should calculate this on a validation set only.
- Parameters
- srl_eval_path
str
, optional. The path to the srl-eval.pl script.
- ignore_classes
List[str]
, optional (default=``None``). A list of classes to ignore.
- srl_eval_path
-
class
allennlp.training.metrics.unigram_recall.
UnigramRecall
[source]¶ Bases:
allennlp.training.metrics.metric.Metric
Unigram top-K recall. This does not take word order into account. Assumes integer labels, with each item to be classified having a single correct class.