This evaluation script relies heavily on the one for DROP (
allennlp/tools/drop_eval.py). We need a separate
script for Quoref only because the data formats are slightly different.
def evaluate_json( annotations: Dict[str, Any], predicted_answers: Dict[str, Any] ) -> Tuple[float, float]
Takes gold annotations and predicted answers and evaluates the predictions for each question in the gold annotations. Both JSON dictionaries must have query_id keys, which are used to match predictions to gold annotations.
predicted_answers JSON must be a dictionary keyed by query id, where the value is a
list of strings (or just one string) that is the answer.
annotations are assumed to have either the format of the dev set in the Quoref data release, or the
same format as the predicted answers file.
def evaluate_prediction_file( prediction_path: str, gold_path: str, output_path: Optional[str] = None ) -> Tuple[float, float]
Takes a prediction file and a gold file and evaluates the predictions for each question in the gold file. Both files must be json formatted and must have query_id keys, which are used to match predictions to gold annotations. Writes a json with global_em and global_f1 metrics to file at the specified output path, unless None is passed as output path.