task_suite

allennlp.confidence_checks.task_checklists.task_suite

TaskSuite¶

class TaskSuite(Registrable):
 | def __init__(
 |     self,
 |     suite: Optional[TestSuite] = None,
 |     add_default_tests: bool = True,
 |     data: Optional[List[Any]] = None,
 |     num_test_cases: int = 100,
 |     **kwargs
 | )

Base class for various task test suites.

This is a wrapper class around the CheckList toolkit introduced in the paper Beyond Accuracy: Behavioral Testing of NLP models with CheckList (Ribeiro et al).

Note

To use the checklist integration you should install allennlp with the "checklist" extra (e.g. conda install allennlp-checklist, pip install allennlp[checklist] or just install checklist after the fact).

Task suites are intended to be used as a form of behavioral testing for NLP models to check for robustness across several general linguistic capabilities; eg. Vocabulary, SRL, Negation, etc.

An example of the entire checklist process can be found at: https://github.com/marcotcr/checklist/blob/master/notebooks/tutorials/.

A task suite should contain tests that check general capabilities, including but not limited to:

Vocabulary + POS : Important words/word types for the task
Taxonomy : Synonyms/antonyms, etc.
Robustness : To typos, irrelevant changes, etc.
NER : Appropriately understanding named entities.
Temporal : Understanding the order of events.
Negation
Coreference
Semantic Role Labeling : Understanding roles such as agents and objects.
Logic : Ability to handle symmetry, consistency, and conjunctions.
Fairness

Parameters¶

suite : checklist.test_suite.TestSuite, optional (default = None)
Pass in an existing test suite.
add_default_tests : bool, optional (default = False)
Whether to add default checklist tests for the task.
data : List[Any], optional (default = None)
If the data is provided, and add_default_tests is True, tests that perturb the data are also added.

For instance, if the task is sentiment analysis, and the a list of sentences is passed, it will add tests that check a model's robustness to typos, etc.

describe¶

class TaskSuite(Registrable):
 | ...
 | def describe(self)

Gives a description of the test suite. This is intended as a utility for examining the test suite.

summary¶

class TaskSuite(Registrable):
 | ...
 | def summary(
 |     self,
 |     capabilities: Optional[List[str]] = None,
 |     file: TextIO = sys.stdout,
 |     **kwargs
 | )

Prints a summary of the test results.

Parameters¶

capabilities : List[str], optional (default = None)
If not None, will only show tests with these capabilities.
**kwargs : type
Will be passed as arguments to each test.summary()

run¶

class TaskSuite(Registrable):
 | ...
 | def run(
 |     self,
 |     predictor: Predictor,
 |     capabilities: Optional[List[str]] = None,
 |     max_examples: Optional[int] = None
 | )

Runs the predictor on the test suite data.

Parameters¶

predictor : Predictor
The predictor object.
capabilities : List[str], optional (default = None)
If not None, will only run tests with these capabilities.
max_examples : int, optional (default = None)
Maximum number of examples to run. If None, all examples will be run.

constructor¶

class TaskSuite(Registrable):
 | ...
 | @classmethod
 | def constructor(
 |     cls,
 |     name: Optional[str] = None,
 |     suite_file: Optional[str] = None,
 |     extra_args: Optional[Dict[str, Any]] = None
 | ) -> "TaskSuite"

save_suite¶

class TaskSuite(Registrable):
 | ...
 | def save_suite(self, suite_file: str)

Saves the suite to a file.

contractions¶

class TaskSuite(Registrable):
 | ...
 | @classmethod
 | def contractions(cls) -> Callable

This returns a function which adds/removes contractions in relevant str inputs of a task's inputs. For instance, "isn't" will be changed to "is not", and "will not" will be changed to "won't".

Expected arguments for this function: (example, **args, **kwargs) where the example is an instance of some task. It can be of any type.

For example, for a sentiment analysis task, it will be a a str (the sentence for which we want to predict the sentiment). For a textual entailment task, it can be a tuple or a Dict, etc.

Expected output of this function is a list of instances for the task, of the same type as example.

typos¶

class TaskSuite(Registrable):
 | ...
 | @classmethod
 | def typos(cls) -> Callable

This returns a function which adds simple typos in relevant str inputs of a task's inputs.

Expected arguments for this function: (example, **args, **kwargs) where the example is an instance of some task. It can be of any type.

For example, for a sentiment analysis task, it will be a a str (the sentence for which we want to predict the sentiment). For a textual entailment task, it can be a tuple or a Dict, etc.

Expected output of this function is a list of instances for the task, of the same type as example.

punctuation¶

class TaskSuite(Registrable):
 | ...
 | @classmethod
 | def punctuation(cls) -> Callable

This returns a function which adds/removes punctuations in relevant str inputs of a task's inputs. For instance, "isn't" will be changed to "is not", and "will not" will be changed to "won't".

Expected arguments for this function: (example, **args, **kwargs) where the example is an instance of some task. It can be of any type.

For example, for a sentiment analysis task, it will be a a str (the sentence for which we want to predict the sentiment). For a textual entailment task, it can be a tuple or a Dict, etc.

Expected output of this function is a list of instances for the task, of the same type as example.

add_test¶

class TaskSuite(Registrable):
 | ...
 | def add_test(self, test: Union[MFT, INV, DIR])

Adds a fully specified checklist test to the suite. The tests can be of the following types:

MFT: A minimum functionality test. It checks if the predicted output matches the expected output. For example, for a sentiment analysis task, a simple MFT can check if the model always predicts a positive sentiment for very positive words. The test's data contains the input and the expected output.
INV: An invariance test. It checks if the predicted output is invariant to some change in the input. For example, for a sentiment analysis task, an INV test can check if the prediction stays consistent if simple typos are added. The test's data contains the pairs (input, modified input).
DIR: A directional expectation test. It checks if the predicted output changes in some specific way in response to the change in input. For example, for a sentiment analysis task, a DIR test can check if adding a reducer (eg. "good" -> "somewhat good") causes the prediction's positive confidence score to decrease (or at least not increase). The test's data contains single inputs or pairs (input, modified input).

Please refer to the paper for more details and examples.

Note: test needs to be fully specified; with name, capability and description.