task_suite
allennlp.confidence_checks.task_checklists.task_suite
TaskSuite¶
class TaskSuite(Registrable):
| def __init__(
| self,
| suite: Optional[TestSuite] = None,
| add_default_tests: bool = True,
| data: Optional[List[Any]] = None,
| num_test_cases: int = 100,
| **kwargs
| )
Base class for various task test suites.
This is a wrapper class around the CheckList toolkit introduced in the paper Beyond Accuracy: Behavioral Testing of NLP models with CheckList (Ribeiro et al).
Note
To use the checklist integration you should install allennlp
with the
"checklist" extra (e.g. conda install allennlp-checklist
,
pip install allennlp[checklist]
or just install checklist
after the fact).
Task suites are intended to be used as a form of behavioral testing for NLP models to check for robustness across several general linguistic capabilities; eg. Vocabulary, SRL, Negation, etc.
An example of the entire checklist process can be found at: https://github.com/marcotcr/checklist/blob/master/notebooks/tutorials/.
A task suite should contain tests that check general capabilities, including but not limited to:
- Vocabulary + POS : Important words/word types for the task
- Taxonomy : Synonyms/antonyms, etc.
- Robustness : To typos, irrelevant changes, etc.
- NER : Appropriately understanding named entities.
- Temporal : Understanding the order of events.
- Negation
- Coreference
- Semantic Role Labeling : Understanding roles such as agents and objects.
- Logic : Ability to handle symmetry, consistency, and conjunctions.
- Fairness
Parameters¶
-
suite :
checklist.test_suite.TestSuite
, optional (default =None
)
Pass in an existing test suite. -
add_default_tests :
bool
, optional (default =False
)
Whether to add default checklist tests for the task. -
data :
List[Any]
, optional (default =None
)
If the data is provided, andadd_default_tests
isTrue
, tests that perturb the data are also added.For instance, if the task is sentiment analysis, and the a list of sentences is passed, it will add tests that check a model's robustness to typos, etc.
describe¶
class TaskSuite(Registrable):
| ...
| def describe(self)
Gives a description of the test suite. This is intended as a utility for examining the test suite.
summary¶
class TaskSuite(Registrable):
| ...
| def summary(
| self,
| capabilities: Optional[List[str]] = None,
| file: TextIO = sys.stdout,
| **kwargs
| )
Prints a summary of the test results.
Parameters¶
- capabilities :
List[str]
, optional (default =None
)
If not None, will only show tests with these capabilities. - **kwargs :
type
Will be passed as arguments to each test.summary()
run¶
class TaskSuite(Registrable):
| ...
| def run(
| self,
| predictor: Predictor,
| capabilities: Optional[List[str]] = None,
| max_examples: Optional[int] = None
| )
Runs the predictor on the test suite data.
Parameters¶
- predictor :
Predictor
The predictor object. - capabilities :
List[str]
, optional (default =None
)
If not None, will only run tests with these capabilities. - max_examples :
int
, optional (default =None
)
Maximum number of examples to run. If None, all examples will be run.
constructor¶
class TaskSuite(Registrable):
| ...
| @classmethod
| def constructor(
| cls,
| name: Optional[str] = None,
| suite_file: Optional[str] = None,
| extra_args: Optional[Dict[str, Any]] = None
| ) -> "TaskSuite"
save_suite¶
class TaskSuite(Registrable):
| ...
| def save_suite(self, suite_file: str)
Saves the suite to a file.
contractions¶
class TaskSuite(Registrable):
| ...
| @classmethod
| def contractions(cls) -> Callable
This returns a function which adds/removes contractions in relevant
str
inputs of a task's inputs. For instance, "isn't" will be
changed to "is not", and "will not" will be changed to "won't".
Expected arguments for this function: (example, **args, **kwargs)
where the example
is an instance of some task. It can be of any
type.
For example, for a sentiment analysis task, it will be a
a str
(the sentence for which we want to predict the sentiment).
For a textual entailment task, it can be a tuple or a Dict, etc.
Expected output of this function is a list of instances for the task,
of the same type as example
.
typos¶
class TaskSuite(Registrable):
| ...
| @classmethod
| def typos(cls) -> Callable
This returns a function which adds simple typos in relevant
str
inputs of a task's inputs.
Expected arguments for this function: (example, **args, **kwargs)
where the example
is an instance of some task. It can be of any
type.
For example, for a sentiment analysis task, it will be a
a str
(the sentence for which we want to predict the sentiment).
For a textual entailment task, it can be a tuple or a Dict, etc.
Expected output of this function is a list of instances for the task,
of the same type as example
.
punctuation¶
class TaskSuite(Registrable):
| ...
| @classmethod
| def punctuation(cls) -> Callable
This returns a function which adds/removes punctuations in relevant
str
inputs of a task's inputs. For instance, "isn't" will be
changed to "is not", and "will not" will be changed to "won't".
Expected arguments for this function: (example, **args, **kwargs)
where the example
is an instance of some task. It can be of any
type.
For example, for a sentiment analysis task, it will be a
a str
(the sentence for which we want to predict the sentiment).
For a textual entailment task, it can be a tuple or a Dict, etc.
Expected output of this function is a list of instances for the task,
of the same type as example
.
add_test¶
class TaskSuite(Registrable):
| ...
| def add_test(self, test: Union[MFT, INV, DIR])
Adds a fully specified checklist test to the suite. The tests can be of the following types:
-
MFT: A minimum functionality test. It checks if the predicted output matches the expected output. For example, for a sentiment analysis task, a simple MFT can check if the model always predicts a positive sentiment for very positive words. The test's data contains the input and the expected output.
-
INV: An invariance test. It checks if the predicted output is invariant to some change in the input. For example, for a sentiment analysis task, an INV test can check if the prediction stays consistent if simple typos are added. The test's data contains the pairs (input, modified input).
-
DIR: A directional expectation test. It checks if the predicted output changes in some specific way in response to the change in input. For example, for a sentiment analysis task, a DIR test can check if adding a reducer (eg. "good" -> "somewhat good") causes the prediction's positive confidence score to decrease (or at least not increase). The test's data contains single inputs or pairs (input, modified input).
Please refer to the paper for more details and examples.
Note: test
needs to be fully specified; with name, capability and description.