allennlp.interpret.attackers

class allennlp.interpret.attackers.attacker.Attacker(predictor: allennlp.predictors.predictor.Predictor)[source]

Bases: allennlp.common.registrable.Registrable

An Attacker will modify an input (e.g., add or delete tokens) to try to change an AllenNLP Predictor’s output in a desired manner (e.g., make it incorrect).

attack_from_json(self, inputs: Dict[str, Any], input_field_to_attack: str, grad_input_field: str, ignore_tokens: List[str], target: Dict[str, Any]) → Dict[str, Any][source]

This function finds a modification to the input text that would change the model’s prediction in some desired manner (e.g., an adversarial attack).

Parameters
inputsJsonDict

The input you want to attack (the same as the argument to a Predictor, e.g., predict_json()).

input_field_to_attackstr

The key in the inputs JsonDict you want to attack, e.g., tokens.

grad_input_fieldstr

The field in the gradients dictionary that contains the input gradients. For example, grad_input_1 will be the field for single input tasks. See get_gradients() in Predictor for more information on field names.

targetJsonDict

If given, this is a targeted attack, trying to change the prediction to a particular value, instead of just changing it from its original prediction. Subclasses are not required to accept this argument, as not all attacks make sense as targeted attacks. Perhaps that means we should make the API more crisp, but adding another class is not worth it.

Returns
reduced_inputJsonDict

Contains the final, sanitized input after adversarial modification.

initialize(self)[source]

Initializes any components of the Attacker that are expensive to compute, so that they are not created on __init__(). Default implementation is pass.

class allennlp.interpret.attackers.hotflip.Hotflip(predictor: allennlp.predictors.predictor.Predictor, vocab_namespace: str = 'tokens', max_tokens: int = 5000)[source]

Bases: allennlp.interpret.attackers.attacker.Attacker

Runs the HotFlip style attack at the word-level https://arxiv.org/abs/1712.06751. We use the first-order taylor approximation described in https://arxiv.org/abs/1903.06620, in the function _first_order_taylor().

We try to re-use the embedding matrix from the model when deciding what other words to flip a token to. For a large class of models, this is straightforward. When there is a character-level encoder, however (e.g., with ELMo, any char-CNN, etc.), or a combination of encoders (e.g., ELMo + glove), we need to construct a fake embedding matrix that we can use in _first_order_taylor(). We do this by getting a list of words from the model’s vocabulary and embedding them using the encoder. This can be expensive, both in terms of time and memory usage, so we take a max_tokens parameter to limit the size of this fake embedding matrix. This also requires a model to have a token vocabulary in the first place, which can be problematic for models that only have character vocabularies.

Parameters
predictorPredictor

The model (inside a Predictor) that we’re attacking. We use this to get gradients and predictions.

vocab_namespacestr, optional (default=’tokens’)

We use this to know three things: (1) which tokens we should ignore when producing flips (we don’t consider non-alphanumeric tokens); (2) what the string value is of the token that we produced, so we can show something human-readable to the user; and (3) if we need to construct a fake embedding matrix, we use the tokens in the vocabulary as flip candidates.

max_tokensint, optional (default=5000)

This is only used when we need to construct a fake embedding matrix. That matrix can take a lot of memory when the vocab size is large. This parameter puts a cap on the number of tokens to use, so the fake embedding matrix doesn’t take as much memory.

attack_from_json(self, inputs: Dict[str, Any], input_field_to_attack: str = 'tokens', grad_input_field: str = 'grad_input_1', ignore_tokens: List[str] = None, target: Dict[str, Any] = None) → Dict[str, Any][source]

Replaces one token at a time from the input until the model’s prediction changes. input_field_to_attack is for example tokens, it says what the input field is called. grad_input_field is for example grad_input_1, which is a key into a grads dictionary.

The method computes the gradient w.r.t. the tokens, finds the token with the maximum gradient (by L2 norm), and replaces it with another token based on the first-order Taylor approximation of the loss. This process is iteratively repeated until the prediction changes. Once a token is replaced, it is not flipped again.

Parameters
inputsJsonDict

The model inputs, the same as what is passed to a Predictor.

input_field_to_attackstr, optional (default=’tokens’)

The field that has the tokens that we’re going to be flipping. This must be a TextField.

grad_input_fieldstr, optional (default=’grad_input_1’)

If there is more than one field that gets embedded in your model (e.g., a question and a passage, or a premise and a hypothesis), this tells us the key to use to get the correct gradients. This selects from the output of Predictor.get_gradients().

ignore_tokensList[str], optional (default=DEFAULT_IGNORE_TOKENS)

These tokens will not be flipped. The default list includes some simple punctuation, OOV and padding tokens, and common control tokens for BERT, etc.

targetJsonDict, optional (default=None)

If given, this will be a targeted hotflip attack, where instead of just trying to change a model’s prediction from what it current is predicting, we try to change it to a specific target value. This is a JsonDict because it needs to specify the field name and target value. For example, for a masked LM, this would be something like {"words": ["she"]}, because "words" is the field name, there is one mask token (hence the list of length one), and we want to change the prediction from whatever it was to "she".

initialize(self)[source]

Call this function before running attack_from_json(). We put the call to _construct_embedding_matrix() in this function to prevent a large amount of compute being done when __init__() is called.

class allennlp.interpret.attackers.input_reduction.InputReduction(predictor: allennlp.predictors.predictor.Predictor, beam_size: int = 3)[source]

Bases: allennlp.interpret.attackers.attacker.Attacker

Runs the input reduction method from Pathologies of Neural Models Make Interpretations Difficult, which removes as many words as possible from the input without changing the model’s prediction.

The functions on this class handle a special case for NER by looking for a field called “tags” This check is brittle, i.e., the code could break if the name of this field has changed, or if a non-NER model has a field called “tags”.

attack_from_json(self, inputs: Dict[str, Any] = None, input_field_to_attack: str = 'tokens', grad_input_field: str = 'grad_input_1', ignore_tokens: List[str] = None, target: Dict[str, Any] = None)[source]

This function finds a modification to the input text that would change the model’s prediction in some desired manner (e.g., an adversarial attack).

Parameters
inputsJsonDict

The input you want to attack (the same as the argument to a Predictor, e.g., predict_json()).

input_field_to_attackstr

The key in the inputs JsonDict you want to attack, e.g., tokens.

grad_input_fieldstr

The field in the gradients dictionary that contains the input gradients. For example, grad_input_1 will be the field for single input tasks. See get_gradients() in Predictor for more information on field names.

targetJsonDict

If given, this is a targeted attack, trying to change the prediction to a particular value, instead of just changing it from its original prediction. Subclasses are not required to accept this argument, as not all attacks make sense as targeted attacks. Perhaps that means we should make the API more crisp, but adding another class is not worth it.

Returns
reduced_inputJsonDict

Contains the final, sanitized input after adversarial modification.

allennlp.interpret.attackers.utils.get_fields_to_compare(inputs: Dict[str, Any], instance: allennlp.data.instance.Instance, input_field_to_attack: str) → Dict[str, Any][source]

Gets a list of the fields that should be checked for equality after an attack is performed.

Parameters
inputsJsonDict

The input you want to attack, similar to the argument to a Predictor, e.g., predict_json().

instanceInstance

A labeled instance that is output from json_to_labeled_instances().

input_field_to_attackstr

The key in the inputs JsonDict you want to attack, e.g., tokens.

Returns
fieldsJsonDict

The fields that must be compared for equality.

allennlp.interpret.attackers.utils.instance_has_changed(instance: allennlp.data.instance.Instance, fields_to_compare: Dict[str, Any])[source]