Skip to content

hotflip

allennlp.interpret.attackers.hotflip

[SOURCE]


DEFAULT_IGNORE_TOKENS

DEFAULT_IGNORE_TOKENS = ["@@NULL@@", ".", ",", ";", "!", "?", "[MASK]", "[SEP]", "[CLS]"]

Hotflip

@Attacker.register("hotflip")
class Hotflip(Attacker):
 | def __init__(
 |     self,
 |     predictor: Predictor,
 |     vocab_namespace: str = "tokens",
 |     max_tokens: int = 5000
 | ) -> None

Runs the HotFlip style attack at the word-level https://arxiv.org/abs/1712.06751. We use the first-order taylor approximation described in https://arxiv.org/abs/1903.06620, in the function _first_order_taylor().

We try to re-use the embedding matrix from the model when deciding what other words to flip a token to. For a large class of models, this is straightforward. When there is a character-level encoder, however (e.g., with ELMo, any char-CNN, etc.), or a combination of encoders (e.g., ELMo + glove), we need to construct a fake embedding matrix that we can use in _first_order_taylor(). We do this by getting a list of words from the model's vocabulary and embedding them using the encoder. This can be expensive, both in terms of time and memory usage, so we take a max_tokens parameter to limit the size of this fake embedding matrix. This also requires a model to have a token vocabulary in the first place, which can be problematic for models that only have character vocabularies.

Registered as an Attacker with name "hotflip".

Parameters

  • predictor : Predictor
    The model (inside a Predictor) that we're attacking. We use this to get gradients and predictions.
  • vocab_namespace : str, optional (default = 'tokens')
    We use this to know three things: (1) which tokens we should ignore when producing flips (we don't consider non-alphanumeric tokens); (2) what the string value is of the token that we produced, so we can show something human-readable to the user; and (3) if we need to construct a fake embedding matrix, we use the tokens in the vocabulary as flip candidates.
  • max_tokens : int, optional (default = 5000)
    This is only used when we need to construct a fake embedding matrix. That matrix can take a lot of memory when the vocab size is large. This parameter puts a cap on the number of tokens to use, so the fake embedding matrix doesn't take as much memory.

initialize

class Hotflip(Attacker):
 | ...
 | def initialize(self)

Call this function before running attack_from_json(). We put the call to _construct_embedding_matrix() in this function to prevent a large amount of compute being done when init() is called.

attack_from_json

class Hotflip(Attacker):
 | ...
 | def attack_from_json(
 |     self,
 |     inputs: JsonDict,
 |     input_field_to_attack: str = "tokens",
 |     grad_input_field: str = "grad_input_1",
 |     ignore_tokens: List[str] = None,
 |     target: JsonDict = None
 | ) -> JsonDict

Replaces one token at a time from the input until the model's prediction changes. input_field_to_attack is for example tokens, it says what the input field is called. grad_input_field is for example grad_input_1, which is a key into a grads dictionary.

The method computes the gradient w.r.t. the tokens, finds the token with the maximum gradient (by L2 norm), and replaces it with another token based on the first-order Taylor approximation of the loss. This process is iteratively repeated until the prediction changes. Once a token is replaced, it is not flipped again.

Parameters

  • inputs : JsonDict
    The model inputs, the same as what is passed to a Predictor.
  • input_field_to_attack : str, optional (default = 'tokens')
    The field that has the tokens that we're going to be flipping. This must be a TextField.
  • grad_input_field : str, optional (default = 'grad_input_1')
    If there is more than one field that gets embedded in your model (e.g., a question and a passage, or a premise and a hypothesis), this tells us the key to use to get the correct gradients. This selects from the output of Predictor.get_gradients.
  • ignore_tokens : List[str], optional (default = DEFAULT_IGNORE_TOKENS)
    These tokens will not be flipped. The default list includes some simple punctuation, OOV and padding tokens, and common control tokens for BERT, etc.
  • target : JsonDict, optional (default = None)
    If given, this will be a targeted hotflip attack, where instead of just trying to change a model's prediction from what it current is predicting, we try to change it to a specific target value. This is a JsonDict because it needs to specify the field name and target value. For example, for a masked LM, this would be something like {"words": ["she"]}, because "words" is the field name, there is one mask token (hence the list of length one), and we want to change the prediction from whatever it was to "she". By default, output_dict from forward pass would be given for func:Predictor.predictions_to_labeled_instances where target has to be extracted manually according to logit.

attack_instance

class Hotflip(Attacker):
 | ...
 | def attack_instance(
 |     self,
 |     instance: Instance,
 |     inputs: JsonDict,
 |     input_field_to_attack: str = "tokens",
 |     grad_input_field: str = "grad_input_1",
 |     ignore_tokens: List[str] = None,
 |     target: JsonDict = None
 | ) -> Tuple[List[Token], JsonDict]