Skip to content

bias_utils

allennlp.fairness.bias_utils

[SOURCE]


load_words

def load_words(
    fname: Union[str, PathLike],
    tokenizer: Tokenizer,
    vocab: Optional[Vocabulary] = None,
    namespace: str = "tokens",
    all_cases: bool = True
) -> List[torch.Tensor]

This function loads a list of words from a file, tokenizes each word into subword tokens, and converts the tokens into IDs.

Parameters

  • fname : Union[str, PathLike]
    Name of file containing list of words to load.
  • tokenizer : Tokenizer
    Tokenizer to tokenize words in file.
  • vocab : Vocabulary, optional (default = None)
    Vocabulary of tokenizer. If None, assumes tokenizer is of type PreTrainedTokenizer and uses tokenizer's vocab attribute.
  • namespace : str
    Namespace of vocab to use when tokenizing.
  • all_cases : bool, optional (default = True)
    Whether to tokenize lower, title, and upper cases of each word.

Returns

  • word_ids : List[torch.Tensor]
    List of tensors containing the IDs of subword tokens for each word in the file.

load_word_pairs

def load_word_pairs(
    fname: Union[str, PathLike],
    tokenizer: Tokenizer,
    vocab: Optional[Vocabulary] = None,
    namespace: str = "token",
    all_cases: bool = True
) -> Tuple[List[torch.Tensor], List[torch.Tensor]]

This function loads a list of pairs of words from a file, tokenizes each word into subword tokens, and converts the tokens into IDs.

Parameters

  • fname : Union[str, PathLike]
    Name of file containing list of pairs of words to load.
  • tokenizer : Tokenizer
    Tokenizer to tokenize words in file.
  • vocab : Vocabulary, optional (default = None)
    Vocabulary of tokenizer. If None, assumes tokenizer is of type PreTrainedTokenizer and uses tokenizer's vocab attribute.
  • namespace : str
    Namespace of vocab to use when tokenizing.
  • all_cases : bool, optional (default = True)
    Whether to tokenize lower, title, and upper cases of each word.

Returns

  • word_ids : Tuple[List[torch.Tensor], List[torch.Tensor]]
    Pair of lists of tensors containing the IDs of subword tokens for words in the file.