allennlp.modules.token_embedders.embedding#

Embedding#

Embedding(
    self,
    embedding_dim: int,
    num_embeddings: int = None,
    projection_dim: int = None,
    weight: torch.FloatTensor = None,
    padding_index: int = None,
    trainable: bool = True,
    max_norm: float = None,
    norm_type: float = 2.0,
    scale_grad_by_freq: bool = False,
    sparse: bool = False,
    vocab_namespace: str = 'tokens',
    pretrained_file: str = None,
    vocab: allennlp.data.vocabulary.Vocabulary = None,
) -> None

A more featureful embedding module than the default in Pytorch. Adds the ability to:

1. embed higher-order inputs
2. pre-specify the weight matrix
3. use a non-trainable embedding
4. project the resultant embeddings to some other dimension (which only makes sense with
   non-trainable embeddings).

Note that if you are using our data API and are trying to embed a TextField, you should use a TextFieldEmbedder instead of using this directly.

Registered as a TokenEmbedder with name "embedding".

Parameters

  • num_embeddings : int Size of the dictionary of embeddings (vocabulary size).
  • embedding_dim : int The size of each embedding vector.
  • projection_dim : int, (optional, default=None) If given, we add a projection layer after the embedding layer. This really only makes sense if trainable is False.
  • weight : torch.FloatTensor, (optional, default=None) A pre-initialised weight matrix for the embedding lookup, allowing the use of pretrained vectors.
  • padding_index : int, (optional, default=None) If given, pads the output with zeros whenever it encounters the index.
  • trainable : bool, (optional, default=True) Whether or not to optimize the embedding parameters.
  • max_norm : float, (optional, default=None) If given, will renormalize the embeddings to always have a norm lesser than this
  • norm_type : float, (optional, default=2) The p of the p-norm to compute for the max_norm option
  • scale_grad_by_freq : bool, (optional, default=False) If given, this will scale gradients by the frequency of the words in the mini-batch.
  • sparse : bool, (optional, default=False) Whether or not the Pytorch backend should use a sparse representation of the embedding weight.
  • vocab_namespace : str, (optional, default=None) In case of fine-tuning/transfer learning, the model's embedding matrix needs to be extended according to the size of extended-vocabulary. To be able to know how much to extend the embedding-matrix, it's necessary to know which vocab_namspace was used to construct it in the original training. We store vocab_namespace used during the original training as an attribute, so that it can be retrieved during fine-tuning.
  • pretrained_file : str, (optional, default=None) Path to a file of word vectors to intialize the embedding matrix. It can be the
  • path to a local file or a URL of a (cached) remote file. Two formats are supported: * hdf5 file - containing an embedding matrix in the form of a torch.Tensor; * text file - an utf-8 encoded text file with space separated fields.
  • vocab : Vocabulary (optional, default = None) Used to construct an embedding from a pretrained file.

Returns

An Embedding module.

extend_vocab#

Embedding.extend_vocab(
    self,
    extended_vocab: allennlp.data.vocabulary.Vocabulary,
    vocab_namespace: str = None,
    extension_pretrained_file: str = None,
    model_path: str = None,
)

Extends the embedding matrix according to the extended vocabulary. If extension_pretrained_file is available, it will be used for initializing the new words embeddings in the extended vocabulary; otherwise we will check if _pretrained_file attribute is already available. If none is available, they will be initialized with xavier uniform.

Parameters

  • extended_vocab : Vocabulary Vocabulary extended from original vocabulary used to construct this Embedding.
  • vocab_namespace : str, (optional, default=None) In case you know what vocab_namespace should be used for extension, you can pass it. If not passed, it will check if vocab_namespace used at the time of Embedding construction is available. If so, this namespace will be used or else extend_vocab will be a no-op.
  • extension_pretrained_file : str, (optional, default=None) A file containing pretrained embeddings can be specified here. It can be the path to a local file or an URL of a (cached) remote file. Check format details in from_params of Embedding class.
  • model_path : str, (optional, default=None) Path traversing the model attributes upto this embedding module. Eg. "_text_field_embedder.token_embedder_tokens". This is only useful to give a helpful error message when extend_vocab is implicitly called by train or any other command.

forward#

Embedding.forward(self, tokens:torch.Tensor) -> torch.Tensor

Defines the computation performed at every call.

Should be overridden by all subclasses.

.. note:: Although the recipe for forward pass needs to be defined within this function, one should call the :class:Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

get_output_dim#

Embedding.get_output_dim(self) -> int

Returns the final output dimension that this TokenEmbedder uses to represent each token. This is not the shape of the returned tensor, but the last element of that shape.

EmbeddingsFileURI#

EmbeddingsFileURI(self, /, *args, **kwargs)

EmbeddingsFileURI(main_file_uri, path_inside_archive)

main_file_uri#

Alias for field number 0

path_inside_archive#

Alias for field number 1

EmbeddingsTextFile#

EmbeddingsTextFile(
    self,
    file_uri: str,
    encoding: str = 'utf-8',
    cache_dir: str = None,
) -> None

Utility class for opening embeddings text files. Handles various compression formats, as well as context management.

Parameters

  • file_uri : str
  • It can be:

    • a file system path or a URL of an eventually compressed text file or a zip/tar archive containing a single file.
    • URI of the type (archive_path_or_url)#file_path_inside_archive if the text file is contained in a multi-file archive.
  • encoding : str

  • cache_dir : str

DEFAULT_ENCODING#

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.str() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.