Skip to content

embedding

allennlp.modules.token_embedders.embedding

[SOURCE]


Embedding

@TokenEmbedder.register("embedding")
class Embedding(TokenEmbedder):
 | def __init__(
 |     self,
 |     embedding_dim: int,
 |     num_embeddings: int = None,
 |     projection_dim: int = None,
 |     weight: torch.FloatTensor = None,
 |     padding_index: int = None,
 |     trainable: bool = True,
 |     max_norm: float = None,
 |     norm_type: float = 2.0,
 |     scale_grad_by_freq: bool = False,
 |     sparse: bool = False,
 |     vocab_namespace: str = "tokens",
 |     pretrained_file: str = None,
 |     vocab: Vocabulary = None
 | ) -> None

A more featureful embedding module than the default in Pytorch. Adds the ability to:

1. embed higher-order inputs
2. pre-specify the weight matrix
3. use a non-trainable embedding
4. project the resultant embeddings to some other dimension (which only makes sense with
   non-trainable embeddings).

Note that if you are using our data API and are trying to embed a TextField, you should use a TextFieldEmbedder instead of using this directly.

Registered as a TokenEmbedder with name "embedding".

Parameters

  • num_embeddings : int
    Size of the dictionary of embeddings (vocabulary size).
  • embedding_dim : int
    The size of each embedding vector.
  • projection_dim : int, optional (default = None)
    If given, we add a projection layer after the embedding layer. This really only makes sense if trainable is False.
  • weight : torch.FloatTensor, optional (default = None)
    A pre-initialised weight matrix for the embedding lookup, allowing the use of pretrained vectors.
  • padding_index : int, optional (default = None)
    If given, pads the output with zeros whenever it encounters the index.
  • trainable : bool, optional (default = True)
    Whether or not to optimize the embedding parameters.
  • max_norm : float, optional (default = None)
    If given, will renormalize the embeddings to always have a norm lesser than this
  • norm_type : float, optional (default = 2)
    The p of the p-norm to compute for the max_norm option
  • scale_grad_by_freq : bool, optional (default = False)
    If given, this will scale gradients by the frequency of the words in the mini-batch.
  • sparse : bool, optional (default = False)
    Whether or not the Pytorch backend should use a sparse representation of the embedding weight.
  • vocab_namespace : str, optional (default = None)
    In case of fine-tuning/transfer learning, the model's embedding matrix needs to be extended according to the size of extended-vocabulary. To be able to know how much to extend the embedding-matrix, it's necessary to know which vocab_namspace was used to construct it in the original training. We store vocab_namespace used during the original training as an attribute, so that it can be retrieved during fine-tuning.
  • pretrained_file : str, optional (default = None)
    Path to a file of word vectors to initialize the embedding matrix. It can be the path to a local file or a URL of a (cached) remote file. Two formats are supported: * hdf5 file - containing an embedding matrix in the form of a torch.Tensor; * text file - an utf-8 encoded text file with space separated fields.
  • vocab : Vocabulary, optional (default = None)
    Used to construct an embedding from a pretrained file.

    In a typical AllenNLP configuration file, this parameter does not get an entry under the "embedding", it gets specified as a top-level parameter, then is passed in to this module separately.

Returns

  • An Embedding module.

get_output_dim

class Embedding(TokenEmbedder):
 | ...
 | def get_output_dim(self) -> int

forward

class Embedding(TokenEmbedder):
 | ...
 | def forward(self, tokens: torch.Tensor) -> torch.Tensor

extend_vocab

class Embedding(TokenEmbedder):
 | ...
 | def extend_vocab(
 |     self,
 |     extended_vocab: Vocabulary,
 |     vocab_namespace: str = None,
 |     extension_pretrained_file: str = None,
 |     model_path: str = None
 | )

Extends the embedding matrix according to the extended vocabulary. If extension_pretrained_file is available, it will be used for initializing the new words embeddings in the extended vocabulary; otherwise we will check if _pretrained_file attribute is already available. If none is available, they will be initialized with xavier uniform.

Parameters

  • extended_vocab : Vocabulary
    Vocabulary extended from original vocabulary used to construct this Embedding.
  • vocab_namespace : str, optional (default = None)
    In case you know what vocab_namespace should be used for extension, you can pass it. If not passed, it will check if vocab_namespace used at the time of Embedding construction is available. If so, this namespace will be used or else extend_vocab will be a no-op.
  • extension_pretrained_file : str, optional (default = None)
    A file containing pretrained embeddings can be specified here. It can be the path to a local file or an URL of a (cached) remote file. Check format details in from_params of Embedding class.
  • model_path : str, optional (default = None)
    Path traversing the model attributes upto this embedding module. Eg. "_text_field_embedder.token_embedder_tokens". This is only useful to give a helpful error message when extend_vocab is implicitly called by train or any other command.

format_embeddings_file_uri

def format_embeddings_file_uri(
    main_file_path_or_url: str,
    path_inside_archive: Optional[str] = None
) -> str

EmbeddingsFileURI

class EmbeddingsFileURI(NamedTuple)

main_file_uri

class EmbeddingsFileURI(NamedTuple):
 | ...
 | main_file_uri: str = None

path_inside_archive

class EmbeddingsFileURI(NamedTuple):
 | ...
 | path_inside_archive: Optional[str] = None

parse_embeddings_file_uri

def parse_embeddings_file_uri(uri: str) -> "EmbeddingsFileURI"

EmbeddingsTextFile

class EmbeddingsTextFile(Iterator[str]):
 | def __init__(
 |     self,
 |     file_uri: str,
 |     encoding: str = DEFAULT_ENCODING,
 |     cache_dir: str = None
 | ) -> None

Utility class for opening embeddings text files. Handles various compression formats, as well as context management.

Parameters

  • file_uri : str
    It can be:

    • a file system path or a URL of an eventually compressed text file or a zip/tar archive containing a single file.
    • URI of the type (archive_path_or_url)#file_path_inside_archive if the text file is contained in a multi-file archive.
  • encoding : str

  • cache_dir : str

DEFAULT_ENCODING

class EmbeddingsTextFile(Iterator[str]):
 | ...
 | DEFAULT_ENCODING = "utf-8"

read

class EmbeddingsTextFile(Iterator[str]):
 | ...
 | def read(self) -> str

readline

class EmbeddingsTextFile(Iterator[str]):
 | ...
 | def readline(self) -> str

close

class EmbeddingsTextFile(Iterator[str]):
 | ...
 | def close(self) -> None

__iter__

class EmbeddingsTextFile(Iterator[str]):
 | ...
 | def __iter__(self) -> "EmbeddingsTextFile"