embedding
allennlp.modules.token_embedders.embedding
Embedding#
@TokenEmbedder.register("embedding")
class Embedding(TokenEmbedder):
| def __init__(
| self,
| embedding_dim: int,
| num_embeddings: int = None,
| projection_dim: int = None,
| weight: torch.FloatTensor = None,
| padding_index: int = None,
| trainable: bool = True,
| max_norm: float = None,
| norm_type: float = 2.0,
| scale_grad_by_freq: bool = False,
| sparse: bool = False,
| vocab_namespace: str = "tokens",
| pretrained_file: str = None,
| vocab: Vocabulary = None
| ) -> None
A more featureful embedding module than the default in Pytorch. Adds the ability to:
1. embed higher-order inputs
2. pre-specify the weight matrix
3. use a non-trainable embedding
4. project the resultant embeddings to some other dimension (which only makes sense with
non-trainable embeddings).
Note that if you are using our data API and are trying to embed a
TextField
, you should use a
TextFieldEmbedder
instead of using this directly.
Registered as a TokenEmbedder
with name "embedding".
Parameters
- num_embeddings :
int
Size of the dictionary of embeddings (vocabulary size). - embedding_dim :
int
The size of each embedding vector. - projection_dim :
int
, optional (default =None
)
If given, we add a projection layer after the embedding layer. This really only makes sense iftrainable
isFalse
. - weight :
torch.FloatTensor
, optional (default =None
)
A pre-initialised weight matrix for the embedding lookup, allowing the use of pretrained vectors. - padding_index :
int
, optional (default =None
)
If given, pads the output with zeros whenever it encounters the index. - trainable :
bool
, optional (default =True
)
Whether or not to optimize the embedding parameters. - max_norm :
float
, optional (default =None
)
If given, will renormalize the embeddings to always have a norm lesser than this - norm_type :
float
, optional (default =2
)
The p of the p-norm to compute for the max_norm option - scale_grad_by_freq :
bool
, optional (default =False
)
If given, this will scale gradients by the frequency of the words in the mini-batch. - sparse :
bool
, optional (default =False
)
Whether or not the Pytorch backend should use a sparse representation of the embedding weight. - vocab_namespace :
str
, optional (default =None
)
In case of fine-tuning/transfer learning, the model's embedding matrix needs to be extended according to the size of extended-vocabulary. To be able to know how much to extend the embedding-matrix, it's necessary to know which vocab_namspace was used to construct it in the original training. We store vocab_namespace used during the original training as an attribute, so that it can be retrieved during fine-tuning. - pretrained_file :
str
, optional (default =None
)
Path to a file of word vectors to initialize the embedding matrix. It can be the path to a local file or a URL of a (cached) remote file. Two formats are supported: * hdf5 file - containing an embedding matrix in the form of a torch.Tensor; * text file - an utf-8 encoded text file with space separated fields. -
vocab :
Vocabulary
, optional (default =None
)
Used to construct an embedding from a pretrained file.In a typical AllenNLP configuration file, this parameter does not get an entry under the "embedding", it gets specified as a top-level parameter, then is passed in to this module separately.
Returns
- An Embedding module.
get_output_dim#
class Embedding(TokenEmbedder):
| ...
| @overrides
| def get_output_dim(self) -> int
forward#
class Embedding(TokenEmbedder):
| ...
| @overrides
| def forward(self, tokens: torch.Tensor) -> torch.Tensor
tokens may have extra dimensions (batch_size, d1, ..., dn, sequence_length), but embedding expects (batch_size, sequence_length), so pass tokens to util.combine_initial_dims (which is a no-op if there are no extra dimensions). Remember the original size.
extend_vocab#
class Embedding(TokenEmbedder):
| ...
| def extend_vocab(
| self,
| extended_vocab: Vocabulary,
| vocab_namespace: str = None,
| extension_pretrained_file: str = None,
| model_path: str = None
| )
Extends the embedding matrix according to the extended vocabulary. If extension_pretrained_file is available, it will be used for initializing the new words embeddings in the extended vocabulary; otherwise we will check if _pretrained_file attribute is already available. If none is available, they will be initialized with xavier uniform.
Parameters
- extended_vocab :
Vocabulary
Vocabulary extended from original vocabulary used to construct thisEmbedding
. - vocab_namespace :
str
, optional (default =None
)
In case you know what vocab_namespace should be used for extension, you can pass it. If not passed, it will check if vocab_namespace used at the time ofEmbedding
construction is available. If so, this namespace will be used or else extend_vocab will be a no-op. - extension_pretrained_file :
str
, optional (default =None
)
A file containing pretrained embeddings can be specified here. It can be the path to a local file or an URL of a (cached) remote file. Check format details infrom_params
ofEmbedding
class. - model_path :
str
, optional (default =None
)
Path traversing the model attributes upto this embedding module. Eg. "_text_field_embedder.token_embedder_tokens". This is only useful to give a helpful error message when extend_vocab is implicitly called by train or any other command.
format_embeddings_file_uri#
def format_embeddings_file_uri(
main_file_path_or_url: str,
path_inside_archive: Optional[str] = None
) -> str
EmbeddingsFileURI#
class EmbeddingsFileURI(NamedTuple)
main_file_uri#
class EmbeddingsFileURI(NamedTuple):
| ...
| main_file_uri: str = None
path_inside_archive#
class EmbeddingsFileURI(NamedTuple):
| ...
| path_inside_archive: Optional[str] = None
parse_embeddings_file_uri#
def parse_embeddings_file_uri(uri: str) -> "EmbeddingsFileURI"
EmbeddingsTextFile#
class EmbeddingsTextFile(Iterator[str]):
| def __init__(
| self,
| file_uri: str,
| encoding: str = DEFAULT_ENCODING,
| cache_dir: str = None
| ) -> None
Utility class for opening embeddings text files. Handles various compression formats, as well as context management.
Parameters
-
file_uri :
str
It can be:- a file system path or a URL of an eventually compressed text file or a zip/tar archive containing a single file.
- URI of the type
(archive_path_or_url)#file_path_inside_archive
if the text file is contained in a multi-file archive.
-
encoding :
str
- cache_dir :
str
DEFAULT_ENCODING#
class EmbeddingsTextFile(Iterator[str]):
| ...
| DEFAULT_ENCODING = "utf-8"
read#
class EmbeddingsTextFile(Iterator[str]):
| ...
| def read(self) -> str
readline#
class EmbeddingsTextFile(Iterator[str]):
| ...
| def readline(self) -> str
close#
class EmbeddingsTextFile(Iterator[str]):
| ...
| def close(self) -> None