Skip to content

Training Transformer ELMo#

This document describes how to train and use a transformer-based version of ELMo with allennlp. The model is a port of the the one described in Dissecting Contextual Word Embeddings: Architecture and Representation by Peters et al. You can find a pretrained version of this model here.


  1. Obtain training data from export BIDIRECTIONAL_LM_DATA_PATH=$PWD'/1-billion-word-language-modeling-benchmark-r13output' export BIDIRECTIONAL_LM_TRAIN_PATH=$BIDIRECTIONAL_LM_DATA_PATH'/training-monolingual.tokenized.shuffled/*'
  2. Obtain vocab. pip install --user awscli mkdir vocabulary export BIDIRECTIONAL_LM_VOCAB_PATH=$PWD'/vocabulary' cd $BIDIRECTIONAL_LM_VOCAB_PATH aws --no-sign-request s3 cp s3://allennlp/models/elmo/vocab-2016-09-10.txt . cat vocab-2016-09-10.txt | sed 's/<UNK>/@@UNKNOWN@@/' > tokens.txt # Avoid creating garbage namespace. rm vocab-2016-09-10.txt echo '*labels\n*tags' > non_padded_namespaces.txt
  3. Run training. Note: training_config refers to this directory. # The multiprocess dataset reader and iterator use many file descriptors, # so we increase the relevant ulimit here to help. # See # for a description of the underlying issue. ulimit -n 4096 # Location of repo for training_config. cd allennlp allennlp train training_config/bidirectional_language_model.jsonnet --serialization-dir output_path
  4. Wait. This will take days. (Example results here are from a model trained for just 4 epochs.)
  5. Evaluate. There is one gotcha here, which is that we discard 3 sentences for being too long (otherwise we'd exhaust GPU memory). If we wanted to report this number formally (in a paper or similar), we'd need to handle this differently. allennlp evaluate --cuda-device 0 -o '{"iterator": {"base_iterator": {"maximum_samples_per_batch": ["num_tokens", 500] }}}' output_path/model.tar.gz $BIDIRECTIONAL_LM_DATA_PATH/heldout-monolingual.tokenized.shuffled/news.en-00000-of-00100

    A model trained for 4 epochs gives: ``` 2018-12-12 05:42:53,711 - INFO - allennlp.commands.evaluate - loss: 3.745238332322373

    ipython In [1]: import math; math.exp(3.745238332322373) # To compute perplexity Out[1]: 42.3190920245054 ```

Using Transformer ELMo with existing allennlp models#

Using Transformer ELMo is essentially the same as using regular ELMo. See this documentation for details on how to do that.

The one exception is that inside the text_field_embedder block in your training config you should replace

"text_field_embedder": {
  "token_embedders": {
    "elmo": {
      "type": "elmo_token_embedder",
      "options_file": "",
      "weight_file": "",
      "do_layer_norm": false,
      "dropout": 0.5


"text_field_embedder": {
  "token_embedders": {
    "elmo": {
      "type": "bidirectional_lm_token_embedder",
      "archive_file": std.extVar('BIDIRECTIONAL_LM_ARCHIVE_PATH'),
      "dropout": 0.2,
      "bos_eos_tokens": ["<S>", "</S>"],
      "remove_bos_eos": true,
      "requires_grad": false


For an example of this see the config for a Transformer ELMo augmented constituency parser and compare with the original ELMo augmented constituency parser.

Calling the BidirectionalLanguageModelTokenEmbedder directly#

Of course, you can also directly call the embedder in your programs:

from allennlp.modules.token_embedders.bidirectional_language_model_token_embedder import BidirectionalLanguageModelTokenEmbedder
from import ELMoTokenCharactersIndexer
from import Token
import torch

lm_model_file = "output_path/model.tar.gz"

sentence = "It is raining in Seattle ."
tokens = [Token(word) for word in sentence.split()]

lm_embedder = BidirectionalLanguageModelTokenEmbedder(
    bos_eos_tokens=["<S>", "</S>"],

indexer = ELMoTokenCharactersIndexer()
vocab = lm_embedder._lm.vocab
character_indices = indexer.tokens_to_indices(tokens, vocab, "elmo")["elmo"]

# Batch of size 1
indices_tensor = torch.LongTensor([character_indices])

# Embed and extract the single element from the batch.
embeddings = lm_embedder(indices_tensor)[0]

for word_embedding in embeddings:

Note: This sidesteps our data loading and batching mechanisms for brevity. See our main tutorial for an exposition of how they function.