Paper Dissected: “Deep Contextualized Word Representations” Explained

“Deep Contextualized Word Representations” was a paper that gained a lot of interest before it was officially published at NAACL this year. The originality and high impact of this paper went on to award it with Outstanding paper at NAACL, which has only further cemented the fact that Embeddings from Language Models (or “ELMos” as the authors have creatively named) might be one of the great breakthroughs and a staple in NLP for years to come. Despite reminding readers of a red furry mascot, this isn’t a paper to be taken lightly.


In this post, I dissect this influential paper and distill the essence that is necessary for practitioners to use this method in their own work. I also point to implementations of this method that you can use to experiment with, without incurring the cost of training an entire language model.


  • The meaning of a word is context-dependent; their embeddings should also take context into account
  • Embeddings from Language Models (ELMos) use language models to obtain embeddings for individual words while taking the entire sentence or paragraph into account.
  • Concretely, ELMos use a pre-trained, multi-layer, bi-directional, LSTM-based language model and extract the hidden state of each layer for the input sequence of words. Then, they compute a weighted sum of those hidden states to obtain an embedding for each word.
  • The weight of each hidden state is task-dependent and is learned.
  • ELMo improves the performance of models across a wide range of tasks, spanning from question answering and sentiment analysis to named entity recognition.

Why do we need contextualized representations?

As an illustrative example, take the following two sentences:

“The bank on the other end of the street was robbed”

“We had a picnic on the bank of the river”

Both sentences use the word “bank”, but the meaning of the word differs completely between them. This phenomenon where two identical words change meaning depending on the context is known as “polysemy“, and has been an issue in the NLP deep learning community ever since word embeddings really took off. Most current neural networks are bad at handling polysemy because they use a single vector to represent the meaning of the word “bank”, regardless of the context. In reality, the vector representing any word should change depending on the words around it.

This is where ELMo comes in. ELMo attempts to resolve this problem by computing the vector representation of the meaning of a word while taking the surrounding context into account. So in the above examples, ELMo would be able to take words like “robbed” and “river” as input and better disambiguate the meaning of the word “bank”. Note that this isn’t just useful for handling polysemy. Any word has subtly different nuances and meanings depending on how they are used, which means taking context into account can be useful for all sorts of tasks.


In this section, we’ll go over a high-level, intuitive overview of the method.

The overview of ELMo is as follows:

1. Train an LSTM-based language model on some large corpus

2. Use the hidden states of the LSTM for each token to compute a vector representation of each word

We’ll go over the details of these steps one by one

LSTM-based language model

In case you are unfamiliar with language models, a language model is simply a model that can predict how “likely” a certain sequence of words is to be a real piece of text. This is generally done by training a model to take a part of sentence (say, the first n words) and predict the next word – or more precisely, output the probability of each word in the vocabulary being the next word (In this blog post, we’ll focus on LSTM-based language models which are the focus of this paper). For instance, given the input sequence:

“The cat sat on the”

we would want our model to output high probabilities for words that are likely to come next (e.g. “mat”, “floor”) and output low probabilities for words that probably will not come next (e.g. “the”, “cat”). Language models are trained using cross-entropy loss and gradient descent-based methods, just like RNNs for other tasks.

One trick that this paper uses is to train a language model with reversed sentences that the authors call the “backward” language model. For anyone familiar with using deep learning for NLP, this is the same idea as using Bidirectional LSTMs for sentence classification. So for the above example, assuming the real next word is “mat”, the backward model would take the sequence

“mat on sat cat”

as input and be trained to predict the word “The”. In contrast to the backward language model, the normal language model is called the “forward” language model.

Furthermore, instead of using a single-layer LSTM, this paper uses a stacked, multi-layer LSTM. Whereas a single-layer LSTM would take the sequence of words as input, a multi-layer LSTM trains multiple LSTMs to take the output sequence of the LSTM in the previous layer as input (of course, the first layer takes the sequence of words as input). This is best illustrated in the following illustration:

「stacked lstm」の画像検索結果

Putting aside the detailed architecture (don’t worry, we’ll cover this later), by training an L-layer LSTM based forward and backward language model, we are able to obtain 2L different representations for each word. If we add the original word vectors, we have 2L + 1 vectors that can be used to compute the context representation of every word.

You may be wondering what training a language model has to do with a task like sentiment analysis. Though you may not think about it that much, traditional word embeddings like word2vec and fasttext are all weights of very simple language models, since they are trained to predict words from their context. Intuitively, all these methods force the model to learn basic properties of the language it is processing. You can’t predict the word “fish” comes after the sentence “The cat ate the ” without knowing that “fish” is a noun, is edible, and is craved by cats (at least in pop culture). These properties can be invaluable in downstream tasks.
Now, the only question is how to combine these to obtain a single representation.


The above language model can be trained in a completely task-agnostic and unsupervised manner. In ELMo, the part that is task specific is the combination of the task-agnostic representations.
Concretely, in ELMo, the word representation is computed with the following equation:
ELMo_k = \gamma\sum_{j}s_jh_{k,j}
(I’ve simplified the notation from the paper for the sake of readability.)

Let’s pick this equation apart. The indices k and j each correspond to the index of the word and the index of the layer the feature is being extracted from. More specifically, h_{k,j} is the output of the j-th LSTM for the word k. s_j is the weight of h_{k,j} in computing the representation for k.

The weight is learned for each task and normalized using the softmax function. The parameter \gamma is a task-dependent value that allows for scaling the entire vector, which is important during optimization.
The reason this simple, learning-based method works is because most NLP tasks that use deep learning take embeddings of discrete tokens (usually words) as input, meaning ELMos can be used with preexisting embeddings ore can replace them and be trained end-to-end. Though the hidden states are fixed, the model still has the flexibility to utilize various levels of abstraction depending on the task at hand.

What makes ELMo so great is that it can be used seamlessly with any model barely changing it, meaning it works in harmony with most other advances in the field.

Applying ELMo to specific tasks

Now we have an overall idea of how to obtain ELMo representations. The next question is how to specifically incorporate ELMo into other tasks. There is a relatively large space of possible designs here. ELMos could be swapped with existing word embeddings, they could be added, multiplied, etc. Within this design space, the authors recommend the following configuration:

1. Concatenate ELMos with context-independent word embeddings instead of replacing them

The authors recommend concatenating ELMos with other word embeddings like GloVe, fasttext, or character-based embeddings before inputting them into the task-specific model.

2. Optionally, concatenate ELMos with the output as well

For tasks like question answering, the authors found that concatenating ELMos with the outputs of the task-specific model lead to further improvements in performance. This did not always improve performance though, so whether to take this option should be decided based on experiments.


In this section, we’ll go over the nitty-gritty details that are necessary to make this method work as well as reported in the paper.

The architecture of the language model

  • LSTM language model

The authors used 2 bi-LSTM layers with 4096 units and 512 dimension projections. They also applied residual connections between the LSTM layers, which is an effective method of improving gradient flow.

  • Word representations

Language models need context-independent word embeddings to be trained, which creates a sort of chicken-and-egg problem. Since we can’t train a language model on ELMos, the authors used a character-CNN based model to obtain word embeddings instead. Concretely, they trained a 2048 channel char-ngram CNN followed by two highway layers and a linear projection down to 512 dimensions.

Training the language model

Training language models isn’t easy, and since language models are costly to train, you want to set yourself up for as much success as possible beforehand. Here are a few tricks that were introduced by the authors:

  • Tie the weights for the forward and backward language models

Though the forward and backward language models perform different tasks, we expect that there is a great deal of knowledge that can be shared. To combat overfitting and reduce the memory overhead of the model, the authors shared the token representations and softmax weights between the models. This is similar to how this paper ties the softmax weights and input token weights when training language models.

  • Fine-tuning the language model

In some cases, fine-tuning the language model on the domain-specific data improved performance. This is essentially a form of domain transfer, making the language model more suited for the specific domain.

Training the ELMo weights

  • Applying layer normalization

ELMos combine the activations of various different LSTM layers. Since the activations for each intermediate layer might differ, the authors found it beneficial in some circumstances to apply layer normalization to the outputs of each LSTM layer.

  • Apply dropout and (optionally) weight decay

Both these method prevent the ELMo specific parameters from causing overfitting. Weight decay also has the effect of encouraging the model to use all the intermediate representations, since the L2 regularization penalty becomes smallest when all the weights are small, leading to a more even distribution.

Experimental Results

The impressive thing about ELMo is how generally applicable and how it improves the performance.
Here are the tasks that ELMo achieves state-of-the-art performance on:

  • Question answering

The dataset used was the Stanford Question Answering Dataset (SQuAD), where the answers to the questions are spans of text in Wikipedia paragraphs.

  • Textual entailment

In textual entailment, the model is tasked with determining whether a statement entails another statement, i.e. whether a statement is true given a premise.

  • Semantic role labeling

In semantic role labeling, the system must determine the predicate-argument structure of a sentence.

  • Coreference resolution

Coreference resolution is a task where the model must identify which pieces of text refer to the same entity.

  • Named entity extraction

In named entity extraction, the model classifies various named entities into a set of different classes such as person and location.

  • Sentiment analysis

Sentiment analysis is a simple text classification task where the label corresponds to the degree of positivity/negativity in the text.

As you can see, the range of tasks is as diverse as it gets. Impressively, ELMo leads to a solid improvement in performance across a wide range of tasks, reducing error rates from 6 – 20% over strong baselines. I refer the reader to the original paper for confirming the details regarding experimental setup and baseline models. The following table summarized the results:

Implementations of ELMo

The authors of the paper have published their code in PyTorch and tensorflow on their homepage. The PyTorch implementation is incorporated into their custom framework allennlp, which makes it very easy to use for experimenting. On the downside, retraining the language model isn’t exactly easy with PyTorch, so training in tensorflow, dumping the weights, then porting them to PyTorch might be the best course of action available right now.

Conclusion and Further Readings

ELMo is an important progress in transfer learning for NLP and will likely spawn many important papers in this field for years to come. I hope this post has made this method more accessible to you.
Here are a few readings that might interest you and deepen your understanding further:

The original paper

The homepage of the authors