An In-Depth Tutorial to AllenNLP (From Basics to ELMo and BERT)

In this post, I will be introducing AllenNLP, a framework for (you guessed it) deep learning in NLP that I’ve come to really love over the past few weeks of working with it.

To better explain AllenNLP and the concepts underlying the framework, I will first go through an actual example using AllenNLP to train a simple text classifier. Then I will show how you can swap those features out for more advanced models like ELMo and BERT. If you’re just here for ELMo and BERT, skip ahead to the later sections. I’ve uploaded all the code that goes along with this post here.

Finally, I’ll give my two cents on whether you should use AllenNLP or torchtext, another NLP library for PyTorch which I blogged about in the past.

Update: I found a couple of bugs in my previous code for using ELMp and BERT and fixed them. If you copied/referenced my previous code before this update, please reference the new versions on Github or in this post!

Update 2: I added a section on generating predictions!

A Basic Example

In my opinion, all good tutorials start with a top-down example that shows the big picture. The example I will use here is a text classifier for the toxic comment classification challenge. Don’t worry about understanding the code: just try to get an overall feel for what is going on and we’ll get to the details later.

You can see the code here as well.


Let’s start dissecting the code I wrote above. AllenNLP is – at its core – a framework for constructing NLP pipelines for training models. The pipeline is composed of distinct elements which are loosely coupled yet work together in wonderful harmony. We’ll go through an overview first, then dissect each element in more depth.


If you are familiar with PyTorch, the overall framework of AllenNLP will feel familiar to you. There are a couple of important differences but I will mention them later on. The basic AllenNLP pipeline is composed of the following elements:

  • DatasetReader: Extracts necessary information from data into a list of Instance objects
  • Model: The model to be trained (with some caveats!)
  • Iterator: Batches the data
  • Trainer: Handles training and metric recording
  • (Predictor: Generates predictions from raw strings)

Each of these elements is loosely coupled, meaning it is easy to swap different models and DatasetReaders in without having to change other parts of your code. Despite this, these parts all work very well together. To take full advantage of all the features available to you though, you’ll need to understand what each component is responsible for and what protocols it must respect. This is what we will discuss in the following sections, starting with the DatasetReader.

The DatasetReader

The DatasetReader is perhaps the most boring – but arguably the most important – piece in the pipeline. If you’re using any non-standard dataset, this is probably where you will need to write the most code, so you will want to understand this component well.

The DatasetReader is responsible for the following:

  1. Reading the data from disk
  2. Extracting relevant information from the data
  3. Converting the data into a list of Instances (we’ll discuss Instances in a second)

You may be surprised to hear that there is no Dataset class in AllenNLP, unlike traditional PyTorch. DatasetReaders are different from Datasets in that they are not a collection of data themselves: they are a schema for converting data on disk into lists of instances. You’ll understand this better after actually reading the code:

from import TextField, MetadataField, ArrayField

class JigsawDatasetReader(DatasetReader):
    def __init__(self, tokenizer: Callable[[str], List[str]]=lambda x: x.split(),
                 token_indexers: Dict[str, TokenIndexer] = None,
                 max_seq_len: Optional[int]=config.max_seq_len) -> None:
        self.tokenizer = tokenizer
        self.token_indexers = token_indexers or {"tokens": SingleIdTokenIndexer()}
        self.max_seq_len = max_seq_len

    def text_to_instance(self, tokens: List[Token], id: str=None,
                         labels: np.ndarray=None) -> Instance:
        sentence_field = TextField(tokens, self.token_indexers)
        fields = {"tokens": sentence_field}
        id_field = MetadataField(id)
        fields["id"] = id_field
        if labels is None:
            labels = np.zeros(len(label_cols))
        label_field = ArrayField(array=labels)
        fields["label"] = label_field

        return Instance(fields)
    def _read(self, file_path: str) -> Iterator[Instance]:
        df = pd.read_csv(file_path)
        if config.testing: df = df.head(1000)
        for i, row in df.iterrows():
            yield self.text_to_instance(
                [Token(x) for x in self.tokenizer(row["comment_text"])],
                row["id"], row[label_cols].values,

As you will probably already have guessed, the _read method is responsible for 1: reading the data from disk into memory.

Side note: You may be worried about datasets that don’t fit into memory. Don’t worry: AllenNLP can lazily load the data (only read the data into memory when you actually need it). This does impose some additional complexity and runtime overhead, so I won’t be delving into this functionality in this post though.

The second central method for the DatasetReader is the text_to_instance method. This method is slightly misleading: it handles not only text but also labels, metadata, and anything else that your model will need later on.

    def text_to_instance(self, tokens: List[Token], id: str=None,
                         labels: np.ndarray=None) -> Instance:
        sentence_field = TextField(tokens, self.token_indexers)
        fields = {"tokens": sentence_field}
        id_field = MetadataField(id)
        fields["id"] = id_field
        if labels is None:
            labels = np.zeros(len(label_cols))
        label_field = ArrayField(array=labels)
        fields["label"] = label_field

        return Instance(fields)

The essence of this method is simple: take the data for a single example and pack it into an Instance object. Here, we’re passing the labels and ids of each example (we keep them optional so that we can use AllenNLP’s predictors: I’ll touch on this later). Instance objects are very similar to dictionaries, and all you need to know about them in practice is that they are instantiated with a dictionary mapping field names to “Field”s, which are our next topic.


Field objects in AllenNLP correspond to inputs to a model or fields in a batch that is fed into a model, depending on how you look at it. For each Field, the model will receive a single input (you can take a look at the forward method in the BaselineModel class in the example code to confirm). Each field handles converting the data into tensors, so if you need to do some fancy processing on your data when converting it into tensor form, you should probably write your own custom Field class.

There are several types of fields that you will find useful, but the one that will probably be the most important is the TextField.


The TextField does what all good NLP libraries do: it converts a sequence of tokens into integers. Be careful here though, since this is all the TextField does. It doesn’t clean the text, tokenize the text, etc.. You’ll need to do that yourself.

The TextField takes an additional argument on init: the token indexer. Though the TextField handles converting tokens to integers, you need to tell it how to do this. Why? Because you might want to use a character level model instead of a word-level model or do some even funkier splitting of tokens (like splitting on morphemes). Instead of specifying these attributes in the TextField, AllenNLP has you pass a separate object that handles these decisions instead. This is the principle of composition, and you’ll see how this makes modifying your code easy later.

For now, we’ll use a simple word-level model so we use the standard SingleIdTokenIndexer. We’ll look at how to modify this to use a character-level model later.

The other fields here are the MetadataField which takes data that is not supposed to be tensorized and the ArrayField which converts numpy arrays into tensors. There isn’t much to be said here but if you want to know more you can consult the documentation.

We went down a bit of a rabbit hole here, so let’s recap: DatasetReaders read data from disk and return a list of Instances. Instances are composed of Fields which specify both the data in the instance and how to process it.

Now, let’s put our DatasetReader into action:

train_ds, test_ds = ( / fname) for fname in ["train.csv", "test_proced.csv"])

The output is simply a list of instances:

>>> train_ds
[< at 0x1a2048a2e8>,
 < at 0x1a2046b320>,
 < at 0x1a20444860>,
 < at 0x1a203e7940>,
 < at 0x1a203de5c0>,

Let’s take a look at the text field of one of the Instances.

>>> vars(train_ds[0].fields["tokens"])
{'tokens': [Explanation,   Why,   the,   edits,   made,   under,   my,   username,   Hardcore,   Metallica,   Fan, ...

Wait, aren’t the fields supposed to convert my data into tensors?

This is one of the gotchas of text processing for deep learning: you can only convert fields into tensors after you know what the vocabulary is. To build the vocabulary, you need to pass through all the text. To build a vocabulary over the training examples, just run the following code:

vocab = Vocabulary.from_instances(train_ds, max_vocab_size=config.max_vocab_size)

Where do we tell the fields to use this vocabulary? This is not immediately intuitive, but the answer is the Iterator – which nicely leads us to our next topic: DataIterators.

Side note: If you’re interested in learning more, AllenNLP also provides implementations of readers for most famous datasets.

The Data Iterator

Neural networks in PyTorch are trained on mini batches of tensors, not lists of data. Therefore, datasets need to be batched and converted to tensors.

This seems trivial at first glance, but there is a lot of subtlety here. To list just a few things we have to consider:

  • Sequences of different lengths need to be padded
  • To minimize padding, sequences of similar lengths can be put in the same batch
  • Tensors need to be sent to the GPU if using the GPU
  • Data needs to be shuffled at the end of each epoch during training, but we don’t want to shuffle in the midst of an epoch in order to cover all examples evenly

Thankfully, AllenNLP has several convenient iterators that will take care of all of these problems behind the scenes. Therefore, you will rarely have to implement your own Iterators from scratch (unless you are doing something really tricky during batching).

Here’s some basic code to use a convenient iterator in AllenNLP: the BucketIterator:

from import BucketIterator

iterator = BucketIterator(batch_size=config.batch_size, 
                          sorting_keys=[("tokens", "num_tokens")],

The BucketIterator batches sequences of similar lengths together to minimize padding. To prevent the batches from becoming deterministic, a small amount of noise is added to the lengths. The sorting_keys keyword argument tells the iterator which field to reference when determining the text length of each instance.

Remember, Iterators are responsible for numericalizing the text fields. We pass the vocabulary we built earlier so that the Iterator knows how to map the words to integers. This step is easy to forget, so be careful!

Important Tip: Don’t forget to run iterator.index_with(vocab)!

You may have noticed that the iterator does not take datasets as an argument. This is an important distinction between general iterators in PyTorch and iterators in AllenNLP. Whereas iterators are direct sources of batches in PyTorch, in AllenNLP, iterators are a schema for how to convert lists of Instances into mini batches of tensors. Therefore, you can’t directly iterate over a DataIterator in AllenNLP!

Now we turn to the aspect of AllenNLP that – in my opinion – is what makes it stand out among many other frameworks: the Models.


AllenNLP models are mostly just simple PyTorch models. The key difference is that AllenNLP models are required to return a dictionary for every forward pass and compute the loss function within the forward method during training.

def forward(self, tokens: Dict[str, torch.Tensor],
                id: Any, label: torch.Tensor) -> torch.Tensor:
        mask = get_text_field_mask(tokens)
        embeddings = self.word_embeddings(tokens)
        state = self.encoder(embeddings, mask)
        class_logits = self.projection(state)
        output = {"class_logits": class_logits}
        output["loss"] = self.loss(class_logits, label)

        return output

This may seem a bit unusual, but this restriction allows you to use all sorts of creative methods of computing the loss while taking advantage of the AllenNLP Trainer (which we will get to later). For instance, you can apply masks to your loss function, weight the losses of different classes adaptively, etc.

One amazing aspect of AllenNLP is that it has a whole host of convenient tools for constructing models for NLP. To utilize these components fully, AllenNLP models are generally composed from the following components:

  • A token embedder
  • An encoder
  • (For seq-to-seq models) A decoder

Therefore, at a high level our model can be written very simply as

from allennlp.nn.util import get_text_field_mask
from allennlp.models import Model
from allennlp.modules.text_field_embedders import TextFieldEmbedder

class BaselineModel(Model):
    def __init__(self, word_embeddings: TextFieldEmbedder,
                 encoder: Seq2VecEncoder,
                 out_sz: int=len(label_cols)):
        self.word_embeddings = word_embeddings
        self.encoder = encoder
        self.projection = nn.Linear(self.encoder.get_output_dim(), out_sz)
        self.loss = nn.BCEWithLogitsLoss()
    def forward(self, tokens: Dict[str, torch.Tensor],
                id: Any, label: torch.Tensor) -> torch.Tensor:
        mask = get_text_field_mask(tokens)
        embeddings = self.word_embeddings(tokens)
        state = self.encoder(embeddings, mask)
        class_logits = self.projection(state)
        output = {"class_logits": class_logits}
        output["loss"] = self.loss(class_logits, label)

        return output

This compartmentalization enables AllenNLP to switch embedding methods and model details easily. Now, let’s look at each component separately.

The Embedder

The embedder maps a sequence of token ids (or character ids) into a sequence of tensors. In this example, we’ll use a simple embedding matrix.

from allennlp.modules.token_embedders import Embedding
from allennlp.modules.text_field_embedders import BasicTextFieldEmbedder

token_embedding = Embedding(num_embeddings=config.max_vocab_size + 2,
                            embedding_dim=300, padding_index=0)
# the embedder maps the input tokens to the appropriate embedding matrix
word_embeddings: TextFieldEmbedder = BasicTextFieldEmbedder({"tokens": token_embedding})

You’ll notice that there are two classes here for handling embeddings: the Embedding class and the BasicTextFieldEmbedder class. This is slightly clumsy but is necessary to map the fields of a batch to the appropriate embedding mechanism.

The Encoder

To classify each sentence, we need to convert the sequence of embeddings into a single vector. In AllenNLP, the model that handles this is referred to as a Seq2VecEncoder: a mapping from sequences to a single vector.

Though AllenNLP provides many Seq2VecEncoders our of the box, for this example we’ll use a simple bidirectional LSTM. Don’t remember the semantics of LSTMs in PyTorch? Don’t worry: AllenNLP has you covered. AllenNLP provides a handy wrapper called the PytorchSeq2VecWrapper that wraps the LSTM so that it takes a sequence as input and returns the final hidden state, converting it into a Seq2VecEncoder.

from allennlp.modules.seq2vec_encoders import PytorchSeq2VecWrapper

encoder: Seq2VecEncoder = PytorchSeq2VecWrapper(nn.LSTM(300, config.hidden_sz, bidirectional=True, batch_first=True))

Now, we can build our model in 3 simple lines of code! (or 4 lines depending on how you count it).

model = BaselineModel(

Side note: When you think about it, you’ll notice how virtually any important NLP model can be written like the above. For seq2seq models you’ll probably need an additional decoder, but that is simply adding another component. This is the beauty of AllenNLP: it is built on abstractions that capture the essence of current deep learning in NLP.


Now we have all the necessary parts to start training our model. In my opinion, one of the largest pain points in PyTorch has been training: unlike frameworks like Keras, there was no shared framework and often you had to write a lot of boilerplate code just to get a simple training loop built.

AllenNLP – thanks to the light restrictions it puts on its models and iterators – provides a Trainer class that removes the necessity of boilerplate code and gives us all sorts of functionality, including access to Tensorboard, one of the best visualization/debugging tools for training neural networks.

The code for preparing a trainer is very simple:

from import Trainer

trainer = Trainer(
    cuda_device=0 if USE_GPU else -1,

With this, we can train our model in one method call:

metrics = trainer.train()

The reason we are able to train with such simple code is because of how the components of AllenNLP work together so well. The Instances contain the information necessary for Iterators to generate batches of data, the model specifies which fields in each batch get mapped to what and returns the loss, which the Trainer uses to update the model. At each step, we could have used a different Iterator or model, as long as we adhered to some basic protocols.

Side note: I do wish the Trainer had a bit more customizability. For example, I wish it supported callbacks and implemented functionality like logging to Tensorboard through callbacks instead of directly writing the code in the Trainer class. The training code is one aspect that I think the fastai library truly excels in, and I hope many of the features there get imported into AllenNLP.

AllenNLP Predictors

Here’s my honest opinion: AllenNLP’s predictors aren’t very easy to use and don’t feel as polished as other parts of the API. You’ll see why in a second. First, let’s actually try and use them. Here’s the code:

from allennlp.predictors.sentence_tagger import SentenceTaggerPredictor
tagger = SentenceTaggerPredictor(model, reader)
tagger.predict("this tutorial was great!")

Although our model isn’t exactly doing sequence tagging, the SequenceTaggerPredictor is the only predictor (as far as I know) that extracts the raw output dicts. This feels pretty clusmy to me.

Now, here’s the question: how do we take advantage of the datasets we’ve already read in? How do we ensure their ordering is consistent with our predictions? Do we extract the text and vocabulary again?

Instead of toiling through the predictor API in AllenNLP, I propose a simpler solution: let’s write our own predictor. Thanks to the great tools in AllenNLP this is pretty easy and instructive!

Writing our Own Predictor

Our predictor will simply extract the model logits from each batch and concatenate them to form a single matrix containing predictions for all the Instances in the dataset. Here’s the code:

from import DataIterator
from tqdm import tqdm
from scipy.special import expit # the sigmoid function

def tonp(tsr): return tsr.detach().cpu().numpy()

class Predictor:
    def __init__(self, model: Model, iterator: DataIterator,
                 cuda_device: int=-1) -> None:
        self.model = model
        self.iterator = iterator
        self.cuda_device = cuda_device
    def _extract_data(self, batch) -> np.ndarray:
        out_dict = self.model(**batch)
        return expit(tonp(out_dict["class_logits"]))
    def predict(self, ds: Iterable[Instance]) -> np.ndarray:
        pred_generator = self.iterator(ds, num_epochs=1, shuffle=False)
        pred_generator_tqdm = tqdm(pred_generator,
        preds = []
        with torch.no_grad():
            for batch in pred_generator_tqdm:
                batch = nn_util.move_to_device(batch, self.cuda_device)
        return np.concatenate(preds, axis=0)

As you can see, we’re taking advantage of the AllenNLP ecosystem: we’re using iterators to batch our data easily and exploiting the semantics of the model output.

Now, just run the following code to generate predictions:

from import BasicIterator
# iterate over the dataset without changing its order
seq_iterator = BasicIterator(batch_size=64)

predictor = Predictor(model, seq_iterator, cuda_device=0 if USE_GPU else -1)
train_preds = predictor.predict(train_ds) 
test_preds = predictor.predict(test_ds) 

Much simpler, don’t you think? Now all we need to do is either write these predictions to disk, evaluate our model, or do whatever downstream tasks we need to do. I’ll leave that up to the reader.

Training classifiers is pretty fun, but now we’ll do something much more exciting: let’s examine how we can use state-of-the-art transfer learning methods in NLP with very small changes to our code above!

Practical Example 1: How to Switch to ELMo

Simply building a single NLP pipeline to train one model is easy. Writing the pipeline so that we can iterate over multiple configurations, swap components in and out, and implement crazy architectures without making our codebase explode is much harder.

This is where the true value in using AllenNLP lies. Not only does AllenNLP provide great built-in components for getting NLP models running quickly, but it also forces your code to be written in a modular manner, meaning you can easily switch new components in.

Here, I’ll demonstrate how you can use ELMo to train your model with minimal changes to your code. ELMo is a recently developed method for text embedding in NLP that takes contextual information into account and achieved state-of-the-art results in many NLP tasks (If you want to learn more about ELMo, please refer to this blog post I wrote in the past explaining the method – sorry for the shameless plug).

To incorporate ELMo, we’ll need to change two things:

  1. The token indexer
  2. The embedder

ELMo uses character-level features so we’ll need to change the token indexer from a word-level indexer to a character-level indexer. In addition to converting characters to integers, we’re using a pre-trained model so we need to ensure that the mapping we use is the same as the mapping that was used to train ELMo. This seems like a lot of work, but in AllenNLP, all you need to is to use the ELMoTokenCharactersIndexer:

from import ELMoTokenCharactersIndexer

# the token indexer is responsible for mapping tokens to integers
token_indexer = ELMoTokenCharactersIndexer()

Wait, is that it? you may ask. What about the DatasetReader? Surely if we use a different indexer, we’ll need to change the way we read the dataset? Well, not in AllenNLP. This is where composition shines; since we delegate all the decisions regarding how to convert raw text into integers to the token indexer, we get to reuse all the remaining code simply by swapping in a new token indexer.

One thing to note is that the ELMoTokenCharactersIndexer handles the mapping from characters to indices for you (you need to use the same mappings as the pretrained model for ELMo to have any benefit). Therefore, the code for initializing the Vocabulary is as follows:

vocab = Vocabulary() # No need to build the vocabulary

Now, to change the embeddings to ELMo, you can simply follow a similar process:

from allennlp.modules.text_field_embedders import BasicTextFieldEmbedder
from allennlp.modules.token_embedders import ElmoTokenEmbedder

options_file = ''
weight_file = ''

elmo_embedder = ElmoTokenEmbedder(options_file, weight_file)
word_embeddings = BasicTextFieldEmbedder({"tokens": elmo_embedder})

We want to use a pretrained model, so we’ll specify where to get the data and the settings from. AllenNLP takes care of all the rest for us.

You can see the entire notebook here.

Practical Example 2: How to Switch to BERT

BERT is another transfer learning method that has gained a lot of attention due to its impressive performance across a wide range of tasks (I’ve written a blog post on this topic here in case you want to learn more).

You’re probably thinking that switching to BERT is mostly the same as above. Well, you’re right – mostly. BERT has a few quirks that make it slightly different from your traditional model. One quirk is that BERT uses wordpiece embeddings so we need to use a special tokenizer. We can access this functionality with the following code:

from import PretrainedBertIndexer

token_indexer = PretrainedBertIndexer(

def tokenizer(s: str):
    return token_indexer.wordpiece_tokenizer(s)[:config.max_seq_len - 2]

Similar to ELMo, the pretrained BERT model has its own embedding matrix. We will need to use the same mappings from wordpiece to index, which is handled by the PretrainedBertIndexer. Therefore, we won’t be building the Vocabulary here either.

vocab = Vocabulary()

Accessing the BERT encoder is mostly the same as using the ELMo encoder. BERT doesn’t handle masking though, so we do need to tell the embedder to ignore addditional fields.

from allennlp.modules.text_field_embedders import BasicTextFieldEmbedder
 from allennlp.modules.token_embedders.bert_token_embedder import PretrainedBertEmbedder
 bert_embedder = PretrainedBertEmbedder(
         top_layer_only=True, # conserve memory
 word_embeddings: TextFieldEmbedder = BasicTextFieldEmbedder({"tokens": bert_embedder},
                                                             # we'll be ignoring masks so we'll need to set this to True
                                                            allow_unmatched_keys = True)

Another thing to be careful of is that when training sentence classification models on BERT we only use the embedding corresponding to the first token in the sentence.

BERT_DIM = word_embeddings.get_output_dim()

class BertSentencePooler(Seq2VecEncoder):
    def forward(self, embs: torch.tensor, 
                mask: torch.tensor=None) -> torch.tensor:
        # extract first token tensor
        return embs[:, 0]
    def get_output_dim(self) -> int:
        return BERT_DIM
encoder = BertSentencePooler(vocab)

This is all we need to change though: we can reuse all the remaining code as is! You can see the full code here.

Torchtext vs. AllenNLP

I’ve personally contributed to torchtext and really love it as a framework. That being said, in many cases I would recommend AllenNLP for those just getting started.

Torchtext is a very lightweight framework that is completely agnostic to how the model is defined or trained. All it handles is the conversion of text files into batches of data that can be fed into models (which it does very well). Therefore, it is a great choice if you already have custom training code and model code that you want to use as-is. Torchtext also has a lot less code so is much more transparent when you really want to know what is going on behind the scenes.

On the other hand, AllenNLP is more of an all-or-nothing framework: you either use all the features or use none of them. AllenNLP models are expected to be defined in a certain way. Of course, you can selectively use pieces but then you lose a great portion of the power of the framework. On the flip side, this means that you can take advantage of many more features.

The decisive factor that made me switch to AllenNLP was its extensive support for contextual representations like ELMo. Contextual representations are just a feature that requires coordination between the model, data loader, and data iterator.

Side note: Another great framework for PyTorch is fastai, but I haven’t used it enough to give an educated opinion on it and I also feel that fastai and AllenNLP have different use cases with AllenNLP being slightly more flexible due to its composite nature. Everything feels more tightly integrated in fastai since a lot of the functionality is shared using inheritance. I may be wrong here though and would really love to hear different opinions on this issue!

Conclusion and Further Readings

AllenNLP is a truly wonderful piece of software. It is easy to use, easy to customize, and improves the quality of the code you write yourself. Based on reading Kaggle kernels and research code on Github, I feel that there is a lack of appreciation for good coding standards in the data science community. AllenNLP is a nice exception to this rule: the function and method names are descriptive, type annotations and documentation make the code easy to interpret and use, and helpful error messages and comments make debugging an ease.

The best way to learn more is to actually apply AllenNLP to some problem you want to solve. The documentation is a great source of information, but I feel that sometimes reading the code is much faster if you want to dive deeper. AllenNLP’s code is heavily annotated with type hints so reading and understanding the code is not as hard as it may seem.

Thanks for reading, and if you have any feedback please leave it in the comments below!