Language modeling tutorial in torchtext (Practical Torchtext part 2)

In a previous article, I wrote an introductory tutorial to torchtext using text classification as an example.

In this post, I will outline how to use torchtext for training a language model. We’ll also take a look at some more practical features of torchtext that you might want to use when training your own practical models. Specifically, we’ll cover

  • Using a built-in dataset
  • Using a custom tokenizer
  • Using pretrained word embeddings

The full code is available here. As a word of caution, if you’re running the code in this tutorial, I assume that you have access to a GPU for the sake of training speed. If you don’t have a GPU, you can still follow along, but the training will be very slow.

1. What is Language Modeling?

Language modeling is a task where we build a model that can take a sequence of words as input and determine how likely that sequence is to be actual human language. For instance, we would want our model to predict “This is a sentence” to be a likely sequence and “cold his book her” to be unlikely.

Though language models may seem uninteresting on their own, they can be used as an unsupervised pretraining method or the basis of other tasks like chat generation. In any case, language modeling is one of the most basic tasks in deep learning for NLP, so it’s a good idea to learn language modeling as the basis of other, more complicated tasks (like machine translation).

The way we generally train language models is by training them to predict the next word given all previous words in a sentence or multiple sentences. Therefore, all we need to do language modeling is a large amount of language data. In this tutorial, we’ll be using the famous WikiText2 dataset, which is a built-in dataset provided by torchtext.

 

2. Preparing the Data

To use the WikiText2 dataset, we’ll need to prepare the field that handles the tokenization and numericalization of the text. This time, we’ll try using our own custom tokenizer: the spacy tokenizer. Spacy is a framework that handles many natural language processing tasks, and torchtext is designed to work closely with it. Using the tokenizer is easy with torchtext: all we have to do is pass in the tokenizer function!


import torchtext
from torchtext import data
import spacy

from spacy.symbols import ORTH
my_tok = spacy.load('en')

def spacy_tok(x):
    return [tok.text for tok in my_tok.tokenizer(x)]

TEXT = data.Field(lower=True, tokenize=spacy_tok)

 

add_special_case simply tells the tokenizer to parse a certain string in a certain way. The list after the special case string represents how we want the string to be tokenized.

If we wanted to tokenize “don’t” into “do” and “‘nt”, then we would write


my_tok.tokenizer.add_special_case("don't", [{ORTH: "do"}, {ORTH: "n't"}])

Now, we’re ready to load the WikiText2 dataset. There are two effective ways of using these datasets: one is loading as a Dataset split into the train, validation, and test sets, and the other is loading as an Iterator. The dataset offers more flexibility, so we’ll use that approach here.


from torchtext.datasets import WikiText2

train, valid, test = WikiText2.splits(TEXT) # loading custom datasets requires passing in the field, but nothing else.

Let’s take a quick look inside. Remember, datasets behave largely like normal lists, so we can measure the length using the len function.

>>> len(train)
1

Only one training example?! Did we do something wrong? Turns out not. It’s just that the entire corpus of the dataset is contained within a single example. We’ll see how this example gets batched and processed later.

Now that we have our data, let’s build the vocabulary. This time, let’s try using precomputed word embeddings. We’ll use GloVe vectors with 200 dimensions this time. There are various other precomputed word embeddings in torchtext (including GloVe vectors with 100 and 300 dimensions) as well which can be loaded in mostly the same way.


TEXT.build_vocab(train, vectors="glove.6B.200d")

Great! We’ve prepared our dataset in only 3 lines of code (excluding imports and the tokenizer). Now we move on to building the Iterator, which will handle batching and moving the data to GPU for us.

This is the climax of our tutorial and shows why torchtext is so handy for language modeling.  It turns out that torchtext has a very handy iterator that does most of the heavy lifting for us. It’s called the BPTTIterator. The BPTTIterator does the following for us:

  • Divide the corpus into batches of sequence length bptt

For instance, suppose we have the following corpus:

“Machine learning is a field of computer science that gives computers the ability to learn without being explicitly programmed.”

Though this sentence is short, the actual corpus is thousands of words long, so we can’t possibly feed it in all at once. We’ll want to divide the corpus into sequences of a shorter length. In the above example, if we wanted to divide the corpus into batches of sequence length 5, we would get the following sequences:

[“Machine“, “learning“, “is“, “a“, “field“],

[“of“, “computer“, “science“, “that“, “gives“],

[“computers“, “the“, “ability“, “to“, “learn“],

[“without“, “being“, “explicitly“, “programmed“, EOS]

 

  • Generate batches that are the input sequences offset by one

In language modeling, the supervision data is the next word in a sequence of words. We, therefore, want to generate the sequences that are the input sequences offset by one. In the above example, we would get the following sequence that we train the model to predict:

[“learning“, “is“, “a“, “field“, “of“],

[“computer“, “science“, “that“, “gives“, “computers“],

[“the“, “ability“, “to“, “learn“, “without“],

[“being“, “explicitly“, “programmed“, EOS, EOS]

Here’s the code for creating the iterator:

train_iter, valid_iter, test_iter = data.BPTTIterator.splits(
    (train, valid, test),
    batch_size=32,
    bptt_len=30, # this is where we specify the sequence length
    device=0,
    repeat=False)

As always, it’s a good idea to take a look into what is actually happening behind the scenes.

>>> b = next(iter(train_iter)); vars(b).keys()
dict_keys(['batch_size', 'dataset', 'train', 'text', 'target'])

 

We see that we have an attribute we never explicitly asked for: target. Let’s hope it’s the target sequence.

>>> b.text[:5, :3]
Variable containing:
     9    953      0
    10    324   5909
     9     11  20014
    12   5906     27
  3872  10434      2

>>> b.target[:5, :3]
Variable containing:
    10    324   5909
     9     11  20014
    12   5906     27
  3872  10434      2
  3892      3  10780

Be careful, the first dimension of the text and target is the sequence, and the next is the batch. We see that the target is indeed the original text offset by 1 (shifted downwards by 1). Which means we have all we need to start training a language model!

3. Training the Language Model

With the above iterators, training the language model is easy.

First, we need to prepare the model. We’ll be borrowing and customizing the model from the examples in PyTorch.

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.autograd import Variable as V

class RNNModel(nn.Module):
    def __init__(self, ntoken, ninp,
                 nhid, nlayers, bsz,
                 dropout=0.5, tie_weights=True):
        super(RNNModel, self).__init__()
        self.nhid, self.nlayers, self.bsz = nhid, nlayers, bsz
        self.drop = nn.Dropout(dropout)
        self.encoder = nn.Embedding(ntoken, ninp)
        self.rnn = nn.LSTM(ninp, nhid, nlayers, dropout=dropout)
        self.decoder = nn.Linear(nhid, ntoken)
        self.init_weights()
        self.hidden = self.init_hidden(bsz) # the input is a batched consecutive corpus
                                            # therefore, we retain the hidden state across batches

    def init_weights(self):
        initrange = 0.1
        self.encoder.weight.data.uniform_(-initrange, initrange)
        self.decoder.bias.data.fill_(0)
        self.decoder.weight.data.uniform_(-initrange, initrange)

    def forward(self, input):
        emb = self.drop(self.encoder(input))
        output, self.hidden = self.rnn(emb, self.hidden)
        output = self.drop(output)
        decoded = self.decoder(output.view(output.size(0)*output.size(1), output.size(2)))
        return decoded.view(output.size(0), output.size(1), decoded.size(1))

    def init_hidden(self, bsz):
        weight = next(self.parameters()).data
        return (V(weight.new(self.nlayers, bsz, self.nhid).zero_().cuda()),
                V(weight.new(self.nlayers, bsz, self.nhid).zero_()).cuda())
 
    def reset_history(self):
        self.hidden = tuple(V(v.data) for v in self.hidden)

The language model itself is simple: it takes a sequence of word tokens, embeds them, puts them through an LSTM, then emits a probability distribution over the next word for each input word. We’ve made slight modifications like saving the hidden state in the model object and adding a reset history method. The reason we need to retain the history is because the entire dataset is a continuous corpus, meaning we want to retain the hidden state between sequences within a batch. Of course, we can’t possibly retain the entire history (it will be too costly), so we’ll periodically reset the history during training.

To use the precomputed word embeddings, we’ll need to pass the initial weights of the embedding matrix explicitly. The weights are contained in the vectors attribute of the vocabulary.

weight_matrix = TEXT.vocab.vectors
model = RNNModel(weight_matrix.size(0),
 weight_matrix.size(1), 200, 1, BATCH_SIZE)

model.encoder.weight.data.copy_(weight_matrix)
model.cuda()

Now we can begin training the language model. We’ll use the Adam optimizer here. For the loss, we’ll use the nn.CrossEntropyLoss function. This loss takes the index of the correct class as the ground truth instead of a one-hot vector. Unfortunately, it only takes tensors of dimension 2 or 4, so we’ll need to do a bit of reshaping.

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=1e-3, betas=(0.7, 0.99))
n_tokens = weight_matrix.size(0)

We’ll write the training loop

from tqdm import tqdm 
def train_epoch(epoch):
"""One epoch of a training loop"""
    epoch_loss = 0
    for batch in tqdm(train_iter):
    # reset the hidden state or else the model will try to backpropagate to the
    # beginning of the dataset, requiring lots of time and a lot of memory
         model.reset_history()

    optimizer.zero_grad()

    text, targets = batch.text, batch.target
    prediction = model(text)
    # pytorch currently only supports cross entropy loss for inputs of 2 or 4 dimensions.
    # we therefore flatten the predictions out across the batch axis so that it becomes
    # shape (batch_size * sequence_length, n_tokens)
    # in accordance to this, we reshape the targets to be
    # shape (batch_size * sequence_length)
    loss = criterion(prediction.view(-1, n_tokens), targets.view(-1))
    loss.backward()

    optimizer.step()

    epoch_loss += loss.data[0] * prediction.size(0) * prediction.size(1)

    epoch_loss /= len(train.examples[0].text)

    # monitor the loss
    val_loss = 0
    model.eval()
    for batch in valid_iter:
        model.reset_history()
        text, targets = batch.text, batch.target
        prediction = model(text)
        loss = criterion(prediction.view(-1, n_tokens), targets.view(-1))
        val_loss += loss.data[0] * text.size(0)
    val_loss /= len(valid.examples[0].text)

    print('Epoch: {}, Training Loss: {:.4f}, Validation Loss: {:.4f}'.format(epoch, epoch_loss, val_loss))</pre>

and we’re set to go!

n_epochs = 2 
for epoch in range(1, n_epochs + 1):
    train_epoch(epoch)

Understanding the correspondence between the loss and the quality of the language model is very difficult, so it’s a good idea to check the outputs of the language model periodically. This can be done by writing a bit of custom code to map integers back into words based on the vocab:

def word_ids_to_sentence(id_tensor, vocab, join=None):
    """Converts a sequence of word ids to a sentence"""
    if isinstance(id_tensor, torch.LongTensor):
        ids = id_tensor.transpose(0, 1).contiguous().view(-1)
    elif isinstance(id_tensor, np.ndarray):
        ids = id_tensor.transpose().reshape(-1)
    batch = [vocab.itos[ind] for ind in ids] # denumericalize
    if join is None:
        return batch
    else:
        return join.join(batch)

which can be run like this:

arrs = model(b.text).cpu().data.numpy()
word_ids_to_sentence(np.argmax(arrs, axis=2), TEXT.vocab, join=' ')

Limiting the results to the first few words, we get results like the following:

'<unk>   <eos> = = ( <eos>   <eos>   = = ( <unk> as the <unk> @-@ ( <unk> species , <unk> a <unk> of the <unk> ( the <eos> was <unk> <unk> <unk> to the the a of the first " , the , <eos>   <eos> reviewers were t'

It’s hard to assess the quality, but it seems pretty obvious that we’ll be needing to do more work or training to get the language model working.

 

4. Conclusion

Hopefully, this tutorial provided basic insight into how to use torchtext for language modeling, as well as some of the more advanced features of torchtext like built-in datasets, custom tokenizers, and pretrained word embeddings.

In this tutorial, we used a very basic language model, but there are many best practices that can improve performance significantly. In a future post, I’ll discuss best practices for language modeling along with implementations.