Paper summary: GPT 1 — Improving Language Understanding by Generative Pre-Training

5 min readMay 23, 2021

The first GPT paper by OpenAI is to this day one of the most ground-breaking papers in NLP. They popularized the concept of semi-supervised pre-training of large transformer models for language understanding. By only fine-tuning their model on specific tasks they also achieved state-of-the-art on several competitive benchmarks. The ideas from the first GPT model later evolved into GPT-2 and GPT-3 and NLP as it was known pre-GPT is now history.

The purpose of this paper summary is to give insight into the key points of the GPT model.

Key ideas

The main idea of GPT is quite simple in the way that they build their architecture on the original transformer decoder layers and pre-train their model on a large corpus of text consisting of books. They then fine-tune the model on language understanding tasks including classification, question-answering and semantic similarity.

Model architecture and pre-training

The GPT model consists of 12 decoder layers of the original transformer stacked upon each other. The architecture of each decoder layer is in the GPT paper detailed to follow the figure below. If you are observant you will see a slight difference from the original transformer paper’s figure of the decoder. It seems as if they have removed the second multi-head attention step in GPT, however, there is no other mention of this difference anywhere else in the paper. If you were to implement this I would therefore recommend checking out their code at github or someone else’s implementation.

They pre-train their model on a large datasets of books on the task to predict the next word given a set number of previous words . In the GPT paper the pre-training is described as unsupervised since no human labelling was required, however, labels are used in the way that the next word in the book is always known. The pre-training objective is known as language modelling and formally it means to maximize the likelihood of the next word given the context of k previous words denoted

and the model parameters as described by the following expression:

The pre-training step is the most computationally expensive step of training the GPT model and it is what builds the underlying language understanding the model possesses.

Zero-shot learning

Even before they fine-tuned the GPT model on specific tasks they tested the model on specific tasks. This is what is called zero-shot learning since the model has not been trained on labelled data for the specific task before testing. The graph below shows GPTs relative performance compared to state-of-the-art models as a function of the number of pre-training steps performed. Essentially this is a measure of how well the language modelling of the pre-training generalizes to other tasks.

It is clear that the GPT model is nowhere near state-of-the-art but remember that this is before even training on the specific tasks. Really, their results showed great promise for the idea of zero-shot learning and pre-training on large language datasets. These ideas are later expanded upon in GPT-2 and 3.

Fine-tuning on language understanding tasks

As we know fine-tuning GPT achieved state-of-the-art results on many NLP benchmarks, but how did they fine-tune the model? The figure below shows a few examples of how the model can be trained for different specific NLP tasks.

The general idea of the GPT paper is to make minimal changes to the model for fine-tuning. They therefore structure the data for the fine-tuning task in a way that is compatible with the architecture.

In the case of multiple choice questions they structure the context with the answer alternatives as different texts that are inputted separately to the GPT model. They then concatenate the outputs from the transformer model and input them to an additional linear layer. The linear layer outputs the probabilities of each answer being correct. In this way the entire GPT model is intact and they have only added the linear layer in the end as well as a special delimiter token and an extract token. Actually, the extract token is the only token that is inputted to the linear layer as it contains the context of all previous words.

During fine-tuning they train a classification loss (or similar depending on the task) on the output from the linear layer, however, they also add the language modelling loss specified above in L1 multiplied by a constant.

Conclusion

The original GPT model is essentially stacked decoder layers of the original transformer pre-trained on a large dataset with books on the task of predicting the next word given the previous words. Together these simple ideas of a big transformer and big data revolutionized the practice of NLP.

Hopefully this article has given you some insight into how the GPT model works. Let me know if you think I have missed anything important!