Type of Transformers-II: GPT.

Suraj
2 min readApr 28, 2021
Figure 1: Basic GPT Architecture

Hi, There! Welcome to the blog where we will be resuming the second leg of our blog series on the two primary types of transformers. In this blog, we will be discussing the GPT architecture which is made by stacking the decoder modules of a Transformer architecture one over another. GPT stands for Generative Pre Training of a language model. If you are not aware of the differences between generative and discriminative models, you can have a look at this intuitive explanation out here.

The training of the GPT model doesn’t need labelled data as they only need the next language token given the predicted token and such data already exist as written texts over the entire internet in a humungous amount. Some of the amazing use cases of a GPT model can be viewed out here.

Similar to BERT, GPT also has two stages- Pre Training and Fine-tuning. In the unsupervised pre-training stage, the model is trained on a corpus(BookCorpus) towards the task of predicting the next language token given the previous ones. The training is continued till the error between the predicted and actual next word token is minimized. Input to GPT3 is encoded in a similar way to BERT and fed as vectors.

In the fine-tuning stage, The last few layers are being fine-tuned in a similar manner to BERT for one of the tasks like Natural language inference etc. This is also abbreviated as causal language modelling in the glossary of HuggingFace

For more details, refer to the paper.

--

--

Suraj

Seasoned machine learning engineer/data scientist versed with the entire life cycle of a data science project.