Types of Transformers- I : BERT.

Suraj
3 min readApr 27, 2021

Hi, There! Welcome to the blog where we will be discussing the types of Transformers -BERT and GPT. If you lack an intuition about the architecture of the Transformers architecture, you may want to spend some time reading this blog

I hope by now, we are well versed with the encoder and decoder blocks of a Transformers architecture. In the machine translation task, The encoder block helped us towards understanding the language and its contexts whereas the decoder block played a vital role in mapping the input language to the desired language by a similar language understanding schema. Both of these blocks independently are capable of understanding language in itself and thus we can use them separately for different language tasks.

If we stack the encoders, we will get the BERT whereas if we stack the decoders we get GPT architecture which are two primary types of transformers architecture. The recent advancements in these architectures are mostly built over it. In this blog, we will be primarily covering the BERT model.

BERT can be used for tasks like machine translation, sentiment analysis, text summarization, question answering and many other tasks. This is often done in a series of two steps abbreviated as the Pre-training and fine-tuning stage. In the Pre-training stage, the model understands language and context whereas the second stage aka the fine-tuning stage for one of the aforementioned task. In the pre-training stage, The model is trained on two unsupervised learning tasks i.e Masked Language modelling and Next sentence prediction.

In Masked language modelling task, The BERT takes in sentences filled randomly with [MASK] and the goal of the model is to predict the mask. This helps BERT understand bi-directional representation and context between the sentences.

In Next Sentence prediction task, The BERT takes in two sentences and checks if the first sentence actually follows the second or vice-versa. This helps BERT understand context across various sentences.

After learning both of the aforementioned tasks together, BERT gets to understand language better. After this comes the fine-tuning stage where the BERT pre-trained model is modified as per one of the desired tasks mentioned above by changing the last few layers. The training in fine-tuning stage is often done via supervised learning strategies on a labelled dataset.

Image clipped from Original paper: Reference: https://arxiv.org/abs/1905.05950
Figure 1: Input to the BERT (Source Image credits: https://arxiv.org/abs/1905.05950)

Now, let’s dive further into understanding the complete flow better. Let’s start with the input to the BERT. Fore every word in the Input text, We get token embeddings from pre-trained WordPiece embeddings. For more details, refer here. This is being fused with positional and segment embeddings to retain the ordering of inputs. For more details on these embeddings, refer here. The same is being illustrated in Figure 1. Once the input to the BERT is ready, It is being coupled so as to represent masked sentences and being fed to the stack of transformer encoder i.e BERT model which outputs word vectors (T1 …Tm) for Masked language modelling along with a binary value in C for Next sentence prediction. The word vectors are further being passed into the final softmax layer with categorical cross entropy loss to train for understanding the language. Refer Figure 2 for a representation of the same.

Image source clipped from Original paper: https://arxiv.org/abs/1905.05950
Figure 2: Pre-training phase (Source Image credits: https://arxiv.org/abs/1905.05950)

Once the pretraining stage is completed, we can hop to the fine-tuning phase where a labelled dataset is being used to fine-tune the last few layers of the pre-trained BERT model for the desired task.

You can find pre-trained BERT model variants on Tensorflow hub and Hugging face and experiment them for your own desired cause at ease.

You can explore the following Google Colab notebook to check out how BERT has been used for Sentiment detection in texts. Similarly, It can be used for Named Entity recognition, Question answering etc.

Google Colab Notebook for Sentiment detection using BERT

I hope the blog gave you an intuition about BERT. Until Next time!

--

--

Suraj

Seasoned machine learning engineer/data scientist versed with the entire life cycle of a data science project.