The Transformers.

Suraj
3 min readApr 27, 2021

Hi, There! Welcome to the blog where we will be looking at an intuitive explanation of the “Transformers architecture”. Before we dive further into dissecting the architecture of transformers, It is very important to know the factors that led to their existence in the era of deep neural networks.

Figure 1: Transformers Architecture.

This can be traced back to that time when CNNs(Convolutional neural network) based architectures were ruling the SOTA architectures for Computer vision and the domain of NLP was still waiting for an equivalent architecture to bloom. RNN(Recurrent Neural network)& LSTM based architectures were actively deployed for the task of Machine Translation, etc., where the input texts (sentences) were processed sequentially word by word by an Encoder RNN Module to form a vector which was assumed to summarize the meaning of the entire sequence. The processed vector was then passed to a Decoder RNN module which translated it to the desired form sequentially. This made the training procedure significantly slow and thus the need for a better architecture arose within the NLP community.

The need was soon being fulfilled by transformers which used a similar encoder-decoder strategy that processed the entire sequence at once, thus providing the power of parallelization within the training procedure. The real breakthrough though came when transformers introduced attention mechanisms without CNNs/RNNs to build high-performance tasks in CV/NLP.

Now that we know the factor which led to the birth of Transformers, Let’s dive further into exploring the architecture of the transformer. As mentioned above, The transformers contain an encoder-decoder coupled network where each encoder and decoder is comprised of multiple similar blocks. The motivation behind stacking layers is described in detail out here. Briefly, It gives an analogy similar to how CNN layers stacked over each other extract coarse to fine features (i.e. edges, etc. in the first few layers to the more dense representation of the ROI in the image at the final layers). The first few blocks in the encoder are used to extract the POS tags etc which goes onto extracting more detailed semantic representations at the latter layers in the block. Let’s dive further into deciphering the working and architecture of the Encoder and Decoder blocks within the transformers.

A standard encoder block comprises a self-attention layer stacked over a feed-forward neural network layer. The input text sequences are firstly being encoded with the desired word-embedding method(using GLOVE etc) to generate a vector representation. Positional embedding is further being generated using sin-cos functions and summed with the vector representation to maintain the ordering of the words within the sequences. (It is to be noted that the summation instead of concatenation is being done to save model parameters)

The transformed vector representation further goes into the self-attention layer which is good at modelling dependencies between different parts within the sequences. The transformer usually employs a more sophisticated approach of self-attention via a Multi-headed attention strategy in parallel. To read more about the attention mechanism, you can check out this blog

The processed vectors from the Multiheaded-attention module are being further fed into the feed-forward neural network. The outputs generated are being fed to the next encoder blocks via residual connections so as to retain maximum information between the encoder layers. The final output of the last encoder layer is being fed into the decoder architecture which is explained below.

A Standard decoder block comprises of an encoder-decoder attention block sandwiched between self-attention and feed-forward neural network layers. The decoder calculates more of what is termed as attention between the output of the previous decoder and the first encoder. This is again repeated for all the subsequent decoder blocks. The last decoder layer is followed by a softmax layer which calculates the probability of each word. This is being continued till the end of the sentence token is being generated. At this time, we can assume that the transformer has generated a translated language for the fed input sequence. This is being repeated for a dataset containing translations between two languages. Refer to the lego block architecture mentioned in Figure1 to get a visual intuition of the language-translation task.

I hope this introduction gave you an intuition of the Transformer architecture.

Until next time!

--

--

Suraj

Seasoned machine learning engineer/data scientist versed with the entire life cycle of a data science project.