Implementing Transformer — Attention is all you need!

4 min readOct 26, 2021

I am going to write a series of articles on the process of reading and understanding the paper, understanding the architecture and details, and then implementing it myself.

The paper to be implemented is Attention is all you need!

This paper made a great contribution to Neural Machine Translation. We will divide the implementation process into 5 steps, study the background knowledge you need to know before reading the thesis. I will blog the process of implementing it after reading the thesis and reviewing the details.

This is the series of posting:

Seq2seq and attention model (this article)
Paper Review: Model Architecture & Implementation Details
Paper Implementation: Transformer Layer & Performance Experiments and Improvements

Before we dive into the paper, let’s look at RNN based seq2seq, which was mainly used before the concept of Attention. We can take a look at the problems of the existing seq2seq2 and how Attention solved the problem.

Problems of the existing RNN-based seq2seq2

You can’t cram the meaning of a whole %&!$ing sentence into a single $&!*ing vector!
Ray Mooney

In the encoder-decoder seq2seq model using RNN, the encoder compresses the input sequence into a single fixed-size vector representation called a context vector, and the decoder creates an output sequence from this context vector.

However, there are two major problems with the seq2seq model based on these RNNs.

First, we are trying to compress all the information into one fixed-size vector, resulting in information loss.
Second, there is the problem of vanishing gradient, which is a chronic problem of RNNs.

In other words, in the end, in the field of machine translation, when the input sentence is long, the translation quality deteriorates. As an alternative to this, attention is introduced, a technique that has emerged to compensate for the loss of accuracy of the output sequence when the input sequence is long.

It is quite unreasonable to make and transmit the entire sentence as a single context vector (with a fixed length).
The existing problem with RNN is that as the sequence gets longer, information loss may occur.
Attention mechanism was introduced to compensate for these shortcomings.

Attention

The basic idea of attention is that at each time step the decoder predicts the output word, it once again consults the entire input sentence at the encoder. However, instead of referring to the entire input sentence at the same rate, you will see the part of the input word that is related to the word to be predicted at that time with more attention.

Encode each word in the sentence into a vector
When decoding, perform a linear combination of these vectors, weighted by “attention weights”
Use this combination in picking the next word

Calculating Attention

There are various types of attention that can be used in the seq2seq + attention model, but the difference between the attentions is the difference in the intermediate formula. The intermediate formula here refers to the attention score function.

1. Use “query” vector and “key” vectors

Use “query” vector (decoder state) and “key” vectors (all encoder states)
For each query-key pair, calculate weight
Normalize to add to one using softmax

2. Combine together value vectors by taking the weighted sum

Combine together value vectors (usually encoder states, like key vectors) by taking the weighted sum
Use this in any part of the model you like

Attention Score Functions

Also, there are several ways to get the attention score.

q is the query and k is the key
Multi-layer Perceptron (Bahdanau et al. 2015)

Flexible, often very good with large data

Bilinear (Luong et al. 2015)

Dot Product (Luong et al. 2015)

No parameters! But requires sizes to be the same.

Scaled Dot Product (Vaswani et al. 2017)

Problem: scale of dot product increases as dimensions get larger • Fix: scale by size of the vector

In this article, we learned about the difference between attention and seq2seq modeling without RNN, which is the basis of Transformer.
The “transformer” model is entirely built on the self-attention mechanisms without using sequence-aligned recurrent architecture.

In the next article, we will review the paper Transformer: Attention is all you need.

References:
https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html
https://www.cs.cmu.edu/~bhiksha/courses/deeplearning/Fall.2015/slides/lec14.neubig.seq_to_seq.pdf
https://docs.likejazz.com/attention/