People have been doing a lot of stuff in natural language processing like parsing, translation, finding entities anything towards making machines understand text a little bit.

You’ve seen in the previous post the main thing is they generate the next word. They look at the past using RNN to generate the next word of the output or you have an LSTM cell to generate the next word.

They have some problems with architecture and we were thinking for a long time what’s the gist of this problem what is not so great about it and its turns out these neural networks you translate sentence by sentence.

Long-range language model

You do it on the level of tokens or words so a sentence may have 40 words a long one may have 60. But when we speak we think about context, that goes way beyond that. when you start talking to your friend about things you talked about 20 years ago maybe you will know it I immediately recall what’s needed there. It seems to be one really important thing to solve and tackle to make these networks like understand more.

If we train the network with a long range we can look back thousands of steps and remember it. Imagine if we just trained it to generate text to write something and since we wanted it to be longer range let’s say we train it on Wikipedia articles. So that’s called the language model something that just generates the next word since it’s a model it’s probabilistic so it’s a probability distribution over various articles.

RNN or LSTM has a problem that if you try to generate 2,000 words its states and the gating in the LSTM would start to make the gradient vanish. It’s called long short-term memory but it’s not that long.

If you make an RNN it needs to go one word at a time to get to the last word cell you need to see the all cell before it. People started to try to solve this with convolutions. You take a long sequence and you make convolutions.

Attention(Reference by Content)

It was introduced already with LSTM. Attention is something where make a query with the vector and then you basically look at similar things in your past.


The convolutions are quite a positional thing that’s where you have a different color for every position. Attention looks at everything but gets things that are similar. That is a general idea. When you retrieve similar things you can look at a very long context.


Imagine you’re producing the name of the group member now you can attend however many steps behind. You can see earlier I said it was Jackson let me say this again. With CNN  you could do this but you need to know the exact position. You need to compress the data very much. With attention since it’s content-based queering you can do this quickly.

3 Ways of Attention

What I showed you is a language model meaning it just decodes it has no inputs but for translation, you need an input. You need attention that’s from the output and attends to the input.

Encoder-Decoder Attention

You need attention to the input which can attend everywhere.

Encoder Self-Attention

The attention to the output only attends to things before.

Masked Decoder Self-Attention

The Transformer Model

It is based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.

Transformer model architecture

We have some inputs, Let’s say the English sentence, and then there’ll be a multi-head attentional. Then there’ll be a feed-forward layer just that every word will be processed and that’s the processing of the input.

Masked Attention

When we start generating output we need this masked attention. In an RNN it’s very clear that you go from one word to the next. In the more standard attention things that have been there for a while. You attend to everything there and back. But if you wanted to make a language model you cannot do this because in the end, you’re predicting the output.

Masked multi-head attention

In the Transformer model, you have the outputs they come in and then there is the masked attention which is a single matrix multiplied with a mask. So all of these things they’re done in parallel you have a thousand words that come as one tensor that’s thousand by something long and then every word I needs to attend to all words before it which will be done by multiplying everyone with everyone and masking out half of this with zeroes.

Masking Step: this step does attention for maybe 2000 words with 2000 words everyone to everyone is a single matrix multiplied and the GPU hardware is extremely optimized for large matrix multiplies so this can happen in a very quick time. Then you do this again but multiply with the Encoder. Then feed-forward and predictions in the sense of neural network architecture this is a very simple few-layer feed-forward architecture in a way it’s much simpler than the RNN it has no recurrence there is no way to share. Everything happens in a bunch of masked matrix multiplies.

Dot Product Attention (Key value pair)

In attention, there is a query that will be a vector and then there is a key and value match matrix which is your memory.

Dot-Product Attention Formula Softmax

The current word(Q) is what I’m operating on and this(K, V) is all the past words I’ve generated before. The keys and values can be the same thing. What you want to do in attention is take the query find the most similar key as I told you it’s a similarity thing and then get the values that correspond to these similar keys.

When you take the query multiply it by the transposed keys that are a matrix multiply. Then you take a softmax which is exponentiation and normalization.

It gives you a mask and probability distribution over keys which is peaked at the ones that are similar to the query. In this mask, you make a matrix multiplied by the values which are the same as summing over values multiplied by this mask.

You’ve done two matrix multiply and one softmax operation which is very fast You need to normalize it a little bit to train well.

Multi-Head Attention

One problem with self-attention is it’s just a similarity measure it works if this was just a set of words. It has no idea that this word comes after this because it just retrieves the most similar ones but it’s really important the order of words is not arbitrary you cannot just reorder and hope that it will translate well or generate the right next word. You need to add some timing signal and you need to add a little bit of the positional things.

Multi Head Attention

In Multi-Head Attention with positional signals, there will be multiple attention heads looking at different words and they know the positions.    


Google built this model it has multiple layers of these attentions and it trained it on translation. You know deep learning is a trade you need to do all the tricks to make things work actually well. So You need to have dropouts in the right places. You need to use the ADAM optimizer and maybe think a little bit about your learning rate and how to decay them and label smoothing in the softmax and then when you start actually decoding from this model you need to pay some attention to what you’re doing so there is a bunch of technical items. It’s very hard to do this right on your own. So the google brain team decided to make a library where you could at least look at the code and get these things right.

Tensor2Tensor library is a framework and a system and It calls on top of the TensorFlow. It has all the tricks that we were talking about the optimizer the label smoothing etc.