People have been doing a lot of stuff in natural language processing like parsing, translation, finding entities anything towards making machines understand text a little bit.
You’ve seen in the previous post the main thing is they generate the next word. They look at the past using RNN to generate the next word of the output or you have an LSTM cell to generate the next word.
They have some problems with architecture and we were thinking for a long time what’s the gist of this problem what is not so great about it and
Long-range language model
You do it on the level of tokens or words so a sentence may have 40 words a long one may have 60. But when we speak we think about context, that goes way beyond that. when you start talking to your friend about things you talked about 20 years ago maybe you will know it I immediately recall what’s needed there. It seems to be one really important thing to solve and tackle to make these networks like understand more.
If we train the network with a long range we can look back thousands of steps and remember it. Imagine if we just trained it to generate text to write something and since we wanted it to be longer range let’s say we train it on Wikipedia articles. So that’s called the language model something that just generates the next word since it’s a model it’s probabilistic so it’s a probability distribution over various articles.
RNN or LSTM has a problem that if you try to generate 2,000 words its states and the gating in the LSTM would start to make the gradient vanish. It’s called long short-term memory but it’s not that long.
If you make an RNN it needs to go one word at a time to get to the
Attention(Reference by Content)
It was introduced already with LSTM. Attention is something where make a query with the vector and then you basically look at similar things in your past.

The convolutions are quite a positional thing that’s where you have a different color for every position. Attention looks at everything but gets things that are similar. That is a general idea. When you retrieve similar things you can look at a very long context.
Self-Attention
Imagine you’re producing the name of the group member now you can attend however many steps behind. You can see earlier I said it was Jackson let me say this again. With CNN you could do this but you need to know the exact position. You need to compress the data very much. With attention since it’s content-based queering you can do this quickly.
3 Ways of Attention
What I showed you is a language model meaning it just decodes it has no inputs but for translation, you need an input. You need attention that’s from the output and attends to the input.

You need attention to the input which can attend everywhere.

The attention to the output only attends to things before.

The Transformer Model
It is based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.

We have some inputs, Let’s say the English sentence, and then there’ll be a multi-head attentional. Then there’ll be a feed-forward layer just that every word will be processed and that’s the processing of the input.
Masked Attention
When we start generating output we need this masked attention. In an RNN it’s very clear that you go from one word to the next. In the more standard attention things that have been there for a while. You attend to everything there and back. But if you wanted to make a language model you cannot do this because
Masked multi-head attention
In the Transformer model, you have the outputs they come in and then there is the masked attention which is a single matrix multiplied with a mask. So all of these things they’re done in parallel you have a thousand words that come as one tensor that’s thousand by something long and then every word I needs to attend to all words before it which will be done by multiplying everyone with everyone and masking out half of this with zeroes
Masking Step: this step does attention for maybe 2000 words with 2000 words everyone to everyone is a single matrix multiplied and the GPU hardware is extremely optimized for large matrix multiplies so this can happen in a very quick time. Then you do this again but multiply with the Encoder. Then feed-forward and predictions in the sense of neural network architecture this is a very simple few-layer feed-forward architecture in a way it’s much simpler than the RNN it has no recurrence there is no way to share. Everything happens in a bunch of masked matrix multiplies.
Dot Product Attention (Key value pair)

The current word(Q) is what I’m operating on and this(K, V) is all the
When you take the query multiply it by the transposed keys that are a matrix multiply. Then you take a softmax which is exponentiation and normalization.

It gives you a mask and probability distribution over keys which is peaked at the ones that are similar to the query. In this mask, you make a matrix multiplied by the values which are the same as summing over values multiplied by this mask.
You’ve done two matrix multiply and one softmax operation which is very fast You need to normalize it a little bit to train well.
Multi-Head Attention
One problem with self-attention is it’s just a similarity measure it works if this was just a set of words. It has no idea that this word comes after this because it just retrieves the most similar ones but it’s really important the order of words is not arbitrary you cannot just reorder and hope that it will translate well or generate the right next word. You need to add some timing signal and you need to add a little bit of the positional things.

In Multi-Head Attention with positional signals, there will be multiple attention heads looking at different words and they know the positions.
Conclusion
Google built this model it has multiple layers of these attentions and it trained it on translation. You know deep learning is a trade you need to do all the tricks to make things work actually well. So You need to have dropouts in the right places. You need to use the ADAM optimizer and maybe think a little bit about your learning rate and how to decay them and label smoothing in the softmax and then when you start actually decoding from this model you need to pay some attention to what you’re doing so there is a bunch of technical items. It’s very hard to do this right on your own. So the google brain team decided to make a library where you could at least look at the code and get these things right.
Tensor2Tensor library is a framework and a system and It calls on top of the