BERT(Bidirectional Encoder Representations from Transformers) is a method of representation pre-training language, it’s trained on a general-purpose “language understanding” model on a large text corpus like Wikipedia.

It is deep bidirectional representations on both left and right context in all layers. Its representations can be fine-tuned with just one additional output layer to create NLP models without task-specific architecture modifications.

Limitations of the Standard Language Model

The major limitation is unidirectional, and this limits the choice of architectures that can be used during pre-training. For example, in OpenAI GPT use a left-to-right architecture, where every token can only attend previous tokens in the self-attention layers of the Transformer.

The disadvantage of RNN and LSTM

RNN and LSTM or GRU  use mainly sequential processing. Long-term information has to sequentially travel through all cells before getting to the last cell. It can be easily corrupted. This is the cause of vanishing gradients.

RNN Time series input

LSTM thus has a way to remove some of the vanishing gradients problems. But not all of it, Still it has a sequential path from older past cells to the current one. The sequence is even more complicated because it has forgotten branches attached to it. They can remember sequences of 100s, not 1000s or 10,000s or more also it is not hardware friendly.

BERT Model Architecture

It is a multi-layer bidirectional Transformer encoder. The BERT Transformer uses bidirectional self-attention.

BERT Model Architecture

Masked model

BERT addresses the unidirectional constraints by proposing a new pre-training objective the “masked language model (MLM)”.  It randomly masks some of the tokens from the input, and the objective is to predict the original vocabulary id of the masked word based only on its context.

Input: the man went to the [MASK1] . he bought a [MASK2] of milk.
Labels: [MASK1] = store; [MASK2] = gallon

The MLM objective allows the representation to fuse the left and the right context, which allows us to pre-train a deep bidirectional Transformer.

In the next sentence prediction task jointly pre-trains text-pair representations.

Sentence A: the man went to the store .
Sentence B: he bought a gallon of milk .
Label: IsNextSentence


Sentence A: the man went to the store .
Sentence B: penguins are flightless .
Label: NotNextSentence

In order to train a deep bidirectional representation, it takes a straightforward approach of masking some percentage of the input tokens at random and then predicting only those masked tokens. It refers to this procedure as a “masked LM” The final hidden vectors corresponding to the mask tokens are fed into an output softmax over the vocabulary, as in a standard LM.

References:

1. Paper:BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

2. Open source TensorFlow implementation.