BERT(Bidirectional Encoder Representations from Transformers) is a method of representations pre-training language, it’s trained on general-purpose “language understanding” model on a large text corpus like Wikipedia. It is deep bidirectional representations on both left and right context in all layers. Its representations can be fine-tuned with just one additional output layer to create NLP models without task-specific architecture modifications.
Limitation of the Standard Language model
The major limitation is unidirectional, and this limits the choice of architectures that can be used during pre-training. For example, in OpenAI GPT use a left-to-right architecture, where every token can only attend previous tokens in the self-attention layers of the Transformer.
The disadvantage of RNN and LSTM
RNN and LSTM or GRU use mainly sequential processing. Long-term information has to sequentially travel through all cells before getting to the last cell. It can be easily corrupted. This is the cause of vanishing gradients.
LSTM thus have a way to remove some of the vanishing gradients problems. But not all of it, Still it has a sequential path from older past cells to the current one. The sequence is even more complicated because it has forget branches attached to it. They can remember sequences of 100s, not 1000s or 10,000s or more also it is not hardware friendly.
BERT Model Architecture
It is multi-layer bidirectional Transformer encoder. The BERT Transformer uses bidirectional self-attention.
BERT addresses the unidirectional constraints by proposing a new pre-training objective the “masked language model (MLM)”. It randomly masks some of the tokens from the input, and the objective is to predict the original vocabulary id of the masked word based only on its context.
Input: the man went to the [MASK1] . he bought a [MASK2] of milk. Labels: [MASK1] = store; [MASK2] = gallon
The MLM objective allows the representation to fuse the left and the right context, which allows us to pre-train a deep bidirectional Transformer.
In the next sentence prediction task that jointly pre-trains text-pair representations.
Sentence A: the man went to the store . Sentence B: he bought a gallon of milk . Label: IsNextSentence Sentence A: the man went to the store . Sentence B: penguins are flightless . Label: NotNextSentence
In order to train a deep bidirectional representation, it takes a straightforward approach of masking some percentage of the input tokens at random and then predicting only those masked tokens. It refers to this procedure as a “masked LM” the final hidden vectors corresponding to the mask tokens are fed into an output softmax over the vocabulary, as in a standard LM.