Pre-trained word embeddings are an integral part of modern NLP systems. Its offering significant improvements over embeddings learned from scratch. The major limitation of word embeddings is unidirectional.

Bidirectional Encoder Representations from Transformers(BERT) is a new language representation model. It is designed to pre-train bidirectional representations from the unlabeled text.

The pre-trained BERT model can be finetuned with just one additional output layer to create state-of-the-art models for a wide range of NLP tasks without substantial task-specific architecture modifications.

In this tutorial, we demonstrated how to integrate BERT embeddings as a Keras layer to simplify model prototyping using the TensorFlow hub.

Install packages 

Install the BERT tokenizer from the BERT python module (bert-for-tf2).

!pip install bert-for-tf2
!pip install sentencepiece

We will use the latest TensorFlow (2.0+) and TensorFlow Hub (0.7+), therefore, it might need an upgrade. For the model creation, we use the high-level Keras API Model class.

%tensorflow_version 2.x

import tensorflow as tf
import tensorflow_hub as hub
import bert
from tensorflow.keras.models import  Model
from tqdm import tqdm
import numpy as np
from collections import namedtuple
print("TensorFlow Version:",tf.__version__)
print("Hub version: ",hub.__version__)

BERT Embedding Layer

For finetuning, the BERT model is first initialized with the pre-trained parameters, and all of the parameters are fine-tuned using labeled data from the downstream tasks.


The same pre-trained model parameters are used to initialize models for different down-stream tasks Apart from output layers. During fine-tuning, all parameters are fine-tuned.

input_word_ids = tf.keras.layers.Input(shape=(MAX_SEQ_LEN,), dtype=tf.int32,
input_mask = tf.keras.layers.Input(shape=(MAX_SEQ_LEN,), dtype=tf.int32,
segment_ids = tf.keras.layers.Input(shape=(MAX_SEQ_LEN,), dtype=tf.int32,
  • input token ids is tokenizer converts tokens using vocab file.
  • input masks are either 0 or 1. 1 for useful tokens, 0 for padding.
  • segment ids are either 0 or 1. For 2 text training: 0 for the first one, 1 for the second one.
def get_masks(tokens, max_seq_length):
    return [1]*len(tokens) + [0] * (max_seq_length - len(tokens))

def get_segments(tokens, max_seq_length):
    """Segments: 0 for the first sequence, 1 for the second"""
    segments = []
    current_segment_id = 0
    for token in tokens:
        if token == "[SEP]":
            current_segment_id = 1
    return segments + [0] * (max_seq_length - len(tokens))
pooled_output, sequence_output = bert_layer([input_word_ids, input_mask, segment_ids])

pooled_output representations the entire input sequences and sequence_output representations each input token in the context.


During any text data preprocessing, there is a tokenization phase involved. The tokenizer available with the BERT package is very powerful. We use WordPiece embeddings (Wu et al., 2016) with a 30,000 token vocabulary.  Create the tokenizer with the BERT layer and import it tokenizer using the original vocab file.





def get_ids(tokens, tokenizer, max_seq_length):
    """Token ids from Tokenizer vocab"""
    token_ids = tokenizer.convert_tokens_to_ids(tokens,)
    input_ids = token_ids + [0] * (max_seq_length-len(token_ids))
    return input_ids

Prepare Training Data 

The dataset for this article can be downloaded from this Kaggle link. Our BERT embedding layer will need three types of input tokens: word_ids, input_mask, segment_ids

Applying the tokenizer to converting into words into ids. These are some functions that will be used to preprocess the raw text data into useable Bert inputs.

import os
os.environ['KAGGLE_USERNAME'] = "username" # username from the json file
os.environ['KAGGLE_KEY'] = "520d35e7sfwe2323112xzf636f66a8034b0c5bd3f" # key from the json file

!kaggle competitions download -c jigsaw-toxic-comment-classification-challenge
import pandas as pd


df = df.sample(frac=1)

train_sentences = df["comment_text"].fillna("CVxTz").values
list_classes = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]
train_y = df[list_classes].values

The first token of every sequence is always a special classification token ([CLS]). It is a special symbol added in front of every input example and [SEP] is a special separator token is added at the end of every input example.

def create_single_input(sentence,MAX_LEN):
  stokens = tokenizer.tokenize(sentence)
  stokens = stokens[:MAX_LEN]
  stokens = ["[CLS]"] + stokens + ["[SEP]"]
  ids = get_ids(stokens, tokenizer, MAX_SEQ_LEN)
  masks = get_masks(stokens, MAX_SEQ_LEN)
  segments = get_segments(stokens, MAX_SEQ_LEN)

  return ids,masks,segments

def create_input_array(sentences):

  input_ids, input_masks, input_segments = [], [], []

  for sentence in tqdm(sentences,position=0, leave=True):


  return [np.asarray(input_ids, dtype=np.int32), 
            np.asarray(input_masks, dtype=np.int32), 
            np.asarray(input_segments, dtype=np.int32)]

Create and Train model

Each downstream task has separate fine-tuned models, even though they are initialized with the same pre-trained parameters. A simple classification layer is added to the pre-trained model, and all parameters are jointly fine-tuned on a downstream task.

x = tf.keras.layers.GlobalAveragePooling1D()(sequence_output)
x = tf.keras.layers.Dropout(0.2)(x)
out = tf.keras.layers.Dense(6, activation="sigmoid", name="dense_output")(x)

model = tf.keras.models.Model(
      inputs=[input_word_ids, input_mask, segment_ids], outputs=out)


It’s simple, just taking the sequence_output of the bert_layer and pass it to an AveragePooling layer and finally to an output layer of 6 units (6 classes that we have to predict.



To predict new text data, first, we need to convert into BERT input after that you can use predict() on the model.


test_sentences = test_df["comment_text"].fillna("CVxTz").values




BERT reduces the need for many heavily-engineered task-specific architectures. BERT is the first finetuning based representation model that achieves state-of-the-art performance on a large suite of sentence-level and token-level tasks, outperforming many task-specific architectures.

Run this code in Google Colab