BERT presenting state-of-the-art results in a wide variety of NLP tasks, including Question Answering, Natural Language Inference, and others. It has been pre-trained on Wikipedia and BooksCorpus and requires only task-specific fine-tuning.

It’s a single model that is trained on a large unlabelled dataset to achieve State-of-the-Art results on 11 individual NLP tasks. BERT inspired many recent NLP architectures, training approaches and language models, such as Google’s TransformerXL, OpenAI’s GPT-2, XLNet, ERNIE2.0, RoBERTa, etc.

BERT is based on Transformer architecture. It is basically a bunch of Transformer encoders stacked together.

For modeling and training, I am using the great transformers library. The transformer library creates the possibility to easily try out different architectures like XLNet, Roberta, etc. Those architectures come pre-trained with several sets of weights. That obtains state-of-the-art results on a variety of NLP tasks like text classification, information extraction, question answering, and text generation.


It has recently been ported to TensorFlow 2.0, offering an API that now works with Keras’ fit API. This tutorial is dedicated to the use of the Transformers library using TensorFlow and the Keras API to fine-tune a State-of-The-Art Transformer model. Getting started with Transformers only requires to install the pip package:

!pip install transformers

Download Dataset

We’ll use the IMDB dataset that contains the text of 50,000 movie reviews. You can download it from Kaggle.

import pandas as pd
from sklearn import preprocessing
from sklearn.model_selection import train_test_split

import numpy as np
from sklearn.model_selection import train_test_split

from transformers import (TFBertForSequenceClassification, 

from tqdm import tqdm

data = pd.read_csv('IMDB Dataset.csv')

Let’s encode labels, each label is an integer value of either 0 or 1, where 0 is a negative review, and 1 is a positive review.

label_encoder = preprocessing.LabelEncoder()
data['sentiment'] = label_encoder.fit_transform(data['sentiment'])

X = (np.array(data['review']))
y = (np.array(data['sentiment']))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=13)

print("Train dataset shape: {0}, \nTest dataset shape: {1}".format(X_train.shape, X_test.shape))

BERT Tokenizer

BERT has its own ways of accepting input data via tokenization. The Transformer’s tokenizer class takes care of converting string in arrays integers.

bert_tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

Preprocess Input

BERT has set a specific set of rules to represent languages before feeding into the model. These are some functions that will be used to preprocess the raw text data into useable BERT inputs.


def convert_to_input(reviews):
  for x in tqdm(reviews,position=0, leave=True):
    inputs = bert_tokenizer.encode_plus(x,add_special_tokens=True, max_length=max_length)
    i, t = inputs["input_ids"], inputs["token_type_ids"]
    m = [1] * len(i)

    padding_length = max_length - len(i)

    i = i + ([pad_token] * padding_length)
    m = m + ([0] * padding_length)
    t = t + ([pad_token_segment_id] * padding_length)
  return [np.asarray(input_ids), 


This method will make use of the tokenizer to tokenize the input and add special tokens at the beginning and the end of sequences (like [SEP], [CLS] for instance) if such additional tokens are required by the model.

We can then shuffle this dataset and batch it in batches of 32 units using standard methods.

def example_to_features(input_ids,attention_masks,token_type_ids,y):
  return {"input_ids": input_ids,
          "attention_mask": attention_masks,
          "token_type_ids": token_type_ids},y
train_ds =[0],X_train_input[1],X_train_input[2],y_train)).map(example_to_features).shuffle(100).batch(32).repeat(5)[0],X_test_input[1],X_test_input[2],y_test)).map(example_to_features).batch(64)

Create a Model

Transformers based on the pre-trained transformer models. These transformer models come in different shapes, sizes, and architectures. When using a TensorFlow model, it inherits from tf.keras.layers.Layer which means it can be used very simply by the Keras’ fit API or trained using a custom training loop and GradientTape.

Loading a pre-trained model can be done in a few lines of code. Here is an example of loading the BERT TensorFlow models.

bert_model = TFBertForSequenceClassification.from_pretrained("bert-base-cased")

The model class holds the neural network modeling logic itself. The weights are downloaded from HuggingFace’s S3 bucket and cached locally on your machine. The models are ready to be used for inference or finetuned if need be.

Fine Tuning Transformer model with Keras’ fit method

Now that we have the input pipeline setup and define the classification model and now you can call the Keras’ fit method with our dataset.

optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08, clipnorm=1.0)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')

bert_model.compile(optimizer=optimizer, loss=loss, metrics=[metric])

bert_history =, epochs=3, validation_data=test_ds)

Run this code in Google Colab