Name Entity Recognition with Huggingface BERT in Keras

A lot of unstructured text data is available today. It provides a rich source of information if it is structured. Name Entity recognition build knowledge from unstructured text data. It parses important information from the text like email addresses, phone numbers, degree titles, location names, organizations, time and etc,

NER has a wide variety of use cases like when you are writing an email and you mention a time in your email or attaching a file, Gmail offers to set a calendar notification or remind you to attach the file in case you are sending the email without an attachment. Another one is extracting important information from legal, financial, and medical documents or classifying content for news providers, improving the search algorithms, etc.

A new paradigm in natural language processing (NLP) is to select a pre-trained model and then “fine-tune” the model with new data from your specific task.

BERT is a powerful general-purpose language model trained on “masked language modeling” that can be leveraged for text-based machine learning tasks.

Transformers

Implementations of pre-trained BERT models already exist in TensorFlow due to their popularity. I leveraged the popular transformers library while building out this project.

!pip install transformers

First, you install the transformers package by huggingface. This library contains some state-of-the-art pre-trained models for Natural Language Processing (NLP) like BERT, GPT, XLNet … etc.

Now you have access to the pre-trained Bert models and the TensorFlow wrappers we will use here.

Download Dataset

We are going to use a dataset from Kaggle. The sentences are annotated with the BIO-schema.

df_data = pd.read_csv("ner_dataset.csv",sep=",",encoding="latin1").fillna(method='ffill')
df_data.shape

Now we split the dataset to use 20% to validate the model.

from sklearn.model_selection import train_test_split
x_train,x_test=train_test_split(df_data,test_size=0.20,shuffle=False)

x_train_grouped = x_train.groupby("Sentence #").apply(agg_func)
x_test_grouped = x_test.groupby("Sentence #").apply(agg_func)

Tokenizer

The Bert implementation comes with a pre-trained tokenizer and a defined vocabulary. We load the one related to the smallest pre-trained model “bert-base-cased”.

MAX_LENGTH=128
BERT_MODEL="bert-base-cased"

BATCH_SIZE=32

pad_token=0
pad_token_segment_id=0
sequence_a_segment_id=0

from transformers import (
    TF2_WEIGHTS_NAME,
    BertConfig,
    BertTokenizer,
    TFBertForTokenClassification,
    create_optimizer)

MODEL_CLASSES = {"bert": (BertConfig, TFBertForTokenClassification, BertTokenizer)}

tokenizer = tokenizer_class.from_pretrained(BERT_MODEL,do_lower_case=False)

Create a Model

NER is the multi-class classification problem where the words are our input and tags are our labels. The transformers package provides a TFBertForTokenClassification class for token-level predictions.TFBertForTokenClassification is a fine-tuning model that wraps BertModel and adds a token-level classifier on top of the BertModel. We load the pre-trained “bert-base-cased” model and provide the number of possible labels.

config_class, model_class, tokenizer_class = MODEL_CLASSES['bert']
config = config_class.from_pretrained(BERT_MODEL,num_labels=num_labels)

model = model_class.from_pretrained(
                BERT_MODEL,
                from_pt=bool(".bin" in BERT_MODEL),
                config=config)

 model.layers[-1].activation = tf.keras.activations.softmax

optimizer = tf.keras.optimizers.Adam()
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')

model.compile(optimizer=optimizer, loss=loss, metrics=[metric])

Prepare Input

Before we can start fine-tuning the model, we have to prepare the data set for use with BERT. We need to set the text into 3 kinds of embeddings:

Token embedding

In order to make token embedding, we need to map the word token into the id.

for word, label in zip(x, y):
      word_tokens = tokenizer.tokenize(word)
      tokens.extend(word_tokens)
      # Use the real label id for the first token of the word, and padding ids for the remaining tokens
      label_ids.extend([label_map[label]] + [pad_token_label_id] * (len(word_tokens) - 1))

Attention Mask

In order to make mask word embedding, we need to use 1 to indicate the real toke and 0 to indicate to pad token.

input_ids, token_type_ids = inputs["input_ids"], inputs["token_type_ids"]
    attention_masks = [1] * len(input_ids)

Where “token_type_ids” are used to indicate whether this is the first # sequence or the second sequence.

Next, we cut and pad the token and label sequences to our desired length.

input_ids_train = pad_sequences(input_ids_train,maxlen=max_seq_length,dtype="long",truncating="post",padding="post")
token_ids_train = pad_sequences(token_ids_train,maxlen=max_seq_length,dtype="long",truncating="post",padding="post")
attention_masks_train = pad_sequences(attention_masks_train,maxlen=max_seq_length,dtype="long",truncating="post",padding="post")
label_ids_train = pad_sequences(label_ids_train,maxlen=max_seq_length,dtype="long",truncating="post",padding="post")

The last step is to define tf.data.dataset. We shuffle the data at training time and at test time we just pass them sequentially.

train_ds = tf.data.Dataset.from_tensor_slices((input_ids_train,attention_masks_train,token_ids_train,label_ids_train)).map(example_to_features).shuffle(1000).batch(32).repeat(5)


test_ds=tf.data.Dataset.from_tensor_slices((input_ids_test,attention_masks_test,token_ids_test,label_ids_test)).map(example_to_features).batch(1)

Train Model

Train the model for 3 epochs in mini-batches of 32 samples. This is 3 iterations over all samples in the train_ds and test_ds tensors. While training, monitor the model’s loss and accuracy on the samples from the validation set.

history = model.fit(train_ds, epochs=3, validation_data=test_ds)