A lot of unstructured text data available today. It provides a rich source of information if it is structured. Name Entity recognition build knowledge from unstructured text data. It parses important information form the text like email address, phone number, degree titles, location names, organizations, time and etc,
NER has a wide variety of use cases like when you are writing an email and you mention a time in your email or attaching a file, Gmail offers to set a calendar notification or remind you to attach the file in case you are sending the email without an attachment. Another one is extracting important information from legal, financial, and medical documents or classifying content for news providers, improving the search algorithms, and etc.
A new paradigm in natural language processing (NLP) is to select a pre-trained model and then “fine-tuning” the model with new data from your specific task.
BERT is a powerful general-purpose language model trained on “masked language modeling” that can be leveraged for the text-based machine learning tasks.
Transformers
Implementations of pre-trained BERT models already exist in TensorFlow due to its popularity. I leveraged the popular transformers library while building out this project.
!pip install transformers
First, you install the transformers package by huggingface. This library contains some state-of-the-art pre-trained models for Natural Language Processing (NLP) like BERT, GPT, XLNet … etc.
Now you have access to the pre-trained Bert models and the TensorFlow wrappers we will use here.
Download Dataset
we are going to use a dataset from Kaggle. The sentences are annotated with the BIO-schema.
df_data = pd.read_csv("ner_dataset.csv",sep=",",encoding="latin1").fillna(method='ffill')
df_data.shape
Now we split the dataset to use 20% to validate the model.
from sklearn.model_selection import train_test_split
x_train,x_test=train_test_split(df_data,test_size=0.20,shuffle=False)
x_train_grouped = x_train.groupby("Sentence #").apply(agg_func)
x_test_grouped = x_test.groupby("Sentence #").apply(agg_func)
Tokenizer
The Bert implementation comes with a pre-trained tokenizer and a defined vocabulary. We load the one related to the smallest pre-trained model “bert-base-cased”.
MAX_LENGTH=128
BERT_MODEL="bert-base-cased"
BATCH_SIZE=32
pad_token=0
pad_token_segment_id=0
sequence_a_segment_id=0
from transformers import (
TF2_WEIGHTS_NAME,
BertConfig,
BertTokenizer,
TFBertForTokenClassification,
create_optimizer)
MODEL_CLASSES = {"bert": (BertConfig, TFBertForTokenClassification, BertTokenizer)}
tokenizer = tokenizer_class.from_pretrained(BERT_MODEL,do_lower_case=False)
Create a Model
NER is the multi-class classification problem where the words are our input and tags are our labels. The transformers package provides a TFBertForTokenClassification
class for token-level predictions.TFBertForTokenClassification
is a fine-tuning model that wraps BertModel
and adds token-level classifier on top of the BertModel
. We load the pre-trained “bert-base-cased” model and provide the number of possible labels.
config_class, model_class, tokenizer_class = MODEL_CLASSES['bert']
config = config_class.from_pretrained(BERT_MODEL,num_labels=num_labels)
model = model_class.from_pretrained(
BERT_MODEL,
from_pt=bool(".bin" in BERT_MODEL),
config=config)
model.layers[-1].activation = tf.keras.activations.softmax
optimizer = tf.keras.optimizers.Adam()
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
model.compile(optimizer=optimizer, loss=loss, metrics=[metric])
Prepare Input
Before we can start fine-tuning the model, we have to prepare the data set for use with BERT. We need to set the text into 3 kinds of embeddings:
Token embedding
In order to make token embedding, we need to map the word token into the id.
for word, label in zip(x, y):
word_tokens = tokenizer.tokenize(word)
tokens.extend(word_tokens)
# Use the real label id for the first token of the word, and padding ids for the remaining tokens
label_ids.extend([label_map[label]] + [pad_token_label_id] * (len(word_tokens) - 1))
Attention Mask
In order to make mask word embedding, we need to use 1 to indicate the real toke and 0 to indicate to pad token.
input_ids, token_type_ids = inputs["input_ids"], inputs["token_type_ids"]
attention_masks = [1] * len(input_ids)
Where “token_type_ids” are used to indicate whether this is the first # sequence or the second sequence.
Next, we cut and pad the token and label sequences to our desired length.
input_ids_train = pad_sequences(input_ids_train,maxlen=max_seq_length,dtype="long",truncating="post",padding="post")
token_ids_train = pad_sequences(token_ids_train,maxlen=max_seq_length,dtype="long",truncating="post",padding="post")
attention_masks_train = pad_sequences(attention_masks_train,maxlen=max_seq_length,dtype="long",truncating="post",padding="post")
label_ids_train = pad_sequences(label_ids_train,maxlen=max_seq_length,dtype="long",truncating="post",padding="post")
The last step is to define tf.data.dataset. We shuffle the data at training time and at test time we just pass them sequentially.
train_ds = tf.data.Dataset.from_tensor_slices((input_ids_train,attention_masks_train,token_ids_train,label_ids_train)).map(example_to_features).shuffle(1000).batch(32).repeat(5)
test_ds=tf.data.Dataset.from_tensor_slices((input_ids_test,attention_masks_test,token_ids_test,label_ids_test)).map(example_to_features).batch(1)
Train Model
Train the model for 3 epochs in mini-batches of 32 samples. This is 3 iterations over all samples in the train_ds
and test_ds
tensors. While training, monitor the model’s loss and accuracy on the samples from the validation set.
history = model.fit(train_ds, epochs=3, validation_data=test_ds)