In this tutorial, We build text classification models in Keras that use attention mechanisms to provide insight into how classification decisions are being made.
1. Prepare the Dataset
We’ll use the IMDB dataset that contains the text of 50,000 movie reviews from the Internet Movie Database. The IMDB dataset comes packaged with Keras. It has already been preprocessed such that the sequences of words have been converted to sequences of integers, where each integer represents a specific word in a dictionary.
import tensorflow as tf
from keras_preprocessing import sequence
from tensorflow import keras
from tensorflow.python.keras import Input
from tensorflow.python.keras.layers import Concatenate
vocab_size = 10000
pad_id = 0
start_id = 1
oov_id = 2
index_offset = 2
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.imdb.load_data(num_words=vocab_size, start_char=start_id,
oov_char=oov_id, index_from=index_offset)
word2idx = tf.keras.datasets.imdb.get_word_index()
idx2word = {v + index_offset: k for k, v in word2idx.items()}
idx2word[pad_id] = '<pad>'
idx2word[start_id] = '<start>'
idx2word[oov_id] = '<oov>'
max_len = 200
rnn_cell_size = 128
x_train = sequence.pad_sequences(x_train,
maxlen=max_len,
truncating='post',
padding='post',
value=pad_id)
x_test = sequence.pad_sequences(x_test, maxlen=max_len,
truncating='post',
padding='post',
value=pad_id)
</oov></start></pad>
Keras provide pad_sequences
2. Create an Attention Layer
You can use the final encoded state of a recurrent neural network for prediction. T
class Attention(tf.keras.Model):
def __init__(self, units):
super(Attention, self).__init__()
self.W1 = tf.keras.layers.Dense(units)
self.W2 = tf.keras.layers.Dense(units)
self.V = tf.keras.layers.Dense(1)
def call(self, features, hidden):
hidden_with_time_axis = tf.expand_dims(hidden, 1)
score = tf.nn.tanh(self.W1(features) + self.W2(hidden_with_time_axis))
attention_weights = tf.nn.softmax(self.V(score), axis=1)
context_vector = attention_weights * features
context_vector = tf.reduce_sum(context_vector, axis=1)
return context_vector, attention_weights
We compute these attention weights simply by building a small fully connected neural network on top of each encoded state. This network will have a single unit final layer which will correspond to the attention weight we will assign.

Attention function is very simple, it’s just dense layers back to back softmax. so basically a three-layer neural network density.
3. Embed Layer
Neural networks are the composition of operators from linear algebra and non-linear activation functions. In order to perform these computations on our input sentences, we must first embed them as a vector of numbers. There are two main approaches to performing this embedding pre-trained embedding like Word2Vec or GloVe or randomly initializing.
In this tutorial, we will be using a random initialization. To perform this embedding we use the embedding
function from the layers package. The parameters of this matrix will then be trained with the rest of the graph.
sequence_input = Input(shape=(max_len,), dtype='int32')
embedded_sequences = keras.layers.Embedding(vocab_size, 128, input_length=max_len)(sequence_input)
4. Bi-directional RNN
We will use a bi-directional RNN. This is simply the
import os
lstm = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM
(rnn_cell_size,
dropout=0.3,
return_sequences=True,
return_state=True,
recurrent_activation='relu',
recurrent_initializer='glorot_uniform'), name="bi_lstm_0")(embedded_sequences)
lstm, forward_h, forward_c, backward_h, backward_c = tf.keras.layers.Bidirectional \
(tf.keras.layers.LSTM
(rnn_cell_size,
dropout=0.2,
return_sequences=True,
return_state=True,
recurrent_activation='relu',
recurrent_initializer='glorot_uniform'))(lstm)
Our model uses a bi-directional RNN, we first concatenate the hidden states from each RNN before computing the attention weights and applying the weighted sum.
state_h = Concatenate()([forward_h, backward_h])
state_c = Concatenate()([forward_c, backward_c])
context_vector, attention_weights = attention(lstm, state_h)
output = keras.layers.Dense(1, activation='sigmoid')(context_vector)
model = keras.Model(inputs=sequence_input, outputs=output)
# summarize layers
print(model.summary())
The last layer is densely connected with a single output node. Using the sigmoid activation function, this value is a float between 0 and 1, representing a probability, or confidence level.
5. Compile Model
A model needs a loss function and an optimizer for training. Our model is a binary classification problem and the model outputs a probability. We’ll use binary_crossentropy
model.compile(optimizer=tf.train.AdamOptimizer(),
loss='binary_crossentropy',
metrics=['accuracy'])
early_stopping_callback = keras.callbacks.EarlyStopping(monitor='val_loss',
min_delta=0,
patience=1,
verbose=0, mode='auto')
6. Train Model
Train the model for 10 epochs in mini-batches of 200 samples. This is 10 iterations over all samples in x_train
y_train
history = model.fit(x_train,
y_train,
epochs=10,
batch_size=200,
validation_split=.3, verbose=1, callbacks=[early_stopping_callback])
7. Evaluate Model
Let’s see how the model performs. Two values will be returned. Loss and accuracy.
result = model.evaluate(x_test, y_test)
print(result)