In this tutorial, we’re going to build a recurrent neural network that’s able to classify reviews. This can be used to improve online conversation and today we’re going to focus build something that can classify positive or negative reviews.

If you spent any time online in comment sections or on social websites you’ve probably run into this kind of classification. In this tutorial, we’ll use Attention Mechanism to focus on the words that are the most useful for classification.


We’ll use the IMDB dataset that contains the text of 50,000 movie reviews from the Internet Movie Database. These are split into 25,000 reviews for training and 25,000 reviews for testing. The training and testing sets are balanced, they contain an equal number of positive and negative reviews.

The IMDB dataset comes packaged with TensorFlow. It has already been preprocessed such that the sequences of words have been converted to sequences of integers, where each integer represents a specific word in a dictionary.
The following code downloads the IMDB dataset to your machine or uses a cached copy if you’ve already downloaded it:

import os

import numpy as np
import tensorflow as tf


vocab_size = 20000
sentence_size = 256

batch_size = 128
rnn_cell_size = 128

embedding_size = 100

attention_size = 32
attention_depth = 2


pad_id = 0
start_id = 1
oov_id = 2
index_offset = 2

(x_train, y_train), (x_test, y_test) =, start_char=start_id,
                                                                        oov_char=oov_id, index_from=index_offset)

word2idx =

idx2word = {v + index_offset: k for k, v in word2idx.items()}

idx2word[pad_id] = '<PAD>'
idx2word[start_id] = '<START>'
idx2word[oov_id] = '<OOV>'

The argument num_words=20000 keeps the top 20,000 most frequently occurring words in the training data. The rare words are discarded to keep the size of the data manageable.

Prepare Input Dataset

Since the movie reviews must be the same length, we will use the pad_sequences function to standardize the lengths:

x_train = tf.keras.preprocessing.sequence.pad_sequences(x_train,
x_test = tf.keras.preprocessing.sequence.pad_sequences(x_test, maxlen=sentence_size,

def parse(x, y):
    features = {"x": x}
    return features, y

def train_input_fn():
    dataset =, y_train))
    dataset = dataset.shuffle(buffer_size=len(x_train))
    dataset = dataset.repeat(None)
    dataset = dataset.batch(batch_size=batch_size)
    dataset =
    iterator = dataset.make_one_shot_iterator()
    return iterator.get_next()

def eval_input_fn():
    dataset =, y_test))
    dataset = dataset.batch(batch_size)
    dataset =
    iterator = dataset.make_one_shot_iterator()
    return iterator.get_next()

Build Model

The framework we’re gonna use is the Embed, Encode, Attend and Predict framework this was introduced by Matthew Honnibal. It’s just a way of encapsulating some of the most common techniques in natural language into usable blocks, blocks that play well with each other.

Embed will be how you transform words into vectors. In order to do deep learning, you need to transform words into numbers that you can do computations on.

The original numbers you get from embedding might not be the most useful. Encoding turns them into more useful numbers and that’s a neural network that does that. Not all of the numbers from encoding are going to be equally useful. So attention lets you focus on the numbers that are the most useful. Finally, the network will predict whatever it predicts.

1. Use Pre-Trained Word Embeddings.

Anytime you do deep learning exercises with the natural language you’re going to generate some embeddings and this embedding be useful in other problems. Some very useful embeddings that were created were something like Word2Vec by google or Glove by Stanford. In this tutorial, we’ll use glove embedding.

def load_glove_embeddings(path):
    embeddings = {}
    with open(path, 'r', encoding='utf-8') as f:
        for line in f:
            values = line.strip().split()
            w = values[0]
            vectors = np.asarray(values[1:], dtype='float32')
            embeddings[w] = vectors

    embedding_matrix = np.random.uniform(-1, 1, size=(vocab_size, embedding_size))
    for w, i in word2idx.items():
        v = embeddings.get(w)
        if v is not None and i < vocab_size:
            embedding_matrix[i] = v

    embedding_matrix = embedding_matrix.astype(np.float32)

    return embedding_matrix

embedding_matrix = load_glove_embeddings('/home/manu/PycharmProjects/image_caption/glove.6B.100d.txt')

#RNN model function
def rnn_model(features, labels, mode, params):
    input_layer = tf.contrib.layers.embed_sequence(
        features['x'], vocab_size, embedding_size,

2. Encode: Bidirectional RNN

When you’re using regular RNN you only learn the dependencies of each word in one direction each word can learn from the words before it not the words after it and sometimes in your use cases it’s useful to have the whole context in both directions of a word. So there’s a simple trick for doing this you just use two RNN one that goes forward and one that goes backward and this two RNN will give you a context in both directions and so you obtain two output vectors.

tensorflow bidirectional lstm
rnn_fw_cell = tf.nn.rnn_cell.GRUCell(rnn_cell_size)
rnn_bw_cell = tf.nn.rnn_cell.GRUCell(rnn_cell_size)

outputs, state = tf.nn.bidirectional_dynamic_rnn(rnn_fw_cell,
bi_rnn_outputs = tf.concat(outputs, axis=2)

You have two vectors, you can just glue these two vectors together by concatenation and you’ll get one vector aging and you can use this for prediction.

3. Attention

If you stopped before attention you still have a classifier and it should still work but it’s only using the output from the last cell and you’re feeding that into your predictor but there’s still information in other cells and for long sentences, this information might get lost by the time it gets to the last cell so you’d like to somehow keep this information.

TensorFlow Text Classification using Attention Mechanism

You have the outputs of all the cells you can do something simple like just take the straight average of these vectors and feed that to your predictor. But it’s also true that not all of this information will be equally important so you might want to use a weighted average.

You want some weights tell you which words are the most important and which words are less important and the question is how are you going to come up with these weights you can try and learn them from scratch but that’s not so good you probably want some information about each of the words in computing these weights.

Train a little neural network a very very small neural network from the output of the cell whose whole job it is to vote on how important that word is. So this little neuron network gets its vote. It gives you a little alpha here which tells you how important the cell is then you do the weighted sum and then you feed that into your predictor and that’s how you do attention.

Attention is just how big these weights are that’s how much attention pay to each of the words and in this structure this mini-neuron network.

Normalize your attention mechanism.
That’s the last important trick and it’s just a technical trick these are first when (a1,h1)they come out of the neural network can be anything they can be negative they can be positive they can be again we’re gonna use softmax and this will just turn them into a number between 0 & 1 that add up to one. That’ll make for a nice weighted average.

sequence_length = bi_rnn_outputs.shape[1].value

final_layer_size = bi_rnn_outputs.shape[2].value

x = tf.reshape(bi_rnn_outputs, [-1, final_layer_size])

for _ in range(attention_depth - 1):
    x = tf.layers.dense(x, attention_size, activation=tf.nn.relu)

x = tf.layers.dense(x, 1, activation=None)

logits = tf.reshape(x, [-1, sequence_length, 1])

alphas = tf.nn.softmax(logits, dim=1)

encoding = tf.reduce_sum(bi_rnn_outputs * alphas, 1)

The attention function is very simple, it’s just dense layers back to back and then a little bit of reshaping and softmax. so basically a two-layer neural network density.

4. Prediction

The final step the predicting step is of again just one usually a very small dense layer that has output two because we have two classes positive or negative. Then the prediction is just which of these two classes is bigger. Which of these two numbers is bigger which is what the argmax does. Then to compute the loss and train the network we’re gonna use the softmax cross entropy.

    logits = tf.layers.dense(encoding, MAX_LABEL, activation=None)

    predicted_classes = tf.argmax(logits, 1)

    if mode == tf.estimator.ModeKeys.PREDICT:
        return tf.estimator.EstimatorSpec(
                'class': predicted_classes,
                'prob': tf.nn.softmax(logits),
                'attention': alphas

    onehot_labels = tf.one_hot(labels, MAX_LABEL, 1, 0)

    loss = tf.losses.softmax_cross_entropy(
        onehot_labels=onehot_labels, logits=logits)

    if mode == tf.estimator.ModeKeys.TRAIN:
        optimizer = tf.train.AdamOptimizer(learning_rate=0.01)
        train_op = optimizer.minimize(loss,
        return tf.estimator.EstimatorSpec(mode,

    eval_metric_ops = {
        'accuracy': tf.metrics.accuracy(
            labels=labels, predictions=predicted_classes),
        'auc': tf.metrics.auc(
            labels=labels, predictions=predicted_classes),
    return tf.estimator.EstimatorSpec(
        mode=mode, loss=loss, eval_metric_ops=eval_metric_ops)

Train the Model

Next, let’s create an Estimator a TensorFlow class for performing high-level model training, evaluation, and inference for our model.

model_dir = os.path.join('./tf_model')

def my_initializer(dtype=tf.float32):
    assert dtype is tf.float32
    return embedding_matrix

params = {'embedding_initializer': my_initializer}

run_config = tf.estimator.RunConfig(save_checkpoints_secs=500,

classifier = tf.estimator.Estimator(model_fn=rnn_model, config=run_config, params=params)

The model_fn argument specifies the model function to use for training, evaluation, and prediction.

Train the Model

Now we’re ready to train our model, which we can do by calling train() on the classifier.

classifier.train(input_fn=train_input_fn, steps=2000)

Evaluate the Model

Once training is complete, we want to evaluate our model to determine its accuracy on the test set. We call the evaluation method, which evaluates the metrics.

scores = classifier.evaluate(input_fn=eval_input_fn)
print('Accuracy: {0:f}'.format(scores['accuracy']))
print('AUC: {0:f}'.format(scores['auc']))

predictions = list(classifier.predict(input_fn=eval_input_fn))
for p, l in zip(predictions, y_test):
    print(p['class'], l)