In this tutorial, we use Keras, TensorFlow high-level API for building encoder-decoder architecture for image captioning. We also use TensorFow Dataset API for easy input pipelines to bring data into your Keras model.

Image captioning models combine convolutional neural networks (CNN) and Long Short Term Memory(LSTM) to create image captions for your own images.

Download Dataset

In this tutorial, we use Flilckr8K dataset. It contains 8,000 images that are each paired with five different captions which provide clear descriptions of the image. The dataset contains multiple descriptions for each image but for simplicity, we use only one description.



!unzip -d all_images
!unzip -d all_captions
import tensorflow as tf

import numpy as np
import matplotlib.pyplot as plt

import pickle
from tqdm import tqdm
import os

from tensorflow.keras.optimizers import Adam, RMSprop
from tensorflow.keras.applications.inception_v3 import InceptionV3
from tensorflow.keras.preprocessing import image
from tensorflow.keras.layers import LSTM, GRU,Embedding, Input,Dense,Activation, Flatten, RepeatVector,TimeDistributed

Load Data

First, we load the image and text data so that we can use it in

image_dir = 'all_images/Flicker8k_Dataset/'
token_file = 'all_captions/Flickr8k.token.txt'
captions = open(token_file, 'r').read().strip().split('\n')

start='<start> '
end=' <end>'

image_caption_mapping = {}

for i, row in enumerate(captions):
    row = row.split('\t')
    image_id = row[0][:len(row[0])-2]
    if image_id not in image_caption_mapping:
      if os.path.isfile(image_dir+image_id):


We will use the strings ‘<start>’ and ‘<end>’ for the start and end sequence. These tokens are added to the loaded descriptions as they are loaded. It is important to do this now before we encode the text so that the tokens are also encoded correctly.

Encode images using InceptionV3

Next, we will use InceptionV3 (pre-trained on Imagenet) to encode each image. We will extract features from the last convolutional layer.

First, we will need to convert the images into the format inceptionV3 expects image size (299, 299) * Using the process method to place the pixels in the range of -1 to 1 (to match the format of the images used to train InceptionV3).

def preprocess(image_path):
    img = image.load_img(image_path, target_size=(299, 299))
    x = image.img_to_array(img)
    x = np.expand_dims(x, axis=0)

    x /= 255.
    return x

Next, we will Initialize InceptionV3 and load the pre-trained Imagenet weights. We’ll create a tf.keras model where the output layer is the last convolutional layer in the InceptionV3 architecture.

inception_model = tf.keras.applications.InceptionV3(weights='imagenet')

model_input = inception_model.input
hidden_layer = inception_model.layers[-2].output

inception_model = tf.keras.Model(model_input, hidden_layer)

def encode_image(img_id):
    image = preprocess(image_dir+img_id)
    encoding = inception_model.predict(image)
    encoding = np.reshape(encoding, encoding.shape[1])
    return encoding

encode_image = {}
for img_id in tqdm(all_images):
    encode_image[img_id] = encode(img_id)

with open("image_encoding.p", "wb") as encoded_pickle:
     pickle.dump(encoding_train, encoded_pickle)
encode_image = pickle.load(open('image_encoding.p', 'rb'))

Each image is forwarded through the network and the vector that we get at the end is stored in a dictionary (image_name –> feature_vector). After all the images are passed through the network, we pickle the dictionary and save it to disk.

Tokenize Captions

First, we’ll tokenize the captions by splitting them. This will give us a vocabulary of all the unique words in the data.

tokenizer = tf.keras.preprocessing.text.Tokenizer(oov_token="<unk>", 
                                                  filters='!"#$%&()*+.,-/:;=?@[\]^_`{|}~ ')


tokenizer.word_index['<pad>'] = 0
tokenizer.index_word[0] = '<pad>'

token_start = tokenizer.word_index[start.strip()]
token_end = tokenizer.word_index[end.strip()]

all_captions_seq = tokenizer.texts_to_sequences(all_captions)

We will then pad all sequences to the same length as the longest one and split the data into training and testing.

from sklearn.model_selection import train_test_split

all_captions_seq = tf.keras.preprocessing.sequence.pad_sequences(all_captions_seq, padding='post')

img_train, img_val, cap_train, cap_val = train_test_split(all_images, 

def get_image_encoding(image_ids):
   for idx in image_ids:
   return np.array(encoding)


Create Dataset

In this code, we use the Datasets API to feed data into the model.

def create_dataset(data,labels,batch_size):
  def map_func(img_encode, cap):
    x = {'decoder_input': cap[0:-1],'encoder_input': img_encode}
    y = {'decoder_output': cap[1:]}
    return x,y
  dataset =,labels))
  dataset =
  dataset = dataset.repeat()
  dataset = dataset.shuffle(data.shape[0]).batch(batch_size)
  dataset = dataset.prefetch(

  return dataset

Create Model

We have pre-processed the image without the output layer and will use the extracted features predicted by this model as input.

embedding_dim = 256
units = 512
vocab_size = len(tokenizer.word_index) + 1

steps_per_epoch = int(len(img_train) / BATCH_SIZE)


encoder_input = Input(shape=(encoder_shape,),name='encoder_input')

encoder_dense = Dense(state_size,activation='tanh',name='encoder_dense')(encoder_input)

We use encoded images to initialize the internal states of the GRU units. This informs the GRU units of the contents of the images. The encoded image values are vectors of length 2048 but the size of the internal states of the GRU units are only 512, so we use a fully-connected layer to map the vectors from 2048 to 512 elements.

Decoder Model

The decoder model expects input sequences with a pre-defined length which are fed into an Embedding layer. This is followed by a GRU layer with 512 memory units.

decoder_input = Input(shape=(None, ), name='decoder_input')

decoder_embedding = Embedding(input_dim=vocab_size,

decoder_gru1 = GRU(units, name='decoder_gru1',
decoder_gru2 = GRU(units, name='decoder_gru2',
decoder_gru3 = GRU(units, name='decoder_gru3',

decoder_dense = Dense(vocab_size,

The Decoder model merges the vectors from both input models using an addition operation.

from tensorflow.python.keras.models import Model

model = Model(inputs=[encoder_input, decoder_input],

Image caption symmary

Finally, we compile the model using the optimizer and loss function

def sparse_cross_entropy(y_true, y_pred):
    loss = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y_true,
    loss_mean = tf.reduce_mean(loss)

    return loss_mean

target_tensor = tf.placeholder(dtype='int32', shape=(None, None))


Callback Functions

we want to save checkpoints and log the progress for TensorBoard so we create the appropriate callbacks for Keras. This is the callback for writing checkpoints during training.


checkpoint_path = "training_2/cp-{epoch:04d}.ckpt"
checkpoint_dir = os.path.dirname(checkpoint_path)

callbacks = [
  tf.keras.callbacks.EarlyStopping(patience=2, monitor='val_loss'),
  tf.keras.callbacks.ModelCheckpoint(checkpoint_path, verbose=1, save_weights_only=True,
    # Save weights, every epochs.

Train the Model

Now we will train the model so it can map encoded values from the image model to sequences of integer tokens for the captions of the images., epochs=10,
train image caption model

Generate Captions

This function loads an image and generates a caption using the model we have trained.

def generate_caption(image_id,true_caption,max_tokens=30):
    encoder_input = encode_image[image_id]
    encoder_input = np.expand_dims(encoder_input, axis=0)

    shape = (1, max_tokens)
    decoder_input = np.zeros(shape=shape,

    token_id = token_start

    count_tokens = 0

    while token_id != token_end and count_tokens < max_tokens:
        decoder_input[0, count_tokens] = token_id

        input_data ={'encoder_input':encoder_input ,'decoder_input': decoder_input}
        predict = model.predict(input_data)
        token_id = np.argmax(predict[0, count_tokens, :])
        count_tokens += 1
    print('Predicted caption',tokenizer.sequences_to_texts([output]))
    print('True captions',tokenizer.sequences_to_texts([true_caption]))