In this tutorial, we use Keras, TensorFlow high-level API for building encoder-decoder architecture for image captioning. We also use TensorFow Dataset API for easy input pipelines to bring data into your Keras model.
Image captioning models combine convolutional neural network (CNN) and Long Short Term Memory(LSTM) to create an image captions for your own images.
Download Dataset
In this tutorial, we use Flilckr8K dataset. It contains 8,000 images that are each paired with five different captions which provide clear descriptions of the image. The dataset contains multiple descriptions for each image but for simplicity we use only one description.
!wget https://github.com/jbrownlee/Datasets/releases/download/Flickr8k/Flickr8k_Dataset.zip
!wget https://github.com/jbrownlee/Datasets/releases/download/Flickr8k/Flickr8k_text.zip
!unzip Flickr8k_Dataset.zip -d all_images
!unzip Flickr8k_text.zip -d all_captions
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import pickle
from tqdm import tqdm
import os
from tensorflow.keras.optimizers import Adam, RMSprop
from tensorflow.keras.applications.inception_v3 import InceptionV3
from tensorflow.keras.preprocessing import image
from tensorflow.keras.layers import LSTM, GRU,Embedding, Input,Dense,Activation, Flatten, RepeatVector,TimeDistributed
Load Data
First, we load the image and text data so that we can use it in tf.data.
image_dir = 'all_images/Flicker8k_Dataset/'
token_file = 'all_captions/Flickr8k.token.txt'
captions = open(token_file, 'r').read().strip().split('\n')
start='<start> '
end=' <end>'
image_caption_mapping = {}
for i, row in enumerate(captions):
row = row.split('\t')
image_id = row[0][:len(row[0])-2]
if image_id not in image_caption_mapping:
if os.path.isfile(image_dir+image_id):
image_caption_mapping[image_id]=start+row[1]+end
all_images=list(image_caption_mapping.keys())
all_captions=list(image_caption_mapping.values())
We will use the strings ‘<start>’ and ‘<end>’ for start and end sequence. These tokens are added to the loaded descriptions as they are loaded. It is important to do this now before we encode the text so that the tokens are also encoded correctly.
Encode images using InceptionV3
Next, we will use InceptionV3 (pre-trained on Imagenet) to encode each image. We will extract features from the last convolutional layer.
First, we will need to convert the images into the format inceptionV3 expects image size (299, 299) * Using the process method to place the pixels in the range of -1 to 1 (to match the format of the images used to train InceptionV3).
def preprocess(image_path):
img = image.load_img(image_path, target_size=(299, 299))
x = image.img_to_array(img)
x = np.expand_dims(x, axis=0)
x /= 255.
return x
Next, we will Initialize InceptionV3 and load the pre-trained Imagenet weights. We’ll create a tf.keras model where the output layer is the last convolutional layer in the InceptionV3 architecture.
inception_model = tf.keras.applications.InceptionV3(weights='imagenet')
model_input = inception_model.input
hidden_layer = inception_model.layers[-2].output
inception_model = tf.keras.Model(model_input, hidden_layer)
def encode_image(img_id):
image = preprocess(image_dir+img_id)
encoding = inception_model.predict(image)
encoding = np.reshape(encoding, encoding.shape[1])
return encoding
encode_image = {}
for img_id in tqdm(all_images):
encode_image[img_id] = encode(img_id)
with open("image_encoding.p", "wb") as encoded_pickle:
pickle.dump(encoding_train, encoded_pickle)
encode_image = pickle.load(open('image_encoding.p', 'rb'))
Each image is forwarded through the network and the vector that we get at the end is stored in a dictionary (image_name –> feature_vector). After all the images are passed through the network, we pickle the dictionary and save it to disk.
Tokenize Captions
First, we’ll tokenize the captions by splitting. This will give us a vocabulary of all the unique words in the data.
tokenizer = tf.keras.preprocessing.text.Tokenizer(oov_token="<unk>",
filters='!"#$%&()*+.,-/:;=?@[\]^_`{|}~ ')
tokenizer.fit_on_texts(all_captions)
tokenizer.word_index['<pad>'] = 0
tokenizer.index_word[0] = '<pad>'
token_start = tokenizer.word_index[start.strip()]
token_end = tokenizer.word_index[end.strip()]
all_captions_seq = tokenizer.texts_to_sequences(all_captions)
We will then pad all sequences to the be
from sklearn.model_selection import train_test_split
all_captions_seq = tf.keras.preprocessing.sequence.pad_sequences(all_captions_seq, padding='post')
img_train, img_val, cap_train, cap_val = train_test_split(all_images,
all_captions_seq,
test_size=0.2,
random_state=0)
def get_image_encoding(image_ids):
encoding=[]
for idx in image_ids:
encoding.append(encode_image[idx])
return np.array(encoding)
encode_train=get_image_encoding(img_train)
encode_val=get_image_encoding(img_val)
Create Dataset
In this code we use the Datasets API for feed data into model.
def create_dataset(data,labels,batch_size):
def map_func(img_encode, cap):
x = {'decoder_input': cap[0:-1],'encoder_input': img_encode}
y = {'decoder_output': cap[1:]}
return x,y
dataset = tf.data.Dataset.from_tensor_slices((data,labels))
dataset = dataset.map(map_func)
dataset = dataset.repeat()
dataset = dataset.shuffle(data.shape[0]).batch(batch_size)
dataset = dataset.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)
return dataset
Create Model
We have pre-processed the image without the output layer and will use the extracted features predicted by this model as input.
BATCH_SIZE = 64
embedding_dim = 256
units = 512
vocab_size = len(tokenizer.word_index) + 1
steps_per_epoch = int(len(img_train) / BATCH_SIZE)
encoder_shape=2048
encoder_input = Input(shape=(encoder_shape,),name='encoder_input')
encoder_dense = Dense(state_size,activation='tanh',name='encoder_dense')(encoder_input)
We use encoded image to initialize the internal states of the GRU units. This informs the GRU units of the contents of the images. The encoded image-values are vectors of length 2048 but the size of the internal states of the GRU units are only 512, so we use a fully-connected layer to map the vectors from 2048 to 512 elements.
Decoder Model
The decoder model expects input sequences with a pre-defined length which are fed into an Embedding layer. This is followed by
decoder_input = Input(shape=(None, ), name='decoder_input')
decoder_embedding = Embedding(input_dim=vocab_size,
output_dim=embedding_dim,
name='decoder_embedding')(decoder_input)
decoder_gru1 = GRU(units, name='decoder_gru1',
return_sequences=True)(decoder_embedding)
decoder_gru2 = GRU(units, name='decoder_gru2',
return_sequences=True)(decoder_gru1)
decoder_gru3 = GRU(units, name='decoder_gru3',
return_sequences=True)(decoder_gru2)
decoder_dense = Dense(vocab_size,
activation='linear',
name='decoder_output')(decoder_gru3)
The Decoder model merges the vectors from both input models using an addition operation.
from tensorflow.python.keras.models import Model
model = Model(inputs=[encoder_input, decoder_input],
outputs=[decoder_dense])
model.summary()

Finally we compile the model using the optimizer and loss function
def sparse_cross_entropy(y_true, y_pred):
loss = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y_true,
logits=y_pred)
loss_mean = tf.reduce_mean(loss)
return loss_mean
target_tensor = tf.placeholder(dtype='int32', shape=(None, None))
model.compile(optimizer=RMSprop(lr=1e-3),
loss=sparse_cross_entropy,
target_tensors=[target_tensor])
Callback Functions
we want to save checkpoints and log the progress for
train_dataset=create_dataset(encode_train,cap_train,BATCH_SIZE)
val_dataset=create_dataset(encode_val,cap_val,BATCH_SIZE)
checkpoint_path = "training_2/cp-{epoch:04d}.ckpt"
checkpoint_dir = os.path.dirname(checkpoint_path)
callbacks = [
tf.keras.callbacks.EarlyStopping(patience=2, monitor='val_loss'),
tf.keras.callbacks.TensorBoard(log_dir='./logs'),
tf.keras.callbacks.ModelCheckpoint(checkpoint_path, verbose=1, save_weights_only=True,
# Save weights, every epochs.
period=1)
]
Train the Model
Now we will train the model so it can map encoded values from the image-model to sequences of integer-tokens for the captions of the images.
model.fit(train_dataset, epochs=10,
validation_data=val_dataset,
steps_per_epoch=steps_per_epoch,
callbacks=callbacks,
validation_steps=3)

Generate Captions
This function loads an image and generates a caption using the model we have trained.
def generate_caption(image_id,true_caption,max_tokens=30):
encoder_input = encode_image[image_id]
encoder_input = np.expand_dims(encoder_input, axis=0)
shape = (1, max_tokens)
decoder_input = np.zeros(shape=shape, dtype=np.int)
token_id = token_start
output=[]
count_tokens = 0
while token_id != token_end and count_tokens < max_tokens:
decoder_input[0, count_tokens] = token_id
input_data ={'encoder_input':encoder_input ,'decoder_input': decoder_input}
predict = model.predict(input_data)
token_id = np.argmax(predict[0, count_tokens, :])
output.append(token_id)
count_tokens += 1
print('Predicted caption',tokenizer.sequences_to_texts([output]))
print('True captions',tokenizer.sequences_to_texts([true_caption]))
plt.imshow(np.squeeze(preprocess(image_dir+image_id)))
generate_caption(img_val[100],cap_val[100])
