One of the most important tasks in Object Detection is to label the objects in Image. There are several tools available where you can load the images and the localization object using bounding boxes. This information is stored in annotation files.

For the purpose of this tutorial, we will be showing you how to prepare your image dataset in the Pascal VOC annotation format and convert it in TFRecord file format.

The Pascal VOC format uses XML files to store details of the objects in your individual images. To easily generate these XML files for the images, we will be using LabelImg that allows you to

  • draw visual boxes around your objects in the images
  • and it automatically saves the XML files for your images
labelImg for Object detection

The XML file containing your box annotations are saved for each image in the “annotations” folder. See the XML code below. 


Once you are done annotating your image dataset in the Pascal VOC format, you must convert your data into the TFRecord format. Because to use your own dataset in TensorFlow Object Detection API, you must convert it into the TFRecord file format

In this tutorial, we use untangle XML parsing library to convert annotations to python objects.

import PIL
import tensorflow as tf
import hashlib
import io
import os
import untangle

if __name__ == '__main__':

    data_dir = '/home/manu/Desktop/OD_DATASET/'

    tfrecord_path = '/home/manu/Desktop/OD_DATASET/train.tfrecord'

    writer =

    annotations_dir = os.path.join(data_dir, 'annotations')
    examples_list = os.listdir(annotations_dir)
    for idx, example in enumerate(examples_list):
        if example.endswith('.xml'):
            path = os.path.join(annotations_dir, example)
            xml_obj = untangle.parse(path)
            tf_example = xml_to_tf_example(xml_obj)


TFRecords file containing tf.train.Example protocol buffers which contain Features as a field. We can generate a tf.Example proto for this image using the following code.

def xml_to_tf_example(xml_obj):
    label_map_dict = {'dog': 1, 'cat': 2}

    full_path = xml_obj.annotation.path.cdata
    filename = xml_obj.annotation.filename.cdata
    with, 'rb') as fid:
        encoded_jpg =
    encoded_jpg_io = io.BytesIO(encoded_jpg)
    image =
    if image.format != 'JPEG':
        raise ValueError('Image format not JPEG')
    key = hashlib.sha256(encoded_jpg).hexdigest()

    width = int(xml_obj.annotation.size.width.cdata)
    height = int(xml_obj.annotation.size.height.cdata)

    xmin = []
    ymin = []
    xmax = []
    ymax = []

    classes = []
    classes_text = []
    truncated = []

    for obj in xml_obj.annotation.object:

        xmin.append(float(obj.bndbox.xmin.cdata) / width)
        ymin.append(float(obj.bndbox.ymin.cdata) / height)
        xmax.append(float(obj.bndbox.xmax.cdata) / width)
        ymax.append(float(obj.bndbox.ymax.cdata) / height)

    example = tf.train.Example(features=tf.train.Features(feature={
        'image/height': int64_feature(height),
        'image/width': int64_feature(width),
        'image/filename': bytes_feature(filename.encode('utf8')),
        'image/source_id': bytes_feature(filename.encode('utf8')),
        'image/key/sha256': bytes_feature(key.encode('utf8')),
        'image/encoded': bytes_feature(encoded_jpg),
        'image/format': bytes_feature('jpeg'.encode('utf8')),
        'image/object/bbox/xmin': float_list_feature(xmin),
        'image/object/bbox/xmax': float_list_feature(xmax),
        'image/object/bbox/ymin': float_list_feature(ymin),
        'image/object/bbox/ymax': float_list_feature(ymax),
        'image/object/class/text': bytes_list_feature(classes_text),
        'image/object/class/label': int64_list_feature(classes),
        'image/object/truncated': int64_list_feature(truncated),
    return example

The bounding box coordinates with origin in the top left corner defined by 4 floating-point numbers [ymin, xmin, ymax, xmax]. We store the normalized coordinates (x / width, y / height) in the TFRecord dataset.