Keras model cannot directly process raw data. The data has to be converted into a suitable format to enable the model to interpret. For example, the images have to be converted to floating-point tensors.

In this tutorial, you will learn how to load and create a train and test dataset from Kaggle as input for deep learning models. You will learn to load the dataset using Keras preprocessing utility tf.keras.utils.image_dataset_from_directory() to read a directory of images on disk.

  • image_dataset_from_directory() with Label List
  • Image_dataset_from_directory without Label List

With Label List

The dog Breed Identification dataset provided a training set and a test set of images of dogs. We will only use the training dataset to learn how to load the dataset from the directory. The folder structure of the image data is:

Dog breed dataset

Label List

All images for training are located in one folder and the target labels are in a CSV file. It should be possible to use a list of labels instead of inferring the classes from the directory structure. We have a list of labels a corresponding number of files in the directory.

label_df = pd.read_csv('/content/labels.csv')

print('Training set: {}'.format(label_df.shape))

# Encode the breed into digits
label_df['label'] = LabelEncoder().fit_transform(label_df.breed)

# Create a breed-2-index dictionary
dict_df = label_df[['label','breed']].copy()
dict_df.drop_duplicates(inplace=True)
dict_df.set_index('label',drop=True,inplace=True)

index_to_breed = dict_df.to_dict()['breed']

Define some parameters for the loader:

batch_size = 32
img_height = 180
img_width = 180

data_dir='/content/dogs/'

It’s good practice to use a validation split when developing your model. We will use 80% of the images for training and 20% for validation.

train_ds = tf.keras.utils.image_dataset_from_directory(
  data_dir,
  seed=123,
  labels=label_df['label'].to_numpy().tolist(),
  label_mode='int',
  validation_split=0.2,
  subset="training",
  image_size=(img_height, img_width),
  batch_size=batch_size)

val_ds = tf.keras.utils.image_dataset_from_directory(
  data_dir,
  seed=123,
  labels=label_df['label'].to_numpy().tolist(),
  label_mode='int',
  validation_split=0.2,
  subset="validation",
  image_size=(img_height, img_width),
  batch_size=batch_size)

Validation Split

Validation_split float between 0 and 1. The model will set apart this fraction of the training data, will not train on it, and will evaluate the loss and any model metrics on this data at the end of each epoch. The validation data is selected from the last samples in the x and y data provided, before shuffling. Shuffle the training data before each epoch.

Note that I am loading both training and validation from the same folder and then using validation_split.validation split in Keras always uses the last x percent of data as a validation set

Visualize the data

Here are the nine images from the training dataset.

import matplotlib.pyplot as plt

plt.figure(figsize=(10, 10))
for images, labels in train_ds.take(1):
  lbl=labels.numpy()
  for i in range(9):
    ax = plt.subplot(3, 3, i + 1)
    plt.imshow(images[i].numpy().astype("uint8"))
    plt.title(index_to_breed[lbl[i]])
    plt.axis("off")
Keras Load images from Folder

Without Label List

The 10 monkey Species dataset consists of two files, training and validation. Each folder contains 10 subforders labeled as n0~n9, each corresponding a monkey species. Images are 400×300 px or larger and JPEG format (almost 1400 images). 

10 monkey species

Each directory contains images of that type of monkey. The data directory should have the following structure to use label as in:

path/to/image_dir/
  label1/  
      l1.png
      l2.png
      l3.png
  label2/
      l1.png
      l2.png
      L3.png

Your folder structure should look like this. It specifically required a label as inferred.

train_ds = tf.keras.utils.image_dataset_from_directory(
  data_dir,
  labels='inferred',
  validation_split=0.2,
  subset="training",
  seed=123,
  image_size=(img_height, img_width),
  batch_size=batch_size)

Label

If you set label as an “inferred” then labels are generated from the directory structure, if “None”  no labels, or a list/tuple of integer labels of the same size as the number of image files found in the directory. 

You can find the class names in the class_names attribute on these datasets.

class_names = train_ds.class_names
print(class_names)

ImageDataGenerator

ImageDataGenerator is Deprecated, it is not recommended for new code. Prefer loading images with image_dataset_from_directory and transforming the output tf.data.Dataset with preprocessing layers. 

Data Augmentation

You can use the Keras preprocessing layers for data augmentation as well, such as RandomFlip and RandomRotation. Let’s create a few preprocessing layers and apply them repeatedly to the image.

data_augmentation = tf.keras.Sequential([
  tf.keras.layers.RandomFlip("horizontal_and_vertical"),
  tf.keras.layers.RandomRotation(0.2),
])

aug_ds = train_ds.map(
  lambda x, y: (data_augmentation(x, training=True), y))

With this approach, you use Dataset.map to create a dataset that yields batches of augmented images. In this case, data augmentation will happen asynchronously on the CPU, and is non-blocking. You can overlap the training of your model on the GPU with data preprocessing, using Dataset.prefetch.

Related Post

Run this code in Google Colab