Keras model cannot directly process raw data. The data has to be converted into a suitable format to enable the model to interpret. For example, the images have to be converted to floating-point tensors.
In this tutorial, you will learn how to load and create a train and test dataset from Kaggle as input for deep learning models. You will learn to load the dataset using Keras preprocessing utility tf.keras.utils.image_dataset_from_directory() to read a directory of images on disk.
- image_dataset_from_directory() with Label List
- Image_dataset_from_directory without Label List
With Label List
The dog Breed Identification dataset provided a training set and a test set of images of dogs. We will only use the training dataset to learn how to load the dataset from the directory. The folder structure of the image data is:

Label List
All images for training are located in one folder and the target labels are in a CSV file. It should be possible to use a list of labels instead of inferring the classes from the directory structure. We have a list of labels a corresponding number of files in the directory.
label_df = pd.read_csv('/content/labels.csv')
print('Training set: {}'.format(label_df.shape))
# Encode the breed into digits
label_df['label'] = LabelEncoder().fit_transform(label_df.breed)
# Create a breed-2-index dictionary
dict_df = label_df[['label','breed']].copy()
dict_df.drop_duplicates(inplace=True)
dict_df.set_index('label',drop=True,inplace=True)
index_to_breed = dict_df.to_dict()['breed']
Define some parameters for the loader:
batch_size = 32
img_height = 180
img_width = 180
data_dir='/content/dogs/'
It’s good practice to use a validation split when developing your model. We will use 80% of the images for training and 20% for validation.
train_ds = tf.keras.utils.image_dataset_from_directory(
data_dir,
seed=123,
labels=label_df['label'].to_numpy().tolist(),
label_mode='int',
validation_split=0.2,
subset="training",
image_size=(img_height, img_width),
batch_size=batch_size)
val_ds = tf.keras.utils.image_dataset_from_directory(
data_dir,
seed=123,
labels=label_df['label'].to_numpy().tolist(),
label_mode='int',
validation_split=0.2,
subset="validation",
image_size=(img_height, img_width),
batch_size=batch_size)
Validation Split
Validation_split float between 0 and 1. The model will set apart this fraction of the training data, will not train on it, and will evaluate the loss and any model metrics on this data at the end of each epoch. The validation data is selected from the last samples in the x and y data provided, before shuffling. Shuffle the training data before each epoch.
Note that I am loading both training and validation from the same folder and then using validation_split
.validation split in Keras always uses the last x percent of data as a validation set.
Visualize the data
Here are the nine images from the training dataset.
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 10))
for images, labels in train_ds.take(1):
lbl=labels.numpy()
for i in range(9):
ax = plt.subplot(3, 3, i + 1)
plt.imshow(images[i].numpy().astype("uint8"))
plt.title(index_to_breed[lbl[i]])
plt.axis("off")

Without Label List
The 10 monkey Species dataset consists of two files, training and validation. Each folder contains 10 subforders labeled as n0~n9, each corresponding a monkey species. Images are 400×300 px or larger and JPEG format (almost 1400 images).

Each directory contains images of that type of monkey. The data directory should have the following structure to use label as in:
path/to/image_dir/
label1/
l1.png
l2.png
l3.png
label2/
l1.png
l2.png
L3.png
Your folder structure should look like this. It specifically required a label as inferred.
train_ds = tf.keras.utils.image_dataset_from_directory(
data_dir,
labels='inferred',
validation_split=0.2,
subset="training",
seed=123,
image_size=(img_height, img_width),
batch_size=batch_size)
Label
If you set label as an “inferred” then labels are generated from the directory structure, if “None” no labels, or a list/tuple of integer labels of the same size as the number of image files found in the directory.
You can find the class names in the class_names attribute on these datasets.
class_names = train_ds.class_names
print(class_names)
ImageDataGenerator
ImageDataGenerator is Deprecated, it is not recommended for new code. Prefer loading images with image_dataset_from_directory and transforming the output tf.data.Dataset with preprocessing layers.
Data Augmentation
You can use the Keras preprocessing layers for data augmentation as well, such as RandomFlip and RandomRotation. Let’s create a few preprocessing layers and apply them repeatedly to the image.
data_augmentation = tf.keras.Sequential([
tf.keras.layers.RandomFlip("horizontal_and_vertical"),
tf.keras.layers.RandomRotation(0.2),
])
aug_ds = train_ds.map(
lambda x, y: (data_augmentation(x, training=True), y))
With this approach, you use Dataset.map
to create a dataset that yields batches of augmented images. In this case, data augmentation will happen asynchronously on the CPU, and is non-blocking. You can overlap the training of your model on the GPU with data preprocessing, using Dataset.prefetch
.
Related Post
- Split data set into Train and Test set Using Keras image_dataset_from_directory/folder.
- How to set steps_per_epoch,validation_steps and validation_split in Keras’s fit method?
- Split Imbalanced dataset using sklearn Stratified train_test_split().