The only way to know how well a model will generalize to new cases is to try it out on a new dataset. One way to do that is to put your model in production and monitor how well it performs.

A better option is to split your data into two sets: the training set and the test set. As these names imply, you train your model using the training set and test it using the test set. The error rate on new cases is called the generalization error, and by evaluating your model on the test set, you get an estimate of this error. This value tells you how well your model will perform on instances it has never seen before.

It is common to use 80% of the data for training and hold out 20% for testing. However, this depends on the size of the dataset: if it contains 10 million instances, then holding out 1% means your test set will have 100,000 instances, probably more than enough to get a reasonable estimate of the generalization error.

There is a standard way to structure your image dataset on disk in order to make it fast and efficient to load and when training and evaluating deep learning models. Once structured, you can use tools like the tf.keras.preprocessing.image_dataset_from_directory class in the Keras deep learning library to automatically load your images from folders and split them into the train, and test datasets. 

In this tutorial, we will download a dataset from Kaggle, split it into train and test, and create a tf.data.Dataset to load it and preprocess it efficiently, then you can build and train an Image classification model using train Dataset.

Download the Weather dataset, which contains 6862 different types of weather images. The images are organized in directories. The images are divided into 11 classes: dew, fog/smog, frost, glaze, hail, lightning, rain, rainbow, rime, sandstorm, and snow.

Weather Dataset

The weather dataset contains 11 sub-directories, one per class: Note that I am loading both training and validation from the same folder and use the validation_split.

After downloading (615MB), you should now have a copy of the Weather photos available. There are 6,862 total images. Each directory contains images of that type of different Weather.

rain = list(data_dir.glob('rain/*'))
PIL.Image.open(str(rain[0]))
Keras image from folder

Load data from the Folder

Next, load these images off disk using the helpful tf.keras.utils.image_dataset_from_directory utility. This will take you from a directory of images on disk to a tf.data.Dataset in just a couple lines of code.

batch_size = 32
img_height = 150
img_width = 150

train_ds = tf.keras.utils.image_dataset_from_directory(
  data_dir,
  validation_split=0.2,
  subset="training",
  seed=123,
  image_size=(img_height, img_width),
  batch_size=batch_size)

subset: One of “training” or “validation”. Only used if validation_split is set.

val_ds = tf.keras.utils.image_dataset_from_directory(
  data_dir,
  validation_split=0.2,
  subset="validation",
  seed=123,
  image_size=(img_height, img_width),
  batch_size=batch_size)

image_dataset_from_directory generates a tf.data.Dataset from image files in a directory that yield batches of images from the sub-directories or class directory within the main directory and labels that is the sub-directory name.

validation_split float between 0 and 1. Fraction of the training data to be used as validation data. The Keras will set apart this fraction of the training data. The validation data is selected from the last samples in the x and y data provided.

Visualize the data

Here are the first six images from the training dataset.

class_names = train_ds.class_names

plt.figure(figsize=(12, 7))
for images, labels in train_ds.take(1):
  for i in range(6):
    ax = plt.subplot(2, 3, i + 1)
    plt.imshow(images[i].numpy().astype("uint8"))
    plt.title(class_names[labels[i]])
    plt.axis("off")
Keras Split Dataset from Folder

You can train a model using these datasets by passing them to model.fit()

Run this code in Google colab.