VGG experiment the depth of the Convolutional Network for image recognition. It is increasing depth using very small ( 3 × 3) convolution filters in all layers. In this tutorial, we present the details of VGG16 network configurations and the details of image augmentation for training and evaluation.
VGG16 Architecture
VGG16 ConvNet configurations are quite different from the other ones, rather than using relatively large convolutional filters at first Conv. layers (e.g. 11×11 with stride 4, or 7×7 with stride 2) VGG use very small 3 × 3 filters throughout the whole net, which are convolved with the input at every pixel (with stride 1). For instance, a stack of three 3×3 Conv. layers instead of a single 7×7 layer. All ConvNet layers designed using the same principles.

During the training, the input to our ConvNets is a fixed-size 224 x 224 x 3 RGB image.
input_shape = [IMG_SIZE,IMG_SIZE,3]
img_input = keras.layers.Input(shape=input_shape)
# Block 1
x = keras.layers.Conv2D(64, (3, 3),
activation='relu',
padding='same',
name='block1_conv1')(img_input)
x = keras.layers.BatchNormalization()(x)
x = keras.layers.Dropout(0.3)(x)
x = keras.layers.Conv2D(64, (3, 3),
activation='relu',
padding='same',
name='block1_conv2')(x)
x = keras.layers.BatchNormalization()(x)
x = keras.layers.MaxPooling2D((2, 2), strides=(2, 2), name='block1_pool')(x)
# Block 2
x = keras.layers.Conv2D(128, (3, 3),
activation='relu',
padding='same',
name='block2_conv1')(x)
x = keras.layers.BatchNormalization()(x)
x = keras.layers.Dropout(0.3)(x)
x = keras.layers.Conv2D(128, (3, 3),
activation='relu',
padding='same',
name='block2_conv2')(x)
x = keras.layers.BatchNormalization()(x)
x = keras.layers.Dropout(0.3)(x)
x = keras.layers.MaxPooling2D((2, 2), strides=(2, 2), name='block2_pool')(x)
# Block 3
x = keras.layers.Conv2D(256, (3, 3),
activation='relu',
padding='same',
name='block3_conv1')(x)
x = keras.layers.BatchNormalization()(x)
x = keras.layers.Conv2D(256, (3, 3),
activation='relu',
padding='same',
name='block3_conv2')(x)
x = keras.layers.BatchNormalization()(x)
x = keras.layers.Conv2D(256, (1, 1),
activation='relu',
padding='same',
name='block3_conv3')(x)
x = keras.layers.BatchNormalization()(x)
x = keras.layers.Dropout(0.3)(x)
x = keras.layers.MaxPooling2D((2, 2), strides=(2, 2), name='block3_pool')(x)
# Block 4
x = keras.layers.Conv2D(512, (3, 3),
activation='relu',
padding='same',
name='block4_conv1')(x)
x = keras.layers.BatchNormalization()(x)
x = keras.layers.Conv2D(512, (3, 3),
activation='relu',
padding='same',
name='block4_conv2')(x)
x = keras.layers.BatchNormalization()(x)
x = keras.layers.Conv2D(512, (1, 1),
activation='relu',
padding='same',
name='block4_conv3')(x)
x = keras.layers.BatchNormalization()(x)
x = keras.layers.Dropout(0.3)(x)
x = keras.layers.MaxPooling2D((2, 2), strides=(2, 2), name='block4_pool')(x)
# Block 5
x = keras.layers.Conv2D(512, (3, 3),
activation='relu',
padding='same',
name='block5_conv1')(x)
x = keras.layers.BatchNormalization()(x)
x = keras.layers.Conv2D(512, (3, 3),
activation='relu',
padding='same',
name='block5_conv2')(x)
x = keras.layers.BatchNormalization()(x)
x = keras.layers.Conv2D(512, (1, 1),
activation='relu',
padding='same',
name='block5_conv3')(x)
x = keras.layers.BatchNormalization()(x)
x = keras.layers.Dropout(0.3)(x)
x = keras.layers.MaxPooling2D((2, 2), strides=(2, 2), name='block5_pool')(x)
x = keras.layers.GlobalAveragePooling2D()(x)
x = keras.layers.Dropout(0.5)(x)
# Classification block
x = keras.layers.Flatten(name='flatten')(x)
x = keras.layers.Dense(4096, activation='relu', name='fc1')(x)
x = keras.layers.Dropout(0.5)(x)
x = keras.layers.Dense(4096, activation='relu', name='fc2')(x)
x = keras.layers.Dropout(0.5)(x)
x = keras.layers.Dense(len(CLASS_NAMES), activation='softmax', name='predictions')(x)
model = keras.models.Model(img_input, x, name='vgg16')
Convolutional Layer
The number of channels is small, starting from 64 in the first layer and then increasing by a factor of 2 after each max-pooling layer, until it reaches 512.
The image is passed through a stack of convolutional layers, where VGG uses 3×3 filters which are the smallest size to capture the notion of left/right, up/down, center.
The convolution stride is fixed to 1 pixel.
The padding of Conv. layer input is the “same” that preserved the spatial resolution after convolution.
Pooling Layer
Spatial pooling is carried out by five max-pooling layers, which follow some of the Conv. layers (not all the conv. layers are followed by max-pooling). max-pooling is performed over a 2 × 2 pixel window, with stride 2.
Fully Connected Layer
A stack of convolutional layers is followed by three Fully-Connected (FC) layers. The first two have 4096 channels each. The final layer is the soft-max layer. Dropout regularisation for the first two fully-connected layers is set to 0.5.
Prepare Dataset
In this tutorial, we use the Fruits 360 dataset. It containing fruits and vegetables. This dataset is available for download from GitHub.
training_path=pathlib.Path("fruits-360_dataset/fruits-360/Training/")
testing_path=pathlib.Path("fruits-360_dataset/fruits-360/Test/")
CLASS_NAMES = np.array([item.name for item in testing_path.glob('*')])
CLASS_NAMES
train_images=glob.glob(str(training_path/'*/*.jpg'))
test_images=glob.glob(str(testing_path/'*/*.jpg'))
Image Augmentation
To obtain the fixed-size 224×224 ConvNet input images, they were randomly cropped from rescaled training images. To further augment the training set, the crops underwent random horizontal flipping. Training image rescaling is explained below.
VGG_MEAN = [123.68, 116.78, 103.94]
def train_augment(image, label):
crop_image = tf.image.random_crop(image, [IMG_SIZE, IMG_SIZE, 3])
flip_image = tf.image.random_flip_left_right(crop_image)
means = tf.reshape(tf.constant(VGG_MEAN), [1, 1, 3])
centered_image = flip_image - means
return flip_image, tf.cast(label,tf.float32)
We also subtracting the mean RGB value from each pixel.
def validation_augment(image, label):
crop_image = tf.image.resize_with_crop_or_pad(image, IMG_SIZE, IMG_SIZE)
means = tf.reshape(tf.constant(VGG_MEAN), [1, 1, 3])
centered_image = crop_image - means
return crop_image, tf.cast(label,tf.float32)
Since the fully-convolutional network is applied over the whole image, there is no need to sample multiple crops at test time, which is less efficient as it requires network re-computation for each crop.
def parse_image(filename):
parts = tf.strings.split(filename, '/')
label =parts[-2] == CLASS_NAMES
image = tf.io.read_file(filename)
image = tf.image.decode_jpeg(image,channels=3)
image = tf.image.convert_image_dtype(image, tf.float32)
smallest_side = 125.0
height, width = tf.shape(image)[0], tf.shape(image)[1]
height = tf.cast(height,tf.float32)
width = tf.cast(width,tf.float32)
scale = tf.cond(tf.greater(height, width),
lambda: smallest_side / width,
lambda: smallest_side / height)
new_height = tf.cast(height * scale,tf.int32)
new_width = tf.cast(width * scale,tf.int32)
image = tf.image.resize(image, [new_height, new_width])
return image, label
def create_dataset(file_list,is_training=False,shuffle_buffer_size=1000):
ds=tf.data.Dataset.from_tensor_slices(file_list)
ds=ds.shuffle(buffer_size=len(file_list))
ds=ds.map(parse_image,num_parallel_calls=AUTOTUNE)
if is_training:
ds=ds.map(train_augment,num_parallel_calls=AUTOTUNE)
else:
ds=ds.map(validation_augment,num_parallel_calls=AUTOTUNE)
ds=ds.repeat()
ds=ds.batch(BATCH_SIZE)
ds = ds.prefetch(buffer_size=AUTOTUNE)
return ds
Compile Model
You must compile the model before training it. Since there are 120 classes, use a categorical_crossentropy
loss.
model.compile(loss='categorical_crossentropy',
optimizer='rmsprop',
metrics=['accuracy'])
Train Model
Train the model by passing the Dataset
object to the model’s fit function. Set the number of epochs.
train_ds=create_dataset(train_images,True)
test_ds=create_dataset(test_images)
history = model.fit(train_ds,
epochs=10,
steps_per_epoch=steps_per_epoch,
validation_steps=validation_steps,
validation_data=test_ds)
Evaluate Model
And let’s see how the model performs. Two values will be returned. Loss (a number which represents our error, lower values are better), and accuracy.
loss,accuracy = model.evaluate(test_ds, steps = validation_steps)
print("loss: {:.2f}".format(loss))
print("accuracy: {:.2f}".format(accuracy0))
Run this code in Google Colab
References
Very Deep Convolutional Networks for Large-Scale Image Recognition