The convolutional layers are capable of extracting different features from an image such as edges, textures, objects, and scenes. A convolutional layer contains a set of filters whose parameters need to be learned. It will compute a dot product between their filter and a small region of the input image. This may result in volume such as [28x28x10] if we decided to use 10 filters.

What is Filter

The convolutional layer computes the convolutional operation of the input images using filters to extract features and scans the entire image looking through this filter. The filter is slid across the width and height of the input and the dot products between the input and filter are computed at every position. The output of a convolution is referred to as a feature map.

Each filter is convolved with the inputs to compute an activation map. The output volume of the convolutional layer is obtained by stacking the activation maps of all filters along the depth dimension. 

For example, the following filter scene images for the feature and computing activation map.

2D Convolution Animation

This filter is also sometimes called a Windows, or kernels all refer to the same thing with respect to convolutional neural networks.

We can scan the image using multiple filters to generate multiple feature mappings of the image. Each feature mapping will reveal the parts of the image which express the given feature defined by the parameters of our filter.

Each convolution layer consists of several filters. In practice, they are a number such as 32,64, 128, 256, 512, etc. This is equal to the number of channels in the output of a convolutional layer.

Define Filter 

In practice, we don’t explicitly define the filters that our convolutional layer will use, instead parameterize the filters and let the network learn the best filters to use during training. We need to define “how many filters we’ll use at each layer”.

During training, the values in the filters are optimized with backpropagation with respect to a loss function.

Kernal Size

Each filter will have a defined width and height, but the height and weight of the filters(kernel) are smaller than the input volume.

The filters have the same dimension but with smaller constant parameters as compared to the input images. As an example, for computing a [32,32, 3], 3D image, the acceptable filter size is f × f × 3, where f = 3, 5, 7, and so on. 

kernel_size: is the size of these convolution filters. In practice, they take values such as 1×1, 3×3, or 5×5. To abbreviate, they can be written as 1 or 3 or 5 as they are mostly square in practice.

Input Layer

The input layer is conceptually different from other layers. It will hold the raw pixel values of the image. In Keras, the input layer itself is not a layer, but a tensor. It’s the starting tensor you send to the first hidden layer. This tensor must have the same shape as your training data.

Input Shape

The input shape is the only one you must define because your model cannot know it. It is based on your training data. All the other shapes are calculated automatically based on the units and particularities of each layer.

For example, if you have 100 images of 32x32x3 pixels, the shape of your input data is (100,32,32,3). Then your input layer tensor must have this shape.