ConvNet is a sequence of layers, and every layer of a ConvNet transforms one volume of activations to another through a differentiable function. We use three main types of layers to build ConvNet architectures: Convolutional Layer, Pooling Layer, and Fully-Connected Layer. We will stack these layers to form a full ConvNet architecture. The layers of a ConvNet have neurons arranged in 3 dimensions: width, height, depth. The depth here refers to the third dimension of an activation volume.
The Conv layer is the core building block of a Convolutional Network that does most of the computational heavy lifting. Three hyperparameters control the size of the output volume: depth, stride, and zero-padding.
The depth corresponds to the number of filters we would like to use, each filter look for something different in the input. For example, if the first Convolutional Layer takes as input the raw image, then different neurons along the depth dimension may activate in presence of various oriented edges, or blobs of color.
Second, we must specify the stride with which we slide the filter. When the stride is 1 then we move the filters one pixel at a time. When the stride is 2 then the filters jump 2 pixels at a time as we slide them around. This will produce smaller output volumes spatially.
It will be convenient to pad the input volume with zeros around the border. The size of this zero-padding is a hyperparameter. The nice feature of zero padding is that it will allow us to control the spatial size of the output volumes. We will use it to exactly preserve the spatial size of the input volume so the input and output width and height are the same.
Let’s go through a concrete example of a deep convolutional network. Let’s say you have an image and you want to image classification or image recognition. We want to take as input an image X and decide is this a cat or dog. For the sake of this example, I’m going to use a fairly small image[32x32x3].
Let’s say the first layer uses a set of 3×3 filters to detect features so F is 3 and let‘s say we are using stride equal 2 and no padding and you have 10 filters.
Input image [32x32x3] will hold the raw pixel values of the image, in this case, an image of width 32, height 32, and with three color channels R, G, B. Filter on the first layer of a ConvNet might have size 3×3.
The activations in the next layer of the neural network will [15x15x10] 10 comes from the fact that you use 10 filters(number of filters from the previous layer this is becoming the dimension of the activation at the layer) and 15 comes from the following formula. Notice that you’re using a stride of 2 the dimension has shrunk much faster.
Accepts a volume of size W1×H1×D1 Requires four hyperparameters: Number of filters K, their spatial extent F, the stride S, the amount of zero padding P. Produces a volume of size W2×H2×D2 where: W2=(W1−F+2P)/S+1 H2=(H1−F+2P)/S+1 (i.e. width and height are computed equally by symmetry) D2=K
In this example there is a neuron with a receptive field size of F = 3, the input size is W = 32, and there is zero padding is 0 and strided across the input in the stride of S = 2, giving an output of size (32 – 3 + 0)/2+1 = 15. It’s a valid convolution and we are using 10 filters the number of channels now is 10.
It is common to periodically insert a Pooling layer in-between successive Conv layers in a ConvNet architecture. Its function is to progressively reduce the spatial size of the representation to reduce the number of parameters and computation in the network, and hence to also control overfitting. The Pooling Layer operates independently on every depth slice of the input and resizes it spatially, using the MAX operation.
The most common form is a pooling layer with filters of size 2×2 applied with a stride of 2 downsamples every depth slice in the input by 2 along both width and height, discarding 75% of the activations. The depth dimension remains unchanged. More generally, the pooling layer.
Suppose an input volume had size [15x15x10] and we have 10 filters of size 2×2 and they are applied with a stride of 2. Therefore, the output volume size has spatial size (15 – 2 )/2 + 1 = [7x7x10].
Padding in the pooling layer is very very rarely used when you do pooling. The pooling layer usually does not use any padding. So the most common value of P far is P equal Zero.
Fully Connected layer
Finally what’s commonly done is if you take the [7x7x10] is actually 490 and we continuously take this volume and flatten it will unroll it into 490 units. Fashion that out into a vector and then feeds this to a logistic regression unit or a softmax unit depend on whether you’re trying to recognize any of k different objects.
This last step is just taking all of these numbers 490 and unrolling them into a very long vector. You have just one long vector you can feed into softmax into logistic regression in order to make a prediction for the final output.
As we go deeper in the neural network typically you start off with larger images [32x32x3] then the height and width will gradually trend down as you go deeper in the neural network. Whereas the number of channels generally increases. You see this general trend in a lot of other convolutional neural networks.