Convolution Neural Networks (CNNs) made a huge leap in deep learning history with their amazing power of image classification. As by its name convolution layers are a major part of CNN.
Though there are so many resources described in this area, a new student may find it difficult to understand the architecture of CNN. This short blog represents a simple method to determine and understand the number of trainable parameters and the shape of the output after a convolution operation.
This is extracted from popular CNN architecture, Alex Net the winner of the 2012 Image Net challenge. With this diagram, I will explicate a simple formula that could apply for all of your convolution layers.
Note: h-height of input, w-width of input, d=depth of input, m=l=filter-size, s=stride , p=width of output, q=height of output, k=depth of output
No image padding is assumed. If padding is applied, the same concept can be extended assuming padded input as the new input.
Determining the Output Shape
The input to the convolution operation is a 227x227x3 block. This is actually the RGB (that’s why it’s 3) image that is set as input. The kernel (filter) is 11x11x3 and there are 96 such kernels (filters). Stride is 4 (pixel skipping =4).
You must notice that the kernel set has a depth value of 3 comparable with the depth of the input. This is an obvious fact because 11×11 kernels must apply for all 3 layers of the input.
Normally, kernels are square shape (width and height the same) and of odd lengths (3,5,7,11.. etc.). If we slide one 11×11 matrix over one layer of 227×227 matrix without skipping the pixels, we get the output size as:
(227–11+1) x (227–11+1) = 217×217
This can be represented in general terms with our symbolic definition as:
(w-m+1) x (h-l+1)
You may still be wondering, what is the intuition behind this calculation. See the picture below:
For example, this is one layer of input to convolution layer 5×5 and the filter size is 3×3. When we slide the filter over the image it can be applied only on the red line surrounded pixels (3×3). After convolution operation output is a 3×3 matrix:
(5–3+1) x (5–3+1) = 3×3
See, it’s simple.
Let’s go back to our original example. Assuming no skipping of pixels (stride=1) our output is 217×217, but in the example, the stride is given as 4.
This gives us an output of (217/4) x (217/4) = 55×55. In more generalized terms:
[(w-m+1)/s] x [(h-l+1)/s]
Now we have a 55×55 output matrix with convolving with one filter. There are 96 such filters. Now we can easily determine the output size:
In our symbolic notation, it is:
[(w-m+1)/s] x [(h-l+1)/s] x k
I know that you may still have a doubtful point. There are filters for each input layer (depth=3). What happened to those?
This is the special case of convolution on CNNs, we sum up the convolution output of all 3 layers to build up the one layer of output. That means, what you see as 55×55 is not just a result of one layer but 3 (multiple) layers.
Number of Trainable Parameters
Calculation of the number of trainable parameters is not a long calculation as our previous calculation of obtaining output.
Let’s get into it.
There are 96, 11×11 filters applied along 3 input layers. Apart from these, for each output slices (we have 96 output slices) there is a bias term defined for CNN.
Then, the total number of trainable parameters are:
(11x11x3+1) x 96 = 34,944
If we generalize this into our symbolic terms:
(m x l x d+1) x k
We have reached the end of this article. Here onwards , you can simply use this method for all CNN’s convolution related parameter calculations.