*By Shashika Munasinghe*

Convolution Neural Networks (CNNs) made a huge leap in deep learning history with their amazing power of image classification. As by its name convolution layers are a major part of CNN.

Though there are so many resources described in this area, a new student may find it difficult to understand the architecture of CNN. This short blog represents a simple method to determine and understand the number of trainable parameters and the shape of the output after a convolution operation.

This is extracted from popular CNN architecture, Alex Net the winner of the 2012 Image Net challenge. With this diagram, I will explicate a simple formula that could apply for all of your convolution layers.

*Note: h-height of input, w-width of input, d=depth of input, m=l=filter-size, s=stride , p=width of output, q=height of output, k=depth of output*

*No image padding is assumed. If padding is applied, the same concept can be extended assuming padded input as the new input.*

**Determining the Output Shape**

The input to the convolution operation is a 227x227x3 block. This is actually the RGB (thatâ€™s why itâ€™s 3) image that is set as input. The kernel (filter) is 11x11x3 and there are 96 such kernels (filters). Stride is 4 (pixel skipping =4).

You must notice that the kernel set has a depth value of 3 comparable with the depth of the input. This is an obvious fact because 11×11 kernels must apply for all 3 layers of the input.

Normally, kernels are square shape (width and height the same) and of odd lengths (3,5,7,11.. etc.). If we slide one 11×11 matrix over one layer of 227×227 matrix without skipping the pixels, we get the output size as:

(227â€“11+1) x (227â€“11+1) = 217×217

This can be represented in general terms with our symbolic definition as:

**(w-m+1) x (h-l+1)**

You may still be wondering, what is the intuition behind this calculation. See the picture below:

For example, this is one layer of input to convolution layer 5×5 and the filter size is 3×3. When we slide the filter over the image it can be applied only on the red line surrounded pixels (3×3). After convolution operation output is a 3×3 matrix:

(5â€“3+1) x (5â€“3+1) = 3×3

See, itâ€™s simple.

Letâ€™s go back to our original example. Assuming no skipping of pixels (stride=1) our output is 217×217, but in the example, the stride is given as 4.

This gives us an output of (217/4) x (217/4) = 55×55. In more generalized terms:

**[(w-m+1)/s] x [(h-l+1)/s]**

Now we have a 55×55 output matrix with convolving with one filter. There are 96 such filters. Now we can easily determine the output size:

55x55x96

In our symbolic notation, it is:

**[(w-m+1)/s] x [(h-l+1)/s] x k**

*I know that you may still have a doubtful point. There are filters for each input layer (depth=3). What happened to those?*

This is the special case of convolution on CNNs, we sum up the convolution output of all 3 layers to build up the one layer of output. That means, what you see as 55×55 is not just a result of one layer but 3 (multiple) layers.

**Number of Trainable Parameters**

Calculation of the number of trainable parameters is not a long calculation as our previous calculation of obtaining output.

Letâ€™s get into it.

There are 96, 11×11 filters applied along 3 input layers. Apart from these, for each output slices (we have 96 output slices) there is a bias term defined for CNN.

Then, the total number of trainable parameters are:

(11x11x3+1) x 96 = 34,944

If we generalize this into our symbolic terms:

**(m x l x d+1) x k**

We have reached the end of this article. Here onwards , you can simply use this method for all CNNâ€™s convolution related parameter calculations.