Convolutional Neural Network


Convolutional Neural Networks are a kind of network inspired by the cats’ visual cortex.

A cat visual cortex is made of 2 distinct type of cells:

  • simple cells which specializes into edge detection.
  • complex cells with larger receptive field which are sensitive to a small region of the visual field and are less sensitive to the exact position of edges.

Convolutional neural network are inspired by the latter type of cells. Each neuron is sensitive to a small region of the input data and less to a specific position of a pattern.

It makes the network more robust to image translations for instance. Let’s say we’re building a classifier that recognises kitten pictures. It’s not important where the kitten is inside the image, we’re only interested of the presence of the kitten. A convnet (convolutional network) makes it easier to detect kitten by being less sensitive to exact position of the kitten.


Recall from the fully-connected MLP that each input point is linked to each neuron in the next layer.

This time we want a small region of the input data to be linked to single neuron in the next layer.

Therefore convolutional network work well with grid-like input data such as images or time series (1-d grid).

We need a way to convert a small region (i.e. several datapoints located close to each other into a single value). We call this operation convolution and by extension every neural net using this operation in at least one of its layers is called a convolutional network.

Instead of using a regular matrix multiplication the convolution operation is a dot product of the input matrix and a second matrix called a kernel. The kernel size defines the size of the sub-regions  and the kernel’s values are the weights we’re going to use to compute each sub-region’s value.

S(i,j) = (I * K)(i,j) = \sum_{m}\sum_{n}I(i+m, j+n)K(m,n)

The output of the convolution is called a feature map.

Convolution operation

A convolution layer is actually made of several operations:

  • convolution: the convolution operation itself
  • detector: the activation function (e.g. ReLU)
  • pooling: A downsampling operation where we keep only

Operations involved in a convolution layer

As we have seen the convolution operation is applied in many different places over our input data. Re-using the same kernel allows to reduce the number of parameters, this is known as parameter sharing.

Using different kernel for each convolution operation would need to an intractable number of parameter. Moreover it makes sense to re-use the same kernel because if one operation make sense at some location it probably makes sense at many other location of the input data.

Using a small kernel reduces the influence of a neuron to the next layer. In a fully connected network each neuron of the input layer influences all neurons of the next layer.

Output neurons influenced by a single input neuron in a convolutional layer

Influence of a single input neuron on the output layer in a fully connected network

What’s interesting to note is that as we add more layer each input layer has a chance to influence the final outcome.

Influence of the inputs on a single output neuron

Meta parameters

A CNN network is somehow similar to a classic MLP (Multi Layer Perceptron) but requires additional meta-parameters for the convolution.

Kernel size

We need to define a kernel size. This represents the size of the sub-regions we’d like to consider for convolution.


The stride defines how we mode the kernel over the input data. Do we move 1 data point at a time (stride of 1 with much overlapping sub-regions) or do we avoid overlapping and move by the kernel width.

Striding (size 2) in a convolution layer


If we don’t use padding each convolution output is smaller than the input layer, thus limiting the possible number of convolution layers.

To avoid reducing the size of the layers we add extra padding around the input data.

Similar to MLP we need to define the number of neurons in every layer. For convolution layer it’s called the depth.


As already said pooling is a downsampling operation that keeps a single value for the convoluted sub-region. Standard pooling function are maxout (keeps only the maximum value) or less frequently the L2 or L1 norm.

It’s the pooling function that makes convolutional networks insensitive to translation. For a given sub-region the maxout value is probably going to remain the same if the image moved slightly in any directions.

Concrete example

Now that we’ve covered all the theory let’s go through a real world example to see how everything fits together.

Let’s consider the well-known case of image classification of hand-written digits where the input data is a grayscale image and the output is the digit written in the image.

The input data is an RGB image of size 28 x 28. So the input tensor has dimension 28 x 28 x 1 (depth 1 because there is only one grayscale layers – for RGB images we would have a depth of 3 layers).

We want to transform this input data into a 1 x 1 x 10 tensor that indicates which digit the image represents.

We’re going to achieve this by defining several convolutional layers in out network.

Layers of convolution network used to classify hand-written digits

Our first layer will work on 5 x 5 sub-regions with a stride of 1 pixel. We’re going to use 0 padding over the input data to keep the same size between the input and the output. Finally we apply a ReLU activation function followed by a  maxout pooling operation on 2×2 blocks.

Our kernel dimension is therefore 5 x 5 x 1 x 32

  • the first 2 dimensions define the size of the subregion
  • the third dimension is the depth of the input data
  • the last dimension is the depth of the output

and the output dimensions are 14 x 14 x 32. The dimensions is reduced from 28 x 28 to 14 x 14 because of the pooling operation over 2 x 2 blocks.

Our second convolution layer is similar to the first one as it contains the same operations (5 x 5 convolution with ReLU and maxout operations) but transforms a 14 x 14 x 32 tensor data into a 7 x 7 x 64 tensor.

The next step is a fully connected layer that will transform out 7 x 7 x 64 tensor into 1024 data points. We do it by simply reshaping our tensor into a flat vector of 3136 values than apply standard matrix multiplication and ReLU activation.

We can also add a dropout operation where we randomly drop out some of the output in order to reduce overfitting.

Finally out last layer is a classic neural net layer that transforms our 1024 datapoints into 10 output values using a softmax activation.

We can see how to implement this example with tensor flow on the tensor flow CNN tutorial.