Neural network design

Today I continue my neural network post series with some considerations on neural network implementation.

So far we covered what is a neural network and how it works but we are still left with numerous choices regarding its design.

How many layers should we use, how many units (neurons) in each layer, which activation functions, which cost function, … ? There are so many questions and choices to make that it has bothered me for quite some time now.

If you scroll the web you may find some advice on these questions. But this is it – you can only get advice as there is no clear answers. It’s just trial and errors so you’d better try for yourself and see how different designs perform on your problem.

However it is still tough to know how to start. Fortunately there is some consensus on what performs better, what generalises best, …

So in this article I try to summarises what I have read here and there and give some hints so that you have an idea into which direction to look for.

Neural network training can be summarised as a function optimisation problem. We just need to find a global minimum of out cost function. However the cost functions are often very complex (thousands or millions of variables) and optimisation is very hard to understand and still an active research subject.

Learning algorithm

So far stochastic or mini-batch gradient descent and its variant remains the simplest and most efficient learning algorithm in use today.

Gradient descent is rarely run over the whole dataset at once. Instead it is run over a small samples (mini-batch) randomly chosen from the dataset. Randomly sampling the dataset can be seen as adding noise into the dataset which helps our network to better generalise (less probe to overfitting).

Activation function

The choice of the activation function is also key. Today’s consensus seems to use ReLU activation for the hidden layers and softmax or sigmoid for the output layer.

The advantage of the ReLU activation is that it is easy to differentiate (it is either 0 or 1 depending if x is < 0 or 0) and it is defined almost everywhere (but for x = 0). Previously people tended to prefer sigmoid function to avoid the discontinuity of the derivative but in practice it turns out this is not a problem and it usually performs better than the sigmoid.

For the output layer the softmax is a nice choice when the prediction is a probability between different outcomes (classification problem). The softmax function sums up to 1 allowing to treat each output as a probability. If there is a single output the sigmoid function performs the same and tends to avoid value in the middle (around 0.5) thanks to its S-shape.

The advantage of the softmax and sigmoid function is that they are exponential function and therefore play nicely with the cross-entropy cost function. The cross entropy is log based which allows to “cancel” the exponential of the sigmoid and softmax function.

Cost function

As we’ve seen cross-entropy seems to be the cost function of choice for neural network especially when softmax or sigmoid functions are used in the output layer.

The cross-entropy seems to perform better than the mean-square method.

Network architecture

There is no good answer on this topic. As a guidance we can say that shallow and wide network (few layers with many nodes) are easier to train but achieve lower performance on the validation set. They tend to overfit on the trained dataset and fail to generalise.

On the other hand a deep and narrow network achieve better generalisation but requires a bigger dataset for training. The training is also more costly but the forward propagation is faster to compute as the network contains fewer units overall that a shallow and wide network.

Regularisation strategies

Data augmentation

Data augmentation consists in generating new samples a use them for training. Depending on your dataset it might be more or less easy. E.g. for training image classification it’s possible to generate a new image by operating a translation of a translation of an existing image (or rotation, zoom, etc …). This will help your network to generalise by being able to recognise an object whatever its position in the picture.

Noise addition

This idea is similar to data augmentation. Here we add some noise into the dataset to help our network to generalise better. The noise makes our dataset less specific and helps generalisation. Noise can be added to the dataset but into the hidden layers too.


Dropout is a technique where we randomly drop the output of a node. It makes sure that essential information find its way through the network and doesn’t rely on a specific unit.

It has proven to be a very efficient technique to improve generalisation. The probabilities to drop a unit is defined for each layer and are model’s meta parameters. Typical values are 0.8 for the input layer and 0.5 for the hidden layers.

In some sense this is also related to noise addition into the hidden layers.

Weight decay

We already explained weight decay in more details. The main idea add a term in the cost function to limit the values of the weights. It favours smaller weight values.

Weight decay tends to keep the parameters closer to the origin
Weight decay tends to keep the parameters closer to the origin

If a function its minimum value at some given point we will choose a point close to it but also closer to the origin. (the closer to the origin the smaller the weights).

Early stop

In early stop we don’t wait for the learning algorithm (e.g. SGD) to find a minimum but stop after a given number of steps.

Early stopping stops the gradient descent before reaching a minimum achieving similar effect to weight decay
Early stopping stops the gradient descent before reaching a minimum achieving similar effect to weight decay

Assuming we start close to the origin then its effect is similar to weight decays as it limits the weight values to go to far from the origin.

Parameter sharing

In parameter sharing we share the weights between units. For examples for a neural network that classifies images we might want to share parameters for pixel close to each other. Instead of applying different weight for each pixels we share the same weight for 3×3 pixels region.

This is typically what CNN (Convolutional Neural Network) does. RNN (Recurrent Neural Network) can also be seen as sharing parameters over time.


In bagging we train several different models and then average the outcomes of each of them to make a decision.

Here we have only scratch the surface and there are still more to cover. However it should give you a good starting point to figure out into which direction to dig to improve your model. For more information on these topics I recommend the excellent book on deep learning from the MIT (freely accessible online).