Why dropouts prevent overfitting in Deep Neural Networks

4 min readNov 6, 2016

Here I will illustrate the effectiveness of dropout layers with a simple example. Dropout layers provide a simple way to avoid overfitting. https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf The primary idea is to randomly drop components of neural network (outputs) from a layer of neural network. This results in a scenario where at each layer more neurons are forced to learn the multiple characteristics of the neural network.

The true strength of drop out comes when we have multiple layers and many neurons in each layers. For a simple case, if a network has 2 layers and 4 neurons in each layer, then we are over training process making sure than 4C2 X 4C2 = 36 different models learn the same relation, and during prediction are taking average of predictions from 36 models. The strength comes from the fact that when we have many hidden layers and hidden neurons, we end up with a situation where NC(N/2)^h models learn the relation between data and target, which has the effect of taking enseamble over NC(N/2)^h models. For a 2 layer model with 100 neurons in each layer, this results in a scenario where we are taking average over 24502500 possible models.

Simple example 1:

Say you have 2 neurons, whose values are PA and B, and we randomly drop 1 of them in the training.

So the possible output during training after drop out layer are,

1- 2A (if B is dropped),

2- 2B (if A is dropped).

The term 2 comes due to scaling. If drop out rate were .25, then we would multiply by 4. If drop out rate is p, then we multiply values by 1/p. This comes from expected value literature in probability.

The main idea of drop out is to to have neuron A and neuron B both to learn something about the data, and the neural network not rely on 1 neuron alone. This has the effect of developing redundant representations of data for prediction. However, we have no idea which one is better, so in the testing phase, we compute outcome from all the neurons and average them. This in effect means you compute outcome without drop outs, and just add the results without scaling.

If we were to take average with dropouts, output will be (2 N_a A + 2 N_b B)/(N_a + N_b), where N_a and N_b represent number of times A or B were selected. Over sufficiently large number of samples, we can assume that n_A ~ n_B (as drop out rate is 50%). Therefore, the final expected value is A + B. Note N_a/(N_a+N_b) and N_b/(N_a+N_b) are nothing but observed probabilities of selecting A or B. Ideally we should run a large number of experiments and average the values. Mathematically,

(2 N_a A + 2 N_b B)/(N_a + N_b) ~ 2*N_a/(N_a+N_b) A + 2*N_a/(N_a+N_b) B

~ 2A p_A + 2 B p_B ~ A + B

So by using probability theory, we can simply not do dropout and not scale the output of the neurons during testing. This is also illustrated below,

Illustration of dropout (.5) in 1 layer network

Example 2 (Multiple layers):

Consider the case when there are 2 hidden layers with neurons A and B in one, and C and D in second. In this case, we have a scenario where during training, we want to train AC, AD, BC and BD all to learn the relation between input and output. Therefore, we have 4 models learning the same relation. This number increases exponentially if we have more layers and more neurons. If we had 4 neurons and 2 layers, the number of possible models is 4C2 X 4C2 = 36, therefore we are taking average over 36 different models. The general formula is, NC(N/2)^h, For 2 layer network with 10 neurons in each layer, this number is around 2025. Therefore, we end up taking average of 2025 different models during testing, and are training on equally large number of models during training. For a 2 layer model with 100 neurons in each layer, this results in a scenario where we are taking average over billion possible models. As a result, the tendency to overfit is significantly reduced.

Why dropouts prevent overfitting in Deep Neural Networks

Written by Vivek Yadav