Dropout: A Simple Solution to a Complex Problem

Learn about dropout, its variants and how to apply them in your next project

Apr 10, 2024

∙ Paid

Get a list of personally curated and freely accessible ML, NLP, and computer vision resources for FREE on newsletter sign-up.

To read more on this topic see the references section at the bottom. Consider sharing this post with someone who wants to know more about machine learning.

0. Introduction

Overfitting [10] is a major obstacle in training deep neural networks. It occurs when the network “memorizes” the training data too well, leading to poor performance on unseen examples. Dropout, a simple and elegant technique introduced in 2012, offers a powerful solution to this problem.

Overfitting is a common problem where the model does really well on the training data but does not generalize on unseen data (“validation” split).

We delve into the technical details of (the original) Dropout and its variants, explore their successful applications, and provide practical advice for incorporating it into your own neural network architectures.

1. Dropout: Tackling Overfitting Head-On

First Introduced

Dropout was introduced by Hinton et al. in their paper “Improving neural networks by preventing co-adaptation of feature detectors” [2, 5] in 2012. It is one of the de-facto regularizers [9] in the neural network world.

Usefulness

Dropout prevents overfitting by randomly deactivating neurons during training, encouraging the network to learn robust features from different subsets of neurons during training. This is done on the activations and not the weights. It is commonly used after fully connected layers.

A model with no dropout vs. when dropout is applied.

Technical Details

Dropout introduces randomness during training by probabilistically dropping out neurons, effectively creating an ensemble of thinned networks.

During each training iteration, a fraction of the input units (neurons) is set to zero with a specified probability, typically between 0.2 and 0.5. This stochastic process encourages the network to learn robust representations by preventing reliance on specific neurons and promoting feature generalization.

The matrix of weights, W without dropout and with dropout. W maps the input neurons to output neurons i.e. y = Wx. Dropped neurons lead to entire columns of the matrix being masked out.

During training, to compensate for the deactivated neurons, the remaining activations are scaled up by a factor of 1 / (1 - p), ensuring the expected average activation remains unchanged.
Let’s assume that we set p=0.1 during training. This means when we do many batches of forward pass, each neuron will be masked 10% of the time i.e. on average only 90% of the signal is passed on to the next layer. During training, the output is scaled by 1/(1-0.1) = 1/0.9, so we can offset the fact that only 90% or 0.9 of the signal is passed on to the next layer.

During deployment, dropout behaves as an identity function i.e. the output is the input.

Difference from Other Dropout Variants

Sets individual activations to zero and not the weight itself. It is applicable to any layer type.

Effective for regularization [9] and preventing overfitting in various neural network architectures. This technique has been instrumental in the success of numerous deep learning architectures, including the architecture that won the 2012 ImageNet competition, a breakthrough1 in computer vision (Krizhevsky et al., 2012 [4]).

*“We use dropout in the first two fully-connected layers ... Without dropout, our network exhibits substantial overfitting. Dropout roughly doubles the number of iterations required to converge.”* [4]. AlexNet architecture from [4].

Practical Advice

Start without any dropout and set up the training pipeline. Once set, when the training loss is lower than the validation loss it signals overfitting. That is a good sign to use dropout.

Begin with a dropout rate between 0.2 and 0.5. Monitor model performance closely during training to detect any signs of underfitting or overfitting, and adjust dropout rates accordingly.

Be cautious when applying dropout to small datasets or networks with fewer parameters, as aggressive dropout rates may lead to worse generalization.

Continue reading more: