Stochastic gradient descent

With gradient descent we try to optimise a function that runs over the entire dataset. \(f\) represents the “cost” over the entire dataset.

When working with big datasets this yield to complex function optimisation and slow computation time.

This is also a problem when dealing with streaming data as we need to wait for the stream to end (or to select a big enough batch of data from the stream) to run gradient descent.

Stochastic gradient descent is a variation of gradient descent where gradient descent is run over every single data point. For each entry in the dataset the parameters are updated.

In practice stochastic gradient descent gets “close” to the minimum much faster than regular gradient descent. However it may never converge and keep oscillating near the minimum. This is usually not a problem as long as it remains “close enough” to the minimum.

Stochastic gradient descent may be computed over a “mini-batch” of samples from the dataset. The mini-batch size typically varies between 1 and a few hundreds samples. The idea remains the same: save computation time while converging to the same minimum as the regular gradient descent run over the full dataset.

In particular it is worth noting that the computation time for one SGD iteration no longer depends on the dataset size but only on the mini-batch size.

For these reasons stochastic gradient descent is often preferred in machine learning algorithms, especially in deep learning where the datasets are usually big.