k-means clustering

k-means is a clustering algorithm which divides space into k different clusters.

Each cluster is represented by its centre of mass (i.e. barycentre) and data points are assigned to the cluster with the nearest barycentre.


The learning algorithm starts by choosing k random points. Each of these is the centre of mass of a cluster. Then we iterate over a sequence of assignation phases and an update phases until we reach stability (i.e. the clusters’ barycentres stop moving).

Assignation phase

During the assignation phase we iterate through all the data points and assign each points to the nearest centre of mass.

Refinement phase

During the refinement phase we recompute the clusters’ barycentres using all the points belonging to the cluster.

By Agor153 - Own work, CC BY-SA 3.0
By Agor153 – Own work, CC BY-SA 3.0

After a few iterations most of the data points stay in the same cluster and there and recomputing all these distances all over again seems useless.

We suffer the same problem when adding new entries in the dataset. We probably don’t want to recompute all the distances just to add a few points to the dataset.

From a performance perspective, the prediction phase is quite fast as we only need to compute k distances for each point we want to predict. Insertion is much slower as we potentially need to re-iterate over the learning phase as the centres of mass might move.