Categories
Machine learning

k-Nearest Neighbours

The k-Nearest Neighbours is based on a simple idea: similar points tend to have similar outcomes.

Therefore the idea is to memorise all the points in the dataset. The prediction for a new entry is made by finding the closest point in the dataset. Then the prediction for the new entry is simply the same outcome as the value associated to its closest point.

If 2 points are close enough so should be their outcomes.

The name k-NN comes from the fact that you can look for the k closest points and compute (e.g. average) the outcome of the new point from the outcomes of the k-nearest points.

This algorithm is quite simple however as the dataset grows the computation power increases and becomes particularly difficult to handle for large dataset.

The naive and brute force approach is to compute the distance between the new entry and every points in the dataset and keep only the k closest points.

Another approach is to use the points in the dataset to split the space into different regions. Then for each new entry we need to determine in which region it belongs to and from there we can find the k nearest points.

In this approach the points of the dataset are stored in a kD-tree (k-dimensional tree) and each node in the tree splits the space in 2.

By KiwiSunset at the English language Wikipedia, CC BY-SA 3.0
By KiwiSunset at the English language Wikipedia, CC BY-SA 3.0

This is computationally better than the brute-force approach but still performs poorly as k grows bigger (k > 20).

To overcome this limitation the ball-tree algorithm has been developed. The idea is to divide the space into nested hyper-spheres. The invariant is that every point in the tree  lies within the hypersphere of its parent nodes. The centroid radius and a single distance calculation between a new entry and a node centroid is sufficient to determine a lower and upper bound on the distance between the new entry and every point within the node’s hypersphere.

The kNN algorithm is trivial to understand and yield to very good results on problems where similar entries yield similar outcomes. However it is not trivial to implement over large dataset as the computation power needed depends directly of the number of entries in the dataset.

One more issue is that as the dimensionality increases more and more points are needed to split the space into a sufficient number of regions to keep accurate predictions. This is generally not the case for high dimensions as the number of required points grows exponentially.