PCA: Principal Component Analysis

PCA stands for Principal Component Analysis. It is a mathematical concept which I am not going to explain in great details here as there are already plenty of books on the subject. Rather I would like to give a practical feeling of what it does and when to use it.

The idea behind PCA is that we represents the data using different axis. For example let’s imagine that we are dealing with accelerometer data from a smart watch sensor. This data comes in the form of (x, y, z) coordinates computed every 20ms.

Depending on how you move your arm the (x,y,z) values will change over time. In a 10s interval 500 (x, y, z) coordinates are computed and each axis holds some variations of data.

With PCA we are going to use different axis, e.g (p, q, r). However (p, q, r) are not chosen randomly. Instead we want p to be the axis where we can observe the most variation of the data: i.e the axis that holds the most information.

Then the dimension q is the orthogonal dimension  to p where we can observe the most variation, and so forth. The means that r would be the axis containing the less variation. And in fact we can probably reduce the dimensions (i.e. using only p and q dimensions) and losing very little information.

By Nicoguaro - Own work, CC BY 4.0
By Nicoguaro – Own work, CC BY 4.0

If you look at this diagram you can see that if we project all the points along the big arrow we keep quite a lot of information and we still got a sense of how the data are distributed.

Similarly let’s consider an image encoded in RGB. Each channel holds some information  but when converted to HSL we still have the same image (but stored with different numbers on the computer). And if we drop the saturation we still can see the image with no colour (in grayscale) so we still can say what’s on the image. Using RGB each layer may hold different data so if you drop one layer you might loose quite a lot of data (e.g. if you remove the blue layer from a sky or sea picture).

Now that we got a sense of what PCA is let’s see how to use it … and it turns out it’s just a few lines of python code.

import numpy as np
from sklearn.decomposition import PCA

# load x,y,z data from csv file
data = np.genfromtxt('data.csv', delimiter=',')

pca = PCA(n_components=1) # only the 1st component
p = pca.fit_transform(data)

# add the principal component column to the data: x,y,z,p
data = np.append(data, p,  1)

# save the new csv file
np.savetxt('datapc.csv', data, delimiter=',')