When you train several models over a dataset you need a way to compare the model performances and choose the one that best suites your needs.
As we will see there are different ways to compare the results and then pick the best one.
Let’s start with what scores we can get out of the training process. Assuming we are running a classification model with 2 possible outcomes, then the model performance can be summarised with 4 figures known as the confusion matrix.
These 4 figures are:
- TP – True positive rate: The number of samples correctly marked as positive
- TN – True negative rate: The number of samples correctly marked as negative
- FP – False positive rate: The number of samples incorrectly marked as positive (aka type 1 error)
- FN – False negative rate: The number of samples incorrectly marked as negative (aka type 2 error)
Continue reading “Confusion matrix”
The k-Nearest Neighbours is based on a simple idea: similar points tend to have similar outcomes.
Therefore the idea is to memorise all the points in the dataset. The prediction for a new entry is made by finding the closest point in the dataset. Then the prediction for the new entry is simply the same outcome as the value associated to its closest point.
If 2 points are close enough so should be their outcomes.
The name k-NN comes from the fact that you can look for the k closest points and compute (e.g. average) the outcome of the new point from the outcomes of the k-nearest points.
Continue reading “k-Nearest Neighbours”
In machine learning it is pretty obvious to me that you need to split your dataset into 2 parts:
- a training set that you can use to train your model and find optimal parameters
- a test set that you can use to test your trained model and see how well it generalises.
It is important that the test data is never used during the training phase. Using “unseen” data is what allows us to test how well our model generalises. It makes sure your model doesn’t overfit.
Continue reading “How to split a dataset”
Most machine learning techniques follow a similar strategy:
- Get the best possible model on the training dataset
- Generalise by testing the model on the test dataset
The test dataset consists of data that are never used during training and it allows to test how the algorithm will perform over “not seen before” data.
Continue reading “Weight decay regularisation”
With gradient descent we try to optimise a function that runs over the entire dataset. \(f\) represents the “cost” over the entire dataset.
When working with big datasets this yield to complex function optimisation and slow computation time.
This is also a problem when dealing with streaming data as we need to wait for the stream to end (or to select a big enough batch of data from the stream) to run gradient descent.
Stochastic gradient descent is a variation of gradient descent where gradient descent is run over every single data point. For each entry in the dataset the parameters are updated.
Continue reading “Stochastic gradient descent”
If you want to predict something from your data, you need to put a strategy in place. I mean you need a way to measure how good your predictions are … and then try to make the best ones.
This is usually done by taking some data for which you already know the outcome and then measuring the difference from what your system predict and the actual outcome.
This difference is often referred to as the “cost function”. Once we have such a function our machine learning problem comes down to minimising our cost function.
One very simple way to find the minimum value(s) is called gradient descent. The basic idea is to make small steps along the gradient (the derivative of the function) until we reach a minimum.
Continue reading “Gradient descent”
PCA stands for Principal Component Analysis. It is a mathematical concept which I am not going to explain in great details here as there are already plenty of books on the subject. Rather I would like to give a practical feeling of what it does and when to use it.
The idea behind PCA is that we represents the data using different axis. For example let’s imagine that we are dealing with accelerometer data from a smart watch sensor. This data comes in the form of (x, y, z) coordinates computed every 20ms.
Depending on how you move your arm the (x,y,z) values will change over time. In a 10s interval 500 (x, y, z) coordinates are computed and each axis holds some variations of data.
Continue reading “PCA: Principal Component Analysis”
If you ever want to get serious about data science soon or later you’re going to have your hands on some Python code.
If you are like me – coming from the JVM world – you probably think “yeah, Python … should be cool!!”. Everybody is using it, the syntax looks concise, and the machine learning ecosystem is pretty dense in python: theano, neon, scikit-learn, …
..So yeah let’s get started… and if you’ve never written any Python code before I’ll tell you it’s not going to be that fun. Continue reading “Python … wtf !!?”
“Data Scientist (n.): Person who is better at statistics than any software engineer and better at software engineering than any statistician.”
As you can see there are 2 worlds out there: the world of math and statistics and the world of software engineering.
Each of these worlds thinks he is better than the other (which is true in a sense) but the truth is that they also need each other to achieve good results.
One cannot harvest huge amount of data without a proper system and the other who knows how to build such systems doesn’t know how to extract valuable information from so much data.
Coming from a software engineering background I intend to publish some articles as I go along through this journey.