## Tensorflow introduction

Following my previous post on neural network I thought it would be nice to see how to implement these concepts with tensorflow.

Tensor flow is a new library developed by google. It is aimed at building fast and efficient machine learning pipelines.

Actually it is based on the computation graph that we discussed earlier.

It provides a C++ and Python interface and can run on CPU or GPU (linux only).

## Neural Network

Machine learning applications widespread every day in many domains. One of today’s most powerful techniques is the neural network. This technique is employed in many applications such as image recognition, speech analysis and translation, self-driving cars, etc…

In fact such learning algorithms have been known for decades. But only recently it has become mainstream supported by the increase in computation power (GPU) and memory usage (SSD) which allow us to run these algorithms over billions of samples.

Neural network can represent a wide range of complex functions making it an algorithm of choice in many domains. However training such algorithms is complex and it’s only the recent increase in computation power and fast data access that allowed to exploit the full potential of this technique.

## k-means clustering

k-means is a clustering algorithm which divides space into k different clusters.

Each cluster is represented by its centre of mass (i.e. barycentre) and data points are assigned to the cluster with the nearest barycentre.

##### Algorithm

The learning algorithm starts by choosing k random points. Each of these is the centre of mass of a cluster. Then we iterate over a sequence of assignation phases and an update phases until we reach stability (i.e. the clusters’ barycentres stop moving).

## Confusion matrix

When you train several models over a dataset you need a way to compare the model performances and choose the one that best suites your needs.

As we will see there are different ways to compare the results and then pick the best one.

Let’s start with what scores we can get out of the training process. Assuming we are running a classification model with 2 possible outcomes, then the model performance can be summarised with 4 figures known as the confusion matrix.

These 4 figures are:

• TP – True positive rate: The number of samples correctly marked as positive
• TN – True negative rate: The number of samples correctly marked as negative
• FP – False positive rate: The number of samples incorrectly marked as positive (aka type 1 error)
• FN – False negative rate: The number of samples incorrectly marked as negative (aka type 2 error)

## k-Nearest Neighbours

The k-Nearest Neighbours is based on a simple idea: similar points tend to have similar outcomes.

Therefore the idea is to memorise all the points in the dataset. The prediction for a new entry is made by finding the closest point in the dataset. Then the prediction for the new entry is simply the same outcome as the value associated to its closest point.

If 2 points are close enough so should be their outcomes.

The name k-NN comes from the fact that you can look for the k closest points and compute (e.g. average) the outcome of the new point from the outcomes of the k-nearest points.

## How to split a dataset

In machine learning it is pretty obvious to me that you need to split your dataset into 2 parts:

• a training set that you can use to train your model and find optimal parameters
• a test set that you can use to test your trained model and see how well it generalises.

It is important that the test data is never used during the training phase. Using “unseen” data is what allows us to test how well our model generalises. It makes sure your model doesn’t overfit.
Continue reading “How to split a dataset”

## Weight decay regularisation

Most machine learning techniques follow a similar strategy:

2. Generalise by testing the model on the test dataset

The test dataset consists of data that are never used during training and it allows to test how the algorithm will perform over “not seen before” data.

With gradient descent we try to optimise a function that runs over the entire dataset. $$f$$ represents the “cost” over the entire dataset.

When working with big datasets this yield to complex function optimisation and slow computation time.

This is also a problem when dealing with streaming data as we need to wait for the stream to end (or to select a big enough batch of data from the stream) to run gradient descent.

Stochastic gradient descent is a variation of gradient descent where gradient descent is run over every single data point. For each entry in the dataset the parameters are updated.