Nd4j – Numpy for the JVM

I have spent years programming in Java and one thing (among others) that I found frustrating is the lack of mathematical libraries (not to say Machine learning framework) on the JVM.

In fact if you’re a little interested in machine learning you’ll notice that all the cool stuffs are written in C++ (for performance reasons) and most often provide a Python wrapper (because who wants to program in C++ anyway).
Continue reading “Nd4j – Numpy for the JVM”

TF-IDF

The idea from this blog post came after finishing the lab on TF-IDF of the edx Spark specialisation courses.

EDX - CS110x - Big data analysis with Spark

In this course the labs follow a step-by-step approach where you need to write some lines of code at every step. The lab is very detailed and easy to follow. However I found that focusing on a single step at a time I was missing the big picture of what’s happening overall.
Continue reading “TF-IDF”

Neural network hardware considerations

This post will present some principles to be considered when choosing the hardware that will run some neural net computation, either for training models or making prediction using an existing model (i.e. inference).

Need more cores

Usually CPU are considered to not perform well enough when it comes to neural network computing and they are now outperformed by GPUs.

CPUs run faster then GPU but they are not design to perform many parallel operations simultaneously which is precisely what GPU are made for.
Continue reading “Neural network hardware considerations”

Kafka streams

Stream computing is one of the hot topic at the moment. It’s not just hype but actually a more generic abstraction that unifies the classical request/response processing with batch processing.

Stream paradigm

The request/response is a 1-1 scheme: 1 request gives 1 response. On the other hand the batch processing is an all-all scheme: all requests are processed at once and gives all response back.

Stream processing lies in between where some requests gives some responses. Depending on how you configure the stream processing you lie closer to one end than the other.
Continue reading “Kafka streams”

Introduction to Alluxio

Continuing my tour of the Spark ecosystem today’s focus will be on Alluxio, a distributed storage system that integrates nicely with many compute engines – including Spark.

What is Alluxio ?

The official definition of Alluxio is (or at least that’s how one of its author presents it):

Alluxio is an open source memory speed virtual distributed storage

Let’s see what each of these terms actually means:
Continue reading “Introduction to Alluxio”

Neural network implementation guidelines

Today to conclude my series on neural network I am going to write down some guidelines and methodology for developing, testing and debugging a neural network.

As we will see (or as you already experienced) implementing a neural network is tricky and there is often a thin line between failure and success – between something that works great and something making absurd predictions.

The number of parameters we need to adjust is just great: from choosing the right algorithm, to tuning the model hyper-parameters, to improving the data, ….

In fact we need a good methodology and a solid understanding of how our model works and what is the impact of each of its parameters.
Continue reading “Neural network implementation guidelines”

Recurrent Neural Network

After introducing the convolutional neural networks I continue my serie on neural networks with another kind of specialised network: the recurrent neural network.

Principle

The recurrent neural network is a kind of neural network that specialises in sequential input data.

With traditional neural network sequential data (e.g. time series) are split into fixed-sized windows and only the data points inside the window can influence the outcome at time t.

With recurrent neural network the network can remember data points much further in the past than a typical window size.
Continue reading “Recurrent Neural Network”