Categories
Computing

Kafka streams

Stream computing is one of the hot topic at the moment. It’s not just hype but actually a more generic abstraction that unifies the classical request/response processing with batch processing.

Stream paradigm

The request/response is a 1-1 scheme: 1 request gives 1 response. On the other hand the batch processing is an all-all scheme: all requests are processed at once and gives all response back.

Stream processing lies in between where some requests gives some responses. Depending on how you configure the stream processing you lie closer to one end than the other.

More over stream processing fits nicely with the micro service architecture.

Kafka is able to capture live data in real time at large scale and therefore acts a message bus for all the micro-service to communicate.

However as soon as you enter this architecture you realise that processing data from Kafka is not a trivial thing. The problems are in fact inherent to any distributed system where one have to deal with things such as:

  • message ordering
  • data and process partitioning
  • scalability
  • fault-tolerance
  • data reprocessing
  • state

Actually there is an alternative: use a stream processing framework There are many stream processing frameworks around – Apache Spark, Apache Samza, Apache Flink, …  (to be honest I am not familiar with any of them but Spark)

The aim of a stream processing framework is to manage a distributed cluster for you and run your application code on the cluster. To achieve this goal the framework (at least Spark streaming – I haven’t check the other ones) imposes quite heavy constraints on the way you must package your application. This is so in order for the framework to be able to deploy your application over the cluster.

I am not a big fan of running my application code on Spark streaming. I think it’s great to perform analytics – even real-time analytics – but I don’t feel confortable running my core business code with it.

For this matter I’d rather rely on the actor model and the Akka implementation. However in this case we still have to handle these issues ourselves (and yes AkkaStream doesn’t help with that matter) but the actor model makes it easier to solve it.

And this is where Kafka Streams kicks in. It aims at leveraging the Kafka properties to make distributed computing easier while remaining a simple library to be included in the application code.

Let’s see how Kafka Streams solves these problems:

  • message ordering: Kafka maintains a sort of append only log where it stores all the messages, Each message  has a sequence id also known as its offset. The offset is used to indicate the position of a message in the log. Kafka streams uses these message offsets to maintain ordering.
  • partitioning: Kafka splits a topic into partitions and each partition is replicated among different brokers. The partitioning allows to spread the load and replication makes the application fault-tolerant (if a broker is down the data are still available). That’s good for data partitioning but we also need to distribute the processes in a similar way. Kafka Streams uses the processor topology that relies on Kafka group management. This is the same group management that is used by the Kafka consumer to distribute load evenly among brokers (This work is mainly managed by the brokers).
  • Fault tolerance: data replication ensures data fault tolerance. Group management has fault tolerance built-in as it redistributes the workload among remaining live broker instances.
  • State management: Kafka streams provides a local storage backed up by a kafka change-log topic which uses log compaction (keeps only latest value for a given key).Kafka log compaction
  • Reprocessing: When starting a new version of the app, we can reprocess the logs from the start to compute new state then redirect the traffic the new instance and shutdown old application.
  • Time management: “Stream data is never complete and can always arrive out-of-order” therefore one must distinguish the
    event time vs processed time and handle it correctly.

State management in distributed environment is quite tricky and using a Kafka topic to maintain the state is an elegant solution. Using this change-log topic Kafka Stream is able to maintain a “table view” of the application state. In fact if we consider how a database maintains state inside a table using an internal commit log it seems logical to build a “table view” backed by the kafka change-log topic.

Stream / table duality

Kafka Streams should even provide a queryable state backed by this “table view” anytime soon.

Kafka Streams provides 2 API for the developer:

  • processor API (based on callback)
  • higher level DSL (named kStreams) which provide a more functional API with function like map, filter, ….

What I like with Kafka Streams is that it’s just a simple library that you choose to integrate within your project. It doesn’t impose much constraints for an application already built on Kafka and solves the most common problems.

System architecture based on Kafka

Kafka Stream is great to deal with data already in Kafka. It doesn’t address how to put data in Kafka in the first place. This problem should be addressed using one of the available Kafka connectors.