Streams processing have been around for a while and encompasses a great number of applications:
- HTTP servers handling stream of incoming HTTP requests
- Message streams: Twitter hose, user posts, …
- Time-series messaging: stream from IoT sensors
- Database querying: result set contains a stream of record
- ….
Most interestingly reactive streams have gain traction over the past few years. They bring back-pressure into the game in order to avoid having the destination stream over flooded by messages from the source stream.
This post focuses on AkkaStream, a reactive stream implementation based on Akka actors. Unlike actors which are untyped, AkkaStreams provides type safety at every stage of the stream pipeline and also comes with a nice and fluent API. However the documentation is sometimes lacking or not easy to search when someone needs to implement common patterns. This post tries to cover the most common ones in a clear and concise way.
Flattening a stream
This is a common pattern where a stream of A
is turned into a stream of Iterable[B]
using a function A => Iterable[B]
. However we don’t want a Stream of Iterable[B]
but just a stream of B
.
On regular collections this is the well-known flatMap
operation. On streams it’s called mapConcat
because it’s not intended to be used in for-comprehension (and it doesn’t follow the monad laws).
import akka.actor.ActorSystem import akka.stream._ import akka.stream.scaladsl._ import scala.concurrent.Future implicit val sys = ActorSystem("akka-stream-patterns") implicit val mat = ActorMaterializer() Source('A' to 'E') .mapConcat(letter => (1 to 3).map(index => s"$letter$index")) .runForeach(println)
It can also be used to flatten a stream. Let’s say I have a query to a database that returns an Future[Iterable[Row]]
and I want to turn it into a stream of Row
.
Source .fromFuture(Future.successful(1 to 10)) .mapConcat(identity) .runForeach(println)
Also note that mapConcat requires a strict (immutable) collection (e.g. it doesn’t work with scala Streams), so if the future contains a scala.collection.Iterable
(and not a scala.collection.immutable.Iterable
) the mapConcat(identity)
won’t work.
To solve this case we can write a toImmutable
function and used that in place of the identity
function
def toImmutable[A](elements: Iterable[A]) = new scala.collection.immutable.Iterable[A] { override def iterator: Iterator[A] = elements.toIterator } Source .fromFuture(Future.successful(Stream.range(1, 10))) .mapConcat(toImmutable) .runForeach(println)
or we can use the flatMapConcat
operator
Source .fromFuture(Future.successful(Stream.range(1, 10))) .flatMapConcat(Source.apply) .runForeach(println)
Interestingly flatMapConcat
is a way to merge substreams by emitting all elements from each substream in sequence.
Source('A' to 'E') .flatMapConcat(letter => Source(1 to 3).map(index => s"$letter$index")) .runForeach(println)
As expected it prints A1, A2, A3, B1, B2, B3, C1, C2, C3, ...
A more general method is flatMapMerge
where we can specify how many substreams to consume simultaneously
Source('A' to 'E') .flatMapMerge(5, letter => (1 to 3).map(index => s"$letter$index")) .runForeach(println)
Note that the order of the substreams is not guaranteed. We get all the ones first, then all the twos but there is no consistency among the letters. I might print B1, C1, D1, A1, E1, C2, D2, E2, A2, B2, D3, C3, ...
.
As we’ve seen with Akka Persistence Akka Streams are used to implement persistent queries. Akka Persistence provides 3 different streams:
- the stream of all persistence ids
- the stream of events for a given persistence id
- the stream of events for a given tag
As you can see there is no stream with all events. Using flatMapMerge
we can create a single stream of all the events:
val journal = PersistenceQuery(sys) .readJournalFor[CassandraReadJournal](CassandraReadJournal.Identifier) journal .allPersistenceIds() .flatMapMerge(Int.MaxValue, { persistenceId => journal.eventsByPersistenceId(persistenceId, 0L, Long.MaxValue) })
Batching
The opposite operation is called batching. Basically you have a stream of elements and you need to group them together. It proves to be handy when you need to perform an operation that is more efficient in batch (e.g writing them down to a disk or database, …)
This operation is called grouped
(same as on regular collections).
Source(1 to 100) .grouped(10) .runForeach(println)
Another interesting alternative is groupedWithin
. This operation takes a duration and batches together the number of elements you specified or as many elements received during the specified duration.
Source .tick(0.millis, 10.millis, ()) .groupedWithin(100, 100.millis) .map { batch => println(s"Processing batch of ${batch.size} elements"); batch } .runWith(Sink.ignore)
Asynchronous computations
Writing the batch to a database is most likely an asynchronous operation, which returns a Future
.
In this case we have the mapAsync operation at out disposal
import sys.dispatcher def writeBatchToDatabase(batch: Seq[Int]): Future[Unit] = Future { println(s"Writing batch of $batch to database by ${Thread.currentThread().getName}") } Source(1 to 1000000) .grouped(10) .mapAsync(10)(writeBatchToDatabase) .runWith(Sink.ignore)
The mapAsync
takes a parallelism
parameter and a function returning a Future
.
The parallelism
parameter allows us to specify how many simultaneous operations are allowed. If the upstream sends too many requests for the mapAsync
to handle, back pressure will apply and slow down the upstream without buffering into memory and risking an OutOfMemoryException
.
In fact the above code generates an infinite streams of ones at a rather high rate but not faster than the downstream can handle.
Note that mapAsync
delivers the elements in order. However it doesn’t necessarily mean that they are going to be written in order into the database as several threads (10) are running concurrently here. You can even see that the batches are not printed in order to the console output.
If the ordering is not required is a similar operation mapAsyncUnordered
.
Concurrency
This brings us to the concurrency model used in Akka streams. By default a stream runs on a single thread. This is a reasonable assumption as it avoids the overhead of crossing asynchronous boundary. However to maximise performance we need to perform some operations in parallel. Choosing which stage can be performed in parallel requires a good understanding of the different operations performed on the pipeline. Executing a stage in parallel is very simple as it just needs to be marked async
.
def stage(name: String): Flow[Int, Int, NotUsed] = Flow[Int].map { index => println(s"Stage $name processing $index by ${Thread.currentThread().getName}") index } } Source(1 to 1000000) .via(stage("A")) .via(stage("B")) .via(stage("C")) .runWith(Sink.ignore)
Here each elements goes through each stage in sequence. 1 goes through stage A,B,C, then 2 goes through A,B,C, then 3 … using only a single thread. You may think that this pipeline takes a thread from the pool and runs forever (no async boundary) – it’s not entirely true. Akka stream’s engine is able to suspend the stream and put the thread back into the pool. It maintains the illusion of single thread execution although several thread might be used at different times but overall the elements are processed sequentially (as if they were on a single thread).
Now let’s execute the 3 stages in parallel:
Source(1 to 1000000) .via(stage("A")).async .via(stage("B")).async .via(stage("C")).async .runWith(Sink.ignore)
The elements still go through A, B and C stages sequentially but while one element is in stage B, A may already process the next one. In fact A, B and C don’t have their own thread but share a common pool. It’s worth noting that the ordering is maintained through the different stages: the elements leave C in order.
Terminating a stream
In case you haven’t noticed the program doesn’t stop when the stream completes. That’s because the stream doesn’t execute on the same thread as the main program. Therefore we need to terminate the underlying actor system when the stream completes for the program to end. You can use the Future
returned by runWith
to terminate the actor system.
import sys.dispatcher Source .single(1) .runWith(Sink.ignore) // returns a Future[Done] .onComplete(_ => sys.terminate()) // onComplete callback of the future
Akka Stream also provides a watchForTermination
method that can be used to monitor stream termination both for success and failure cases. It’s a good place to add logging messages or trigger follow-up actions.
import scala.util.{ Failure, Success } Source .single(1) .watchForTermination() { (_, done) => done.onComplete { case Success(_) => println("Stream completed successfully") case Failure(error) => println("Stream failed with error ${error.getMessage}") } } .runWith(Sink.ignore)
Throttling
When communicating with a remote system there is often a limitation of the request rate it can handle without degrading its performances. Enforcing such a limitation rate is another easy operation with Akka stream as it provides the throttle
operator. The nice thing is that it automatically applies the back pressure to the upstream components.
This also proves useful to limit the requests to a database, …
Source(1 to 1000000) .grouped(10) .throttle(elements = 10, per = 1.second, maximumBurst = 10, ThrottleMode.shaping) .mapAsync(10)(writeBatchToDatabase) .runWith(Sink.ignore)
Idle timeouts
Still when communicating with other service it might be useful to detect when a service stops sending requests or emits messages slower than expected. It’s often better to fail the stream than keeping these problem silent which often leads to more subtle issues.
Source .tick(0.millis, 1.minute, ()) .idleTimeout(30.seconds) .runWith(Sink.ignore) .recover { case _: TimeoutException => println("No messages received for 30 seconds") }
Error handling and recovery
Akka streams being implemented on top of actors, it’s no surprise to see that error handling strategies follow similar patterns.
There are 3 strategies to choose from:
- Stop – completes the stream with failure
- Resume – The faulty element is dropped and the stream continues
- Restart – The faulty element is dropped and the stream continues after restarting the stage
The recovery strategy can be defined in the ActorMaterializer
itself
implicit val mat = ActorMaterializer( ActorMaterializerSettings(sys) .withSupervisionStrategy(Supervision.restartingDecider) )
or for any operators of the stream
Source(1 to 5) .map { case 3 => throw new Exception("3 is bad") case n => n } .withAttributes(ActorAttributes.supervisionStrategy(Supervision.restartingDecider)) .funForeach(println)
It’s possible to define more fine-grained strategies by choosing the strategy based on the exception raised.
The difference between Resume
and Restart
is somehow subtle. To understand it we need a stage which maintains a state. fold
has a state (the current values depends on the previous elements). On the other hand map
has no state. Therefore for a map
operation the resuming and restarting strategies have the same effect.
Source(1 to 5) .fold(0) { case (total, element) => if (element == 3) throw new Exception("I don't like 3") else total + element } .withAttributes(ActorAttributes.supervisionStrategy(Supervision.restartingDecider)) .runForeach(println)
Here the fold
operation is going to fail for element 3, so 3 will be dropped. However when reaching element 3, elements 1 and 2 have already been processed, so the internal state of the stage is 3 (0 + 1 + 2). If we use a resumingDecider
, 3 is dropped and the stream continues with element 4 and 5, which yield a total of 12 (0 + 1 + 2 + 4 + 5). If we use a restartingDecider
the accumulated state is lost when failing on element 3, so the processing starts again with a state of 0 for elements 4 and 5, which yields to a total of 9 (0 + 4 + 5).
Of course all these strategies apply to the failing stage and not to the whole pipeline. It means there is no way to restart a whole pipeline. It makes sense as it might not be possible to restart a source. Imagine that your source is a source of HTTP requests, it’s not possible to replay all the HTTP requests when your Akka stream pipeline fails.
This is certainly true, but there is an interesting method: recoverWithRetries
. This method takes the number of retries (-1 for infinite retries) and a partial function Throwable => Graph
. It means we can use it to restart our pipeline (if our source allows it).
def pipeline = Source(1 to 5).map { case 3 => throw new Exception("three fails") case n => n } pipeline .recoverWithRetries(2, { case _ => pipeline.initialDelay(2.seconds) }) .runForeach(println)
Restarting with backoff
Akka-stream now provides exponential backoff recovery. It’s the same functionality that is available for Akka actors but adapted to the stream API. There are RestartSource
, RestartFlow
and RestartSink
.
This is especially useful when the messages in the stream come from a remote server where the connection can be lost. In this case you’ll want to restart the source but not straight away to give some time to the server to recover.
RestartSource .withBackoff( minBackoff = 2.seconds, maxBackoff = 30.seconds, // retries after 2, 4, 8, 16, 30 seconds randomFactor = 0.2 // adds 20% "noise" to vary the intervals slightly ) { () => Source(1 to 5).map { case 3 => throw new Exception("connection lost!") case n => n } } .runForeach(println)
This code prints 1 2
after 0, 2, 4, 8, 16 and 30 seconds and keeps restarting every 30s.
Conclusion
Akka Streams building blocks are simple concepts and components in isolation, yet they would have been quite tricky to implement correctly. The great power of Akka Streams lies in the combination of these building blocks in order to build more complex applications. Hopefully we have covered some of the most useful patterns in this post and should now be ready to use it in your own projects.
In case I forgot a pattern that you’d like to share, please leave a comment below with a little explanation!