A Recipe for Kafka Lag Monitoring

by Mark Kelly - 5 Dec 2017

A closer look at the ingredients needed for ultimate stability

This is part of a series of posts on Kafka. See Ranking Websites in Real-time with Apache Kafka’s Streams API for the first post in the series.

Remora is a small application to track the monitoring of Kafka. Due to many teams deploying this to their production environments, open sourcing this application made sense. Here, I’ll go through some technical pieces of what is streaming, why it is useful and the drive behind this useful monitoring application.

Streaming architectures have become an important architecture pattern at Zalando. To have fast, highly available and scalable systems to process data across numerous teams and layers, having a good streaming infrastructure and monitoring is key.  

Without streaming, the system is not reactive to changes. An older style would manage changes through incremental batches. Batches, such as CRONs, can be hard to manage as you have to keep track of a large number of jobs, each taking care of a shard of data, this may also be hard to scale. Having a streaming component, such as Kafka, allows for a centralised scalable resource to make your architecture reactive.

Some use cloud infrastructure such as AWS Kinesis or SQS. Alternatively, to reach better throughput and use frameworks like AKKA streams, Kafka is chosen by Zalando.

Monitoring lag is important; without it we don’t know where the consumer is relative to the size of the queue. An analogy might be piloting a plane without knowing how many more miles are left on your journey. Zalando has trialled:

We needed a simple independent application, which may scrape metrics from a URL and place them into our metrics system. But what to include in this application? I put on my apron and played chef with the range of ingredients available to us.

Scala is big at Zalando; a majority of the developers know AKKA or Play. At the time in our team, we were designing systems using AKKA HTTP with an actor pattern. It is a very light, stable and an asynchronous framework. Could you just wrap the command line tools in a light Scala framework? Potentially, it could take less time and be more stable. Sounds reasonable, we thought, let’s do that.

The ingredients for ultimate stability were as follows: a handful of Kafka java command line tools with a pinch of AKKA Http and a hint of an actor design. Leave to code for a few days and take out of the oven. Lightly garnish with a performance test to ensure stability, throw in some docker. Deploy, monitor, alert, and top it off with a beautiful graph to impress. Present Remora to your friends so that everyone may have a piece of the cake, no matter where in the world you are!

Bon app-etit!

Similar blog posts