Recently, I was developing a small stream processing application using Apache Flink. Zalando uses Kubernetes as the default deployment target, so naturally I wanted to deploy Flink and the developed job to our Kubernetes cluster. I learned a lot about Flink and Kubernetes along the way, which I want to share in this article.
Compliance - At Zalando, all code running in production has to be reviewed by at least two people and all deployed artifacts have to be traceable to a git commit. The default way of deploying Flink Jobs is to upload a JAR containing the Job with any other required dependencies to a running Flink cluster. This is not compatible with our internal compliance guidelines.
Container Orchestration Readiness - One of the key selling points of Flink is to do fault tolerant stream processing. However - as will be outlined in the next section - the reliability features were not designed with container orchestration systems in mind, which makes operating a Flink cluster on Kubernetes not as straightforward as it could be.
Fragmented Documentation - Both Flink and Kubernetes are evolving quickly, rendering some documentation (especially blog posts like this one and forum/newsgroup posts) out of date. Unfortunately, the official documentation currently does not provide every information that is needed to run Flink in a reliable way on Kubernetes.
Flink Architecture & Deployment Patterns
In order to understand how to deploy Flink on a Kubernetes cluster, a basic understanding of the architecture and deployment patterns is required. Feel free to skip this section if you are already familiar with Flink.
Flink consists of two components, Job Manager and Task Manager. The Job Manager coordinates the stream processing job, manages job submission and its lifecycle and allocates work to Task Managers. Task Managers execute the actual stream processing logic. There should always be exactly one active Job Manager and there can be n Task Managers.
In order to enable resilient, stateful, stream processing, Flink uses Checkpointing to periodically store the state of the various stream processing operators on durable storage. When recovering from a failure, the stream processing job can resume from the latest checkpoint. Checkpointing is coordinated by the Job Manager - notably, the Job Manager knows the location of the latest completed checkpoint which will get important later on.
Flink Clusters can be run in two distinct modes: The first mode, called Standalone or Session Cluster, is a single cluster that is running multiple stream processing jobs. Task Managers are shared between jobs. The second mode is called Job Cluster and is dedicated to run a single stream processing job.
A Flink Cluster can be run in HA mode. In this mode, multiple Job Manager instances are running and one is elected as a leader. If the leader fails, leadership is transferred to one of the other running Job Managers. Flink uses ZooKeeper for handling Leader Election.
Out of the two modes described in the previous section, we chose to run Flink as a Job Cluster. Two reasons drove the decision: The first reason is that the Docker image for Job Clusters needs to include the JAR with the Flink Job. This neatly solves the compliance problem as we can re-use the same workflow as we are using for regular JVM applications. The second advantage is that this deployment model allows to scale Task Managers independently for each Flink Job.
The Job Manager is modeled as a Deployment with one replica, Task Managers as a Deployment with n replicas. The Task Manager discovers the Job Manager via a Kubernetes Service. This setup deviates from the official documentation that recommends running the Job Manager of a Job Cluster as a Kubernetes Job. We think that using a Deployment is the more reliable option in this case (which is a never-ending streaming job) as the Deployment will make sure that one pod is always running whereas a Job could complete, leaving the cluster without any Job Manager. This is why our setup resembles the one describing a session cluster in the documentation.
Failures of Job Manager pods are handled by the Deployment Controller which will take care of spawning a new Job Manager. Since this is usually a relatively fast operation, this frees us from the need to maintain multiple Job Managers in hot-standby, which would increase the complexity of the deployment. Task Managers address the Job Manager with a Kubernetes Service.
As outlined above, the Job Manager keeps some state related to checkpointing in it’s memory. This state would be lost on Job Manager crashes, which is why this state is persisted in ZooKeeper. This means that even though there is no real need for the leader election and -discovery part of Flink’s HA mode (as is this handled natively by Kubernetes), it still needs to be enabled just for storing the checkpoint state.
As we already had an etcd cluster and etcd-operator deployed in our Kubernetes cluster, we did not want to introduce another distributed coordination system. We gave zetcd a try which is a ZooKeeper API backed by etcdv3. This setup works fine, so we decided to stick with it.
One other issue we faced with this setup was that the Job Manager sometimes got stuck in an unhealthy state that only could be fixed by restarting the Job Manager. This is done by a livenessProbe which checks if the Job Manager is still healthy and the job is still running.
It is also noteworthy that this setup only works correctly with Flink > 1.6.1 as there was this bug that prevented resuming from checkpoints in job clusters.
The above setup is now running in production for a couple of months and is serving our use case well. This is showing that it is possible to reliably run Flink on Kubernetes, even though there are some small roadblocks on the way.
- “Flink in Containerland” by Patrick Lucas - main inspiration of the points of this post
- “Redesigning Flink’s Distributed Architecture” by Till Rohrmann