Scaling operational excellence
Zalando is iterating our production incident handling process. The previous process had a dedicated Tier One 24/7 team who coordinated the incident response communication while escalating to the service-owning Tier Two team(s) for a resolution. That has been rationalized to those service-owning teams handling their incident flow, from alert to post-mortem to reduce time to resolution.
Zalando’s Customer Data Platform (CDP) is composed of several teams co-located in our Dublin office, each focused on solving specific insights or core data problems to improve the fashion experience for our customers. Depending on the needs of each problem space, our team sizes can range from two to nine people, depending on the team’s life cycle stage. Their autonomy manifests as variation in the technology stacks deployed by each team.
The people who build a service are in the best position to support it, and while small teams can be very focused and effective, the reality is that operating an on-call rota within a newly formed team of perhaps two people is likely not sustainable and–I suggest–unnecessary.
This post will describe one method we use to increase the pool of people creating production services who are also comfortable with sharing the on-call support responsibility. I also address how we establish a positive feedback loop to build, document and operate better systems.
Being on-call every second week either as primary or secondary support can compromise work-life balance in subtle ways e.g. scheduling leave, illness cover, family events, or meeting up with that friend who’s back in town for just one night.
As “vertically scaling” an individual is not always practical, we can try to horizontally scale the pool of individuals available to join the on-call rota.
The individual members within a cross-functional team naturally have varying degrees of SRE/SysOp skills, experience, and confidence in being on-call. We have found the range to be somewhat more pronounced when teams are composed of a mix of data scientists and engineers who have diverse backgrounds, and thus have different expectations for what supporting a production service and responding to an incident entails.
Add to this the challenge that every individual participating in the on-call rota can support other teams’ services and you may see furrowed brows and a drop in confidence on your colleagues’ faces as they wonder how that might pan out at 3am.
We learned that Game Days help calibrate expectations, raise confidence, and scale the pool of on-call support.
What Is a Game Day?
In three words: Exercising operational excellence.
In more words: practicing your incident handling process, dogfooding your documented playbooks, auditing your checks and alerts coverage, literally testing your backups (at least once a year), providing a fire drill for your service.
Take your pick and add your own.
Typically, a Game Day is a window during work hours when failure modes are artificially induced within one or more system components in a controlled manner. The supporting team then responds.
The specifics can vary by the maturity of the teams, services, and cultures involved. The window may be a recurring calendar invite or could have the element of surprise. The failure may be created on a sacrificial, staging, or production environment. The facilitator may be a rotating team member or an external actor.
The purpose is the same regardless: to provide the team with an opportunity to experiment and learn more about their systems’ failure modes, validate documented recovery steps, and develop mitigation strategies for that 3am alert–whoever is on-call.
I imagine you don’t want you or your colleague to be left frustrated, having to escalate a production incident where a trivial but non-obvious recovery step is missing from the 24/7 playbook just because this may be the first time it has been tested by someone other than the initial author.
How We Started
As part of a wider operational excellence effort in CDP to prepare for scaling the number of services offered, we started small within two of the teams. Within my team of two engineers and two data scientists, we joined the on-call rota, but some of us felt under prepared–apprehensive even–when shadowing others.
With buy-in from the team, we drafted a simple template document listing inputs, a rough execution plan, and outputs to record successes and opportunities we expected to discover along the way. I scheduled Monday afternoon in our calendar so the team knew the window, but intentionally omitted further details about how it might unfold. This was intended to inject some realism into the scenario and maintain an air of mystery and suspense.
We wanted to bootstrap the process in a safe and efficient manner, so we used what we already had available where possible, and as this was the first iteration, opted to use our staging environment to avoid impacting other teams. We aim to be able to safely run Game Days on our production environment in coordination with stakeholders to surface brittle dependencies on our services and prevent cascading failures.
Cloning our production ZMON alerts, tagging them as Game Day-specific, and targeting the staging environment enables them to be quickly toggled before and after future Game Days. The alerts were configured to page our CDP 24/7 pilot rota through OpsGenie at the time. Our staging environment writes logs into Scalyr where we had some saved queries.
We use Gatling to drive load test scenarios and employ this in two forms:
- A low-intensity 1req/s “trace” to give higher resolution than we typically get with our 60sec-interval ZMON checks, which would run throughout the Game Day window.
- A variety of relatively high intensity load tests which would run for a shorter period e.g. 2000req/s for 30 minutes for the particular service under test.
We had previously put effort into maintaining our section of the joint CDP Playbooks written with Gitbooks but they could use some shaking out.
We planned on two phases to exercise some first principles root cause analysis; the first not being covered by the playbook, and the second simulating human error and misconfiguration.
Our API stack is a Play app in front of DynamoDB tables that make calls to a handful of other services, is relatively simple, and allows some creativity in breaking it in new and convincing ways to keep future Game Days exciting.
For the first scenario, I opted to simply kill the API EC2 instances. Not having more sophisticated tooling prepared, I ran a rudimentary Bash for loop from my machine which lists and kills running instances in the stack. It is, admittedly, a contrived yet possible and effective method of inducing a failure.
For the second, I planned to break our Dynamic DynamoDB autoscaler, which we had been using before AWS AutoScaling was available. We had wrapped it in a docker image to add some heartbeat keep-alive functionality and to pull the configuration from an S3 bucket on startup. This config is under version control and peer review, but infrequently changed and manually deployed, presenting an opportunity to demonstrate the effect of a regex typo on production systems which is only visible under load.
First Game Day
As you might guess, it did not go as planned.
Phase One alerts manifested as zero active members on the ELB and a spike in errors for which there was no playbook. We worked through the problem with some occasional coaching by me, checking Scalyr logs, examining the AutoScaler history, finding the EC2 instance console logs, determining the instances were being gracefully terminated and using CloudTrail to discover the source of the API calls. This took 56 minutes from the first alert to service restoration, with vital gains for the team in experience and confidence.
After some reflection and a write-up of the post-mortem document, we triggered Phase Two by launching the more intensive load test to expose the DynamoDB autoscaler configuration typo (replaced an ‘n’ with an ‘m’). This went more smoothly: the proverbial ice had been broken and the team followed the playbook steps to triage the issue by manually scaling the table before tracing the issue.
While our playbook for high DynamoDB latency and throttled requests did mention Dynamic DynamoDB autoscaling failure as a possible cause, there was no specific playbook addressing it. Resorting to first principles, the logs revealed no activity relating to the affected tables, which narrowed it to the config, and redeploying the autoscaler restored service. This time it was 27 minutes from first alert to service restoration.
Even aside from value of discovering what did and did not work well from a technical perspective, the 51% improvement in restoration time made it a worthwhile exercise. From our discussion afterwards, we were happy to have gained confidence, established a baseline of experience, and to have calibrated our expectations of what handling an incident actually involves.
Since this first iteration, we have run two more Game Days within our team, made them monthly recurring events, and we are working with the other teams in our group to bootstrap, peer-review and improve the process, and build more sophisticated tooling. Through this and other efforts, we are seeing improvements in the ratio of production services to the people supporting them. Faster resolutions and collaborative learning have also emerged in our happily less frequent post-mortems.
Want to know more about Zalando Tech, Dublin. Check out this video!