At Zalando the teams are autonomous and involved in the entire software development process - from gathering stakeholder requirements to design, implementation, testing and deployment. For me, this was one of the greatest challenges/opportunities of joining Zalando and it allowed me to grow on so many dimensions of software development, one of these being DevOps.
When I initially joined Zalando I had previously been focused only on software development and I was eager to understand how my software should be deployed and operated.
As part of the autonomy mindset, each team is given an AWS account where they can deploy their services. There is common infrastructure based on STUPS (fully open source by the way) that provides a common way to handle logging, monitoring and deployment concerns. Today we are actively moving to a Kubernetes based setup and a fully integrated continuous delivery platform.
There are three main topics that I faced while doing DevOps: Monitoring or Visibility, Reliability, and Software Delivery. Let’s focus on each one individually and how learning about it improved the solutions I bring to production.
Monitoring / Visibility
For a period of time, we did not know how our application was behaving. This lack of visibility included not knowing whether our users were seeing errors and the latencies of any backends for frontend.
This problem became apparent when there were some errors in one endpoint and we only learned about it when notified by the end users. This was a personal wake up call to better understand how the applications the team owns should be operated.
We started by measuring the four golden signals:
- Latency - We gathered the latency perceived by our application on the various endpoints and from the load balancer’s perspective. Differences between these two signals can for example showcase long Garbage Collector pauses that may not be visible in internal application metrics.
- Request rate - Abnormal variations should be investigated, especially during a deployment. One can also learn about the saturation point in terms of requests by monitoring this signal during load tests.
- Saturation - We included in this CPU and memory consumption, TCP connection stats like new connections, total connections and the ones in TIME_WAIT and CLOSE_WAIT states.
- Error and Success rate - Like the latency, we measure this inside the application on the various endpoints and on the load balancer level. Inconsistencies between these two could be explained by misconfiguration of timeouts on the LB level or other abnormal scenarios like the application refusing new connections.
We chose to not alert on saturation signals, and only use the latency and the error rate since these are the metrics that affect the end user experience. If there is no impact on latency and error rate, having the CPU at 99% is completely acceptable and actually a sign of good design since it would mean that application requires very little slack.
These monitoring capabilities provide us with an understanding of how our system is behaving in real time, information about application usage patterns, and helps us to foresee possible problems/issues. Now when we are developing a new service, we do not go to production without having good monitoring in place beforehand.
Once the monitoring was improved, we saw a lot of inefficiencies that were introduced by our backend for the frontend. We expected our latencies to match closely the backend metrics but this was not the case. Upon further investigation, it was discovered that our authentication strategy was introducing significant unnecessary latency.
We also looked at where the stateful components of the system were being stored and added a Redis deployment to hold the session data. Previously with every deployment our users would need to log in again which meant that releases had to be aligned with them.
Our work on reliability highlighted that we had not properly considered how different components interacted when designing the system. Now, thinking about which components can fail and how their failure can affect the system is a common exercise when building new services or even refactoring current ones.
The last topic we focused on was improving the way we deliver the software the team develops. Initially releases based around docker images were manual and done from developer machines, which was not even compliant with our internal policies. The first attempt at improving the situation was focused on producing these docker images by using a Jenkins job which improved the compliance status. The second iteration moved the team to a continuous delivery workflow using Kubernetes and an internal Continuous Delivery Platform. In order to enable this without reducing the quality of delivery, we introduced end to end testing (you can read about it here). These tests run on the staging and production deployments before the traffic switch to the new deployment. If the tests fail, the deployment is aborted and we are notified via instant messaging. I am happy with the current state of our delivery process but I continue to learn and try to find improvements.
Having a continuous delivery workflow reduced the operational needs of the team and allowed us to deliver faster to our stakeholders.
With our migration to Kubernetes we have also improved our application architecture. We have simplified it to just one service in the frontend and we moved all the stateful components to a Redis datastore. In the image below you can see the architecture before and after the Kubernetes migration.
System architecture before migration to Kubernetes
System architecture after migration to Kubernetes
As a software engineer getting involved with DevOps helped me to better understand how our applications are delivered to our customers and empowered me with crucial knowledge to investigate and fix issues autonomously. From my working experience, gaining DevOps knowledge as a software engineer has greatly improved my ability to have an impact.