Zalando is using an event-driven approach for its new Fashion Platform. Conor Clifford examines why
In a recent post, I wrote about how we went about building the core Article services and applications, of Zalando’s new Fashion Platform, with a strong event first focus. That new platform also has a strong overall event-driven focus, rather than a more “traditional” service-oriented approach.
The concept of “event-driven” is not a new one; indeed, it has been quite well covered in recent years.
In this post, we look at why we are using an event-driven approach to build the new Fashion Platform in Zalando.
A “traditional” service/microservice architecture will be composed of many individual services, each with different responsibilities. Each service will likely have several, probably many, clients; each interacting with the service to fetch data as needed. And these clients may be services to other clients, etc.
Various clients will have different requirements for the data they are fetching, for example:
- Regular high frequency individual primary key fetches
- Intermittent, yet regular, large bulk fetches
- Non-primary key based queries/searches, also with varieties of frequencies and volumes
- All the above with differing expectations/requirements around response times, and request throughputs
Over time, as such systems grow and become more complex, the demands on each of these services grow, both in terms of new functional requirements, as well as operational demands. From generally easy beginnings, growth can lead to ever-increasing complexity of the overall system, resulting in systems that are difficult to operate, scale, maintain and improve over time.
While there are excellent tools and techniques for dealing with and managing these complexities, these target the symptoms, not the underlying root causes.
Perhaps there is a better way.
Want to know more about Zalando Dublin? Check out the video straight from our fashion insights center.
Inversion of flow
The basic underlying concept here is to invert this traditional flow of information. To change from a top-down, request oriented system to one where data flows from the bottom up, with changes to data causing new snapshot events to be published. These changes propagate upwards through the system, being handled appropriately by a variety of client subsystems on its way.
Rather than fetching data on demand, clients requiring the data in question can process it appropriately for their own needs, at their own pace. That can be processing transformation, merging and producing new events, or building an appropriate local persisted projection of the data, e.g. a high speed key-value store for fast lookups, populating an analytical database, maintaining a search cluster, or even maintaining a corpus of data for various data science/machine learning activities, etc. In fact, there can and will be clients that do a combination of such activities around the event data.
On-Demand Requests is Easy
Building a client that pulls data on demand from a service would appear the easier thing to do, with clients being free to just fetch data directly as needed. From the client perspective, it can even appear that there is no obvious benefit to an event-driven approach.
However, with a view to the wider platform ecology (many clients, many services, lots of data, etc.), the traditional “pull-based” approach will lead to much more complex and problematic systems, leading to a variety of challenges:
- Operation and Maintenance - core services in pull-based systems grow to serve more and more clients over time; clients with different access requirements (PK fetches, batch fetches, periodic "fetch the world" cycles, etc.). As the number and types of such clients grow, operating and maintaining such core services becomes ever more complex and difficult.
- Software Delivery - as clients of core services grow, so to will the list of requirements around different access patterns and capabilities of the underlying data services (e.g. inclusion of batch-fetches, alternative indexing, growing request loads, competing prioritizations, etc.). This workload has a strong tendency to ultimately swamp the delivery teams of core services, to the detriment of delivering new business value. In addition to the service's team, the client teams themselves would also be dependent on new/changed functionality in the services to allow them to move forward.
- Runtime Complexity - Outages and other such incidents in "pull" based environments can have dramatic impacts. Core service outage would essentially break any client fetching data "on demand". Multiple dependent applications can be brought down by an outage in a single underlying service. There can also be interesting dynamics on recovery of such services, with potential thundering herds, etc., causing repeating outages during this recovery, prolonging, or worse, further degrading, the impact of the original incident. Even without outages, the complexity of systems built around a request/response approach makes forecasting and predicting load growth difficult, modelling the interplay of many different clients, with different request patterns is difficult. Attempting to do forecasting of growth for each of these becomes a real challenge.
By evolving to an event-driven system, there are many advantages over these, and other aspects:
- Operationally - since clients receive information about change, the clients can react instantly and appropriately themselves. As the throughput of data is driving the system, the performance/load characteristics are much more predictable (and testable.) There is no non-determinism caused by the interplay of multiple clients, etc. In general, handing data over event streams allows for much looser coupling of clients and services results in simpler systems.
- Delivery - With the ability to access complete data from the event streams, clients are no longer blocked by the service teams delivery backlog; they are completely free to move forward themselves. Similarly, service delivery team backlogs are not overloaded by requests for serving modifications/alterations, etc., and as such freed up to directly deliver new business value.
- Outages - With clients receiving data changes, and handling these locally, an outage of the originating service essentially means clients working with some stale data (the data that would have been changed during that outage), typically a much less invasive and problematic issue. In many cases, where clients depend on data that changes infrequently, if at all, once established, it’s not an issue.
- Greater Runtime Simplicity - with data passing through event streams, and clients consuming these streams as they need, the overall dynamic of the system becomes more predictable/less complicated.
“Time is an illusion. Lunchtime doubly so.” - Douglas Adams
There's no such thing as a free lunch. There’s likely more work up front in establishing such an architectural shift, as well as other concerns:
- Client Burden - There will be an additional burden on clients in a purely event-driven system, with those clients having to implement and operate local persisted projections of the event stream(s) for their purposes. However, a non-trivial part of this extra work is offset by removing work (development and operational) around integrating with a synchronous API and all the details that entails; dealing with authentication, rate limiting, circuit breakers, outage mitigations, etc. There is also less work involved with not having an API that is 100% purpose built. In addition, the logic to maintain such local snapshot projections is straightforward (e.g. write an optionally transformed value to a “key value” store for fast local lookups).
- Source of Truth - A common concern with having locally managed projections of the source data is that there is a loss of the single source of truth that the originating service represents in the traditional model. By following an “Event First” approach, with the event stream itself being the primary source of truth, and by allowing only changes from the event stream itself to cause changes to any of the local projections, the source of truth is kept true.
- Traditional Clients - there may be clients that are not in a position to deal with a local data projection (e.g. clients that only require few requests processed, clients that facilitate certain types of custom/external integrations, etc.) In such cases, there may be a need to provide a more traditional “request-response” interface. These, though, can be built using the same foundations, i.e. a custom data projection, and a dedicated new service using this to address these clients’ needs. We need to ensure that any clients looking to fit the “traditional” model are appropriate candidates to do so. Care should be taken to resist the temptation to implement the “easy” solution, rather than the correct solution.
In the modern era of building growing systems, with many hundreds of engineers, dozens of teams, all trying move fast and deliver excellent software with key business value, there is a need for less fragile solutions.
In this post, we have looked at moving away from a more regular “service” oriented architecture, towards one driven by event streams. There are many advantages, but, with their own set of challenges. And of course, there is no such thing as a silver bullet.
In the next post, I will look at some of the lessons we have learned building the system, and some practices that should be encouraged.
If you are interested in working with these types of systems and challenges, join us. We’re hiring!