Shortly after joining Zalando, I, along with a small number of other new colleagues (in a newly opened Dublin office), was entrusted with the task of building an important part of the new Fashion Platform - in particular, the core services around the Article data of Zalando. This task came with several interesting challenges, not least of which was ensuring the new platform provided not just sufficient capacity/throughput for existing workloads, but also had capacity for longer term growth - not just in terms of data volumes/throughput, but also with the number, and types, of users of that data. The aim here was the democratization of data for all potential users on the new platform.
It was decided that this new platform would be primarily an event driven system - with data changes being streamed to consumers. These consumers would subscribe, receive, and process the data appropriately for their own needs - essentially inverting the flow of data, from the traditional “pull” based architectures, to a “push” based approach. With this, we were looking to strongly prompt a wide adoption of a “third generation microservices” architecture.
In an event driven system it is important that the outbound events themselves have at least equal importance to the data being managed by the system. The primary responsibility of the system is not just to manage the data, but also ensure a fully correct, and efficient, outbound event stream, as it is this event stream that is the primary source of data for the majority of clients of this system.
Starting with an API First approach, the event structure and definition were treated as much a part of the system’s API as the more traditional HTTP API being designed. Beyond just the definition of the events (as part of the API), key focus was placed on ensuring both correctness of the events (compared to any stored data, in addition to the sequence of changes made to that data), as well as efficient publishing of the stream of events. This Event First approach meant that any decisions around design or implementation were taken always with correctness, and efficiency, of the outbound event stream in primary focus.
Initially, we built a quick prototype of the data services - primitive CRUD-type services, with synchronous HTTP APIs, each interacting directly with a simple (dedicated) PostgreSQL database as the operational store for the data. Outbound events were generated after completion of DB updates.
For this prototype, a very simple HTTP-based mockup of an event delivery system was used, while we decided on the real eventing infrastructure that would be used.
Not only did this prototype allow us to quickly exercise the APIs (in particular the event definitions) as they were being constructed, it also allowed us to quickly identify several shortfalls with this type of synchronous service model, including:
- Dealing with multiple networked systems, especially around ensuring correct delivery of outbound events for every completed data change
- Ensuring concurrent modifications to the same data entities are correctly sequenced, guaranteeing correct outbound event sequenced delivery
- Effectively supporting a variety of data providing client types, including live low latency clients, through to high volume bulk-type clients.
Throw away your prototypes
With these limitations in mind, we worked at moving from this synchronous service approach to an asynchronous approach, processing data using an Event Sourcing model. At the same time, we progressed in our selection of an eventing platform, and were looking strongly at Apache Kafka - the combination of high throughput, guaranteed ordering, at least once delivery semantics, strong partitioning, natural backpressure handling, and log compaction capability were a winning combination for dealing with the outbound events.
With this selection of Kafka as the outbound event platform, it was also a natural selection for the inbound data processing. Using Kafka for the inbound event source, the logic for processing the data became a relatively simple event processing engine. Much of the feature set that was valuable for outbound event processing was equally as valuable for the inbound processing:
- High throughput allowing for fast data ingestion - HTTP submissions getting transformed to inbound events published to an internal topic - even with high acknowledge settings for publishing these events, submission times are generally in the order of single digit milliseconds per submitted event. By allowing clients to submit data, with fast, guaranteed, accepted responses, clients can safely proceed through their workload promptly - allowing for greater flow of information in general through the wider system.
- Guaranteed ordering - moving processing to event processing on a guaranteed ordered topic removed a lot of complexity around concurrent modifications, as well as cross-entity validations, etc.
- At least once delivery - With any network-oriented service, modelling data changes to be idempotent is an important best practice - it allows reprocessing the same request/event (in cases of retries, or in the case of at least once delivery, repeated delivery.) Having this semantic in place for both the inbound event source, as well as the outbound event topic, actually allowed the event processing logic to use coarse grained retries around various activities (e.g. database manipulations, accessing remote validation engines, audit trail generations, and of course, outbound event delivery.) Removing the need for complex transaction handling allowed for much simpler logic, and as such, higher throughput in the nominal case.
- Natural Backpressure handling - with Kafka’s “pull” based semantics, clients process data at their own rate - there is no complex feedback/throttling interactions required for clients to implement.
- Partitioning - using Kafka’s partitioning capabilities, the internal event source topics can be subdivided logically - some careful thought to select an appropriate partitioning key was required for some data services (especially those with interesting cross-entity validation requirements), but once partitioned, it allowed the processing logic of the application to be scaled effectively horizontally, as each partition can be processed without any involvement with any data in the other partitions.
There were also several additional benefits to the use of Kafka for the event sources, including:
- As it was already a selected platform for the outbound events, there was no additional technology required for Event Source processing - the one tool was more than sufficient for both tasks - immediately reducing operational burden by avoiding different technologies for the two cases.
- Using the same technology for Event Source processing as well as Outbound Event Delivery led to a highly composable architecture - one application’s Outbound event stream became another application’s inbound Event Source. In conjunction with judicious use of Kafka’s Log Compacted Topics, to act as a complete snapshot, bringing in new applications “later” was not a problem.
- By building a suite of asynchronous services and applications all around an event sourcing and delivery data model, identifying bottlenecks in applications became much simpler - monitoring the Lag processing the event source for any given application allows bottlenecks to be much clearer - allowing us to quickly direct efforts to the hotspots without delay.
- Coordinating event processing, retries, etc. - it was possible to minimise the interaction with underlying operational databases to just the data being processed - no large transactional handling, no additional advisory (or otherwise) locking, no secondary “messaging” queue tables, etc. This allowed much simpler optimisation of these datastores for the key operational nature of the services in question.
- Processing applications could be, and several already have been, refactored opaquely to process Batches of events - allowing for many efficiencies that come with batch processing (e.g. bulk operations within databases, reduced network costs, etc.) - this could be done naturally with Kafka as the client model directly supports event batches. Adding batch processing in this way ensures all applications get the benefits of batch processing without impacting client APIs (forcing clients to create batches), and also without loss of low latency under “bursty” loads.
- Separation of client data submissions from data processing allows for (temporary) disabling of the processing engines without interrupting client data requests - this allows for a far less intrusive operational model for these applications.
- A coarse grained event sourcing model is much more amenable to a heterogeneous technology ecosystem - using “the right tool for the job” - for example, PostgreSQL for operational datastores, Solr/ElasticSearch for search/exploratory accesses, S3/DynamoDB for additional event archival/snapshotting, etc. - all primed from the single eventing platform.
Today, and Moving Forward
Today, we have a suite of close to a dozen loosely coupled event driven services and applications - all processing data asynchronously, composed via event streams. These applications and services, built on a standard set of patterns are readily operated, enhanced and further developed, by anyone in our larger, and still growing, team. As new requirements and opportunities come up around these applications, and the very data itself, we have strong confidence and capability in growing this system as appropriate.
If you find the topics in this post interesting, and would enjoy these types of challenges, come join us - we're hiring!