Introduction
In this post, I’ll share some of the lessons learned from developing an event-driven architecture with millions of users. I’ll also share some of the challenges either I’ve personally faced or my teams have faced and how we overcame them.
Lessons Learned
The following are a set of recommendations for implementing an EDA from first-hand experience. While many of these recommendations are AWS centric, the principles are applicable to any event-driven architecture and you can substitute your cloud vendor of choice.
Event Schema Versioning
Enable schema versioning/validation on EventBridge to ensure you are sending and receiving valid payloads.
Versioning your event payloads allows you to provide proper backwards compatibility. Keep in mind that given this is a decoupled architecture, you may never truly know who has subscribed to your events and thus you cannot be certain that making a change wouldn’t break existing implementations. Always version your changes, attempt to be backwards compatible, and enforce schema versioning.
In the event payload ensure that there is a reference to the version of the payload you are sending for the event (I think EventBridge supports this out of the box).
Record your schema payloads somewhere in a GitHub repo for future reference. It will make it easier for the teams to work with the event bus as they can identify existing events they may want to subscribe to, and likewise immediately find opportunities to contribute to the event bus.
For each one of your events you intend on emitting, I’d recommend you create a README or a similar txt file that outlines the purpose of that event e.g. when should this event be emitted.
Ensure that in your example you specify what version is currently in use and you keep the prior versions for reference.
Most importantly, do not tailor your payloads/event schemas to meet the demands of specific services! It’s fine to add more details to help a downstream consumer with a known use cases, but avoid implementing specific payloads for specific consumers or specific or custom fields/configurations for downstream consumers. If you don’t, your number of events and schemas will balloon and you will find this nearly unmanageable.
Avoid any tight coupling e.g. do not specify a downstream target or consumer in any of your messages. You may sometimes be tempted to route messages this way, but I would instead recommend using EventBridge rules and basing the routing on the properties of the payloads.
Preferably, have a single service own the emitting of events for a given domain e.g. a third party service would own all of the events that are emitted for third parties.
Fan-out messages to separate SQS queues or SNS topics so that you do not face congestion from messages of depth.
Set alerts for queue depth e.g. the number of messages that should be in a queue in a given time based on your quotas. Queue depth can be potentially “lethal” in production if not monitored.
Always put into place Dead Letter Queues (DLQ) to be able to handle failed processing/sends of messages.
Always design for and assume failure - your systems will fail, your services will go offline. At a minimum, you will need to replace systems over time in production and you cannot guarantee 0% downtime.
Have consumers track the messages they have consumed e.g. using DynamoDB. This tracker is sometimes known as a bookmark, cursor, or pointer. Effectively it says “The most recent message I processed successfully has id of
Consider enabling event replay on your EventBridge instance. Event replay (if you have implemented the other steps prior to this) will allow you to re-run all of your events through your downstream consumers in order to catch them up in case of failure. I strongly encourage you to implement the cursor/pointer pattern for consumers if you intend on resending events in the scenario of a failure.
Plan ahead and agree on your message/event retention periods, DLQ retry counts, queue depths, etc. You will want to have guiding principles for these items before you proceed with a mass rollout or you will find inconsistencies quickly grow.
Likewise, plan ahead and outline the approach for how new events and new versions of event schemas will be raised to the department. Ensure to highlight and segregate the responsibilities of each service and the associated payloads.
Keep your payloads (your event schemas) as light weight as possible - you will find at scale that you will be sending sometimes millions of messages within just a few hours which can quickly cause congestion or at a minimum weigh down your logging. Keeping light message schemas also simplifies debugging as you won’t have massive overly verbose log statements to read through.
Consider the sensitivity of data in your events and whenever possible limit the sensitive data provided. Keep in mind that just like with your logs, if your event messages contain PII and you have a queue/keep alive that has an extended lifespan, that sensitive data will be present for quite some time.
Implement a distributed logging/debugging/tracing tool like X-Ray or Jaeger in order to properly troubleshoot EDA oriented systems. Without a specialized tool you will find your engineering teams (particularly if they have complete ownership of only a subset of services) spending hours upon hours troubleshooting communication breakdowns if you do not implement tracing.
I would highly recommend your consumers have specific lambdas/API endpoints for consuming messages and processing them within their context. If you implement this abstraction layer per message, your consumer can effectively capture and transform the incoming payload to be able to fit their desired context, further decoupling the applications and creating a sort of API middleware between the event bus and the business logic of your app.
Please enable First In First Out (FIFO) ordering in order to ensure proper cronological processing of data. Yes, enabling FIFO will incur a small amount of latency but the majority of workflows (use cases) benefit from having FIFO enabled. FIFO also helps to ensure data consistency across services, otherwise you cannot guarantee that updates are being applied in the correct order and again, if you ever need to replay messages having FIFO enabled will be a life saver.
Explore the topic of event sourcing for data persistence. Event Sourcing is an intermediate to advanced topic and many teams entering into the realm of EDA struggle with it initially. A possible alternative is to simply use the details from the last event processed.