In the last post we said that all our services publish events. What does it mean?
It means that when something changes in the domain of one of our services an event is published to notify all the other services interested in that specific type of change.
Ok, and what’s an event-driven architecture?
It’s a software architecture where the different services of our software solution produce events when something “noticeable” happens and they also consume events produced by other services.
In an event-driven architecture we have the concept of:
- Publisher or Producer: a service which publishes an event after something happened in its domain
- Subscriber or Consumer: a service which is interested in a specific event and it subscribes to it
Let’s make an example. In our domain a customer has an identity document and she can update it. When the customer updates her identity document we want to sync some data with Intercom and Mailchimp.
What’s happening here is that when the customer updates her identity document Service1 publishes the event CustomerUpdatedIdentityDocument. Intercom Sync Manager and Mailchimp Sync Manager are two services which are subscribed to the event and take care of communicating respectively with Intercom and Mailchimp.
How are the events passed around?
In our case Azure Service Bus takes care of this dirty job, but you can use whatever other messaging system that guarantees the event is not lost if there is a failure.
Azure Service Bus provides all the gears to build an event-driven architecture, but you still need to write quite a lot of complicated code on top of its SDK to be able to put those gears together and get the best from the features provided by Azure Service Bus. Obviously we did not write any code to communicate with Azure Service Bus, instead we just let Rebus take care of all the complicated stuff.
Azure Service Bus has queues and topics which are used by Rebus to provide the publisher/subscriber model. To make it easier to understand, let's just say that when an event is published by a service, a copy of that event is placed in the queue of every service which subscribed to that event.
When an event is inserted into a queue, the queue stores the event so if the subscriber is down/not running, the event is not lost. When the subscriber will be up and running again it will process all the events in the queue.
What’s so cool about having an Event Driven Architecture?
- Asynchronous communication
- Loose coupling
Let’s use the example we made before to go into the details of every benefit. If we did not have an event-driven architecture, we would probably have something similar to sync the updated data with Intercom and Mailchimp:
instead of just this:
Without events, UpdateCustomerIdentityDocument should:
- Save the data
- Sync the new data with Intercom
- Sync the new data with Mailchimp
What are we going to do if Intercom returns an error or Mailchimp APIs are under a high load and the request timeouts? And if we are able to sync the new data with Intercom but not Mailchimp? We should handle all these cases, and trust me, it’s a mess.
Using events, once the method saves the data to the database, it publishes an event and it does not care about who is going to handle the event and what the subscribers are going to do with it. The service just does what it needs to do, save the new data, then it fires an event and forget about it.
Let’s say that we want to send a SMS to a customer every time she updates her identity document (not sure why we want to do that, but it makes sense for the sake of the example).
If the UpdateCustomerIdentityDocument method did not publish an event we should update the method and add a call to Twilio APIs to send the SMS.
Does it make sense that UpdateCustomerIdentityDocument knows it needs to send an SMS to the customer? And what if the customer did not provide her phone number but only her email address? Should it handle all this logic? No, not really...
The UpdateCustomerIdentityDocument just does its job, storing the data into the database. That’s it. It doesn’t care that a part of the customer’s identity document data is used somewhere else, or that the action of updating the customer’s identity document triggers another 100 actions.
It just does its job, nothing more.
How do we send the SMS to the customer? Simple, let’s just create a new handler for CustomerUpdatedIdentityDocument event and let the handler take care of all the nitty logic.
An handler in our case (as we said we use Rebus) it’s just a class which implements an interface specifying which event that class wants to handle. That’s it, we expanded our system but we did not touch UpdateCustomerIdentityDocument.
Ok, this really depends on how we deploy our services but let’s say that Service1, Intercom Sync Manager and Mailchimp Sync Service are deployed in 3 different Web Apps which can be scaled independently and each of them has its own queue.
Now, let’s say that our Marketing department planned to launch a new Facebook campaign tomorrow and they planned around X thousands users will sign up and add/update their identity document.
We can decide what to scale. If Service1 is not able to handle the load we can just scale out/up that Web App (unless the problem is the database…).
If every call we make to Mailchimp takes 1 second but we can send multiple requests at the same time and we want Mailchimp up to date asap, we can scale the Web App under Mailchimp Sync Manager and we will end up with two instances processing the events placed in the same queue. Not enough? Mailchimp can take more load? Add another instance!
Code fails for whatever reason and when it fails somebody should be informed about it and maybe do something.
As I wrote before, when we implement an event-driven architecture we should have a messaging service which does not lose events unless we ask to. In our case, when a message fails it is moved to a queue called error, and it stays there until we don’t take an action.
We can decide to retry the failed event (just placing it again in the queue of the service where it failed) or archive/delete it.
The fact that the failed event is not lost it’s amazing. It gives us the time to fix the bug that caused the failure (if it was caused by a bug) and retry the event.
What are the cons?
The benefits are pretty cool, but not all that glitters is gold. These are a list of things you should keep in mind if you decide to go for an event-driven architecture.
- Events will fail and you need to be informed about it. An active monitoring is REALLY important
- All your handlers should be idempotent. An handler could process the same events multiple times (because the event failed and it has been retried or just because of a glitch)
- The messaging service is the bone of your system. You need to be sure it’s high available and it’s fast
- The messaging service could be expensive
Was it a good choice?
I need to admit we had hard times:
- Bugs (on our code) which caused events to be lost
- Not idempotent handlers which caused some weird bugs
- Azure Service Bus basic plan was unreliable, it went down multiple times (we upgraded to the standard plan)
- Not enough monitoring: we found out some failures after 2 days they happened
But, hell yes, it was a good choice:
- Thanks to events we publish we were able to develop some features in a really small amount of time
- Our customers are not affected so much by failures, whatever can be handled later is asynchronous and we have some fallbacks for stuff which fails badly
- We can easily test new ideas, we just need to subscribe to few events and that’s it