Software failures are pretty normal, everybody should know it.
As a developer, ITOps, DevOps, … you should be informed of a failure before a herd of angry customers call the customer support of your company, complaining that they are losing money because your software does not work.
That’s one of the reasons why monitoring is really important. Another pretty good reason is to know when your infrastructure is on fire, but I would say that monitoring the infrastructure where your code runs it is a pretty obvious thing to do.
As we wrote about in the last post, at Stamp we have an event-driven architecture and we have quite some events flying around, we need to be sure that all the events are processed correctly.
We also take advantage of the events published to monitor features, so we can see if they are working correctly and if our customers are using them. In this way we can avoid running manual queries every few hours or to keep looking at a dashboard.
Monitoring events failures
Lots of events flying around, guess what? Sometimes those events fail, or better, the handlers handling the events fail, and the event is placed in the error queue.
When an event is placed in the error queue somebody should be informed.
As we said, our event-driven architecture uses Azure Service Bus as transport and Rebus as framework to simplify all the event-driven details.
One of the first things we did when we started building our systems was to write an Azure Function which gets the events in the error queue from Azure Service Bus. This function runs every couple of minutes and if there is any event in the queue it sends a notification to a Slack channel. After a couple of years we integrated also OpsGenie, so if there something going really wrong, we can page people.
Every event which goes through our system is also sent to Fleet Manager which allows us to retry or archive an event.
When an event is retried, it is placed back to the queue where the failure originated. All the handlers will process the event again. If one of the handlers fails, again, the event is placed in the error queue and sent to Fleet Manager.
If the event is not part of some critical, high-availability, flows we can take our time to fix the problem that caused the failure and retry the event.
Pretty cool stuff, but... let’s read this phrase again “All the handlers will process the event again.”. It means that all our handlers should be idempotent. If they are not idempotent, the service which they call it should be.
Let’s make a simple example: we charge a customer when the OrderPlaced event is published. If the handler which makes the charge is not idempotent or the external Payment Gateway is not idempotent, your customer will be charged twice. Not cool.
How you ensure that your handler is idempotent it really depends on what that handler does. Maybe a topic for another post.
When we build a feature we usually have some kind of event being published because something happened in our systems.
We can use those events to keep an eye on how a feature is working, and if it’s really working.
For example, in our platform, when a customer reaches a specific level we send a notification to a Slack channel.
This notification helped us identify quite some bugs, for example when we charged a customer two times (idempotency is really important 🙃).