Incident Management
Incident. Scary word, huh? You'd think so, but not at Infobip. Over the years we have identified incidents as learning opportunities and source of vast knowledge about the system we have engineered.
Taking a couple of steps back: As a software engineer, you get a task to develop a new product feature. You do it. You test it in the staging environment. It works as intended and you deploy it to production. Some time later it goes belly up. What now, panic-mode on? No, that's not how we do it.
Our Incident Management Process
If it sounds that there's some black magic behind this and that it's too complicated, it's not. In big systems like Infobip's, there are always two big dependencies: a butterfly (effect) and Murphy's (law). I'd bet you know them both as well as their effects. Because of them, we created our Incident Management process. It's about sending a message (pun intended) as well as about building a transparent and healthy culture of reporting incidents where the goal is to improve and grow.
It's not about hiding problems because with hiding problems, we're also hiding solutions, and that's there that Murphy and butterfly effect are waiting to bite us in the SLA.
A Straight-forward Process
Now that we know why, it's time for the how:
Detect the incident -> Contact the troubleshooter of affected service/product -> Mitigate the impact of the incident
Straight-forward, right?
So straight-forward, you'd think why should I bother with it. Because it's not only about the process. It's about solving the incident as soon as possible and providing a reliable service to our customers. Having a defined process speeds things up. Slow in incident detection? Improve your observability. Slow in contacting dedicated troubleshooter? Improve your on-call schedules and escalation process. Slow in mitigating impact and solving the incident? Improve your rollback and failover options.
Overwhelmed? Don't be; we got you covered. The SRE team can guide you through your post-incident reviews and assist with analyzing the incident. Suggesting improvements to make the process as smooth as possible and provided service more reliable is also a part of the deal. Knowing you have SRE (Site Reliability Engineering) team as a form of safety net gives you some peace of mind and ability to focus on fixing stuff. Once you know what's broken and how, it's (relatively) easy to fix it.
With big systems like Infobip's, the catch is not in fixing, it's in finding what to fix.
SRE can help with the experience in (war-room) coordination, impact assessment and with sharing best practices on how to address the issues. Regarding post-incident reviews, there's one quite simple technique, which children use all the time, while adults keep forgetting about, or are annoyed by it: the five whys. Just keep asking 'why something happened?' (usually 5 times will do the trick) until you get enough information to paint a picture of what and why happened. With that, we can all work towards improvements and learn from incidents, which is THE most important thing.
Sure, bureaucracy probably isn't on most of our top priority lists, however it's bureaucracy that turns incidents into investments. Investments into us becoming better engineers. Investments into a more resilient system. Investments into a quality product. Investment into customer happiness. Take your pick.
Incident Management process at scale
Remember that straight-forward process from earlier? I might have misled you a bit there, sorry, not sorry... It's not so straight-forward when you're working on such scale as Infobip is working. To refresh your memory from The Scale of Our Systems: we have 1600+ applications and 100+ teams owning them. Knowing their dependencies and ownership by heart, takes years of Infobip experience; however, we cannot rely on that. Couple that with hopping from one team Slack channel to another team Slack channel, while there’s an ongoing incident, and asking who the owner of some service is, is a forever lasting nightmare. Sure, we have documentation, inventory and other ways to find the team owning the service, but that’s not the (whole) point here. We didn’t stop at solving only that, as we needed also a way of streamlining the communication during the incident, as well. We not only needed to know the owner of the service, but also their Slack team tag and Opsgenie schedule rotation to know who is oncall from that team.
Enter "SRE Cop" incident management bot. Yeah, as in "police officer", since the bot kind of does similar job. Coincidentally, we have SRE Cop as our Slack channel regarding SRE Community Of Practice topics. Before SRE Cop bot, we had a single incident management channel in which a single thread was dedicated for a single incident instance. Yes, sure, that’s perfect from perspective of having everyone at the same place, but the more people are involved, the communication flow gets tricky to follow. Spoiler alert, we’re still doing all of that, as we’re still gradually rolling out bot usage.
So, what are we gaining with introducing a bot to already fine-tuned process? Why the change, you may ask? Well, even though the incident management process was highly tuned and working like a charm, we don’t like maintaining status quo. Being content with status quo means to stagnate, and to stagnate means it will get worse over time.
With bot introduction, we’re solving, at minimum, the following problems we’re having:
Automate the incident response steps as much as possible, i.e. all steps that do not require human intelligence
Stakeholders can easily track incident status without reading the entire conversation
Slow in service owner identification -> it’s easy and quick now: with a single command, bot is now giving you all the information you need for service you asked for. It’s doing the searching for us.
Slow in contacting the oncall engineer -> coupled with above, there’s also Slack tag provided based on Opsgenie schedule rotation
Complex communication in a single Slack thread -> incidents are now being reported via bot
You didn’t see that last one coming, did you? Yes, with a couple of clicks and of course, some typing of summary what’s going on to kickstart the incident management process, (all of) the following can happen:
New Slack channel dedicated for the specific incident is created
Previous incident management channel receives updates from bot on a summarized message from dedicated channel
Various threads categorized by topic are created by default in new channel
Incident reporter, dedicated incident management support team, oncall SRE are added automatically to new channel
Incident metadata being stored by bot, instead of manually collecting it
And more to come…
Based On A Real Incident at Infobip
Talking about incident management without some horror stories feels kind of dissatisfying, so here's one from the perspective of SRE troubleshooter:
The working day was nearing its end. My focus was long gone, but I was trying to wrap up some tasks and closing the shop for the day. Everything was fine.
Next morning: Didn't even finish my first coffee and I notice one of our most important products in one of our biggest data centers looks fishy. All other products are fine, it's just this one that's weird. Like everything there stopped. Zero processed requests. Since yesterday. A zero in big data center is never good; something is broken somewhere for sure. As at that time I was only half awake, I didn't trust myself so I immediately contacted team owning the service to check it out. 8 minutes later: "Yeah, our RabbitMQ cluster is dead". RabbitMQ did try to recover on its own, but failed due to having Mnesia table issues at the same time. Let's just say that I no longer needed my 2nd or 3rd coffee like I usually do.
Talk about a wake up call: we stopped processing all traffic in one data center for about 15 hours and nobody noticed.
34 minutes after the incident was noticed, functionality was restored. The team manually redeployed the whole RabbitMQ cluster. Even our CTO went "Oouf".
This is where horror story stops and the post-incident review fairy tale begins.
The whole team, A-team (Architects) member and SRE member got on a post-incident review call and investigated this into the tiniest detail.
How come we missed such a big deal? Actually, the incident was caught on time by one of the SRE alerts. However, no escalation policy was set on it because it was still in the testing phase; it just sat there in some Slack channel until the next morning. Ouch, but thankfully an easy and quick fix: write a runbook for it, add escalation policy and put it in production. Alert testing: successful.
What about other alerts already in production? How come out-of-the-box RabbitMQ alerts didn't trigger? As it turns out, as the cluster died, so did its metrics. Note to self: don't rely on a system to report itself that it's unhealthy. Or do, but have a backup for it.
Then comes the big question: why did the RabbitMQ cluster die? Unfortunately, to this day, we don't know, but we prepared like it's going to happen again. Alerts were improved. Runbooks and documentation written. Additional (Prometheus) metrics added. Procedures expanded to include Support teams to enable faster reaction.
And it works. We're getting faster in incident detection as well as mitigating their impact. Knock on wood.
Wrapping up with a quote from our SRE team: We don't know who you are. We don't know what you want. If you are looking for magic answers, we don't have any... But what we do have are very particular sets of skills, skills we have acquired over very long careers, skills that make us a nightmare for problems and incidents. If you have no problem now, that'll be the end of it. We will not look for you. We will not pursue you. But if you do, we will look for you, we will find you, and we will help you.
Point taken?
Sample template of our Post-Incident Review document.
Last updated