Best Practices for Your Incident Management Process
If you work in IT Ops, you’ve probably been on the receiving end of a tsunami of Nagios alerts. It’s not pleasant.
What happens when an IT outage is followed by hundreds of Nagios alerts?
During these alert floods, alerting becomes practically useless. Most operations teams go as far as ignoring their Nagios alerts entirely during major outages. Instead, they focus on the one alert that appears to be most urgent at that moment.
Frustrated with noisy alerts, many people these days are shopping for an alternative to Nagios. Something that will give them deep and granular visibility across their infrastructure, but that isn’t so noisy. However, the switching costs are very high and most Nagios alternatives don’t offer a fundamental improvement along the spectrum of granular visibility verses noisy alerts.
But…It is actually possible to eliminate alert floods altogether, without ripping out Nagios. More and more companies are now using alert correlation on top of Nagios to fight alert overload and boost their production health.
Why do alert floods happen at all? And why are things only getting worse every year?
The roots of the problem actually lie in a positive aspect of alerting: automation. It is unthinkable to manually parse through hundreds of thousands of metrics looking for unhealthy values. Nobody expects their infrastructure operators to stare at charts and point out things such as “CPU load on host #472 is awfully high” or “access time to our billing app from France appears to be a bit off.”
By configuring thresholds, we delegate this work to Nagios. Nagios goes through all of our checks, looking for metrics that pass thresholds, and alerts us when necessary. As stated, this is a good thing that helps us scale effectively.
Why, then, does such a big time-saver suddenly become a productivity-killer during major outages? The answer lies in the architecture of modern applications:
Of course the evolution described above is a great thing. The goal of alert correlation is never to undo or slow down these trends. Instead it seeks to find new, better ways to handle alerts – ways that would coexist in synergy with the new reality.
What is Alert Correlation and how can it help? The best way to understand it is to look at an example.
Consider a MySQL cluster with 25 hosts. Some of these hosts have been experiencing high page-fault rates, and a few others complained about low free memory. In 30 minutes, we received more than 20 individual alerts. Your Nagios dashboard now looks like a circus. Your email inbox looks even worse.
There is a right way to look at alerts. In this particular case, we would have preferred to see just a single incident. That incident would group together all of the cluster’s memory and page-fault alerts, allowing us to stay in control even during the alert flood.
Further more, by correlating these alerts together, we can easily distinguish between alerts belonging to this incident, and other similar alerts, such as storage issues on the MySQL nodes, or a global connectivity issue experienced by the datacenter. Alerts are such as these often drown in the alert flood.
Alert correlation is a method of grouping highly-related alerts into one high-level incident. To do this, it addresses three main parameters:
Are there other ways to combat alert floods? One method commonly attempted by companies is alert filtering. Monitoring engineers define custom dashboards limited to a small set of alerts, designated as high-severity or sev-1 alerts. Such a dashboard is expected to be considerably less noisy than a full dashboard.
However, there are two major problems with alert filtering. First, it introduces a blindspot to your operational visibility. Often low-severity alerts are precursors to high-severity alerts. A CPU Load issue might quickly evolve into a full outage. By ignoring the low-severity issues, you are risking reacting to alerts only after they are already impacting your production. The second problem with filtering is that filtered dashboards become very noisy, very quickly. Looking at the MySQL example above, you would probably want to see all of the page-fault rate alerts in your high-severity dashboard. So even after eliminating the low memory alerts, you are still stuck with thirteen alerts in your new dashboard.
In contrast, with Alert Correlation you avoid alert floods without losing visibility. Once a company adopts alert correlation, it doesn’t need a high-severity dashboard anymore.
BigPanda is an alert correlation platform optimized for Nagios. It consumes your Nagios alerts in realtime, and uses an intelligent algorithm to process and correlate these alerts. The BigPanda dashboard is a cloud-based application that presents all of your Nagios alerts grouped together into high-level incidents.
Among the benefits of using BigPanda are:
Learn more and try it free at bigpanda.io.