Stop Managing Ops Incidents with Jira or Zendesk
In many ways, incident management for devops is similar to typical issue tracking processes: it facilitates coordination and collaboration of daily tasks. For this reason, tools such as Jira, Zendesk, and even email are often used as solutions for incident management. But incident management faces one unique challenge that makes it different from other issue tracking processes. In addition to human-operated workflows, incident management also relies heavily on machine-driven workflows. Unfortunately, traditional issue trackers and ticketing systems cannot accommodate for this with their current product mechanics.
For example, consider the process of initiating an issue or a ticket. Unlike most engineering tasks, the vast majority of production incidents are triggered automatically. Companies depend on tools such Nagios, New Relic, Splunk and Pingdom to detect problematic symptoms. These symptoms are then analyzed, correlated and grouped into incidents. Incidents are dynamic, short-lived and occur dozens or hundreds of times a day. It is easy to see why a manual process is too slow and error-prone.
Another good example is operational insight. To correctly prioritize and route an issue, some context is required. What deployments and config changes occurred around the same time? When did the same issue occur in the past and how often? Were critical applications or web-sites impacted? These questions, sadly, cannot be answered by a generic issue tracker.
A good incident management platform integrates intelligently with the company’s monitoring stack, providing realtime context for all your incidents. It consumes and leverages information residing in various infrastructure data hubs (e.g., a configuration management system such as Puppet or Chef). It delivers automation where appropriate, replacing critical, yet inefficient manual processes.
It is not unusual to see companies implement their own incident management tools. Some start from scratch, others adapt existing issue trackers, augmenting them with scripts and hacking their code. In the next few years, I expect to see more and more companies migrate to out-of-the-box solutions, specifically tailored to cope with the complexities of incident management.