Get everyone on the same page… literally
Whether we practice more traditional operations processes with a 24x7 NOC and well-documented processes, or we’re embracing DevOps-styles with cross-functional teams and highly iterative methodologies, one problem we all face is the growing disconnect between our monitoring systems, the alerts they fire off, and the processes we’re using to handle operational issues. We log incidents in a ticket, but are the folks working on that ticket aware of the real-time status of the underlying incident?
This doesn’t matter so much for help desk tasks like resetting passwords, ordering replacement hardware, or fixing a user’s phone. But for today’s complex environments and multi-level monitoring stacks, keeping teams in sync with dynamic service issues as they unfold is a real challenge.
Let’s take a pretty typical workflow pattern that we often see in operations as an example. Maybe something like the following: A critical alert will come in as an email or text message or perhaps a dashboard indicator, and we’ll begin the defined or ad hoc process for handling that situation, probably something like opening a ticket and then directly investigating the event. We might refer to a playbook, open a terminal session, inspect some graphs, or maybe run specific diagnostic tools. Whatever we need to do for that situation. If we can’t handle it ourselves, we get some other folks involved through a routing or escalation path.
But the underlying incident continues to dynamically unfold, and since we use quite a few different monitoring systems, more alerts for the condition we’re already investigating continue to flood our inbox or blow up our phone. One customer told me recently that his iPhone, being on vibrate, would basically buzz itself right off his desk as he was trying to investigate the event in real-time, even though he was already using a third-party alert suppression service.
BigPanda came up with a simple and elegant solution to this problem. When alerts are coming in, we already put them into an higher-level container we call an “incident”, which is the core of how BigPanda works. Then, once you’ve started working on the actual issue at hand, we figure that since these incoming alerts are all symptoms of the same outage that’s already being worked on, they should be collected into a real-time status page that serves as a trusty aide to the investigation and resolution process. This page doesn’t require users to have a BigPanda login (so you can go right to it with no intervening clicks or forms) and can be open on your mobile device or in a browser on your screen as a reference point for the latest info on the issue. You can actually see the changing status of the alerts as you work on resolving the disruption. It brings a bit of order to the chaos of consoles, and dashboards, and log viewers, and other resources that you might typically have scattered around. The frustrating sense of “yes, I’m already working on it!” can be mitigated nicely. Stress-reduction.
After the incident is resolved, BigPanda preserves all of this activity in the Incident Timeline. What a great way to do a postmortem on the event! All of the alerts that affected the systems during the disruption are clearly laid out in a very clear way. The story of what went down, all the different alerts and their state changes for the entire duration, is all there in one chart. It removes the need to assemble and organize all the emails and other assets that you might normally have to gather for a retrospective.
The Incident Details page is basically an extrapolation of BigPanda’s main Incident Management Console, just devoted only to that one incident. The page is always embedded in a shared incident, whether that’s an email, text message, service desk ticket, or a chat channel. Every share will include a link to the Incident Details page, so everybody is always in sync.
Pretty cool stuff, no? So feel free to close all those windows you don’t need open while you’re focused on getting things back to normal. Just make a nice little sidebar out of the BigPanda Incident Details page and devote your attention to what you do best, restoring order and getting the business back on the right track!
Ready to give BigPanda a try?
Get started quickly & free. Collaborate better around your incident management from end to end with BigPanda.