Automating Incident Management

Any post talking about incident mgmt workflow: detect, triage, root cause, escalation, resolution.

Getting Started with BigPanda – The Incident Feed

By |2018-04-17T18:30:37+00:00May 4th, 2015|Blog|

BigPanda is an incident management platform for IT, NOC, and DevOps teams. Organize, prioritize and triage your incidents faster and more intelligently than ever before. Vastly improve your team's collaboration around Ops alerts and events. The following guide is the first in our series on getting started with BigPanda's incident feed. This BigPanda product introduction will help you to get up and running quickly so you can get back to fixing the world's broken stuff.

The new Alerts REST API from BigPanda

By |2018-09-14T23:13:46+00:00September 4th, 2014|Blog|

CONNECT ALL THE THINGS! Here at BigPanda we are constantly working on adding new monitoring systems to our arsenal of out-of-the-box integrations. We already provide integration with all of the most popular monitoring systems & services. Nagios, Zabbix, Zenoss, New Relic, AppDyamics, CloudWatch, Pingdom are all there. And there's many more – this list gets longer with every week that passes. These out-of-the-box integrations from BigPanda have many advantages:

How to Use the 80/20 Rule to Turn Noisy Alerts into Actionable Intelligence

By |2018-04-17T18:23:40+00:00October 26th, 2015|Blog|

If you work in tech, you’ve probably heard of the Pareto principle, or, as it’s more commonly called, the 80/20 rule. According to the 80/20 rule, for many events, 80 percent of the results are generated by 20 percent of the inputs.

A little background: back in the late 1800s the Italian economist Vilfredo Pareto noticed that approximately 80 percent of the land in Italy was owned by 20 percent of the population. Not long after, Pareto also observed that 20 percent of the peapods in his garden generated 80 of the crop’s yield – and thus the 80/20 principle was born. 

Key takeaways from DevOpsDays Silicon Valley

By |2018-04-17T18:22:57+00:00November 12th, 2015|Blog|

In between sessions at last weekend’s DevOpsDays Silicon Valley, scores of attendees filled the halls, amplifying the Computer History Museum with chatter and turning it into something more akin to a high school cafeteria than a conference venue. As crowds formed to share their stories and insights with one another, a common theme quickly emerged: It just isn’t as easy as we thought it would be.

Part 1 of 2: The reason why Nagios is so noisy – and what you can do about it

By |2018-04-17T18:23:04+00:00December 1st, 2015|Blog|

If you’re struggling with a flood of Nagios alerts, this two-part blog series is for you. We’ll take a close look at the complicated relationship that IT and Ops professionals have with the monitoring tool, explain why Nagios is so noisy, and discuss the simple way that you take charge of your alerts and maximize the way Nagios works for you.

How alert correlation helps Dev and Ops work better together

By |2018-04-17T18:44:50+00:00April 28th, 2016|Blog|

This post was recently published as a guest blog by our friends at Jira Service Desk. You can find the original post here.

We all need to move fast in order to stay competitive. But the faster things move, the faster things break.

While many companies have made great strides towards automating application release and infrastructure management, automation for service assurance has been sorely lacking. That’s left Dev and Ops with a problem: how to effectively service alerts that have grown by orders of magnitude.

Not all alert correlation platforms are created equal

By |2018-04-17T18:42:36+00:00May 23rd, 2016|Blog|

Ask yourself these questions to find the right fit in an alert correlation platform.

To maintain operational visibility in modern IT environments, companies are abandoning monolithic monitoring solutions from legacy vendors in favor of a modern set of “best of breed” monitoring tools. Today’s average IT monitoring stack consists of about 6-8 tools, including at least one from each of the following categories: systems monitoring, end user monitoring, application performance monitoring (APM), error detection, log analytics, chat, and ticketing. When service disruptions occur, operations engineers face a flood of alerts across different layers of the IT stack, with no fast way to figure out what’s really going on. Customers are left stranded, while IT professionals struggle to detect, triage and remediate urgent issues. Downtime abounds which negatively impacts revenue, performance, and brand loyalty.

Three key themes from ServiceNow Knowledge16

By |2018-04-17T18:42:34+00:00May 26th, 2016|Blog|

Decompressing from an exhausting, inspirational few days at Knowledge16, the annual ServiceNow event...

From humble beginnings (my first Knowledge was a few hundred attendees in a tent in San Diego), Knowledge has become a global tour de force. This year, Mandalay Bay could barely contain more than 11,000 customers and partners (and the expo hall could barely contain more than 100 decibels of the tech equivalent of Queensryche). Getting into the keynote felt like rush hour on the subway in midtown Manhattan.