Mean Time To Resolution

AKA MTTR. Posts related to incident discovery and resolution process in full.

What is MTTR? Or why not to feed the baby cognac.

By | 2018-04-17T19:00:36+00:00 July 31st, 2015|Blog|

What is MTTR? Don’t answer with what it stands for or how you use it. The question is more philosophical than literal. For too long we’ve measured operational performance based on the number of minutes it takes to resolve an incident. The almighty trend line slopes down then we gulp milk from the jug of IT inflated ego like NASCAR drivers drunk on Nagios exhaust fumes.

Like the Zen riddle about one hand clapping it’s important to first ask:

  1. What’s an incident?
  2. What does it mean to resolve one? …and (the ever-blasphemous)
  3. Is it unequivocally better to resolve them quickly?

My answers...

Service Health Analytics: make better IT ops decisions faster

By | 2018-04-17T18:59:04+00:00 September 29th, 2015|Blog|

We’re proud to be unveiling a new concept we pioneered in the den that finally moves beyond dashboards as eye candy to a new place where IT analytics can be used to make better ops decisions. It’s called Service Health Analytics and it exposes all data from all monitoring sources in the form of configurable dashboards that can be customized, saved, and shared.

Until DevOps becomes NoOps, there’s Service Health Analytics

By | 2018-04-17T18:23:51+00:00 October 5th, 2015|Blog|

We’re adjusting to the new reality that DevOps is a compelling layover on the journey between legacy ops and self-healing infrastructure. Eliminating the cultural gap between developers and operations, the now-cliched state of IT nirvana called “DevOps”, is by no means the end goal. The goal is reliable system performance and availability without human intervention - the panacea called “NoOps”.

Why DDoS attacks aren’t just a security problem… and monitoring traffic isn’t the solution – Part One

By | 2018-04-17T18:23:44+00:00 October 16th, 2015|Blog|

Every company’s a target, every customer’s at risk. But the now-cliched threat of data breaches from Distributed Denial of Service (DDoS) attacks obscures a bigger threat: outages that impact not just data integrity but also profitability, brand equity, and customer retention. 

The volume of attacks is growing and so is the impact of down time. According to Akamai’s most recent State of the Internet report, DDoS attacks are a bigger threat than ever before. “The number of DDoS attacks continued to increase substantially in Q2 2015, more than doubling the number observed in Q2 2014.”

15 hours of down time… avoided: part two of a two-part series

By | 2018-04-17T18:22:53+00:00 October 31st, 2015|Blog|

This is part two of a two-part post about using event correlation to thwart DDoS attacks. Channeling Mark Twain: it would have been shorter if I had more time. In the last post I described why DDoS attacks for SaaS providers are no different than performance and availability issues experienced in other domains like healthcare, finance, or retail. In this post I’ll share a customer story about a security breach that never happened… thanks to a savvy DevOps team and data science.