What is MTTR? Or why not to feed the baby cognac.
What is MTTR? Don’t answer with what it stands for or how you use it. The question is more philosophical than literal. For too long we’ve measured operational performance based on the number of minutes it takes to resolve an incident. The almighty trend line slopes down then we gulp milk from the jug of IT inflated ego like NASCAR drivers drunk on Nagios exhaust fumes.
Like the Zen riddle about one hand clapping it’s important to first ask:
- What’s an incident?
- What does it mean to resolve one? …and (the ever-blasphemous)
- Is it unequivocally better to resolve them quickly?
What’s an incident?
An issue triggered by some unexpected behavior that has an adverse impact on people, process, or things. They’re often symptoms of larger problems and can frequently be remediated by routine tasks (reboot… reconnect… restart).
Our goal in IT, however, isn’t getting credit for fixing issues we created… it’s managing healthy infrastructure that doesn’t suffer from a high volume of incidents. MTTR-driven ops management often misconstrues a large number of incidents resolved quickly to indicate a productive team when in fact it more frequently indicates fragile infrastructure.
What does it mean to resolve one?
Resolving incidents is considered positive… when in fact resolving them the right way the first time is what should be valued. MTTR rewards turning red to green. Other metrics like MTBF (mean time between failures) are better indicators of infrastructure that remains consistently healthy.
Is it always better to resolve incidents quickly?
Measuring reduced downtime alone is the IT equivalent of dipping the pacifier in cognac. The kid stops crying quickly but dad (mom would *never* exercise such bad judgment) may end up in prison. Reward thoroughness. Reward quality. Reward service. Don’t reward the cognac solution.
So what is MTTR?
It’s the starting point for a discussion about operational excellence. Its value varies from organization to organization and it’s one of many indicators of healthy process and infrastructure. It’s best calculated as the sum of all periods when every incident was in a state other than “resolved” divided by the total number of incidents – where duration is calculated based on machine timestamps (vs. operator-supplied status changes) using monitoring data and frequently reopened (or flapping) incidents are treated as a single incident.
Consider this less an unprovoked assault on IT doctrine and more an invitation to spend 30 minutes with your team evaluating whether or not MTTR reduction is the metric best aligned with business value.