What is MTTR? Don’t answer with what it stands for or how you use it. The question is more philosophical than literal. For too long we’ve measured operational performance based on the number of minutes it takes to resolve an incident. The almighty trend line slopes down then we gulp milk from the jug of IT inflated ego like NASCAR drivers drunk on Nagios exhaust fumes.
Like the Zen riddle about one hand clapping it’s important to first ask:
Enterprise application and computing environments have changed radically over the past fifteen years. Anyone who has spent even a day in an IT role can tell you that.What gets less attention, however, is how those changes undermine the ability of operations teams to do their jobs. The problem is that as computing and application environments have changed dramatically, workflows and org charts have not.
We engineers love measuring stuff. Whether it helps us solve an immediate problem, gets us ready for a bad day or just because most of us are information junkies, we love keeping track of metrics. The spectrum of what can be measured is very wide. It can include data from every part of our system: from technical metrics such as disk space or RPM, through UI metrics like page load times, to business KPIs such as revenue, conversion rates and so on. When choosing which metrics to collect, we usually start with the obvious ones: those that reflect the current state of the system (e.g., CPU, memory and load). There are quite a few articles and blog posts about these metrics, so I’m not going to discuss that here. Rather, I would like to focus on metrics that reflect the user experience.
Here are the four metrics that we at BigPanda see as the most important in this category: