How to prepare for, deal with, and recover from IT outages
The average cost of an IT outage is $12,900—per minute.
And when it comes to a “significant outage,” organizations reported the average overall cost was a whopping $1,477,800.
On the That’s Great IT podcast I spoke with Scott Lee, AVP for infrastructure and ITOps at Arch Mortgage Insurance Company, part of Arch Capital Group, about how organizations can best navigate IT outages. From preparation to real-time response and then the aftermath, there are many things IT operators and leadership can do to minimize an outage’s impact.
And the impacts go much deeper than the monetary cost. In BigPanda’s recent ebook, The modern IT outage: costs, causes, and “cures,” we asked organizations what was most important to them when calculating the cost of an outage. A majority said business disruption and impact on employee productivity. That was followed by data breach/government regulatory exposure, lost revenue, reputation, and finally, the impact on DevOps/SRE productivity.
How to prepare for an IT outage
To start, we have to go back to before the beginning: how you prepare for an IT outage is just as important as how you respond to one. You’ll notice that I didn’t say “prevent,” but rather, “prepare.” That’s because outages are, unfortunately, inevitable. You could be part of the best run ship on the planet, or you could be a one-man shop helping support IT operations, but no matter what, we all deal with outages.
That doesn’t mean they have to be catastrophic, though. There are many ways to minimize the impact before they even occur.
The first step is to create a recovery plan. This is when you should really sit down and consider all the factors that could cause an outage. That may include regional factors like people or infrastructure, but it also probably includes things that are impossible to control, such as the weather, utility providers, mass power outages, and geopolitical activities.
“It can be difficult to think about because you have to think about the small [problems, like], ‘Hey, this one server has gone offline, but it’s causing an outage of a service or an application. So how do I deal with that? How do I know that’s happening? How do I isolate that one server to understand [what’s] happening?’” Lee explained. “But it can also be something that’s regional in nature. I’ve got folks in Manila, for example, and they get hit by all kinds of [severe] weather and stuff over there. What do I do if the whole island goes offline? And that can be very challenging to overcome a regional disaster like that.”
Lee advised that organizations start small and expand their plans from there. And oftentimes, that preparation starts with implementing redundancy and avoiding single-point-of-failure scenarios.
For some companies, that might mean having a redundant pair of servers in an IT closet, while for others that might mean relying on the cloud. And if at all possible, organizations ideally should not have everything in the same location.
“How many companies went out of business on 9/11 because everything they had was in the Trade Centers?” Lee brought up. “That’s hard, hard, hard to overcome. And thinking about those things, what’s your tolerance level from a company perspective? And then work with your business units to try and identify that and then build your infrastructure to support it.”
Of course, there is a cost to creating redundancy. However, there’s a very good financial case for building it into the budget. While I noted that the average IT outage costs an organization $12,900, that number can vary widely depending on an organization’s size, with larger companies facing much higher costs.
In our ebook we noted that, when broken down by organization size, the average cost per minute of an outage is:
- 1,000-2,499 employees: $1,850
- 2,500-4,999 employees: $4,542
- 5,000-9,999 employees: $8,424
- 10,00-20,000 employees: $24,347
- More than 20,000 employees: $25,402
Lee said he often thinks about the WIRED article about NotPetya. The article described how a series of 2017 cyberattacks used the Petya malware to swamp websites of Ukrainian organizations, including banks, ministries, newspapers, and electricity firms.
“They described an instance of IT people running down the halls, yelling at everyone, ‘Unplug your computer! Unplug your computer!’ And I never want to be in that situation,” he said. “That’s a horrible way to be. That’s why I plan to do the things that I do today. And again, it’s about the money, right? How much money will we lose, and therefore, let’s put a return on the investment of the [data recovery] plan and infrastructure to make sure that we’re not going to lose more than we need to.”
Beyond lost revenue, there’s also the potential costs of violating regulatory compliance.
“Especially in financial services, if people don’t have access to their money, they get a little squirrely—and the government doesn’t like that either,” Lee reminded. “So what’s the tolerance of the government before the fines start to happen and before you start losing customers and things like that?”
After organizations look over the possible ramifications of an outage, they should determine the tolerance level of each one, and how that applies to your various services and applications. Most companies do this by ranking them into tiers, where Tier 0 is the most critical infrastructure.
“IT people, a lot of times, don’t understand that, right?” Lee said. “Sometimes they’ll know that their app is really important, it’s a Tier 1 or Tier 2, but they won’t understand the business impact. And this is where the leadership comes in to help bridge that gap.”
How to deal with IT outages
Once organizations understand which resources they need in various situations, they can make specific plans for recovering from outages.
The first practical thing organizations should do is implement an on-call rotation. Even if an IT team has offshore support, those working domestically need to be able to jump on a call and help in an emergency. Teams need to know that when there’s an emergency, it’s all hands on deck.
“I had an instance in my past where a full-on datacenter went down. We had three core data centers, and one of the three core data centers went completely offline from a perceived fire,” Lee described. “It was smoke, [and] there was no fire in this case. [But] that data center happened to house all of the virtual hosts for our offshore people. So they were completely offline, they couldn’t help, and it went down in the middle of the night. So U.S.-based people had to hop on the call [and] had to start recovering to the other data centers.”
There also should be a pre-planned hierarchy of command, so that IT operators aren’t overwhelmed with conflicting demands.
“You have to have a point of contact that’s going to help the IT folks prioritize as they’re doing their work so that they don’t have to make those decisions,” Lee explained.
During the actual outage, Lee said that’s when everyone needs to come together on a bridge call that includes both management and IT operators. They should discuss what everyone is seeing, what exactly is out, and what the priorities are.
“Management can’t make the priority until management knows what’s wrong. And only the tech people are going to really dive in and find out what’s wrong,” Lee said. “So we have to work together as a team to really understand how to proceed down the road.”
How to recover from an outage
After an outage is resolved, there’s still more work to be done. That’s when it’s time for the post mortem. Organizations should figure out what happened, what the root cause was, when the incident started affecting things, whether they caught it all, and how to prevent similar outages.
It can be tempting to let this discussion turn into a “blame game,” but that’s a mistake. Managers especially shouldn’t be too harsh on staff.
“If it’s unintentional, they didn’t mean to do it, there has to be some tolerance level,” Lee asserted. “We have to create a safe space for people to be able to absorb mistakes and learn from them and move forward.”
Lee recalled one server admin who was given a task to run on a nonproduction server, but he accidentally logged into a production server, ran the task, and it brought that server down—which then brought down a pretty important service. The manager immediately fired him.
“Everyone else now is going to be really afraid to make a mistake, and they’re going to be afraid to jump in and help when the time comes because they’re going to be afraid of making that mistake,” Lee explained.
In contrast, when he faced a similar situation as a manager, he made the choice to respond differently. He recalled having a person who was supposed to run a job on a nonproduction machine, ran it on production, took the server down, which then caused a costly service outage. Despite senior leadership wanting someone held accountable, Lee refused to fire the person.
“[I said], ‘Look, yes, it’s my team’s fault. It’s not this person’s fault. We had something in our process that allowed this person to make that mistake, and we are going to learn from that and make sure it doesn’t happen again,” Lee recalled. “But that is how you deal with it after the fact. You don’t go and chastise that person.”
The goal should be to figure out what happened, and how you can be better prepared next time. That usually starts with finding the root cause, something a lot of companies end up skipping over. While they may get close, they often stop short of actually figuring it out, Lee explained. They say they’re going to fix it later, but then they never do. Predictably, they then run into the same problems over and over again.
“Get to that root cause, fix the problem that caused it in the first place,” Lee said. “And then make sure you’ve got enough monitors and policies and practices in place to detect it again earlier so that you don’t have to go through all the troubleshooting to dive down in.”
In the end, Lee insisted organizations need to practice how they play.
“I used to be a volunteer firefighter, and we practiced burning down buildings, we practiced cutting through a roof with a chainsaw to ventilate, we practiced all those things before we had to go out and do those things when it was life safety,” Lee said. “Do the same thing. Understand the business, do all the things that you gotta do to put in place.”
He said to run your teams through what are essentially fire drills, and practice responding to a data recovery event. Organizations should even take things offline when it’s appropriate for the business, and practice responding to that situation.
“Make sure you understand what’s going to happen, because things are going to come up as you recover,” Lee imparted. “And if you’ve never done it before, it’s going to take you a lot longer to figure out.”
Download our e-book: The modern IT outage: costs, causes, and “cures” to learn more about IT outages, including:
- How organizations view the relationship between outage cost to outage duration.
- What constitutes a significant outage and the costs and frequencies associated with one.
- How organizations are seeing outage-busting benefits from AIOps.
- The high correlation between AIOps implementation and outstanding IT service quality.
For further insights into ITOps, AIOps, and tech in general, check out our podcast, That’s great IT. We explore timely topics in the tech industry in a fun way and host some incredible speakers, so make sure to follow our podcast from your favorite podcast provider!