Managing IT operations during a crisis
As work environments for entire industries continue to evolve between on-site, remote, and hybrid models, the performance of IT operations (ITOps) teams is more critical than ever. If you need proof, just remember the global impact of the CloudStrike outage. Operations teams must monitor, triage, communicate, and manage incidents 24×7 across all services.
SaaS, legacy on-premises, and homegrown tools and systems are all stretching to meet business demand. Customer expectations are ever-increasing. Maintaining — or gaining —a lead on competitors requires constant innovation and solid reliability.
Meanwhile, it’s essential to expect the unexpected. Organizations must prepare to address any crisis, whether that’s a natural disaster, technical issue, or other disruption.
Here are six ways CIOs and their ITOps teams can adapt their organizations to survive crisis-related challenges and be well-positioned for the long term.
Priority 1: Check on your people
For large enterprises, the average cost of a significant IT outage has increased by 17% since 2022. These increased costs put massive pressure on your people, including your network operations center (NOC), ITOps, service management, observability, and incident response teams.
Taking care of your teams throughout a crisis is paramount. As the situation evolves, leaders need to rise above the noise. This equation is one part, “Keep Calm and Carry On,” and one part, “Here’s what we can do right now to improve our situation.” Leading teams by helping them remain focused and look beyond the horizon is paramount.
Priority 2: Maintain focus on the Big Picture
ITOps teams often have the organization’s best (and least-leveraged) situational awareness. They can access real-time alerts and reports to quickly find, fix, and prioritize problems across their systems. Their birds-eye view of the business is critical to leaders and individual contributors. Understanding the health of all employee and customer-facing services provides precious visibility and enables the organization to respond agilely.
Too often, this visibility remains siloed within the ITOps team. Only when issues become critical does the awareness spread throughout the organization. ITOps should proactively promote the visibility of service performance, team performance, usage, incident trends, and everything they can see from their privileged position. Even reporting “situation normal, all systems up” is valuable information.
From high-level periodic reports through service-status dashboards to very tactical live incident-status pages, the objective is to create shared awareness. Breaking down silos allows interdependent teams to make decisions more effectively and focus their efforts on a shared understanding of the truth.
Priority 3: Maintain IT change awareness
All services require routine hygiene, from application updates, database maintenance, and server OS updates to security fixes and network configuration changes. Change velocity has ramped up significantly over the years, which can be more difficult across distributed teams.
Changes always present risks, so most ITOps teams track them to correlate them to service impacts. There needs to be more collective awareness if team members are performing changes while working remotely. Teams need a centralized change process and information hub that includes the critical what, when, why, and who. This unified platform helps teams quickly resolve conflicting changes, minimize risks to the business, and correlate changes to impacts, reducing recovery times.
Priority 4: Proactively measure, analyze, and report
ITOps teams use multiple metrics and KPIs to report on their processes and service performance. From Daily Active Users to MTTx to service availability, these numbers usually stay within predictable ranges and don’t require constant tracking.
However, extenuating circumstances might cause them to extend beyond those predictable ranges. Establishing KPIs helps identify trends and outliers in service usage and team performance. CIOs can pinpoint issues and more effectively inform other C-level executives. By using well-defined metrics, you can accelerate your Observe, Orient, Decide, and Act (OODA) loop and gain enhanced, real-time insights into business performance.
Priority 5: Reduce tool sprawl
Solution sprawl is a constant and growing challenge. Many CIOs struggle to support multiple solutions in the same space, including:
- Chat
- Project management
- CI/CD
- Orchestration
- Data visualization
- Monitoring solutions
- Ticketing systems
- ERPs
- Public clouds
You may be using multiple examples of these solutions for different teams, making it hard for ITOps to track them. Supporting these tools requires significant resources and can challenge your organization’s ability to maintain a high-availability, secure network environment.
The need for organizational agility is the best argument against sprawl, and that need is visible across nearly all industries. Balancing each unique solution’s costs against its value is essential when evaluating how to reduce tool sprawl. Compromise when there is a real differentiator. Drive teams to best-in-class, cloud-based SaaS solutions that natively support hybrid teams. Consolidation can also reduce operating expenses.
Examine team workflows and identify where multiple tools create delays due to “mental switching costs,” where operators must check various tools to get an answer. Streamline work through integrations and aggregation of related data into a unified view. ITOps teams will have less to manage and better tools to do their jobs.
Priority 6: Invest in GenAI solutions for ITOps
Every day, ITOps teams have to deal with all the problems, large and small, that can occur within an ever-evolving service technology stack. They have to wage this never-ending battle while attempting to prevent or limit the impacts on users and customers.
Advances in generative AI offer solutions to these challenges. By effectively incorporating GenAI into their IT operations, organizations can automate time-consuming, manual processes such as alert analysis, incident correlation, and ticket creation. Automating these tasks lightens operator workload, reduces MTTR, and improves system reliability.
GenAI can also expedite root-cause analysis and reduce escalations to L2/L3 teams. By analyzing systems in real time and presenting those insights in plain language, AI can provide actionable insights so teams can resolve incidents faster and make better decisions.
Waiting for a crisis to be the catalyst needed to trigger an investment in ITOps people, processes, and technology is a high-risk approach. ITOps teams provide the agility and scalability to safeguard your enterprise’s future. To ensure they remain effective, these teams must be supported with investments in GenAI solutions that make them faster and more productive. Ensure your teams have CIO-level support in the form of organizational focus and executive sponsorship so that the ITOps function can evolve to keep pace with the business.
To get more insights on how ITOps organizations will use AI to innovate and transform over the next 12-18 months, check out our webinar on the Top AIOps predictions for 2025.