5 ways teams used BigPanda during the CrowdStrike outage
In the weeks since the Crowdstrike outage brought millions of systems to a halt, countless articles have been written about the cause of the outage, its impact, and the costs companies incur during service disruptions.
Nearly every large company had hosts offline due to the faulty update in CrowdStrike’s Falcon software. BigPanda customers were no exception. On July 19, between 04:00 and 07:00 UTC, the BigPanda systems logged an increase in shared incidents. The increase peaked at 290% of normal, indicating something big was happening throughout our customer base.
At BigPanda, we’re privileged to support teams that keep the digital world running. When faced with the massive scale of the CrowdStrike outage, those teams used BigPanda in innovative ways to speed their response and recovery times. They shared their approaches with us, and we wanted to share some of their best practices with the wider community.
Best practice 1: Control the flood of outage-related alerts
At the time of the faulty update, some BigPanda users had hundreds or thousands of hosts fall offline. Alerts came from impacted systems and interrelated systems — plus the online systems that handled the resulting additional load. Response teams faced a flood of alert noise and the challenge of sifting through that noise to find relevant and actionable information.
Organizations that had previously deployed robust alert correlation had a distinct advantage. These teams used BigPanda to correlate the flood of “Host not Reporting” and related service alerts into fewer, clearly articulated incidents. Our customers shared that the BigPanda alert filtering and correlation were instrumental in managing the volume of requests, providing critical data, and helping teams prioritize and resolve issues efficiently.
Best practice 2: Rapidly identify incidents tied to impacted hosts
Response teams faced the challenge of identifying which incidents were related to the CrowdStrike update and which were not. Due to the volume of incidents associated with the outage, there was a risk of losing others in the chaos.
BigPanda enriched incidents with important information, such as auto-generated titles and summaries and, importantly, suspected root causes. With this detail, responders could immediately identify which incidents were related and address them appropriately. Importantly, non–CrowdStrike-related incidents were still visible and actionable even when other incidents threatened to overwhelm operations teams.
Best practice 3: Create a list of all the impacted hosts
Once Microsoft identified the steps to recover impacted systems, some of our customers used BigPanda to provide a consolidated list of hosts showing outages. They did this by searching for incidents based on enrichment tags (unified search) and using Analytics, plus help from some BigPanda employees who jumped into action to help.
The lists were essential for teams to restore and reboot servers. They saved hours by determining the ultimate scope of impact. Some customers used real-time dashboards in BigPanda to track the resolution of related incidents.
Best practice 4: Manage the volume of created tickets
Robust correlation made the difference, significantly reducing the number of tickets and eliminating significant wasted effort. BigPanda customers that use integrated Workflow Automation had the best experience. Workflow Automation automatically creates tickets from correlated incidents, promptly routing them to the right teams with the necessary context to speed remediation.
The CrowdStrike incident also served as a pressure test. For BigPanda customers, it proved the ability of their systems to handle a massive spike in ticket volume across the entire ops process. In cases where systems were overtaxed, teams are working post-outage to increase scalability and resilience. Many customers used their findings to reevaluate their BigPanda correlation patterns, improving them in ways that will make the next outage easier to handle.
Best practice 5: Perform post-event analysis
A common response to the outage was to turn to BigPanda for additional insights and reporting. These teams used the Unified Search and Analytics functionality to extract the vital information, using Unified Analytics to create the necessary reports. Savvy customers continue to use these tools to collect and customize the required data and reports to assess the outage’s final impact.
Our internal teams and early adopters alike used the Open Analytics Hub to access the BigPanda standard data model. The model provides access to the information stored in BigPanda for use in other tools, including business intelligence, analytics, and reporting systems.
Thank you
The BigPanda team wants to thank everyone who shared their outage stories with us. We were impressed not only with your professionalism but the speed of your responses to the sudden workload increase. It was quite a week for many teams, and we know you were extremely busy.
Like you, many BigPanda employees are current or former ITOps, infrastructure, observability, or service management professionals. We’ve felt your pain and are humbled by your creativity and dedication to keeping your services running under extraordinary circumstances.