15 hours of down time… avoided: part two of a two-part series
This is part two of a two-part post about using event correlation to thwart DDoS attacks. Channeling Mark Twain: it would have been shorter if I had more time. In the last post I described why DDoS attacks for SaaS providers are no different than performance and availability issues experienced in other domains like healthcare, finance, or retail. In this post I’ll share a customer story about a security breach that never happened… thanks to a savvy DevOps team and data science.
A few weeks back on a Thursday morning I was listening to elevator music on hold while waiting for a customer to start our weekly status call. 15 minutes passed and I got a cryptic Slack post canceling the call. Zero notice. No explanation. I was annoyed.
An hour later I got a follow up post explaining that the team had spent the past hour playing DevOps Space Invaders to repel an attempted DDoS attack. This customer operates one of the largest hosted security services in the world. It’s the multi-billion dollar division of an even larger multi-billion dollar Fortune 100 technology provider.
They have a target on their back the size of Beijing and in recent months have experienced an alarming escalation in the volume of DDoS attempts. Every attempt threatens to erode years of customer trust and the bad guys only need to succeed once. Competitors are a click away. SaaS margins are razor thin. Board pressure to guarantee 100% availability is tighter than J Lo’s skinny pants.
Back to my elevator music… Alert volume spiked to 30,000 events per minute from an alphabet soup of monitoring tools: Nagios, AppDynamics, Pingdom, SolarWinds, and Dynatrace. Canceling our weekly call was the least of their worries. In the past, DDoS attacks took down all services for an average of 15 hours – three hours to detect, three hours to quarantine, three hours to diagnose, three hours to modify DNS tables and re-route traffic, and three hours for the change advisory board (CAB) to approve re-activating service. The cost of typical attacks extends way beyond 15 hours of down time. It includes erosion of trust and goodwill, customer defections, social media backlash, plus penalties for missed SLAs.
This time, the net impact of the attack was a canceled project status call. This time, defenses were in place. This time, three BigPanda incidents clustered more than 30,000 related alerts and indicated the source of the attack and what to do about it. BigPanda incident timelines provided a treasure map showing when each alert was generated and why, how they changed states, and how similar issues were resolved in the past.
Each incident was automatically shared to Slack channels and ServiceNow to provide instant and ongoing visibility to all service stakeholders. Service Health Analytics reports showed the pattern of impacted hosts and failed checks visually. This time, the first 20 minutes of the first three hours were spent orbiting through the service health lifecycle from detection to remediation… all while I endured Yanni.
Less down time from thwarted DDoS attacks is just one demonstration of the value they’re receiving. They report that in the past six months using BigPanda they’re generating two-thirds fewer ServiceNow tickets, MTTR is down 45%, they’re spending a day less per week babysitting monitoring tools, and they’re down to thirty minutes a month to prepare uptime reports for board meetings from an original ten hours.
Lawrence and team, you’re forgiven for being late. My only request: please play Floyd or Zeppelin next time you’re attacked during our status call.
Missed Part One? Find out why DDoS attacks aren’t just a security problem.