Golden Age of Developers = Nightmare for Ops
The last ten years have brought enormous changes to production environments, driven by a best-of-breed approach to production infrastructure enabled by open source and cloud. This has been a boon for developers in terms of flexibility and productivity, but it’s also placed a new set of challenges and expectations on Ops.
The Golden Age of Developers
In the past, a developer’s toolbox was limited to a few, monolithic solutions from legacy vendors (think Oracle, IBM, HP). Solutions were slow to integrate and slow to evolve — and they were expensive. Whether they suited your needs well or not, once your company bought them you had damn well better use them.
Today, the abundance of open source and cloud solutions has liberated developers from their reliance on legacy tools and enabled a best-of-breed approach to infrastructure. Developers can choose exactly the right tool for the right job. These tools are free or cheap to try, fast to integrate, and can scale with your needs. These days, you’ll often see a company use seven different databases (Redis for caching, Elasticsearch for search, MySQL, etc.), rather than being locked into a single deployment from a large vendor. The same goes for monitoring tools, compute environments, application frameworks and so on. For developers, it’s like being a kid in a candy store.
Continuous Deployment has been another boon for developers. No longer bound to monthly or quarterly schedules, release cycles have dramatically accelerated. This has enabled developers to move faster and stop being a bottleneck for the product and business folks. Development is doing more, faster, and with less. It’s the golden age for developers.
The Ops Nightmare
At the same time, these advancements are creating an “Ops nightmare” by burdening Ops (DevOps, SREs, IT admins) with a whole new set of challenges and expectations. Much of this has found expression under the #monitoringsucks hashtag.
Pace of change: There’s been a significant increase in the volume of production incidents that require monitoring and response. Why? Because the large majority of production issues originate from internal code deploys and infrastructure changes. When major changes were limited to monthly or quarterly cycles it dictated the pace of corresponding issues. When doing continuous deployment (not to mention virtualization and infrastructure-as-code), the pace of change has dramatically increased, producing an acceleration in things that can go wrong. An argument can be made that the potential for catastrophic failures is reduced by continuous deployment, because changes are small and incremental. But still, contending with a constant flood of alerts (most of which are noise, some of which are urgent) is a real challenge and source of frustration.
Moving parts: The best-of-breed approach to modern infrastructure has raised the table stakes for Ops. There are many more moving parts, many more shifting dependencies, and many more monitoring solutions driving many more alerts. Troubleshooting in such environments has become an unending process of triage: filtering through constant alert storms to understand, prioritize and respond to potential incidents. In short: Alert Fatigue. It’s not uncommon to hear Ops folks complain that 50%-70% of their time is consumed by responding to alerts, a huge disruption to focusing on their core responsibilities: building business enabling infrastructure.
The bottom line is that that Ops are still in critical need of tools and workflows to deal with alert fatigue. A good starting point is for companies to find a way to organize noisy alerts into high level incidents, get quick access to insight they need in order to triage, and be able to collaborate efficiently among stakeholders.