Those of us lucky enough to have attended the August MonitoringScape meetup at BigPanda with Adrian Cockcroft enjoyed the master of performance tuning as he waxed philosophical about why monitoring isn’t a solved problem after twenty plus years honing our craft. Spoiler-alert: it’s difficult to solve a problem that keeps changing.
SREcon16 is a wrap, and our team had a blast at this year’s event! Both days were non-stop action: demos, discussions, and - of course - handing out our fair share of panda swag. Between the buzz on the floor and in the sessions, what topics were top of mind at this year’s show? Here are our three key takeaways:
For many IT and Ops teams, Nagios is both a blessing and a curse. On the one hand, Nagios gives you near real-time visibility into the inner workings of your IT infrastructure. But on the other hand, Nagios can generate so many alerts that it’s impossible for any single person (or even any team) to keep up.
If you’re struggling with a flood of Nagios alerts, this two-part blog series is for you. We’ll take a close look at the complicated relationship that IT and Ops professionals have with the monitoring tool, explain why Nagios is so noisy, and discuss the simple way that you take charge of your alerts and maximize the way Nagios works for you.
It was a whirlwind couple of days, but FutureStack15 is in the books! I’m sure I speak for most of the BigPanda team when I say that a weekend of rest was welcomed after the long (but exciting!) days at the show. Between demos, conversations with clients and prospects, and even a surprise visit from Weird Al – there was hardly a moment of downtime. But in our world, who likes downtime? (Excuse the terrible pun).
In case you missed it, here’s a recap of some of the key themes discussed at the show:
In between sessions at last weekend’s DevOpsDays Silicon Valley, scores of attendees filled the halls, amplifying the Computer History Museum with chatter and turning it into something more akin to a high school cafeteria than a conference venue. As crowds formed to share their stories and insights with one another, a common theme quickly emerged: It just isn’t as easy as we thought it would be.
In the last two decades, with the emergence of cloud infrastructure and SaaS delivery models, the monitoring ecosystem has changed dramatically to include over 100 monitoring solutions. The upside of that change is the rapid implementation of monitoring infrastructure, but the unintended consequence of this is that the tools themselves decide what IT measures.
Rishi is too humble to be the CIO of a Fortune 100 bank, too busy to be the father of four, too accomplished to blog about ice cream, and too educated to love John Gray. Mostly, he's too unpredictable to fit stereotypes and too passionate about everything he does to do anything at less than full throttle.
I met Rishi this week at the Pacific Crest Global Technology Leadership Forum in Vail where he was presenting and I was lucky to be in the audience. We spent an hour together before his talk that inspired me to rescue Nepalese orphans... and eat more ice cream.
Rishi's been an IT leader since before we called it that. He has helped organizations grow and shrink and grow again. He's more scared about the state of IT today than he has ever been.
Here are excerpts from the discussion...
The last ten years have brought enormous changes to production environments, driven by a best-of-breed approach to production infrastructure enabled by open source and cloud. This has been a boon for developers in terms of flexibility and productivity, but it’s also placed a new set of challenges and expectations on Ops.
We engineers love measuring stuff. Whether it helps us solve an immediate problem, gets us ready for a bad day or just because most of us are information junkies, we love keeping track of metrics. The spectrum of what can be measured is very wide. It can include data from every part of our system: from technical metrics such as disk space or RPM, through UI metrics like page load times, to business KPIs such as revenue, conversion rates and so on. When choosing which metrics to collect, we usually start with the obvious ones: those that reflect the current state of the system (e.g., CPU, memory and load). There are quite a few articles and blog posts about these metrics, so I’m not going to discuss that here. Rather, I would like to focus on metrics that reflect the user experience.
Here are the four metrics that we at BigPanda see as the most important in this category:
One of the first things we do right after installing Nagios, is set up email notifications. Without that, how would you know when something went wrong?
In many ways, incident management for devops is similar to typical issue tracking processes: it facilitates coordination and collaboration of daily tasks. For this reason, tools such as Jira, Zendesk, and even email are often used as solutions for incident management. But incident management faces one unique challenge that makes it different from other issue tracking processes. In addition to human-operated workflows, incident management also relies heavily on machine-driven workflows. Unfortunately, traditional issue trackers and ticketing systems cannot accommodate for this with their current product mechanics.
Few things damage productivity as much as waiting. Waiting forces us to context switch, disrupts our creative momentum and eliminates our ability to experiment. Whether we are deploying a new service or troubleshooting a problem, waiting puts a heavy tax on efficient work.
It’s well known in IT operations that things don't break on their own. Close to 80% of production outages occur because of changes made by developers or someone in IT. However, this fact often eludes us when it comes to actually resolving production issues.