RESOLVE ’22: Behind the scenes
What do a sinking ship and an improperly equipped data center have in common? For Dell Senior Director of Global Network and Datacenter Services Paul Beninati, the two have a lot in common.
At least, from the perspective of company proactivity and ITOps performance goals.
The comparison might sound dramatic, but it actually holds water under scrutiny: “Someone saying ‘I’ve got a network problem in a data center’ with little context is like saying ‘there’s a ship sinking in the Atlantic Ocean,” he said at our RESOLVE ’22 event titled Behind the scenes.
The panel discussed the importance of technological proactivity and put heavy emphasis on the value of site reliability engineering (SRE). This practice utilizes automation to save human time on rote IT Ops tasks, but the change required to get there isn’t all technological, as our panelists discussed.
Making the switch for proactive response
Event moderator and BigPanda Regional VP of Professional Services Sales Jordan Gamble carried the sinking ship analogy a step further. “With the right outlook and tools in place, we can start to understand that the ship might not be sinking, but it has a hole. And that hole might get bigger.”
For digitally maturing companies, the objective is clear. “We need to be able to react before the ship ever starts sinking.”
Dell Senior Director of Reliability Enablement Neha Wadhwani brought another outlook on proactivity to the table. Leading a “monolithic IT Operations group of 1,600 people” that leadership viewed as a cost center was a day-to-day challenge, she said.
“And from that, we’ve really transformed to a nimble, product-oriented model where we’ve adopted SRE and DevOps,” she said. It has been “a tremendous transformation… and we’ve seen many benefits in terms of our outage time, the health of our production system and our ability to launch new products.”
Our third presenter, Alex Meyer, senior DevOps engineer at Machinify, offered a growing company’s perspective on proactivity. Instead of struggling with problems on the scale of a Dell, his team found itself mired in communication issues.
When “it was DevOps supporting the entire tech stack, including a lot of the applications,” he said, a lot of data that should’ve landed with engineers ended up being triaged to them instead.
“We’ve really shifted our focus to send that data directly to our engineers so they can make data-driven decisions and react in quicker time,” he said. “And the result of that is more ownership, quicker mean time to resolve [MTTR], and more observability in general into our application stack. Which has been wonderful.”
Better observability has cultural requirements, too.
Much of the discussion focused on something Paul called the “SRE mindset”: an outlook that emphasizes observability and quick response.
The emphasis applies to both technology and the attitude teams must bring to the table. In that regard, the SRE mindset may be more accurately described as an SRE culture. All three panelists stressed the importance of removing finger-pointing from the company’s problem-solving process. And reaching that point, Paul said, requires companies to align broad needs with departmental and individual functions.
“Let’s say I break something, right? The application teams that get on there and say ‘Okay, I need to get away from the problem, stop the bleeding, get my company running…’ are the ones embracing the SRE mentality,” he said. “Because, to them, it’s: ‘Let’s get this up and down in minutes.’”
“There are three primary levers that I can pull as a DevOps engineer: people, process and technology,” Alex replied. “And the people component is lost a lot. Understanding how people work and how your teams work with each other is very important.”
Alex spoke to the concepts of siloization and institutional knowledge. He said when he first started at Machinify, the company “just wasn’t tracking its incidents properly.” In many cases, they were being rectified via direct messages between engineers—not being exposed and analyzed, like they should’ve been.
Somewhat humorously, Alex’s team saw a turning point when they implanted a tool called Blameless.
“I realized during our retrospectives that there was a lot of: ‘Well, the server team did this; SRE team didn’t do that.’ And I worked with each of the teams individually to find different ways to say that same thing but with a less finger-pointing mentality.”
It’s an approach most companies can emulate, regardless of their current-day tech stack or roadmap. And it’s a solid reminder that pulling the “people lever” Alex references can have all the impact of a strong new tool integration or technology overhaul.
“There’s no SRE for Dummies.”
That quote (credited to Jordan) belies a very real problem in organizations ranging from Dell to Visionworks and beyond. While SRE and its core tenets have gained more mindshare of late, there’s no single direct line a company can take to fully adopt it.
There is a commonly accepted starting point, however. Neha said companies looking at SRE must “start small and do a proof of concept with an important area. Learn from that and incorporate what you learn into other areas. Don’t try to do it all in a big-bang fashion.”
Likewise, Paul said companies can realize big results by “looking at your incidents and looking at where your biggest points of pain are first. Lay them out on a whiteboard. One or two of those points is going to be an opportunity where you can pick up a whole bunch of time relatively quickly.”
If there is “no SRE for Dummies,” there are still ways companies can learn from the collective knowledge of others. Neha recommended that every relevant/impacted player on a team read The Site Reliability Workbook and added that another book of practical use cases (also published by Google) can further drill down understanding.
“Don’t reinvent the wheel,” she said. “It’s important to [come in from the mindset of] getting started.”
Alex echoed the other panelists: “The journey’s different for every user and organization. Identifying pain points before implementing solutions is very important. It’s important to figure out where you are falling apart and how you can patch those holes.”
Learning to be excited to fail
The panelists repeatedly touched on the topic of cultural/professional “safe spaces” and their importance in building a true culture of blamelessness.
“No organization is taking giant leaps from the DevOps or SRE mentality,” Alex said. Instead, creating iterative improvements is critical.
Alex specifically recommended building redundancies and safeties into systems that allowed for safe failure instead of the stressful, reactive kind: “If you have a VP of sales yelling at you because some component is failing in their demo, that’s very stressful, and it doesn’t help to solve the problem.”
An SRE mindset is focused on taking away the situations that make people yell from stress—a production system safely switching to failover and giving response teams a chance to learn at a more leisurely pace instead of scrambling to repair, for example.
Though it’s unwise to call any technology or mindset a panacea, Neha did bring up the positive, self-reinforcing aspects of SRE. Because SRE-enabled teams tend to achieve drastic time-based results, the value to decision-makers becomes “very clear.”
Behind the scenes with Dell, Machinify and BigPanda
Our RESOLVE ’22 panelists have a lot more thought leadership to offer. We invite readers to view the full webinar at the following link.