SREs can help enterprise IT in its digital transformation
Updated: February 7, 2020
Category: Ops and Teams
Author: Stefan Apitz
Digital transformation of IT Ops
When I think of the term IT Ops I immediately think of Enterprise IT and the traditional attributes that make up this function – many of which are in the middle of an industry-wide disruption – and its associated impact.
At LinkedIn, when we first looked at business process support, shadow IT and non-accounted-for IT spend, about 10 years ago, it was a bit of a revelation to me how the landscape had already changed by then. No longer were business owners forced to use technologies implemented and managed by an internal IT organization – they were able to leverage what they thought they needed to get their business done, in a way that made sense to them.
Along with that came the need for IT to change its approach, as common capabilities, such as security and compliance, tooling, vendor management, etc., still had to be managed and coordinated in a centralized function – but now, as services provided back to the business units.
For us it was important to ensure we re-positioned ourselves as an enabling and support function that partnered with the business process owners, providing advice and capabilities to collectively move the business forward.
Today the picture is even more severe, especially for large, well-established enterprises. For these enterprises, the concern is not solely about inward-focused enterprise business processes and services. More importantly it is now about services and products that make up the company’s business itself, which – as we know – is being threatened by small digitally-native disruptors and nimble competitors, forcing incumbents to transform themselves rapidly.
This means that transforming IT Ops is more than an exercise in merely understanding internal dynamics and gaining efficiencies, like it was for us at Linkedin. Rather, it has become essential for IT Ops to evolve into a key stakeholder and enabling function that allows the business to move at the speed it needs to and compete with an ever-expanding list of competitors.
This type of massive change, however, is very hard for traditional enterprise companies. That’s because it affects how services are delivered and how products are designed, architected and deployed. Which means that it also impacts the organizational structures and processes of these IT and Engineering functions themselves.
Learn from hyper-scale service providers
In contrast, consider service providers such as Facebook, Netflix, Google, and more recently, Spotify, Uber, Hulu, etc., which have successfully overcome the problems of handling rapid and massive user growth, and the hyper-scale challenges that come with that.
I recall the time at LinkedIn when we were establishing performance, stability and security as the key priorities for our Dev and Prod Ops organizations. Because the business was focused on growth and engagement at the same time, it used to seem like we were continuously having trade-off discussions.
Of course, we understood at an intellectual level how our goals and the goals of the business were deeply correlated. In practice, however, in Production Operations, running a set of services at scale, while engineering was rapidly developing new features and service designs to meet growth and engagement goals, was a challenge.
And it forced us to collectively figure out how to do both together, while maintaining our performance and resiliency characteristics.
It was hard, and it prompted a significant rethinking of our operational capabilities and approaches. Traditional models of segregating Dev and Ops, ITSM best practices, and other enterprise-y approaches and frameworks were no longer working well enough for us to achieve our goals.
We had to revise ownership models and implement the associated cultural changes, in addition to implementing revised tooling and architectural design approaches. This was non-trivial, both from a technology point of view, and from a cultural point of view. But we had the advantages of being a relatively young company with innovative ideals and a relatively homogeneous services architecture, that had in principle – not entirely in practice – moved away from monolithic design principles.
Service ownership as an enabler
I actually think about Service Ownership quite a bit, as a key factor in a company’s ability to transform itself. By that, I mean the implemented principles and incentive systems regarding the development and operation of services in a production context. An enterprise technology company, for instance, may have delivered on-prem, packaged tech in the past that was implemented and supported by distinct organizations. An engineer there would typically have worked on a project team developing, building and testing a certain revision or application feature-set – and when that was complete, move on to the next project.
When you now move into a service-centric world, as was the case at Oracle for example, engineers now had to be able to develop, build and run “their” service in production. This was a complicated paradigm shift.
All of the sudden, developers had to design for production operations concepts like resiliency, observability and scalability, which they ideally established and shared with the IT Ops team, vs simply handing their service off to be run by someone else.
Other enterprises that are also transitioning from a traditional development approach to a more agile model, in order to release changes to production faster, must make cultural and organizational changes to support this capability. In the process, they will also go through the shifts in culture and technology / service ownership that we did, at Oracle.
For companies that are most likely to succeed in achieving this, the 2018 State of DevOps survey by DORA has some very interesting results and takeaways. The survey report states that a cohort of elite performers (enterprises) are able to achieve 25x faster deployments and 7x lower change failure rates by implementing capabilities such as Shared Observability, Loosely Coupled Services, Automated Infrastructure and Service Deployments, etc., compared to those enterprises that aren’t in the elite cohort.
SREs help bridge the gap
Site Reliable Engineering has been a pivotal function to enable these capabilities in the companies I have worked for. Initially, we implemented these capabilities as a group of Ops engineers that were passionate about automating as much as possible around infrastructure and service management. Eventually we allocated them to a team of developers and tasked them with co-owning a set of services. The benefits of incentivizing smart engineers to maximize service uptime and performance – when those engineers also have a history of carrying pagers and jumping on outage bridge calls, and working side by side with the folks who actually write the code for these services – should be obvious.
SREs understand operational concerns, have a good sense of infrastructure, topology and constraints, and are well versed in the dynamics of production environments, such as usage and traffic patterns, seasonality, resiliency and failover procedures. Getting developers to internalize these concepts around their services, and other concepts around the system as a whole, is invaluable in my view. As such, this capability has to organically evolve over time across the development organization. SREs provide this capability, while themselves learn about new services and designs, which they can help influence. directly
At LinkedIn, the SREs eventually became part of development teams. They were probably the most sought-after resources, especially those that had been with the company for a while and had developed an understanding of the dynamics of complex, interdependent environments. Once SREs and developers partnered on service ownership, they were able to work together on a number of fronts: implementing capabilities that made their services more resilient to infrastructure failures, implementing the instrumentation necessary to share metrics across teams, and handling rollouts and rollbacks more gracefully.
Sharing Access and Knowledge
So when I look at a traditional IT Ops department inside a company that is looking to transform itself and become more agile, I think it is essential that it embrace the concepts that large scale service providers had to learn a decade ago. Specifically, a culture of shared ownership, and incentivizing Dev and Ops teams on the things that matter the most to the business. In most cases, that translates into shared ownership and incentives around deployment frequency, uptime, performance, change error rates and security capabilities.
In my experience, it is essential to leverage the skills one has in an IT Ops organization w.r.t to understanding how complex systems work and marrying those with developers’ knowledge.
There are almost always certain employees who simply know more than anyone else in the organization about specific aspects of the overall system. These folks are often found in a classic IT Ops team and are essential for helping developers assimilate the context in which their services operate, so that they can rebuild these services over time and accomplish the company’s business goals.
One very real example from the early days at LinkedIn was the story of inGraphs and how it was conceived. The democratization of production telemetry that allowed engineers to understand how their service was doing in real-time was game changing. It was initially developed and implemented by the SRE team and eventually became core to driving the ownership of production services back into development.
Observability in general is one such capability that I would describe as fundamental. It’s only when you enable developers to actually observe and learn how their services are doing in a production context, can you even begin a discussion around ownership and accountability. This learning begins with the sharing of data, principles and designs that drive towards collectively owned goals.
Summary and Takeaways
Large scale service providers had to learn or invent new practices and methodologies in order to build and grow their businesses. Several of these practices have now become established, as either organizational capabilities or tools, and have found their way into traditional enterprises. Today, these principles help traditional enterprises accelerate their digital transformation.
SREs are the glue between traditional IT Ops knowledge and capabilities, and the newer engineering-driven world. It is important to embrace and apply the skills and experience IT Ops has developed and gained over time, with the newer principles, and passionately foster sharing without fearing change.
Do this and your enterprise can, and will, be well positioned for the future.