AIOps as a psychologist: getting teams to work together in crisis
Updated: July 1, 2021
Category: Ops and Teams
Author: BigPanda
As a unifying force, AIOps is a key influencer in the creation of collaborative relationships, which ultimately translates into productivity and efficiency.
On the face of it, AIOps is a technology aimed at solving operational challenges – combining big data and Artificial Intelligence/Machine Learning to automate and improve IT operations.
But there is another highly beneficial side to it that cannot be underestimated, a psychological one: helping to create a collaborative atmosphere that brings together people within and across teams, to solve IT Ops challenges together, as a ‘tribe’.
Incident response is a very dynamic function that, by definition, relies on coordination between teams and individuals, often encompassing dozens of stakeholders. All these stakeholders, who don’t necessarily know each other personally, need to come together in real-time as soon as they are alerted or notified, to make smart decisions under pressure. And they are required to do so while taking into account a number of different perspectives, monitoring multiple tech stacks, and catering to the different priorities of all parties involved, as well as keeping track of how the incident itself is progressing.
And it’s here that psychology really kicks in.
Mean Time to Innocence
When there are many cooks in the kitchen, personalities and egos reign supreme. In large enterprise environments when many teams are involved in troubleshooting an incident or an outage, and each team is responsible for a separate tech stack / process / service, they’re understandably eager to establish their “innocence”, often driving efforts in different directions.
This problem can even plague start-up environments with small teams where most team members know each other well and are experienced at working together. The issues surface when they deal with their enterprise customers, where large-scale dynamics rule. In such cases, the challenge lies in ensuring that the client’s team is onboard and cooperating with the incident response process, as they may not even be aware that something on their end is actually causing the problem. Additionally, they tend to assume that if there is a problem, it’s the service provider that is at fault.
The key to getting the incident response process off the personal track and on to the practical one that’s focused on solving the incidents – is trust, and this in turn is achieved through unified, objective visibility.
It’s here that AIOps can really shine.
Ego, Id and AIOps
A lack of trust breaks down collaboration. When different teams can only see their own perspectives of the same incident, and none of these perspectives provide a complete picture – then the technology or tool being used is often the least of their worries. It’s the subjective opinions of different team members that poses the biggest challenge.
Operations executives will readily agree that the focal point of the proverbial “bridge calls from hell” that follow severe service disruptions, is heavily affected by the disarray and distrust between the participants. Discussions often center on who knows best, who’s to blame, what really should be done at this stage, and more…
In these situations, providing comprehensive visibility – driven by AIOps – can make the difference.
AIOps provides all stakeholders with a consistent, unified picture of what is going on: what alerts are coming in, what services are being impacted, what are the dependencies associated with the point of impact, and how has the incident progressed over time. In addition to providing this incontrovertible data-driven view, AIOps also provides actionable intelligence so that teams can identify precisely where the problem lies, and what can be done to solve it.
This objective data is the common language for all IT teams, the connective tissue around which they can bond. It’s not about one team member disagreeing with another one’s thoughts that are based on partial information, it’s not a matter of another team having a different opinion, and it’s not about competing intuition.
Suddenly, with AIOps, it becomes clear that the point of failure is ‘here’, based on actual data showing that 15 minutes ago the process was working fine, but 14 minutes ago, it slowed down, right when a certain change happened, and, there you go, ‘that’ is the source of the problem. It’s no longer a matter of finger pointing or conjecture; it’s a matter of fact – and that’s something everyone can agree on.
Now, everyone is working together, going through various scenarios to find the solution to the problem.
By providing this visibility, AIOps creates a ‘tribe’ of people with a common cause. People who feel a bond with each other are more naturally open to working together, and this can also include external connections, between service providers and their customers.
And now that we have a tribe, as any psychologist/sociologist will tell you, we need a chief.
The incident chief
Once everyone is on the same page, collaboration is easier, but ultimately, there still needs to be a unified incident commander, a chief. Someone who is responsible for keeping the tribe together during the challenging time ahead.
Unified command is more than a function, more than an ‘incident owner’. It is a way of unifying not only the decision-making landscape, but also communications, and the way teams interact with each other. It’s non-denominational, with no favoritism involved. Its sole goal is to be a focal point for all the bits of information, and the impact from the outage, in one central repository, so that the incident can be handled effectively and the outage resolved quickly. AIOps is imperative for facilitating this, as it is the basis for any process put in place and provides the focal point for troubleshooting and decision making.
Ensuring efficient incident response is about managing people and processes, as much as it is about technology, and AIOps is a cornerstone for both.
What are the fundamentals of unified incident command? How do you implement them in your organization? To learn more we invite you to watch our “Successfully Leading Critical IT Functions During Incident Response” panel with Blackrock 3.