AIOps: Beyond the hype – it’s not Hollywood AI
Updated: June 15, 2021
Category: The CTO Perspective
Author: BigPanda
Many AIOps initiatives experience difficulties due to unrealistic expectations and a lack of a clear AIOps strategy. What is the reality beyond the AI hype, and how do we make these initiatives a success?
Join us in this CTO Perspective discussion with Jason Walker, Field CTO at BigPanda, to find out.
Read the skinny for a brief summary, then either lean back and watch the interview, or if you prefer to continue reading, take a few minutes to read the transcript. It’s been lightly edited to make it easy for you to consume it. Enjoy!
The Skinny
Often, our initial expectation from AI is that it will think and act like a human, similar to what we see in Hollywood movies. In IT operations – this means that we expect AIOps (AI/ML enabled IT Ops) to connect to our IT environment, understand it, see the alerts and immediately “know” what is wrong and how to fix things. But that’s just hype. The reality is that AI in IT is algorithmic, and is based on alert normalization, enrichment and correlation patterns that we “humans” have to first set up for it to work properly. Basically, this means that we first need to translate our organization’s tribal knowledge into AI and ML logic, and only then can it start assisting us and also improve on this initial logic by locating patterns that we may have missed due to the sheer amount of data and its complexity. And throughout this learning process – the AI/ML also needs to be reviewed and improved by operators to work well. In short – the reality is that AI and ML are only as good as the “humans” that set it up and continue to improve it. How do we do all this? Watch the interview to find out.
The Interview
The Transcript
Yoram:: Hello and welcome to the CTO Perspective, where we discuss unique perspectives about the most current issues in IT operations with the most current people in IT operations.
Here today with us is Jason Walker, former director of IT operations at Blizzard Entertainment, and currently BigPanda Field CTO. Hi Jason.
Jason: Good morning. Good to see you.
Yoram: Good to see you as well. It’s great having you here and talking to you. Let me just dive directly into the opening question: In my former years as an avid science fiction fan, when someone mentioned artificial intelligence, what came to my mind was a super-intelligence or sentient being, that knows a lot more than we do and can do a lot more than we can. We have no way of understanding what it does. We cannot participate in what it does, let alone control it. I have to say that even though I’ve matured since then, and the industry that we deal with is very educated and very technologically oriented, I sometimes still get the sense that when we talk to people about AIOps, in the back of their minds they’re expecting to get something similar. Am I mistaken?
Jason: No, I think you’re absolutely right, and there are two things going on here. One is the typical hype cycle around a new technology, and AI definitely falls into that category.
People have built it up in sales cycles so much that the expectations are very unrealistic. The other thing, though, is that when people look at AI, they tend to think it will come into their environment and be used in a human-like way. And so they anthropomorphize it and then give it superhuman capabilities.
The combination of these two things, the hype and the thinking of it as a very powerful and knowledgeable person, creates these inflated expectations about what it can do (especially from day one) in an organization that maybe doesn’t have any AI experience or capability yet.
Yoram: And that’s where the disappointment comes in. Right? There are a lot of stories about AI and ML not working properly, from back when we started using it in IT operations. I guess that’s where they come from.
Jason: Yes. And it’s very easy to demonstrate it in a POC or POV. It plugs into your ecosystem of tools very easily. But then the actual utility has to come next. And generally, both the organization (the teams) and the AIOps program, have to work together to realize the potential. And oftentimes it’s a mismatch, those missed expectations in the beginning lead to an insufficient set up and preparation, and that ends up dooming those initial efforts at that company.
Yoram: So, there is no magical Hollywood AI, is there? It’s something a little bit different. What is it based on? What’s the right way to do it?
Jason: It’s a science. What you are doing is feeding an algorithmic program, and a sophisticated one at that, a normalized set of data. You’re giving it some sort of framework to make decisions, and then you’re tasking it with certain actions.
And just like anything you do in software engineering, you have to be very careful about how you do that. Data preparation is probably the place where most organisations fall over.
Yoram: What do you mean by “starting out with normalized data”?
Jason: If you think about a simple term like “server” in an organization, a human will look at the word “server” and say: “OK, that’s the box that has some CPU and some memory, and a disk. And it does things for my IT applications”. You can call this box by about eight different names: “server” or “host” or “device”, or maybe by the manufacturer name. There are a lot of different ways to refer to it. And humans will instantly generalize and know all that. The AI doesn’t. If you have seven different data sources and they each call a “host” by different names, you will have to normalize that word and make sure all items are called the same, otherwise the AI will not recognize them as similar.
Yoram: So the AI doesn’t come in and totally understand this on its own. First of all, you have to somehow apply a common taxonomy throughout everything that you’re doing, for the AI to initially understand what it’s looking at.
Jason: Yes. And multiple sources make that much more complex because there is a real lack of standards across monitoring tools, and in the IT field in general. We have a lot of manufacturer specific language out there.
Yoram: OK. And then what would be the second step? Once we’ve normalized everything? What do we do then?
Jason: Then you convert the data to some sort of map that the AI can use in its tasking. An enrichment map is what some companies call it, or key value pairs: this host is related to this application, this application is related to this business service, this source system is related to this location from a network perspective, and so on. You just establish those relationships and deliver the AI maps of your environment one by one. Now, all of your alerts and events that come in can go through those maps and be enriched by additional attributes, so the AI can use those in the next step which is correlation.
Imagine an alert with two attributes, that is then enriched with 15 additional ones, and now all those fields are available for correlation patterns, i.e. grouping alerts on the basis of shared attributes.
Yoram: I assume that once you do that, that is where the AI advantage kicks in? Now there is a common language and there are all these connections. The machine learning can start searching for patterns.
Jason: Exactly. And it’s going to instantaneously apply. Again, we don’t want to anthropomorphize it. It isn’t searching, it’s instantaneously looking at every event that comes in, it has a time window and it’s going to see what matches the criteria that it has for attributes, and group those alerts together. It’s very systematic and it’s very consistent and reliable, but it is not intelligent. It will never make a cognitive leap to say “maybe that network alert is related to that application alert” if it doesn’t know anything about those two systems.
Yoram: Ok. So AI and ML are only as good as the people that set it up to begin with.
Jason: Yes. And we’re not the first ones to say that. But, you know, good data going in will get you good results, and a good application of your existing tribal knowledge in configuring the AI will definitely benefit you, but you have to learn how to do it well.
Yoram: And then when it starts learning, once again, you have to be there to tell it if it’s doing it right.
Jason: Absolutely. And that’s putting a human in the loop and giving a verification pass to every bit of output that that AI or ML is producing.
If you put a human in there who can validate, then the AI is going to pick that up and be able to use it going forward.
Yoram: This is what is commonly known as the black box phenomenon in AI. If you let AI correlate on its own, it is not going to be able to know if it is doing something right or wrong. And if you’re putting the human in the loop, then you’re making it better.
Jason: Right. And that’s a key component of almost all AI. I always point to Google spam filters. Google has good spam filters and has a baseline configuration that’s very sophisticated at detecting spam and helping us all out, to prevent spam. But it does give you the choice of classifying something as spam. And that’s key because all too often things get lost in that folder. And so, you have to put a person in the middle to prevent that from happening. The same is true in IT Ops. Your team has to trust what the AI is producing.
And that trust comes from explainability: “I understand why the AI put these alerts together and now I can either accept that or I can change the configuration until it makes sense to me or produces the desired result”.
Yoram: So it’s not just about making the results better by telling the AI and the ML what it’s doing right or wrong. It’s also about gaining trust in the organization, because if you don’t trust AI, it’s obviously not going to work well. And then it sort of snowballs.
Jason: Right. You can have the best AI in the world, but if nobody understands why it’s grouping alerts together, why it took these two seemingly unrelated events and said these two are part of the same incident, then your team is going to reject it out of hand, even if it is correct.
And that’s kind of the interesting thing, because an AI might, by looking at hundreds of different dimensions simultaneously, find a relationship that humans didn’t. And even though that is very useful knowledge, if the humans don’t understand why it did that and how it did that, then they won’t be able to act on that.
Yoram: So us humans are important throughout the whole process of AI and ML, we’re not redundant, happily enough…
Jason: I always told my team that if they can just manage the ML and the AI, just run their bot army, then they would always have a job because that’s the new job in IT Ops.
Yoram: I’m assuming that’s the philosophy of BigPanda’s Open Box Machine Learning.
Jason: Yes. BigPanda has taken a very pragmatic approach to it and said: Hey, I’m an organization. I’m looking at all the IT Ops knowledge that is out there, all that tribal knowledge. I’m going to take that in first, convert it to enrichment maps, and then I’m going to take the existing correlation patterns that probably most IT Ops teams recognize – this change is related to this alert or will cause these alerts, this network alert almost always impacts that database or that application, and so on. We’ll start with those as the initial correlation patterns. Then we will surface up suggestions as the ML on the back end crunches through all of that data and starts to see patterns that maybe people didn’t.
Yoram: And what was your experience in Blizzard with Open Box Machine Learning?
Jason: For me, there was a light bulb moment. We had fed in several different enrichment maps, and one of the things we did was take our troubleshooting guides and our run books, which had a whole bunch of condition IDs, and the associated service they were diagnosing as a failure. Those had been put together over a period of years by engineers from multiple different teams. They were phrased like “you may see these alerts when this type of failure occurs”, and we had almost a thousand of them. We converted those to enrichment maps, not thinking that they would ever be used for anything in particular. And within the first week in the POV, BigPanda started surfacing incidents that said: hey, all of these alerts were in the same troubleshooting guide and we think they’re related. And it was correct. And it was something that we had never looked at, as a potential correlation pattern. But it didn’t skip that step. It didn’t overlook that.
That’s, again, one of the advantages of good ML that is set up well: it will not miss anything that you may have missed as a human. It was an instant win for the team because the trust was established right then and there.
Yoram: Cool! So there is a bit of magic… but it’s always based on the way that you set it up.
Jason: Absolutely. And it’s incremental magic, not switch it on and it takes over your IT operations function entirely. It’s very much: “Oh, wow, I didn’t know it could do that”. This is much better than many of the other tools I’ve seen, which you build on it, and you get better at using it over time.
Yoram: Great. I think that’s a high note that we can end in. Thank you so much for talking to me today.
Jason: That was good talk. I really appreciate the time.
Yoram: I appreciate it as well. And if you want to learn more about Big Panda AI or about the BigPanda platform, just view our other CTO Perspectives videos or visit us at BigPanda.io. See you next time.