Sony Interactive Entertainment drives better ITOps

5 min read
Time Indicator

Sony Interactive Entertainment (SIE) is a multinational video game and digital entertainment company owned by global conglomerate Sony. SIE primarily operates the PlayStation brand of video game consoles and products.

Challenge

  • The high volume of noisy, low-quality alerts surpassed the network operations center (NOC) team’s capacity to manually manage every incident.
  • Siloed monitoring and observability tools forced the NOC to waste time switching between several tools and consoles in an effort to detect and manually troubleshoot a single IT incident.
  • Without alerting standards, critical context and metadata were missing from alerts, hindering the NOC’s ability to interpret alert priority and next steps for incident resolution.

Sony Interactive Entertainment’s NOC was experiencing an extremely high influx of event noise coming in daily from all over the IT environment. The NOC often manually switched between four to five siloed monitoring dashboards at once to investigate the root cause of a single incident. However, they could not easily interpret the significance and priority of each alert related to an incident, which only prolonged mean time to resolve (MTTR).

“When you have very low-quality alerts that are just making noise, you cannot act on them. You get lost in all that noise because you can’t make sense of what’s actually happening within the environment or how critical the alert is,” said Priscilliano Flores, staff software systems engineer at SIE. “If a service were to go down or become intermittent, that could have a direct impact on our gamers and our partners.”

Solution

  • Single pane of glass for end-to-end visibility across varying monitoring tools
  • Alert quality standards to promptly identify priority actions to speed incident resolution
  • Analytics dashboards that show the impact of alert-quality standards on cross-functional incident response workflows

Single pane of glass for end-to-end visibility of IT operations

“As we modernized our observability and monitoring systems, our operations and engineering teams integrated their monitoring and sent all alerts and events into BigPanda. Our NOC now only needed to look at BigPanda’s single pane of glass for full visibility,” said Flores.

BigPanda Alert Intelligence service helped SIE to get a comprehensive view of their IT environment that eliminated the need for the NOC to switch between multiple monitoring dashboards when responding to an incident. The service reduced fragmented alert noise from each monitoring tool and enriched the alerts with Sony’s requirements for actionability, allowing their teams to prioritize and take action on incident alerts much more efficiently.

Alert quality standards ensure actionable alerts and priority resolution

Alerts are considered actionable only after they have been enriched with the minimum technical context (location, host, or affected service) and business context (priority, team responsible, etc) needed for support teams to take action on an incident. With affected services, relevant runbooks, or historical incidents readily available, L1 operators can resolve more incidents without escalation.

Standardizing alert quality requires alignment and agreement from cross-functional NOC, engineering, and operations teams on common definitions for actionability. The minimum amount of technical and business context must be enriched to alerts so the NOC can action an alert. If Sony’s NOC is to support a team, they must first be compliant with the details sent to BigPanda.

“If you have very loose or no standardization around what data should be sent into your event management tool, then it is inevitable that important information will be missing, gets lost, or you won’t be able to identify critical impact before it is too late and an outage has already occurred,” said Flores. “Actionable alerts prevent this manual toil.”

Unified Analytics monitors and evaluates incident management performance in real time

SIE measures and tracks live incident management performance to gain visibility into which teams were properly enriching their alert payloads and which ones were not. This feedback loop is critical to managing compliance when teams and alerts increase in volume.

“We use Unified Analytics for visibility into which outlier teams are either compliant or those that are not following our defined BigPanda payload specification for alert quality standardization,” explains Flores. “Because we have visibility into these gaps, we can communicate with these teams and create a plan to properly onboard them onto BigPanda and convert the remaining low-quality noise into actionable alerts that enable the right incident response team to be notified at the right time.”

SIE values high-quality alerts and a good night’s sleep

SIE now prioritizes consistently high-quality alert data coming in from multiple monitoring tools, allowing them to be far more efficient with escalation and incident resolution. This fosters a positive internal culture shift around AIOps adoption as SIE expands this practice to additional teams.

“As we walked this journey with our stakeholders, we knew BigPanda was reducing alert noise—but it went further than that. We saw a different attitude within our operations teams. They weren’t getting woken up in the middle of the night for something that wasn’t really an issue. They started seeing the potential of using BigPanda and not only embraced it but also evangelized it across other teams.

“BigPanda is bringing work-life balance to the entire organization. Everyone strives for that, but this has really helped us to achieve that,” said Flores. “We are seeing fewer outages because our NOC can now truly get ahead of them. What that means for us, our partners, and especially our gamers, is everybody’s just happier. The gamers are able to continue their gaming, partners are able to continue to create content, and our engineering teams are able to have a good night’s sleep.”