Building a Fast Ops Incident Dashboard
Few things damage productivity as much as waiting. Waiting forces us to context switch, disrupts our creative momentum and eliminates our ability to experiment. Whether we are deploying a new service or troubleshooting a problem, waiting puts a heavy tax on efficient work.
Nothing is as delightful, then, as things that don’t make us wait: scripts that complete their work in under a second, queries that return instantly, and web pages that don’t take forever to load. Why do sysadmins love vim so much? Because you can open a file, run a text-replace command, save the file, and get back to shell, before your average IDE even boots up. Speed is invaluable for the operations person, and we at BigPanda are well aware of this. In fact, we are so passionate about speed, that we are committed to making BigPanda the fastest incident dashboard in the world. Here are a few architectural choices we’ve made to speed up our platform:
- Low-Latency Data Pipeline – Our data processing pipeline is composed of fast, message-driven micro-services. We measure and alert on the latency of each of those micro-services, ensuring that nothing ever evolves into a performance bottleneck. A full end-to-end event lifecycle (which includes, in the minimum, normalization, correlation, persistence and publishing) takes less than 100ms on average.
- Realtime Front-End – Old generation monitoring frontends relied on full page reloads for updating. Modern tools have implemented AJAX-based data polling. This is an improvement of course, but it still retains the inherent delay imposed by periodical updating. By contrast, BigPanda’s frontend maintains an open web-socket to our backend. New events and status updates are pushed to the frontend entirely in realtime. This means that a Nagios-generated alert is likely to appear in BigPanda before it shows up in the Nagios dashboard!
- Reactive UI – Data-heavy web pages are often sluggish and clunky. To avoid this, our awesome web team implemented a set of UI performance optimizations (to name a few: virtual scrolling, flexbox liquid layouts, svg visualizations & manipulation buffering). We’ve tested our UI with tens of thousands of concurrent incidents and we’re happy to report next-to-zero impact on overall performance and responsiveness.
Additionally, our roadmap includes keyboard shortcuts to enable instant assigning, starring & snoozing. There’s definitely more work to be done, but we think BigPanda already provides a satisfying experience to any speed-addicted DevOps or NOC. Give us a try! (and please let us know how else we can make BigPanda faster for you). Comic courtesy of xkcd.com