monitoringscape

Introduction

The past decade has seen a dramatic shift in how we build applications: clouds, containers and micro-services have displaced the old paradigm of static, monolithic infrastructure. The need for operational visibility has grown tenfold. Thankfully, the monitoring landscape has kept up with the times.

We now have a choice of over 100 monitoring tools that provide excellent visibility to every nook and cranny of our IT stack. The modern monitoring landscape has something for everyone: on-prem installations, SaaS applications, open-source tools and high-priced enterprise monitoring suites. However, with so many tools to choose from, the monitoring landscape can be difficult to navigate.

MonitoringScape is your guide to the new, exciting world of modern monitoring. Keep in mind that this is a community resource, so your comments and suggestions are very welcome. Download full resolution version (8MB)

Contribute

Time-Series Databases

Use time-series databases to store and visualize your performance metrics. Common metrics include system & network performance (e.g. CPU Load), application performance (e.g. Transaction Latency) and business KPIs (e.g. Ad Impressions). Time-series databases are optimized for scale & performance and are capable of consuming millions of samples per second in most cases.

Common Features

  • Configurable data granularity and retention
  • Aggregation functions (e.g. sum, mean)
  • API & language wrappers
  • Dashboards / integration with dashboarding tools

Tool Overview

Tool Started Deployment Notes Screenshot
Blueflood 2013 On Prem Open Source
Cacti 2001 On Prem Open Source
Circonus 2010 On Prem & SaaS
Cube 2011 On Prem Open Source
Druid 2012 On Prem Open Source
Graphite 2008 On Prem & SaaS Open Source
InfluxData 2013 On Prem & SaaS Open Source
Instrumental 2011 SaaS
KairosDB 2013 On Prem Open Source
Librato 2011 SaaS Acquired by SolarWinds
OpenTSDB 2010 On Prem Open Source
Prometheus 2012 On Prem Open Source
RRDtool 1999 On Prem Open Source
SignalFX 2013 SaaS
StatHat 2011 SaaS

System Monitoring

For most operations teams, system monitoring tools constitute the central hub of visibility into the status of their production environments. Use these tools to detect and investigate hardware, network and software problems. This is a broad definition that captures many flavors of tools, and thorough research is required before adopting a tool for production use.

System monitoring tools frequently employ a plugin architecture, making it easy to monitor the health of various types of infrastructure. Note that often breadth comes at the expense of depth: as your company grows, you will likely choose to adopt additional tools from other categories to augment your system monitoring solution.

Common Features

  • Status dashboard (i.e. “red/yellow/green” infrastructure overview )
  • Alerting via email, sms, etc.
  • Agents for periodical execution of health checks
  • Built-in collectors for servers & networks
  • Plugin architecture that supports many types of infrastructure
  • Check hierarchy / dependency mapping

Tool Overview

Tool Started Deployment Notes Screenshot
Centreon 2003 On Prem
Check_MK 2009 On Prem Open Source, Uses Nagios Core
collectd 2005 On Prem Open Source, Collection only
Datadog 2010 SaaS
Dataloop.io 2014 SaaS
Ganglia 2001 On Prem Open Source
Icinga 2009 On Prem Open Source, Forked from Nagios
LogicMonitor 2009 SaaS
mackerel 2015 SaaS
Munin 2006 On Prem Open Source
Nagios 1999 On Prem Open Source
OpenNMS 2000 On Prem Open Source
OpsView 2003 On Prem Open Source, Forked from Nagios
PRTG 1997 On Prem
Scout 2008 SaaS
Sensu 2011 On Prem Open Source
Server Density 2009 SaaS Server and web monitoring
Shinken 2010 SaaS Open Source, Nagios Core rewrite in Python
Spiceworks 2014 On Prem
statsd 2012 On Prem Open Source, Collection Only
statsite 2012 On Prem Open Source, statsd-inspired collector
Zabbix 1998 On Prem Open Source
Zenoss 2002 On Prem Open Source

Log Management

Essentially all kinds of software output log files. Logs provide low-level visibility on application behavior; they are extremely useful for debugging, and can help with tracking recurring errors.

The rise of distributed systems resulted in an explosion in the number of log files and log lines. Locating an individual transaction in the ocean of log files became impossible. Log management tools were invented to address this issue. Similarly to the way Google crawls and indexes webpages, log management tools collect and index all your log data. This allows you to quickly search for specific messages, errors and patterns across all your log files.

Common Features

  • Query language for searching logs
  • Timelines & histograms
  • Automatic alerts
  • Customizable dashboards
  • Aggregation functions & analytics

Tool Overview

Tool Started Deployment Notes Screenshot
ELK Stack 2011 On Prem ElasticSearch, LogStash & Kibana. Open Source
Graylog 2010 On Prem Open Source
Logentries 2010 SaaS
Loggly 2009 SaaS
Papertrail 2011 SaaS Acquired by SolarWinds
Splunk 2003 On Prem
SumoLogic 2010 SaaS
Logscape 2011 On Prem & SaaS

APM

Application Performance Monitoring tools (APMs) monitor the behavior of your applications by tracking transaction flow, starting with the client, and working down the stack, through the backend and database. They measure performance metrics such as latency, throughput and error rate. Use them to detect and debug user experience issues.

APMs provide important visibility that would be very hard to achieve otherwise. APM agents perform code-level instrumentation and therefore require language-specific implementations. Your applications might incur small performance penalties when monitored using APMs.

Common Features

  • Latency, throughput & error-rate measurement
  • Geography-based segmentation
  • Common errors & occurrence frequency
  • Database query performance
  • Correlation of performance metrics with code deployments
  • Alerting via email, sms, etc.

Tool Overview

Tool Started Deployment Notes Screenshot
AppDynamics 2008 On Prem & SaaS
AppNeta 2010 SaaS
CorrelSense 2005 SaaS
Dynatrace 1993 On Prem
DripStat 2015 SaaS
New Relic 2008 SaaS
perfino 2014 On Prem
Ruxit 2014 SaaS
Stackify 2012 SaaS
Takipi 2011 SaaS

Web & User Monitoring

Web & user monitoring tools measure how your application performs “from the outside.” They simulate traffic to your application from various geographies and alert you on failures and timeouts (Synthetic Monitoring). Additionally, they can be embedded into your web frontends or mobile applications in order to track real failures arising in your users’ clients (Real User Monitoring).

Unlike monitoring tools that track technical performance metrics, web & user monitoring metrics are tied directly to actual user experience. Web & user monitoring alerts almost always indicate that you have a real issue that must be resolved promptly. However, these alerts can’t provide much context as to what is causing the problem. It is recommended to complement web & user monitoring tools with System Monitoring and Application Performance Monitoring tools.

Common Features

  • Monitor HTTP / HTTPS / SSH / generic TCP endpoints
  • Uptime & SLA tests
  • Geographical segmentation
  • Alerting via email, sms, etc.

Tool Overview

Tool Started Deployment Notes Screenshot
Apica 2005 SaaS
CatchPoint 2008 SaaS & On Prem
dotcom 1998 SaaS
Gomez 1997 SaaS Acquired by Compuware
Keynote 1995 SaaS
mPulse 2012 SaaS
Panopta 2007 SaaS
Pingdom 2007 SaaS Acquired by SolarWinds
Rigor 2010 SaaS
Site24x7 2006 SaaS
StatusCake 2012 SaaS
UptimeRobot 2010 SaaS

On-Call Management

More and more companies are transitioning away from a strict tier-based operations model. In these companies, developers and infrastructure engineers respond to alerts directly, instead of a tier-1 team. This shortcuts the traditional, manual-escalation process and reduces overall resolution time significantly.

On Call Management tools enable this methodology. They consume alerts from your monitoring stack, and route the alerts automatically to the person who is currently on call. The alert is normally communicated to the on-call person via a mobile notification. If the person doesn’t respond within the confines of a pre-defined SLA, the alert is automatically escalated to a second on-call person.

Common Features

  • Manage on-call schedules
  • Automatic routing based on schedule
  • Configurable, automatic escalation policies
  • Notification via SMS, phone call or push notification
  • Acknowledge, resolve & add a comment to an alert
  • Mobile apps

Tool Overview

Tool Started Deployment Notes Screenshot
OpsGenie 2012 SaaS
PagerDuty 2010 SaaS
VictorOps 2012 SaaS
xMatters 2000 SaaS

Event Processing

The growth in scale & complexity of modern production environments resulted in an explosion in the amount of data we have to process to make operational decisions. Manual processing of events is becoming harder and harder.

Event processing tools help you automate large parts of the incident resolution process. They consume alerts from your monitoring tools, and run them through a series of processing steps: Correlation (matching related alerts), Enrichment (adding insight & context to events), Noise Supression (removing unnecessary events) and Routing (funneling events to specific stakeholders). Use event processing tools to boost your service uptime and team productivity.

Common Features

  • Event correlation & enrichment
  • Alert routing
  • Alert analytics
  • Integration with collaboration platforms (e.g. JIRA, Slack, ServiceNow, etc.)
  • Consolidated event dashboard

Tool Overview

Tool Started Deployment Notes Screenshot
BigPanda 2012 SaaS
Bosun 2013 On Prem Open Source
MoogSoft 2011 On Prem
Riemann 2012 On Prem Open Source

Mobile APM

Mobile apps include large quantities of native code whose performance is directly tied to revenue. And yet too often operations teams dismiss the importance of the reliability of native code, perhaps due to the fact that it resides outside of the datacenter. In fact, we should monitor our mobile apps with the same level of diligence given to backend infrastructure.

Mobile APMs are embedded into mobile apps and provide real-time visibility on their performance. Use them to track crashes and measure app speed. Debug issues by segmenting them according to device, operating system or geography.

Common Features

  • Crash reports
  • Impact analysis of external services
  • Client-backend communication monitoring
  • Device, os, carrier network & geo segmentation
  • Uncaught exceptions tracking

Tool Overview

Tool Started Deployment Notes Screenshot
Crittercism 2011 SaaS
Fabric 2011 SaaS Previously Crashlytics (Acquired by Twitter)
NewRelic Mobile 2013 SaaS
Splunk>MINT 2011 SaaS Previously BugSense (Acquired by Splunk)

Error Tracking

No matter how much you test, realistically your applications are going to have bugs. How you respond to these bugs once they occur is the key to reliability. Error tracking tools capture exceptions in your runtime code and provide context to help you prioritize and investigate them.

Log files provide general-purpose visibility, but too often errors pass by unnoticed or unhandled. Error tracking tools focus on actionability. They bubble up frequent errors, alert you in realtime on new error types, and help you collaborate on their resolution.

Common Features

  • Monitor exceptions in backend & frontend code
  • Sort errors by frequency and severity
  • Automatically group duplicate exceptions
  • Alerts via Email, SMS, etc.
  • Assign and track error resolution

Tool Overview

Tool Started Deployment Notes Screenshot
AirBrake 2008 SaaS
BugSnag 2012 SaaS
Honeybadger 2012 SaaS
Raygun 2013 SaaS
Rollbar 2012 SaaS
Sentry 2010 SaaS & On Prem Open Source

Anomaly Detection

Application load tends to have a certain rhythm: it goes up during daytime and then down during nights. And yet our monitoring alerts rely almost exclusively on static thresholds, resulting in many inaccuracies. For example, consider an application bug causing high Disk IO. During nights, this bug will likely go unnoticed, due to the low baseline load (False Negative). During days, we will receive an alert, but then ignore it, as we’ll already be flooded by many other unnecessary alerts caused by healthy high traffic (False Positive).

Anomaly detection tools address this problem. They analyze your system’s behavior over time and calculate an adaptive baseline representing the systems “normal” behavior. Then, when your system behaves abnormally, they capture the anomaly and alert on it. You can read more about how anomaly detection helps DevOps teams here.

Common Features

  • Consume time-series or log data
  • Detect & alert on anomalous behavior
  • Automatic context for root cause analysis

Tool Overview

Tool Started Deployment Notes Screenshot
Anodot 2014 SaaS Time-series data
Anomaly Detective 2013 On Prem Log data
Grok 2014 On Prem Time-series data
Skyline 2013 On Prem Time-series data, open source

Specialized

As the saying goes, do one thing and do it well. This category includes monitoring tools that specialize in specific use-cases or specific infrastructure vendors.

Tool Overview

Tool Started Deployment Notes Screenshot
Cachet 2014 SaaS Public status pages, Open-Source
CloudWatch 2007 SaaS AWS monitoring, part of AWS
Google Cloud Monitoring 2014 SaaS Google Compute Engine monitoring, part of Google Cloud
Opsmatic 2013 SaaS Change monitoring
opvizor 2012 SaaS VMWare VSphere monitoring
Rackspace Monitoring 2013 SaaS RackSpace monitoring
Runscope 2013 SaaS API monitoring
StackDriver 2012 SaaS AWS monitoring, acquired by Google.
StatusPage.io 2013 SaaS Public status pages
ThousandEyes 2010 SaaS Organizational network monitoring
vRealize Operations 2013 On prem VMWare hybrid-cloud monitoring

Enterprise Suites

Before the monitoring boom, companies relied on a fairly small set of vendors to monitor their environments. These vendors built large monitoring suites providing holistic workflows and end-to-end visibility. However, the rapid proliferation of SaaS and open-source tools resulted in a significant reduction of their market-share in recent years.

Tool Overview

Tool Started Deployment Notes Screenshot
BMC TrueSight 1990s On Prem
IBM Tivoli 1990s On Prem
HP Operations Management Solutions 1990s On Prem
Microsoft SCOM 2000 On Prem
SolarWinds 1999 On Prem