The past decade has seen a dramatic shift in how we build applications: clouds, containers and micro-services have displaced the old paradigm of static, monolithic infrastructure. The need for operational visibility has grown tenfold. Thankfully, the monitoring landscape has kept up with the times.
We now have a choice of over 100 monitoring tools that provide excellent visibility to every nook and cranny of our IT stack. The modern monitoring landscape has something for everyone: on-prem installations, SaaS applications, open-source tools and high-priced enterprise monitoring suites. However, with so many tools to choose from, the monitoring landscape can be difficult to navigate.
MonitoringScape is your guide to the new, exciting world of modern monitoring. Keep in mind that this is a community resource, so your comments and suggestions are very welcome. Download full resolution version (8MB)
Use time-series databases to store and visualize your performance metrics. Common metrics include system & network performance (e.g. CPU Load), application performance (e.g. Transaction Latency) and business KPIs (e.g. Ad Impressions). Time-series databases are optimized for scale & performance and are capable of consuming millions of samples per second in most cases.
- Configurable data granularity and retention
- Aggregation functions (e.g. sum, mean)
- API & language wrappers
- Dashboards / integration with dashboarding tools
|Blueflood||2013||On Prem||Open Source|
|Cacti||2001||On Prem||Open Source|
|Circonus||2010||On Prem & SaaS|
|Cube||2011||On Prem||Open Source|
|Druid||2012||On Prem||Open Source|
|Graphite||2008||On Prem & SaaS||Open Source|
|InfluxData||2013||On Prem & SaaS||Open Source|
|KairosDB||2013||On Prem||Open Source|
|Librato||2011||SaaS||Acquired by SolarWinds|
|OpenTSDB||2010||On Prem||Open Source|
|Prometheus||2012||On Prem||Open Source|
|RRDtool||1999||On Prem||Open Source|
For most operations teams, system monitoring tools constitute the central hub of visibility into the status of their production environments. Use these tools to detect and investigate hardware, network and software problems. This is a broad definition that captures many flavors of tools, and thorough research is required before adopting a tool for production use.
System monitoring tools frequently employ a plugin architecture, making it easy to monitor the health of various types of infrastructure. Note that often breadth comes at the expense of depth: as your company grows, you will likely choose to adopt additional tools from other categories to augment your system monitoring solution.
- Status dashboard (i.e. “red/yellow/green” infrastructure overview )
- Alerting via email, sms, etc.
- Agents for periodical execution of health checks
- Built-in collectors for servers & networks
- Plugin architecture that supports many types of infrastructure
- Check hierarchy / dependency mapping
|Check_MK||2009||On Prem||Open Source, Uses Nagios Core|
|collectd||2005||On Prem||Open Source, Collection only|
|Ganglia||2001||On Prem||Open Source|
|Icinga||2009||On Prem||Open Source, Forked from Nagios|
|Munin||2006||On Prem||Open Source|
|Nagios||1999||On Prem||Open Source|
|OpenNMS||2000||On Prem||Open Source|
|OpsView||2003||On Prem||Open Source, Forked from Nagios|
|Sensu||2011||On Prem||Open Source|
|Server Density||2009||SaaS||Server and web monitoring|
|Shinken||2010||SaaS||Open Source, Nagios Core rewrite in Python|
|statsd||2012||On Prem||Open Source, Collection Only|
|statsite||2012||On Prem||Open Source, statsd-inspired collector|
|Zabbix||1998||On Prem||Open Source|
|Zenoss||2002||On Prem||Open Source|
Essentially all kinds of software output log files. Logs provide low-level visibility on application behavior; they are extremely useful for debugging, and can help with tracking recurring errors.
The rise of distributed systems resulted in an explosion in the number of log files and log lines. Locating an individual transaction in the ocean of log files became impossible. Log management tools were invented to address this issue. Similarly to the way Google crawls and indexes webpages, log management tools collect and index all your log data. This allows you to quickly search for specific messages, errors and patterns across all your log files.
- Query language for searching logs
- Timelines & histograms
- Automatic alerts
- Customizable dashboards
- Aggregation functions & analytics
|ELK Stack||2011||On Prem||ElasticSearch, LogStash & Kibana. Open Source|
|Graylog||2010||On Prem||Open Source|
|Papertrail||2011||SaaS||Acquired by SolarWinds|
|Logscape||2011||On Prem & SaaS|
Application Performance Monitoring tools (APMs) monitor the behavior of your applications by tracking transaction flow, starting with the client, and working down the stack, through the backend and database. They measure performance metrics such as latency, throughput and error rate. Use them to detect and debug user experience issues.
APMs provide important visibility that would be very hard to achieve otherwise. APM agents perform code-level instrumentation and therefore require language-specific implementations. Your applications might incur small performance penalties when monitored using APMs.
- Latency, throughput & error-rate measurement
- Geography-based segmentation
- Common errors & occurrence frequency
- Database query performance
- Correlation of performance metrics with code deployments
- Alerting via email, sms, etc.
|AppDynamics||2008||On Prem & SaaS|
Web & User Monitoring
Web & user monitoring tools measure how your application performs “from the outside.” They simulate traffic to your application from various geographies and alert you on failures and timeouts (Synthetic Monitoring). Additionally, they can be embedded into your web frontends or mobile applications in order to track real failures arising in your users’ clients (Real User Monitoring).
Unlike monitoring tools that track technical performance metrics, web & user monitoring metrics are tied directly to actual user experience. Web & user monitoring alerts almost always indicate that you have a real issue that must be resolved promptly. However, these alerts can’t provide much context as to what is causing the problem. It is recommended to complement web & user monitoring tools with System Monitoring and Application Performance Monitoring tools.
- Monitor HTTP / HTTPS / SSH / generic TCP endpoints
- Uptime & SLA tests
- Geographical segmentation
- Alerting via email, sms, etc.
|CatchPoint||2008||SaaS & On Prem|
|Gomez||1997||SaaS||Acquired by Compuware|
|Pingdom||2007||SaaS||Acquired by SolarWinds|
More and more companies are transitioning away from a strict tier-based operations model. In these companies, developers and infrastructure engineers respond to alerts directly, instead of a tier-1 team. This shortcuts the traditional, manual-escalation process and reduces overall resolution time significantly.
On Call Management tools enable this methodology. They consume alerts from your monitoring stack, and route the alerts automatically to the person who is currently on call. The alert is normally communicated to the on-call person via a mobile notification. If the person doesn’t respond within the confines of a pre-defined SLA, the alert is automatically escalated to a second on-call person.
- Manage on-call schedules
- Automatic routing based on schedule
- Configurable, automatic escalation policies
- Notification via SMS, phone call or push notification
- Acknowledge, resolve & add a comment to an alert
- Mobile apps
The growth in scale & complexity of modern production environments resulted in an explosion in the amount of data we have to process to make operational decisions. Manual processing of events is becoming harder and harder.
Event processing tools help you automate large parts of the incident resolution process. They consume alerts from your monitoring tools, and run them through a series of processing steps: Correlation (matching related alerts), Enrichment (adding insight & context to events), Noise Supression (removing unnecessary events) and Routing (funneling events to specific stakeholders). Use event processing tools to boost your service uptime and team productivity.
- Event correlation & enrichment
- Alert routing
- Alert analytics
- Integration with collaboration platforms (e.g. JIRA, Slack, ServiceNow, etc.)
- Consolidated event dashboard
|Bosun||2013||On Prem||Open Source|
|Riemann||2012||On Prem||Open Source|
Mobile apps include large quantities of native code whose performance is directly tied to revenue. And yet too often operations teams dismiss the importance of the reliability of native code, perhaps due to the fact that it resides outside of the datacenter. In fact, we should monitor our mobile apps with the same level of diligence given to backend infrastructure.
Mobile APMs are embedded into mobile apps and provide real-time visibility on their performance. Use them to track crashes and measure app speed. Debug issues by segmenting them according to device, operating system or geography.
- Crash reports
- Impact analysis of external services
- Client-backend communication monitoring
- Device, os, carrier network & geo segmentation
- Uncaught exceptions tracking
|Fabric||2011||SaaS||Previously Crashlytics (Acquired by Twitter)|
|Splunk>MINT||2011||SaaS||Previously BugSense (Acquired by Splunk)|
No matter how much you test, realistically your applications are going to have bugs. How you respond to these bugs once they occur is the key to reliability. Error tracking tools capture exceptions in your runtime code and provide context to help you prioritize and investigate them.
Log files provide general-purpose visibility, but too often errors pass by unnoticed or unhandled. Error tracking tools focus on actionability. They bubble up frequent errors, alert you in realtime on new error types, and help you collaborate on their resolution.
- Monitor exceptions in backend & frontend code
- Sort errors by frequency and severity
- Automatically group duplicate exceptions
- Alerts via Email, SMS, etc.
- Assign and track error resolution
|Sentry||2010||SaaS & On Prem||Open Source|
Application load tends to have a certain rhythm: it goes up during daytime and then down during nights. And yet our monitoring alerts rely almost exclusively on static thresholds, resulting in many inaccuracies. For example, consider an application bug causing high Disk IO. During nights, this bug will likely go unnoticed, due to the low baseline load (False Negative). During days, we will receive an alert, but then ignore it, as we’ll already be flooded by many other unnecessary alerts caused by healthy high traffic (False Positive).
Anomaly detection tools address this problem. They analyze your system’s behavior over time and calculate an adaptive baseline representing the systems “normal” behavior. Then, when your system behaves abnormally, they capture the anomaly and alert on it. You can read more about how anomaly detection helps DevOps teams here.
- Consume time-series or log data
- Detect & alert on anomalous behavior
- Automatic context for root cause analysis
|Anomaly Detective||2013||On Prem||Log data|
|Grok||2014||On Prem||Time-series data|
|Skyline||2013||On Prem||Time-series data, open source|
As the saying goes, do one thing and do it well. This category includes monitoring tools that specialize in specific use-cases or specific infrastructure vendors.
|Cachet||2014||SaaS||Public status pages, Open-Source|
|CloudWatch||2007||SaaS||AWS monitoring, part of AWS|
|Google Cloud Monitoring||2014||SaaS||Google Compute Engine monitoring, part of Google Cloud|
|opvizor||2012||SaaS||VMWare VSphere monitoring|
|Rackspace Monitoring||2013||SaaS||RackSpace monitoring|
|StackDriver||2012||SaaS||AWS monitoring, acquired by Google.|
|StatusPage.io||2013||SaaS||Public status pages|
|ThousandEyes||2010||SaaS||Organizational network monitoring|
|vRealize Operations||2013||On prem||VMWare hybrid-cloud monitoring|
Before the monitoring boom, companies relied on a fairly small set of vendors to monitor their environments. These vendors built large monitoring suites providing holistic workflows and end-to-end visibility. However, the rapid proliferation of SaaS and open-source tools resulted in a significant reduction of their market-share in recent years.
|BMC TrueSight||1990s||On Prem|
|IBM Tivoli||1990s||On Prem|
|HP Operations Management Solutions||1990s||On Prem|
|Microsoft SCOM||2000||On Prem|