The Definitive Guide
to IT Monitoring

MonitoringScape is the definite guide to the ever-changing landscape of IT monitoring. As a community resource, we welcome your submissions and feedback.

Download the posternow Explore Monitoring Tools

WHAT IS MONITORINGSCAPE?

Modern monitoring solutions offer something for everyone: on-prem and SaaS deployment options, open source and commercial licenses, purpose-built best-of-breed products and full-featured enterprise suites. With so many tools to choose from, navigating the landscape can be difficult.

Covering everything from system monitoring and APM, to anomaly detection and error tracking, to log monitoring, enterprise suites, and more, MonitoringScape is designed to help IT professionals explore the hundreds of monitoring tools available today, providing you with a go-to source for discovering and researching new solutions.

show more

CONTRIBUTE

MonitoringScape is a community resource, so your submissions and feedback are very welcome.

Monitoring Tools

Loading....

Time-Series Databases

Use time-series databases to store and visualize your performance metrics. Common metrics include system & network performance (e.g. CPU Load), application performance (e.g. Transaction Latency) and business KPIs (e.g. Ad Impressions). Time-series databases are optimized for scale & performance and are capable of consuming millions of samples per second in most cases.

Common Features

  • Configurable data granularity and retention
  • Aggregation functions (e.g. sum, mean)
  • API & language wrappers
  • Dashboards / integration with dashboarding tools

System Monitoring

For most operations teams, system monitoring tools constitute the central hub of visibility into the status of their production environments. Use these tools to detect and investigate hardware, network and software problems. This is a broad definition that captures many flavors of tools, and thorough research is required before adopting a tool for production use.

System monitoring tools frequently employ a plugin architecture, making it easy to monitor the health of various types of infrastructure. Note that often breadth comes at the expense of depth: as your company grows, you will likely choose to adopt additional tools from other categories to augment your system monitoring solution.

Common Features

  • Status dashboard (i.e. “red/yellow/green” infrastructure overview )
  • Alerting via email, sms, etc.
  • Agents for periodical execution of health checks
  • Built-in collectors for servers & networks
  • Plugin architecture that supports many types of infrastructure
  • Check hierarchy / dependency mapping

Anomaly Detection

Application load tends to have a certain rhythm: it goes up during daytime and then down during nights. And yet our monitoring alerts rely almost exclusively on static thresholds, resulting in many inaccuracies. For example, consider an application bug causing high Disk IO. During nights, this bug will likely go unnoticed, due to the low baseline load (False Negative). During days, we will receive an alert, but then ignore it, as we’ll already be flooded by many other unnecessary alerts caused by healthy high traffic (False Positive).

Anomaly detection tools address this problem. They analyze your system’s behavior over time and calculate an adaptive baseline representing the systems “normal” behavior. Then, when your system behaves abnormally, they capture the anomaly and alert on it. You can read more about how anomaly detection helps DevOps teams here.

Common Features

  • Consume time-series or log data
  • Detect & alert on anomalous behavior
  • Automatic context for root cause analysis

Log Monitoring

Essentially all kinds of software output log files. Logs provide low-level visibility on application behavior; they are extremely useful for debugging, and can help with tracking recurring errors.

The rise of distributed systems resulted in an explosion in the number of log files and log lines. Locating an individual transaction in the ocean of log files became impossible. Log management tools were invented to address this issue. Similarly to the way Google crawls and indexes webpages, log management tools collect and index all your log data. This allows you to quickly search for specific messages, errors and patterns across all your log files.

Common Features

  • Query language for searching logs
  • Timelines & histograms
  • Automatic alerts
  • Customizable dashboards
  • Aggregation functions & analytics

APM

Application Performance Monitoring tools (APMs) monitor the behavior of your applications by tracking transaction flow, starting with the client, and working down the stack, through the backend and database. They measure performance metrics such as latency, throughput and error rate. Use them to detect and debug user experience issues.

APMs provide important visibility that would be very hard to achieve otherwise. APM agents perform code-level instrumentation and therefore require language-specific implementations. Your applications might incur small performance penalties when monitored using APMs.

Common Features

  • Latency, throughput & error-rate measurement
  • Geography-based segmentation
  • Common errors & occurrence frequency
  • Database query performance
  • Correlation of performance metrics with code deployments
  • Alerting via email, sms, etc.

Web & User Monitoring

Web & user monitoring tools measure how your application performs “from the outside.” They simulate traffic to your application from various geographies and alert you on failures and timeouts (Synthetic Monitoring). Additionally, they can be embedded into your web frontends or mobile applications in order to track real failures arising in your users’ clients (Real User Monitoring).

Unlike monitoring tools that track technical performance metrics, web & user monitoring metrics are tied directly to actual user experience. Web & user monitoring alerts almost always indicate that you have a real issue that must be resolved promptly. However, these alerts can’t provide much context as to what is causing the problem. It is recommended to complement web & user monitoring tools with System Monitoring and Application Performance Monitoring tools.

Common Features

  • Monitor HTTP / HTTPS / SSH / generic TCP endpoints
  • Uptime & SLA tests
  • Geographical segmentation
  • Alerting via email, sms, etc.

On-Call Notification

More and more companies are transitioning away from a strict tier-based operations model. In these companies, developers and infrastructure engineers respond to alerts directly, instead of a tier-1 team. This shortcuts the traditional, manual-escalation process and reduces overall resolution time significantly.

On Call Management tools enable this methodology. They consume alerts from your monitoring stack, and route the alerts automatically to the person who is currently on call. The alert is normally communicated to the on-call person via a mobile notification. If the person doesn’t respond within the confines of a pre-defined SLA, the alert is automatically escalated to a second on-call person.

Common Features

  • Manage on-call schedules
  • Automatic routing based on schedule
  • Configurable, automatic escalation policies
  • Notification via SMS, phone call or push notification
  • Acknowledge, resolve & add a comment to an alert
  • Mobile apps

Event Correlation

The growth in scale & complexity of modern production environments resulted in an explosion in the amount of data we have to process to make operational decisions. Manual processing of events is becoming harder and harder.

Event processing tools help you automate large parts of the incident resolution process. They consume alerts from your monitoring tools, and run them through a series of processing steps: Correlation (matching related alerts), Enrichment (adding insight & context to events), Noise Supression (removing unnecessary events) and Routing (funneling events to specific stakeholders). Use event processing tools to boost your service uptime and team productivity.

Common Features

  • Event correlation & enrichment
  • Alert routing
  • Alert analytics
  • Integration with collaboration platforms (e.g. JIRA, Slack, ServiceNow, etc.)
  • Consolidated event dashboard

Mobile APM

Mobile apps include large quantities of native code whose performance is directly tied to revenue. And yet too often operations teams dismiss the importance of the reliability of native code, perhaps due to the fact that it resides outside of the datacenter. In fact, we should monitor our mobile apps with the same level of diligence given to backend infrastructure.

Mobile APMs are embedded into mobile apps and provide real-time visibility on their performance. Use them to track crashes and measure app speed. Debug issues by segmenting them according to device, operating system or geography.

Common Features

  • Crash reports
  • Impact analysis of external services
  • Client-backend communication monitoring
  • Device, os, carrier network & geo segmentation
  • Uncaught exceptions tracking

Error Tracking

No matter how much you test, realistically your applications are going to have bugs. How you respond to these bugs once they occur is the key to reliability. Error tracking tools capture exceptions in your runtime code and provide context to help you prioritize and investigate them.

Log files provide general-purpose visibility, but too often errors pass by unnoticed or unhandled. Error tracking tools focus on actionability. They bubble up frequent errors, alert you in realtime on new error types, and help you collaborate on their resolution.

Common Features

  • Monitor exceptions in backend & frontend code
  • Sort errors by frequency and severity
  • Automatically group duplicate exceptions
  • Alerts via Email, SMS, etc.
  • Assign and track error resolution

Specialized

As the saying goes, do one thing and do it well. This category includes monitoring tools that specialize in specific use-cases or specific infrastructure vendors.

Enterprise Suites

Before the monitoring boom, companies relied on a fairly small set of vendors to monitor their environments. These vendors built large monitoring suites providing holistic workflows and end-to-end visibility. However, the rapid proliferation of SaaS and open-source tools resulted in a significant reduction of their market-share in recent years.

Chat

Network Performance Monitoring

Synthetic Monitoring

Ticketing

Show More...
  • Cacti
    Time-Series Databases
    Cacti
    Deployment: On Prem

    Cacti (Time-Series Databases)

  • Circonus
    System Monitoring
    Circonus
    Deployment: On Prem & SaaS

    Circonus (System Monitoring)

  • Graphite
    Time-Series Databases
    Graphite
    Deployment: On Prem

    Graphite (Time-Series Databases)

  • InfluxData
    Time-Series Databases
    InfluxData
    Deployment: On Prem & SaaS

    InfluxData (Time-Series Databases)

  • Librato
    Time-Series Databases
    Librato
    Deployment: SaaS

    Librato (Time-Series Databases)

  • OpenTSDB
    Time-Series Databases
    OpenTSDB
    Deployment: On Prem

    OpenTSDB (Time-Series Databases)

  • RRDtool
    Time-Series Databases
    RRDtool
    Deployment: On Prem

    RRDtool (Time-Series Databases)

  • SignalFX
    Time-Series Databases
    SignalFX
    Deployment: SaaS

    SignalFX (Time-Series Databases)

  • Collectd
    Time-Series Databases
    Collectd
    Deployment: On Prem

    Collectd (Time-Series Databases)

  • Ganglia
    System Monitoring
    Ganglia
    Deployment: On Prem

    Ganglia (System Monitoring)

  • Icinga
    System Monitoring
    Icinga
    Deployment: On Prem

    Icinga (System Monitoring)

  • Munin
    Time-Series Databases
    Munin
    Deployment: On Prem

    Munin (Time-Series Databases)

  • Nagios
    System Monitoring
    Nagios
    Deployment: On Prem

    Nagios (System Monitoring)

  • Scout
    APM
    Scout
    Deployment: SaaS

    Scout (APM)

  • Zabbix
    System Monitoring
    Zabbix
    Deployment: On Prem

    Zabbix (System Monitoring)

  • Zenoss
    System Monitoring
    Zenoss
    Deployment: On Prem & SaaS

    Zenoss (System Monitoring)

  • Elastic
    Log Monitoring
    Elastic
    Deployment: On Prem & Saas

    Elastic (Log Monitoring)

  • Graylog
    Log Monitoring
    Graylog
    Deployment: On Prem

    Graylog (Log Monitoring)

  • Logentries
    Log Monitoring
    Logentries
    Deployment: SaaS

    Logentries (Log Monitoring)

  • Papertrail
    Log Monitoring
    Papertrail
    Deployment: SaaS

    Papertrail (Log Monitoring)

  • Splunk
    Log Monitoring
    Splunk
    Deployment: On Prem

    Splunk (Log Monitoring)

  • SumoLogic
    Log Monitoring
    SumoLogic
    Deployment: SaaS

    SumoLogic (Log Monitoring)

  • AppDynamics
    APM
    AppDynamics
    Deployment: On Prem & SaaS

    AppDynamics (APM)

  • New Relic
    APM
    New Relic
    Deployment: SaaS

    New Relic (APM)

  • Apica
    System Monitoring
    Apica
    Deployment: SaaS

    Apica (System Monitoring)

  • Keynote
    Synthetic Monitoring
    Keynote
    Deployment: SaaS

    Keynote (Synthetic Monitoring)

  • Panopta
    Synthetic Monitoring
    Panopta
    Deployment: SaaS

    Panopta (Synthetic Monitoring)

  • OpsGenie
    On-Call Notification
    OpsGenie
    Deployment: SaaS

    OpsGenie (On-Call Notification)

  • PagerDuty
    On-Call Notification
    PagerDuty
    Deployment: SaaS

    PagerDuty (On-Call Notification)

  • VictorOps
    On-Call Notification
    VictorOps
    Deployment: SaaS

    VictorOps (On-Call Notification)

  • BigPanda
    Event Correlation
    BigPanda
    Deployment: SaaS

    BigPanda (Event Correlation)

  • MoogSoft
    Event Correlation
    MoogSoft
    Deployment: On Prem

    MoogSoft (Event Correlation)

  • NewRelic Mobile
    Mobile APM
    NewRelic Mobile
    Deployment: SaaS

    NewRelic Mobile (Mobile APM)

  • AirBrake
    Error Tracking
    AirBrake
    Deployment: SaaS

    AirBrake (Error Tracking)

  • BugSnag
    Error Tracking
    BugSnag
    Deployment: SaaS

    BugSnag (Error Tracking)

  • Honeybadger
    Error Tracking
    Honeybadger
    Deployment: SaaS

    Honeybadger (Error Tracking)

  • Prometheus
    Time-Series Databases
    Prometheus
    Deployment: On Prem

    Prometheus (Time-Series Databases)

    no-slideshow
  • Datadog
    System Monitoring
    Datadog
    Deployment: SaaS

    Datadog (System Monitoring)

    no-slideshow
  • LogicMonitor
    System Monitoring
    LogicMonitor
    Deployment: SaaS

    LogicMonitor (System Monitoring)

    no-slideshow
  • Sensu
    System Monitoring
    Sensu
    Deployment: On Prem

    Sensu (System Monitoring)

    no-slideshow
  • Logscape
    Log Monitoring
    Logscape
    Deployment: On Prem

    Logscape (Log Monitoring)

    no-slideshow
  • StatsD
    Time-Series Databases
    StatsD
    Deployment: On Prem

    StatsD (Time-Series Databases)

    no-slideshow
  • Dynatrace
    APM
    Dynatrace
    Deployment: On Prem

    Dynatrace (APM)

    no-slideshow
  • Ruxit
    APM
    Ruxit
    Deployment: SaaS

    Ruxit (APM)

    no-slideshow
  • Stackify
    APM
    Stackify
    Deployment: SaaS

    Stackify (APM)

    no-slideshow
  • Gomez
    Synthetic Monitoring
    Gomez
    Deployment: SaaS

    Gomez (Synthetic Monitoring)

    no-slideshow
  • Pingdom
    Synthetic Monitoring
    Pingdom
    Deployment: SaaS

    Pingdom (Synthetic Monitoring)

    no-slideshow
  • xMatters
    On-Call Notification
    xMatters
    Deployment: On Prem & SaaS

    xMatters (On-Call Notification)

    no-slideshow
  • Splunk>MINT
    Mobile APM
    Splunk>MINT
    Deployment: On Prem & SaaS

    Splunk>MINT (Mobile APM)

    no-slideshow
  • Raygun
    Error Tracking
    Raygun
    Deployment: SaaS

    Raygun (Error Tracking)

    no-slideshow
  • Rollbar
    Error Tracking
    Rollbar
    Deployment: SaaS

    Rollbar (Error Tracking)

    no-slideshow
  • Grok
    Anomaly Detection
    Grok
    Deployment: On Prem

    Grok (Anomaly Detection)

    no-slideshow
  • EtsySkyline
    Anomaly Detection
    EtsySkyline
    Deployment: On Prem

    EtsySkyline (Anomaly Detection)

    no-slideshow
  • CloudWatch
    Specialized
    CloudWatch
    Deployment: SaaS

    CloudWatch (Specialized)

    no-slideshow
  • StackDriver
    Specialized
    StackDriver
    Deployment: SaaS

    StackDriver (Specialized)

    no-slideshow
  • ThousandEyes
    Synthetic Monitoring
    ThousandEyes
    Deployment: SaaS

    ThousandEyes (Synthetic Monitoring)

    no-slideshow
  • BMC
    Enterprise Suites
    BMC
    Deployment: On Prem

    BMC (Enterprise Suites)

    no-slideshow
  • IBM
    Enterprise Suites
    IBM
    Deployment: On Prem

    IBM (Enterprise Suites)

    no-slideshow
  • HP
    Enterprise Suites
    HP
    Deployment: On Prem

    HP (Enterprise Suites)

  • Microsoft SCOM
    Enterprise Suites
    Microsoft SCOM
    Deployment: On Prem

    Microsoft SCOM (Enterprise Suites)

  • SolarWinds
    Enterprise Suites
    SolarWinds
    Deployment: On Prem

    SolarWinds (Enterprise Suites)

  • AppDynamics Mobile
    Mobile APM
    AppDynamics Mobile
    Deployment: On Prem & SaaS

    AppDynamics Mobile (Mobile APM)

  • AWS X-Ray
    APM
    AWS X-Ray
    Deployment: SaaS

    AWS X-Ray (APM)

    no-slideshow
  • CA Wily
    APM
    CA Wily
    Deployment: On Prem
  • Corvil
    Network Performance Monitoring
    Corvil
    Deployment: On Prem

    Corvil (Network Performance Monitoring)

  • Crashlytics
    Mobile APM
    Crashlytics
    Deployment: SaaS

    Crashlytics (Mobile APM)

    no-slideshow
  • Dynatrace Mobile
    Mobile APM
    Dynatrace Mobile
    Deployment: On Prem & SaaS

    Dynatrace Mobile (Mobile APM)

  • ExtraHop
    Network Performance Monitoring
    ExtraHop
    Deployment: On Prem

    ExtraHop (Network Performance Monitoring)

    no-slideshow
  • FluentD
    Log Monitoring
    FluentD
    Deployment: On Prem

    FluentD (Log Monitoring)

    no-slideshow
  • Grafana
    Time-Series Databases
    Grafana
    Deployment: On Prem

    Grafana (Time-Series Databases)

  • Instana
    APM
    Instana
    Deployment: On Prem & SaaS

    Instana (APM)

    no-slideshow
  • MRTG
    Time-Series Databases
    MRTG
    Deployment: On Prem

    MRTG (Time-Series Databases)

    no-slideshow
  • Netscout
    Network Performance Monitoring
    Netscout
    Deployment: On Prem

    Netscout (Network Performance Monitoring)

    no-slideshow
  • Netuitive
    Anomaly Detection
    Netuitive
    Deployment: On Prem

    Netuitive (Anomaly Detection)

    no-slideshow
  • OverOps
    Error Tracking
    OverOps
    Deployment: On Prem & SaaS

    OverOps (Error Tracking)

    no-slideshow
  • Prelert
    Anomaly Detection
    Prelert
    Deployment: On Prem

    Prelert (Anomaly Detection)

    no-slideshow
  • Riverbed
    Network Performance Monitoring
    Riverbed
    Deployment: On Prem

    Riverbed (Network Performance Monitoring)

    no-slideshow
  • ScienceLogic
    System Monitoring
    ScienceLogic
    Deployment: On Prem

    ScienceLogic (System Monitoring)

    no-slideshow
  • SevOne
    Network Performance Monitoring
    SevOne
    Deployment: On Prem

    SevOne (Network Performance Monitoring)

    no-slideshow
  • SmartBear
    Synthetic Monitoring
    SmartBear
    Deployment: On Prem & SaaS

    SmartBear (Synthetic Monitoring)

    no-slideshow
  • Soasta
    Synthetic Monitoring
    Soasta
    Deployment: SaaS

    Soasta (Synthetic Monitoring)

    no-slideshow
  • Sysdig
    Specialized
    Sysdig
    Deployment: On Prem & SaaS

    Sysdig (Specialized)

    no-slideshow
  • VMware
    Specialized
    VMware
    Deployment: On Prem

    VMware (Specialized)

  • Viavi
    Network Performance Monitoring
    Viavi
    Deployment: On Prem

    Viavi (Network Performance Monitoring)

    no-slideshow
  • Wavefront
    Time-Series Databases
    Wavefront
    Deployment: SaaS

    Wavefront (Time-Series Databases)

    no-slideshow
  • XPLG
    Log Monitoring
    XPLG
    Deployment: On Prem

    XPLG (Log Monitoring)

    no-slideshow
  • Apteligent
    Mobile APM
    Apteligent
    Deployment: SaaS

    Apteligent (Mobile APM)

  • HipChat
    Chat
    HipChat
    Deployment: On Prem & SaaS

    HipChat (Chat)

    no-slideshow
  • Logz.io
    Log Monitoring
    Logz.io
    Deployment: SaaS

    Logz.io (Log Monitoring)

    no-slideshow
  • ServiceNow
    Event Correlation
    ServiceNow
    Deployment: SaaS

    ServiceNow (Event Correlation)

    no-slideshow
  • Outlyer
    System Monitoring
    Outlyer
    Deployment: SaaS

    Outlyer (System Monitoring)

    no-slideshow