MonitoringScape is the definite guide to the ever-changing landscape of IT monitoring. As a community resource, we welcome your submissions and feedback.
Use time-series databases to store and visualize your performance metrics. Common metrics include system & network performance (e.g. CPU Load), application performance (e.g. Transaction Latency) and business KPIs (e.g. Ad Impressions). Time-series databases are optimized for scale & performance and are capable of consuming millions of samples per second in most cases.
For most operations teams, system monitoring tools constitute the central hub of visibility into the status of their production environments. Use these tools to detect and investigate hardware, network and software problems. This is a broad definition that captures many flavors of tools, and thorough research is required before adopting a tool for production use.
System monitoring tools frequently employ a plugin architecture, making it easy to monitor the health of various types of infrastructure. Note that often breadth comes at the expense of depth: as your company grows, you will likely choose to adopt additional tools from other categories to augment your system monitoring solution.
Application load tends to have a certain rhythm: it goes up during daytime and then down during nights. And yet our monitoring alerts rely almost exclusively on static thresholds, resulting in many inaccuracies. For example, consider an application bug causing high Disk IO. During nights, this bug will likely go unnoticed, due to the low baseline load (False Negative). During days, we will receive an alert, but then ignore it, as we’ll already be flooded by many other unnecessary alerts caused by healthy high traffic (False Positive).
Anomaly detection tools address this problem. They analyze your system’s behavior over time and calculate an adaptive baseline representing the systems “normal” behavior. Then, when your system behaves abnormally, they capture the anomaly and alert on it. You can read more about how anomaly detection helps DevOps teams here.
Essentially all kinds of software output log files. Logs provide low-level visibility on application behavior; they are extremely useful for debugging, and can help with tracking recurring errors.
The rise of distributed systems resulted in an explosion in the number of log files and log lines. Locating an individual transaction in the ocean of log files became impossible. Log management tools were invented to address this issue. Similarly to the way Google crawls and indexes webpages, log management tools collect and index all your log data. This allows you to quickly search for specific messages, errors and patterns across all your log files.
Application Performance Monitoring tools (APMs) monitor the behavior of your applications by tracking transaction flow, starting with the client, and working down the stack, through the backend and database. They measure performance metrics such as latency, throughput and error rate. Use them to detect and debug user experience issues.
APMs provide important visibility that would be very hard to achieve otherwise. APM agents perform code-level instrumentation and therefore require language-specific implementations. Your applications might incur small performance penalties when monitored using APMs.
Web & user monitoring tools measure how your application performs “from the outside.” They simulate traffic to your application from various geographies and alert you on failures and timeouts (Synthetic Monitoring). Additionally, they can be embedded into your web frontends or mobile applications in order to track real failures arising in your users’ clients (Real User Monitoring).
Unlike monitoring tools that track technical performance metrics, web & user monitoring metrics are tied directly to actual user experience. Web & user monitoring alerts almost always indicate that you have a real issue that must be resolved promptly. However, these alerts can’t provide much context as to what is causing the problem. It is recommended to complement web & user monitoring tools with System Monitoring and Application Performance Monitoring tools.
More and more companies are transitioning away from a strict tier-based operations model. In these companies, developers and infrastructure engineers respond to alerts directly, instead of a tier-1 team. This shortcuts the traditional, manual-escalation process and reduces overall resolution time significantly.
On Call Management tools enable this methodology. They consume alerts from your monitoring stack, and route the alerts automatically to the person who is currently on call. The alert is normally communicated to the on-call person via a mobile notification. If the person doesn’t respond within the confines of a pre-defined SLA, the alert is automatically escalated to a second on-call person.
The growth in scale & complexity of modern production environments resulted in an explosion in the amount of data we have to process to make operational decisions. Manual processing of events is becoming harder and harder.
Event processing tools help you automate large parts of the incident resolution process. They consume alerts from your monitoring tools, and run them through a series of processing steps: Correlation (matching related alerts), Enrichment (adding insight & context to events), Noise Supression (removing unnecessary events) and Routing (funneling events to specific stakeholders). Use event processing tools to boost your service uptime and team productivity.
Mobile apps include large quantities of native code whose performance is directly tied to revenue. And yet too often operations teams dismiss the importance of the reliability of native code, perhaps due to the fact that it resides outside of the datacenter. In fact, we should monitor our mobile apps with the same level of diligence given to backend infrastructure.
Mobile APMs are embedded into mobile apps and provide real-time visibility on their performance. Use them to track crashes and measure app speed. Debug issues by segmenting them according to device, operating system or geography.
No matter how much you test, realistically your applications are going to have bugs. How you respond to these bugs once they occur is the key to reliability. Error tracking tools capture exceptions in your runtime code and provide context to help you prioritize and investigate them.
Log files provide general-purpose visibility, but too often errors pass by unnoticed or unhandled. Error tracking tools focus on actionability. They bubble up frequent errors, alert you in realtime on new error types, and help you collaborate on their resolution.
As the saying goes, do one thing and do it well. This category includes monitoring tools that specialize in specific use-cases or specific infrastructure vendors.
Before the monitoring boom, companies relied on a fairly small set of vendors to monitor their environments. These vendors built large monitoring suites providing holistic workflows and end-to-end visibility. However, the rapid proliferation of SaaS and open-source tools resulted in a significant reduction of their market-share in recent years.