Making the USE method of monitoring useful





Mistakes happen. Things will go wrong. It’s not a matter of if – it’s a matter of when. But understanding that fact in advance can help us take steps to prepare for the inevitable. By having a way to quickly identify contributing factors, we can address them faster. This translates into less downtime, which makes everyone happier.

However, knowing to prepare for problems is not the same as having a strategy for identifying them. If you want to be able to exclude things quickly and systematically, you need to know what those things are and what the acceptable thresholds are.

The USE method

Think of the USE method as an emergency checklist for all your critical assets. For each resource in the list, check for one or more of:

  • usage
  • Saturation
  • faults

When performance issues arise, the USE method can help identify system bottlenecks.

First, let’s define a resource. In this case, a resource is a functional server component. These can be physical elements such as disks, CPUs, network connections or buses, as well as certain software components.

The three USE criteria can mean different things depending on the context. Let’s define them for the USE method.

  • Usage: The average time the resource was engaged in delivering work. We usually display usage as a percentage over time.
  • Saturation: The amount of work that a resource cannot perform. We usually represent this metric as a queue length.
  • faults: The number of error events. We usually display errors as a total number.

It’s important to remember that usage and saturation are time series statistics, so it may take some trial and error to find the optimal monitoring interval.

For example, a long time interval can show high saturation levels with low usage levels. Shortening the time interval may reveal peak usage. You may want a few different time interval dashboards to get a clearer picture of performance trends.

The above example also illustrates the value of high-quality time series data storage. For example, InfluxDB allows you to ingest highly detailed data and slice and dice in multiple ways. This capability allows you to answer multiple questions about the same aspects of the system at the same time.

Make a checklist

Think about all the different resources your system uses and how you want to measure them. Some resources can cause bottlenecks in more than one way. For example, a network interconnection can have both I/O issues and CPU performance. You want to be sure to create a separate entry for each type of problem to make the identification process more thorough and faster.

The USE method works best on resources that experience performance degradation with heavy use. It doesn’t work well on resources that use caching, because caching improves resource performance under heavy usage.

Building a monitoring system

As the last point about caching points out, the USE method is not a panacea. To get the most out of the USE method, combine it with other monitoring methods and processes. Be prepared to spend a lot of time planning and optimizing your monitoring system.

This is the approach we use at InfluxData:

  1. Before configuring dashboards, we work to understand our thresholds as measured by our Service Level Indicators (SLIs). This is a critical step as it allows us to avoid alert fatigue by filtering out metrics outside the established thresholds. At the same time, having a threshold allows us to track problems as they arise, rather than letting them pop up out of nowhere. The more we understand about our systems from the perspective of acceptable performance and expected scale, the more predictable our pain points become. In other words, we use data to figure out how to prevent alerts from being generated in the first place.
  2. We’ve set up alerts so that we are immediately aware of any issues.
  3. We’ve built USE and RED dashboards and use them as input to our SLIs to see if the alert indicates a current or potential issue. These dashboards also act as a troubleshooting tool to locate factors that could contribute to a failure or incident.
  4. We use SLO dashboards to measure availability and to help determine if we need to invest in more availability or features to address the issue.
  5. Finally, we created a wide variety of custom dashboards that we use to investigate and diagnose issues if the USE/RED dashboards indicate a valid issue.
use dashboard 01 Intake data
use dashboard 02 Intake data

The goal is to identify and resolve issues early before they affect users or system performance. If nothing else, hopefully our system illustrates how to think about performance issues and the interconnectedness of different monitoring methods.

Tim Yocum is director of operations at InfluxData, where he is responsible for site reliability engineering and operations for InfluxData’s multi-cloud infrastructure. He has held leadership roles at startups and enterprises for the past 20 years, emphasizing the human factor in the excellence of the SRE team.

The New Tech Forum provides a place to explore and discuss emerging business technology in unprecedented depth and breadth. Selection is subjective, based on our choice of the technologies that we believe are important and of utmost importance to InfoWorld’s readers. InfoWorld does not accept marketing materials for publication and reserves the right to edit any contributed content. Send all questions to newtechforum@infoworld.com.

Copyright © 2021 IDG Communications, Inc.




Leave a Reply

Your email address will not be published. Required fields are marked *