The RED Method: A New Strategy for Monitoring Microservices





Monitoring an application is critical to providing users with a quality product and experience. But simply collecting a bunch of application statistics doesn’t solve the real problem. What software companies need is a way to extract actionable insights from their metrics so they can quickly resolve any issues their users may face.

Enter the RED method.

RED method origin

The RED method is a monitoring method devised by Tom Wilkie based on what he learned while working at Google. RED is derived from some best practices established at Google known as the “Four Golden Signals” developed by Google’s SRE team.

The primary rationale behind RED is that previous monitoring philosophies and methodologies such as the USE method did not fully align with the goals of software companies and modern software architectures. USE is more applicable to hardware and infrastructure, while the RED method aims to focus on what users of an application actually experience.

The aim of the RED method is to ensure that the software application functions well especially for the end users. In the modern era of microservice architectures, containers, and cloud infrastructure, hardware metrics are not nearly as important if your service level goals (SLOs) are met.

RED method explained

RED stands for speed, errors and duration. These represent the three most important metrics you want to monitor for each service in your architecture:

  • Rate – The number of requests the service processes per second.
  • Error: The number of failed requests per second.
  • Duration – The amount of time each request takes.

Using these three metrics, you can get a good understanding of how your services are performing. The number of requests gives you a baseline for how much traffic is going to your service. The portion of those requests that are errors lets you know if a service is functioning within your SLO. Finally, the amount of time it takes for your service to process each request gives you insight into the overall user experience of your application.

Advantages of the RED method

The first benefit of the RED method is to reduce the cognitive load that technicians need to determine why a service is having problems. RED abstracts the internal details of each service into something that can be understood by the entire architecture. Not only does this mean issues can be resolved faster, but it also makes it easier to scale up an operations team, as members can now be on-call for services they haven’t written themselves.

The RED abstraction makes it easy to understand what is going wrong and determine how to fix it. Even if the service they are trying to fix is ​​basically a black box that they don’t understand internally, the technician can review telemetry data and determine the best action to improve the user experience. Because the same metrics are used for each service, the amount of training time or service-specific knowledge is also reduced.

Another advantage of the RED method is that it is more in line with the users and the general objectives of the company. Users don’t care about your infrastructure. They don’t care about your CPU usage, your memory usage, or any other hardware stats. They don’t mind if they get errors when they use your app. They care if pages on your website take a long time to load. The RED method makes it very clear when a service isn’t honoring your SLO and your users are having a bad experience.

A final benefit of the RED method is that it makes it easier to automate tasks and alerts for all your services. Automating repetitive tasks is easier and safer because all services are treated the same. You can also standardize things like dashboard layouts for different services because the same three metrics are used.

red dashboard 01 Intake data
red dashboard 02 Intake data

Limitations of the RED method

All these advantages do not mean that the RED method is perfect. The RED method is primarily designed for request-driven applications, so for use cases involving batch processing or streaming, it may not provide the insight you need.

A second drawback is that the “external” view of RED means it can be difficult to know how close a service is to failure. A slight increase in traffic may cause your response time to increase and you may not have internal application stats to determine why. Using the RED method means that your metrics can be interpreted differently depending on multiple factors, so it requires careful implementation.

The good news is that the RED method was never intended to cover all aspects of monitoring. Tom Wilkie recommends using the RED monitoring methodology in conjunction with other monitoring methods such as USE to give teams complete monitoring of their application.

Tim Yocum is director of operations at InfluxData, where he is responsible for site reliability engineering and operations for InfluxData’s multi-cloud infrastructure. He has held leadership roles at startups and enterprises for the past 20 years, emphasizing the human factor in the excellence of the SRE team.

The New Tech Forum provides a place to explore and discuss emerging business technology in unprecedented depth and breadth. Selection is subjective, based on our choice of the technologies that we believe are important and of utmost importance to InfoWorld’s readers. InfoWorld does not accept marketing materials for publication and reserves the right to edit any contributed content. Send all questions to newtechforum@infoworld.com.

Copyright © 2021 IDG Communications, Inc.




Leave a Comment