Slack recently published how it implemented the Circuit Breaker pattern to improve its CI/CD pipeline availability. Before this project, engineers at Slack saw challenges as peak request volumes in internal tooling caused cascade failures in dependent systems. Since the project’s completion, engineers saw increased service availability and fewer bad developer experiences like flakiness from failing services.
Frank Chen, a senior staff software engineer at Slack, highlights the impact that the project had on Slack developers:
Since […] infrastructure and dependent service breakers were introduced, we have reduced the surface area for cascading failures (by test deferral) and smoothed out the throughput for test executions (by load shedding).
The result has been a significantly improved developer experience—zero cascading failure incidents in internal tooling over the last two years—and a significantly reduced load for critical services that benefited the CI/CD user experience.
Another benefit of implementing visible circuit breakers is that engineers now get feedback from the CI/CD orchestrator via Slack when it defers their tests until the system recovers. Before circuit breakers, these tests would flake or fail due to some overloaded downstream system. “Deferring tests overall led to fewer flakes and test executions that were less relevant.”
According to Chen, internal tooling and services struggled to keep up with 10% month-over-month growth in CI/CD requests. “Development across Slack slowed due to these failures, leaving internal tooling and infrastructure engineers scrambling to restore service.” Short-term solutions such as vertical scaling to the largest VM available and horizontal scaling for specific services only worked until Slack reached a new peak load in other internal services.
Slack uses an internal platform named Checkpoint to orchestrate code builds, tests, deploys and releases. Checkpoint works by receiving webhook calls from GitHub on new commits and orchestrating Jenkins test executors across available test environments.
A Circuit Breaker is a design pattern meant to prevent catastrophic cascading failures, in a concept borrowed from electrical engineering. Chen explains the reasoning for implementing Circuit Breakers at the orchestration level:
We had a hypothesis that circuit breakers could minimize cascading failures and provide high leverage for programmatic metric queries for multiple services, instead of individual client- or service-based approaches. Unlike traditional circuit breakers in individual services, circuit breakers at the orchestration-level system could regulate the interface of requests between systems.
By implementing circuit breakers, Slack could defer test jobs when Checkpoint and Jenkins queues reach a certain threshold or when all Slack test environments are busy. In addition, it could load shed executions that were of lesser importance. For example, test executions for older commits on a branch or test retries for any suite with consistent failures.