The running topic for the last few posts is the basics of distributed systems. For today’s post, let’s start with a short story.
The system was humming along just fine. The occasional failure—even pretty big network splits—got handled, and services recovered reasonably quickly. Updating the various parts of the system would go off without a hitch.
Then a transformer exploded, and the whole data center went down for a moment.
Never fear, right? The power is back, all the machines come back up, everything should recover. The occasional machine here and there with a new physical failure shouldn’t pose a problem. The system has handled such problems before.
But it’s not coming back.
After a bit of investigation, it is discovered that the databases are under extreme load. Every query arrives, waits to be processed, and times out. Almost nothing is actually getting through.
After checking on the caches, the engineers start to get a sinking feeling. The caches are empty. No query can get far enough successfully that the cache even gets populated with a single entry. The caches are taking no load off the databases, and the databases are so overloaded that nothing actually progresses anywhere. The system could stay down indefinitely.
The co-workers whose responsibility this isn’t have started playing Homestar Runner music. “The system is down! Do-do do-doo-doo!” It’s not helping.
Taking the load off an overloaded system
There are two very basic strategies for taking the load off of an overloaded system.
- Exponential back-off.
The first strategy is all you really need for a closed system. All retry mechanisms exponentially back-off (Retry in 5s. Retry in 10s. Retry in 20s.) exactly to reduce load on what might be an overloaded system.
If some part of the system becomes overloaded, then requests will either fail with errors or timeout.
If we are in the scenario at the beginning of this post, where
N requests per second were causing the system to fail to make any progress, we’ll soon be getting
N/2 requests per second.
N/8, and so on.
At some point, load will drop to the point where queries succeed, caches get populated, more of the overall load is taken off the databases, and the system recovers.
But this isn’t a complete picture when you have an open system—one where your load is originating with a large number of uncontrolled outside parties. Getting users to slow down their requests doesn’t help much when you’re just dealing with too many users.
That’s where circuit-breakers come in.
The fundamental problem we have here is that clients don’t have enough information. There’s just too many of them. (Or… maybe just we don’t trust the malicious little jerks.) So we need a more centralized part of the system to “know things” on their behalf.
A circuit-breaker is exactly that bit of knowledge and policy. If an API’s overall error rate grows high enough, a circuit-breaker is meant to cut off load. The circuit breaker trips, and incoming client requests are failed with errors immediately, without attempting to transmit them on to the rest of the system.
The key thing a circuit-breaker is doing here is correlating multiple requests together. If the last thousand clients all timed out, there’s no reason for this client to expect differently. This client just doesn’t know, and so would be eagerly starting a request exactly as if the system were functioning. So the circuit-breaker protects part of a distributed system from load coming from far too many sources.
Both of these strategies rely on the same thing, however: the retrying node, and the circuit-breaking node, require information about the state the rest of the system. To get that information, we need backpressure.
Exerting backpressure is why we don’t retry except at the ends of a system. If we’re going to fail, we want to propagate that failure information to where it can be used. The most important place that failure information needs to go is to the end (typically, the client) making the original request to the system. But it equally needs to propagate through the nodes implementing circuit-breaking. Any attempt to put retrying strategies in the middle of the system begins hiding this information from the nodes where it’s needed.
As long as the system propagates backpressure, the load-reduction strategies we discussed have the information they need to work.
Backpressure is a compositional strategy. Any node of a larger system that correctly handles and exerts backpressure can be put together with other nodes that do likewise, and you end up with a whole system that handles and exerts backpressure.
(Compositional properties are great, when we can get them. Take a moment to consider idempotence—it’s not necessarily compositional. We have to think about each operation in a systemic way.)
So, besides following the end-to-end principle and not trying to hide failures in the middle of a system, how can we end up screwing up backpressure?
It’s always queues, isn’t it?
The first law of queues is: Every queue should have a maximum size. Queues must not grow unbounded.
This is the most common failure for correctly handling backpressure. A unbounded queue will just continue to accept input from producers, even though the consumer side has been completely overloaded. Eventually, the queue system will die (after all, RAM and disk are an implicit upper bound…) but in the meantime, backpressure is eaten. The queue pretends everything is fine while the system burns behind it.
Backpressure is compositional, which is great when it means that we can build a whole from its parts. But it does have a flip side: for the system to correctly handle backpressure, each part must do so. Sticking an unbounded queue in the middle messes things up for the whole system pretty effectively.
The second law of queues is: Have you considered a maximum size of 0? A queue of size zero in a distributed system is pretty much a load balancer. The “queue’s” job is just directly connecting a producer with a consumer (a client with a server). (Of course, those servers have TCP accept queues… there’s always a queue involved. But those also have bounds!)
It’s surprising how often queues are used when there’s actually little reason to do so. The primary reason queues make sense is switching from synchronous to asynchronous requests. There are other times to reach for one, of course, but they’re often viewed as something helpful. They should be viewed as something potentially dangerous.
There’s nothing better than a queue for smoothing over problems today, only to make them explode bigger in the future.
- Excellent talk on queues, with a bit of simulation and queueing theory throw in: Zach Tellman - Everything Will Flow. Pay attention to those long tails!