The end-to-end principle in distributed systems

The theme for the last couple weeks has been basic design considerations for distributed systems. The name of the game is reliability in the face of failure. Last week the topic was retrying failed requests and idempotence, and I made an odd claim: usually, we don’t retry, instead we just propagate small failures into larger failures. Isn’t that a bit odd, considering we’re trying to be resilient to failure?

Not at all! Distributed systems are complicated, and our goal is simplicity. We can’t make a system reliable unless we can wrap our heads around its behavior.

The end-to-end principle in networking

The end-to-end principle is originally all about designing networks.

The general idea is, to be resilient to failures, the end-points of the network need to handle failure anyway. So… what’s the point of adding resiliency features to every node in the intermediate network? All you’re doing is complicating all the intermediate parts, to accomplish the same thing that the end points could have just handled themselves anyway. And building more complicated intermediate parts can make them more failure-prone!

This principle not only makes packet switched networks more attractive, it drives us towards a particularly “dumb network” design space. We can get more reliable systems by building simpler, dumber intermediate parts, and having the ends of the system take responsibility for creating reliability from unreliable parts.

Bufferbloat was a nasty illness the internet began suffering from about a decade ago. (Don’t worry, it’s largely cured now.) The problem had a simple cause: the intermediate nodes of the internet, the routers, had casually begun to develop very large buffers. Technology had advanced, RAM was cheap, why not try to make them more reliable?

The end result was a less reliable, higher latency, slower internet. The trouble? Routers could take higher rates of load before they started to actually drop packets (because the buffer were bigger, and so could grow bigger), and dropping packets was the signal to the endpoints that they should all back off a bit. So routers didn’t ask the ends to back off quickly enough, and you quickly got the actual equivalent to a freeway traffic jam.

One path back to a higher reliability, low latency, higher bandwidth internet? Program routers to randomly and deliberately drop packets when their buffers start to grow. More on this concept in a future post!

But we’re not building networks…

The original articulation of the end-to-end principle is clearly about networking, but what does that tell us about distributed system design? After all, we’re usually already committed to HTTP, never mind TCP/IP, so what does this mean for us?

As general principles, two things:

Keep system behavior simple! The responsibility for retrying on failure should only fall to the originator (the ends!) of the request. If an intermediate request fails, the system should generally just propagate failure outwards, not initiate a retry by itself.
Don’t accidentally start implementing a database! It sounds like absurdly easy to follow advice, but this is actually the most common violation of the end-to-end principle in distributed system design. If a client requests a write (of any form), and success is reported back before the action is actually taken with appropriately reliable database, you’re gonna regret it. We want simplicity: we want to use a database, we don’t want our system as a whole to start being a database.

Only the client retries

Here’s where we get to my suggestion last week: a sub-request failure should just propagate failure. So most of the time when we encounter an error, we just error in turn. Only the client, upon being notified of failure (or upon timing out), should actually attempt a re-try.

Breaking this rule just complicates the system… for no benefit. The system as a whole is not likely to become noticeably more reliable if a retry happens in the middle than if the client is notified of failure at attempts a retry from the end.

But two nasty things can happen:

The system and its behavior can become more complicated. Introducing intermediate retry logic can, in the aggregate (especially if not carefully and empirically measured), cause the system to become less reliable.
The intermediate nodes of the system don’t have all the context. The client at the end is reliably executing its own code in making this request happen. It’s in a more informed position to take care of implementing backpressure (which will be the subject of a future post).

This is one way in which spurious retries can reduce reliabilitiy. Retries happening too often, because they’re uncoordinated, can create load amplification that increases the chance of the system falling over or staying fallen down. If the database is overloaded, the last thing you want is retry logic that sends the database even more requests in response.

No accidental databases

A sequence of events:

The client sends a request to the server.

The server replies, claiming success.

The server *then* attempts to write to the database in service of that request.

The write fails or times out, and the server happens to fall over.

The result? Claimed success, but the attempted action simply disappears into the void.

In the figure above, we have a typical “accidental database” scenario. The trouble is that the system (consisting of a server and its backing database, together) has effectively become a database on its own.

The successful response by the server essentially placed database-like responsibilities on that system. Here is a situation where we might start by thinking “oh, well, if the write to the database fails, now my server has to retry it because I already told the client we succeeded!” But this retry attempt is a false sense of security: what if the server itself now fails? To handle that, we start needing persistence and replication and… wait a second.

The actual root cause of the problem here was claiming success without hitting the database first. This system should be end-to-end instead: the request must go from client to database, and the successful response should come from database back to the client. We should not have the middle of the system start taking responsibilities like this.

So retries, outside of the client, are potentially a “smell.” Why do we need to retry here? Why can’t we just fail, and let the client retry? If there’s a reason, does that reason amount to “oops, we started implementing a database and didn’t realize it?”

Queues: it’s just always queues, isn’t it?

Again, we might think this seems obvious. Telling a client the request succeeded before we’ve attempted it does seem like an odd thing to do, right?

But the most common case of this works like so:

The client sends a request.
The server puts that request into a queue, and responds with success to the client.
The queue falls over, and was never, ever properly treated as a database. (“It’s just a queue!”)
The request disappears into the ether, and tech support reminds the customer that computers are just haunted and do that sometimes.

Developers seem to routinely just stick queues in the system, and then don’t take them seriously. Queues should be treated as databases by default. A typical queue:

Needs persistence. (The power goes out, how will this request end up serviced later?)
Needs replication. (This machine unrecoverably dies, how will this request end up serviced?)
Takes end-to-end responsibilities. (This request gets started, but the downstream system then fails. Who will retry it? The queue is now also effectively a client, but worse because it can’t fail! So we need to track downstream status, in a database, to potentially allow another server to initiate retries, after this one fails.)

And more! (Queues are totally going to show up in another future post. Here’s a spoiler: every queue should have a maximum size.)

Are these hard rules?

No! This is an introduction to distributed systems that’s all about keeping you on a happy, simple path. The rules in this thread can easily become inapplicable, but usually only because the complexity of the system has really started to demand it.

You might be implementing a database! It’s not an accident, then.
For scalability reasons, a full-on database-based queue might not be able to support the load it’s under, and/or there might be application-specific reasons why occasionally failing to honor a request is okay.

But distributed systems get complicated fast, and one of the easiest ways to avoid the pain is to keep as close to the end-to-end principle as possible. Clients on one end, real databases on the other end, and don’t let responsibilities leak into the middle.

End notes

Kyle Kingsbury’s notes on distributed systems are a hoot. My favorite advice on queues:

Queues do not improve end-to-end latency

Queues do not improve mean throughput

Queues do not provide total event ordering

Queues can offer at-most-once or at-least-once delivery

Queues do improve burst throughput

Distributed queues also improve fault tolerance (if they don’t lose data)

Every queue is a place for things to go horribly, horribly wrong

Everybody just loves to accidentally start implementing a database, and it’s always just queues isn’t it?

(Added 2019/3/4) End-to-end arguments in system design (1984. PDF available here) is the classic paper on this subject.