Distributed systems and time

Back in the fall, I did a short series on concurrency. The idea was relatively simple: figure out how to think clearly about the topic, then look into handling the “common case” instead of the general case. These subjects can be infinitely deep, and this is a general design book, not a specialist one. That’s why I ended focusing on async/await as that’s especially well-suited to the kind of “internally stateless request-response” (i.e. HTTP) servers we write today.

We end up with a sort of progression of complexity here:

Sequential programming is a useful fiction. No program is ever really sequential, but we can usefully pretend some bits of code are, such as anything that doesn’t do I/O.
Concurrent programming is concerned with asynchronous, non-deterministic events. Any kind of I/O automatically means we must handle such things, somehow, and so every program is a concurrent program, however secretly. But some kinds of programming require us to confront reality more directly, especially if we want to get good performance.
Parallel programs are also inherently concurrent, although some parallel programming models allow us to ignore the concurrency aspects in some situations, which is fantastic (when it works). (I haven’t really talked about designing for parallelism, and I’m trying to decide if there is a simple “common case” to look at. Perhaps I’ll draft a post about parallel map and my running “use data” theme that I have going…)
Distributed programs are parallel, concurrent programs that must also cope with failure. Machines, storage, networks, and software can fall over completely or become corrupt at any point, and our distributed programs have to be able to handle these failures gracefully.

The now-classic simple model for distributed programming is map/reduce. I think everyone is probably sick of hearing about map/reduce (it got hyped way beyond its actual value), so I won’t go to far into the details, but map/reduce followed a particularly decent line of reasoning:

Moving large amounts of data around is more difficult than moving a little code around, so let’s send the code to the data instead of the other way around.
The larger the system of machines, the more frequently it will experience failures, so let’s make sure the model is resilient to failures.

A map/reduce system sends the code (the mapper in particular) to the machines that have the data of interest, where they locally process that data. They then send the (hopefully much smaller) resulting data across the network to the appropriate reduce machines, where some aggregation can be done.

Even though the map/reduce system as a whole has a lot of state, the programming model is stateless. In fact, the beauty of the design is that the programmer can to some extent get away with pretending they’re writing a simple sequential program. A mapper or reducer can be run multiple times with no ill effect, as it is in essence just a pure function. As a result, if any machine fails or goes out of contact with the rest of the system, its part of the job can simply be restarted elsewhere. (That the data you want to process is stored redundantly already is simply assumed.)

Map/reduce is an incredibly restrictive programming model, but it pretty much had to be. It’s easy to use, and automatically handles all the complexity of a distributed system.

Aside Q: What’s the difference between a data center and a supercomputer?

Generally, two things:

The network topology is usually different, with many more connections. Supercomputers often connect nodes together in a torus shape, creating an arrangement of nodes with direct connections to each other that’s well suited to the communication patterns of computations over spaces, e.g. tensor fields. Data centers generally connect nodes to switches, and then arrange switches in comparatively cost-efficient ways.
More expensive and reliable hardware is used for supercomputers. Data centers tend to run distributed systems, while supercomputers run parallel programs. That is, the programs for supercomputers often aren’t written to be resilient to failure. (A typical concession to the possibility of hardware failure on a supercomputer is simply to checkpoint every so often: if a failure happens, the program crashes, the machine is repaired, and the program can be manually restarted from the checkpoint, having lost only so much work.)

On the fundamental nature of time to distributed systems

My favorite result in distributed systems is sometimes called the FLP theorem. It’s a pretty straightforward result. Under some assumptions that we’ll get to, it is impossible for a distributed system to reach consensus about anything if even a single failure occurs.

Impossibility proofs are fantastic, but they’re usually not to be interpreted as something actually being impossible. Instead, we should look carefully at the assumptions. The FLP theorem rests on two critical assumptions:

We have no clocks available.
Later work also showed there was an implicit assumption that we cannot generate random numbers. (Which is to say that a probabilistic algorithm exists, let’s not get into complexity theory…)

So the way we should interpret this result instead is simply that clocks (and random numbers) are essential features of distributed systems.

Gödel’s incompleteness theorems are another good example of examining the assumptions of an impossibility result. When does the proof apply? When is a logic necessarily incomplete? It continues to perplex me that people continue to give bad answers like “a sufficiently powerful logic with arithmetic.” What’s “powerful?” Why arithmetic?

The simplest presentation of Gödel’s incompleteness results comes largely from Kleene: the proof applies to any logic capable of encoding a Turing machine. It turns out it’s just a restatement of the halting problem, using the terminology of “logics” instead of “computers.”

We have to use timeouts when making requests, there’s basically no other option. (We can add optimizations that attempt to discover failures more quickly, but we must always have the backstop of a timeout.) When a request is sent, and before a response is received, anything could be happening:

The response is still on its way?
The request disappeared over the network without ever being received.
The remote service is gone (hardware or software failure).
The request got received but crashed before replying (before/after/during actually handling the request).
The request was handled, but the response was lost to the network.

So when we send a request and it times out, we could be in three majors situations:

The request succeeded, but we never actually heard back before the timeout expired. What do we do next? Retransmission may not be correct!
The request failed, and so retransmission is appropriate.
The service is overloaded, and retransmission could make the problem worse.

The running theme for the next few posts on this blog will be: how do we actually go about handling this in practice? A something happened and we have to figure out how to recover, and we don’t really know what state we’re in anymore.

Tricky problem. Tricky design problem, since the solutions mostly involve making slight changes to the problem, to make it easily solved.

Conclusion

I had modest goals for today’s post:

Distributed systems are about resilience to failures.
We can’t reliably detect failures without resorting to timeouts.
Timing out leaves the system as a whole in an ambiguous state.

The goal for the next 4-ish posts is to try to address a simple audience: someone writing an internally-stateless service (i.e. HTTP) that might interact with caches, queues, databases, and other services. How should we understand this system? The goal is not to address the kind of audience that needs to build one of those databases, however.

For the audience that’s more interested, I hear the best things about these two books:

Designing Data-Intensive Applications by Martin Kleppmann
Designing Distributed Systems by Brendan Burns

End notes

Time plays more critical roles in distributed systems, but those are less general and more specific tools. The classic example here are Lamport (or vector) clocks which allow construction of a partial ordering between events in a distributed system. We might still have to say “these events may be simultaneous or ambiguous” but they’re still quite good at frequently giving definite before/after answers.
Also interesting is altering the design of a service to incorporate time. Google’s Spanner database changes queries to specify a time they should be accurate for. This allows the node queried to reply as soon as it knows it has all updates that could possibly have occurred before that time. (And as a result, Google has immense interest in getting accurate atomic clocks to reduce query latencies, as the more accurate the clock, the less time they need to wait to be sure.)