Different models of concurrency

We’re in the middle of a run of concurrency posts on this blog, just a quick recap:

We have to start with the difference between concurrency and parallelism. Everything’s concurrent, but not everything is parallel. I’m focusing on concurrency models, but not parallel programming models.
Next, we look at how to go about embracing concurrency (event loops) and how to program asynchronously (futures/promises).
Last week, we looked at async/await, a language feature helping us construct those futures/promises.

This week, I’d just like to get into a few different approaches to concurrency, and why it is that I emphasize asynchronous event loops and promises.

The synchronous threaded model

We traditionally are exposed to a lot of synchronous APIs from our operating systems. These APIs are the root of all evil here. If we were not terribly interested in I/O, and we were really focused on compute (i.e. parallelism more than concurrency), then the threaded model is fine. But once we start to depend on those synchronous APIs, we start to suffer problems.

The thread-count tuning space for a mixed compute-I/O workload. “Threads = CPU cores” blocks on I/O, wasting idle CPU cores when there’s still work to be done. “Threads > CPU cores” wastes CPU (and degrades latency) switching between too many active threads. Tuning can reduce waste, but not eliminate it. The model is just inherently inefficient.

The most interesting case for the synchronous model is mixing compute and I/O. Once given a workload, the only tunable parameter we have is thread count. In the figure above, I try to sketch a waste curve, showing the costs of thread count choices. There is no ideal choice here. We can try to tune for minimal waste, but the synchronous threaded model simply imposes waste, no matter what.

But this model has further warts. The heavier the I/O workload, the more it starts to break down. Part of the trouble is that submitting more synchronous I/O requests requires more threads. And more threads means more memory use. The classic “C10K” problem is rooted in this conundrum: if an additional concurrent connection requires another thread (or even process) and each takes 10MB, then 10K concurrent connections starts at 100GB of memory overhead!

If you’d like to believe that this model is obsolete, I have bad news. As I mentioned in my first post in this concurrency series, support for asynchronous disk I/O is dreadful in Linux. As a result, synchronous threaded is the usual model for most disk-involved applications. Postgres, for instance, is synchronous and so can handle 1 request per process, and each process is advised as taking about 10 MB of overhead, hence my numbers in the paragraph above. The default “max connections” configuration is something like 100, and production deployments vary from 20 to 500 or so.

Even a crap NVMe SSD can support so much more I/O load than Postgres is capable of giving it. The good news is that most queries also involve a bit of compute work (and additionally may step on each other’s toes with locks), so we’re really in the “mixed compute-I/O” case above, and you can tune process count accordingly. The bad news is that it’s inherently wasteful, and well… the tuning is a magical art that will vary over time with workload changes. It doesn’t “Just Work.”

It turns out MySQL’s InnoDB back-end can do aio with O_DIRECT, but given that Postgres and MySQL performance generally come out comparable, this likely succumbs to the problems associated with O_DIRECT. We’re just looping back around to “Linux’s async disk I/O support is dreadful.”

Benchmarking numbers for MS SQL Server are hard to find, especially comparable ones, but the conventional wisdom (i.e. baseless rumors) is that while I/O completion ports should give SQL Server an advantage on Windows, that advantage is overwhelmed by other OS limitations.

So nobody is embarrassed, yet. Or everybody equally is, I’m not sure.

The asynchronous event loop model

The immediate, dramatic benefit of embracing event loops is that this inherent trade-off goes away. We can actually hit the “ideal” point in that optimization curve above, and we don’t even have to work at it; there’s no tuning necessary.

This is possible because the event loop will aggressively do as much compute work as possible, dispatching as many I/O requests as it gets to. Multiple concurrent paused requests generally take less memory than whole threads, so this architecture is generally able to scale to an absurd degree. (The event loop model has obsoleted C10K as a real challenge, and some have proposed C10M as the replacement challenge. Three orders of magnitude difference.)

The tuning that happens with this model isn’t so much to achieve higher performance, but to limit memory usage, if that proves to be the limiting factor. It is better to serve requests less aggressively than to crash from lack of memory. Better still to install more RAM in your servers, though.

One remarkable part of this new concurrency model is that it frequently manifested itself as a single-threaded model. Applications like Nginx and Node.js are largely single-threaded (at least in terms of compute). The task of fully utilizing the machine’s CPU cores was simply delegated to running multiple independent application processes, one per core. This is made easier with Linux kernel features that allowed multiple processes to listen on the same port, easily balancing incoming request workload between the application instances.

Asynchronous programs still have to deal with synchronous OS APIs, though. If we dig into libuv, what we find is that while it’s built around a single-threaded event loop, there’s also a background thread pool for being the sacrificial lambs calling synchronous APIs. These include file system APIs and DNS resolving (of all the things, the usual API for this is totally synchronous!)

Can we abstract away from asynchony?

One of the classic alternative options to embracing event loops is to hide that event loop behind an abstraction layer. The idea is for the programming language runtime to take care of pumping the event loop, and instead let us continue to believe we’re executing in the synchronous single-threaded model, augmented by “green threads.” (My preferred name. These are Erlang process, or Go routines.)

Different approaches to green threads nevertheless all function in generally the same way: the programmer believes they’re writing synchronous sequential code using cheap “threads,” but secretly the language runtime is not actually blocking when a thread blocks. Much like await, the thread being executed gets put on hold, and the runtime switches to a different thread, selected from the event loop. In many ways, you can imagine “what if every function was async and await was just implicit?”

In the early days of Linux, threads were initially implemented using an m:n model, where multiple “threads” were implemented per process. Likewise, in the early days of Java (1.x), threads were an in-process concept, like green threads.

These were eventually regarded as very bad decisions and reversed. I regard these as obviously bad ideas, if we have a clear mental model of parallelism vs concurrency. OS threads are a parallel programming model, they should be parallel. We were tricked into believing they’re about concurrency because we need them due to synchronous APIs.

But green threads, properly understood, are a concurrency programming model, not a parallel programming model. (They may or may not be parallel-capable, just as C# Task is, but Node Promise is not.) The semantics are different. The m:n threading model is a bad idea, but green threads can be a good idea. This is not a contradiction.

There are a few drawbacks to this programming model:

It requires a heavier runtime. Rust ruled it out for this reason.
It easily lends itself to erroneous conflation with real threads. We all know what’s supposed to happen when you block, sure, but what happens when you enter a tight numerical loop for a few seconds? With green threads, you’re almost always co-operatively interrupted, which means this loop will in effect block execution of other tasks on this thread (which might be the only thread). This issue might be fine, or it might be a serious problem, or even a serious problem that’s subtly hidden until you reach a critical mass of “blocked” threads. This all depends on whether you’re using green threads for concurrency (ok) or believe they’re giving you parallelism (maybe uh oh).
Green threads are a leaky abstraction. Lots of system calls block. Many things like file I/O only present synchronous APIs, after all. But in practice, we get blocking behavior from even more sources. FFI calls, for instance, might internally make blocking system calls, so even if we try to wrap language-internal system calls cleverly, we can still run aground.
There’s no indication of blocking or non-blocking behavior. A function call is just a function call. It might hard-block in FFI, it might soft-block co-operatively, or it might just run to completion normally. To find out which… run it?

Interesting fact: Erlang’s general mechanism for implementing an FFI is called a C Node. It’s basically a way to spin up a new process that exists for the sole purpose of executing FFI calls for a particular library. The regular Erlang processes will send the C Node process messages, and it responds accordingly. This way, FFI calls get wrapped in the same message-passing concurrency mechanism as anything else in Erlang. Of course, this mechanism has enough overhead that there’s also a separate “NIF” mechanism for doing the usual direct calls… which hopefully you don’t use with anything synchronous. Or that takes too long.

Another interesting fact: Erlang was single-threaded from it inception in 1986 until about 2008. Or really 2012, I think, for default adoption. Given Erlang’s reputation for concurrency, it seems like a lot of people are surprised by that. But that’s the confusion between concurrency and parallelism for you.

I believe this combination of drawbacks is what motivated the much more recent developments with async/await. (Promises, of course, go back a long time, too, although not entirely in their modern form…) In contrast to the above, what Rust appears to be getting (once the design process is finalized) with its version of explicit asynchronous programming:

A minimal and low-overhead runtime. In fact, likely even multiple runtime implementations. I haven’t investigated in depth yet, but I believe while the initial design efforts have focused on single-threaded concurrency, it may be a library-level change to support parallel work-stealing thread pools. The async programming model isn’t tightly tied to the implementation technique.
Since it’s clearly about asynchronous programming, and not easily confused with parallelism, you know what happens if you get into long-running compute loops. Nothing is hidden.
If a function is going to do I/O, then if it returns a result, it’s synchronous. If it returns a promise/future, it’s asynchronous. You can more easily understand your program just by reading it, instead of running it and investigating the result.
When faced with a synchronous API, it’s possible to build mechanisms (yourself or simply using a library) to use a sacrificial thread pool to “async-ize” the call.

In the end, I think the async/await/promise approach gets us the same benefits as green threads, with a different mix of drawbacks that I think comes out in favor of the explicit asynchronous model. Pedagogically, I’m also in favor of actually exposing programmers to asynchrony. The confusion that arises from a wholly synchronous world-view is real. Especially since everything is actually asynchronous.

End notes

Coming up next week are my thoughts on “function coloring,” and then I might be done with concurrency for the moment. If you want my thoughts on a specific topic here, ask away! Patreon or Twitter or messenger pigeon, whatever floats your boat.