System boundaries and the Linux kernel

Linus Torvalds, the current maintainer-in-chief of the Linux kernel, has an (in)famous reputation for getting upset whenever developers write patches that make breaking changes to user space. This is because the system call interface is a very, very hard system boundary.

Like many platforms, the vast majority of its value comes not from the platform itself, but from the infrastructure that gets built on top of it. Python is valued mostly because of its libraries. (Hence, the immense difficulty of the Python 2 to 3 transition, which broke many of them.) Vim and Emacs are valued because of their plug-ins, and the infinite variety of existing, working user configurations. Linux, likewise, is valuable because of the applications that run on it. So the boundary between the kernel and userspace is an extremely important one: breaking those applications is destroying the value of the platform.

But almost equally famously, the Linux kernel maintains no equally hard system boundaries internally. This mostly plagues vendors producing drivers that they do not open source or merge upstream, like Nvidia (and phone SoC vendors like Qualcomm). Kernel developers are free to routinely make changes that break “internal system boundaries” so long as they accompany these changes with patches that fix all in-tree users. Users of these APIs outside of the mainline kernel tree can just deal with it.

On the one hand, this seems inconsistent: surely if userspace “just working” is important to maintain, hardware “just working” via device drivers is also important to maintain? But that’s just the thing: the kernel does have a policy of maintaining those drivers… just only the ones merged into the mainline kernel. It’s only out-of-tree drivers that suffer.

The trade-off here is that by ensuring that all users of an API are available (or at least, by declaring that we only care about the available users), we regain the ability to make “breaking” changes. Design is always iterative, and design constraints can change over time. The best way to structure the kernel internals is almost certainly not the way they currently are, and even if it was, that may change in 10 years.

But by far my own favorite bit of justification for this policy comes from Greg KH:

Security issues are also very important for Linux. When a security issue is found, it is fixed in a very short amount of time. A number of times this has caused internal kernel interfaces to be reworked to prevent the security problem from occurring.

Security problems are mistakes, but even better than fixing mistakes is to make design changes that ensure such mistakes could never happen in the first place. Those kinds of design changes are virtually always breaking changes.

Beware extensibility and plug-ins

One of the observations I made almost exactly a year ago is that plug-in systems often cause designs to stagnate. Part of this is simply that plug-in systems create new internal system boundaries. But that’s compounded by the “spidering” of those system boundaries out through the public dependencies of the modules in the boundaries.

And for object-oriented designs (not to be confused with merely object types in contrast to data or ADTs), the problem can be exacerbated even more. Implementation inheritance introduces all the problems of open recursion, unexpected variance, and fragile base classes, and the end result is that what looks like a fully-encapsulated change can have visible effects outside that class. This can create de facto system boundaries, and obscure where they even are!

The Linux kernel developers are trying to avoid the design stagnation that comes from such widespread hard system boundaries. To even support plug-ins for just one kind of driver, such as graphics drivers, is basically hopeless. To create a graphics driver, you need to to expose a device file to userspace. You need to talk to the PCIe subsystem to get to the device at all, and you’ve got to configure IOMMUs, and get DMA-able memory from the virtual memory subsystem, and so on. Trying to punch one hole here, and suddenly most of the kernel’s internal design becomes a system boundary and thus must be frozen to avoid breaking plug-ins. It’s pretty all or nothing.

“Responsibility”

One of the other observations from my original system boundaries post was that “mono-repos” are an occasionally effective tool for avoiding the creation of unnecessary system boundaries. With a mono-repo, a “breaking change” to a library dependency can be made along with sweeping fixes to downstream users, all in the same commit.

Since design is iterative, one of our biggest goals is to support evolution. We design software better by allowing ourselves the option of changing designs for the better. The creation of system boundaries unnecessarily is thus a significant impediment: we’re making it harder to improve the software’s design… unnecessarily. Who wants that?

But one of the counter-arguments against mono-repos is that they seem “irresponsible.” As if the company adopting the practice is absconding on the responsibilities inherent to maintainership. “Surely,” this argument goes, “it’s maturity that leads to maintaining stable interfaces, and these damn kids just don’t know how to design software!”

Of course, this argument is totally specious. It could only make sense if there were no benefits to mono-repos, or if the costs would someday come to outweigh the benefits. But when we look at the purely technical effects on design here… mono-repos impose slightly higher immediate costs, in favor of big longer-term benefits! With a mono-repo, everyone is forced to pay a continuing low-grade cost to keep up with cross-cutting changes across the repo, and in the long run, designs are able to evolve for the better. It’s the exact opposite of the supposed effect that the “maturity” argument presumes.

The Linux kernel’s internal stability policy sees some similar criticism. “Obviously the big boys are capable of maintaining stable APIs, not like those Linux kids!” Or even better, the pure assumption that the reason the kernel allows internal breaking changes is entirely rooted in free software political activism: an attempt to force drivers open-source.

Again, fascinatingly, this gets it exactly backwards. The idea that there should be a stable API for drivers is born of necessity by a proprietary kernel. That’s a technical decision made by a “political” constraint! With the Linux kernel, as an open source project, it could go either way. But there are technical advantages to being able to evolve kernel internals, and so the approach of open sourcing and upstreaming drivers gets taken instead. That’s a “political” constraint driven by a technical decision, not the other way around!

Disadvantages?

Of course, there are reasons this ain’t all peaches and sunshine. I’ll ignore the kernel-specific aspects, like getting hardware vendors to come around to developing driver in the open.

The major problem this approach creates is deprecation. If you eschew “plug-ins” for just upstreaming everything, you end up with a monotonically growing pile of code. The Linux kernel is now noticeably starting to contain broken drivers for hardware that nobody uses anymore.

But the “upstream everything” approach now creates a new problem to be solved: when do we remove such a driver?

The trouble here is that the existence of these drivers isn’t free. Upstreaming everything avoids the creation of hard system boundaries, but we can end up creating rather firm system boundaries when an API is used in hundreds of thousands of lines of code. Nobody wants to do all that work to accommodate a design change. Every additional driver is additional friction that gets in the way of making design changes.

Eventually everything is going to grind to a halt.

The obvious solution is to get rid of things nobody’s using anymore, but the upstreaming policy creates a new political problem to be solved. Now, instead of old drivers simply bitrotting away into the void, somebody has to take the action of removing them from the kernel. And by taking that action, you get to be the target of ire from anybody who still wanted that driver around. So that’s a bit of a problem.

So far as I’m aware, Linux developers have only just started to propose removing drivers that are visibly broken—proving that no one is actively using them. And to solve the political problem, they insist the drivers can stay if only someone steps up to fix them.

So either the driver gets fixed up, or it turns out the people who’d get upset about the driver going away are also people who aren’t going to lift a finger to save it. That rather neatly lets everyone just get on with things.

End notes

The driver removal problem rather reminds me of other open source political fights. The librsvg developers decided to adopt Rust, but Rust only has a compiler for certain architectures, and distributions like Debian support more architectures than that. So newer versions of the library don’t build for some supported architectures… what’s a Debian to do?