Design follows data structures

One of the funny things that’s happened in the last decade or two is that performance optimization got turned on its head. Once upon a time, CPU performance meant optimizing instructions. When I learned C, over 20 years ago, the books I read were quite particular about (y << 8) + (y << 6) being a much more efficient way to write y * 320. These days, of course, compilers do that for you, if it’s even still a good idea (and CPUs have changed enough that it might not be).

On the one hand, this is evidence in favor of what your friendly local algorithms teacher always told you: machine details aren’t the important part, focus on the algorithms. Big O is your friend. Sometimes this advice has been scoffed at (“Constant factors matter!”), but in light of how much things have changed, it seems pretty true.

Except in one way: cache.

It’s less that the “machine details and constant factors” faction was wrong, and more that the details that matter have changed. And… it doesn’t look like they’re going to change again, at least until CPUs are no longer a thing, if that’s even possible. Multiple factors have converged:

We got a lot more data to process, in general, and so memory is bigger part of the performance picture.
CPUs got so much faster, relative to access latencies from memory, that instructions matter far less than cache misses. CPUs spend a lot of time these days doing nothing, waiting for RAM to reply.
Compilers have gotten so much better at optimizing code that we’re far more likely to be bottlenecked on what compilers aren’t optimizing: memory layout.

Why can’t compilers optimize data structures?

I think this question is obvious from a higher-level perspective. Obviously, compilers can’t invent red-black trees or hash tables to replace your code’s association lists.

But there are much lower-level questions here. Compilers can’t replace linear scans with binary search, either, but that hasn’t stopped code optimizations from having a pretty significant impact of performance. Why can’t compilers do any optimizations at a smaller scale with data representation, just as it can for code?

The problem compilers face is that data representation is a system boundary, from the compiler’s perspective. If a function takes a certain format of data, the compiler is not free to just change that. It doesn’t know where all this function may be called from. Even in situations like static functions in C, where the compiler can see all call sites, compilers still do not really take advantage of that fact to alter their signatures. The trouble is that functions are still taken to be a fundamental abstraction: intra-procedural and inter-procedural optimizations are different, and inter-procedural generally take each function’s signature as a given.

In fact there is one way in which we can see compilers optimize data representation, to an extent. But this is only possible when two criteria are met:

All of the relevant functions have to be inlined. Inlined functions don’t have a signature that has to remain fixed anymore, so their implementation can float.
The transformations have be semantically reasonable. (This is hard to characterize and varies by language, but a reasonable heuristic understanding would be “the data has to be on the stack.” It doesn’t have to be written that way, but it’s often the case that if the compiler can play with the representation at all, then it could have also chosen to place the data on the stack, and often does, too.)

At this point, an aggregate data structure can be “exploded” into its component parts, and then optimized as if they were all local variables.

Haskell relies on this extensively. Haskellers sometimes remark that “lists in Haskell aren’t exactly like lists in other languages, they’re more akin to for loops.” It’s extremely common for a list to be produced in one function, only to be consumed by another function. GHC Haskell inlines functions extremely aggressively, in part so this creation of an intermediate, temporary list can be completely eliminated. (And that’s before we get to “fusion” or “deforestation” optimizations that try to do this even more aggressively, even when there’s too much to fully inline.)

So what are we to do?

The inflexibility of data structures means that—when performance is a concern—we need to think about representation early. The design of the code involved can’t be easily changed later, because the code is written around the data structure choice.

But we’re also told to avoid “premature optimization,” so which is it?

On the one hand, not all early optimization is premature. On the other hand, prototypes are valuable in figuring out what to do, before you figure out how to do it well. So perhaps we’re talking about designing a partial re-write.

But I’m going to punt on this question until maybe a later post. There’s different question I want to think about first.

Wasn’t object-oriented programming supposed to help?

Part of the argument for object-oriented programming is that encapsulation was supposed to hide data structure choice. We’re supposed to build against an interface, the concrete implementation of which is supposed to be swappable. So does OO help us swap out data structure choice at a later date?

Well, not so much. Or at least, there’s nothing special about OO here. Part of what helped the OO crowd make their case at first is that it was advocated at a time when non-modular procedural programming was dominant. Obviously, any amount of modularity is a win, and OO did bring that compared to the “plain C” practices of the day.

But the only major benefit OO (i.e. modular programming of any kind) brings here is “in the large.” It’s the idea that some modules should not depend on other modules. Limiting dependencies limits the scope of impact that a change in data structures has. That’s the primary mechanism that helps facilitate data structure changes, but we can do that with or without OO.

“In the small,” when we’re just considering code within a module, or a couple of related modules, OO does essentially nothing to help make data structure changes less painful. Of course, it’s not without dogma to hand out, despite this. This is how we’ve ended up with proscriptions for classes to be littered with getters/setters, despite the frequent absence of benefit compared to data.

One interesting thing about getters/setters is that it does have a benefit, sometimes, but that rarely gets talked about because it’s not relevant when the rule is “just do it always.” For instance, an object with getters/setters is a way to be ABI-compatible with future additions of more fields to the data structure in question. This is the primary benefit of the “Builder” pattern, compared to a constructor that just takes some configuration data. The builder allows more options to be added to the configuration data in the future without breaking existing code that uses the builder.

Of course, one could also just wish for a record data type that’s extensible in an ABI-compatible way, too. Other benefits exist: setters can do validation, for instance. But this still leaves us wondering why getters/setters are so prevalent in other situations, even when there’s no discernable benefit.

The theory backing up getters/setters is that by depending on an interface (these methods) instead of an implementation (the fields), the actual representation could change in the future, and the methods could still work. This is supposed to make it easier to change those methods, removing the need to rewrite all the users of that class.

In practice, especially and nearly always for performance-related changes, this supposed benefit is often non-existent. If we’re changing representation for performance reasons, generally “adapters” sitting between users and the actual representation aren’t going to help with that performance. In fact, many times, a data structure change requires changing access patterns in the algorithms, too.

But more insidious is that data structure changes for performance reasons are unlikely to respect our present object-oriented decomposition.

Aggregate data

We think compositionally about how we represent things. If we have widgets, we create a widget type. If we have a collection of widgets, we might reach for the “array of widgets” type.

But performance doesn’t always respect this compositional thinking. The classic example here is “array of struct” (i.e. our traditional, compositional approach) versus “struct of arrays.”

// We have points, so let's represent a point:

struct Point { float x, y, z; }

void update(Point p[]) {
  // p[0] identifies a point
  // p[0].x accesses one coordinate
}

Or, alternatively,

struct Points {
  float x[];
  float y[];
  float z[];
}

void update(Points p) {
  // We don't have a way to refer to a point by itself, only the aggregate
  // p.x[0] accesses one coordinate
}

The benefits of this transformation can be huge:

For numerical data, the “SoA” approach can much more easily make use of SIMD instructions, to efficiently perform the same transformation on all elements of the array.
For any data, access to just some of the data can be done in a much more cache-efficient manner. (Slightly contrived for the above example, but let’s imagine we only care about y coordinate to apply the effects gravity or something. The second representation means the cache would only be filled with y data, and not any of the unnecessary z or x data. This trick can be helpful even in general purpose databases, where it’s called a column-oriented store.)

But consider trying to make this data structure change to an object-oriented program. Point would be an object at first, but now we have to explode it. In fact, we have don’t even have a Point object anymore, it’s gone away entirely, instead we’re just writing all our functions over the data structure Points instead. This isn’t even an object-oriented way of thinking anymore.

This kind of transformation is antithetical to object-oriented programming. And when it comes to cache-friendly representations, this kind of thing is common. (Here’s a RustConf closing keynote about “OO” designs vs entity component systems. Note to self: a future post about ECS might be an interesting case study.)

So another reason object-oriented rules aren’t helping us cope with changing data structures is that the necessary changes have no real reason to be confined to the internals of any given class. They are easily cross-cutting.

Another interesting manifestation of this tension is the problems that object-relational mappers (ORMs) run into. The relational data model (even with its limitations) treats data in the aggregate, necessarily as part of what makes it a good and useful model. But mapping that back into classes and objects often runs into problems. Joins and transactions in particular often present unnecessary complications when attempted from an ORM. These things aren’t that complicated themselves, but trying to map them into an OO worldview makes them more complicated.

Opposite land

What’s the easiest way to make an invasive change to data representation?

The basic premise of the object-oriented style here is that the easiest way is to allow current code that uses the class to continue to work just fine. That is, make an internal change, but keep external users unchanged because the interface doesn’t change. I’ve challenged whether this approach really works above, but what I’d like to do now is take it a step further: I don’t even think this is best way.

The best approach to making invasive changes like this that I’ve found is to:

Use a language with a good type system and strong support for data. (The best languages I have used on this front are Ocaml, Haskell, Scala, Rust, and it looks like Swift as well, though I haven’t actually used that one yet.)
Make a “loud” breaking change to the types. Deliberately ensure that compiler errors do arise as a result of the change.
Chill out for a moderately mindless session of “go to next compiler error, fix error, repeat.” (Starting to have more than a hundred? Many changes are probably similar, maybe it’s time to learn vim macros…)
It just works?

I’ve mentioned this kind of refactoring before on the blog. If you’ve never experienced it, you should try it someday. But there seems to be a few major benefits:

Since you’re changing all the users, there’s no longer any performance concerns about compatibility shims or the like.
You’re directly changing data representation from one to another, so you don’t run afoul of any enforced dogma about class organization (like what happens with aggregate data).
Since you’re fully converting the representation and all users, there’s no accumulation of technical debt whatsoever. It’s very clean.
You get to survey how the data structure gets used, which can sometimes help inform you about whether you’re making the right change.

Conclusion

All this leaves us with the question: if it’s so difficult to make invasive changes to data representation later, what do we do about it? I’m trying to put some thoughts together on what prototyping really means when it comes to software development, but that’ll have to wait for another time.

But one tool that does work is the one object-oriented programming did help bring into common practice: control dependencies. Encapsulation might not always matter as much at the level of individual classes, but the idea of limiting the scope of changes necessary to accommodate a representation change is a good one.

And of course, the most important way that applies: if you think a data representation might need to change, then don’t leak it into a system boundary. There are “easier to change” and “harder to change” designs but this just brings us to “impossible to fix.”

We’ve got enough of those.