What can we learn from how compilers are designed?

Programming languages and compilers used to have a reputation for being one of the most complex topics in computer science. I think that reputation persists, but only because the topic is pretty much infinitely deep, especially where programming languages are concerned. The shallow waters here—the basics of how to design and implement a compiler—are pretty well solved. Compilers are still big, but we know how to break them down into easily understood pieces. I’d like to share some of those insights with you, because they have implications beyond writing compilers.

There is one key idea that makes compiler structure manageable to understand: a compiler consists of a pipeline of relatively simple transformations from one form of data to the next. The key idea here is data. We can completely segment one component of the compiler from another by having the right form of data structure in between them. To build that data structure, you don’t need to know anything that happens to it afterwards, you just need to know what it means.

Once we have those data structures, a component of the compiler is just a function from one input type to another output type. As black boxes go, this is the easiest possible thing in the world to reason about. Finding a bug in the whole compiler is just a matter of finding an input that does not produce the expected output, and then tracking that input through the pipeline to the black box that produces the first unexpected output. The internals of that box may still be complex, but by narrowing down the scope of the problem so much, we’re already off to a very good start with debugging.

A high-level diagram of a compiler pipeline, with emphasis on the front-end.

This diagram shows a general structure for how to design a compiler front-end. The general tasks look something like this:

A file is parsed into an abstract syntax tree (AST) that represents the program structure in tree form. (Or graph form, but that doesn’t seem to stop us from calling it an AST.)
This AST is analyzed for semantic errors, before being transformed into a high-level intermediate representation (HIR).
This HIR is suitable for applying certain language-specific optimizations before translating down to a much more general purpose IR.
Many compilers these days just use LLVM, which has its own suite of IRs to perform optimizations, before ultimately generating machine code.

This kind of design is traditionally justified by the front-end/back-end split. If we tried to build a compiler for every language and every architecture, we’d end up having to build N*M compilers. But with a common IR in the middle, we only have to build N front-ends plus M back-ends. This is an obvious win.

But I think this justification sells the design short. It doesn’t explain why we’re so incredibly enamored with creating more IRs. My high-level diagram is, to some extent, misleading. One of the nice things about pipelines is the composition operator is associative: we can take a series of transformations and pretend they’re just one big transformation, and I sure took that liberty above. Real compilers can have many more IRs than the uninitiated would ever guess. GHC (the Haskell compiler) has 3 separate HIRs (CoreExpr, STG, and Cmm) and several different normalizations of those IRs, all before it ever gets to something like the LLVM IR in the back-end.

In fact, just parsing to an AST? Let’s zoom and enhance:

Just going from a file to an AST can involve several intermediate representations.

In this diagram we see a common traditional compiler design:

We first handle the problem of lexing an input file into a sequence of tokens: if, (, x, ==, and so on.
Given a token stream, we can then parse this into a tree structure according to the grammar of the language.
For many languages, there often some transformation step from the structure we construct while parsing to the structure we actually use as the AST (which is to say, the thing we do name-lookups and type checking on). I’ve called this “desugaring” here for lack of an all-encompassing word, but these days you usually find a step called “desugaring” after error checking (i.e. after the AST). We want to give nice error messages about the code you actually wrote, and “desugaring” often implies transforming away the code as written into something simpler.

A quick example: Haskell has user-definable operators like >>=, so during parsing a Haskell compiler likely needs to construct an intermediate list-like structure, for example [a, '+', b, '*', c]. During parsing, it may not have seen all operator precedence declarations yet, so it doesn’t know how to arrange this sequence of operations into a tree. Once it has finished parsing, it can go back and figure out how to arrange those lists into trees, constructing something closer to its real AST. You may be unsurprised to hear there’s even more intermediate representations before it gets there in GHC.

Why so many IRs?

The pipeline style of design is pretty much the distilled essence of the idea of breaking a large problem down into small pieces. We start with a big problem: a compiler is a function from an input source file to an output object file. We drop an IR in the middle, and now we have two smaller problems: build a front-end and a back-end. Drop another IR in the middle of the front-end and now we again have two smaller problems: parse an input file to an AST, then translate the AST to the IR. Rinse, repeat. Now the problem is tiny and totally self-contained. Crush it.

We see this show up in other popular approaches to programming. Shell scripting languages like Bash involve not only the obvious pipe (|) operator for composing together commands, but also composition operators like $(...). Instead of breaking problems down into ones small enough to solve, here we break problems down into ones small enough to already be solved by some tool we have just laying around, like grep. (I really have to get around to writing that post about composition someday.)

We also see this in what is probably the most eagerly and widely adopted functional programming language feature. We can write programs by operating on streams of data, using map, filter, fold, flatMap, groupBy, reduce, comprehensions, and so on. This can reduce big problems into ones small enough to solve with a little lambda, or allow us to compose together functions we already have available. And even if the function is not yet written, we’ve still reduce the size of the problem it has to solve.

The traditional object-oriented design aesthetic involves a lot of emphasis on encapsulation to achieve loose coupling. The actual designs of data are to be hidden away, so that they can change. Interfaces necessarily hide data members, because you don’t know what actual implementation of that interface you might get. Hiding away data representation often gets sold as the key thing that makes OO good at breaking down large problems into smaller pieces.

But here we are, looking at how compilers are designed, and we’re achieving loose coupling between components by exposing a data schema, publicly committing to all its representational details. Nothing is encapsulated at all.

“But what if we need to change that representation?” one might ask. But this is no real objection. You can make breaking changes to interfaces, too. If it looks like you want to make a breaking change to an interface, you either make the breaking change, or you define a new version of the interface next to it. Likewise with data like this. Data can be an interface.

The fact that data can be an interface should be kind of obvious (file formats? DB schemas? protocols?) but this fact seems to get lost in some OO ideology. It’s a natural tool to use between systems, but it often seems to get lost when we’re design a single program in an OO language. Instead of passing data from one part of the system to another, we often end up passing references to objects, which ends up creating a dependency between those parts.

So this ends my musing about design in general. Now let me tell you a few related war stories about compiler design.

Story time: breaking the rules poorly

The advantage of the pipeline design is that it flexibly “zooms” in or out on parts of the problem. All that goes to hell as soon as you back-feed outputs of a later stage into inputs of an earlier stage. Now you have one monolithic block of code where you’ve semi-pointlessly drawn some boxes inside of it to pretend like it’s modular like the rest of the pipeline, but it’s not. You can’t understand it without understanding the whole thing.

Breaking the design like this usually isn’t the fault of the compiler, it’s the fault of the language it’s compiling. The most famous example of this is The Lexer Hack. That’s literally the title of the Wikipedia page. Hah. Not just any lexer hack, the lexer hack.

The C language screwed up its design back in the day. Among other problems, an ambiguity was discovered in its grammar when typedef declarations were introduced. The trouble was how to parse something like x * y;. Multiplication of two variables? Or declaration of y as pointer to type x?

The official answer is: look up x in the environment while lexing and recognize it differently as a result of that lookup. Either as one token (identifier, and so it’s a multiply) or as another token (typedef-name, and so it’s a declaration). The problem: name binding analysis is something most naturally computed on the abstract syntax tree. So we’re having to back-feed information from a later stage to an earlier stage here.

The way a compiler like Clang copes with this problem is to box the whole AST construction effort into a “compilation unit” object that contains a whole lot of mutable state for us. We then run the whole ball of code statefully inside this context, mutating away and transitioning back and forth between lexing, parsing, name binding, and even type checking. (We have C++ to thank for that last one, as it complicates parsing even further by making it type-dependent, and also hands us a Turing-complete type system just to make things even more interesting.)

Clang tries to keep things separate, mostly. There’s still an AST representation that’s getting constructed, it’s just one that’s already had name analysis and type checking information filled in. The code in the Parse module tries to call over to a separate Sema module that does name binding and type checking. So it’s not completely mixed up. But it is partially conflated by necessity: all the logic about where scopes open and close has to live in Parse because that’s the code that’s driving everything.

This is a pretty decent compiler design for trying to cope with a language design decision that never should have happened. If you’ve ever wondered why Java tooling is so much better than C/C++ tooling, this design mistake is part of the answer. (The preprocessor is the bigger problem by far, though.)

Story time: breaking the rules well

This is not to say that back-feeding information is always terrible. The problem with C’s lexer hack is that’s a lot of complexity. It’s hard to reason about what’s going on, everything we’re trying to do gets more complicated. But it’s also possible to make the same mistake in a different direction.

I’ve described traditional compilers as separating lexing from parsing, but this separation is likely a bad decision, in retrospect. If you’ve ever tried to use a lexer & parser generator, you’ve probably gotten an error message like Syntax error and marveled at how it manages to not even have any clue where the syntax error was, much less what the darned problem is. What we’re suffering from here is over-abstraction making things harder for ourselves.

There’s nothing about parsing that requires separate lexing and parsing. By separating these things, we can run into problems like our parser not having any notion of where it is within the file it’s parsing. (It’s parsing an abstract token stream, it has no notion of a file!) By separating these things, we end up having to manually re-implement something like location tracking because ostensibly that token stream might have come from something other than the lexer, and these tools are trying not to couple with each other.

Indeed, if we stop and look at a high-level at what a language’s grammar is, we see that not only is a lexer an implicit part of the specification, and the parser also an implicit part, but there’s also an implicit data structure involved: the “concrete syntax tree.” (Basically, look at the grammar as a set of algebraic data type declarations.) We might start to get suspicious of our design when our single high-level description of a language’s grammar gets cut up into three artificially separated components that create more work for us.

And there’s potential benefit to back-feeding information, too. Layout and parsing context information can inform the lexer and make it possible to parse grammars you otherwise couldn’t. (e.g. context-aware scanning)

So in this particular instance, perhaps we went too far in breaking things down into pipelined steps. More intermediate representations doesn’t necessarily mean better.

Post-publication note, March 16: Several readers have come away from the above thinking I am advocating for “lexerless” approaches to parsing. Oops. Lexerless is certainly one approach you can take with an integrated view of parsing (e.g. SGLR, or monadic parsing). However, I merely meant to argue that the traditional “pipeline-like” separation between these phases was unnecessary and there were benefits to integrating them.

I didn’t mean to suggest lexers were a bad idea, merely that we may have gone too far by implementing a lexer as if we had no idea what the token stream would be used for. And further implementing a parser as if we had no idea where the token stream came from. Most of our lexer and parser generators only loosely work together. There were reasons for this, but I don’t think they were very good ones, and it’s another story from what this post was meant to be about. :)