The influence of testing on design

We’ve seen people advocate not just test-driven development, but test-driven design. Does testing improve software design? Eh… maybe. It definitely can, and probably most frequently does, but there are things we should be careful of, because testing can make our design worse, too.

What do we test?

Let’s start with the positive. I’ve talked before about the concept of a system boundary, essentially the interface/API that’s we’re setting in stone because we have external users we don’t want to annoy with breaking changes. The first thing we can be confident of is that almost any amount of testing is a positive for a system boundary. A positive both for correctness, perhaps even for documentation, and frequently a positive for design as well.

At boundaries, the test suite pretty much just plays the role of another external user. This has two important corollaries.

First, because it’s just another external user, and because we’re already loath to make breaking changes, we should never find ourselves making changes to our code that require us to also change our test suite. Such changes would potentially break other external users too. As a result, our test suite should never become “fragile” and get in the way of our development. It might even be a helpful reminder that a change would be a breaking one. After all, external users are, well, external and not always easily visible.

Second, one of the things that helps us design code well is having a good sense of how it will be used. One of the biggest rules of thumb we hear (and I agree with) is usually some kind of “rule of three.” That is, once you have three users of a potential abstraction, you can start to see what that abstraction should really look like. Testing gives us a second user. (Or for those instances we’re writing something bottom-up, testing might give us our first user!) This is a big part of the potential that testing has for improving the design of code. However, it doesn’t necessarily do so, just as any old additional user of an abstraction doesn’t necessarily tell us anything helpful about our design.

The danger: fragile test design

System boundaries are not always hard distinctions. Harder system boundaries are actually exposed to external users. But very light system boundaries might simply be the public methods of a class internal to an application. In my experience, the general usefulness of a test suite (for design purposes) is proportional to the hardness of the system boundary it is testing. The reason isn’t so much that test suites are more effective on system boundaries, but rather than (a) system boundaries are more important and (b) system boundaries are less fragile.

I’ve mentioned before the idea of wizardry vs engineering. (Recap: engineering is about maintenance while wizardry is about getting things done, and we’re bad at transitioning spells into well-engineered code.) One of our tendencies is to resort to over-testing as a means to insulate ourselves from the capriciousness of wizarding code. So much of what’s supposed to happen gets left implicit (which is what typically makes wizardy what it is) that in order to recover some understanding of what the code should do, we have to resort to tests. The tests serve as machine-checked documentation and examples from which we can try to recover an understanding of the system.

The principle way test design can cause us problems is by getting in the way—by being fragile. We want to make a change, and this change is a definite improvement, but it breaks a bunch of tests, and what we have to do in response is go fix the tests. This pits the test suite against our design sensibility. It becomes an obstacle to improving the design of the code. The tests are no longer doing what they’re supposed to: alerting us to bugs in the code, and helping us code.

On hard system boundaries, this problem never arises because we don’t want breaking changes anyway, but internally, this is a serious flaw. But that’s not the only potential problem.

Does designing for testing improve design generally?

I’d like to use an example, and the example that comes to mind involves a small amount of programming language theory, so please indulge me.

When we’re trying to describe precisely how a programming language behaves, there are a couple of very common styles. One of these styles is called “small-step operational semantics” and consists of writing a function with a type like this:

smallStep :: (Instructions, Heap) -> (Instructions, Heap)

This is intended to take one “small step” in evaluation. For instance, we might see that 1 + 2 * 3 evaluates to 1 + 6 with an unchanged heap. In other words, this is a very “low-level” machine-like description of how the language works. Another approach we can take is “big-step” operational semantics, and write a function like this:

bigStep :: Program -> Value

In other words, an interpreter. An expression like 1 + 2 * 3 evaluates straight to 7.

Which of these is easier to unit test? Which of these would you rather use?

If you think the way I do, you probably see the small-step style as easier to test. Here, you can set up any program state you want, and run one step, and inspect anything about the resulting program state you want. This lets you easily check that each element of the language works as intended: we can see exactly what + does directly.

Likewise, you probably want the big-step style if you’re a user. The smallStep function is an annoying interface. If I want to evaluate an expression, I do… what? Call it again and again until it results in a value, then stop? I just want an interpreter, why did you design something this way?

This is the fundamental danger I see in the idea that designing for testing necessarily leading you to better designs.

Fun fact: compilers rarely unit test. At least, not the way most programmers think of unit testing. For all that compilers consist of smaller transformation steps pipelined together, we almost always write tests as some input file we call the compiler against. Clang has a unittests directory next to test but besides being a lot smaller, it almost exclusively consists of tests for things not traditionally part of what a compiler does. In unittests/AST/, for example, almost half the code is about parsing comments (presumably for the documentation generator), nothing involved in the task of compilation. It seems like unittests is for things that can’t be tested by just running the compiler on some input. This is by design: the language is a system boundary, we want to concentrate tests there, where they won’t ever get in the way. The “big-step” style is what we actually want to commit ourselves to. Testing the internal steps of the compiler can be done, but… why? Other than an ideological commitment to a certain style of unit tests, it’s pretty much all down sides.

Can designing for tests be an improvement?

Yes! I’m just more preoccupied with the potential for drawbacks, because today I think we sometimes over-do the adherence to strict rules about testing.

One of the simplest things that designing for testing can do is decouple code. It’s harder to test things that implicitly tie in to a lot of other application state. The drive for more testable code has also driven adoption of more functional styles of programming. What’s easier to test than a pure function after all? (Even one as big as a compiler!)

Possibly I should have more to say about this (and perhaps I will say more in the future), but today I’m more interested in what causes this to go wrong. I suspect most of my readers are all-in on testing being a good thing in principle. So I ponder what can sometimes make it more negative in practice.

What about “test-driven development?”

A well-designed code base is one that has good tests, too. For the most part, the ideas originally articulated for TDD are good. They’re honestly mostly about process, not especially relevant to design. If I were to summarize the ideas behind TDD, I’d go with:

Write tests first, because tests are useful from the very beginning to give you feedback as you code, and because ensuring the test fails before you start means you’ve minimally “tested the tests.” Besides, most people want to try their code a little when writing anyway, so just “automate your repl.”
A fast testing feedback loop lets you be more confident as you change code, especially when refactoring.
We definitely want to ensure test suites exist, and writing them too long afterwards can be more difficult and too easily skipped or skimped.

I feel TDD only goes very wrong when it encounters consultants and thought leaders that want to hand rules down to their inferiors that should be followed unquestioningly. Perhaps they mean well. People do hate change and sometimes some arm twisting seems necessary. Forcing past illegitimate objections while paying attention to legitimate ones is real difficult though.

So, to conclude, here are some ideas I have about how to ensure TDD doesn’t go wrong, if that’s the kind of methology you’d like to follow:

Try to concentrate tests on hard system boundaries. Just as compilers are tested largely by input programs, focusing tests in a place that shouldn’t involve breaking changes means the maintenance burden of these tests is essentially zero.
Try to minimize tests on light or non-system boundaries. I’ve heard that sometimes people struggle with wanting to write tests for private methods on classes and don’t know what to do about it. Sometimes the recommendation I see is to move the private method to another class as a public method. But I really have to question: couldn’t the test be written against a public method? Really? We can test whole compilers by just writing different input programs, but you can’t test your single class’s private methods via public ones? I don’t know. I’m reluctant to say anything definite about this, since I’ve never really found myself in this situation, and it’s difficult to say why. But I wonder if it’s not a real concern, just a case of thoughtless rule following: “I felt the impulse to try this method out during development, and I’m supposed to automate that as tests, right?” Tests are designed; not every impulse to try things out should be preserved as a test.
Unless we’re on a system boundary, we should probably regard the common organizational tactic of writing FooTest.java as the place to put tests for Foo.java as a form of mild technical debt. If you’re practicing TDD, you want those tests somewhere, right away, but if it’s internal code, you don’t really want to be coding a lot of tests against that class. You may later decide to rewrite these tests against a different interface, testing indirectly. An insistence on this organization can result in poor design. You may find yourself obligated to write a lot of tests against non-system boundaries, following rules directly into a hell of fragile test suites that get in your way.
Outside of system boundaries, we should like to delete tests, much in same way we like to delete any code.
Likewise, tests are a valid target of refactoring, same as any other code. Tests have a function, and if you can find a better way of doing things that performs that function, it’s probably worth the effort to clean up.
Sometimes when you find yourself really, really wanting to write tests against a particular interface, that’s evidence you should regard that interface as a system boundary, even if not a particularly hard one. Consider whether you should also write up quality documentation about that API, and whether it’s something you might regard as a frozen interface you don’t want to make breaking changes to (even if not necessarily exposed.)
One way to get really good testing with really small tests is to use randomized property testing. I’ll have more to say about this soon.

End notes

Just to talk about process for a bit more… I’ve personally never really practiced TDD, but I have had a lot of success with encouraging every commit to come with associated test changes. A commit that’s invisible to tests is either a refactoring, or it should get a great deal of scrutiny. Seeing the tests can be a simple explanation of what’s actually happening, making it a lot easier to evaluate a commit from just a diff. The more commits that can be peer reviewed effectively from glancing at a diff, without needing to check it out and explore in more detail, the faster and easier code review is.

I certainly have experienced my fair share of tests that turned out to be fairly annoying and non-helpful. Some of these actually turned out to be my own naive attempts to unit test internals of a compiler. Doh. Turns out non-system boundaries change, and then what? Probably? Just delete those tests. In my case, they seemed remarkably ineffective in catching problems anyway.