Testing, induction, and mocks

When it comes to designing software well, testing is an important consideration. It’s not just about correctness. Good design with good testing can improve our ability to quickly make changes, too. The more we can be fearless about making changes (because the computer is checking our work), the faster it goes.

Of course, it doesn’t always work out. There is such a thing as bad design, bad testing, and even test-induced design damage.

So far on this blog, I’ve had a few posts about testing, and I’ll just review them quickly:

We write software in two major contexts: the “just accomplish something” mode that I called “wizarding”, versus actual engineering. When we start with a quick hack that starts to persist, require maintenance, and becomes something we wish we’d constructed in a more engineering mode, testing can accumulate as another hack to try to keep the Rube Golberg machine going.
One of the critical problems that can arise from testing that is tests can impede making changes to your design. Concentrating tests on system boundaries alleviates this concern. One classic example is compilers: tests are generally written not on the individual units, but against the compiler as a whole. This is specifically to avoid tests becoming an impediment to changing the internal design of the compiler.
Next I had a two part series about the usefulness of property testing. It’s hard to over-state the benefits of property testing over traditional testing. You can write fewer tests, which are easier to maintain in the face of changes, that are more effective at spotting bugs, and it makes you think in terms of design properties.
I got a little more explicit about how types and tests are complementary “materials” for keeping our software working. Many of the ways in which tests can have drawbacks are ways types have strengths instead. (Possibly I should expand on the “vice versa” aspect of this… note to self, for the future.)
And one of the background ideas in how interfaces can cut off dependencies is how nowadays many programmers are suspicious of interfaces with single implementations, because of over-mocking.

Over-mocking is a difficult topic, because some people swear by the benefits of mocking. I wanted to do some very broad musing on the subject.

With apologies to anyone who prefers more specific terminology, I’m going to use the word “mock” to mean any old interposing of a test-specific behavior, instead of the real behavior of actually calling into another module.

On the nature of induction

Remember my blog post about how abstractions have two sides? The interesting thing about this two-sidedness is that it appears not just in programming but in mathematics, too.

Mathematical induction (by convention, on the natural numbers) states that to prove something (P) for all natural numbers, you can show:

Base case: P(0)
Inductive case: P(n) -> P(n+1)

Intuitively, this works because for any n, you can start from 0 and repeatedly apply the inductive step until you get to n. It’s a deeply related idea to recursion.

The interesting thing here is the inductive step. We’re doing two things when showing the inductive case:

Obviously, we’re showing that our target property holds for P(n+1) under some assumptions. This is the thing we’re obviously most interested in, since it’s the conclusion!
But we’re also verifying that we’ve made the right assumptions about P(n). If we’re actually doing induction and not just proving something outright, this assumption is critical to writing our proof.

When proving things by induction, it is frequently the case that a property cannot be proven, but a stronger variant of the same property is easily proven! This is counter-intuitive: shouldn’t it be harder, not easier, to prove stronger claims? But no! It’s because of that second part—the other side of the abstraction coin—that the proof can become easier overall. A stronger claim also gives us a stronger assumption.

What is testing doing?

Sometimes people make the claim that testing is about philosophical induction. We make a few specific observations (the tests), and try to conclude something more broad (the code works).

I don’t agree with this assessment. I think testing software works differently.

We’re not physicists conducting experiments to figure out the immutable laws of physics. We control the tests and the software under test.

I think it’s closer to the truth to instead recognize programming as an informal kind of (mathematical) inductive reasoning. We’re already thinking our code through. It’s not like we bashed the keyboard randomly, forgot everything, and now we’re trying to understand the result only by writing tests.

And the problem we succumb to usually isn’t an inability to show P(n+1). Most of our bugs come from casually mutating our understanding of P when we look at P(n). That’s the consequence of the informality of our thinking: we’re sloppy like that.

Our functions don’t usually have mistakes because we just did the wrong thing. (Or, if they do, that’s usually the sort of thing we can spot and fix pretty quickly. Typos happen.) Usually, our functions have mistakes because we make mistaken assumptions.

We assume something about inputs, and oops.
We assume something about state, and oops.
We assume something about the behavior of other functions that our function under test calls… and, oops.

So the thing about tests is that we’re not doing the reasoning with the tests. We did the reasoning when we wrote the code. We’re using tests to formally spot check our informal reasoning.

How can testing go wrong?

Thinking about testing as being “how you reason about code” leads us astray. If we can only understand our code through tests, then we’d have to write tests for every possible combination of input.

Writing tests for every possible combination is a waste of time.

There are plenty of ways to convince ourselves of this. If that’s the way to reason, we should wonder how people managed to write programs before test suites became a standard thing. If that’s the way to reason, we should wonder how humans managed to write the code at all! After all, the combinations often grow exponentially, so we should be stuck paralyzed thinking through every line of code.

People have a point when they talk about the combinatorial complexity of a design. Certainly it can speak to how difficult the code is to understand. But it doesn’t necessarily.

You can directly experience this for yourself if you do randomized property testing. Generating a large number of random cases isn’t necessary. Just a few are enough to spot the problems. If a test case isn’t finding any bugs, but you know they’re there, then you want to adjust the property you’re testing (perhaps to better generate interesting cases), or more deeply instrument your code with assertions. It’s not a problem of generating too few instances.

If I haven’t gushed about Jepsen enough, let me use it as an example again. Finding concurrency bugs sounds like a traditionally extraordinarily hard problem, right? These are supposed to be about as complicated as bugs come! And yet, Jepsen can use property testing to find bugs in mature databases exposed in usually about 5 operations. They’re still shallow!

What are the effects of mocking?

The most obvious thing that mocking does is speed up the running of your tests. Because the test doesn’t need to (generally speaking) call out to a database or disk or some other I/O, it can run much faster.

Another benefit I can’t argue with is fault injection, or dealing with nondeterministic results. It’s very easy for a mock to say “what if the cache has returned stale data” than to try to put a cache service into a state where it’s returned something that should have been invalidated.

Finally, sometimes you’re writing code against some expensive service. You probably don’t want your code that interacts with S3 to always have to talk to the real S3 to run its tests.

Another frequently claimed benefit of mocking is spotting the root causes of problem in test failures. If you mock out a component, then that unit’s test will continue to pass, even if that dependency has a newly introduced bug. Hopefully, a unit test on that other component will spot the bug instead, directing the programmer towards the root cause of the problem.

However, I think this benefit is illusory. This supposed benefit really speaks to missing features in our test runners. We should be able to say “make sure the test suite for component X passes before running this test suite” and surprisingly that’s generally not a standard feature. (Actually, we should probably be able to infer an order to run test suites just by looking at module dependencies, no need to repeat that information to the test runner.) Mocks aren’t the best solution here.

And after this, I think it’s all down hill.

One of the things I mentioned last week is that a mock is one-sided. It says “this service will behave like this”, but doesn’t actually test whether that service behaves that way. (In fact, this is part of what people are claiming as a benefit of mocking when they say mocking leads you to the root cause faster: the service doesn’t behave that way, but in this scenario, that’s because the service was wrong, not the mock! How convenient.)

But as we try to make a mock less one-sided, we start to drown the supposed benefits. If our test suite both returns fake results to some unit tests, but also elsewhere goes about calling out to the external service to verify that it returns the expected results, what have we actually saved?

Mockists

But something even worse starts to happen. Not only can we return fake results, we can start to record what interactions happened, and write tests about that. “Did this method call that method?” And that starts to seem like a good idea.

Okay, let me play devil’s advocate for a second. The best-case scenario for testing is a pure function. We give inputs, it gives outputs, we’re just testing its behavior, everything is great.

With state, everything gets harder. We have to set up initial state, then call the function, then inspect the resulting state to see if everything’s good. Even just the last step—inspecting the resulting state—is tricky. It’s hard to be sure just exactly what to check, sometimes. Maybe something unexpectedly funky happened to the state?

So if we can intercept some interaction between components, and look at the behavior again, isn’t that better? More like testing the pure function? Instead of inspecting state after the fact? Doesn’t that let us take a stateful function and talk about what it “does” again, instead of trying to inspect the state we’re left with afterwards to see if it looks like it did the right thing?

No.

If you like that idea, I encourage you to go back and read my earlier post “Using data to mutate state”. This describes a good technique for actually turning these kinds of stateful functions pure, and being able to reap the testing benefits. But it also describes a serious drawback: you have to actually have a meaningful data type that can describe those stateful changes. If your data type starts to degenerate (such as in the described “command object” case from that post), the benefits of doing this disappear.

Mocks are pretty much the degenerate case, exclusively and only. They have to be because you’re not actually designing a data type.

What are we to do?

So if we’re talking to an external service, let’s say trying to test some code that sends email, what should we do?

I’d like to start with two absolute facts:

You’re going to have to test in production, in some part, however small.
It’s okay to not unit test some code.

Obviously, the more code we have unit and integration tests for, the better. We want to minimize the size of untested code. My point here is just take a deep breath. It’s okay. You’re not going to turn into a bad person for not testing something! This is not a moral failing.

Email is an especially good example, because no amount of unit testing is going to tell you the credentials for your email server are wrong. Or that your server is in a spam-block black hole and nothing will ever get through to anyone. At the end, you have to actually send some test emails out with the real system, and see if it works.

It’s also a good example because almost everything about sending email is unit testable, except the actual sending of the email. If you practice a more “functional programming” sort of style, you’re going to have the emailing module largely consist of something like pure functions, with one little “actually send the email” function that’s about 3 statements long.

So sure, to unit test other code that might want to send an email, mock it out to just return success and do nothing. (Or even better, just have a global flag you can turn off. After all, you already need some configuration or credentials to send that email.) But:

You don’t have to check to make sure it got called.
You don’t have to inspect the arguments it got called with. You should be able to unit test the pure functions that produced those arguments.
It’s fine.

A summary of ideas

More unit-testable code is generally designed better. Pure functions are the platonic ideal of unit-testable.
Mocks don’t really make code more unit testable. They’re just “integration tests, but faster/easier/cheaper.” This is good for running things as unit tests, but it doesn’t mean you’ve changed your design to be more unit-testable.
Changing code to be more amendable to mocks may not be good design. It has drawbacks, so the advantages have to be big enough.
Mocks may encourage brittleness: they generally do make it more difficult to make changes to your designs in the future.
Most of the reasoning we do about code, we do when writing it, not testing it. If a design change makes something more “testable” but makes it harder to reason about, be suspicious.
Letting yourself not test code is okay. Especially if it encourages you to make the untested code as small and clear as possible.

End notes

Martin Fowler has a good discussion called “Mocks aren’t Stubs” which is less about mocks/stubs, and more about the two different styles of testing. My article today is coming down strong on one side of this debate. I’d like to hear counter-arguments if you think I’m not addressing something.
This week’s blog post was prompted by something sent to me by @Confusionist on Twitter. Thanks!