Here is a very common situation in programming: You begin by writing something relatively quickly, validate what you’re doing, and then you have to come back and refactor or even rewrite small pieces of it. Perhaps because of performance problems or naive choices of data structure or algorithm, perhaps because of design errors, perhaps just heaps of technical debt, perhaps it was always a prototype to begin with, it doesn’t really matter what the problem was. This situation happens all the time, especially since design is iterative. We can’t help but write programs this way, at least in some ways.

There’s a perfect testing technique for this situation called “model-based testing.” You take two implementations solving the same problem, and you compare their behavior. They’re supposed to accomplish the same things, so their behavior should be identical. Try both, and take note of any differences.

The name “model-based testing” isn’t perfect for the technique I’m describing. It connotes that there’s a “model” and a “system under test,” which is true… when you’re trying to build a test suite in this way. But the most common situation—literally everyday programming—when model-based testing would apply is temporary. We’re not constructing an enduring specification of a correct model, we’re just temporarily using the old implementation as an ad hoc model, to be thrown away later. This is ephemeral model-based testing.

Another term in common use is “differential testing”. However, I believe this term connotes comparing two separate implementations of the same thing with unrelated histories. For example, GCC and Clang are both “supposed” to compile C code with the same behavior. Seeing the places where these compilers produce different results is interesting.

I’ve decided to coin a term (adding “ephemeral”) because I’m not aware of a good term in common use for this specific technique: comparing the old and new implementations. Tweet at me if I’m wrong.

With ephemeral model-based testing, you accomplish three major things:

  1. You find bugs in your new implementation. Any time you spot behavior that differs from the old implementation, that’s a strong clue you may have done something wrong with the fresh, relatively green code. This approach can be much more effective than a traditional example-based test suite.

  2. You can find bugs in your old implementation. The new implementation may treat some (especially edge) cases differently, and this may expose differences that turn out to be problems with the old implementation. Models aren’t always perfect! These are perfect candidates for adding to the test suite, however.

  3. You can be more confident no behavior changes sneak through in your refactoring, especially on system boundaries where these would be breaking changes. Even with well-specified and tested interfaces, there can be dark corners with specific behavior that users (those wascally wabbits) manage to depend upon. Directly comparing the behavior of two implementations can expose these before it gets as far as being released and breaking dependent code, especially from external users.

This technique is widely applicable. We have adopted this word “refactoring” specifically for changes that shouldn’t result in behavior differences. We make changes where this would be useful all the time. And this approach is very effective: even minimal property testing is effective at finding bugs. Actually having an “oracle” around to tell us what the correct behavior should be is even more powerful.

Why is this technique so rare?

Despite its utility, and its appropriateness for common programming tasks, this approach to testing is rarely used. I can see a few reasons why that is.

You must be applying property testing.

A significant part of the reason this technique is rare is that property testing is rare. To meaningfully compare two implementations, you have to be generating random inputs and comparing their behaviors. That’s a property test!

In truth, we already do use the extremely degenerate version of this technique. The whole idea behind “refactoring should keep your tests green” is just this technique… except that you’re using an example-based test suite. That test suite serves as an impoverished version of the model—a more sparse specification of what the behavior should be. Example-based testing falls far short of what is possible here. I feel like I’m failing to come up with a good metaphor of just how much more effective a model is in this situation. It’s the actual landscape itself, instead of a handful of elevation samples.

Non-ephemeral models look hard to create.

Many tools and techniques are adopted by trying them out lightly at first. Traditional presentations of model-based testing are not easy to adopt that way. As a result, I think many people find it hard to get started and understand the benefits.

In order to test against a model, you first have to create a model. That can look like a lot of up-front work, before you can even begin to benefit at all. And worse, it looks like silly work when trying to apply the technique to a toy problem, to get a feel for it. “I’ll implement a queue and then… uh… implement another queue, I guess? To test against?” It’s easy to decide this isn’t a promising approach.

I think the ephemeral approach is important here, because that’s an extremely common situation with a perfectly good model already right there! There’s no more up-front work to do before you can start to apply the technique and see the benefits.

Mutating the code can make it hard to test.

The common case here is changing existing code. Writing tests against two different implementations is hard when you’re changing an existing implementation. You don’t have two separate things to compare anymore: the old one got turned into the new one. This is a serious impediment to adopting the ephemeral approach.

I think this could be solved with tools. We should be able to write property tests against past versions of the code we’re changing. A novel testing harness could allow us to write a test like:

module.func(input) == original.module.func(input)

And the tooling could take care of figuring out how to run both the new and original code. (For instance, by keeping around a build of the last commit, or by obtaining a specified version from an artifact server.) Unfortunately, I’m aware of no such tools.

We don’t like the ephemeral nature.

To write tests against an old implementation, the old implementation has to be around. Even if we wrote the new code side-by-side instead of modifying the old code, we generally want to do away with that old thing when we’re done. So to remove that old code, we’d have to remove the tests!

I don’t think people like that. It’s nice to be able to point to your commits and say “that’s the work I did.” It’s not so nice to wave your hands about and say “I did a lot of ephemeral testing there, and I have nothing to show for it but less buggy code! Take my word for it!” This is a big part of the reason I think test suites have caught on so well: automated testing not only is great engineering practice, but you also get to see and show off the work you’ve done.

This is again an area where I think better tooling can help. We could commit that test we wrote above, and let the tooling figure out what original.module.func is. This would allow us to show what we’ve accomplished.


Model-based testing is an effective form of property testing, and we can get models for free by using the past versions of the code we’re changing or replacing. This approach isn’t applied as often as it should be, perhaps because:

  1. Property testing isn’t done as much as it should be. (Learn it!)
  2. Non-ephemeral models take up-front work to create.
  3. It’s hard to test against the old version of changed code.
  4. We prefer enduring testing over ephemeral work.

Better tooling could solve these last two problems.

But if you’re replacing rather than changing code, and in a way where you aren’t throwing away the replaced code, you probably should be applying this technique today. Perhaps you’re developing a library as a replacement for another existing library. In this situation, you should be able to write property tests to ensure that behavior between the two libraries is consistent.

End notes

  • This is not the first technique I’ve discussed where I’ve lamented the lack of good tools to apply it. One the one hand, it irritates me that I’m telling you “this is a good approach, too bad you don’t get to use it!” On the other hand, maybe it’s a good sign. If I’m giving good design advice, maybe I should be running into missing tools… otherwise this stuff would already have been widely known and applied, right?