Last week we looked at how property testing can impact design. The biggest impact is simply ensuring that we think about abstractions in terms of their properties. Of course, it also gives us more powerful test suites that involve quite a bit less code. (And that can matter a fair bit, since test suites against non-system boundaries can be considered minor technical debt, so it’s especially good to reduce their size.)

But last week, we largely looked at property testing from a functional programming perspective. This week, I’d like to convince you it applies equally well even to imperative designs.

Diving right in with an example

I happen to have a tiny little C API I’m working on right now:

int buffer_init(struct buffer* buf,
                int alignment,
                unsigned long long initial_size_bytes);
void buffer_destroy(struct buffer* buf);
int buffer_alloc(struct buffer* buf,
                 unsigned long long size_bytes,
                 void **out_ptr,
                 unsigned *out_index);
void *buffer_index(struct buffer *buf, unsigned index);

The idea behind this API is pretty much just to talk about allocated space in terms of indexes into a buffer, rather than pointers to arbitrary memory. Above we see functions to create/destroy a buffer, as well as to allocate space within it and index into it.

This API is small enough that I’m unlikely to discover (from property testing) any flaws with how I’ve designed the above function prototypes. That is something that can happen with larger APIs, especially if it’s the first time you’re really thinking about them in terms of their properties. (Re-examining these prototypes just now, we might question whether out_ptr should be there, but we’re unlikely to answer that question merely from property testing; that’s a question about how it is most commonly used.)

But there’s still enough here that we could discover flaws with its design. The probable value we’ll get here is checking assumptions: what can be or not be null, how big/small/negative of values are acceptable? And the biggest of all: error checking.

Well, and we’ll test the API. That’s nice too.

Someday, I’ll have to get around to talking about foreign function interfaces, but for now here’s a Python blob to talk to this API that I’m going to drop here partially unexplained:

from ctypes import *

lib = cdll.LoadLibrary("./libbuffers.so")
lib.buffer_index.restype = c_void_p

class Buffer:
    """Wrapper for the C buffer API. Can use ``with`` python syntax."""
    
    def __init__(self, alignment, initial_size):
        self.buf_ptr = create_string_buffer(24) # struct 24 bytes, fixme
        ret = lib.buffer_init(self.buf_ptr, alignment, c_ulonglong(initial_size))
        assert ret == 0
    
    def alloc(self, size_bytes):
        """Allocate a new block of at least ``size_bytes`` bytes."""
        ptr = c_void_p()
        idx = c_uint()
        size = c_ulonglong(size_bytes)
        ret = lib.buffer_alloc(self.buf_ptr, size, byref(ptr), byref(idx))
        assert ret == 0
        return (ptr.value, idx.value)
    
    def get_index(self, index):
        """Assuming a 64-bit integer at ``index``, read a value."""
        p = lib.buffer_index(self.buf_ptr, index)
        v = cast(p, POINTER(c_longlong))
        return v[0]
    
    def set_index(self, index, value):
        """Assuming a 64-bit integer at ``index``, write ``value``."""
        p = lib.buffer_index(self.buf_ptr, index)
        v = cast(p, POINTER(c_longlong))
        v[0] = value
    
    def __enter__(self):
        return self
    
    def __exit__(self, type, value, traceback):
        lib.buffer_destroy(self.buf_ptr)
        self.buf_ptr = None

This blob allows us to write Python code interacting with my C API quite nicely:

with Buffer(3, 0) as buf: # 3 means 8 byte aligned, 0 is default buffer size
    (_, i) = buf.alloc(8) # 8 bytes is one 64-bit integer
    buf.set_index(i, 23)
    print(buf.get_index(i)) # 23

This capability is important, because the property testing library I’d like to use is Hypothesis for Python. I’ve generally found this technique (writing property tests in Python for C code) to be pretty approachable. (I have previously tried writing property tests in Haskell for C code, and this also worked well but was not as friendly. Also, I think Hypothesis might be the best property testing library I’ve used. I have heard very good things about property testing C code from Erlang, but I haven’t tried it myself. And honestly, Hypothesis has impressed me enough I might not get around to trying anything else for awhile.)

Now we can write our first property test:

from hypothesis import given
import hypothesis.strategies as st
import unittest

class Test1(unittest.TestCase):
    # 0-12 are all the acceptable alignment values (1 byte to 4096 bytes)
    # initial_size is anything
    # repetitions under 10,000 because that takes time to run
    # total_size under 2.8 GB because this VM only gets given 6 GB
    #   also, that's bigger than 2147483648 = 2 GiB,
    #   which is negative for a 32-bit integer, a boundary
    @given(st.integers(min_value = 0, max_value = 12),
           st.integers(min_value = 0, max_value = 1000000),
           st.integers(min_value = 1, max_value = 10000),
           st.integers(min_value = 1, max_value = 2800000000))
    def test_block_repetition(self, alignment, initial_size, repetitions, total_size):
        if total_size < repetitions:
            total_size = repetitions
        block_size = total_size / repetitions
        with Buffer(alignment, initial_size) as buf:
            for i in xrange(0, repetitions):
                buf.alloc(block_size)

This property test looks pathetic. It’s not.

Let’s pause for a second to explain what’s going on here. This class is pretty much just an ordinary Python unit test. Hypothesis lets us annotate a test function with given, where we specify how to randomly generate values to call that function with. So test_block_repetition gets called a few times with random values, and if there are no assertion failures, we’ve passed. And all our function does is… create a buffer and repeatedly call alloc a few times.

It looks like we’re testing nothing, but that’s because we’re not seeing all the asserts that are littered throughout the code, checking invariants every step of the way. And that’s quite powerful. Let me put it this way: I had unit tested this code manually before creating this property test. This property test found at least 7 new problems. Some were bugs, some were design errors.

But I didn’t start with this property. The usual process is to start with something trivial, and slowly augment it. My first property looked something like this:

class Test1(unittest.TestCase):
    @given(st.integers(min_value = 0, max_value = 100))
    def test_block_repetition(self, repetitions):
        with Buffer(3, 128) as buf:
            for i in xrange(0, repetitions):
                buf.alloc(8)

This was pretty simple, and served to ensure I got everything working. If you imagine all the ways an API might be used as if it’s a space (or a map of some territory), this is a very tiny dot in that space. It’s a bigger dot than a single unit test, but not by much. What we want to do next is “expand that dot” outwards. Or in other words, exercise the API in more ways.

One neat and surprisingly helpful technique is to make tweaks that will cause failures. This helps ensure we’re able to spot those failures successfully with this test. It also tests us: we ensure our reasoning is correct, and a tweak we expect to fail, in fact does. Both testing the test, and testing the tester, as it were.

Starting with that simpler property test, here’s a rough timeline of what happened while I property tested this API:

  1. After ranging over all possible alignments, I discovered an allocation issue that was caused by a calculation that was just wrong in some cases.
  2. After ranging over initial sizes, I realized that my code accepted 0 and infinite looped. Missing error check. And thinking on it more, I added a minimum size, and decided that 0 should not actually be an error, just a shorthand for the default minimum size.
  3. I also discovered some magic initial sizes that caused the buffer growth calculation to go pathological with an off-by-one error.
  4. I discovered several erroneous asserts. I have an unfortunate tendency to error on the side of asserting instead of returning an error code. During property testing there were at least 2 of these that I converted to return error codes instead of crashing.
  5. After allowing very large buffers (over 2 GB), I discovered an overflow error in my C code, and a sign-extension error in the test suite. (A 32-bit int went negative and then got sign-extended into a very, very large unsigned long long.) While diagnosing the bug in the test suite, I realized this was also a missing assert in the C code. With that check added, my code is now better able to catch another common class of error.
  6. While deliberately making the property test do bad things to see it crash, I discovered several more missing asserts that would better help report erroneous behavior, instead of just crashing in weird ways.
  7. Adding in those asserts on one occasion discovered another bug.

Overall, I’d say this exercise taught me a few things:

  • The synergy between property testing and invariants (assert) cannot be overstated. It’s powerful. A trivial exercise of the API found many problems.
    • The interplay went back and forth. Sometimes I’d expand the space of inputs to test more. Sometimes I’ve leave the property alone and add more asserts to test more.
  • Every time the property test caught a bug, it was found 100% of the time I ran the test.
    • You don’t find new bugs by generating more instances, usually, you refine the test so it’s able to exercise more of the interaction space.
  • In retrospect, some of the design mistakes I caught could have been caught by writing up careful documentation, too.

This approach also had caveats:

  • Given the synergy with asserts, I wanted to enable compiler sanitizers that can spot undefined behavior and pointer mis-use. Unfortunately, these can’t be enabled for just a shared library, the main executable has to be built with them, which is hard when that’s… Python.
  • When my C code would assert, it would just crash. This meant that fairly often Hypothesis’s fancy test-case minimization was useless. But, at least it was still all very easily debugged since gdb python took me straight to my assertion failure. And maybe there’s a solution to that problem, I just didn’t look yet.

Ah, well.

State machine testing

So, that (single!) property test was pretty effective, but it also didn’t let us really check the behavior of the API too deeply. It still looked a bit functional: we just generate inputs to a single function, and all we checked was that it didn’t raise an exception. That function happened to do something stateful inside, but what about directly testing stateful APIs?

We can do that too, and pretty easily too, thanks to Hypothesis’s support for rule-base state machine testing. Open that up in a background tab, and after you’re done reading this post, if I’ve piqued your interest, read that page over too. There’s a lot of neat things in there.

from hypothesis.stateful import Bundle, RuleBasedStateMachine, rule, invariant

# mildly surprised Hypothesis doesn't have something like this already?
st_int64 = st.integers(min_value=-9223372036854775808, max_value=9223372036854775807)

class StatefulTest(RuleBasedStateMachine):
    def __init__(self):
        super(StatefulTest, self).__init__()
        self.buf = Buffer(3, 0)
        self.model = {}
    
    indexes = Bundle('indexes')
    
    @rule(target=indexes)
    def alloc(self):
        (_, i) = self.buf.alloc(8)
        self.buf.set_index(i, 0)
        self.model[i] = 0
        return i
    
    @rule(i=indexes, v=st_int64)
    def assign(self, i, v):
        self.buf.set_index(i, v)
        self.model[i] = v
    
    @invariant()
    def equal_values(self):
        for i in self.model.keys():
            assert self.model[i] == self.buf.get_index(i)
    
    def teardown(self):
        self.buf.__exit__(None,None,None)

This property test does something really cool. First, it operates on two things simultaneously: the system under test, and a model that specifies the correct behavior. Here, we’ve narrowed our conception of the C API to just consider reading/storing 64-bit integers. So we compare doing that on the C API with a simple model using a Python dictionary.

To run this test, Hypothesis generates sequences of alloc and assign calls (because these are annotated as rules), and in between each call it checks the equal_values invariant. To generate sensible calls, we make use of Bundles. What we see happen above is that every time alloc is called, we get a new value in indexes that can be validly used in assign later on. This mechanism allows us to thread values from earlier allocs to later assigns.

So now we’re doing something much more involved than the previous test, and actually storing/reading data from the buffer. Fun fact though: the previous property test was so effective, this one found nothing new for me. Yet.

Homework: I was hoping to have my code in shape to release, but not yet. So, just reading the above test to yourself (and not being able to run it), how might you attempt to expand it to better exercise more of the API?

I can come up with 4 different, simple ideas to improve this property test. And there’s probably more.

That word “engineering”

If you’ve never seen Jepsen before, I recommend checking it out. Perhaps the analysis of Cassandra 2.0.0.

Jepsen is a very sophisticated form of the rule-based state machine testing we saw in the last section. It seems pretty specialized to testing databases, and focuses on checking the kinds of properties that distributed systems are supposed to have. Like not claiming writes were successful when they weren’t. There are at least two major innovations beyond ordinary property testing:

  1. Fault injection into the distributed system under test. Interrupting networks, killing processes, and so forth.
  2. Awareness of the re-ordering possible with concurrent operations. Essentially, testing to see if any reasonable ordering of the actions taken match the observed output. Concurrent systems can have different valid inter-leavings.

The amazing thing about Jepsen is that analyses routinely found very short interactions with real systems that reliably caused data loss and corruption. Like, data corruption in databases you probably have used. Are using. Really scary stuff, demonstrated in really simple tests, found by randomized property testing, in supposedly stable and production ready systems. Oops.

There’s sometimes a psychological complex that software developers suffer from related to the word “engineering.” (Likewise, jokes from physicists about fields with “science” in their names.) Depending on whether you’re “half-full” or “half-empty” sort of person, this is either a fantastic example of how real databases that we rely on are commonly broken, or a fantastic example of the kind of testing we’re able to subject real databases to. I prefer the latter point of view. Sure, we have to do better at getting these practices adopted, but darn it, we do know how to do it. And they are impressive feats of engineering.

Property testing is some real good stuff.

End notes

Today, I ended up with a few notes to self about things I should write in future posts:

  1. Design and assertions, in more depth.
  2. Foreign function interfaces, because cross-language programming might be under-used as an overarching design technique.
  3. Last week we saw amazing synergy with types and properties. This week, invariants and properties. Someday I’d like to see if I can find a good example of putting all three together.

I meant to open source this code, but it’s tangled up a mess of library that’s not in releasable shape. I intend to make this available so the code shown in this article is runnable, er… someday.

Someone is writing what looks to be a good book about property testing in Erlang, you might be interested in. I kinda hope they’ll add a section about using the Erlang FFI to talk to C code.

It’d be nice to point to a good-quality, real-world, open-source example of property testing stateful APIs. I know these exist, but the ones I know about aren’t public.