Safely repeating failed calls

Last week we looked at the inherent need for timeouts when two parts of a distributed system are communicating. The possibility of failure unfortunately leads us to an ambiguous state: did the action happen, or not? What do we do next?

Generally, this failure happens “in the small.” That is, we are perhaps trying to accomplish some overarching request that the system as a whole should respond to, but we’ve experienced a failure of some sub-request that one piece of the system is sending to another piece in pursuit of that larger goal. We then have to figure out how to get the pieces to behave in a manner that leaves the system as a whole behaving in a sensible manner.

In a way, this is one of the hardest problems in computer science. Or at least, it is in some parts of the system. Fortunately, many of us aren’t tasked with designing a database, we’re tasked with using a database. For us, the problem is a bit simpler.

In the simple case, a failure can generally be responded to in three different ways:

Throw a wrench in whatever’s going on, and raise an even larger failure. This can be less extreme than it might sound, because instead of the system as a whole falling over, we might just be propagating a failure of a sub-request to a failure of the overall request being handled. Perhaps paradoxically (considering we’re trying to be resilient to failures) this can be the right thing to do most of the time! (More on this, next week!)
Graceful degradation. Sometimes, a sub-request isn’t actually essential, and so the failure can be ignored. A classic example is something like Google search: if the sub-request for ads fails, you do always have the option to just display search results without ads.
Repeat and retry the request. But… in order to safely repeat the request, don’t we need to know something about the state of the system? How do we know that’s the right course of action?

Idempotence

An idempotent function is one that can be applied multiple times to no extra effect: f(f(x)) = f(x). The classic example is absolute value: ||x|| = |x|.

On the one hand, this gives us an idea: if the request is idempotent, we can always safely retry! Even if we’ve already applied such an operation (who knows? Not us, our earlier request timed out!), then applying it again will safely lead to the same outcome.

But on the other hand, this concept should (initially) also leave us a bit confused. Our functions have state, so what does idempotence even mean? We don’t have f(x) = y, we are looking at f(x, s) = (y, s') and we don’t even control the value of s when we make a call!

Well in truth, we’re waving our hands a tiny bit here. The usual meaning of “idempotent” on stateful functions is that we take f(x, _) as the operation being performed, and we’re ignoring the actual return value y, focusing only on this operation’s effect on state.

So an idempotent operation is one that leaves system state unmodified in the situation where that operation has already happened. Let’s look at what this means for a typical CRUD app:

Create operations are nearly always non-idempotent. We’re going to talk about this some more below, because obviously we’d like to fix that somehow. Otherwise, we don’t know how to handle a timed-out request.
Read operations are always idempotent, by virtue of their never affecting state at all.
Update operations are usually considered idempotent, but this too is a place where we might have concerns… let’s come back to that.
Delete operations are (almost) always considered idempotent.

We do need to be sure that ids/keys aren’t going to get re-used to consider a “delete” operation idempotent. For modern databases, this can generally be assured, as 64-bit ids or UUIDs won’t be re-used, and even in cases of 32-bit ids, the database is usually designed to start raising errors about key exhaustion before it tries to re-use an id.

But cases do exist where id re-use is a concern. Process IDs on Linux, for example, are limited to 32768 by default, and so PIDs get re-used on the regular. Fortunately, there’s an easy solution: accept that sometimes things will just randomly break for no good reason, as computers are haunted by angry ghosts.

I mean, you can try to raise the PID limit (up to a mere 4 million, with an m), but you’re probably just going to discover why the default is what it is. See above, re: haunting.

They… they probably didn’t tell you about the ghosts, did they? This is fine.

Delete operations also raise an interesting problem with this classification. If we classify operations as idempotent solely based on their effect on state, what about the operation’s actual return value? An initial delete operation will return success, but a retried delete operation (that was previously secretly successful) may return an error about the key id not being found!

So on the one hand, we called this operation idempotent for its ultimate effect on system state, but on the other hand, the effect of re-trying may actually not get us the exact same response we would have gotten the first time. But… at least the system didn’t move into an indeterminate state. And “retried delete operation returns non-existent key error” is actually just one special case to handle in your framework. That is, if you don’t just decide that trying to delete something that doesn’t exist is just… successful.

Dealing with creation

One amusing part of the early internet was the little bit of text that appeared next to the “submit” button (of course it was just labeled “submit”) when ordering from an online store: “Don’t click submit more than once!!”

One of the problems these early websites were dealing with was the non-idempotence of creation. Creating a new order could happen multiple times if you clicked multiple times. Don’t want to deal with angry customers getting multiple shipments and getting charged multiple times? Obviously, just admonish them not to click mistakenly, of course!

Since the 90s, everyone (okay, everyone sensible, because I’m sure somewhere out there…) has learned to handle this problem. This is the “shopping cart” model for placing orders.

The trick to the shopping cart model is that the cart itself is just an order that’s in a “not yet actually ordered” state. Actually ordering the contents of the cart involves merely updating the order’s state.

update orders set state = 'ordered' where id = 123 and state = 'cart'

I’m simplifying (because when is anything this simple), but the idea is the always same: a cart can’t be ordered more than once by construction. So if we receive a request that says “okay, order cart 123” multiple times, then after the first one, we can reply with “yep, that cart was ordered, you’re good to go.”

Now the only non-idempotent action involved is initial customer arrival. If we don’t have a cart, we need to create one. But despite the lack of idempotence, this is a completely safe action: if we’ve secretly created a fresh cart and didn’t know it, who cares if we just replace it with a new fresh cart?

Are updates always idempotent?

The basic form of update in a CRUD app is generally considered idempotent:

update table set value = $newvalue where id = $id

Repeated runs of this operation will result in the same outcome. However… not all updates are this straightforward.

1. We may be doing some kind of atomic operation.

update table set counter = counter + 1 where id = $id

In this situation, retries may be inappropriate, as this kind of update is not idempotent.

2. We might worry about what other operations have been interleaved.

Sure, we can set value = 1, but what if we do that, time out, someone else set value = 3, and then we re-try our set value = 1 operation. We just clobbered their update. Isn’t this a bad outcome?

Maybe! This is actually domain-specific. If you’re updating your preferences from multiple windows simultaneously, what do you expect to happen? If you’re doing something more complicated, then perhaps there should have been a transaction involved…

3. We might be doing “mini-transactional” updates.

Not necessarily part of a large database transaction, but by verifying expected state before making a change. We see an example in my shopping cart query earlier: and state = 'cart'. This means our update only has effect if we’re in the state we expect, and having taken effect, are no longer in that state. This can help make potentially-fraught update operations safely idempotent. If our state machine cannot get back to 'cart' after transitioning to 'ordered' then we’re good to go.

So the long story short here is “no.” We have to think about our update operations, and decide how to handle retries in pretty much every case. And, we’ve only really considered updates to a single “entity” here: as soon as we have to coordinate changes to multiple entities (calling for transactions, probably), updates are more likely to be non-idempotent.

Synthetic idempotence

Part of the reason idempotence gets such special attention for handling retries is that essentially anything can be made idempotent. The trick is to track a generated “request ID” of some sort, and then reply accordingly if request ID is repeated. With this, even relatively “unsafe” creation requests can be forced into idempotence:

POST /users

request_id=403926033d001b5279df37cbbe5287b7c7c267fa&name=tedinski

If a request is repeated with the same ID, the creation action isn’t taken, and perhaps a duplicate “success” reponse can be put together. In a request like the above, we have enough information to find the tedinski user and reply however we find appropriate.

Sometimes there are natural notions of a request ID: the shopping cart model is an example. So are most blog posts. When I make posts on Patreon, the post is created immediately in a draft state before I’ve typed a single letter. The actual act of publishing a draft likewise becomes an update instead of a creation.

But if all else fails, you can create a totally synthetic request ID, as I did with my mini user creation example above. It might be annoying to have to remember what IDs have been used, but you can just tie their lifetime to user sessions. When every request is idempotent, every time out can be retried.

End notes

Next week, the plan is to address why I think “just fail” is more common than “retry,” despite our desire for resiliency.