Design considerations for software security

Last week I kicked off thinking about designing for security by looking at many of the common ways people can be mislead. A big part of good design (I’m realizing as these ideas are taking more concrete shape) is fighting against our own less than helpful instincts and biases. This week, I want to focus more on the positive side: what can we actually do during the design process to improve software security.

And it is a process. Things like security shouldn’t be an afterthought, but I think it’s a mistake to think of this as “up front” design. We’re always doing design as we go in building and refactoring systems.

Security is an economic affair

One of the points I made last week is that security is economic. To build secure software, our goal is to:

Increase the cost or risk associated with making a successful attack.
Reduce the value or reward for making a successful attack.

Almost everything follows from this general idea. I emphasize this in part because this is a mistake I used to make all the time. If we think about security the wrong way, it’s easy end up lost in the weeds. The last post consists largely of questions I used to struggle with, or I’ve seen other people struggle with. I constantly see people object to a security practice because it’s not literally perfect. There are a lot of places were we can almost — allllmost — but not quite get something like a perfect guarantee. Maybe the concern is compromised CPUs or backdoored compilers (as in the famous “Trusting Trust” paper). It seems like a really cool thing to think about, and maybe really important!

And it is cool, but as a realistic possibility, it’s a waste of time to plan against. These exotic possibilities are really expensive to pull off and have enormous risk associated with actually using them (as that may lead to their discovery). Not to mention the continuing passive potential of discovery, and the political and economic blow-back big backdoors might cause. To date, we think maybe the NSA deliberately weakened an older crypto standard in a way we still haven’t figured out how to actually use (as far as I can tell). The result has been a huge loss of trust in anything the NSA says, deeply compromising the part of their mission that tries to actually improve the state of software and national security. I’d bet they regard this as a fuckup on their part.

When we think about a system from a security perspective, we want to think in terms of risk. A system consists of a set of isolated parts. Each part has a value to an attacker as well as a cost or risk, which I generally think of as being a multiplier on the value. High-risk parts of the system have high multipliers, hardened parts of the system will have low multipliers. Our total risk is the sum of the parts… assuming we got the parts right.

risk = sum([x.value * x.cost_factor  for x in system])

Actually putting numbers into this equation is silly. This is just about modeling: understanding how risk changes. Usually we care about risk at some threshold: at some point, it’s money wasted, just buy insurance. And sometimes there are cost_factor negligible terms: most of us just trust Microsoft, for instance, because their security team is at least as good as ours, and if they actively ship malicious code to us, we’ll just sue them. The legal system usually forms a good baseline of what to worry about first.

From this extremely simplistic model, there’s a few more bits of understanding we can gleam:

The risk/cost factor for any part in the system is the weakest vulnerability in that part. If you put 10 deadlocks on the front door, but leave the back door unlocked, you’ve changed the cost not at all.
If the system is just one part (i.e. one bug and everything is exposed), then you’re playing a losing game. As a software-involved product or service becomes more successful, there’s generally both more value in compromising it and more code… which due to the “weakest link” rule means the cost of compromise is likely going down. So both value and cost_factor are going up at the same time. Yikes.

There are three general strategies for managing this risk equation.

Find a high-value piece of the system and break it out as its own separate part, isolated from the rest of the system, so it becomes another term in the equation. Then find a way to reduce its scope and attack surface, and harden it. This approach takes a high-value target and tries to mitigate that with high-confidence that the module is secure and difficult to attack. (i.e. A low cost_factor multiplier on that value.)
Find a high-vulnerability piece of the system and break it out as its own separate part (and term in the equation). Because the cost factor of a module is based on the weakest link, isolating suspected weak links in low-value parts mitigates those vulnerabilities. For example, font parsing should be done in a locked down container, not in the kernel. (Thankfully now true in Windows 10…)
Delete value. Consider the balance of value to you, and value to attackers. If something has modest value to you, but the aggregate is a pile of treasure to attackers, you’ve got a good candidate to just get rid of it. One way to delete value while retaining it in practice is to use encryption. TLS or SSH encrypt all your network traffic, and suddenly the value of a lot of in-between systems falls to nearly zero. In general, encrypt data in motion (network) and data at rest (disk), and you’ve eliminated the value of many tertiary pieces of the system for attackers. This is also the basis of good password hashing: the original high-value passwords are essentially destroyed, but we’re able to preserve 99% of their utility to us (authentication) in their hashed forms.

In general, the first two strategies are trying to employ privilege separation. In order to use strategy #1, you have to give that part of the system privileges that no other part of the system has. Those privileges are what make that part high-value. In order to use strategy #2, you have to be able to take away as many privileges as possible, so the value of that part is correspondingly low.

All of these strategies are about the design of the system.

Sandboxing

In my opinion, the most significant change in software security in the last 20 years is sandboxing. When it comes to shedding privileges, the most basic form of separation that we had in The Bad Old Days was between root and users (and of course between users themselves).

The modern era has brought a lot more, and the associated benefits. Firmware and TPMs are able to protect themselves better from malicious software. They’re even able to protect against physical access, which used to be regarded as a no-win scenario. (Sure arbitrarily expensive attacks are still possible, but remember: Economics. Cost. Apple pretty much ended a smartphone theft epidemic in its tracks by tethering hardware to an iCloud account. Steal a phone? Well, maybe you can sell a couple of its parts, maybe. Otherwise? Useless.)

But by far the most important innovation has been sandboxing. The ability to further drop privileges to the point where arbitrary running code poses no threat (even to the user it is running as) has had a profound effect on the design of secure software. It’s probably no exaggeration to say it’s the foundation of modern software security. Without sandboxing, strategy #2 cannot even be employed, except in certain specialized situations. (e.g. Some distributed systems, or cases where you can run the software as a separate user account, such as httpd.)

The trouble is that in the Bad Old Days, root/user separation was often seen as a major advantage. But today, especially for desktop security, just compromising a user account is actually game over. Who cares if you can’t get further access when all the juicy stuff is owned by the user? Ransomware exemplifies this perfectly.

Once upon a time, it was obvious Linux distributions offered better desktop security than Windows. That may well no longer be true, or if not yet, then soon.

What’s changed? Windows 10 is bringing aggressive adoption of sandboxing to many of the dangerous activities of desktop users—especially media file handling (fonts, video, audio, etc). Linux has moved on this… very little, as far as I can tell. The kernel has support, of course, but actual use of that support is lagging. The end result are vulnerabilities in desktop Linux that can’t happen to windows 10. That “vulnerable in ways the other option isn’t” used to always be the other way around. Times change.

Here’s an LWN article for more reading on the topic.

Sandboxing can do a lot of things. Very, very light sandboxing is a great tool for just creating more layers between an application and the user account. If the application removes its own ability to write files outside its own space, then that application can no longer be a vector for ransomware, full stop. Layers are effective at both reducing value and increasing costs. Almost every security-sensitive application should probably using some form of sandboxing today.

Sandboxing to an extreme can even allow actively malicious code to be run harmlessly. Web browser employ this heavily. Not only are Javascript VMs meant to prevent arbitrary machine code execution, but even if they’re vulnerable and exploited, the sandbox doesn’t have too many more capabilities compared to what the Javascript could already do. Actual exploitation requires chaining together multiple exploits to multiple layers of the system to actually get to anything valuable. This raises both costs and risks, as identification of an actively used exploit will result in a fix for all links in the exploit chain, making a lot of work go poof.

Hardening a high-value part

When we employ strategy #1, there are a few techniques that work well:

Reduce the attack surface as much as possible. The smaller and simpler the part of the system holding all the value is, the easier it will be to secure, and the less likely you’ll overlook a major weak link.
Eliminate entire classes of vulnerabilities. Fixing bugs is essential, but it’s not a security strategy. The only thing we can show actually works here is making mistakes literally impossible.

Those are the basics. After that are some more aggressive approaches with more effort and less pay-off. Most notable among those is that anything that parses should be fuzz-tested. For parsing code, fuzz testing has a pretty proven track record of finding mistakes. If nothing else, if you don’t do it, your adversaries will.

There are a lot of other approaches to hardening the most sensitive code. Usually the best of these reduce in some way to trying to eliminate a vulnerability as a class. For instance, outright banning of certain dangerous functions (strncpy etc) can be an effective way to eliminate some kinds of vulnerabilities… if actually safe alternatives are actually used instead. But usually these are just close approximations.

Actually eliminating a source of vulnerability is the most effective technique we have. This can involve use of an appropriate programming language (Rust, Go, Python, etc vs C means no buffer overflow vulnerabilities). But it also comes down to the design of the libraries we use and the application itself.

By far the most commonplace use of design to eliminate vulnerabilities are in querying databases with SQL. SQL injection plagued the internet (er… and apparently still does somehow) because we gave programmers tools to write SQL queries with string concatenation.

What on earth did we expect?

Various ORMs don’t allow this in the first place, and even ordinary SQL querying libraries today decouple a query from the values that should be substituted into it (e.g. the prepared query style select * from t where k = ?). The end result of this stylistic difference in the design of the library used to access a database is complete elimination of a whole class of vulnerabilities.

Another common problem is command injection vulnerabilities. Sure enough, we have C library functions like system that expect you to concatenate strings instead of actually putting structured data together. And worse, shells like bash actually offers no way to directly call execv on an array of arguments, for example, which might have offered style you could adopt to eliminate the problem. Even worserer, some kinds of commands like bash -c 'command "one arg"' require collapsing any structured list of arguments into a properly quoted string in a single argument, so we can’t even try to universally adopt a reasonable convention of using more structured data. At least, not easily.

The best approach is probably just to ensure untrusted data never makes it to a command in the first place.

The effect of poor design

Let’s pick a couple more examples of poor design leading to vulnerabilities.

Curl had the CURLOPT_SSL_VERIFYHOST=2 problem. For one thing, it was odd that this option had to be touched by anyone at all. The default should have been (and I think was?) secure, so no one should have reason to change it. This could be a major documentation problem that it got touched so often. (Or perhaps I’m wrong and Curl used to not verify by default… a major mistake.) But there’s no real excusing having a seemingly boolean setting with the only acceptable answer being a strange… 2. Multiple langauges let people pass in true and let it translate to 1 which had completely broken behavior. Modern Curl now raises an error on 1 making this value a real boolean parameter… with acceptable values 0 or 2. Huh.

GPG recently suffered from some serious flaws. For one thing, it would happily decrypt a message, but also report an authentication error along with the secret text. This allows attackers to take a secret message they can’t decrypt, modify it, send it along to their target, and exploit misbehavior in the applications that decrypted the data, had vulnerabilities to exfiltrate the decrypted text, and failed to properly notice the lack of a proper signature. This was partially a failure of mail clients, but it was a double failure in the design of GPG. Decrypted text should never have been passed along when the signature was invalid. That’s like saying “I know you’re relying on me not giving you radioactive waste, but here’s radioactive waste along with a note saying ‘this is radioactive waste.’ Obviously what you do with it is your fault!” But a secondary flaw here is that finding that note about the signature failure doesn’t involve looking at a return value or something simple. It involves parsing text. Text that’s printed along with attacker-controlled text. I don’t know what to say. GPG is not well designed.

General hardening

While above are some good things you can do to especially high-value parts of the system, there are also general things you can employ all around:

Identify compromises quickly. The less time an attacker has in the system, or the fewer machines they’re able to exploit, the less reward they reap. A good tool here is logging, and finding ways to usefully sift those logs (a whole book, probably, unto itself).
Fix and deploy security vulnerabilities quickly. Depending on the kind of software, this is either about cost or value. Equifax got hacked because of an Apache Struts vulnerability they didn’t patch yet. For Equifax, this made their cost_factor high, since they were easier to breach. For Struts, this made the value of a vulnerability high, since juicy users didn’t deploy updated versions quickly enough.
Reduce privileges at all; use layering and decomposition. All too often, everything is still structured so that a single bug gives away the kingdom. Just doing some decomposition is better than nothing.
Use secure defaults. I still can’t get over Tesla’s completely open Kubernetes admin panel. It’s not that Tesla is stupid. That thing should never have an unauthenticated mode. Jenkins is guilty of this too.
Whitelist; don’t blacklist. Blacklists are almost always incomplete, sometimes inherently so. It’s much safer and simpler to go from the small set you need and grow, than to hope you haven’t missed something huge.
Use mitigations. Many of these are nearly free to adopt. (Perhaps some minor performance loss.) ASLR isn’t quite the vulnerability-class eliminator it was originally hoped it would be, but it’s still pretty effective at raising the costs of an attack.
Code review; defensive style. The cost_factor of any target is its weakest link. Just culturally enforcing general good practices across the board can help removes some of those weaker links. Pay attention to those return values.

This is hardly a text on software security

There is so much I’ve said almost nothing about, and many things I’ve only vaguely mentioned. I wanted to give a shot to making the biggest points that most affect general software design. I have missed many things, I’m sure. If you’re presently mad I didn’t include something important, @ me yo.

(2018-7-26) Google Testing Blog had a post about making interfaces harder to misuse. Somewhat relevant.
(2018-9-2) One important concept I missed is that a security bug (and maybe bugs in general) are rarely things that happen in total isolation. It’s usually an indication of a larger problem. It’s not my goal to talk much about process, but still… If you’ve fixed a scary bug, it’s usually a good idea to do a postmortem, and consider how the bug happened, wasn’t initially caught, and whether there could be more, similar bugs elsewhere.