What people said about an OpenZFS bug
A few days ago I wrote about a OpenZFS bug I found and why it went unnoticed. It generated a surprising amount of chatter (at least for me!) which I found quite interesting, and did actually help clarify my thinking on a few things.
For the interested, the main conversations I was reading were on this lobste.rs thread and in the replies to @cliffle’s fedi post. I’ll be summarising some of the more interesting (to me) thoughts and shapes here.
I will say, I really appreciate the quality and especially the humility on display. Granted, I’m very selective about where I hang out online and I don’t generally go anywhere that annoys me more than a very small amount, but I will still pleasantly surprised by how little “git gud” there was in response. So thank you everyone! And if you’re still waiting for a reply from me (including in private email), I promise I will get back to you!
SPONSOR MY WORK
Use OpenZFS? Like this post? Help out by sponsoring my OpenZFS work!
The human side
When I wrote the post it wasn’t totally clear to me what point I was trying to make. I have a much better idea now, mostly because of my own reaction to the things people chose to focus on.
Most of the responses were focused on the bug itself - how it could have happened, what tools or processes or techniques would have helped or hindered. Initially I was annoyed by these; I wanted to yell “ugh, you’re missing the point!”. That in itself was clarifying, because it helped me to peel off the things I thought weren’t the most important thing.
Comparatively few were focused on the fact that it was experienced and competent humans that missed the bug, both in authorship and in review, which in my gut was the most important part. It properly came into view for me upon reading this comment:
I have an aviation background and I agree emphatically with your assertion that “just get good” is not a reasonable answer. If a competent aircrew has a disaster or a near miss, “pilot error” is never the whole cause. There’s a deeper issue somewhere that needs mitigation. Any system that involves humans but can’t tolerate human mistakes is guaranteed to fail.
Now this is not a new idea to me, but it reminded me of how to draw the lines. Modern computers are ridiculously complicated, and getting the most out of them is more than one any human can handle alone. We build tools and abstractions and even whole languages to make this challenge tractable. Those improvements are not evenly distributed though, and when there’s a gap it’s on a person to fill in the gap.
I absolutely agree that we should be using this as an opportunity to improve the system, and I felt much better about the shop talk that followed after that. But still, this fundamentally is why just admonishing people to get better at C instead of at least considering better tools is unhelpful at best (though as I said, I doubt there’s much good faith behind this sort of nonsense).
Improving readability
Before we go into improving the tech, its worth looking at what we could do to just improve readability.
A few people noted that asize
and psize
have maybe smallest difference in appearance possible, and almost anything else would have improved the situation. Some suggested spelling it out longform as allocated_size
and physical_size
, but shorter variations like alloc_sz
and phys_sz
might help too.
I suspect that in isolation this might help, but it brings its own challenges. I am genuinely not sure how often I would spot the difference if they were named differently, because I know when reading code I am usually focused either logic or structure, and this is not strongly either. But someone did point out that at least it would give the pattern matching part of the brain a bit more to consider, and that might create a bit more of a different feel at that point, and I think that is quite possible.
It was noted that there’s also a sweet spot to hit for names; some brains struggle with names that are too long as well. I did also wonder to what extent your native language or even alphabet changes this – are these two symbols more or less similar if, for example, there’s not an implied “split” between a
/p
and size
?
The other challenge is that “asize”, “psize” and “lsize” are terms of art in OpenZFS: they appear everywhere, including baked into the on-disk format, in documentation, in the UI, books, and programmers’ and sysadmins’ heads. That doesn’t mean they can’t or shouldn’t be changed, but rather that such a change may have much more far-reaching consequences than is immediately obvious. Definitely one to tread carefully on!
Wrapper types
In my post I had suggested disambiguating the two values through the type system, since I knew that at least in Rust, that’s something that would have caught it:
struct PhysicalSize(u64);
struct AllocatedSize(u64);
A couple of people suggested doing the same thing in C, namely:
typedef struct { uint64_t a_size; } asize_t;
typedef struct { uint64_t p_size; } psize_t;
This can be done, but it makes them very difficult to work with. For example, initialising them becomes awkward:
psize_t psize = 4096; // error: invalid initializer
psize_t psize = (psize_t)4096; // error: conversion to non-scalar type requested
psize_t psize = { 4096 }; // compiles ok.
They can be initialised on the fly, but you have to write the whole type out to get an initialiser in an expression position (a “compound literal”):
psize_t asize_to_psize(asize_t a);
psize_t psize = asize_to_psize({ 8192 }); // error: expected expression before ‘{’ token
psize_t psize = asize_to_psize((asize_t){ 8192 }); // compiles ok.
Similarly on return:
psize_t asize_to_psize(asize_t a) {
return { 8192 }; // error: expected expression before ‘{’ token
return (psize_t){ 8192 }; // compiles ok.
}
They’re hard to work with, since we no longer get math ops:
asize_t asize = { 8192 };
psize_t psize = { asize << 1 }; // error: invalid operands to binary << (have ‘asize_t’ and ‘int’)
asize_t asize1 = { 8192 };
asize_t asize2 = { 4096 };
asize_t total = asize1 + asize2; // error: invalid operands to binary + (have ‘asize_t’ and ‘asize+t’)
return (asize1 == asize2); // error: invalid operands to binary == (have ‘asize_t’ and ‘asize_t’)
Still, they do solve the problem to a degree, in that they can’t be accidentally used in place of each other:
asize_t asize = { 8192 };
psize_t psize = { asize.a_size << 1 };
return (asize == psize); // error: invalid operands to binary == (have ‘asize_t’ and ‘psize_t’)
psize_t asize_to_psize(asize_t a) {
psize_t p = { a.a_size >> 1 };
return a; // error: incompatible types when returning type ‘asize_t’ but ‘psize_t’ was expected
}
In reality, we would write macros to overcome most of the messy syntax. Would it be worth it to make most of their use more difficult to read and to write, to avoid this shape of bug? It’s hard to say. Definitely maybe, but in deciding these things I imagine myself trying to persuade my colleagues that the tradeoffs line up, and on this one I feel like I would have a very difficult time being convincing.
C typedefs actually don’t define types
Another suggestion was just to make new scalar types:
typedef uint64_t asize_t;
typedef uint64_t psize_t;
Unfortunately, typedef
doesn’t actually create types, but rather, creates a name for an existing type.
asize_t asize = 8192;
psize_t psize = 4096;
return (asize == psize); // compiles ok.
In fact, what we often think of as “core” types like uint64_t
are themselves aliases to some other type, as we can see if we try to cast to a different fundamental type:
uint64_t n;
char *x = &n;
// warning: initialization of ‘char *’ from incompatible pointer type ‘uint64_t *’ {aka ‘long unsigned int *’} [-Wincompatible-pointer-types]
So unfortunately this one is just no help at all.
Detecting unused things
I mentioned that -Wunused
did not detect this, but a few people reported otherwise:
uint64_t asize_to_psize(uint64_t asize) {
uint64_t psize = asize << 1;
return asize; // warning: unused variable ‘psize’ [-Wunused-variable]
}
The thing is, the original function is more complicated, and the variable is actually used. It’s more like this, which compiles cleanly:
uint64_t asize_to_psize(uint64_t asize) {
uint64_t psize = asize << 1;
psize <<= 1;
psize *= asize + 3;
return asize;
}
The actual thing that static analysers find here is a “dead store”; we store a value that is never read.
To me it feels like this should be something a normal compiler warning could point out, but apparently to detect this, you need to do data flow analysis, which compilers generally don’t want to put in the “normal” warnings because they can be slower and less accurate than people would expect. It sounds plausible, but it sure is annoying! I would turn on a warning flag for this immediately if it existed.
Unit testing
A few people pointed out that a unit test would have caught this. The answer, as with everything in this whole saga, is a definite “maybe”.
Unit testing in C is traditionally quite difficult. At least some of that is that unit testing as we understand it arrived long after C had reached maturity, so there isn’t a culture of that style of testing. But I also think a lot of it has to do with how a large C program is constructed and the environments it often lives in.
In my experience, C software is very “visible”: there are few tools for encapsulation and data hiding as we understand them from other languages. Usually to protect from other things messing with the internals of a struct, we have to entirely hide the struct definition inside one .c
file, and then expose functions to access it. A function call has overhead, both in the call setup and cleanup but sometimes in like surrounding things that come from that work: argument copies, register spills, CPU cache changes, stack use, etc. Its usually small, but not nothing, and some optimisation techniques aren’t available when you have to cross file boundaries. In high-performance code (which kernel code often is), that overhead can actually matter. But it’s also true that the language is simply much less work to use if you just pass struct pointers around instead of buliding out accessor methods, or just reaching out to some other component or subsystem to do what you need. And so, over time, you end up with code that doesn’t always have clean lines of separation.
For unit testing, that matters. It becomes quite difficult to build a function without having half the world come in with it. In the case of our function, the result is extremely dependent on the internal state of the vdev_t
arg, and also the history of that state (the txg
parameter can be thought of as a timestamp here). So to test it thoroughly you have to mock up the internal state of a vdev_t
, and that can get complicated because disk layouts are dependent on all sorts of weird things.
Which isn’t to say it can’t be done. Consider the raidz_test
tool, which is a unit test runner for the raidz math functions. It works by building up different layout “maps” and then blasting data into it and making sure both the columns and the data within them are generated correctly. But, since in the real world the “map” is entirely based on the incoming user IO, it builds up fake IO objects as inputs to the map generation function.
It’s also not all-or-nothing. We can still be sensible and try to make sure functions keep to themselves, and that they can not take a whole struct as an arg if they only need a few fields from it. It’s not like good API design is impossible here, it’s just that the language and tools don’t always guide us in the right direction.
None of this is to say we can’t make unit testing more prevalent. It’s just a bunch of up-front work to get a structure in place that makes it easy to add and run unit tests.
Static analysis
A few people rightly mentioned static analysis, and of course that’s the right path forward, but even then it can be trickier than you might think!
I’d been playing a little with Infer that week anyway, and after I had shipped the fix I threw it in there, and it had no trouble detecting the dead store:
module/zfs/vdev_raidz.c:2261: error: Dead Store
The value written to `&psize` is never used.
2259. psize = (asize >> ashift);
2260. psize -= nparity * DIV_ROUND_UP(psize, cols);
2261. psize <<= ashift;
^
2262.
2263. return (asize);
I have at various times run some or all of OpenZFS through Coverity, Clang Analyzer and Cppcheck as well, so I gave them all a rerun, and they all found it. So in that respect, it’s both a little embarrassing and more than a little frustrating.
I would love to turn on a static analyser and run it regularly, ideally every build. Unfortunately, the false-positive list is far too high to do that just yet. At least, there too many places where memory leaks are reported because they don’t fully understand how some of our memory allocation functions work, and because they don’t always see us initialising some data via out params. I expect some of that is to do with the lack of insight into how our concurrency models are assembled, but I can also see not all of it is that.
And unfortunately, not all of these things can actually be defanged through annotations. The only open-source analyser for C that has a way of signalling that a parameter is an output is Frama-C, and having just spent half a day fiddling with it I can see I’m in for weeks of work to get it all properly wired up. If we’re happy to go not open-source, I know Coverity has the ability to model this, but it’s also slow to run and difficult to work with the results.
And this is the issue with static analysis for the moment. Yes, it would have caught this. Would I have noticed it in among all the false positives?
I will be spending a little time with our existing linting configuration to see if I can add a few “easy” things, and I’m planning to experiment a little with a compiler plugin to add some annotations that might help a little, but I’m afraid that an aggressive and comprehensive static analysis regime is too much for now.
LLMs
Multiple people chimed in with their results of pasting the function into their LLM of choice and pointing out that it noticed the bug. If we ignore all of the AI Discourse for a moment and just consider an LLM as a tool that can read the codebase and point out possible problems, then I think they are in exactly the same category as static analyzers: they’ll point out some real problems, probably also a bunch of non-problems, and they’ll be difficult to properly integrate into everyone’s standard workflow.
… and Rust is here too
I used an example of how I might have done it in Rust because I think a lot about types even when I don’t have the ability to use them right now. This produced a good amount of chatter about different ways that we might approach something like this in Rust - techniques, crates, it’s approach to integer handling, etc. I’m not going to into it here as this post is already getting too long and it’s not directly relevant, but have been following them with great interest.
I hope to get the opportunity to write Rust more often than the occasional dabbling on a rare free evening, but in the meantime I find that just listening in on these discussions gives me ideas and techniques that I can apply in C to make my code better, and that’s a pretty good thing.
The reality of community-driven open-source work
A great many comments expressed surprise that a bug like this would get through, or that we weren’t using static analysis tools regularly, or that we didn’t have unit tests, and that we didn’t just drop everything and re-tool. A few expressed concern that this might be reflective of our commitment to quality or just general seriousness about our work.
I have a lot to say on this topic, but it’s really another few thousand words on another blog post, and I’ve been trying to stay focused. But, there was a great question wondering about the lack of discussion on my bugfix PR:
Do you feel any type of way about the response? Unless there was communication in some other channel, it just looks like an “approved” with no comment. This is the sort of thing I would expect a bit of a postmortem after, but maybe not.
In a proper (and for-profit!) product organisation, I think it’d be worth a pause to check if we need to update our processes, or down tools and build out some new test infrastructure, or whatever. I have pushed for and participated this many times in the past when I’ve been in “normal” jobs: once the flames are out, take a breath, pull everyone into a room, make sure everyone is ok, and work out what the hell just happened, and what do we have to do now to make sure it doesn’t happen again in the future?
In a volunteer- & contract-driven open-source project though, that sort of time and space just doesn’t exist. Unlike a lot of open source, OpenZFS doesn’t have a “corporate owner”. It isn’t a “product” as such. It’s best to think of it as a group of volunteers, working on their own time on the projects that interest them. Things like building out new test infrastructure is a big job that requires care, and it can be hard to find the mental energy at the end of a 9-5 to focus on that sort of thing and do a good job of it. Things like a nomenclature change can be even tougher – no programmer is going to push back on more testing, but changing names of long-established terms of art requires a lot more effort to persuade. So, these sort of changes tend to move very slowly.
So yes, I do sigh a little bit, because these are solved problems in the abstract. But no, I think the response is fine, because there really isn’t much to say. There’s nothing new here that we didn’t already know. We know C is tricky, and we know our testing regime is not as good as we would like. But, it is pretty good, we improve it a little but all the time, and we try to build “general maintenance and targets of opportunity” into everything we do. Sometimes you just have to play the long game.
I will write a lot more about this sometime soon, but for now I’ll just say that if you have a burning desire to see OpenZFS’ testing regime improved and an amount of cash burning a hole in your pocket, you can easily help offset the cost of doing this work. My tip jar is linked at the bottom of this page; while if you’re a business and would like a more professional relationship, Klara can work with you on support and development projects. Alternatively, you can support products and services that are build on top of OpenZFS and let them know why you have chosen them. Alternatively, if you’ve got an interest or skill with the kinds of testing and tools we’ve been talking about and are looking around for somewhere to spend your time, we could really use your help. Feel free to contact me via email for any of this; I’d be more than happy to put you in touch with the right person, or provide advice or mentoring.
Now what?
So that’s a lot of words about why it’s hard to do a lot of things, but I’ve also said a lot of times that it’s important to me that I do something. So what am I looking at?
In the immediate, static analysis. As I said above, it’s currently a bit of a mixed bag, but that doesn’t mean we can’t do anything at all. It turns out we already do have a little bit of CI-driven linting based on Cppcheck but it covers very little of the codebase currently because most of the kernel-specific code is difficult for other tools to work with, so it wasn’t fully integrated yet (to the extent that I didn’t even know about it). I’ve had a look at it and it doesn’t look very complicated to get running regularly against more of code, so that’s the place to start.
After that, I think its just the continuing refactoring and cleanup that I’ve been doing for a couple of years already. I have some huge plans and dreams for OpenZFS, but most of them are stymied by the code not being particularly modular, so difficult to use in other contexts. Some kinds of testing, like unit testing, are difficult for us for the same reason, so continuing that work will make that easier too.
And that’s where I’ll leave this. It’s been a really interesting weekend chatting about all this, thanks everyone 💚
SPONSOR MY WORK
Use OpenZFS? Like this post? Help out by sponsoring my OpenZFS work!