A few nice things in OpenZFS 2.3
OpenZFS 2.3.0-rc1 was just tagged. This was the first full development cycle I was involved in and there’s lots I’m excited about beyond the headline items. Here’s a few less obvious things that I think are very nice!
- Better userspace debugging
- Better task queue stats
- The endless kernel update treadmill
- Linux IO submission rewrite
- Credit where it’s due
- JSON everywhere
Most of this is stuff I had a hand in myself, because a lot of what I do is scratching my own itches, and I like to talk excitedly about my work. Let’s go!
Better userspace debugging 🔗
A decent chunk of my day job is helping customers recover data from damaged pools. We do use this using tools built on top of libzpool
, which is the OpenZFS core compiled for userspace, such that no kernel support is needed at all. This includes the standard zdb
pool exploration tool, as well as some in-house tools.
Since libzpool
is the same OpenZFS core code, it generally follows the same rules as the in-kernel OpenZFS, which is to say that it prefers to crash in the face of bad data rather than plough on regardless. A lot of recovery work involves considering each crash, making a small code adjustment, recompiling and moving on to the next one, until we’ve worked around enough bad data that we start to see useful results. As such, recovery work is also development work to a degree, and that means needing good development tools.
In 2.2, when something like zdb
crashed or tripped an assertion, it would produce the most minimal of output:
$ ./zdb -d crayon/ 0
"oh no" == 0 (0x7ff68c1e96c0 == 0)
ASSERT at module/zfs/arc.c:4463:arc_evict_cb_check()Aborted
It’s better than nothing, but only just barely. In practice though, you want to know more about what was happening at the time and how it got here.
So, I extended this to give more information, along the lines of what a kernel crash will give you: information about the thread that crashed, and a stack backtrace. If available, libunwind will be used to get an even better backtrace, but if not, most libcs have a backtrace()
function that is good-enough.
$ ./zdb -d crayon/ 0
ASSERT at module/zfs/arc.c:4510:arc_evict_cb_check()
"oh no" == 0 (0x7fdc8ebafd40 == 0)
PID: 3594698 COMM: zdb
TID: 3594782 NAME: arc_evict
Call trace:
[0x7fdc8eba3b85] libspl_assertf+0x135 (in /home/robn/code/zfs-2.3/.libs/libzpool.so.5.0.0 +0x2ccb85)
[0x7fdc8e95babc] arc_evict_cb_check+0x3c (in /home/robn/code/zfs-2.3/.libs/libzpool.so.5.0.0 +0x84abc)
[0x7fdc8eb1801f] zthr_procedure+0x9f (in /home/robn/code/zfs-2.3/.libs/libzpool.so.5.0.0 +0x24101f)
[0x7fdc8e930ed8] zk_thread_wrapper+0x18 (in /home/robn/code/zfs-2.3/.libs/libzpool.so.5.0.0 +0x59ed8)
[0x7fdc8e2b5144] start_thread+0x2d4 (in /usr/lib/x86_64-linux-gnu/libc.so.6 +0x89144)
[0x7fdc8e3357dc] __clone3+0x2c (in /usr/lib/x86_64-linux-gnu/libc.so.6 +0x1097dc)
Call trace:
[0x55698b61dd81] sig_handler+0x21 (in /home/robn/code/zfs-2.3/.libs/zdb +0x11d81)
[0x7fdc8e268050] +0x7fdc8e268050 (in /usr/lib/x86_64-linux-gnu/libc.so.6 +0x3c050)
[0x7fdc8e2b6e3c] __pthread_kill_implementation+0x10c (in /usr/lib/x86_64-linux-gnu/libc.so.6 +0x8ae3c)
[0x7fdc8e267fb2] __GI_raise+0x12 (in /usr/lib/x86_64-linux-gnu/libc.so.6 +0x3bfb2)
[0x7fdc8e252472] __GI_abort+0xd3 (in /usr/lib/x86_64-linux-gnu/libc.so.6 +0x26472)
[0x7fdc8e92febb] libspl_assertf.cold+0x05 (in /home/robn/code/zfs-2.3/.libs/libzpool.so.5.0.0 +0x58ebb)
[0x7fdc8e95babc] arc_evict_cb_check+0x3c (in /home/robn/code/zfs-2.3/.libs/libzpool.so.5.0.0 +0x84abc)
[0x7fdc8eb1801f] zthr_procedure+0x9f (in /home/robn/code/zfs-2.3/.libs/libzpool.so.5.0.0 +0x24101f)
[0x7fdc8e930ed8] zk_thread_wrapper+0x18 (in /home/robn/code/zfs-2.3/.libs/libzpool.so.5.0.0 +0x59ed8)
[0x7fdc8e2b5144] start_thread+0x2d4 (in /usr/lib/x86_64-linux-gnu/libc.so.6 +0x89144)
[0x7fdc8e3357dc] __clone3+0x2c (in /usr/lib/x86_64-linux-gnu/libc.so.6 +0x1097dc)
Aborted
We now also set thread names, so we can see what’s happening from process inspection tools while its running. Before, every thread was just named for the parent program:
$ top -H -b -n 1 -p $(pgrep zdb)
top - 11:42:58 up 25 days, 20:59, 1 user, load average: 0.83, 1.35, 1.18
Threads: 229 total, 4 running, 225 sleeping, 0 stopped, 0 zombie
%Cpu(s): 25.0 us, 25.0 sy, 0.0 ni, 50.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 31757.8 total, 6487.5 free, 24664.6 used, 9994.4 buff/cache
MiB Swap: 32751.0 total, 27034.8 free, 5716.2 used. 7093.2 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3604049 robn 20 0 6655196 403900 5056 R 99.9 1.2 0:32.12 zdb
3604167 robn 20 0 6655196 404032 5056 S 13.3 1.2 0:02.03 zdb
3604168 robn 20 0 6655196 404032 5056 S 13.3 1.2 0:02.03 zdb
3604132 robn 20 0 6655196 403900 5056 R 6.7 1.2 0:00.29 zdb
3604141 robn 20 0 6655332 403900 5056 S 6.7 1.2 0:00.18 zdb
3604142 robn 20 0 6655332 403900 5056 S 6.7 1.2 0:00.18 zdb
3604147 robn 20 0 6655332 404160 5056 S 6.7 1.2 0:00.18 zdb
3604149 robn 20 0 6655332 404160 5056 S 6.7 1.2 0:00.18 zdb
3604162 robn 20 0 6655332 404160 5056 R 6.7 1.2 0:02.20 zdb
3604163 robn 20 0 6655332 404032 5056 S 6.7 1.2 0:02.01 zdb
3604164 robn 20 0 6655196 404032 5056 S 6.7 1.2 0:02.03 zdb
3604165 robn 20 0 6655196 404032 5056 S 6.7 1.2 0:02.04 zdb
3604166 robn 20 0 6655196 404032 5056 R 6.7 1.2 0:02.02 zdb
...
In 2.3, threads get given their actually name, so we can tell at a glance who is doing the work:
$ top -H -b -n 1 -p $(pgrep zdb)
top - 11:44:03 up 25 days, 21:01, 1 user, load average: 1.36, 1.43, 1.22
Threads: 213 total, 7 running, 206 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 31757.8 total, 6339.6 free, 24812.6 used, 10010.3 buff/cache
MiB Swap: 32751.0 total, 27034.8 free, 5716.2 used. 6945.2 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3605076 robn 20 0 6755820 527492 5256 R 99.9 1.6 0:10.10 zdb
3605166 robn 20 0 6755820 527492 5256 R 13.3 1.6 0:00.06 z_vdev_file
3605190 robn 20 0 6756364 528016 5256 R 13.3 1.6 0:00.61 z_rd_int_0
3605193 robn 20 0 6756364 528016 5256 S 13.3 1.6 0:00.53 z_rd_int_1
3605196 robn 20 0 6756364 528016 5256 S 13.3 1.6 0:00.53 z_rd_int_1
3605175 robn 20 0 6755956 527492 5256 S 6.7 1.6 0:00.06 z_vdev_file
3605176 robn 20 0 6755956 527492 5256 R 6.7 1.6 0:00.06 z_vdev_file
3605189 robn 20 0 6756364 528016 5256 R 6.7 1.6 0:00.60 z_rd_int_0
3605191 robn 20 0 6756364 528016 5256 R 6.7 1.6 0:00.56 z_rd_int_0
3605192 robn 20 0 6756364 528016 5256 S 6.7 1.6 0:00.57 z_rd_int_0
3605194 robn 20 0 6756364 528016 5256 S 6.7 1.6 0:00.52 z_rd_int_1
3605090 robn 20 0 6755820 527492 5256 S 0.0 1.6 0:00.00 system_taskq
3605091 robn 20 0 6755820 527492 5256 S 0.0 1.6 0:00.00 system_taskq
...
It’s not glamorous, but this is stuff is crucial when trying to debug problems on customer pools, especially if we don’t have access to the system but are trying to guide them through a remote debugging session.
Better task queue stats 🔗
The bulk of OpenZFS’ concurrency model is based around “task queues”. They’re pretty simple in concept but have some truly devilish details, and being able to understand what they’re up to is critical for understanding performance problems.
Up until the 2.2 series, some task queue stats were available but showed a lot of things that weren’t at all interesting and barely anything that would be useful to understand how they were being used.
There have always been two stats nodes, taskq
and taskq-all
. taskq-all
shows all queues in the system with a whole bunch of attributes about them:
$ cat /proc/spl/taskq-all
taskq act nthr spwn maxt pri mina maxa cura flags
spl_system_taskq/0 0 1 0 64 100 6 2147483647 6 80000005
spl_delay_taskq/0 0 1 0 6 100 6 2147483647 6 80000005
spl_dynamic_taskq/0 0 1 0 1 100 6 2147483647 6 80000001
spl_kmem_cache/0 0 2 0 4 100 32 2147483647 32 80000005
zvol/0 0 1 0 32 100 32 2147483647 32 80000005
arc_prune/0 0 1 0 1 120 100 2147483647 100 80000005
dbu_evict/0 0 1 0 1 120 0 0 0 80000000
z_vdev_file/0 0 1 0 16 139 6 2147483647 0 80000004
z_null_iss/0 0 1 0 1 100 50 2147483647 0 80000004
...
Most of this information is uninteresting. act
is the number of running tasks at this moment, nthr
and spwn
are about threads that exist to service the queue and new threads being spawned right now. The rest are largely static configuration that can be got from the source code, and so are mostly a waste of space.
When there’s stuff on the queues, it gets more bizarre:
$ cat /proc/spl/taskq-all
taskq act nthr spwn maxt pri mina maxa cura flags
spl_system_taskq/0 0 1 0 64 100 6 2147483647 6 80000005
spl_delay_taskq/0 0 1 0 6 100 6 2147483647 6 80000005
delay: (%pf?)(0xffff9bdc041a4000)
spl_dynamic_taskq/0 0 1 0 1 100 6 2147483647 6 80000001
...
z_wr_iss/0 1 4 0 4 101 50 2147483647 0 8000000c
active: [3871](%pf?)(0xffff9bdc270db700)
pend: (%pf?)(0xffff9bdc2ec88f00)
z_wr_iss_h/0 0 5 0 5 100 50 2147483647 0 80000004
z_wr_int/0 0 4 0 4 100 50 2147483647 0 8000000c
pend: (%pf?)(0xffff9bdc3572b200)
z_wr_int_h/0 0 5 0 5 100 50 2147483647 0 80000004
z_fr_iss/0 0 1 0 4 100 50 2147483647 0 8000000c
...
This is actually listing in full the tasks in various states. The memory address is the in-kernel memory location of the task parameters. %pf
is a long-broken format that is supposed to be the name of the task function. Even unbroken, that would rarely be useful, because while task queues can run any task you like, for the most part task queues only handle one kind of task.
As for /proc/spl/taskq
, it only shows exactly the same thing, but only if there are any tasks on it at all. So it ranges from no output to a wall of hex numbers and misery.
So I ripped all this out and replaced it with something more useful. Now there is regular conventional kstat node for every task queue, showing instantaneous and accumulated values for a wide variety of internal task queue operational state:
$ ls /proc/spl/kstat/taskq
arc_prune.0 z_cl_iss.0 z_rd_int_0.0 zvol_tq-2.0
dbu_evict.0 z_fr_int.0 z_rd_int_1.0 z_wr_int_0.0
dp_sync_taskq.0 z_fr_iss_0.0 z_rd_iss.0 z_wr_int_1.0
dp_zil_clean_taskq.0 z_fr_iss_1.0 z_trim_int.0 z_wr_int_h.0
spl_delay_taskq.0 z_ioctl_int.0 z_trim_iss.0 z_wr_iss.0
spl_dynamic_taskq.0 z_ioctl_iss.0 z_unlinked_drain.0 z_wr_iss_h.0
spl_kmem_cache.0 z_metaslab.0 z_upgrade.0 z_zrele.0
spl_system_taskq.0 z_null_int.0 z_vdev_file.0 z_zvol.0
summary z_null_iss.0 zvol_tq-0.0
z_cl_int.0 z_prefetch.0 zvol_tq-1.0
$ cat /proc/spl/kstat/taskq/z_wr_iss.0
114 1 0x01 23 6256 61967657786 132922594033
name type data
threads_max 4 4
entry_pool_min 4 50
entry_pool_max 4 2147483647
threads_active 4 0
threads_idle 4 4
threads_total 4 4
tasks_pending 4 0
tasks_priority 4 0
tasks_total 4 0
tasks_delayed 4 0
entries_free 4 0
threads_created 4 5
threads_destroyed 4 1
tasks_dispatched 4 266
tasks_dispatched_delayed 4 0
tasks_executed_normal 4 263
tasks_executed_priority 4 3
tasks_executed 4 266
tasks_delayed_requeued 4 0
tasks_cancelled 4 0
thread_wakeups 4 145
thread_wakeups_nowork 4 23
thread_sleeps 4 149
And then there’s a summary view, that shows the most useful at-a-glance stuff for the entire system:
$ cat /proc/spl/kstat/taskq/summary
3 0 0x01 1 0 6060658890 2005804238964098
| threads | tasks on queue
taskq name | tot [act idl] max | pend [ norm high] dly
-------------------- | ----------------- | -----------------------
z_unlinked_drain.0 | 1 [ 0 1] 12 | 0 [ 0 0] 0
z_zrele.0 | 1 [ 0 1] 12 | 0 [ 0 0] 0
dp_zil_clean_taskq.0 | 12 [ 0 12] 12 | 0 [ 0 0] 0
dp_sync_taskq.0 | 9 [ 1 8] 9 | 0 [ 0 0] 0
z_upgrade.0 | 1 [ 0 1] 12 | 0 [ 0 0] 0
...
z_wr_iss.0 | 8 [ 0 8] 9 | 0 [ 0 0] 0
z_rd_int_1.0 | 4 [ 4 0] 4 | 2 [ 2 0] 0
z_rd_int_0.0 | 4 [ 4 0] 4 | 2 [ 2 0] 0
z_rd_iss.0 | 1 [ 0 1] 8 | 0 [ 0 0] 0
...
The original PR #16171 actually had a bit more in it, showing rolling averages for each point in a task’s journey through the queue:
enqueue_time_min 4 0
enqueue_time_max 4 38
enqueue_time_avg_10 4 5
enqueue_time_avg_100 4 5
enqueue_time_avg_1000 4 4
enqueue_time_avg_10000 4 0
wait_time_min 4 0
wait_time_max 4 42347
wait_time_avg_10 4 2
wait_time_avg_100 4 10
wait_time_avg_1000 4 1487
wait_time_avg_10000 4 3102
dequeue_time_min 4 0
dequeue_time_max 4 45
dequeue_time_avg_10 4 0
dequeue_time_avg_100 4 0
dequeue_time_avg_1000 4 1
dequeue_time_avg_10000 4 0
execute_time_min 4 0
execute_time_max 4 44618
execute_time_avg_10 4 0
execute_time_avg_100 4 2
execute_time_avg_1000 4 207
execute_time_avg_10000 4 441
task_time_min 4 2
task_time_max 4 44694
task_time_avg_10 4 8
task_time_avg_100 4 18
task_time_avg_1000 4 1698
task_time_avg_10000 4 3544
| threads (current) | task on queue (current) | task time (µs) (last 100 tasks)
taskq name | tot [act idl] max | pend [ norm high] dly | avg [enq wait deq exec]
-------------------- | ----------------- | ----------------------- | --------------------------------
z_unlinked_drain.0 | 1 [ 0 1] 4 | 0 [ 0 0] 0 | 0 [ 0 0 0 0]
z_zrele.0 | 1 [ 0 1] 4 | 0 [ 0 0] 0 | 0 [ 0 0 0 0]
dp_zil_clean_taskq.0 | 4 [ 0 4] 4 | 0 [ 0 0] 0 | 10 [ 4 4 0 1]
dp_sync_taskq.0 | 1 [ 0 1] 1 | 0 [ 0 0] 0 | 154 [ 2 44 0 108]
z_upgrade.0 | 1 [ 0 1] 4 | 0 [ 0 0] 0 | 87 [ 0 0 0 86]
z_prefetch.0 | 1 [ 0 1] 4 | 0 [ 0 0] 0 | 0 [ 0 0 0 0]
z_metaslab.0 | 2 [ 0 2] 2 | 0 [ 0 0] 0 | 30 [ 2 27 0 0]
...
I ended up removing this, as the overhead of taking so many accurate time samples was substantial on very busy queues. I would like to come back to it as I think its kind of the killer feature here, but even so, what’s there is still light-years ahead of the original.
The endless kernel update treadmill 🔗
Pull requests:
Linux churns the crap out of their internal APIs, and people really like to run OpenZFS on the newest kernel releases on day one. I have multiple opinions about all of this which are for another time, but regardless, OpenZFS needs to be updated to make it work. For me it’s become a kind of pleasant routine evening task every couple of months.
It’s of course not as simple as just applying the changes in Linux to OpenZFS, because we support kernels back to 4.18 (to cover RHEL8). Instead, we have to study the changes, work out the feature checks needed to tell the difference, and then build appropriate abstractions to ensure the same behaviour across years of kernel internal change.
Of course, the world moves on at the other end too. There’s old Linux versions that we don’t support anymore. That used to be the hybrid 3.10 that came with RHEL7, which is now out of support. That means removing code, which has been lovely. Linux still changes a lot, but much of the deep structural stuff was mostly settled long ago, so there’s plenty of compatibility stuff we can get rid of.
But we mustn’t leave FreeBSD out. We could, because they tend not to break stuff very often, but as FreeBSD 12 went to EOL, there was plenty of old code we could remove. But it’s not like they stay still, so we still do stuff to wire up improvements, which is why you can now TRIM your file-backed pools.
Keeping up with Linux updates led to a surprising issue though. If OpenZFS doesn’t work with new Linux version on day of release, people file bug reports. But of course OpenZFS doesn’t release as often as Linux, and also, until day of release we can’t be sure that we have everything working properly for the new kernel anyway - our stability and durability guarantees matter! So instead, we started shipping experimental support for new kernels in point releases, without raising the “officially supported” version. That would yield a different kind of bug report when something doesn’t work properly because our patches aren’t right yet.
To try and split the difference, we now detect the kernel version we’re compiling against, and abort if its newer than our maximal supported version. But now you can pass --enable-linux-experimental
to configure
, which will try to compile it, but will add scary warnings:
[ 3.124287] ZFS: Loaded module v2.3.0-rc1, ZFS pool version 5000, ZFS filesystem version 5
[ 3.124513] ZFS: Using ZFS with kernel 6.12-rc1 is EXPERIMENTAL and SERIOUS DATA LOSS may occur!
[ 3.124640] ZFS: Please report your results at: https://github.com/openzfs/zfs/issues/new
Will it help people know what they’re getting? We’ll see!
More importantly though, this does show some of the effort involved in keeping OpenZFS running smoothly.
Linux IO submission rewrite 🔗
This one took me a full year to solve properly. It turns out that pretty much since the first port of ZFS to Linux, it’s been passing IO buffers down to the block layer a bit wrong, in a way that under particularly rare situations, it could result in misaligned buffers being passed down to the disk driver, which would then reject it. Obviously so rare that it had barely been noticed in over a decade.
This resulted in a rewrite of the IO submission layer that is a lot smarter, and also does a lot less work than the old one, to the extent that as well as fixing the alignment issue it’s some 10% faster on heavy sync writes and as much 150% faster on async writes. This wasn’t the purpose, but was a cool and surprising thing to find out.
Some of the new code also enables some interesting “zero copy” facilities, which with a little more work should be able to give us very efficient implementations of things like sendfile()
. So I’m looking forward to exploring that further in the next development series.
Credit where it’s due 🔗
One of the real joys of working OpenZFS for me is that it’s not a “corporate” open source project. Yes, many of the contributors do things on their employers’ time, but the point is the individuals involved, not the companies. And we keep a nice list of everyone in the AUTHORS file, no matter how big or small the contribution. I love this, because its no bullshit, no pecking order, just everyone ever.
So the first time I went to add my own name, and then noticed just how many people were missing. So I kinda got really stuck into it one evening and wrote a whole big program to mine the git history and update the AUTHORS file, and now it seems like I’m the guy that presses that button every now and again. Which I’m ok with, because I love a team effort 🏆
JSON everywhere 🔗
Pull request: #16217
And here’s the coolest one here, and the one I had nothing to do with.
Since the beginning of time, grumpy sysadmins have lost their minds trying to parse output from the ZFS tools and feed them into other thing. A lot of the output is basic tabular and so not hard to feed to grep
, cut
, awk
and all the usual friends.
The challenge has always zpool status
, which has useful device health info in it that you can’t easily get elsewhere:
$ zpool status shed
pool: shed
state: ONLINE
scan: scrub repaired 0B in 04:45:02 with 0 errors on Fri Sep 27 12:03:32 2024
config:
NAME STATE READ WRITE CKSUM
shed ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
da0 ONLINE 0 0 0
da1 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
da2 ONLINE 0 0 0
da3 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
da4 ONLINE 0 0 0
da5 ONLINE 0 0 0
errors: No known data errors
But now most of the zpool
and zfs
subcommands take a -j
switch to get JSON output. So now I can do this:
$ ./zpool status -j shed | jq -r '.pools[] | recurse(.vdevs[]?) | select(.vdev_type == "disk") | [.name,.path,.state] | @tsv'
da0 /dev/da0 ONLINE
da1 /dev/da1 ONLINE
da2 /dev/da2 ONLINE
da3 /dev/da3 ONLINE
da4 /dev/da4 ONLINE
da5 /dev/da5 ONLINE
I probably won’t do a whole lot with it myself, now that I’m out of the sysadmin game, but I’m really curious to see what kind of tools people make out of this.
And more! 🔗
I promised myself I’d post this today, so I won’t write more right now, but maybe I’ll do another. Regardless, there’s still loads more cool stuff inside OpenZFS 2.3, and you should try it out!