5 October, 2024

Rob Norris

A few nice things in OpenZFS 2.3

OpenZFS 2.3.0-rc1 was just tagged. This was the first full development cycle I was involved in and there’s lots I’m excited about beyond the headline items. Here’s a few less obvious things that I think are very nice!

Better userspace debugging
Better task queue stats
The endless kernel update treadmill
Linux IO submission rewrite
Credit where it’s due
JSON everywhere

Most of this is stuff I had a hand in myself, because a lot of what I do is scratching my own itches, and I like to talk excitedly about my work. Let’s go!

Better userspace debugging 🔗

Pull requests: #16140 #16181

A decent chunk of my day job is helping customers recover data from damaged pools. We do use this using tools built on top of libzpool, which is the OpenZFS core compiled for userspace, such that no kernel support is needed at all. This includes the standard zdb pool exploration tool, as well as some in-house tools.

Since libzpool is the same OpenZFS core code, it generally follows the same rules as the in-kernel OpenZFS, which is to say that it prefers to crash in the face of bad data rather than plough on regardless. A lot of recovery work involves considering each crash, making a small code adjustment, recompiling and moving on to the next one, until we’ve worked around enough bad data that we start to see useful results. As such, recovery work is also development work to a degree, and that means needing good development tools.

In 2.2, when something like zdb crashed or tripped an assertion, it would produce the most minimal of output:

$ ./zdb -d crayon/ 0
"oh no" == 0 (0x7ff68c1e96c0 == 0)
ASSERT at module/zfs/arc.c:4463:arc_evict_cb_check()Aborted

It’s better than nothing, but only just barely. In practice though, you want to know more about what was happening at the time and how it got here.

So, I extended this to give more information, along the lines of what a kernel crash will give you: information about the thread that crashed, and a stack backtrace. If available, libunwind will be used to get an even better backtrace, but if not, most libcs have a backtrace() function that is good-enough.

$ ./zdb -d crayon/ 0
ASSERT at module/zfs/arc.c:4510:arc_evict_cb_check()
"oh no" == 0 (0x7fdc8ebafd40 == 0)
  PID: 3594698   COMM: zdb
  TID: 3594782   NAME: arc_evict
Call trace:
  [0x7fdc8eba3b85] libspl_assertf+0x135 (in /home/robn/code/zfs-2.3/.libs/libzpool.so.5.0.0 +0x2ccb85)
  [0x7fdc8e95babc] arc_evict_cb_check+0x3c (in /home/robn/code/zfs-2.3/.libs/libzpool.so.5.0.0 +0x84abc)
  [0x7fdc8eb1801f] zthr_procedure+0x9f (in /home/robn/code/zfs-2.3/.libs/libzpool.so.5.0.0 +0x24101f)
  [0x7fdc8e930ed8] zk_thread_wrapper+0x18 (in /home/robn/code/zfs-2.3/.libs/libzpool.so.5.0.0 +0x59ed8)
  [0x7fdc8e2b5144] start_thread+0x2d4 (in /usr/lib/x86_64-linux-gnu/libc.so.6 +0x89144)
  [0x7fdc8e3357dc] __clone3+0x2c (in /usr/lib/x86_64-linux-gnu/libc.so.6 +0x1097dc)
Call trace:
  [0x55698b61dd81] sig_handler+0x21 (in /home/robn/code/zfs-2.3/.libs/zdb +0x11d81)
  [0x7fdc8e268050] +0x7fdc8e268050 (in /usr/lib/x86_64-linux-gnu/libc.so.6 +0x3c050)
  [0x7fdc8e2b6e3c] __pthread_kill_implementation+0x10c (in /usr/lib/x86_64-linux-gnu/libc.so.6 +0x8ae3c)
  [0x7fdc8e267fb2] __GI_raise+0x12 (in /usr/lib/x86_64-linux-gnu/libc.so.6 +0x3bfb2)
  [0x7fdc8e252472] __GI_abort+0xd3 (in /usr/lib/x86_64-linux-gnu/libc.so.6 +0x26472)
  [0x7fdc8e92febb] libspl_assertf.cold+0x05 (in /home/robn/code/zfs-2.3/.libs/libzpool.so.5.0.0 +0x58ebb)
  [0x7fdc8e95babc] arc_evict_cb_check+0x3c (in /home/robn/code/zfs-2.3/.libs/libzpool.so.5.0.0 +0x84abc)
  [0x7fdc8eb1801f] zthr_procedure+0x9f (in /home/robn/code/zfs-2.3/.libs/libzpool.so.5.0.0 +0x24101f)
  [0x7fdc8e930ed8] zk_thread_wrapper+0x18 (in /home/robn/code/zfs-2.3/.libs/libzpool.so.5.0.0 +0x59ed8)
  [0x7fdc8e2b5144] start_thread+0x2d4 (in /usr/lib/x86_64-linux-gnu/libc.so.6 +0x89144)
  [0x7fdc8e3357dc] __clone3+0x2c (in /usr/lib/x86_64-linux-gnu/libc.so.6 +0x1097dc)
Aborted

We now also set thread names, so we can see what’s happening from process inspection tools while its running. Before, every thread was just named for the parent program:

$ top -H -b -n 1 -p $(pgrep zdb)
top - 11:42:58 up 25 days, 20:59,  1 user,  load average: 0.83, 1.35, 1.18
Threads: 229 total,   4 running, 225 sleeping,   0 stopped,   0 zombie
%Cpu(s): 25.0 us, 25.0 sy,  0.0 ni, 50.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  31757.8 total,   6487.5 free,  24664.6 used,   9994.4 buff/cache
MiB Swap:  32751.0 total,  27034.8 free,   5716.2 used.   7093.2 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
3604049 robn      20   0 6655196 403900   5056 R  99.9   1.2   0:32.12 zdb
3604167 robn      20   0 6655196 404032   5056 S  13.3   1.2   0:02.03 zdb
3604168 robn      20   0 6655196 404032   5056 S  13.3   1.2   0:02.03 zdb
3604132 robn      20   0 6655196 403900   5056 R   6.7   1.2   0:00.29 zdb
3604141 robn      20   0 6655332 403900   5056 S   6.7   1.2   0:00.18 zdb
3604142 robn      20   0 6655332 403900   5056 S   6.7   1.2   0:00.18 zdb
3604147 robn      20   0 6655332 404160   5056 S   6.7   1.2   0:00.18 zdb
3604149 robn      20   0 6655332 404160   5056 S   6.7   1.2   0:00.18 zdb
3604162 robn      20   0 6655332 404160   5056 R   6.7   1.2   0:02.20 zdb
3604163 robn      20   0 6655332 404032   5056 S   6.7   1.2   0:02.01 zdb
3604164 robn      20   0 6655196 404032   5056 S   6.7   1.2   0:02.03 zdb
3604165 robn      20   0 6655196 404032   5056 S   6.7   1.2   0:02.04 zdb
3604166 robn      20   0 6655196 404032   5056 R   6.7   1.2   0:02.02 zdb
...

In 2.3, threads get given their actually name, so we can tell at a glance who is doing the work:

$ top -H -b -n 1 -p $(pgrep zdb)
top - 11:44:03 up 25 days, 21:01,  1 user,  load average: 1.36, 1.43, 1.22
Threads: 213 total,   7 running, 206 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  31757.8 total,   6339.6 free,  24812.6 used,  10010.3 buff/cache
MiB Swap:  32751.0 total,  27034.8 free,   5716.2 used.   6945.2 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
3605076 robn      20   0 6755820 527492   5256 R  99.9   1.6   0:10.10 zdb
3605166 robn      20   0 6755820 527492   5256 R  13.3   1.6   0:00.06 z_vdev_file
3605190 robn      20   0 6756364 528016   5256 R  13.3   1.6   0:00.61 z_rd_int_0
3605193 robn      20   0 6756364 528016   5256 S  13.3   1.6   0:00.53 z_rd_int_1
3605196 robn      20   0 6756364 528016   5256 S  13.3   1.6   0:00.53 z_rd_int_1
3605175 robn      20   0 6755956 527492   5256 S   6.7   1.6   0:00.06 z_vdev_file
3605176 robn      20   0 6755956 527492   5256 R   6.7   1.6   0:00.06 z_vdev_file
3605189 robn      20   0 6756364 528016   5256 R   6.7   1.6   0:00.60 z_rd_int_0
3605191 robn      20   0 6756364 528016   5256 R   6.7   1.6   0:00.56 z_rd_int_0
3605192 robn      20   0 6756364 528016   5256 S   6.7   1.6   0:00.57 z_rd_int_0
3605194 robn      20   0 6756364 528016   5256 S   6.7   1.6   0:00.52 z_rd_int_1
3605090 robn      20   0 6755820 527492   5256 S   0.0   1.6   0:00.00 system_taskq
3605091 robn      20   0 6755820 527492   5256 S   0.0   1.6   0:00.00 system_taskq
...

It’s not glamorous, but this is stuff is crucial when trying to debug problems on customer pools, especially if we don’t have access to the system but are trying to guide them through a remote debugging session.

Better task queue stats 🔗

Pull requests: #16171 #16473

The bulk of OpenZFS’ concurrency model is based around “task queues”. They’re pretty simple in concept but have some truly devilish details, and being able to understand what they’re up to is critical for understanding performance problems.

Up until the 2.2 series, some task queue stats were available but showed a lot of things that weren’t at all interesting and barely anything that would be useful to understand how they were being used.

There have always been two stats nodes, taskq and taskq-all. taskq-all shows all queues in the system with a whole bunch of attributes about them:

$ cat /proc/spl/taskq-all
taskq                       act  nthr  spwn  maxt   pri  mina         maxa  cura      flags
spl_system_taskq/0            0     1     0    64   100     6   2147483647     6   80000005
spl_delay_taskq/0             0     1     0     6   100     6   2147483647     6   80000005
spl_dynamic_taskq/0           0     1     0     1   100     6   2147483647     6   80000001
spl_kmem_cache/0              0     2     0     4   100    32   2147483647    32   80000005
zvol/0                        0     1     0    32   100    32   2147483647    32   80000005
arc_prune/0                   0     1     0     1   120   100   2147483647   100   80000005
dbu_evict/0                   0     1     0     1   120     0            0     0   80000000
z_vdev_file/0                 0     1     0    16   139     6   2147483647     0   80000004
z_null_iss/0                  0     1     0     1   100    50   2147483647     0   80000004
...

Most of this information is uninteresting. act is the number of running tasks at this moment, nthr and spwn are about threads that exist to service the queue and new threads being spawned right now. The rest are largely static configuration that can be got from the source code, and so are mostly a waste of space.

When there’s stuff on the queues, it gets more bizarre:

$ cat /proc/spl/taskq-all
taskq                       act  nthr  spwn  maxt   pri  mina         maxa  cura      flags
spl_system_taskq/0            0     1     0    64   100     6   2147483647     6   80000005
spl_delay_taskq/0             0     1     0     6   100     6   2147483647     6   80000005
	delay: (%pf?)(0xffff9bdc041a4000)
spl_dynamic_taskq/0           0     1     0     1   100     6   2147483647     6   80000001
...
z_wr_iss/0                    1     4     0     4   101    50   2147483647     0   8000000c
	active: [3871](%pf?)(0xffff9bdc270db700)
	pend: (%pf?)(0xffff9bdc2ec88f00)
z_wr_iss_h/0                  0     5     0     5   100    50   2147483647     0   80000004
z_wr_int/0                    0     4     0     4   100    50   2147483647     0   8000000c
	pend: (%pf?)(0xffff9bdc3572b200)
z_wr_int_h/0                  0     5     0     5   100    50   2147483647     0   80000004
z_fr_iss/0                    0     1     0     4   100    50   2147483647     0   8000000c
...

This is actually listing in full the tasks in various states. The memory address is the in-kernel memory location of the task parameters. %pf is a long-broken format that is supposed to be the name of the task function. Even unbroken, that would rarely be useful, because while task queues can run any task you like, for the most part task queues only handle one kind of task.

As for /proc/spl/taskq, it only shows exactly the same thing, but only if there are any tasks on it at all. So it ranges from no output to a wall of hex numbers and misery.

So I ripped all this out and replaced it with something more useful. Now there is regular conventional kstat node for every task queue, showing instantaneous and accumulated values for a wide variety of internal task queue operational state:

$ ls /proc/spl/kstat/taskq
arc_prune.0           z_cl_iss.0     z_rd_int_0.0        zvol_tq-2.0
dbu_evict.0           z_fr_int.0     z_rd_int_1.0        z_wr_int_0.0
dp_sync_taskq.0       z_fr_iss_0.0   z_rd_iss.0          z_wr_int_1.0
dp_zil_clean_taskq.0  z_fr_iss_1.0   z_trim_int.0        z_wr_int_h.0
spl_delay_taskq.0     z_ioctl_int.0  z_trim_iss.0        z_wr_iss.0
spl_dynamic_taskq.0   z_ioctl_iss.0  z_unlinked_drain.0  z_wr_iss_h.0
spl_kmem_cache.0      z_metaslab.0   z_upgrade.0         z_zrele.0
spl_system_taskq.0    z_null_int.0   z_vdev_file.0       z_zvol.0
summary               z_null_iss.0   zvol_tq-0.0
z_cl_int.0            z_prefetch.0   zvol_tq-1.0

$ cat /proc/spl/kstat/taskq/z_wr_iss.0
114 1 0x01 23 6256 61967657786 132922594033
name                            type data
threads_max                     4    4
entry_pool_min                  4    50
entry_pool_max                  4    2147483647
threads_active                  4    0
threads_idle                    4    4
threads_total                   4    4
tasks_pending                   4    0
tasks_priority                  4    0
tasks_total                     4    0
tasks_delayed                   4    0
entries_free                    4    0
threads_created                 4    5
threads_destroyed               4    1
tasks_dispatched                4    266
tasks_dispatched_delayed        4    0
tasks_executed_normal           4    263
tasks_executed_priority         4    3
tasks_executed                  4    266
tasks_delayed_requeued          4    0
tasks_cancelled                 4    0
thread_wakeups                  4    145
thread_wakeups_nowork           4    23
thread_sleeps                   4    149

And then there’s a summary view, that shows the most useful at-a-glance stuff for the entire system:

$ cat /proc/spl/kstat/taskq/summary
3 0 0x01 1 0 6060658890 2005804238964098
                     | threads           | tasks on queue
taskq name           | tot [act idl] max |  pend [ norm  high] dly
-------------------- | ----------------- | -----------------------
z_unlinked_drain.0   |   1 [  0   1]  12 |     0 [    0     0]   0
z_zrele.0            |   1 [  0   1]  12 |     0 [    0     0]   0
dp_zil_clean_taskq.0 |  12 [  0  12]  12 |     0 [    0     0]   0
dp_sync_taskq.0      |   9 [  1   8]   9 |     0 [    0     0]   0
z_upgrade.0          |   1 [  0   1]  12 |     0 [    0     0]   0
...
z_wr_iss.0           |   8 [  0   8]   9 |     0 [    0     0]   0
z_rd_int_1.0         |   4 [  4   0]   4 |     2 [    2     0]   0
z_rd_int_0.0         |   4 [  4   0]   4 |     2 [    2     0]   0
z_rd_iss.0           |   1 [  0   1]   8 |     0 [    0     0]   0
...

The original PR #16171 actually had a bit more in it, showing rolling averages for each point in a task’s journey through the queue:

enqueue_time_min                4    0
enqueue_time_max                4    38
enqueue_time_avg_10             4    5
enqueue_time_avg_100            4    5
enqueue_time_avg_1000           4    4
enqueue_time_avg_10000          4    0
wait_time_min                   4    0
wait_time_max                   4    42347
wait_time_avg_10                4    2
wait_time_avg_100               4    10
wait_time_avg_1000              4    1487
wait_time_avg_10000             4    3102
dequeue_time_min                4    0
dequeue_time_max                4    45
dequeue_time_avg_10             4    0
dequeue_time_avg_100            4    0
dequeue_time_avg_1000           4    1
dequeue_time_avg_10000          4    0
execute_time_min                4    0
execute_time_max                4    44618
execute_time_avg_10             4    0
execute_time_avg_100            4    2
execute_time_avg_1000           4    207
execute_time_avg_10000          4    441
task_time_min                   4    2
task_time_max                   4    44694
task_time_avg_10                4    8
task_time_avg_100               4    18
task_time_avg_1000              4    1698
task_time_avg_10000             4    3544

                     | threads (current) | task on queue (current) | task time (µs)  (last 100 tasks)
taskq name           | tot [act idl] max |  pend [ norm  high] dly | avg    [enq    wait  deq   exec]
-------------------- | ----------------- | ----------------------- | --------------------------------
z_unlinked_drain.0   |   1 [  0   1]   4 |     0 [    0     0]   0 |      0 [   0      0    0      0]
z_zrele.0            |   1 [  0   1]   4 |     0 [    0     0]   0 |      0 [   0      0    0      0]
dp_zil_clean_taskq.0 |   4 [  0   4]   4 |     0 [    0     0]   0 |     10 [   4      4    0      1]
dp_sync_taskq.0      |   1 [  0   1]   1 |     0 [    0     0]   0 |    154 [   2     44    0    108]
z_upgrade.0          |   1 [  0   1]   4 |     0 [    0     0]   0 |     87 [   0      0    0     86]
z_prefetch.0         |   1 [  0   1]   4 |     0 [    0     0]   0 |      0 [   0      0    0      0]
z_metaslab.0         |   2 [  0   2]   2 |     0 [    0     0]   0 |     30 [   2     27    0      0]
...

I ended up removing this, as the overhead of taking so many accurate time samples was substantial on very busy queues. I would like to come back to it as I think its kind of the killer feature here, but even so, what’s there is still light-years ahead of the original.

The endless kernel update treadmill 🔗

Pull requests:

new Linux updates: 6.7 #15681 #15773, 6.8 #15805 #15931, 6.9 #16033, 6.10 #16250, 6.11 #16400 #16539 #16583, 6.12 #16582

old Linux removal: #16479 #16545

old FreeBSD removal: #16372

FreeBSD vdev_file TRIM/DISCARD: #16496

Linux experimental opt-in: #15986

Linux churns the crap out of their internal APIs, and people really like to run OpenZFS on the newest kernel releases on day one. I have multiple opinions about all of this which are for another time, but regardless, OpenZFS needs to be updated to make it work. For me it’s become a kind of pleasant routine evening task every couple of months.

It’s of course not as simple as just applying the changes in Linux to OpenZFS, because we support kernels back to 4.18 (to cover RHEL8). Instead, we have to study the changes, work out the feature checks needed to tell the difference, and then build appropriate abstractions to ensure the same behaviour across years of kernel internal change.

Of course, the world moves on at the other end too. There’s old Linux versions that we don’t support anymore. That used to be the hybrid 3.10 that came with RHEL7, which is now out of support. That means removing code, which has been lovely. Linux still changes a lot, but much of the deep structural stuff was mostly settled long ago, so there’s plenty of compatibility stuff we can get rid of.

But we mustn’t leave FreeBSD out. We could, because they tend not to break stuff very often, but as FreeBSD 12 went to EOL, there was plenty of old code we could remove. But it’s not like they stay still, so we still do stuff to wire up improvements, which is why you can now TRIM your file-backed pools.

Keeping up with Linux updates led to a surprising issue though. If OpenZFS doesn’t work with new Linux version on day of release, people file bug reports. But of course OpenZFS doesn’t release as often as Linux, and also, until day of release we can’t be sure that we have everything working properly for the new kernel anyway - our stability and durability guarantees matter! So instead, we started shipping experimental support for new kernels in point releases, without raising the “officially supported” version. That would yield a different kind of bug report when something doesn’t work properly because our patches aren’t right yet.

To try and split the difference, we now detect the kernel version we’re compiling against, and abort if its newer than our maximal supported version. But now you can pass --enable-linux-experimental to configure, which will try to compile it, but will add scary warnings:

[    3.124287] ZFS: Loaded module v2.3.0-rc1, ZFS pool version 5000, ZFS filesystem version 5
[    3.124513] ZFS: Using ZFS with kernel 6.12-rc1 is EXPERIMENTAL and SERIOUS DATA LOSS may occur!
[    3.124640] ZFS: Please report your results at: https://github.com/openzfs/zfs/issues/new

Will it help people know what they’re getting? We’ll see!

More importantly though, this does show some of the effort involved in keeping OpenZFS running smoothly.

Linux IO submission rewrite 🔗

Pull requests: #15588 #16108

This one took me a full year to solve properly. It turns out that pretty much since the first port of ZFS to Linux, it’s been passing IO buffers down to the block layer a bit wrong, in a way that under particularly rare situations, it could result in misaligned buffers being passed down to the disk driver, which would then reject it. Obviously so rare that it had barely been noticed in over a decade.

This resulted in a rewrite of the IO submission layer that is a lot smarter, and also does a lot less work than the old one, to the extent that as well as fixing the alignment issue it’s some 10% faster on heavy sync writes and as much 150% faster on async writes. This wasn’t the purpose, but was a cool and surprising thing to find out.

Some of the new code also enables some interesting “zero copy” facilities, which with a little more work should be able to give us very efficient implementations of things like sendfile(). So I’m looking forward to exploring that further in the next development series.

Credit where it’s due 🔗

Pull requests: #15374 #16079 #16362 #16563

One of the real joys of working OpenZFS for me is that it’s not a “corporate” open source project. Yes, many of the contributors do things on their employers’ time, but the point is the individuals involved, not the companies. And we keep a nice list of everyone in the AUTHORS file, no matter how big or small the contribution. I love this, because its no bullshit, no pecking order, just everyone ever.

So the first time I went to add my own name, and then noticed just how many people were missing. So I kinda got really stuck into it one evening and wrote a whole big program to mine the git history and update the AUTHORS file, and now it seems like I’m the guy that presses that button every now and again. Which I’m ok with, because I love a team effort 🏆

JSON everywhere 🔗

Pull request: #16217

And here’s the coolest one here, and the one I had nothing to do with.

Since the beginning of time, grumpy sysadmins have lost their minds trying to parse output from the ZFS tools and feed them into other thing. A lot of the output is basic tabular and so not hard to feed to grep, cut, awk and all the usual friends.

The challenge has always zpool status, which has useful device health info in it that you can’t easily get elsewhere:

$ zpool status shed
  pool: shed
 state: ONLINE
  scan: scrub repaired 0B in 04:45:02 with 0 errors on Fri Sep 27 12:03:32 2024
config:

	NAME        STATE     READ WRITE CKSUM
	shed        ONLINE       0     0     0
	 mirror-0  ONLINE       0     0     0
	   da0     ONLINE       0     0     0
	   da1     ONLINE       0     0     0
	 mirror-1  ONLINE       0     0     0
	   da2     ONLINE       0     0     0
	   da3     ONLINE       0     0     0
	 mirror-2  ONLINE       0     0     0
	   da4     ONLINE       0     0     0
	   da5     ONLINE       0     0     0

errors: No known data errors

But now most of the zpool and zfs subcommands take a -j switch to get JSON output. So now I can do this:

$ ./zpool status -j shed | jq -r '.pools[] | recurse(.vdevs[]?) | select(.vdev_type == "disk") | [.name,.path,.state] | @tsv'
da0	/dev/da0	ONLINE
da1	/dev/da1	ONLINE
da2	/dev/da2	ONLINE
da3	/dev/da3	ONLINE
da4	/dev/da4	ONLINE
da5	/dev/da5	ONLINE

I probably won’t do a whole lot with it myself, now that I’m out of the sysadmin game, but I’m really curious to see what kind of tools people make out of this.

And more! 🔗

I promised myself I’d post this today, so I won’t write more right now, but maybe I’ll do another. Regardless, there’s still loads more cool stuff inside OpenZFS 2.3, and you should try it out!