We should improve libzfs somewhat
OpenZFS has some extremely nice tools and you can do a lot with them, but they start to struggle once you need to do more complicated things with your storage, or scale your OpenZFS installation out to tens or hundreds of pools, or use it as a component of a larger product. That’s usually when people turn up looking for better ways to “program” OpenZFS, and it’s usually not long before they’re disappointed, horrified or both.
Recently in the weekly OpenZFS Production User Call we’ve been trying to pin down some common use cases and work out how to get from where we are to the point where someone can turn up looking for a useful programmatic interface to OpenZFS. That conversation is ongoing but we all agree the bare minimum should be to make sure that what we have already is usable, even if not particularly useful. At least, that makes it easier for people to experiment and get a feel for the shape of the problem, and help us figure out what’s next.
BSD Fund have generously sponsored me to do a little bit of exploratory work around libzfs
and it’s younger sibling libzfs_core
and make some progress. This fits in well with my personal mission of making OpenZFS more accessible to developers of all kinds (which I will try to write about soon). There’s some nice low-hanging fruit to be plucked here!
This post will go over what we’re trying to achieve, why it’s difficult and what I’m currently looking at.
There’s a lot here, and I’m not sure how coherent it is, so there’s every chance you’ll give up half way through. So I’ll say the most important part up front: if you have an application you’d like to integrate more tightly with OpenZFS, please do drop in on the production user call and introduce yourself. Or, if you’re the quiet type, feel free to email me directly. And if you’re the type that has money to spare and would just be happy to see nice things, please throw some towards BSD Fund. There’s a lot of really good and important ideas, wants and needs floating around, and more than anything they just need a bit of time for someone to sit down and think about them a bit. You can help!
The kernel interface: ioctl()
🔗
Most of the time, OpenZFS lives inside the kernel. As with anything in the kernel, a program talks to it through system calls. And, as with any syscall that isn’t a “core” function of the kernel, that’s done by creating a special “control node” in /dev
, then sending it “custom” commands via the catch-all ioctl()
syscall.
If you haven’t seen ioctl()
before, the idea is that sometimes you needed to send commands that didn’t “fit” into the old “everything is a file” model of everything being representable as operations on a data stream. Sometimes you just want to say “turn off the computer” or “eject the CD” or “hang up the modem”. Over time, we realised that actually, most things aren’t data streams, and we also have this small problem that the kernel has opinions about how the “core” syscalls should work. ioctl()
though is mostly passed straight through to the “device driver”, because the kernel has no idea that we aren’t actually a CD drive. So we can generally do anything we want through ioctl()
.
ioctl()
has basically no structure. A more conventional syscall like read()
has a set number of arguments, with known sizes and meanings:
ssize_t read(int fd, void buf, size_t count);
ioctl()
, on the other hand, is just:
int ioctl(int fd, unsigned long request, char *argp);
That is, a open handle on some device node, some arbitrary request number or id and some arbitrary payload. The in-kernel receiver does something, and can return an error code if it wants. That’s all of it.
OpenZFS has around 100 possible request numbers which are all used to instruct OpenZFS to do things outside of its normal storage operations (which are handled through the regular data syscalls). Some of them are very simple and obviously map to zpool
or zfs
commands (eg ZFS_IOC_HOLD
). Some are simple commands but need a very complicated payload (eg ZFS_IOC_SEND
). Some have complicated functions under them (ZFS_IOC_DIFF
), and some are small component of a larger function (eg ZFS_IOC_POOL_IMPORT
, ZFS_IOC_POOL_CONFIGS
, ZFS_IOC_POOL_TRYIMPORT
and ZFS_IOC_POOL_STATS
, which are all used during import), and some are tiny utility functions that are used by many functions (eg ZFS_IOC_OBJ_TO_PATH
).
The payload format, zfs_cmd_t
, is kinda wild. Traditionally ioctl()
payloads are some small binary blob that both sides know how to interpret. As more and more commands were added, this structure grew to accommodate all the possible arguments they could take, making it big and messy and weird. More recent commands have taken steps to address this by passing an “nvlist” of args to the kernel, and receiving results back in another nvlist.
What’s an nvlist? Broadly, it’s an in-memory serialised key-value or dictionary structure, a bit like CBOR or MessagePack. It was used all over the place in old Solaris to pass structured data around, including onto disk and over the network, and so OpenZFS inherited it. It’s about what you’d expect. I don’t love it, but it’s fit for purpose and I don’t think about it most of the time. It’s fine.
Anyway, this whole edifice is kind of what you get when something is good-enough for 20 years - it quietly accumulates cruft but doesn’t really get in anyone’s way. It’s not supposed to matter though, because this is not the application interface to OpenZFS. That’s something else, called libzfs
.
The application interface: libzfs
🔗
libzfs
is the bridge between userspace applications and the ioctl()
interface into the kernel module. It’s two biggest and most important are the zpool
and zfs
tools that OpenZFS ships with … and it shows.
It’s certainly true that libzfs
obviates the need to use ioctl()
directly, and it definitely provides lots of useful facilities. But, it probably provides too many facilities, with some being too high-level, and others not high-level enough.
I wanted to put some examples in here but frankly, they’re kind of intensely complicated to do even simple things, I’m not sure if I’ve got it right and it’s really hard to know what I’ve missed. And this is my entire point!
Just now, I was looking at the code that implements zfs list
. It’s a maze of callbacks, but the rough shape is:
- initialise
libzfs
- create a property list for the wanted properties (including the name)
- open a handle to the root filesystem
- call the filesystem iterator on that handle, which loops over the immediate children of the handle and calls a callback function for each
- in the callback function, “expand” the property list for the filesystem in question
- loop over each property in the list, and depending on its type and source (property inheritance), call the right function to get its value
- print the property name and value
- call the filesystem iterator on that this filesystem, to do recursion into the child filesystems
- close the root filesystem handle
- destroy the property list
- shutdown
libzfs
Now that isn’t too bad considering that this needs to handle tens of thousands of datasets and all sorts of combinations of properties, sort order, filtering, etc. It’s certainly more than a casual programmer turning up wanting a list of datasets should have to deal with though.
The thing is, libzfs
also has lots of functions for sorting by field names, helping with table layouts (including property display widths!), managing certain kinds of user interactions (including signals) and so on, doing localisation, building error strings, and so so. To the extent that it’s “designed”, it’s designed with the needs of zpool
and zfs
in mind.
To be clear, that’s ok! I mean yes, the abstractions are a mess and its hard to use, but if it’s just a utility library for programs that OpenZFS ships, well, we can deal with that, and slowly clean it up over time. Unfortunately, we’re here as an outsider looking for something to use in our own program, and it’s really hard for that.
It has deeper problems though. One is documentation. Consider this prototype from the header:
_LIBZFS_H int zpool_import(libzfs_handle_t *, nvlist_t *, const char *,
char *altroot);
Now sure, that’s a weak criticism, since those fields are easy to name in the header. It’d take a couple of hours to fill them all out and send a PR.
If we go to the implementation to see what’s up we can get to the heart of the problem with libzfs
as a general-purpose library:
int
zpool_import(libzfs_handle_t *hdl, nvlist_t *config, const char *newname,
char *altroot)
{
...
}
hdl
is obvious, and newname
and altroot
are understandable if you’ve ever used zpool import
before. config
however is the tricky one that shows the whole issue.
This is a pool configuration, and is (roughly) what zpool_import()
is expecting to receive in its second arg:
$ zdb -C crayon
MOS Configuration:
version: 5000
name: 'crayon'
state: 0
txg: 4775908
pool_guid: 16973926705435575469
errata: 0
hostid: 12235020
hostname: '(none)'
com.delphix:has_per_vdev_zaps
vdev_children: 1
vdev_tree:
type: 'root'
id: 0
guid: 16973926705435575469
create_txg: 4
com.klarasystems:vdev_zap_root: 114
children[0]:
type: 'disk'
id: 0
guid: 719478801560942130
path: '/dev/nvme0n1p2'
whole_disk: 0
metaslab_array: 256
metaslab_shift: 32
ashift: 12
asize: 477207724032
is_log: 0
DTL: 118
create_txg: 4
com.delphix:vdev_zap_leaf: 129
com.delphix:vdev_zap_top: 130
features_for_read:
com.delphix:hole_birth
com.delphix:embedded_data
com.klarasystems:vdev_zaps_v2
If you’ve been around OpenZFS long enough, you’ll roughly recognise this format. This is the “pretty-print” output of an nvlist_t
structure, the key/value or dictionary thing we mentioned above.
Where do we get one of these? Well, there’s a few possible places. The famous zpool.cache
is just this nvlist structure dumped to a file. This config is also stored on all the drives in the pool (in the “label” areas) and is how zpool import
finds pools if you don’t pass it a cachefile. Or you could just create one if you knew the right things to put in it.
How to find those? Mostly reading the source code and trying things.
Now I’ll admit I’m sort of being a dick here by choosing an import function, because import is maybe the most complicated single process anywhere in OpenZFS and it’s possibly unfair to expect any API to it to be simple. However I do think that this mostly demonstrates that libzfs
is probably not the general-purpose library that people hope it would be.
There are other difficulties. libzfs
has a lot of compatibility code to make sure it can continue to work against older kernel modules, at least for a short time, so that one can upgrade the OpenZFS userspace first, then reload the kernel module afterwards. It also has the problem that it’s header files are quite deeply entangled with OpenZFS internal headers, which can make it difficult or impossible to actually link an out-of-tree program with it (on Linux you basically can’t right now due to mismatches in struct stat64
, while on FreeBSD libzfs.h
is in base but the additional headers are in the system development package, so even if you have a program you can’t build it). There’s some things we can do to make it at least possible to mess with it (see way back at the start), but its still not likely to be a good general-purpose API, or at least not for many years.
And so our curious programmer rapidly discards libzfs
as an option, and if they stick around and dig a little deeper, they find libzfs_core
.
The ✨NEW✨ application interface: libzfs_core
🔗
libzfs_core
was started back in 2013 as an answer to all of these problems with libzfs
. Matt Ahrens did a nice presentation at an Illumos meetup in 2012 [video, slides, text] outlining what was wrong with libzfs
and what libzfs_core
would try to bring to the table.
Basically, it’s purpose is to be a light shim over ioctl()
. Each function within it should simply massage their args into the right shape, call the relevant ioctl()
, and return the results, and that’s all. In addition, it would be a committed interface. libzfs
has to change for all the reasons above, but since (in theory) the kernel interfaces would never change, only be added to, libzfs_core
’s API and even ABI would never need to change.
I think the ideas behind it is right, but I think it’s missing the mark in a couple of important ways.
It doesn’t implement quite a few useful ioctl()
commands. In particular, it’s almost all aimed at dataset management; there’s precious little in there about pool management. For example, there’s no general way to access pool properties. I think at least part of this is that its geared towards the newer nvlist-based commands; the older ones (like pool management) still operate on the old binary data. Of course, it wouldn’t be any issue to implement wrappers over these, but I suspect it was assumed they would be converted to the newer structure before doing that.
The real problem is more fundamental, in that I think it still doesn’t draw the lines in the right place. A 1:1 mapping between API calls and ioctl()
seems nice, and gives nice properties like easy atomicity, but as I’ve said above, some of those calls do too little, and this just shifts the burden onto the application programmer.
It also means that too many implementation details leak through when things get complicated, in particular falling back to nvlists:
_LIBZFS_CORE_H int lzc_snapshot(nvlist_t *, nvlist_t *, nvlist_t **);
Again imagining we documented this, its actually:
int lzc_snapshot(nvlist_t *snaps, nvlist_t *props, nvlist_t **errlist)
Now don’t get me wrong, this looks like a very useful function. It can take multiple snapshots in the same call. props
is a set of properties to set on each snapshot, and errlist
has the result of each snapshot. Batching is generally pretty nice!
My main criticism here though, and again, is the overuse of nvlists. It’s not entirely bad; if you want a batching interface, you need some notion of a “batch” to add things to. But these are generic dictionary-style objects. They can’t check types, enforce args, or do much of anything really, and they’re hard to debug when you get them wrong and the kernel rejects them.
Of the two, I’d take libzfs_core
over libzfs
most days, because it’s small and only has two or three core concepts. But honestly, I don’t love either them. The specifics are different but the gist is that they make me think too hard and worry that I’m holding them wrong, and when I’m busy thinking about the actual problem I’m working on, I don’t want to be worried that I’m using my storage platform incorrectly.
The extras 🔗
There’s a handful of other programmatic interfaces to OpenZFS interfaces, and all are interesting and have ideas to bring to the table, but for now I think are off to the side.
pyzfs
OpenZFS ships with Python bindings to libzfs_core
. I have no idea how it works, if it works. I do know that normal programmers get excited when they see it, and then disappointed when they find out it can’t do anything that libzfs_core
also can’t, which is many things.
channel programs
OpenZFS has a Lua interpreter inside it, with Lua APIs to a handful of internal functions. The main advantage of this mechanism is to be able to avoid the round-trip penalty when doing bulk dataset operations, as they can all be loaded onto the same transaction. I am fascinated by what might be possible under this model in the future, but in their current form they’re barely usable, and most people who find them don’t get very far with them. I have a lot of thoughts about channel programs, but that’s for another time. See zfs-program(8)
for more info.
JSON output
Since 2.3 many of the zpool
and zfs
commands can produce output in JSON instead of the usual text/tab format. This isn’t enough for what we’re talking about here, but I’ll definitely agree that it makes a lot of things that were previously awkward much easier, with a uniform output format that can be more easily transformed and consumed by other programs. If it had existed when I was putting in my first major OpenZFS installation in 2020, it’s quite plausible that I would not have gone very far into the code and possibly never have switched careers to do OpenZFS dev full time. I’ll let you decide what kind of missed opportunity that is.
If not libzfs
, then what?
🔗
It’s sorta complicated.
If we were starting from scratch, I at least know what I’d prototype. It would be a small, standalone C library (no libzfs
or libzfs_core
dependency) with a conventional noun+verb style API, because I think those are the two biggest problems we have: it’s not clear how to set up the thing, and it’s not clear how to actually use it. (If I’m honest, that’s a whole-of-OpenZFS problem, but I’m not going into that whole thing today either!)
So a little off the top of my head, maybe a zfs list
-alike might look like:
/* open the pool */
zest_pool_t *pool = zest_pool_open("crayon");
/* loop over the datasets and print out their names and usage */
zest_dataset_list_t *dslist = zest_pool_dataset_list(pool);
for (zest_dataset_t *ds = zest_dataset_list_first(dslist); ds != NULL;
ds = zest_dataset_list_next(dslist, ds)) {
const char *name = zest_dataset_name(ds);
uint64_t *referenced =
zest_dataset_prop_value_u64(ds, "referenced");
printf("%s\t%"PRIu64"\n", name, referenced);
}
Or maybe a part of some provisioning service would create new project datasets from a snapshot template:
/* create a new user project from the latest approved template */
zest_dataset_t *ds =
zest_pool_dataset_find(pool, "product/template/nice@latest");
zset_snapshot_t *snap = zest_snapshot_from_dataset(ds);
zest_dataset_t *zest_snapshot_clone(snap, "product/users/sam/project");
I’m not tied to the name “zest”, so don’t bikeshed on that too hard. The better questions to ask are why I would use C at all, and why is this particular C so verbose?
As we’ve seen above, actually talking to OpenZFS at any level is difficult. Writing and maintaining a library of this kind is going to be a reasonable amount of work for someone, and not likely to be repeated for every language that anyone might want to use. The answer to that then is bindings, and C is a lowest common denominator that every language can hook up to. More importantly, using simple and uncomplicated C with clearly named types, explicit conversions, and understandable lifetimes, it’s relatively easy to create bindings without needing to jump through too many hoops.
if I did this, I’d be thinking about maintaining it outside of OpenZFS proper, tracking new versions as they come out but providing some amount of API (if not ABI) stability across updates. I’d want it to be somewhat unconnected from changes to OpenZFS proper, and work properly across all platforms that it supports. I don’t see anything that isn’t achievable; most of it isn’t even that hard.
However, we’re not starting from scratch. libzfs_core
was born seeking to replace libzfs
, and now we have both of them. Channel programs can do some things that libzfs_core
might have once imagined, and now we have those too. Starting a new thing feels like a big commitment because the worst thing would be leaving yet another thing in a half-working state.
Now what? 🔗
For the moment, I’m working on shoring up libzfs_core
. OpenZFS Production User Call regular Jan B has an actual application he’s working on (a FreeBSD jail manager) that would benefit from being able to manage snapshots and clones programmatically, and has spent a little time working out the smallest number of headers and defines required to get it to compile. I’m trying to either remove or make those dependencies explicit, and once that’s done, ensure that “base” OpenZFS packaging includes everything needed. #17066 was recently merged and is the first small step along this road.
I’d like to then get libzfs_core.h
documented. Even if it can’t do a lot of things, it should at least be not too much effort to get something simple up and running from just reading the documentation. I would also like to write a couple of sample programs, just to show how to do basic things.
Beyond that, I’m not sure. I mean, I have a thousand great ideas I’d like to play with, and never enough time. What I’d really love is for someone else to start building something and talk about what’s working and not working for them. Maybe that will help us see how to improve what have further, or maybe it’ll be the thing that forces a decision on making a new thing. Or maybe something else, who knows!
(っᵔ◡ᵔ)っ 🔗
There we go, 3500+ words to tell you “I am fixing some headers” 😅
But seriously. Maybe this is the the opening move for a new generation of OpenZFS-enabled applications. Or maybe it’ll just make life a bit easier for those storage operators responsible for taking care of the ever-increasing amount of data in their care.
So again, if you’ve got an idea, or a use case, or a whole project proposal, or just like to pay for nice people trying to do nice things, you’d be very welcome at the weekly OpenZFS Production User Call, or you could say hello to the nice folk at BSD Fund, or just hit me up for a chat. Cheers!