Objects and object sets
Objects are OpenZFS’ conception of an arbitrarily-large chunk of data. Objects are what the “upper” levels of OpenZFS deal with, and use them to assemble more complicated constructs, like filesystems, volumes, snapshots, property lists, and so on. If it helps, you can imagine an object to be a “file” in OpenZFS internal filesystem (the DMU).
Like a file, an object has a name, which is simply a uint64_t
. Also like a filesystem, objects can be grouped together into a named collection; that collection is called an “object set”, and is itself an object. An object set forms a unique namespace for its object, which is to say, two different objects can have object 1
, and they are totally unrelated objects.
So, the full “path” of an object is the “object set id”, then the “object id”.
A typical OpenZFS pool has lots of object sets. Some of these are visible to the user, and are called “datasets”. You might know these better as “filesystems”, “snapshots” and “volumes”. There’s nothing special about those, they’re just collections of objects assembled in such a way as to provide useful and familiar facilities to the user.
Every object has a description, called a “dnode”. This contains all the information about the object, most importantly, its size, type, and the location of its first block.
Now we know enough to start looking around.
Every pool has a single top-level object set, called the meta-object set or MOS. This objects in the MOS store all the housekeeping information about the pool as a whole.
Object 0 of any object set contains the table of objects within it, know as the metadnode. The data of this object is an array of dnode_phys_t
. So to find object 238, OpenZFS will calculate the offset into the metadnode of that object’s dnode (238 * sizeof (dnode_phys_t)
) . From there we can just look at the dnode itself to see the object’s metadata, or we can follow the block pointer down to the object’s data. However we’re not going to look too closely at the construction of a dnode here.
# zpool create tank loop0
# zdb -dddd -N tank 0
Dataset mos [META], ID 0, cr_txg 4, 90K, 59 objects, rootbp DVA[0]=<0:22000:200> DVA[1]=<0:1022000:200> DVA[2]=<0:2012600:200>[L0 DMU objset] fletcher4 lz4 unencrypted LE contiguous unique triple size=1000L/200P birth=11L/11P fill=59 cksum=0000000b2d6035a7:0000048dc6bd54d5:0000f11304f3c243:0021bf3fee4fa5bc
Object lvl iblk dblk dsize dnsize lsize %full type
0 2 128K 16K 18K 512 80K 36.88 DMU dnode
dnode flags: USED_BYTES
dnode maxblkid: 4
By convention, object 1 is the start of the housekeeping information for the object set. In the MOS, that’s the “object directory”.
# zdb -dddd -N tank 1
Dataset mos [META], ID 0, cr_txg 4, 81K, 46 objects, rootbp DVA[0]=<0:ee00:200> DVA[1]=<0:100ee00:200> DVA[2]=<0:200a000:200> [L0 DMU objset] fletcher4 lz4 unencrypted LE contiguous unique triple size=1000L/200P birth=8L/8P fill=46 cksum=00000008f52ba0bd:000003b2b818e2c7:0000c6e9eb051939:001c5bff8746c363
Object lvl iblk dblk dsize dnsize lsize %full type
1 1 128K 16K 13.5K 512 32K 100.00 object directory
dnode flags: USED_BYTES
dnode maxblkid: 1
...
features_for_read = 51
history = 60
com.delphix:vdev_zap_map = 128
root_dataset = 32
sync_bplist = 62
deflate = 1
features_for_write = 52
config = 61
feature_enabled_txg = 63
creation_version = 5000
feature_descriptions = 53
com.delphix:log_spacemap_zap = 133
free_bpobj = 41
org.illumos:checksum_salt = 62e56f861a38def0434b91c79c82826720475618395f98a3636b93621980eab9
This object, like many non-raw-data objects, is a ZAP, which is a kind of dictionary/map/directory/KV store inside a single OpenZFS object. ZAPs are a while topic for discussion; for now its enough to know that they’re a list of names and values.
So here we see a few basic things needed for OpenZFS to understand how to read the pool and find things in it. Some of these, like create_version
, deflate
or org.illumos:checksum_salt
are simple data items. Most though are object ids for other things. For example, features_for_read
points to the object that describes the features OpenZFS must have available to be able to read the pool:
# zdb -dddd -N tank 51
Dataset mos [META], ID 0, cr_txg 4, 81K, 46 objects, rootbp DVA[0]=<0:ee00:200> DVA[1]=<0:100ee00:200> DVA[2]=<0:200a000:200> [L0 DMU objset] fletcher4 lz4 unencrypted LE contiguous unique triple size=1000L/200P birth=8L/8P fill=46 cksum=00000008f52ba0bd:000003b2b818e2c7:0000c6e9eb051939:001c5bff8746c363
Object lvl iblk dblk dsize dnsize lsize %full type
51 1 128K 1.50K 1.50K 512 1.50K 100.00 zap
dnode flags: USED_BYTES
dnode maxblkid: 0
microzap: 1536 bytes, 21 entries
org.illumos:sha512 = 0
com.klarasystems:vdev_zaps_v2 = 1
com.delphix:head_errlog = 1
org.zfsonlinux:large_dnode = 0
org.freebsd:zstd_compress = 0
org.illumos:edonr = 0
com.delphix:bookmark_written = 0
com.delphix:redacted_datasets = 0
org.illumos:lz4_compress = 1
com.delphix:device_removal = 0
com.delphix:hole_birth = 1
com.datto:encryption = 0
com.delphix:extensible_dataset = 1
org.openzfs:draid = 0
com.delphix:redaction_bookmarks = 0
org.illumos:skein = 0
com.joyent:multi_vdev_crash_dump = 0
org.openzfs:blake3 = 0
org.open-zfs:large_blocks = 0
com.datto:bookmark_v2 = 0
com.delphix:embedded_data = 1
Yes, another ZAP. Anyway, pool metadata is not why we’re here either.
As noted, all user-facing datasets are object sets. We can find their ids by asking for the objsetid
property.
# zfs create tank/foo
# zfs create tank/bar
# zfs list -o name,objsetid
NAME OBJSETID
tank 54
tank/bar 151
tank/foo 143
Object ids have two purposes. Its the id of a real object in the MOS:
# zdb -dddd -N tank 151
Dataset mos [META], ID 0, cr_txg 4, 91.5K, 59 objects, rootbp DVA[0]=<0:1f400:200> DVA[1]=<0:101f400:200> DVA[2]=<0:200fa00:200> [L0 DMU objset] fletcher4 lz4 unencrypted LE contiguous unique triple size=1000L/200P birth=10L/10P fill=59 cksum=00000008e57f2b1e:000003ac2d4bfc30:0000c57ef5b7f24a:001c25f704ebd178
Object lvl iblk dblk dsize dnsize lsize %full type
151 1 128K 512 0 512 512 100.00 zap
320 bonus DSL dataset
dnode flags: USED_BYTES
dnode maxblkid: 0
dir_obj = 148
prev_snap_obj = 48
prev_snap_txg = 1
next_snap_obj = 0
snapnames_zapobj = 152
num_children = 0
userrefs_obj = 0
creation_time = Fri Jun 16 10:36:49 2023
creation_txg = 9
deadlist_obj = 153
used_bytes = 24K
compressed_bytes = 12K
uncompressed_bytes = 12K
unique = 24K
fsid_guid = 13616134740256949
guid = 7246029045750260381
flags = 4
next_clones_obj = 0
props_obj = 0
bp = DVA[0]=<0:1a600:200> DVA[1]=<0:101a600:200> [L0 DMU objset] fletcher4 lz4 unencrypted LE contiguous unique double size=1000L/200P birth=9L/9P fill=6 cksum=0000000dc6a38361:0000050166658386:0000f3824c02cbcd:002035533cc05adc
microzap: 512 bytes, 2 entries
org.zfsonlinux:project_quota = 0
org.zfsonlinux:userobj_accounting = 0
This object has various “global” metadata items about the dataset, like pointers to snapshots, and also the block of object 0.
From there, we can start to look around the dataset proper. Like the MOS, the object table is in object 0, and object 1 has the top-level metadata:
# zdb -dddd -N tank/151 1
Dataset tank/bar [ZPL], ID 151, cr_txg 9, 24K, 6 objects, rootbp DVA[0]=<0:1a600:200> DVA[1]=<0:101a600:200> [L0 DMU objset] fletcher4 lz4 unencrypted LE contiguous unique double size=1000L/200P birth=9L/9P fill=6 cksum=0000000dc6a38361:0000050166658386:0000f3824c02cbcd:002035533cc05adc
Object lvl iblk dblk dsize dnsize lsize %full type
1 1 128K 512 1K 512 512 100.00 ZFS master node
dnode flags: USED_BYTES USERUSED_ACCOUNTED USEROBJUSED_ACCOUNTED
dnode maxblkid: 0
microzap: 512 bytes, 7 entries
utf8only = 0
normalization = 0
DELETE_QUEUE = 33
casesensitivity = 0
VERSION = 5
ROOT = 34
SA_ATTRS = 32
ROOT
is the root dir, but before we go there, we need some files on disk!
# mkdir /tank/bar/somdir
# ls -l /tank/bar
total 1
drwxr-xr-x 2 root root 2 Jun 16 10:58 somdir
-rw-r--r-- 1 root root 0 Jun 16 10:58 somefile
And now we look at the root:
# zdb -dddd -N tank/151 34
Dataset tank/bar [ZPL], ID 151, cr_txg 9, 24K, 8 objects, rootbp DVA[0]=<0:3016800:200> DVA[1]=<0:4016800:200> [L0 DMU objset]fletcher4 lz4 unencrypted LE contiguous unique double size=1000L/200P birth=268L/268P fill=8 cksum=00000010fcd80fdf:0000062e53728927:00012c01c062907c:00276a0b253a0b4e
Object lvl iblk dblk dsize dnsize lsize %full type
34 1 128K 512 0 512 512 100.00 ZFS directory
176 bonus System attributes
dnode flags: USED_BYTES USERUSED_ACCOUNTED USEROBJUSED_ACCOUNTED
dnode maxblkid: 0
uid 0
gid 0
atime Fri Jun 16 10:58:52 2023
mtime Fri Jun 16 10:58:49 2023
ctime Fri Jun 16 10:58:49 2023
crtime Fri Jun 16 10:36:49 2023
gen 9
mode 40755
size 4
parent 34
links 3
pflags 840800000144
microzap: 512 bytes, 2 entries
somdir = 3 (type: Directory)
somefile = 2 (type: Regular File)
Objects have a type, this one has the “ZFS directory” type, so zdb
can show us more information about it. Directories are just ZAPs again, so we get the file list, and their object ids. The file attributes are held in the “bonus buffer”, which is some spare space dnode that can hold a but of extra data.
So finally we’re getting to real files, but lets get some data on disk first.
# dd if=/dev/random of=/tank/bar/somefile bs=10K count=1
1+0 records in
1+0 records out
10240 bytes (10 kB, 10 KiB) copied, 0.000787502 s, 13.0 MB/s
Oh, and here’s a little fun fact: the object id is exposed to the stat()
system call as st_ino
:
# stat -c %i /tank/bar/somefile
2
# zdb -ddddd -N tank/151 2
Dataset tank/bar [ZPL], ID 151, cr_txg 9, 34K, 8 objects, rootbp DVA[0]=<0:301d000:200> DVA[1]=<0:401a800:200> [L0 DMU objset]fletcher4 lz4 unencrypted LE contiguous unique double size=1000L/200P birth=337L/337P fill=8 cksum=0000000f0938eeb9:000005bd88a01f1d:0001229e43516c8a:0027967698e37bf5
Object lvl iblk dblk dsize dnsize lsize %full type
2 1 128K 10K 10K 512 10K 100.00 ZFS plain file
176 bonus System attributes
dnode flags: USED_BYTES USERUSED_ACCOUNTED USEROBJUSED_ACCOUNTED
dnode maxblkid: 0
path /somefile
uid 0
gid 0
atime Fri Jun 16 10:58:44 2023
mtime Fri Jun 16 11:04:44 2023
ctime Fri Jun 16 11:04:44 2023
crtime Fri Jun 16 10:58:44 2023
gen 266
mode 100644
size 10240
parent 34
links 1
pflags 840800000004
Indirect blocks:
0 L0 0:3019000:2800 2800L/2800P F=1 B=337/337 cksum=0000050bfb2a9f12:001962adad83a241:55148c3aa6669101:8c4bdbff5fbf271c
segment [0000000000000000, 0000000000002800) size 10K
With an extra -d
to zdb
, we get the block list as well, and we can start to compare the dnode with the file attributes. The dnode is listing the block size (dblk
) as 10K, which of course matches the file size. This is how OpenZFS implements the well-known “variable-size first block” - it just sets the object’s block size to whatever the file size is.
If we add enough data to span more than one block though, we see:
# dd if=/dev/random of=/tank/bar/somefile bs=129K count=1
1+0 records in
1+0 records out
132096 bytes (132 kB, 129 KiB) copied, 0.00144645 s, 91.3 MB/s
# zdb -ddddd -N tank/151 2
Dataset tank/bar [ZPL], ID 151, cr_txg 9, 34K, 8 objects, rootbp DVA[0]=<0:301d000:200> DVA[1]=<0:401a800:200> [L0 DMU objset]fletcher4 lz4 unencrypted LE contiguous unique double size=1000L/200P birth=337L/337P fill=8 cksum=0000000f0938eeb9:000005bd88a01f1d:0001229e43516c8a:0027967698e37bf5
Object lvl iblk dblk dsize dnsize lsize %full type
2 2 128K 128K 132K 512 256K 100.00 ZFS plain file
176 bonus System attributes
dnode flags: USED_BYTES USERUSED_ACCOUNTED USEROBJUSED_ACCOUNTED
dnode maxblkid: 1
path /somefile
uid 0
gid 0
atime Fri Jun 16 11:11:41 2023
mtime Fri Jun 16 11:11:41 2023
ctime Fri Jun 16 11:11:41 2023
crtime Fri Jun 16 11:11:41 2023
gen 16
mode 100644
size 132096
parent 34
links 1
pflags 840800000004
Indirect blocks:
0 L1 0:42a00:400 20000L/400P F=2 B=16/16 cksum=0000008a16d22a53:00005941257622e3:001ed55dc790b913:0785670864a2b64e
0 L0 0:22200:20000 20000L/20000P F=1 B=16/16 cksum=00003fb7d3da9921:0fe929a605532995:461bf6a18c49152f:c9f65eecc99f59f4
20000 L0 0:42200:800 20000L/800P F=1 B=16/16 cksum=000001038bd5bfd0:0001219fabd78a82:00bb071da3cd67c3:5a1d0311cf25b298
segment [0000000000000000, 0000000000040000) size 256K
This dataset has the default recordsize=128K
, so a 129K file spans two blocks. Both blocks have a logical size of 128K, so the dnode reports an lsize
of 256K. This is an important observation: objects are sequences of blocks, not of bytes. There’s no in-between.
The file size is just a regular file attribute, which is an “application specific” thing. The object store doesn’t know or care about, its the filesystem layer (ZPL) that will use that to get the right amount of data from the object and give that back to userspace.
And that’s about all there is to say about objects really. All the interesting stuff is about how they’re interpreted, which is all worked out from what’s in the dnode.