Objects and object sets

Objects are OpenZFS’ conception of an arbitrarily-large chunk of data. Objects are what the “upper” levels of OpenZFS deal with, and use them to assemble more complicated constructs, like filesystems, volumes, snapshots, property lists, and so on. If it helps, you can imagine an object to be a “file” in OpenZFS internal filesystem (the DMU).

Like a file, an object has a name, which is simply a uint64_t. Also like a filesystem, objects can be grouped together into a named collection; that collection is called an “object set”, and is itself an object. An object set forms a unique namespace for its object, which is to say, two different objects can have object 1, and they are totally unrelated objects.

So, the full “path” of an object is the “object set id”, then the “object id”.

A typical OpenZFS pool has lots of object sets. Some of these are visible to the user, and are called “datasets”. You might know these better as “filesystems”, “snapshots” and “volumes”. There’s nothing special about those, they’re just collections of objects assembled in such a way as to provide useful and familiar facilities to the user.

Every object has a description, called a “dnode”. This contains all the information about the object, most importantly, its size, type, and the location of its first block.

Now we know enough to start looking around.

Every pool has a single top-level object set, called the meta-object set or MOS. This objects in the MOS store all the housekeeping information about the pool as a whole.

Object 0 of any object set contains the table of objects within it, know as the metadnode. The data of this object is an array of dnode_phys_t. So to find object 238, OpenZFS will calculate the offset into the metadnode of that object’s dnode (238 * sizeof (dnode_phys_t)) . From there we can just look at the dnode itself to see the object’s metadata, or we can follow the block pointer down to the object’s data. However we’re not going to look too closely at the construction of a dnode here.

# zpool create tank loop0
# zdb -dddd -N tank 0
Dataset mos [META], ID 0, cr_txg 4, 90K, 59 objects, rootbp DVA[0]=<0:22000:200> DVA[1]=<0:1022000:200> DVA[2]=<0:2012600:200>[L0 DMU objset] fletcher4 lz4 unencrypted LE contiguous unique triple size=1000L/200P birth=11L/11P fill=59 cksum=0000000b2d6035a7:0000048dc6bd54d5:0000f11304f3c243:0021bf3fee4fa5bc

    Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
         0    2   128K    16K    18K     512    80K   36.88  DMU dnode
	dnode flags: USED_BYTES
	dnode maxblkid: 4

By convention, object 1 is the start of the housekeeping information for the object set. In the MOS, that’s the “object directory”.

# zdb -dddd -N tank 1
Dataset mos [META], ID 0, cr_txg 4, 81K, 46 objects, rootbp DVA[0]=<0:ee00:200> DVA[1]=<0:100ee00:200> DVA[2]=<0:200a000:200> [L0 DMU objset] fletcher4 lz4 unencrypted LE contiguous unique triple size=1000L/200P birth=8L/8P fill=46 cksum=00000008f52ba0bd:000003b2b818e2c7:0000c6e9eb051939:001c5bff8746c363

    Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
         1    1   128K    16K  13.5K     512    32K  100.00  object directory
	dnode flags: USED_BYTES
	dnode maxblkid: 1
...
		features_for_read = 51
		history = 60
		com.delphix:vdev_zap_map = 128
		root_dataset = 32
		sync_bplist = 62
		deflate = 1
		features_for_write = 52
		config = 61
		feature_enabled_txg = 63
		creation_version = 5000
		feature_descriptions = 53
		com.delphix:log_spacemap_zap = 133
		free_bpobj = 41
		org.illumos:checksum_salt = 62e56f861a38def0434b91c79c82826720475618395f98a3636b93621980eab9

This object, like many non-raw-data objects, is a ZAP, which is a kind of dictionary/map/directory/KV store inside a single OpenZFS object. ZAPs are a while topic for discussion; for now its enough to know that they’re a list of names and values.

So here we see a few basic things needed for OpenZFS to understand how to read the pool and find things in it. Some of these, like create_version, deflate or org.illumos:checksum_salt are simple data items. Most though are object ids for other things. For example, features_for_read points to the object that describes the features OpenZFS must have available to be able to read the pool:

# zdb -dddd -N tank 51
Dataset mos [META], ID 0, cr_txg 4, 81K, 46 objects, rootbp DVA[0]=<0:ee00:200> DVA[1]=<0:100ee00:200> DVA[2]=<0:200a000:200> [L0 DMU objset] fletcher4 lz4 unencrypted LE contiguous unique triple size=1000L/200P birth=8L/8P fill=46 cksum=00000008f52ba0bd:000003b2b818e2c7:0000c6e9eb051939:001c5bff8746c363

    Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
        51    1   128K  1.50K  1.50K     512  1.50K  100.00  zap
	dnode flags: USED_BYTES
	dnode maxblkid: 0
	microzap: 1536 bytes, 21 entries

		org.illumos:sha512 = 0
		com.klarasystems:vdev_zaps_v2 = 1
		com.delphix:head_errlog = 1
		org.zfsonlinux:large_dnode = 0
		org.freebsd:zstd_compress = 0
		org.illumos:edonr = 0
		com.delphix:bookmark_written = 0
		com.delphix:redacted_datasets = 0
		org.illumos:lz4_compress = 1
		com.delphix:device_removal = 0
		com.delphix:hole_birth = 1
		com.datto:encryption = 0
		com.delphix:extensible_dataset = 1
		org.openzfs:draid = 0
		com.delphix:redaction_bookmarks = 0
		org.illumos:skein = 0
		com.joyent:multi_vdev_crash_dump = 0
		org.openzfs:blake3 = 0
		org.open-zfs:large_blocks = 0
		com.datto:bookmark_v2 = 0
		com.delphix:embedded_data = 1

Yes, another ZAP. Anyway, pool metadata is not why we’re here either.

As noted, all user-facing datasets are object sets. We can find their ids by asking for the objsetid property.

# zfs create tank/foo
# zfs create tank/bar
# zfs list -o name,objsetid
NAME      OBJSETID
tank            54
tank/bar       151
tank/foo       143

Object ids have two purposes. Its the id of a real object in the MOS:

# zdb -dddd -N tank 151
Dataset mos [META], ID 0, cr_txg 4, 91.5K, 59 objects, rootbp DVA[0]=<0:1f400:200> DVA[1]=<0:101f400:200> DVA[2]=<0:200fa00:200> [L0 DMU objset] fletcher4 lz4 unencrypted LE contiguous unique triple size=1000L/200P birth=10L/10P fill=59 cksum=00000008e57f2b1e:000003ac2d4bfc30:0000c57ef5b7f24a:001c25f704ebd178

    Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
       151    1   128K    512      0     512    512  100.00  zap
                                               320   bonus  DSL dataset
	dnode flags: USED_BYTES
	dnode maxblkid: 0
		dir_obj = 148
		prev_snap_obj = 48
		prev_snap_txg = 1
		next_snap_obj = 0
		snapnames_zapobj = 152
		num_children = 0
		userrefs_obj = 0
		creation_time = Fri Jun 16 10:36:49 2023
		creation_txg = 9
		deadlist_obj = 153
		used_bytes = 24K
		compressed_bytes = 12K
		uncompressed_bytes = 12K
		unique = 24K
		fsid_guid = 13616134740256949
		guid = 7246029045750260381
		flags = 4
		next_clones_obj = 0
		props_obj = 0
		bp = DVA[0]=<0:1a600:200> DVA[1]=<0:101a600:200> [L0 DMU objset] fletcher4 lz4 unencrypted LE contiguous unique double size=1000L/200P birth=9L/9P fill=6 cksum=0000000dc6a38361:0000050166658386:0000f3824c02cbcd:002035533cc05adc
	microzap: 512 bytes, 2 entries

		org.zfsonlinux:project_quota = 0
		org.zfsonlinux:userobj_accounting = 0

This object has various “global” metadata items about the dataset, like pointers to snapshots, and also the block of object 0.

From there, we can start to look around the dataset proper. Like the MOS, the object table is in object 0, and object 1 has the top-level metadata:

# zdb -dddd -N tank/151 1
Dataset tank/bar [ZPL], ID 151, cr_txg 9, 24K, 6 objects, rootbp DVA[0]=<0:1a600:200> DVA[1]=<0:101a600:200> [L0 DMU objset] fletcher4 lz4 unencrypted LE contiguous unique double size=1000L/200P birth=9L/9P fill=6 cksum=0000000dc6a38361:0000050166658386:0000f3824c02cbcd:002035533cc05adc

    Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
         1    1   128K    512     1K     512    512  100.00  ZFS master node
	dnode flags: USED_BYTES USERUSED_ACCOUNTED USEROBJUSED_ACCOUNTED
	dnode maxblkid: 0
	microzap: 512 bytes, 7 entries

		utf8only = 0
		normalization = 0
		DELETE_QUEUE = 33
		casesensitivity = 0
		VERSION = 5
		ROOT = 34
		SA_ATTRS = 32

ROOT is the root dir, but before we go there, we need some files on disk!

# mkdir /tank/bar/somdir
# ls -l /tank/bar
total 1
drwxr-xr-x 2 root root 2 Jun 16 10:58 somdir
-rw-r--r-- 1 root root 0 Jun 16 10:58 somefile

And now we look at the root:

# zdb -dddd -N tank/151 34
Dataset tank/bar [ZPL], ID 151, cr_txg 9, 24K, 8 objects, rootbp DVA[0]=<0:3016800:200> DVA[1]=<0:4016800:200> [L0 DMU objset]fletcher4 lz4 unencrypted LE contiguous unique double size=1000L/200P birth=268L/268P fill=8 cksum=00000010fcd80fdf:0000062e53728927:00012c01c062907c:00276a0b253a0b4e

    Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
        34    1   128K    512      0     512    512  100.00  ZFS directory
                                               176   bonus  System attributes
	dnode flags: USED_BYTES USERUSED_ACCOUNTED USEROBJUSED_ACCOUNTED
	dnode maxblkid: 0
	uid     0
	gid     0
	atime	Fri Jun 16 10:58:52 2023
	mtime	Fri Jun 16 10:58:49 2023
	ctime	Fri Jun 16 10:58:49 2023
	crtime	Fri Jun 16 10:36:49 2023
	gen	9
	mode	40755
	size	4
	parent	34
	links	3
	pflags	840800000144
	microzap: 512 bytes, 2 entries

		somdir = 3 (type: Directory)
		somefile = 2 (type: Regular File)

Objects have a type, this one has the “ZFS directory” type, so zdb can show us more information about it. Directories are just ZAPs again, so we get the file list, and their object ids. The file attributes are held in the “bonus buffer”, which is some spare space dnode that can hold a but of extra data.

So finally we’re getting to real files, but lets get some data on disk first.

# dd if=/dev/random of=/tank/bar/somefile bs=10K count=1
1+0 records in
1+0 records out
10240 bytes (10 kB, 10 KiB) copied, 0.000787502 s, 13.0 MB/s

Oh, and here’s a little fun fact: the object id is exposed to the stat() system call as st_ino:

# stat -c %i /tank/bar/somefile
2

# zdb -ddddd -N tank/151 2
Dataset tank/bar [ZPL], ID 151, cr_txg 9, 34K, 8 objects, rootbp DVA[0]=<0:301d000:200> DVA[1]=<0:401a800:200> [L0 DMU objset]fletcher4 lz4 unencrypted LE contiguous unique double size=1000L/200P birth=337L/337P fill=8 cksum=0000000f0938eeb9:000005bd88a01f1d:0001229e43516c8a:0027967698e37bf5

    Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
         2    1   128K    10K    10K     512    10K  100.00  ZFS plain file
                                               176   bonus  System attributes
	dnode flags: USED_BYTES USERUSED_ACCOUNTED USEROBJUSED_ACCOUNTED
	dnode maxblkid: 0
	path	/somefile
	uid     0
	gid     0
	atime	Fri Jun 16 10:58:44 2023
	mtime	Fri Jun 16 11:04:44 2023
	ctime	Fri Jun 16 11:04:44 2023
	crtime	Fri Jun 16 10:58:44 2023
	gen	266
	mode	100644
	size	10240
	parent	34
	links	1
	pflags	840800000004
Indirect blocks:
               0 L0 0:3019000:2800 2800L/2800P F=1 B=337/337 cksum=0000050bfb2a9f12:001962adad83a241:55148c3aa6669101:8c4bdbff5fbf271c

		segment [0000000000000000, 0000000000002800) size   10K

With an extra -d to zdb, we get the block list as well, and we can start to compare the dnode with the file attributes. The dnode is listing the block size (dblk) as 10K, which of course matches the file size. This is how OpenZFS implements the well-known “variable-size first block” - it just sets the object’s block size to whatever the file size is.

If we add enough data to span more than one block though, we see:

# dd if=/dev/random of=/tank/bar/somefile bs=129K count=1
1+0 records in
1+0 records out
132096 bytes (132 kB, 129 KiB) copied, 0.00144645 s, 91.3 MB/s

# zdb -ddddd -N tank/151 2
Dataset tank/bar [ZPL], ID 151, cr_txg 9, 34K, 8 objects, rootbp DVA[0]=<0:301d000:200> DVA[1]=<0:401a800:200> [L0 DMU objset]fletcher4 lz4 unencrypted LE contiguous unique double size=1000L/200P birth=337L/337P fill=8 cksum=0000000f0938eeb9:000005bd88a01f1d:0001229e43516c8a:0027967698e37bf5

    Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
         2    2   128K   128K   132K     512   256K  100.00  ZFS plain file
                                               176   bonus  System attributes
	dnode flags: USED_BYTES USERUSED_ACCOUNTED USEROBJUSED_ACCOUNTED
	dnode maxblkid: 1
	path	/somefile
	uid     0
	gid     0
	atime	Fri Jun 16 11:11:41 2023
	mtime	Fri Jun 16 11:11:41 2023
	ctime	Fri Jun 16 11:11:41 2023
	crtime	Fri Jun 16 11:11:41 2023
	gen	16
	mode	100644
	size	132096
	parent	34
	links	1
	pflags	840800000004
Indirect blocks:
               0 L1  0:42a00:400 20000L/400P F=2 B=16/16 cksum=0000008a16d22a53:00005941257622e3:001ed55dc790b913:0785670864a2b64e
               0  L0 0:22200:20000 20000L/20000P F=1 B=16/16 cksum=00003fb7d3da9921:0fe929a605532995:461bf6a18c49152f:c9f65eecc99f59f4
           20000  L0 0:42200:800 20000L/800P F=1 B=16/16 cksum=000001038bd5bfd0:0001219fabd78a82:00bb071da3cd67c3:5a1d0311cf25b298

		segment [0000000000000000, 0000000000040000) size  256K

This dataset has the default recordsize=128K, so a 129K file spans two blocks. Both blocks have a logical size of 128K, so the dnode reports an lsize of 256K. This is an important observation: objects are sequences of blocks, not of bytes. There’s no in-between.

The file size is just a regular file attribute, which is an “application specific” thing. The object store doesn’t know or care about, its the filesystem layer (ZPL) that will use that to get the right amount of data from the object and give that back to userspace.

And that’s about all there is to say about objects really. All the interesting stuff is about how they’re interpreted, which is all worked out from what’s in the dnode.