27 October, 2024

Rob Norris

OpenZFS deduplication is good now and you shouldn't use it

OpenZFS 2.3.0 will be released any day now, and it includes the new “Fast Dedup” feature. My team at Klara spent many months in 2023 and 2024 working on it, and we reckon it’s pretty good, a huge step up from the old dedup as well as being a solid base for further improvements.

I’ve been watching various forums and mailing lists since it was announced, and the thing I kept seeing was people saying something like “it has the same problems as the old dedup; needs too much memory, nukes your performance”. While that was true (ish), and is now significantly less true, the real problem is that this just repeating the same old non-information that they probably heard from someone else repeating it.

I don’t blame anyone really; it is true that dedup has been extremely challenging to get the best out of, it’s very difficult to find good information about using it well, and “don’t use it” was and remains almost certainly the right answer. But, with this being the first time in almost two decades that dedup has been worth even considering, I want to get some fresh information out there about what dedup is, how it worked traditionally and why it was usually bad, what we changed with fast dedup, and why it’s still probably not the thing you want.

What even is dedup?

Dedup can be easily described in a sentence.

When OpenZFS prepares to write some data to disk, if that data is already on disk, don’t do the write but instead, add a reference to the existing copy.

The challenge is all in how you determine whether or not the data is already on disk, and knowing where on disk it is. The reason it’s challenging is that that information has to be stored and retrieved, which is additional IO that we didn’t have to do before, and that IO can add surprising amounts of overhead!

This stored information is the “dedup table”. Conceptually, it’s hashtable, with the data checksum as the “key” and the on-disk location and refcount as the “value”. It’s stored in the pool as part of the pool metadata, that is, it’s considered “structural” pool data, not user data.

How does dedup work?

When dedup is enabled, the “write” IO path is modified. As normal, a data block is prepared by the DMU and handed to the SPA to be written to disk. Encryption and compression are performed as normal and then the checksum is calculated.

Without dedup, the metaslab allocator is called to request space on the pool to store the block, and the locations (DVAs) are returned and copied into the block pointer. When dedup is enabled, OpenZFS instead looks up the checksum in the dedup table. If it doesn’t find it, it calls out to the metaslab allocator as normal, gets fresh DVAs, fills the block and lets the IO through to be written to disks as normal, and then creates a new dedup table entry with the checksum, DVAs and the refcount set to 1. On the other hand, if it does find it, it copies the DVAs from the the value into the block pointer and returns the writing IO as “completed” and then increments the refcount.

Blocks allocated with dedup enabled have a special D flag set on the block pointer. This is to assist when it comes time to free the block. The “free” IO path is similarly modified to check for the D flag. If it exists, the same dedup table lookup happens, and the refcount is decremented. If the refcount is non-zero, the IO is returned as “completed”, but if it reaches zero, then the last “copy” of the block is being freed, so the dedup table entry is deleted and the metaslab allocator is called to deallocate the space.

So all this is working, in that OpenZFS is avoiding writing multiple copies of the same data. The downside is that every single write and free operation requires a lookup and a then a write to the dedup table, regardless of whether or not the write or free proper was actually done by the pool.

It should be clear then that any dedup system worth using needs to save more in “true” space and IO than it spends on the overhead of managing the table. And this is the fundamental issue with traditional dedup: these overheads are so outrageous that you are unlikely to ever get them back except on rare and specific workloads.

Why is traditional dedup so bad?

All of the detail of dedup is in how the table is stored, and how it interacts with the IO pipeline. There’s three main categories of problem with the traditional setup:

the construction and storage of the dedup table itself
the overheads required to accumulate and stage changes to the dedup table
the problem of “unique” entries in the table

The dedup table

Traditional dedup implemented the dedup table in probably the simplest way that might work: it just hooked up the standard OpenZFS on-disk hashtable object and called it a day. This object type is a “ZAP”, and it’s used throughout OpenZFS for file directories, property lists and internal housekeeping. It’s an entirely reasonable choice. It’s also really not well suited to an application like dedup.

A ZAP is a fairly complicated structure, and I’m not going to get into it here. For our purposes, it’s enough to know that each data block in a ZAP object is an array of fixed-size “chunks”, with a single key/value consuming as many chunks as needed to hold the key, the data, and a header describing how the chunks are being used.

A dedup entry has a 40-byte key. The value part can be up to 256 bytes, however this is compressed before storing it, so let’s assume a common case of 64 bytes to actually stored. Each chunk inside the ZAP is 24 bytes, and can contain either the header, or up to 21 bytes of key or value data. All together, we’re looking at ceil(40/21) + ceil(64/21) + 1 == 7 chunks per entry. A typical dedup ZAP block is 32K, which has space for 1320 chunks (ZAP blocks themselves have their own header describing the chunks). So a single dedup block has space for 1320/7 = 188 “typical” entries.

We could certainly create a better format tailored to storing dedup entries, but the format is not the immediate issue here. The real problem is one present throughout OpenZFS wherever a data block is carrying an array of unrelated items: amplification. OpenZFS never writes partial blocks, and it never overwrites a block in place. So if we want to update a single dedup entry, we need to load the entire block from disk, modify just the bit we want, and write it back out, in full, as a brand-new block. And then the new block pointer needs to be written to an indirect block, and its new block pointer to another indirect, or the dnode, and so on up and up to the top of the tree. This is of course no different to any other OpenZFS write, and further up the tree we go the more that overhead is amortised across writes from the entire pool. But within the dedup ZAP, it’s a read-modify-write cycle for every single block written, because at minimum we have to bump a refcount.

That’s just a single entry update. If two writes were done within the same transaction, then that’s almost certain to be two different ZAP blocks we need to do that read-copy-write change on. Dedup mandates a cryptographically-strong and collision-resistant checksum to use as the key, which means the chance of any two arbitrary checksums falling close enough to land in the same ZAP block is small.

This is where the old recommendations suggesting that dedup requires enormous amounts of RAM ultimately come from. Reading a dedup table is like reading any other data in OpenZFS: it gets cached in the ARC. If you have enough RAM such that the ARC never needs to evict any part of the dedup table, then you can largely cut out the entire read part of the table update.

This is also where the rarely-seen dedup vdev class can help. If you add a sufficiently large and fast dedup vdev to your pool, then you may be able to reduce your memory requirements a little. This still ends up being a challenging build, because at scales that make dedup worthwhile, you really need to build that vdev out of something large enough to hold the entire table, and fast enough the overhead of not being memory is still workable. Multi-terabyte NVMe devices are great if you can afford them, but it’s not for the faint of heart or light of wallet.

The live entry list

There is another significant memory use in traditional dedup, that isn’t as well known and has no good way to balance it against other factors.

Every write in OpenZFS is assigned to a “transaction”, identified by a numerical “transaction id”. Data is written out to disk as it becomes ready, then every so often that transaction is closed and all the metadata for the transaction, which includes dedup table updates, all those block pointer updates mentioned above, and various other bits of housekeeping are all written down. By default this happens no more than 5 seconds since the last time, but in practice it’s when there’s a gap in userspace activity.

Imagine you wrote, say, five instances of the same data at the same moment, to a dataset with dedup enabled. Imagine also that this is brand-new data, not currently in the dedup table. You would want this to only be written once, and a dedup table entry for this block to have a refcount of 5.

Because these are data writes, they begin “immediately”. But if you recall the IO path above, each one needs to look up the dedup table first to decide if it should really be written, or just bump the refcount. The dedup table lookup function will be called five times, all would end up trying to read the relevant part of the dedup ZAP (though the ARC would reduce it to one physical read), and all would discover that the entry doesn’t exist, and so let the writes through. Finally, all would try to create a new entry with refcount 1. We end up with some distinctly un-duplicated data, and a dedup table that has no reflection on reality.

So instead, OpenZFS keeps an in-memory list of “live” entries. These are held entirely in memory, and keep track of entries created or modified on this transaction. The dedup table lookup function starts by checking this list for the requested entry. If it’s there, the live entry refcount is bumped and it is returned. If it’s not there, then it will create a new live entry, flag it as “in progress”, then it will call down to the ZAP layer to get the dedup entry proper. When that entry comes back, it unpacks it into the live entry, flags it as “ready” and then returns it. Meanwhile, the other write threads will arrive, look up the live entry, see that it’s “in progress”, set up to be woken when it’s ready, and go to sleep. When they get woken, they will see that the flag has changed to “ready”, bump the refcount and return it.

Then, at the end of transaction, the live entry list is walked, and the relevant details are copied into the dedup ZAP. Because there’s one and only one live entry for every checksum, each dedup ZAP entry only gets one update. We’re also applying all of the changes all at once, which is our best chance to be updating multiple entries within the same block.

Overall, this is a reasonable model, and matches the rough model everywhere else in OpenZFS: do all the data work during the transaction, and when the transaction closes, resolve all the associated metadata changes and write them all out. If you’ve ever heard an OpenZFS developer talk about “open” and “syncing” contexts, this is what they’re talking about.

The problem? These live entries are enormous: 424 bytes each. The raw entry that we load from and save to the ZAP is 296 bytes (bigger than the 40+~64 bytes stored in the ZAP, because this version is uncompressed), which is a problem on its own. However there is also 128 bytes of housekeeping stuff (like the “in progress” and “ready” flags and associated locks). It doesn’t take long for this list to grow very large, and although it’s cleared out at the end of every transaction, the peaks can be pretty high.

And this is kernel slab memory, not ARC memory. It can’t be reclaimed when the system is under pressure. There’s nothing you can tune or configure to bring this down.

It is a little more situational, as it will only grow in proportion to the amount of different things you’re writing each transaction. That’s little comfort though, as you’re only using dedup if you have a lot to write! So in the end, you still need a ton of memory if you want to actually make use of your dedup table.

Unique entries

The biggest drag on all the dedup table is the space required to track unique entries. For dedup to work, we have to track everything we have stored on the disk, but, we only get any benefit when the refcount goes greater than 1. Any block that we have only one copy of is just consuming space in the dedup table, waiting for the day that something writes exactly the same data. If that never happens, then it’s a cost that we can never claw back.

And, since dedup is performed on the data after encryption and compression, and on the block level, then it’s not just the same data, but the same compression method, encryption keys, and alignment within the file. And this is why dedup is worse than useless on general purpose workloads, because there is just so little data that is truly “the same”.

How does fast dedup fix all this?

What we call “fast dedup” is a suite of changes that together try to tackle the above problems. Put simply, the goal is to reduce the amount we store in the table, and when we must store something, be much smarter about how we accumulate and stage those changes, and then provide tools to allow the operator to limit and manage the table contents.

Making the live entry list smaller

The first place we started was to reduce the memory footprint of the live entry list. The dedup table is a regular stored object, accessed through the ARC, so for that we have many more optimisation options available to us. The live entry list however is exactly that: a simple list, recording every entry touched on this transaction, pinned in memory. We can’t get rid of it within the current architecture, so we just had to make it smaller.

The live entry type ddt_entry_t was not very well laid out. First, it used a few large numeric types to store simple flags; those were replaced with a bitfield. Next, some of the entry synchronisation fields (recall the “write five” example above) were simplified. Finally, it carried 40 bytes of state that is only needed when a dedup’d data block was first being written, or is in need of a repair write (the OpenZFS “self-healing” stuff). Once an entry is created, this state is never used, so it was lifted out to a separate “IO state” object that we create when we need it, and toss when we’re done. This was all fairly small stuff though.

The big fish is in the stored part of the entry. The key is 40 bytes, 32 of which are the checksum. It’s effectively a block pointer fragment, so can’t really be modified further without implications for the block pointer structure proper, which is well outside the scope of this work. More importantly though, we actually wanted to keep the key the same for compatibility with existing dedup tables. Not strictly necessary, but there’s no gains to be made that are significant enough to warrant the complexity of converting back to an old format when needed.

The value part is a different story. In traditional dedup, an entry contains four “physical” entries, which look like this:

typedef struct ddt_phys {
    dva_t       ddp_dva[SPA_DVAS_PER_BP];
    uint64_t    ddp_refcnt;
    uint64_t    ddp_phys_birth;
} ddt_phys_t;

That’s three 128-bit DVAs, the refcount for this entry, and the birth time (transaction id) for the entry. This is all pretty reasonable, and you can see how it can be combined with the key to produce an almost-complete block pointer (good enough to scrub the block anyway).

But wait, four? Why four? Well.

OpenZFS datasets have a copies property, that says how many copies of the block data should be written out. If you recall above I said that during a write, the metaslab allocator is called to allocate space to store the block data, and can return multiple DVAs. That’s how copies works: the allocator allocates one, two or three regions on the disk of the right size, the data is written to all of them, and that many DVAs go in the block pointer.

The thing is, you can change this property live, and it does what most property changes do, which is to affect all future writes, while leaving existing data on disk.

Consider what that means for dedup. Say you have a dataset with copies=1 (the default), and you write a block. The block pointer gets one DVA (the other two are all-zero). You copy it a few times, which reuses the DVA and bumps the dedup refcount. Then you change to copies=2 and copy the block again. The entry is looked up in the dedup table, but only has one DVA when the write policy is requesting two. What do we do?

Traditional dedup’s answer is to treat this as a brand-new write. It goes through to the allocator, two DVAs are allocated and written. Then the dedup entry is updated, but instead of bumping the refcount for the 1-copy “physical” entry (at dde_phys[1]), it instead copies the DVAs into the 2-copy entry (at dde_phys[2]), and sets the refcount there to 1. And from then on, those two variants of the block are treated effectively as separate dedup entries with a common key.

The thing is, it’s very unusual for an operator to modify copies= on an existing dataset, and also ill-advised where dedup is in play, because it effectively invalidates your existing dedup table for new writes: they all start back at 1, including using more space! So most of the time, the other “physical” entries are going to be filled with zeroes, unused. This isn’t too much of an issue for entries stored in the dedup ZAP, as those are compressed and long runs of zeroes compress reasonably well. But, while they’re in memory on the live list, they’re just sitting there, at least 192 bytes of zeroes that will never be needed.

(Astute readers that know that a block pointer can only contain up to three DVAs, and that copies= can be set to 1, 2 or 3, might now be wondering what the fourth entry is for. The answer, these days, is nothing. It used to be where the dedupditto feature lived, storing extra DVAs “just in case”, but it was buggy and not really useful, and was removed years ago. It’s still supported for very old pools, but modern OpenZFS can only read them, not write them).

After some experimentation, we realised that if we receive a write for a block that is already on the dedup table, but has too few DVAs, all we really have to do is allocate and write enough additional copies of the data to fulfill the request (up to 3), and add those to the dedup entry when we bump the refcount. This does mean we now have blocks on disk with the old number of DVAs, but that’s ok, as that’s the same guarantee as before. It does make the “lookup entry” part of the IO pipeline more complex of course, and it does introduce some subtleties when freeing the block when the refcount reaches zero, but that’s fine - clever and tricky things are ok, for a good cause.

And a good cause this is! New dedup tables created with fast dedup enabled have the “value” part of the entry at just 72 bytes, rather than 256 in the traditional version (the extra 8 is for an additional 64-bit integer to support the pruning feature, see below).

Pull this all together, and a single entry in the live list is now 216 bytes, almost half the original 424 in a traditional entry. “Half the memory” is a pretty nice thing to be able to put on the brochure, especially for slab memory that can’t be easily reclaimed.

The stored entry is technically also smaller, but it’s still going to hit that ballpark of ~64 bytes after compression. It starts a lot smaller, but is a lot less compressible because it’s not full of long runs of zeroes that are easy to reduce. Once we take ZAP chunk overheads into account there’s very little gain made here. But that’s ok, because reducing the size of the dedup ZAP entries was never really in our plans.

The dedup log

As we’ve discussed, the live entry list has a record for every deduplicated modified on the current transaction. At the end of the transaction, these updated entries are written back to the dedup ZAPs. Each entry is surrounded by 187 other entries in the same dedup block, which as the dedup table gets larger, are less and less likely to have been touched on this transaction. So in order to update the entry, we have potentially had to load a full block of entries, which is additional IO or, in the better-but-still-bad scenario, “loaded” the block from the ARC instead. And then, once all the entries are updated, the live entry list is cleared.

After considering some customer workloads, and doing some of our own experiments, we decided that if a block is to be duplicated, it’s more likely to be one that was “recently” created or duplicated. Or, put another way, the longer it’s been since a block was touched, the less likely it is that it will be duplicated or freed in the future. Without thinking very hard this intuitively makes sense; we tend to work with the same bit of data a lot, and then we don’t touch it again for a while, or ever. If this is true, then it means throwing away the live entry list at the end of the transaction ends up being extremely wasteful, because there’s a good chance we’re going to need some or even most of those entries in the very near future!

The thing is, the changes represented by the live entry list “belong” to the transaction. They must be written down with that transaction, otherwise rolling back to that transaction (eg during crash recovery) will have stale information in the dedup table. So we started thinking about where else we could possibly record these changes such that we could get them back quickly in subsequent transactions, and in a rollback, without wearing the full cost of updating the dedup table proper every time.

The answer of course is the same as it’s been any time any storage system has wanted to defer work into the future: add a “journal” or “log” describing the changes, replay the log during crash recovery, and during normal operation, slowly write the logged changes out to their final resting place. The fundamentals of the dedup architecture however, make things a little more complicated, as we’ll see when we imagine building up this system.

So let’s imagine the simplest thing that might work. At the end of transaction, instead of updating the dedup ZAP, we just dump the entire live entry list as an array of fixed-size entries onto the end of some object, which we declare to be the log. Since the same entry might have been updated on two or more consecutive transactions, and we’re only appending to the log, this means that the log might contain the same entry more than once. That’s fine, it just means we should only use the last we see. Every so often, we blast through the log, add the last instance of each entry to the dedup ZAP, and then zero the log. Job done.

This actually works very well for what it is. The log is stored in a regular object, and so is bound to the same transaction as the data changes associated with its entries. Multiple changes to the same entries end up being amortised somewhat; if we only write the log back to the ZAP every five transactions, an entry that changes every five transactions will only be written once. So great, that’s the write overhead taken care of.

The critical flaw here is on the lookup side. At any point, the write pipeline is going to come asking for an entry. If it hasn’t been used this transaction (ie it’s not on the live entry list), it will go to the dedup ZAP and get the entry from there. But if that entry is on the log, then the entry in the ZAP is stale, and we can’t use it, both because it may be wrong, and because it’s going to be overwritten when the log is written out.

Our only option then is to search the log to find out if it has a more recent version of the entry. The problem with that is that most of the time it’s even worse than reading from the ZAP, as the log has no useful ordering, and duplicate entries. We potentially have to read the entire log, however enormous it may be, only to find the entry wasn’t even there and still have to do a lookup in the ZAP.

What we need is an index of the log, that gives us a fast way to look up any entry that might exist in it. As it turns out, once you remove the overheads of a “live” object, a single “logged” entry held in memory is only 144 bytes, for the entire entry. This is small enough that it’s feasible to keep the entire log in memory, as well as on disk. And then, when doing a lookup, if the entry we want is not on the live entry list, we then check the in-memory log, and if it’s not there, we go to the dedup ZAP. And then, at end of transaction, we save the updated version of the entry to both the in-memory log and to the on-disk log.

On disk as well? Yes. We still need crash safety. But when we take these two version of the log together, the on-disk log becomes write-only, while all regular activity goes to the in-memory log, that is, the two contain alternate representations of the same data. In the case of rollback or crash recovery (both of which happen at pool import), we simply load the in-memory log from the on-disk log, and move on with life.

Incremental log flushing

Of course, we’ve now introduced some new complexity to help with managing some existing complexity, which naturally means we have to do some work to manage the complexity as well.

We’ve reduced the per-transaction IO overhead of the dedup table to about the smallest it can be, at the cost of the memory required to carry a copy of the log. It’s smaller on average compared to the ARC overhead, but it’s not nothing, and we have to keep it in check.

Our earliest version just watched the size of the in-memory tree and when it grew “too big”, at the end of transaction, we just wrote the entire log out to the dedup ZAP and cleared it. This is less IO in total than if we’d written those updates to the dedup ZAP at the end of each transaction, but at least that IO was spread across multiple transactions. In our testing, we could easily causes substantial flushing pauses with only a few thousand entries on the list, long before any real memory pressure was felt.

So instead, we changed things so that some amount of the log was written out to the ZAP every transaction. We monitor the log flush rate against the amount of time spent on real IO, so that we write less in busy periods, and more in quiet periods. There’s some extra consideration there too, like, we may accelerate when the in-memory log is too large and causing memory pressure, and a few other things.

Incremental flushing brought back an older problem though. We want to be able to zero the on-disk log. We know which are the “most recent” entries on the log, because they’re the only ones on the in-memory log. But, we don’t know where in the on-disk log those versions of those entries are, and, because new entries are being added to the log on the same transactions as we are writing them out, we don’t know which entries on the in-memory log have been written out. Under this model, we cannot zero the on-disk log until and unless the in-memory log is empty, and the in-memory log will only be empty if no updated entries have occurred, which ultimately means that eventually, we have to stop and flush the remaining log. And, because the on-disk log is only appended to, the longer we drag out the flushing, the larger it gets.

To handle this, we actually have two logs, each with an in-memory and on-disk version. One of these is only being flushed, the other is only being updated. In this way, we can accumulate new updates on the “active” log, while the “flushing” log is being emptied. Once it’s empty, the on-disk flushing log is zeroed, and the two logs are swapped: the old “active” is now “flushing” and begins being written out, and new changes are written to active. We get our wish that the log needs to be stopped before it can be flushed, without needing to stop the world.

Of course, this adds further complications to the lookup stage, as we now have to look for an entry on the “active” log list first, and then on the “flushing” log list, before finally going to the ZAP. For obscure reasons, that then means that entries “loaded” from the flushing list then ”saved” to the active list need a little bit of extra finesse, because we can end up with entries on the on-disk “flushing” tree that were never flushed before they were reused. It’s nothing to worry about; just slightly more complicated dance moves.

There’s also a “log checkpoint”. Since checksums are really just very large numbers, they have a very comfortable ordering. So, when we finish flushing on this transaction, we write the last checksum we wrote to the disk (actually to the bonus buffer of the flushing log object). This is there to make import faster; we have to reload both log lists. We can use the checkpoint when reading the flushing log to know which entries have already been flushed and not even bother putting them in the in-memory list.

Finally, there’s some interesting interactions with pool scans (ie scrub and resilver). Normally, when dedup is enabled, a scan begins by walking the dedup table, reading every block pointer (and taking the wanted action) from every entry within it, and then moving on to the rest of the pool. Every so often, the scan process will record a position in the pool, so that the scan can be resumed from that point.

The problem with the dedup log is that there is no useful notion of “position” within it to record. The on-disk log has no natural layout as we know, while the in-memory log uses a common structure in OpenZFS called an AVL tree, which does not have a “stable cursor”; that is, there’s nothing you can store to describe a logical position in the tree that would carry over to a different tree with the same structure.

We tried a lot of things to synthesize an AVL cursor, and it is sort of possible, but not within the constraints of the “position” data we need to save (for those playing along at home, we need to add a 40-byte key to scn_ddt_bookmark within dsl_scan_phys_t). In the end, we take something of a coward’s way out: when a scrub is requested, we accelerate log flushing; all the log is flushed out to the dedup ZAP, and then do it the old way. Scans set the current transaction as the “end of scan” point, so we don’t need to worry about changes that come in after, and the dedup ZAP will have everything from before the flush. It does mean that after a crash, the log needs to be re-flushed before the scan can continue, but the expectation is always that the size of the dedup ZAP and the pool data as a whole is always going to dwarf the size of the log.

Unique entries

So that was a lot about the logs! If you’re still reading, well done!

There was one other issue with traditional dedup, and that was the difficulties caused by unique entries vastly inflating the size of the table with entries that never get used. There’s some new tools to help the operator manage the table size generally, and unique entries specifically.

The big help for unique entries is the new zpool ddtprune command. It will remove some amount of unique entries from all dedup tables on the system, specified by age or percentage. The age option works particularly well with our ideal workload where more recently used data is more likely to be deduplicated. This sort of usage pattern results in a long and aging tail of unique entries that will never be deduplicated. Now you can wholesale get rid of them, and with the new “ZAP shrink” enhancement, dedup ZAP blocks that end up entirely empty as a result of this operation will simply be removed.

Of course, this does mean that if a block whose dedup entry has been removed does later get copied, it will be a new block with a new allocation; there will be no deduplication. That said, if a very old unique block is suddenly copied a dozen times, that will be a dozen references to a single new block, and you’ll have two copies instead of the 13 you’d have without dedup at all. So you do need to tune the behaviour of ddtprune to match your workload, but it may not be a total disaster if you were to prune too much.

Meanwhile, the pool property dedup_table_quota lets you set a maximum possible size for the dedup tables on your pool. If creating a new entry would take the dedup tables over that limit, the entry won’t be created and the write will just be a regular non-dedup’d write. This is good to use in conjunction with a dedicated dedup device where you want it to not spill out to the main device if it gets full.

That’s a lot! Anything else?

Just a handful of operational improvements.

zpool prefetch -t ddt will preload the dedup tables into the ARC, which can help with performance immediately after pool import. In traditional dedup it’s obvious how this helps, but even in fast dedup, entries not on the log still need to be loaded from the ZAPs, and flushing still needs the ZAPs nearby to write too, so having them in the ARC still helps.

There’s a new collection of kstats, in /proc/spl/kstat/zfs/<pool>/ddt_stats_<checksum> on Linux or kstat.zfs.<pool>.misc.ddt_stats_<checksum> on FreeBSD. These will show various live stats for the dedup subsystem, including counts of lookups, hit rates on the live and log lists and the dedup ZAPs, and the log flushing rates.

There’s also a new collection of tuneables, /sys/modules/zfs/parameters/zfs_dedup_log_* on Linux or vfs.zfs.dedup.log_* on FreeBSD. These control various inputs into how much log flushing occurs each transaction. As usual, the defaults are carefully considered and should be fine for anyone, but when they’re not, having some knobs to adjust is a game changer.

And then the existing dedup-aware things (zpool status -D, zdb -D, zdb -S, etc) have been updated to understand all this new stuff.

Nice! Can’t wait to use this with my existing dedup table

😬

So almost all of the above requires on-disk format changes that your existing dedup tables don’t have. This was unfortunate, but intentional: the project brief explicitly excluded a migration path, because it’s complicated (read: expensive) and there’s very few dedup installations out there of sufficient size and complexity to require it.

However, we didn’t go out of our way to make it not work, or to prevent it from being possible in the future.

For existing tables, anything that doesn’t require an on-disk format change should work:

Table quotas (dedup_table_quota pool property)
Table prefetch (zpool prefetch -t ddt)
Lookup and hit counts (ddt_stats_* kstats)
ZAP “shrink” (collapsing whole empty blocks; this is a general ZAP enhancement in 2.3)

The dedup log feature should be straightforward to make work with traditional tables. Nothing in it cares about the size of the entries (in fact, “log” and “flat entry” are two separate subfeatures). The only missing piece is something to set up the new “container” object for an existing table, which would be fairly easy to do. Of course, you would not get the smaller live and log list entries in memory or on disk, so some of the sizing tuning would be different.

Unique entry pruning (zpool ddtprune) should be straightforward to add for only the “percentage of uniques” mode. The “age” mode is not possible, as it requires data in the new entry format which doesn’t exist in the traditional format.

Converting your old tables is not currently possible. In the simplest case, where copies= has never been changed, it would be as straightforward as creating a new ZAP, walking over the existing ZAP, converting the entry and copying it in. Doing this online would be complicated as we’d need to either be reading from both old and new ZAPs, or writing down to both ZAPs and then switch over at the end. Doing it offline would be easier, and could be done through a userspace tool, but of course requires taking the pool offline.

If copies= has been changed and there are existing entries carrying both kinds, then a complete conversion is not possible, as the whole method by which existing “variants” of a block are upgraded is different and there just isn’t room in a new entry to store it all. The most fortunate cases would be if one and only one of the variants has a refcount greater than 1, as those are uniques that could be “pruned”. Otherwise there’s nothing we can do with those (though I have just had a strange thought involving the BRT).

And of course, the usual “trick” of sending the deduplicated dataset to another pool where the new dedup is available will certainly do the job.

If you are one of those people with an enormous dedup table and you’re interested in funding development of one or more of these options, Klara would love to talk to you.

Is deduplication really good now?

I think it really is good enough to at least play with. The overheads should at least be reduced enough to make it useful in more marginal situations.

Is it really good though? Probably not, not yet, but “not yet” is part of the point of all this. We’ve taken perhaps the most unloved OpenZFS feature, given it a substantial upgrade that hopefully will get more people taking a look at it, or maybe even taking a second look at it. Meanwhile, the code is now better structured, commented and understood by a lot more people, and has a lot more obvious points to add and change behaviour. It can finally join all the other features in the OpenZFS toolbox, and from there, who knows what it can become.

I don’t get it. After all this, if it’s good enough, why shouldn’t I enable dedup everywhere?

If you’re like most people, you’re thinking about transparent deduplication because you have a general-purpose workload, that is, some “big ball of miscellaneous files” setup, like a local desktop or laptop. Or maybe a bigger version of that, for example, you provide exactly that service for all the people in your organisation. And good disks cost money, and times are tough, and you’re thinking well, if I can turn this thing on and it saves a bit of data, wouldn’t that be worth it?

I actually agree with this in theory.

As we’ve seen from the last 7000+ words, the overheads are not trivial. Even with all these changes, you still need to have a lot of deduplicated blocks to offset the weight of all the unique entries in your dedup table.

But, what might surprise you is how rare it is to find blocks eligible for deduplication are on most general purpose workloads.

Consider a simulated dedup run on my laptop. This is the machine I use for everything, home and work, easily 12 hours every day.

$ zpool list crayon
NAME     SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
crayon   444G   397G  47.3G        -         -    72%    89%  1.00x    ONLINE  -

$ zdb -S crayon
Simulated DDT histogram:

bucket              allocated                       referenced
______   ______________________________   ______________________________
refcnt   blocks   LSIZE   PSIZE   DSIZE   blocks   LSIZE   PSIZE   DSIZE
------   ------   -----   -----   -----   ------   -----   -----   -----
     1    11.7M    708G    362G    373G    11.7M    708G    362G    373G
     2    12.2K    666M    284M    293M    25.1K   1.32G    582M    602M
     4      294   4.44M    962K   1.55M    1.44K   22.0M   4.67M   7.76M
     8        5     99K     41K     44K       52   1.06M    440K    464K
    16        1   7.50K   3.50K      4K       26    195K     91K    104K
 Total    11.7M    708G    362G    373G    11.7M    709G    362G    373G

dedup = 1.00, compress = 1.96, copies = 1.03, dedup * compress / copies = 1.90

So for a table of 11.7M entries, the number of those that represent something we actually managed to deduplicate is a literal rounding error. It’s pretty much entirely uniques, pure overhead. Turning on dedup would just add IO and memory pressure for almost nothing.

But the real reason you probably don’t want dedup these days is because since OpenZFS 2.2 we have the BRT (aka “block cloning” aka “reflinks”). (I acknowledge that it had a shaky start, which has been written and presented on extensively, so I won’t do that again, but lets just say it’s all good now).

You may recall, way back at the top of this post, we asked “what even is dedup?”, and we defined it as:

When OpenZFS prepares to write some data to disk, if that data is already on disk, don’t do the write but instead, add a reference to the existing copy.

The dedup table and its entourage all exist to answer “is this data already on disk?”, but in one quite niche situation: when you don’t have any other knowledge or context about the data being written.

The thing is, it’s actually pretty rare these days that you have a write operation coming from some kind of copy operation, but you don’t know that came from a copy operation. In the old days, a client program would read the source data and write to the destination data, and the storage system would see these as two unrelated operations. These days though, “copy offloading” is readily available, where instead of reading and writing, the program will tell the storage system “copy this source to that destination” and the storage system is free to do that however it wants. A naive implementation will just do the same read and write as the client would, but a smarter system could do something different, for example, not doing the write and instead just reusing the existing data and bumping a refcount.

For Linux and FreeBSD filesystems, this “offload” facility is the copy_file_range() syscall. Most systems have an equivalent; macOS calls it copyfile(), Windows calls it FSCTL_SRV_COPYCHUNK. NFS and CIFS support something like it, OS block device drivers are getting equivalents, even disk protocols have something like it (eg SCSI EXTENDED COPY or NVMe Copy).

If you put all this together, you end up in a place where so long as the client program (like /bin/cp) can issue the right copy offload call, and all the layers in between can translate it (eg the Window application does FSCTL_SRV_COPYCHUNK, which Samba converts to copy_file_range() and ships down to OpenZFS). And again, because there’s that clear and unambiguous signal that the data already exists and also it’s right there, OpenZFS can just bump the refcount in the BRT.

Most importantly is the space difference. If a block is never cloned, then we never pay for it, and if it is cloned, the BRT entry is only 16 bytes.

On my pool, where the two major users of copy_file_range() (that I know about) are cp and ccache, my BRT stats are rather nicer:

$ zdb -TTT crayon
BRT: used 292M; saved 309M; ratio 2.05x
BRT: vdev 0: refcnt 12.2K; used 292M; saved 309M

BRT: vdev 0: DVAs with 2^n refcnts:
			 1:  11788 ****************************************
			 2:    645 ***
			 3:     40 *
			 4:      3 *
			 5:      1 *

If you compare to the dedup simulation, I’m not saving as much raw data as dedup would get me, though it’s pretty close. But I’m not spending a fortune tracking all those uncloned and forgotten blocks.

Now yes, this is not plumbed through everywhere. zvols don’t use the BRT yet. Samba has only just gotten support for OpenZFS very recently. Offloading in Windows is only relatively new. The situation is only going to get better, but maybe it’s not good enough yet. So maybe you might be tempted to try dedup anyway, but for mine, I can’t see how the gains would be worth it even without block cloning.

And this is why I say you probably don’t want it. Unless you have a very very specific workload where data is heavily duplicated and clients can’t or won’t give direct “copy me!” signal, then just using block cloning is likely to get you a good chunk of the gain without the outsized amount of pain.

In summary

Dedup is about balancing IO throughput, memory usage and dedup table size. Traditional dedup has a very tiny “sweet spot” where these factors balance nicely, while being ruinous if you fall out of it. Fast dedup improves all three, making it far easier to balance these factors and rather less of a disaster if it doesn’t work out. However, it is still only of benefit if you have a truly enormous amount of data, that gets copied a lot, and aren’t able to take advantage of other “zero-copy” options within OpenZFS, like block cloning or snapshot clones.

Extra congratulations if you got this far. I hope this was of interest!

Thanks to @mattcen and @adavis for numerous grammar and spelling fixes. Much obliged! 💚

George Melikov has translated this post to Russian: Дедупликация в OpenZFS теперь хороша, но использовать её не стоит. Thank you George!