Why fsync() on OpenZFS can’t fail (and what happens when it does)

This presentation was given at AsiaBSDCon 2024.

Abstract

On OpenZFS, fsync() cannot fail - it will wait until the application’s changes are on disk before it returns. If there is a problem, such as a hardware failure, that causes the pool to suspend, then it may wait forever. This feels strange, but is acceptable according to the API contract: fsync() never returned success, so the application has no reason to believe its data is on disk.

However, OpenZFS pools can recover if the fault is repaired, and so fsync() can still return. As it turns out though, it’s possible in rare situations for the pool to return to service but not have actually put the data on disk. fsync() returns success, because it cannot fail, and the application has been lied to.

This paper describes the path taken from the fsync() call, through the ZFS Intent Log, the transaction machinery, the pool failure system and the IO pipeline to understand what happens to IO when disks fail, and why OpenZFS believed that writes had succeeded when they had not. It goes on to describe changes to make OpenZFS understand that something had gone wrong and respond appropriately, such that fsync() once again cannot fail.

Resources

Paper: norris_openzfs-fsync-failure_asiabsdcon-2024.pdf [PDF 134K]
Slides: norris_openzfs-fsync-failure_asiabsdcon-2024_slides.pdf [PDF 5.8M]
Video: coming soon!

Status

Last updated: 2024-06-30

This series is in production at a customer site, and are now being upstreamed. Unlinked items are commits not yet made public, mostly because they require earlier changes not yet available. This list will be updated as PRs are merged and new ones posted.

Test suite support:
- zinject: show more device fault fields openzfs/zfs#15953 ✅
- zinject: inject device errors into ioctls openzfs/zfs#16061 ✅
- zinject: “no-op” error injection openzfs/zfs#16085 ✅
- zil: add stats for commit failure/fallback openzfs/zfs#16315
Failing test case:
- zts: test for correct fsync() response to ZIL flush failure openzfs/zfs#16314
Implementation:
- zio: rename “ioctl” to “flush”; remove zio_ioctl() openzfs/zfs#16064 ✅
- zio_flush: propagate flush errors to the ZIL openzfs/zfs#16314
- zio: add vdev tracing machinery
- zio: expose trace node alloc/free/compare
- zio: function to issue flushes by trace tree
- zil: only flush leaf vdevs that were actually written to

Acknowledgements

Thanks to Klara, Inc. and Wasabi Technology, Inc. for sponsoring this work.

Thanks to the AsiaBSDCon 2024 committee and sponsors for supporting my travel to Taipei to present this work.