Why fsync() on OpenZFS can’t fail (and what happens when it does)

This presentation was given at AsiaBSDCon 2024.

Abstract

On OpenZFS, fsync() cannot fail - it will wait until the application’s changes are on disk before it returns. If there is a problem, such as a hardware failure, that causes the pool to suspend, then it may wait forever. This feels strange, but is acceptable according to the API contract: fsync() never returned success, so the application has no reason to believe its data is on disk.

However, OpenZFS pools can recover if the fault is repaired, and so fsync() can still return. As it turns out though, it’s possible in rare situations for the pool to return to service but not have actually put the data on disk. fsync() returns success, because it cannot fail, and the application has been lied to.

This paper describes the path taken from the fsync() call, through the ZFS Intent Log, the transaction machinery, the pool failure system and the IO pipeline to understand what happens to IO when disks fail, and why OpenZFS believed that writes had succeeded when they had not. It goes on to describe changes to make OpenZFS understand that something had gone wrong and respond appropriately, such that fsync() once again cannot fail.

Resources

Status

Last updated: 2024-04-04

This series is in production at a customer site, and are now being upstreamed. Unlinked items are commits not yet made public, mostly because they require earlier changes not yet available. This list will be updated as PRs are merged and new ones posted.

Acknowledgements

Thanks to Klara, Inc. and Wasabi Technology, Inc. for sponsoring this work.

Thanks to the AsiaBSDCon 2024 committee and sponsors for supporting my travel to Taipei to present this work.