Why fsync() on OpenZFS can’t fail, and what happens when it does
This presentation was given at BSDCan 2024.
Abstract
On OpenZFS, fsync()
cannot fail - it will wait until the application’s changes are on disk before it returns. If there is a problem, such as a hardware failure, that causes the pool to suspend, then it will block until the pool returns. This could be seconds, hours, or never, depending on the nature on the failure.
Modern distributed systems can often cope with this type of failure by redirecting requests to another node, but they can only do this if fsync()
returns an error instead of blocking.
In this talk I describe how OpenZFS implements fsync()
and why it blocks when the pool fails. I then discuss a series of changes made to make it possible for fsync()
to return failure - and what it means for applications when it does.
Resources
- Slides: norris_openzfs-fsync-zil_bsdcan-2024_slides.pdf [PDF 2.1M]
- Video: https://www.youtube.com/watch?v=dfH0I6D9ZAA
Further reading
fsyncgate
- Mirror of the postgresql-dev thread where it was first reported.
- LWN coverage:
Papers
- Can Applications Recover from
fsync
Failures? [Rebello et al]
Implementations
FreeBSD
- 1999 (4.x) change to retain failed pages, rather than invalidate them
- 2017 (12.x) change to invalidate failed pages if the device was removed
Linux
- VFS documentation on handling errors during writeback
- 2017 (4.13) changes to codify behaviour and make consistent
Acknowledgements
Thanks to Klara, Inc. and Wasabi Technology, Inc. for sponsoring this work.
Thanks to the BSDCan 2024 committee and sponsors for supporting my travel to Ottawa to present this work.