fsync() after open() is an elaborate no-op

I have spent the last couple of years of my life trying to make sense of fsync() and bringing OpenZFS up to code. I’ve read a lot of horror stories about this apparently-simple syscall in that time, usually written by people who tried very hard to get it right but ended up losing data in different ways. I hesitate to say I enjoy reading these things, because they usually start with some catastrophic data loss situation and that’s just miserably unfair. At least, I think they’re important reads, and I’m always glad to see another story of fsync() done right, or done wrong.

A few days ago a friend pointed me at a new one of these by the team working on CouchDB: How CouchDB Prevents Data Corruption: fsync . I was a passenger on a long car trip at the time, so while my wife listened to a podcast I read through it, and quite enjoyed it! As I read it I was happy to read something not horrifying; everything about it seemed right and normal. It’s even got some cool animations, check it out!

But then I got to the very last paragraph:

However, CouchDB is not susceptible to the sad path. Because it issues one more fsync: when opening the database. That fsync causes the footer page to be flushed to storage and only if that is successful, CouchDB allows access to the data in the database file (and page cache) because now it knows all data to be safely on disk.

fsync() after open()? What even?

That’s the question I’ve been mulling over for days, because I don’t see how this action can make any particular guarantees about durability, at least not in any portable way. I’ve gone back over my notes from my BSDCan 2024 presentation, and in turn checked back over some of the research papers, blog posts and code that informed them. If I’m missing something I’m not seeing it.

Scenario 🔗

Here’s a summary of the scenario as I understand it.

If I’ve misunderstood the situation, then the rest of this post might be meaningless. If not though, then I’m pretty sure that successful fsync() has not made any particular guarantee about that original write.

POSIX 1003.1-2024 is very thin in its description of fsync() – just two paragraphs. My understanding is that the first paragraph is about durability, while the second is about ordering. However, there’s very little in this description that is concrete, so we can’t rely on much.

Lets go over both.

Durability 🔗

Durability is what most people think of when they think about fsync() – being able to reliably find out if your writes made it to storage (at least to the extent that the filesystem, OS and hardware can make that guarantee, but that’s a whole different issue).

The fsync() function shall request that all data for the open file descriptor named by fildes is to be transferred to the storage device associated with the file described by fildes. The nature of the transfer is implementation-defined. The fsync() function shall not return until the system has completed that action or until an error is detected.

The key part here is the emphasis on the file descriptor. What it’s saying is that if you call fsync(fd) on a given fd, and it returns success, you are guaranteed that the writes made on that file descriptor are on disk. That is to say, if you have a different file descriptor open on the same file, and you have made writes there too, the first fsync() does not guarantee those writes.

Or, put another way:

char data[4096] = { ... };

int fd1 = open("/tmp/myfile", O_WRONLY|O_CREAT, S_IRUSR|S_IWUSR);
int fd2 = open("/tmp/myfile", O_WRONLY|O_CREAT, S_IRUSR|S_IWUSR);

pwrite(fd1, data, sizeof(data), 0);             // 4K at start of file
pwrite(fd2, data, sizeof(data), 0x3ffff000);    // last 4K up to 1G

assert(fsync(fd1) == 0);    // first 4K definitely on disk, last 4K unknown
assert(fsync(fd2) == 0);    // last 4K on disk too

It should be obvious that you wouldn’t want this any other way. If you were a database server, and you have a multi-terabyte file under you, and hundreds of concurrent writers operating on different parts of the file, you don’t want to durably write (that is, make recoverable after a crash) all of them just because one decided it had finished its work.

Note that there is nothing in this that says what state the data is in if fsync() returns an error. The assumption among applications programmers (including myself) was that if fsync() failed, and then on retry it succeeded, then your data had made it to disk. The “fsyncgate” email thread was where the mainstream understanding of that behaviour was finally understood, and it included both misunderstanding of the meaning of fsync() returning an error, and of the behaviour when called on different file descriptors.

Ordering 🔗

The second paragraph is equally vague:

If _POSIX_SYNCHRONIZED_IO is defined, the fsync() function shall force all currently queued I/O operations associated with the file indicated by file descriptor fildes to the synchronized I/O completion state. All I/O operations shall be completed as defined for synchronized I/O file integrity completion.

To the best of my understanding, what this is trying to say is that a call to fsync() also creates a write barrier over the entire file. That is, all writes to the file before fsync() is called will be completed before any writes to the file after the call. On the surface, this appears to be without reference to the file descriptor used for the call to fsync().

POSIX has a whole chapter on Definitions where “Synchronized I/O completion” and “Synchronized I/O file integrity completion” are defined. However, those are also ambiguous. I won’t paste them all in here, but if you do care to look, you’ll see that “I/O operation” is not clearly defined, and “queued I/O operation” is not defined at all. Further, “I/O completion” is defined as “successfully transferred or diagnosed as unsuccessful”, but then in the description of write (and read!) data integrity completion it repeats “successful or unsuccessful” and then immediately says “actually, only successful”. So it’s very unclear what’s actually supposed to happen here, but the barrier seems the most likely.

However, regardless of reading, this paragraph makes no claim about when those “before” writes should be written, nor how errors on them should be reported, nor how any of this interacts with reads. So for understanding where any given write is, we’re pretty much dependent on the first guarantee.

Guarantees 🔗

From this, I believe there are only two conclusions you can draw from the result of an fsync(fd) call for a given fd:

That last point is crucial. On error, you don’t know if your write made it to disk, and you can neither find out the current state nor induce a retry or any other behaviour, at least not through any standard means. There is no way to call fsync() again to get a guarantee. Reading the data back through the same file will not get you a guarantee. You might be able to do clever things “outside” of the normal POSIX APIs (flush some OS caches, read a raw block device, etc) but those are not guarantees provided by fsync().

This isn’t just spec lawyering either. In the absence of a clear direction, different operating systems have handled errors in differents way. I’ll point you back to my BSDCan presentation for more details, since this post isn’t quite about error responses. What I’ll say is that they vary in quite interesting ways, and all make sense from certain perspectives.

If fsync() after open() doesn’t work, then why did it work? 🔗

I’ve rattled on for a while now but there’s a small problem: at least one project included calling fsync() after reopening the file in their recovery strategy. I have to assume it wasn’t just a cargo-cult, and it actually solved a real problem for them. So why did it work? I think we can make a good guess.

If you recall the descriptions from the CouchDB post, the write() call would have gone into the page cache. The user program crashed without fsync() being called, so that’s just a normal dirty page, sitting there, waiting for the OS to come along and write it out as part of its normal page writeback behaviour. When that happens is very dependent on OS, filesystem and configuration, but it could depend on how many other dirty pages are waiting to be written out, or it could be some scheduled or timeboxed process, or it could be when the OS is under memory pressure and starts reclaiming pages.

I’m willing to bet that, being an important server program, it was was restarted almost immediately by a process supervisor, and so the page was actually still dirty in memory, not yet written down. So, it looks good when you read it.

And why did the fsync() work? There’s two possible reasons I can think of, but the post doesn’t have enough information for we to know which.

One possibility is that it actually didn’t do anything. fsync() looked, saw no dirty writes associated with the file descriptor and returned success. The program merrily continued on its way, and within the next couple of seconds, the normal page writeback came through and completed the write successfully.

The other possibility is that operating systems and filesystems are allowed to do more than POSIX permits. Using OpenZFS as an example (hey, it’s what I know), fsync() always flushes anything outstanding for the underlying object, regardless of where the writes came from. That’s obviously fine; fsync() returning success just means that the writes on the same descriptor are done. If OpenZFS has others lying around, it’s totally allowed to write those too. Just so long as the caller doesn’t assume that more happened than did, everything will be fine.

I don’t know enough other filesystems, but it wouldn’t surprise me if at least some of them held a list of dirty pages on the file or even the filesystem as a whole. No one has expressed an interest in them (they would have called fsync() if so!) and there’s good batching opportunites available by holding them for a while and then writing them out all at once.

This is all by the by though. The real problem is actually what happens if that page fails to write out. Where does the error get reported? Does it get reported at all? And what does the OS do after that? You’d like to think that if there’s no one to report the error to, the page would either remain dirty (and so still be readable) or become invalid (and the next read causes a fault and a read from disk, likely also failing).

Most do one of those. Linux and ext4 however does not (or at least, didn’t used to). It used to both clear the error state and mark the page clean, but valid. In all respects it looked like a page freshly read from disk. But, that data was not on disk (the write failed!), any attempt to flush it would do nothing (it’s not dirty). But, because it’s a clean page, it’s eligible for eviction, resulting in the weird situation that you could read data from it and it “worked”, and read it again later and it faults. This is all perfectly legal, and I imagine more efficient in some situations – if it’s clean then there’s no I/O to do, but if it’s valid there’s no fault to process.

And our program would never know, because it misunderstood what a successful fsync() meant.

Summary 🔗

If you need a single-paragraph summary, try this:

For portable programs, fsync() can only tell you about writes that you did on the same file descriptor. If it succeeds, those writes are on disk. If it fails, assume they are lost and gone forever. It has nothing to say about any other I/O anywhere else in the system.

Though if I’ve got it all wrong please let me know. fsync() has thrown up so many subtle surprises over so many years that I don’t trust myself to know if I’m just looking at a new thing I didn’t know about.

And, for avoidance of doubt, understand that I’m definitely not dunking on CouchDB at all. It was just their post that prompted me to write this. Almost every serious data storage system (including the one I work on) has messed up something around durability and been found out in the last decade or so. It’s almost a rite of passage these days.

That’s all I’ve got. Good luck out there 🍀