8

As a common cause for file corruption, documentation often cites as cause "NFS that doesn't implement file-level locking correctly" or something similar, e.g., for SQLite:

How to corrupt an SQLite database file, paragraph 2.1

2.1 Filesystems with broken or missing lock implementations

SQLite depends on the underlying filesystem to do locking as the documentation says it will. But some filesystems contain bugs in their locking logic such that the locks do not always behave as advertised. This is especially true of network filesystems and NFS in particular. If SQLite is used on a filesystem where the locking primitives contain bugs, and if two or more threads or processes try to access the same database at the same time, then database corruption might result.

This – or something to this effect – has been frequently mentioned for over a decade, commonly in mixed Windows/Unix environments. However, I've never found any indication which network file systems (or client/server combinations) actually are at risk.

What can I tell my customers?

3
  • 1
    Its not so much about filesystem as it is sharing protocol (NFS is not a filesystem despite the name). Samba does perform write-locking. Commented Jun 27, 2013 at 12:53
  • so it would be an issue between certain clients/servers?
    – peterchen
    Commented Jun 27, 2013 at 18:21
  • well, its a server issue. but you're mostly right. NFS should inform whatever underlying file system is present to implement a lock when a file is opened for write. that filesystem might be EXT, BTRFS, MurderFS, NTFS, or whatever else. All FS's have to deal with write locking, but the sharing protocol is responsible for brokering between the network request and the filesystem- specific semantics. unfortunately, NFS doesn't appear to do this well. I'm not sure why, but I would guess its a problem with the way it abstracts away the actual FS operations. Commented Jun 27, 2013 at 18:46

1 Answer 1

3
+25

Which network file systems do not implement locking correctly?

I believe that the correct answer is in short : "All of them".

This should be qualified to the context of multiple processes on multiple computers accessing the same database over the network. There should not be a problem when all the processes accessing the database are running on the same computer, if the database engine (SQLite) uses exclusion tools such semaphores or mutex (and uses them correctly...).

The reason for this is that the information of what is locked is usually kept in the context of "process X locks Y". This may work very well when all processes execute on the same computer, but not very well when they are on different computers.

When a process from one computer accesses a file on another computer, the local operating system substitutes itself for the remote process, in effect working as an agent for a process that it knows nothing about. It may even have difficulty in distinguishing between two different processes executing on the same remote computer and mistake one for the other.

To correctly lock files, the entire file or a section thereof, would really need one operating system that executes on all of the involved computers with a centralized data repository for file locks. Unfortunately, neither Linux nor Windows can do that in the general case.

You would find a better treaty of the subject for SQLite in the "How To Corrupt Your Database Files" section of the older article of File Locking And Concurrency In SQLite Version 3. This details the system calls SQLite uses to assure concurrency on Windows and Linux, both for locking file sections and for flushing out updated data from the computer to the database. All the listed functions would work well only in the context of one computer. Appropriately, this article contains the sentence "Your best defense is to not use SQLite for files on a network filesystem".

Note that the problem of flushing data to the database is especially worrisome, as the operating system usually delays writes, so you might have the case where the updating process has flushed the data and released its lock, but the new data may not yet have arrived into the database when another process tried to read it, thus ending easily with a corrupted database.

Some more information is found in the SQLite article Atomic Commit In SQLite, section Things That Can Go Wrong. It adds the case where different processes and computers may use different locking mechanisms, where one does not block the other.

In database management systems that solved this problem, such as Oracle or SQL Server, the solution is to designate one specific computer as the only one that can update the database, so locking is much simplified. Other computer network systems such as Apache Hadoop supply their own locking mechanisms to solve these problems.

Another article you would find interesting is On the Brokenness of File Locking.

2
  • You beat me to it, saying "All of them". And your links are very informative. A much more extensive answer than I would have produced myself.
    – Tonny
    Commented May 16, 2019 at 14:56
  • It is possible to make this work... but you need something really heavy, like a distributed Paxos locking service. Not the sort of thing the average consumer-grade networked filesystem will provide, certainly.
    – Kevin
    Commented May 17, 2019 at 4:37

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .