32

If this question is too programmer oriented, let me know. I wonder if there are people familiar with the O_DIRECT flag for the open() system call on Linux 2.6? Linus disparages its use, however high performance file writing seems to indicate its use. I would like to know of any real world experience and recommendations.

More info: The application that I am using does maintain its own cache, and in doing so attains an average of 5x or more speed up. When writing to file, the contents of the cache must be written out to the filesystem cache, which seems redundant and a performance concern.

6 Answers 6

24

Ok, you ask for experiences, this makes the question a little subjective and argumentative, but passable.

Linus said that referring to the uses that people usually attribute to O_DIRECT, and for those uses, IMO Linus is mostly correct. Even if you do direct I/O, you cannot transfer data to/from devices directly to your program statements, you need a buffer that is filled (by the program or the device) and transferred through a system call to the other end. Also, to make it efficient, you will not want to reread something you just already read, in case you need it again. So you need some sort of cache... and it is exactly that that the kernel provides without O_DIRECT, a page cache! Why not use that? It also comes with benefits if more processes want to access the same file concurrently, it would be a disaster with O_DIRECT.

Having said that, O_DIRECT has its uses: If for some reason you need to get data directly from the block device. It has nothing to do with performance.

People using O_DIRECT for performance usually come from systems with bad page cache algorithms, or without POSIX advice mechanisms, or even people mindlessly repeating what other people have said. To avoid these problems, O_DIRECT was a solution. Linux, OTOH, has the philosophy that you should fix the real underlying problem, and the underlying problem was OSs that did a bad job with page caching.

I used O_DIRECT for a simple implementation of cat to find a memory error in my machine. This is one valid use for O_DIRECT. That had nothing to do with performance.

3
  • Thanks for the info, it is appreciated. I have updated my question with the specific conditions of the app that prompted this question. If you have more details on the POSIX advice mechanisms for writing files, that would be appreciated, too. Commented Jan 28, 2011 at 5:42
  • 4
    o_direct might also be useful in a system where the developer wants to provide a caching mechanism at the application layer (think databases).
    – Jmoney38
    Commented Feb 3, 2014 at 17:14
  • It has nothing to do with performance. That's not always true, especially for accessing a high-speed device where the IO rates rival memory bandwidth, or even just a significant percentage of memory bandwidth. In that case, skipping the extra copy to/from page cache can have significant performance benefits. Commented Oct 14, 2018 at 16:19
22

Actually, O_DIRECT is needed to avoid either of

  • cache pollution — sometimes you know that there's no sense in overhead with caching, for e. g. when dealing with really large files, say 64 GiB when there's only 2 GiB of RAM. Torrent file of 32 GiB which a user decided to verify doesn't seem to be a good candidate for caching. It's just extra activity with its own overhead. And it can cause some really useful data to be pruned from cache.
  • double caching — for e. g. some RDBMSes (MySQL to mention) allows for defining its own cache. Databases supposedly know better how to cache and what, than kernel's Virtual Memory which does not know a thing about SQL planning and so on.

— which is no good, as it seems. And O_DIRECT doesn't mean to be faster, often it is not.

3
  • 13
    posix_fadvise can take care of the cache pollution problem.
    – psusi
    Commented Jun 14, 2012 at 14:36
  • I do not think Virtual Memory has anything to do with it, it merely maps memory address. Buffer Cache/Page Cache is what you mean.
    – ArekBulski
    Commented Oct 26, 2015 at 21:51
  • Caches/caching is part of VM subsystem in UNIX, as far as I can tell, that's why I used this term. Thanks for edit. :)
    – poige
    Commented Oct 26, 2015 at 23:58
10

It has lots to do with performance.

An interesting example is in mongodb using the mmap engine. O_DIRECT is best used, as others have stated, where the data is unlikely to be read for some time. In mongodb, the database journal is written using O_DIRECT while the data and indexes writes are handled by the page cache mechanism (pdflush) because, although O_DIRECT offers less bandwidth, it also means less latency, and hence reduces data loss in the event of an unexpected outage (kernel panic, disk or power failure). Note that there is still buffering before an O_DIRECT write is committed to non-volatile storage, this just reduces the data loss.

Another important feature of O_DIRECT is that it provides more control over the sequence of writes. Again it does not guarantee the order of writes (unless you have a non-volatile caching disk controller and are using the fifo scheduler, but these have their own complications). Hence although mysql uses O_DIRECT for its data/indexes as well as journalling, it can expect that the latter will usually be committed first.

But its important to remember that O_DIRECT breaks fairness in resource allocation. One of the reason your application is speeded up is that it is slowing down other stuff.

2
  • You say it has a lot to do with performance, yet, you provide an example where it is used either to decrease latency or order writes. But I do agree that it affects performance. Fair point about fairness.
    – ArekBulski
    Commented Oct 27, 2015 at 0:42
  • Can you provide more references explaining when it is unfair?
    – ACyclic
    Commented Nov 5, 2015 at 3:38
9

Note that using O_DIRECT is liable to fail in newer kernels with newer file systems. See this bug report for example. So not only is the use often dubious, it will likely not work at all in the coming generation of Linux distributions. So I would not bet the performance of my code on it, even if you happen to be able to prove that it might have a benefit.

1
  • 1
    The bug report actually discusses the use of filesystems with the journal=data option on. This option is directly opposite in effect to the O_DIRECT flag. Most ext3 and ext4 filesystems do not have this flag set and if they do, turning it off will permit opening the file with O_DIRECT. Commented Jan 29, 2011 at 17:57
6

Relating to what @Juliano has already said.

Checkout posix_fadvise if the real problem is misbehaviour of underlying filesystem's cache algorithm, you can try give it advice, how are you going to use filesystem. For nicely implemented fs, it should give performance boost. (Here is link to another topic touching similar considerations https://stackoverflow.com/a/3755818/544721 )

1
  • 1
    It looks like posix_fadvise changes the readahead algorithms used by the kernel. The critical factor with the code in the question is write performance. The problem is that writing out the buffer fills the Linux caches first, which the kernel then has to dump when it runs out of memory. This is a waste of effort, the output in this case should be minimally buffered on the way to disk. Commented Oct 28, 2015 at 2:05
3

I'm mostly a DBA/sys admin/network admin but I'm also a C/C++ programmer. I wrote my own app to mass copy/move files using O_DIRECT and Linux native aio. The result is a reasonable increase in throughput while not bothering any apps that need the cache.
As a DBA I like to use direct I/O whenever possible. Many RDBMS offer O_DIRECT support but none offer using posix/linux native calls to free up the cache after using it.
In the end it seems to me Linus position make sense from his kernel developer standpoint but it's not very practical for an RDBMS developer that must support other OSes too.
Back to my app, using O_DIRECT with 8 or 16MB buffers make a substantial improvement in performance. And the app barely uses cpu at all.

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .