Linux filesystem with inodes close on the disk

Question

I'd like to make the ls -laR /media/myfs on Linux as fast as possible. I'll have 1 million files on the filesystem, 2TB of total file size, and some directories containing as much as 10000 files. Which filesystem should I use and how should I configure it?

As far as I understand, the reason why ls -laR is slow because it has to stat(2) each inode (i.e. 1 million stat(2)s), and since inodes are distributed randomly on the disk, each stat(2) needs one disk seek.

Here are some solutions I had in mind, none of which I am satisfied with:

Create the filesystem on an SSD, because the seek operations on SSDs are fast. This wouldn't work, because a 2TB SSD doesn't exist, or it's prohibitively expensive.
Create a filesystem which spans on two block devices: an SSD and a disk; the disk contains file data, and the SSD contains all the metadata (including directory entries, inodes and POSIX extended attributes). Is there a filesystem which supports this? Would it survive a system crash (power outage)?
Use find /media/myfs on ext2, ext3 or ext4, instead of ls -laR /media/myfs, because the former can the advantage of the d_type field (see in the getdents(2) man page), so it doesn't have to stat. Unfortunately, this doesn't meet my requirements, because I need all file sizes as well, which find /media/myfs doesn't print.
Use a filesystem, such as VFAT, which stores inodes in the directory entries. I'd love this one, but VFAT is not reliable and flexible enough for me, and I don't know of any other filesystem which does that. Do you? Of course, storing inodes in the directory entries wouldn't work for files with a link count more than 1, but that's not a problem since I have only a few dozen such files in my use case.
Adjust some settings in /proc or sysctl so that inodes are locked to system memory forever. This would not speed up the first ls -laR /media/myfs, but it would make all subsequent invocations amazingly fast. How can I do this? I don't like this idea, because it doesn't speed up the first invocation, which currently takes 30 minutes. Also I'd like to lock the POSIX extended attributes in memory as well. What do I have to do for that?
Use a filesystem which has an online defragmentation tool, which can be instructed to relocate inodes to the the beginning of the block device. Once the relocation is done, I can run dd if=/dev/sdb of=/dev/null bs=1M count=256 to get the beginning of the block device fetched to the kernel in-memory cache without seeking, and then the stat(2) operations would be fast, because they read from the cache. Is there a way to lock those inodes and/or blocks into memory once they have been read? Which filesystem has such a defragmentation tool?

You say "on Linux as fast as possible" Do you really mean that? What is your budget for hardware because it can be very fast if you have a legitimate reason for needing things to be so fast. — deltaray, Commented Jan 9, 2011 at 18:20
If your version of find has -printf (and on Linux it should), you can output the same information as ls -l (in addition to the fact that find has -ls). However, it's still going to have to do stat to get that information. Have you considered using locate or a similar scheme? — Dennis Williamson, Commented Jan 9, 2011 at 20:29
@deltaray: My budget is the price of an 32GB SSD drive + my time to set things up. I don't have time to develop custom software. — pts, Commented Jan 9, 2011 at 23:22
@Dennis Williamson: I know about locate, and it can be a useful workaround. But I'm still interested in getting an answer to my question. — pts, Commented Jan 9, 2011 at 23:24

Andrew · Accepted Answer · 2011-04-02 06:35:38Z

I'll trade you my answer to your question for your answer to mine: What knobs have to be fiddled in /proc or /sys to keep all the inodes in memory?

Now for my answer to your question:

I'm struggling with a similar-ish issue, where I'm trying to get ls -l to work quickly over NFS for a directory with a few thousand files when the server is heavily loaded.

A NetApp performs the task brilliantly; everything else I've tried so far doesn't.

Researching this, I've found a few filesystems that separate metadata from data, but they all have some shortcomings:

dualfs: Has some patches available for 2.4.19 but not much else.
lustre: ls -l is a worst-case scenario because all the metadata except the file size is stored on the metadata server.
QFS for Solaris, StorNext/Xsan: Not known for great metadata performance without a substantial investment.

So that won't help (unless you can revive dualfs).

The best answer in your case is to increase your spindle count as much as possible. The ugliest - but cheapest and most practical - way to do this is to get an enterprise-class JBOD (or two) and fiber channel card off of Ebay that are a few years old. If you look hard, you should be able to keep your costs under $500 or so. The search terms "146gb" and "73gb" will be of great help. You should be able to convince a seller to make a deal on something like this, since they've got a bunch of them sitting around and hardly any interested buyers:

http://cgi.ebay.ca/StorageTek-Fibre-Channel-2TB-14-Bay-HDD-Array-JBOD-NAS-/120654381562?pt=UK_Computing_Networking_SM&hash=item1c178fc1fa#ht_2805wt_1056

Set up a RAID-0 stripe across all the drives. Back up your data religiously, because one or two of the drives will inevitably fail. Use tar for the backup instead of cp or rsync so that the receiving single drive won't have to deal with the millions of inodes.

This is the single cheapest way I've found (at this particular historical moment, anyway) to increase IOPs for filesystems in the 2-4TB range.

Hope that helps - or is at least interesting!

About keeping inodes on memory on Linux: try sudo sysctl -w vm.vfs_cache_pressure=1 (or =0). I got this hint from a forum, I've never tried. — pts, Commented Apr 8, 2011 at 19:39
@pts: vfs_cache_pressure=1 is pretty extreme. I use vfs_cache_pressure=60 on my desktop instead of the default 100, which biases some towards keeping metadata in cache, but I haven't measured to see how much difference it makes. — Peter Cordes, Commented Feb 2, 2016 at 4:54

Sven · Accepted Answer · 2013-01-24 12:53:29Z

2

the disk contains file data, and the SSD contains all the metadata ... Is there a filesystem which supports this?

btrfs supports this to some extent, btrfs Wiki. One can specify raid1 for the metadata (and raid0 for data - most data will end up on the large HDD) so that the SSD will always have a copy of the metadata for reading (I have no idea how clever btrfs will be in selecting the source for reading metadata). I have not seen any benchmarks for such a setup.

edited Jan 24, 2013 at 12:53

answered Jan 12, 2013 at 10:04

Sven

1711 silver badge5 bronze badges

Thank you for mentioning btrfs. Unfortunately I couldn't find the feature you mentioned. spinics.net/lists/linux-btrfs/msg05617.html says that in July 2010 this was not possible. Could you please cite your source stating otherwise.
– pts
Commented Jan 24, 2013 at 7:37

Add a comment |

Evert Mouw · Accepted Answer · 2014-10-02 19:39:25Z

No answer, unfortunately, although I did google for an answer for the last half an hour.

Create a filesystem which spans on two block devices: an SSD and a disk; the disk contains file data, and the SSD contains all the metadata (including directory entries, inodes and POSIX extended attributes). Is there a filesystem which supports this? Would it survive a system crash (power outage)?

Exactly what I also would like.

For the links, see this pastebin, because I'm not allowed to post more than one link...

http://www.notehub.org/2014/10/2/external-metadata-more-information

Multi-device support from btrfs is discussed here:

Btrfs: Working with multiple devices, by By Jonathan Corbet, December 30, 2013 (LWN), [link][1]

But although you can mirror the metadata (-m raid1) to a SSD, you are then forced to also use the SSD for data (-d raid0) storage, at least partially.

The good news is that there is work being done:

Dedicated metadata drives Jan Schmidt and Arne Jansen (Not in kernel yet) We're able to split data and metadata IO very easily. Metadata tends to be dominated by seeks and for many applications it makes sense to put the metadata onto faster SSDs. [link][2]

If you are willing to use IBM's proprietary General Parallel File System (GPFS), then this is already possible, it seems. Read "How to migrate all GPFS filesystem metadata to SSDs": [link][3]

deltaray · Accepted Answer · 2011-01-10 03:19:21Z

1

I would just use ext4 and make sure that you have dir_index set. You can check for that flag by running this:

dumpe2fs /dev/drivepartition | grep "Filesystem features:"

The biggest problem you'll run into is just the number of files overall on the filesystem. Any operation you run across the filesystem will have to look at each file. This is the case with any filesystem. 10,000 files in a directory may seem like a lot, but I find that filesystems don't get slow until you get to 40,000 files or more and that's really an older symptom of filesystems like ext2.

It sounds like you're trying to do something specific rather than just have a general purpose filesystem. If you can explain what you're trying to do, we can probably suggest a way to optimize your data. For instance, a database.

answered Jan 10, 2011 at 3:19

deltaray

2,0274 gold badges19 silver badges25 bronze badges

Thank you for composing your answer. I know about the dir_index flag, and it doesn't help me (i.e. it's still too slow with the flag enabled), because it doesn't prevent the stat(2) calls and the disk seeks.
– pts
Commented Jan 10, 2011 at 9:36
My use case is interactive: I go to directories in Midnight Commander, then I view, copy and move some some files. If the directory contains 10000 files, sometimes it takes 50 seconds to refresh it in Midnight Commander (because of the stat(2) calls).
– pts
Commented Jan 10, 2011 at 9:38
Again I ask, why do you have 10,000 files in a directory?
– deltaray
Commented Jan 10, 2011 at 15:38
Some of my scripts and programs create files into the same directory. Every week I open those directories manually, and move the files away, possibly to smaller directories. However, in this question, I am not interested in alternative designs. I am only interested in what I asked.
– pts
Commented Jan 10, 2011 at 18:56
If your budget is limited then you better start looking into alternative designs. You can try XFS or JFS and see if you see an improvement, but I think that if your budget is limited, you should start thinking of a better design for your data.
– deltaray
Commented Jan 10, 2011 at 22:06

Add a comment |

phuclv · Accepted Answer · 2024-06-10 15:11:31Z

Create a filesystem which spans on two block devices: an SSD and a disk; the disk contains file data, and the SSD contains all the metadata (including directory entries, inodes and POSIX extended attributes). Is there a filesystem which supports this? Would it survive a system crash (power outage)?

ZFS and Bcachefs support this. ZFS has the Special VDEV Class

In ZFS 0.8 and later, it is possible to configure a Special VDEV class to preferentially store filesystem metadata, and optionally the Data Deduplication Table (DDT), and small filesystem blocks. This allows, for example, to create a Special VDEV on fast solid-state storage to store the metadata, while the regular file data is stored on spinning disks. This speeds up metadata-intensive operations such as filesystem traversal, scrub, and resilver, without the expense of storing the entire filesystem on solid-state storage.

And in Bcachefs there are target options

Four target options exist which may be set at the filesystem level (at format time, at mount time, or at runtime via sysfs), or on a particular file or directory:

foreground target: normal foreground data writes, and metadata if metadata target is not set

metadata target: btree writes

background target: If set, user data (not metadata) will be moved to this target in the background

promote target: If set, a cached copy will be added to this target on read, if none exists

See Linux filesystem with metadata on SSD and file data on HDD

Use a filesystem, such as VFAT, which stores inodes in the directory entries. I'd love this one, but VFAT is not reliable and flexible enough for me, and I don't know of any other filesystem which does that. Do you? Of course, storing inodes in the directory entries wouldn't work for files with a link count more than 1, but that's not a problem since I have only a few dozen such files in my use case.

VFAT doesn't have inodes at all. Only NTFS (and probably also ReFS) has a similar feature. The equivalent to inode struct is the MFT record, which is stored in the MFT, and has a field called File ID that's the analog of inode numbers in *nix filesystems. The FAT entry is similar to inode, but it doesn't have a fixed ID field. Actually many modern *nix filesystems like Btrfs don't even have inode and have a different way to generate inode number when necessary instead.

The MFT record in NTFS is large, hence can contains lots of information like filenames, links, or even data streams... In case of folders then their contents are filenames, therefore can also be stored directly inside the MFT if they're not too large

Although hard links use the same MFT record (inode) which records file metadata such as file size, modification date, and attributes, NTFS also caches this data in the directory entry as a performance enhancement. This means that when listing the contents of a directory using FindFirstFile/FindNextFile family of APIs, (equivalent to the POSIX opendir/readdir APIs) you will also receive this cached information, in addition to the name and inode

https://en.wikipedia.org/wiki/NTFS#Hard_links

You can run fsutil file layout too see which kind of metadata has been stored inside the MFT record itself. They'll be shown as Resident | No clusters allocated

In fact the Resident-file feature (files with data streams inside the MFT record and and without any blocks allocated for data) was borrowed by many other filesystems like ext4 (called inline files). However the inode size is smaller compared to the MFT record, therefore less data can be stored in it. Btrfs has a similar feature that's controlled by the max_inline=bytes option

Stack Exchange Network

Linux filesystem with inodes close on the disk

5 Answers 5

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
linux
filesystems
directory-listing
performance-tuning
inode
.

Linked

Hot Network Questions

Linux filesystem with inodes close on the disk

5 Answers 5

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged linuxfilesystemsdirectory-listingperformance-tuninginode.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
linux
filesystems
directory-listing
performance-tuning
inode
.