Millions of files in a single directory

Question

I need a file storage solution that will provide read/write for files to a collection of web servers. The space demands are modest -- about 2 TiB right now but will probably grow to twice that. NFS is what is used now and it looked good until I saw that almost all the files are in one single directory. Considering there are about 15 million files right now and the total could grow to 20 or 30 million I am worried that a linux filesystem might have a problem with that many.

I proposed that the application be modifed to split the files up across several sub-directories but the powers-that-be say "no" to that. That seems to leave me with two options:

NFS. This would be the simplest but I am not sure how well it can handle the number of files in the directory.
Cloud storage -- here that means Azure. I don't know enough about cloud storage to have an opinion on expected performance. Also I do not know what kind if rewriting will be necessary. Can object storage in the cloud be made to appear like part of the local file system like I can with NFS?

You can't control a file structure with NFS for it to save to different folders? Each server having it's own folder, or the logic that controls the copies/writes to YYYYMMDD time stamp a specific folder, etc? Or is that the application level stuff that is copying files to the NFS that you are saying the developers are saying it's not possible? — Vomit IT - Chunky Mess Style, Commented Jul 22, 2022 at 20:11
He is not entirely right either. Programmers are meant to work with systems to ensure that their programs are performant and just blasting millions of files into a single directory without caring about things like directory seek times or searchability is short sighted, irresponsible and frankly quite stupid. superuser.com/questions/623965/… is very relevant (though for Windows). You can just throw faster disks and processors at it, but that will only go so far before poor design makes it unusable. — Mokubai, Commented Jul 23, 2022 at 2:54
google.com/… gives many links across Super User and Stack Overflow and also Unix & Linux that show that this is a problem programmers should care about. It is not just a "systems problem". — Mokubai, Commented Jul 23, 2022 at 2:57
There is even evidence that on Linux using ext4 that simply having had millions of files in a directory in the past can make it unusable: unix.stackexchange.com/questions/679176/…. Granted you can use other filesystems but that would mean a lot of work benchmarking each filesystems behaviour and their pros and cons. — Mokubai, Commented Jul 23, 2022 at 3:14
Storing a million images in the filesystem, What are the performance implications for millions of files in a modern file system?, Performance associated with storing millions of files on NTFS — phuclv, Commented Jul 23, 2022 at 8:20

Stephen Carville · Accepted Answer · 2023-07-23 22:42:01Z

I just realized I never posted what I finally did to "solve" this.

I built a GlusterFS cluster consisting of four servers. Servers 1 & 2 mirror to each other. Server 3 & 4 mirror to each other. The new files are written alternately to the 1/2 and 3/4 pair. Sort of like a Raid10 for file storage. I think this called a 2x2 cluster by the glusterfs folks.

The volumes are managed by lvm and formatted as XFS.

So far it has held up well. We just passed the 25 million file mark and performance is still acceptable. It takes a while (about 3 hours) to get a listing but I only have to that once per day for statistical purposes. According to df we are using about 5.2T of 8.0T total though bear in mind the actual storage used is twice that because of the mirroring.

Delayed thanks to all who answered. It helped me arrive at a compromise that should hold us for a while.

grawity_u1686 · Accepted Answer · 2022-07-23 08:17:59Z

He seems to think this is a systems problem and he is not entirely wrong.

To some extent, yes, older filesystems were really bad at handling millions of files; newer ones do it differently. For example, ext2 and FAT use simple linear lists of files, so the scalability problems were indeed problems with ext2 and FAT, which were subsequently improved to use HTrees or B+trees in ext4 and NTFS.

(Eventually, however, the design of the filesystem can only do so much – I suspect it's not easy to optimize a general-purpose filesystem to handle billions of files per directory on a server and still remain usable for tens of files per directory on a desktop computer without too much overhead...)

But the way you use the filesystem also matters a lot. Even if you have millions of files on e.g. XFS, chances are that direct lookups by exact path will remain reasonably fast, as they only involve reading a small part of the directory data; but trying to list the directory will be much slower in comparison. So your program should be designed to never need to list the entire directory, but to know exactly what files it needs.

(As an analogy, if you use a SQL database, you already know that the correct way to search for data is to let server-side queries of "SELECT WHERE this=that" do the job – you don't usually try to retrieve the entire table every single time and then blame it on your network being too slow.)

NFS. This would be the simplest but I am not sure how well it can handle the number of files in the directory.

NFS doesn't store the directory lists on its own, it only provides access to the capabilities of the remote "storage" filesystem.

So if all your operations only deal with exact paths (i.e. read this specific file, write this file) then the capabilities of NFS itself should be completely irrelevant to your problem, as NFS will never need to look at the complete list of files – it will only forward the exact requested paths to the fileserver, where the NFS server's on-disk filesystem (e.g. zfs or ext4) needs to worry about handling the whole directory listing.

In other words, you're only shifting the problem to a different machine, but it still remains the exact same problem there. (Though the NFS file server certainly could use a filesystem that handles many files better than the one used on the web server, but you can do that locally as well.)

Any strategy I can devise to break up the files between several directoiries would require code changes and the project manager is not willing to do anythng beyond the most trivial of changes.

The most trivial change would be to use part of the file name itself that becomes the subdirectory name, as this makes it easy to find the files later – just apply the same transformation to the file name as you did when storing it.

Take a look at how .git/objects/ works. It can accumulate many object files (especially if you travel back in time to when Git didn't yet have packfiles), so they are separated into subdirectories based on the first 2 digits of the object ID.

For example, the Git object c813a148564a5.. is found at objects/c8/13a148564a5.., using one level of subdirectories with a 8-bit prefix – there are 256 possible subdirectories, and the number of files within each subdirectory is reduced approximately 256 times (e.g. only ~40k files per directory, in a 10-million-object repository) – and the software knows exactly where to find each object knowing only its name.

If you want to spread files out even more, you can use longer subdirectory names (e.g 12-bit for 1/4096) or even create a second level of subdirectories.

This works best if the names are evenly distributed, like hash-based names usually are. If your file names tend to start with the same text over and over, you should hash the names to avoid that (and store the mapping of real name to hash name in a database).

this scheme has been suggested here Storing a million images in the filesystem — phuclv, Commented Jul 23, 2022 at 8:20

Stack Exchange Network

Millions of files in a single directory

2 Answers 2

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
linux
nfs
cloud-storage
.

Linked

Hot Network Questions

Millions of files in a single directory

2 Answers 2

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged linuxnfscloud-storage.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
linux
nfs
cloud-storage
.