He seems to think this is a systems problem and he is not entirely wrong.
To some extent, yes, older filesystems were really bad at handling millions of files; newer ones do it differently. For example, ext2 and FAT use simple linear lists of files, so the scalability problems were indeed problems with ext2 and FAT, which were subsequently improved to use HTrees or B+trees in ext4 and NTFS.
(Eventually, however, the design of the filesystem can only do so much – I suspect it's not easy to optimize a general-purpose filesystem to handle billions of files per directory on a server and still remain usable for tens of files per directory on a desktop computer without too much overhead...)
But the way you use the filesystem also matters a lot. Even if you have millions of files on e.g. XFS, chances are that direct lookups by exact path will remain reasonably fast, as they only involve reading a small part of the directory data; but trying to list the directory will be much slower in comparison. So your program should be designed to never need to list the entire directory, but to know exactly what files it needs.
(As an analogy, if you use a SQL database, you already know that the correct way to search for data is to let server-side queries of "SELECT WHERE this=that" do the job – you don't usually try to retrieve the entire table every single time and then blame it on your network being too slow.)
NFS. This would be the simplest but I am not sure how well it can handle the number of files in the directory.
NFS doesn't store the directory lists on its own, it only provides access to the capabilities of the remote "storage" filesystem.
So if all your operations only deal with exact paths (i.e. read this specific file, write this file) then the capabilities of NFS itself should be completely irrelevant to your problem, as NFS will never need to look at the complete list of files – it will only forward the exact requested paths to the fileserver, where the NFS server's on-disk filesystem (e.g. zfs or ext4) needs to worry about handling the whole directory listing.
In other words, you're only shifting the problem to a different machine, but it still remains the exact same problem there. (Though the NFS file server certainly could use a filesystem that handles many files better than the one used on the web server, but you can do that locally as well.)
Any strategy I can devise to break up the files between several directoiries would require code changes and the project manager is not willing to do anythng beyond the most trivial of changes.
The most trivial change would be to use part of the file name itself that becomes the subdirectory name, as this makes it easy to find the files later – just apply the same transformation to the file name as you did when storing it.
Take a look at how .git/objects/
works. It can accumulate many object files (especially if you travel back in time to when Git didn't yet have packfiles), so they are separated into subdirectories based on the first 2 digits of the object ID.
For example, the Git object c813a148564a5..
is found at objects/c8/13a148564a5..
, using one level of subdirectories with a 8-bit prefix – there are 256 possible subdirectories, and the number of files within each subdirectory is reduced approximately 256 times (e.g. only ~40k files per directory, in a 10-million-object repository) – and the software knows exactly where to find each object knowing only its name.
If you want to spread files out even more, you can use longer subdirectory names (e.g 12-bit for 1/4096) or even create a second level of subdirectories.
This works best if the names are evenly distributed, like hash-based names usually are. If your file names tend to start with the same text over and over, you should hash the names to avoid that (and store the mapping of real name to hash name in a database).
YYYYMMDD
time stamp a specific folder, etc? Or is that the application level stuff that is copying files to the NFS that you are saying the developers are saying it's not possible?ext4
that simply having had millions of files in a directory in the past can make it unusable: unix.stackexchange.com/questions/679176/…. Granted you can use other filesystems but that would mean a lot of work benchmarking each filesystems behaviour and their pros and cons.