0

We have a huge amount of queries hitting our API that request a minor or major extract of some huge files lying around on our mounted hard drives. The data needs to be extracted from the files and processed before handing it back to the customer. My goal is to make this operation as blazingly fast as possible.

To piece together the puzzle I need some answers to the following questions:

  1. Is it possible to alleviate the memory bus from being the bottleneck by using (lightly) compressed data on reads?
  2. If so, which decompression algorithm is fit for multithreaded decompression? I read that the official xz library doesn't support multithreaded decompression, so what can I use instead?
  3. Is it a good idea to use mmap() to map the whole file into memory at once or should I use normal file handles? Our data is stored on either nvme or jbod drives, but I could opt-in on non-sequential reads depending on the exact case.

Also, maybe there are other ideas around on how to accomplish this in the most efficient manner. Everything is probably more efficient than what we have now.

4
  • You could always blink more slowly.
    – JimmyJames
    Commented Jan 22 at 16:27
  • More seriously, I've worked with a database system that used compression to speed up read from platters so there's definitely something to that idea.
    – JimmyJames
    Commented Jan 22 at 16:30
  • @JimmyJames We're also using disks (jbod) so I imagine it would make sense. But i'll have to measure. What compression algorithm did you use?
    – glades
    Commented Jan 23 at 6:57
  • I'm referring to an old exadata appliance. You might be able to find out what they used but it might be a secret.
    – JimmyJames
    Commented Jan 23 at 15:13

5 Answers 5

9

profiling measurements

It sounds like you do not yet know where the CPU cycles go, or, more importantly, the sources of I/O + network delay.

I didn't see any numeric figures, let alone 98th percentile SLA targets.

To "blaze", or to go "twice as fast as blazing", does not a performance target make. Pick a particular transaction of interest to the business. Identify where it spends its time, and what it would cost to make one component of that delay, or another component, go twice as fast. Ask the business whether it's worth it, whether the checks will be issued.

  1. compressed data on reads?

Yes! Google advocates using the Snappy algorithm for exactly this use case. Compressed data is cache-friendly at all levels of the memory hierarchy, and the server endpoint's fast CPU doesn't become the bottleneck. (But measure, to verify.)

  1. which decompression algorithm is fit for multithreaded decompression?

You're asking the wrong question.

Ideally you would use LZMA (xz) or LZ (gzip) once when storing a document, and then make the web clients worry about expanding those compressed bits. Much better to offer compressed data to a TCP channel.

If you want to burn all of your webserver's cores, then you will not only want multithreading (which the LZMA algorithm is definitely compatible with, even if one library implementation isn't), but you will want to consume only a modest number of cycles on decompression. Which is where algorithms like Snappy win; the compression ratio isn't as high as possible, but that's balanced against decompression needing very little effort.

If you're still set on burning all cores with xz decompression, well, there's nothing stopping you from asking nginx to fork off N worker processes for N cores. The OP didn't mention anything that strongly motivates threads over processes.

  1. Is it a good idea to use mmap() to map the whole file into memory?

You're asking the wrong question. You should first try to characterize your production workload, in terms of hot versus cold stored data, and elephants (giant files) versus mice (small ones). How much RAM would you need to buy for all the hot spots to be memory resident? And would the business write the checks for that? Maybe outsourcing the cache eviction strategy to the FreeBSD or Linux kernel is the right thing to do. Or maybe your business logic knows things that the kernel can't, and so the app should be responsible for some of the eviction policy. What cache hit ratios are you currently experiencing, and what would you pay in order to see them improve?

Then ask if you should even be concerned with such details, or whether it makes sense to outsource the problem to Akamai or similar edge hosting service that is already good at this.

Finally, for a given OS and version, implement both the mmap() and file handle approaches (perhaps in a micro-benchmark) and see which performs better under your workload.

1
  • Thanks for the input about snappy, I just checked it out and it looks promising! I guess my question is not so much about cost of business to implement an idea but to just get an overview of the ideas that are in use right now in other industries. Our team will decide this once we're done evaluating.
    – glades
    Commented Jan 22 at 6:01
4

I assume you are a junior programmer because there are a lot of red flags in your question that show you don't know what you are doing. You need to profile first.

Especially you need to find out

  • how much processing is done CPU wise.
  • how much IO is done in request.
  • what is the access pattern to the files (random or sequential)
  • what data is "hot" as there are lots of caches involved
  • how many requests do you have
  • what process/threading structure you have for the requests.
  • you want to optimize latency or throughput

Only then start to think about the details.

Just one comment about mmap. Its not for performance and because it reloads on page hits instead of prefetching (unless optimized and hinted) it is always slower BUT it will help to effienctly use the disk cache in a system that runs multiple servers. Thats why databases are using mmap.

1
  • Very valuable questions, thank you! I can't really anser them all as our software is planned to move to a more microservice-oriented architecture which will break a lot of assumptions we have right now about especially the distribution of Memory bus / processing throughput. I wanted to make sure we're considering the right things from the start.
    – glades
    Commented Jan 22 at 6:06
3

First:

  • Set goals

  • Profile

  • Identify bottlenecks

  • Decide what to do

Is it possible to alleviate the memory bus from being the bottleneck by using (lightly) compressed data on reads?

If by "memory bus" you mean RAM, no.

However, if your bottleneck is disk read throughput, compression can alleviate that, simply because the disks will need to read less data once it is compressed.

Compression also stores more actual data in OS disk cache.

Basically, the size of your cache and your disk throughput are multiplied by the compression ratio.

You need compression algorithms that can decompress faster than the disks can read: zstd, snappy, lz4 for example.

There are drawbacks though.

Your files will likely be compressed using blocks. For example each file is sliced into 64kB blocks, and each block is compressed independently. This means if you need to make a random access to read one 4k page, the entire block has to be decompressed. This makes random reads slower.

Random writes will also require decompression and re-compression of an entire block. This takes time and results in fragmentation.

Thus there is a compromise: large blocks compress better, but they slow down writes and small random reads.

Basically the ideal use case is large sequential reads on read-only data.

Note there is no such thing as "multithreaded decompression". All data compression algos are sequential. If your file is compressed block by block, then decompressing each block can be parallelized. But... if the blocks are too large, then maybe you won't have enough blocks to parallelize efficiently. That's another tradeoff.

I would avoid reinventing the wheel, and instead use a filesystem that supports compression, like btrfs or zfs.

However, there is not much information in your question. You don't even say if your data can be compressed with a good ratio or not, so you can't make an informed decision...

Another thing to consider would be using a specialized "data warehouse" type database with columnar storage, like clickhouse. I'll give an example.

Say you store time series data with columns like mqtt_topic, timestamp, value. You can store that in JSON format in a text file and compress it, that should be quite effective, you can expect around x2 compression ratio. Maybe better if topic strings take a lot of space and repeat often enough that the compressor can do some good work on them. Timestamps in ISO format also compress well: if rows are inserted in time order, most characters in the timestamp are probably the same as the previous row. Labels in the JSON object are the same on each row, so good. Numeric data in text form usually has high entropy so it doesn't compresses well.

I'm explaining that to highlight how to "eyeball" how much you'll save with compression.

But if you put it in a columnar database, you can use several tricks to organize your data, like partitioning and ordering, which open up opportunities for better compression.

I'm getting x12 compression ratio on my time series data in clickhouse, for example. Also the data is already stored in the database, so contrary to the JSON example, it doesn't have to be parsed.

If your flat files contain binary data (not text, not JSON) it will be much faster to parse than text, but usually it won't compress well, because there is higher entropy and much less repetitions. That's why column-storage databases rearrange the data to allow tricks like delta, prefix, etc.

Again, no way to say if this would be helpful in your case without knowing more.

4
  • We're using binary data of time series, where there are lots of zeros. I'll have to measure what compression I get with different algorithms, I'm always happy for any hints in that direction. I'll investigate the compression filesystems you mentioned, these seem promising.
    – glades
    Commented Jan 23 at 7:13
  • I investigated xfs and zufs and it seems we're already using the right thing: xfs supports fast I/O operations on huge files using multiple threads. I think we're best of using compression algorithms together with fast I/O as an educated guess.
    – glades
    Commented Jan 23 at 7:45
  • For time series, try clickhouse. Much easier than reinventing the wheel. But if you insist on reinventing the wheel, and your time series have lots of zeros, try transposing the blocks ie store them in column major order. That way all values for one column are contiguous in a block, and a long string of zeros will compress very well.
    – bobflux
    Commented Jan 23 at 12:30
  • Good thinking, I think we already do that though. But it's worth keeping in mind to make it consistend for the whole loading operation!
    – glades
    Commented Jan 24 at 7:44
3

Read files once, put all useful information in a database of your choice (consider storing file timestamp or other data identification with the information). Use (indexed and cached) database queries to respond to user.

Basically, do not reinvent the wheel - the situation you describe matches exactly a pattern of accessing a large database, which is a solved problem.

1

mmap vs read/write: Measure it. It depends on the implementations. And these files, how large is “large”? Any chance everything fits into RAM?

With the right methods, decompression can be faster than reading the data. But you should check the cost of replacing your storage with faster one which saves you a lot of development work.

And consider if there is a way to pre-process the data to make actual extraction faster.

4
  • 2
    never seen a performance benchmark where mmap wins for io
    – Lothar
    Commented Jan 18 at 18:25
  • Some should fit into RAM, we have huge amounts of it. The files are already preprocessed, it's basically binary dumps of the memory content, so there should be almost no processing involved if I were to not compress them.
    – glades
    Commented Jan 22 at 6:07
  • @Lothar I have actually seen benchmarks in which mmap is faster (here and this one seems to indicate that it depends on file size). I have also seen some youtube talks about the drawbacks of mmap and it seems to just be more erratic random and unreliable than read. However, it seems for random access it will be faster and hence I wonder what happens if you do a multithreaded read on several NUMA cores. Will have to measure...
    – glades
    Commented Jan 22 at 6:25
  • 1
    @glades Those look to be isolated performance tests likely run on otherwise-quiescent systems. Which isn't going to be happening in the real world. There's only one VM system doing virtual-to-physical mappings, and on a quiescent system the benchmark process has that pretty much all to itself. On a real-world production system mmap() calls will be contending with everything else going on to get their virtual-to-physical mappings done. Commented Jan 23 at 12:45

Not the answer you're looking for? Browse other questions tagged or ask your own question.