5

I read here and then that small stripe size is bad for software (and maybe hardware) RAID 5 and 6 in Linux. The rare benchmarks I saw fully agree with that.
But the explanation given by everybody is this induce more head movements. I just don't understand how a small stripe size lead to more head movements.

Let's say we have a RAID 6 setup with 4 local SAS drives.

case 1: we write 1 Gb of sequential data
The program ask the kernel to write the data, then the kernel divide it to match the stripe size and compute each chunk (data and/or parity) to be written to each disk.
The kernel is able to write to the 4 disks at the same time (with the proper disk controler).
If the written data is not fully aligned with the stripes, the kernel has only to read the first and last stripes before computing the resulting data. All other stripes will only be overwritten without any care for previous data.
Because this computing is done much faster than disks throughput, each chunk is just written next to the previous one on each disk without pause. So this is basically a sequential write on 4 disks.
How a small stripe size could slow down this ?

case 2: we write 1,000,000 x 1 kb of data at random places
1 kb is smaller than the stripe size (common stripe size is currently 512 kb)
The program asks the kernel to write some data, then some other data, and again some other, etc. For each write, the kernel has to read the current data on disk, compute the new content, and write it back to the disk. Then the heads move elsewhere and the operation is repeated 999,999 more times.
The smaller the stripe size, the faster the data is read/computed/written. Ideally a stripe size of 4 kb should be the best with modern disks (if correctly aligned).

So once again, how a small stripe size could slow down this ?

3 Answers 3

3
+150

I speak about Linux sofware RAID. When you look into the code, you see the md driver is not fully optimized: when multiple contiguous requests are made, the md driver doesn't merge then into a bigger one. This lead to massive overhead in some common situations.

Big reads or writes are optimized: they are cut down to several requests egal to the stripe size, and treated optimaly.

If the read or write is accross 2 stripes, the md driver does the job correctly: everything is handled in one operation.

With small reads there is no problem because the data is in the kernel cache after the first read. So several contiguous reads only generate a small overhead for CPU and memory, compared to the slow disk bandwidth.
For example I read 1 Gb of data 100 bytes at a time: the kernel will first transform it to a 512 kb read because this is the minimum I/O size (if the stripe size is 512 kb). So the next 100 bytes will already be in kernel cache. This is the exact same thing as reading from a non RAID partition.

With writes smaller than the stripe size, the md driver first read the full stripe into memory, then overwrite in memory with the new data, then compute the result if parity is used (mostly RAID 5 and 6), then write it to the disks.
For example I write 1 Gb of data 100 bytes at a time: the kernel will first read the 512 kb stripe, overwrite the required parts in memory, compute the result if parity is involved, then write it to the disk. When writing the next 100 bytes, only the "read the 512 kb stripe" is avoided because the data is in the kernel cache. So we have a small overhead for overwriting into memory and computing parity, but a huge overhead because the data is written again to the same stripe. The kernel code here is not optimized.

I didn'd dug enough to understand why those repeated writes are not correctly cached, and the data flushed to disk only after several seconds (so only once per stripe). If they where cached, the overhead will only be some CPU and memory, but my own benchmarks show CPU remain under 10%, and the I/O is the bottleneck.

If the writes were optimized, then the minimum stripe size will always be the best: RAID 6 with 4 disks with 4 k sectors will lead to 8 kb stripes, and it will be the best for read and write througput for every possible load.

4
  • 3
    This answer is completely wrong. MD not merging requests has nothing to do with it not being optimized. It's not MD's job to merge small requests into bigger ones. Linux specifically has a component of the block device driver whose job is to do exactly that. It would be stupid and unoptimized for MD to repeat what other parts of the kernel are already doing, especially when it is in a worse position to do so.
    – Circus Cat
    Commented Jun 19, 2015 at 9:32
  • 1
    small writes can't be optimized.
    – Skaperen
    Commented Jun 21, 2015 at 8:44
  • @CircusCat Those optimisations in the block device driver are not perfect for large mdadm chunk sizes: I created a RAID5 with 4 disks and chunk size 4M and copied 37.5 GB from /dev/zero to the md device (dd with bs=384k count=102400). Expected behaviour: No data is read from the disks. Observed behaviour: Data being read from all disks at heavily fluctuating speeds up to 10MB/s (while writing at 110MB/s) on each disk. Commented May 21 at 11:42
  • Is it true that the same isse exists for both read and write? Documentation in Redhat suggest perfomance is asymettric with much faster read than write. This would hint that parity is not checked for reading and only checked when scrubbing Commented Jul 9 at 6:59
3

As far as I know the issue has never been to do with head movements and all simply due to more overheads. For a given sequential read or write a 4KB stripe size results in sixteen times more operations than a 64KB stripe size. More CPU time, more memory bandwidth, more context switches, more I/Os, more work for the kernel I/O scheduler, more merges to compute, and so forth, so ultimately more latency per I/O.

Remember a lot of applications issue I/Os with a queue depth of 1 so you may not always be able to merge 16 4KB sequential requests to a 64KB request to disk.

Also if you look at a typical ATTO disk benchmark such as this one:

enter image description here

You can see that the disk cannot even read sequentially at full speed until the reads are done in blocks of 128KB or larger.

Tomshardware has a fairly comprehensive review of the effects of stripe size here :

http://www.tomshardware.com/reviews/RAID-SCALING-CHARTS,1735.html

3
  • I was going to write something similar, but slightly different. The thing to focus on is how many disk writes per request. You want there to be one which suggests increasing strip size to make this cover more cases.
    – rocky
    Commented Jun 15, 2015 at 12:20
  • 3
    The major problem with large stripe sizes is that there is a huge penalty for write operations smaller than the stripe size as the stripe needs to be read and modified and rewritten in its entirety - at least logically, as some of the actual physical IO operations can be optimized out or handled in cache. You need to match your stripe size to your expected IO pattern. Random 4 kB writes will have horrendous performance on a RAID-6 array with a 1 MB stripe size, for example, while sequential 4 kB writes may work just fine on the same array because of write coalescing and caching. Commented Jun 16, 2015 at 22:53
  • With even larger stripe sizes than 1 MB (I tested mdadm RAID5 chunk sizes from 64KB to 4MB with 4 disks --> stripe sizes from 192KB to 12MB), I observed substantial read activity also when writing large files (presumably to update parity), up to a quarter of the write speed (peak over 12s), usually around 1% of write speed. Commented May 21 at 11:26
0

As with all things, there's a happy medium. But I would suggest have a look at RAID2 and RAID3 - both types that are rarely used - to get an understanding of the nature of the problem.

However it's basically boils down to latency of IO and concurrent data transfer. Every read IO operation has an overhead of several milliseconds for the heads to seek and the drive to rotate.

If we have larger chunks of data, we pay this penalty less often. It's much like a cruder form of prefetching - because of this overhead, it's generally a good idea to prefetch multiple chunks of data when one is requested, simply because it's statistically likely that you'll need it anyway.

But primarily - it's a performance tuning operation than a hard rule - you should set your chunk size based on whatever workload you're sending to the disk. If your workload is mixed or random, then it becomes increasingly hard to do this. Larger chunks mean more throughput with fewer IO operations, and it's usually your IO operations that are the limiting factor on your drive speed and so it is usually beneficial to have larger requests.

For specific use cases (like databases!) this may well not apply though.

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .