4

Am I correct that chunk size in context of RAID is essentially the same thing as cluster in file-system context? In other words, chunk size is the smallest unit of data which can be written to a member of RAID array? For example if I have a chunk size of 64KiB and I need to write a 4KiB file and cluster size of the file-system is also 4KiB, then is it true that I will use one 64KiB chunk and basically waste 60KiB?

3 Answers 3

6

Given that chunks can be quite big and that the parity information is simple XOR (i.e. does not affect data before or after the piece in question) the assumption that only complete chunks can be written does not make sense to me.

Chunks are the unit in which data is spread over the volumes. One chunk of continuous data is written to a certain volume, the next data is written to another.

Both with file systems and with RAID this is an optimization issue: In a file system too small blocks / clusters would cause metadata overhead, too big blocks waste too much space (as most file systems can use a certain block for a single file only).

With RAID it is similar: If you have tiny chunks then you need accesses to several disks even for very small files (or other data). In most cases the higher latency of the (in this single case) slower drive takes more time than reading from one drive alone. This is not valid for SSDs but they are not the dominant technology for RAID.

If you have very big chunks then even those accesses which could be clearly speeded up by spreading to several drives are done to only one drive.

1

The answer to the OP's question is: Yes. In a RAID, a "chunk" is the minimum amount of data read or written to each data disk in the array during a single read/write operation.

In your example, you won't necessarily "waste" 60-KiB as you put it. That depends on the combination of the file system structure and the underlying RAID structure. However, you are raising a very important point, which is that it's ideal if the file system configuration aligns with the RAID configuration with regard to their units of storage.

Continuing with your hypothetical example; if you had a RAID with 64-KiB chunk size and an overlaid file system using 64-KiB block sizes, then yes a 4-KiB file would use an entire 64-KiB area of storage space in the file system all by itself. And at the same time it would eat up an entire 64-KiB chunk in the RAID all by itself. However, that would be because the file system was setup to 64-KiB blocks. Those blocks are the smallest unit of storage data of that file system. Any file size smaller than that block will still use 1 block of file system storage space.

My point is your 2nd question is actually relevant to the file system and not the RAID.

Continuing with my example above, if your RAID used 16-KiB chunks and your file system on top of the RAID used 64-KiB blocks, then each block written to the RAID would require 4 of those chunks (64/16=4).

Now reverse that thought process. What if you had 64-KiB RAID chunks and 16-KiB file system blocks? Now each file system block only uses 1/4 of the RAID chunk. That means 1) your 4-KiB file takes up 16-KiB in the file system; and 2) the RAID will perform a read/modify/write operation when writing that 4-KiB file/16-KiB block because the RAID's smallest unit of storage is 64-KiB. So then your file system is more efficient, but your RAID is less efficient (for that particular file operation).

1

The chunk size only defines where the bytes are located. A 512 KB chunk size doesn't require system to write e.g. 512 KB for every 4 KB write or to read 512 KB of device surface for a 4 KB application read.

The chunk size is basically a tradeoff between device first byte latency vs bandwidth of the device for continuous read after receiving the first byte. For a HDD the first byte latency is really bad and trying to sync multiple devices is so slow that you cannot benefit from small chunk size.

Imagine situation where you have chunk size 4 KB and 4 device RAID-0 setup. If an app opens a file and does a 16 KB read, the system will have to move read head of every device in the RAID near the start of the file and read data. Because rotation of the platterns are not syncronized on separate HDDs, one should assume that disks are offset 180 degrees on average, so to get every device to read the same location requires worst case behavior for the HDD latency. This is because at least one of the devices always have a bad offset for the current plattern position vs data position for a single rotation of the plattern so that device must wait maximal time waiting for the correct plattern position to rotate under the read head. This is also why enterprise HDDs have so high rotation speeds but nowadays SSD is already typically cheaper than high RPM HDD.

A single HDD can typically emit 150 MB/s or 16 KB in 0.1 ms. The typical read head latency for a HDD is in range 6–11 ms.

Trying to move read heads on multiple devices doesn't make sense for short reads because reading the rest of the file after getting the first byte is practically free. On the other hand, SATA6 SSD may have first by latency in range 0.1 – 1.0 ms and bandwidth around 400–500 MB/s so the situation is quite different there.

And it also makes a huge difference how much parallel access your workload has. If you have only one process thread accessing the whole RAID setup at once, using a smaller chunk size may improve throughput but typically the chunk size should be set according to following logic:

chunk_size = typical_first_byte_latency ⨯ single_device_throughput

That is, if your latency is 10 ms and throughput is 150 MB/s (typical HDD), the chunk size should be set around 1.7 MB (!) for optimal throughput. In practice OS often prefers power-of-2 sizes so sensible options are 1 MB or 2 MB. It's typically safe to go with smaller chunk size so 1 MB should be used in practice.

On the other hand, a SATA SSD might have typical latency around 0.2 ms and throughput 450 MB/s and then optimal chunk size should be around 94 KB, or 64 KB in practice. Notice how critical the latency is here because if the typical latency is in actually 0.1 ms, optimal chunk size would be 47 KB already, or 32 KB in practice.

And note that for SSD you have to think about typical random 4KB QD1 latency while considering chunk size, not amortized latency over lots of reads for QD32 load. This is because chunked data MUST be read from multiple devices if the chunk is smaller than the read operation size, so the slowest of all parallel operations limits the response time.

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .