2

We just purchased a new 24 drive RAID array and an LSI 9285-8e RAID controller. We are seeing 2 things which seem strange to us.

  1. Write speeds are faster than read speeds (with either ext4 or xfs filesystems).

  2. There is a knee in the read speed such that when then read size (with dd) is greater than 128Kbytes, performance drops by about 30%.

Here are the latest test results with 512k byte RAID stripe size, and xfs filesystem:

dd bs=1024k if=junk of=/dev/null        => 9.11s = 1.4 GB/s
dd bs=512k if=junk of=/dev/null         => 9.38s = 1.3 GB/s
dd bs=256k if=junk of=/dev/null         => 9.78s = 1.3 GB/s
dd bs=128k if=junk of=/dev/null         => 7.03s = 1.8 GB/s
dd bs=64k if=junk of=/dev/null          => 6.77s = 1.9 GB/s
dd bs=32k if=junk of=/dev/null          => 6.79s = 1.9 GB/s
dd bs=16k if=junk of=/dev/null          => 6.49s = 1.9 GB/s
dd bs=8k if=junk of=/dev/null           => 6.91s = 1.8 GB/s
dd bs=4k if=junk of=/dev/null           => 6.46s = 1.9 GB/s

(FYI, for all sizes shown above, write speed is 2.2 GB/s)

I'm currently using RAID0, but I had almost identical results with RAID6.

It's just the new server being installed. No other apps running and no network connection causing any interrupts. This install is on OpenSuSE 11.4. We could do random read tests, but since our intent is to stream video (eg, 4K 3D, or 8K), we are really only worried about sequential reads.

Any ideas how to speed up the read speed?

2
  • What CPU are you using?
    – sblair
    Commented Nov 2, 2011 at 0:18
  • The CPU we are using is a Xeon X5687 Commented Nov 2, 2011 at 12:34

1 Answer 1

1

The RAID card is specified (PDF) as having faster write speeds than read speeds, so nothing unusual is happening there. My guess is that the 1GB of onboard cache memory is used as a write buffer to help smooth out any seeking delays caused by the relatively slow hard drives. But when reading, you will obviously always have to wait for the disks to get data.

As for the knee-point in the read speeds, this may be due to filling up a cache somewhere along the chain. Intel CPU L2 cache is typically 256 kB (per core), with a larger L3 cache (shared between cores). Assume that the dd executable and anything else needed by the OS during the test is at least one byte, and never exceeds about 128kB. This means that the 128KB (or less) block size tests are mostly unimpeded by the CPU, but performance may be significantly reduced for a 256kB block size due to the latency of frequent L3 cache (or main memory) lookups. It might be coincidence, but it fits with your test data.

2
  • Interesting idea. If that were the case wouldn`t that also limit dd reading from /dev/zero? When reading from /dev/zero dd sees no such knee: dd bs=1024k if=/dev/zero of=/dev/null count=120000 gets 14.6GB/s dd bs=128k if=/dev/zero of=/dev/null count=960000 gets 14.4GB/s If it matters, CPU is Xeon X5687. Commented Nov 2, 2011 at 13:05
  • @DonaldMcLachlan Perhaps setting the input file to /dev/zero/, being a special file created by the OS, results in an optimisation. It might always point to a single byte of memory that equals 0x00, which never leaves (and never fills) the CPU cache. But with if=junk, there must be a real file called "junk" which has to be read from disk and streamed into the CPU cache during the transfer. Again, this may be huge a simplification, or just plain wrong. For example, there might also be a DMA device involved, rather than a full CPU core.
    – sblair
    Commented Nov 2, 2011 at 17:54

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .