How do I set up this ZFS pool correctly?

Question

Summary:

I have set up a RAIDZ array of 4 HDDs with two SSD cache devices, and I don't get the expected results in cache boost and general performance.Also, some things don't seem to add up.

Background and configuration:

I am setting up an analysis workstation for research: Ryzen 7 1800X, 64GB ECC RAM, GFX 1080Ti, Tesla K40 (thanks for that, NVIDIA). It is set to be general purpose, there will be CPU and GPU computations, and some of the datasets consist of very big files (50-100 files, 10-30GB each). Due to parallelization, sometimes some will be accessed at the same time. There are RAM intensive jobs, but not all of them are, so there are situations where ZFS will have plenty of RAM available, but not all the time (5-10GB for the 500GB l2arc described below would be fine, though).

I have 2 ✕ 2TB SSDs (Samsung 850 EVO) and 4x 8TB HDD (WD Red). 3.5TB of the SDDs will be a RAID0, the remaining 2*250GB may be used as cache for the ZFS. For a first test, I have added them as two caching devices for a RAIDZ over the 4 HDDs.

Here is the layout:

# zpool status -v
[sudo] password for administrator: 
  pool: data
 state: ONLINE
  scan: none requested
config:

        NAME                                                 STATE     READ WRITE CKSUM
        data                                                 ONLINE       0     0     0
          raidz1-0                                           ONLINE       0     0     0
            ata-WDC_WD80EFZX-68UW8N0_VJGSE7NX                ONLINE       0     0     0
            ata-WDC_WD80EFZX-68UW8N0_VJGSDP4X                ONLINE       0     0     0
            ata-WDC_WD80EFZX-68UW8N0_VJGSBYHX                ONLINE       0     0     0
            ata-WDC_WD80EFZX-68UW8N0_VJGSDDAX                ONLINE       0     0     0
        cache
          ata-Samsung_SSD_850_EVO_2TB_S2RMNX0HC00789R-part1  ONLINE       0     0     0
          ata-Samsung_SSD_850_EVO_2TB_S2RMNX0HC00792H-part1  ONLINE       0     0     0

Measurements and command outputs:

I generated a random file (to get around compression issues) with quite nice performance:

# dd if=<(openssl enc -aes-256-ctr -pass pass:"$(dd if=/dev/urandom bs=128 count=1 2>/dev/null | base64)" -nosalt < /dev/zero) of=filename bs=1M count=100000 iflag=fullblock
100000+0 records in
100000+0 records out
104857600000 bytes (105 GB, 98 GiB) copied, 199,795 s, 525 MB/s

Now, what I expected was that this file goes to my cache (l2arc, AFAIU) if it is accessed often. However, that doesn't really happen (very efficiently):

for i in 1 2 3 4;do dd if=filename of=/dev/null bs=1M iflag=fullblock;done
100000+0 records in
100000+0 records out
104857600000 bytes (105 GB, 98 GiB) copied, 252,751 s, 415 MB/s
100000+0 records in
100000+0 records out
104857600000 bytes (105 GB, 98 GiB) copied, 215,769 s, 486 MB/s
100000+0 records in
100000+0 records out
104857600000 bytes (105 GB, 98 GiB) copied, 227,668 s, 461 MB/s
100000+0 records in
100000+0 records out
104857600000 bytes (105 GB, 98 GiB) copied, 224,469 s, 467 MB/s

Questions:

Why do I get lower read than write performance? Shouldn't write converge to the speed of 3 disks and read to the speed of 4 discs, like a RAID5?
Why doesn't kick the l2arc kick in? After multiple reads with no other data being read, I would have expected a read performance similar to the 1GB/sec. of the SSD RAID0.
Why does zpool iostat report such low read bandwidth for the individual devices? I ran this multiple times (this is from the last run), and it way always similar. The for hard drives just add up to ~160MB/s, while dd reports more than 400MB/s:

# zpool iostat -v
                                                        capacity     operations    bandwidth
pool                                                 alloc   free   read  write   read  write
---------------------------------------------------  -----  -----  -----  -----  -----  -----
data                                                  136G  28,9T  1,31K    152   167M  14,9M
  raidz1                                              136G  28,9T  1,31K    152   167M  14,9M
    ata-WDC_WD80EFZX-68UW8N0_VJGSE7NX                    -      -    571     66  46,0M  5,18M
    ata-WDC_WD80EFZX-68UW8N0_VJGSDP4X                    -      -    445     59  44,9M  5,18M
    ata-WDC_WD80EFZX-68UW8N0_VJGSBYHX                    -      -    503     66  40,2M  5,18M
    ata-WDC_WD80EFZX-68UW8N0_VJGSDDAX                    -      -    419     62  39,4M  5,18M
cache                                                    -      -      -      -      -      -
  ata-Samsung_SSD_850_EVO_2TB_S2RMNX0HC00789R-part1  34,0G   216G      1    266  8,23K  33,1M
  ata-Samsung_SSD_850_EVO_2TB_S2RMNX0HC00792H-part1  34,0G   216G      1    266  7,80K  33,0M
---------------------------------------------------  -----  -----  -----  -----  -----  -----

Is something fundamentally wrong here, or did I misunderstand something? Should I use part of the SSDs for ZIL? I could also spare a few dozens of GB from the OS M.2 SSD for that. At least if I can add an LVM device, since right now, it is all claimed by the Kubuntu installation. I have not done that yet because I understood this would only help with small, synced writes, which I don't really expect. Mostly, bigger date will be written back serially.

Why does cache look like a pool named cache, not something that belongs to the pool data? I used:

zpool add data cache [devices]

so it should belong to the data pool, shouldn't it?

Please show us the specific layout of your pool. Output of zpool status with the pool imported will do nicely. — user, Commented May 16, 2017 at 12:23
I have added it. Out of curiosity, since I don't see anything? What is in there what iostat does not have for the layout? I am still confused that cache is not visibly associated to the data pool, even though the add command suggests so. But oh, well, "data" is also written in the first line, so I guess all of this output would be duplicated for another pool? — user23563, Commented May 16, 2017 at 12:27
Update regarding zpool iostat: iostat from the sysstat package ALSO reports those low values, while the KDE disk bandwidth widget reports realistic bandwidth ... strange. — user23563, Commented May 17, 2017 at 14:07

user121391 · Accepted Answer · 2017-05-17 13:40:51Z

RAIDZ1 performance vs. conventional RAID5

Why do I get lower read than write performance? Shouldn't write converge to the speed of 3 discs and read to the speed of 4 discs, like a RAID5?

See this thread on ServerFault:

RAIDZ with one parity drive will give you a single disk's IOPS performance, but n-1 times aggregate bandwidth of a single disk.

And this comment:

I have a significant amount of experience with this, and can confirm for you that in most situations, RAIDZ is NOT going to outperform the same number of disks thrown into a traditional RAID5/6 equivalent array.

Your disks can sustain about 145 MB/s sequentially, so your theoretical results should be 435 MB/s. I would say that pretty closely matches your results.

L2ARC cache for sequential reads

Why doesn't kick the l2arc kick in? After multiple reads with no other data being read, I would have expected a read performance similar to the 1GB/s of the SSD RAID0.

Have a look at this mailing list post:

Is ARC satisfying the caching needs?

and

Post by Marty Scholes Are some of the reads sequential? Sequential reads don't go to L2ARC.

So, your main reasons are:

Your (random) load is already served from ARC and L2ARC is not needed (because your data was always the same and can stay in ARC completely). Idea behind that: ARC is much faster than L2ARC (RAM vs. SSD), so your first choice for reads is always ARC, you need L2ARC only because your active data is too big for memory, but random disk access is too slow on spinning disks.
Your benchmark was sequential in nature and thus not served from L2ARC. Idea behind that: sequential reads would poison the cache, because a single big file read would fill the cache completely and remove millions of small blocks from other users (ZFS is optimized for concurrent random access of many users), while not having any effect on your speed on the first read. On the second read it would be speed up, but normally you do not read large files twice. Maybe you can modify the behavior with ZFS tuneables.

Various questions

Should I use part of the SSDs for ZIL?

A separate SLOG device will only help for random synchronized writes, nothing else. To test this it is quite simple - set your benchmark file system property sync to disabled: zfs set sync=disabled pool/fs, then benchmark again. If your performance is now suddenly great, you will benefit. If it does not change much, you won't.

PS: Why does cache look like a pool named cache, not something that belongs to the pool data?

I think it is that way because those extra devices (spares, caches, slog devices) can also consist of multiple vdevs. For example, if you have a mirrored slog device, you would have the same 3 levels like your normal disk (log - mirror - disk1/disk2).

Well, that doesn't exactly answer my questions. Still great link, thanks! I was more surprised that writing was faster than reading. The ServerVault answer doesn't seem to differentiate there. But ok, the things that puzzle me more are: 1) We does the l2arc seem to have no effect? I read the file for a hundred times overnight. Now each cache drive has 50GB allocated, so it could be striped, but I still read with 450MB/s, although the SSDs could supply 1GB/s (and do as an mdRAID0). 2) Why is the zpool iostat so low? — user23563, Commented May 17, 2017 at 11:02
@user23563 Please see my updated answer, I hope it is helpful. — user121391, Commented May 17, 2017 at 13:45
Yes, this helps a lol, I'll mark this as an answer. Thanks. Now just one probably cosmetic question remains open (though it might be an indicator for issues?): Why does zpool iostat report about 160MB/s from the vdev, while I actually pull 450MB/s, see added comment :) — user23563, Commented May 17, 2017 at 14:06
@user23563 Writing in general is going to be faster than reading when using caching devices. Write calls can return the moment all the data is copied into the cache - whatever that cache may be. Read calls can only return when all the data has been copied into the caller's read buffer - from disk if necessary. And when sequentially reading large amounts of data, odds are the data will have to come from disk. — Andrew Henle, Commented May 18, 2017 at 13:34
Well, for write to the double-SDD cache, 500MB/s is a little slow, on the other hand ... And since it is a 100GB file, this also cannot be about disk caches. — user23563, Commented May 19, 2017 at 7:34

Stack Exchange Network

How do I set up this ZFS pool correctly?

1 Answer 1

RAIDZ1 performance vs. conventional RAID5

L2ARC cache for sequential reads

Various questions

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
linux
raid
cache
zfs
raid-z
.

Hot Network Questions

How do I set up this ZFS pool correctly?

1 Answer 1

RAIDZ1 performance vs. conventional RAID5

L2ARC cache for sequential reads

Various questions

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged linuxraidcachezfsraid-z.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
linux
raid
cache
zfs
raid-z
.