Summary:
I have set up a RAIDZ array of 4 HDDs with two SSD cache devices, and I don't get the expected results in cache boost and general performance.Also, some things don't seem to add up.
Background and configuration:
I am setting up an analysis workstation for research: Ryzen 7 1800X, 64GB ECC RAM, GFX 1080Ti, Tesla K40 (thanks for that, NVIDIA). It is set to be general purpose, there will be CPU and GPU computations, and some of the datasets consist of very big files (50-100 files, 10-30GB each). Due to parallelization, sometimes some will be accessed at the same time. There are RAM intensive jobs, but not all of them are, so there are situations where ZFS will have plenty of RAM available, but not all the time (5-10GB for the 500GB l2arc described below would be fine, though).
I have 2 ✕ 2TB SSDs (Samsung 850 EVO) and 4x 8TB HDD (WD Red). 3.5TB of the SDDs will be a RAID0, the remaining 2*250GB may be used as cache for the ZFS. For a first test, I have added them as two caching devices for a RAIDZ over the 4 HDDs.
Here is the layout:
# zpool status -v [sudo] password for administrator: pool: data state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM data ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 ata-WDC_WD80EFZX-68UW8N0_VJGSE7NX ONLINE 0 0 0 ata-WDC_WD80EFZX-68UW8N0_VJGSDP4X ONLINE 0 0 0 ata-WDC_WD80EFZX-68UW8N0_VJGSBYHX ONLINE 0 0 0 ata-WDC_WD80EFZX-68UW8N0_VJGSDDAX ONLINE 0 0 0 cache ata-Samsung_SSD_850_EVO_2TB_S2RMNX0HC00789R-part1 ONLINE 0 0 0 ata-Samsung_SSD_850_EVO_2TB_S2RMNX0HC00792H-part1 ONLINE 0 0 0
Measurements and command outputs:
I generated a random file (to get around compression issues) with quite nice performance:
# dd if=<(openssl enc -aes-256-ctr -pass pass:"$(dd if=/dev/urandom bs=128 count=1 2>/dev/null | base64)" -nosalt < /dev/zero) of=filename bs=1M count=100000 iflag=fullblock
100000+0 records in
100000+0 records out
104857600000 bytes (105 GB, 98 GiB) copied, 199,795 s, 525 MB/s
Now, what I expected was that this file goes to my cache (l2arc, AFAIU) if it is accessed often. However, that doesn't really happen (very efficiently):
for i in 1 2 3 4;do dd if=filename of=/dev/null bs=1M iflag=fullblock;done
100000+0 records in
100000+0 records out
104857600000 bytes (105 GB, 98 GiB) copied, 252,751 s, 415 MB/s
100000+0 records in
100000+0 records out
104857600000 bytes (105 GB, 98 GiB) copied, 215,769 s, 486 MB/s
100000+0 records in
100000+0 records out
104857600000 bytes (105 GB, 98 GiB) copied, 227,668 s, 461 MB/s
100000+0 records in
100000+0 records out
104857600000 bytes (105 GB, 98 GiB) copied, 224,469 s, 467 MB/s
Questions:
Why do I get lower read than write performance? Shouldn't write converge to the speed of 3 disks and read to the speed of 4 discs, like a RAID5?
Why doesn't kick the l2arc kick in? After multiple reads with no other data being read, I would have expected a read performance similar to the 1GB/sec. of the SSD RAID0.
Why does zpool iostat report such low read bandwidth for the individual devices? I ran this multiple times (this is from the last run), and it way always similar. The for hard drives just add up to ~160MB/s, while dd reports more than 400MB/s:
# zpool iostat -v capacity operations bandwidth pool alloc free read write read write --------------------------------------------------- ----- ----- ----- ----- ----- ----- data 136G 28,9T 1,31K 152 167M 14,9M raidz1 136G 28,9T 1,31K 152 167M 14,9M ata-WDC_WD80EFZX-68UW8N0_VJGSE7NX - - 571 66 46,0M 5,18M ata-WDC_WD80EFZX-68UW8N0_VJGSDP4X - - 445 59 44,9M 5,18M ata-WDC_WD80EFZX-68UW8N0_VJGSBYHX - - 503 66 40,2M 5,18M ata-WDC_WD80EFZX-68UW8N0_VJGSDDAX - - 419 62 39,4M 5,18M cache - - - - - - ata-Samsung_SSD_850_EVO_2TB_S2RMNX0HC00789R-part1 34,0G 216G 1 266 8,23K 33,1M ata-Samsung_SSD_850_EVO_2TB_S2RMNX0HC00792H-part1 34,0G 216G 1 266 7,80K 33,0M --------------------------------------------------- ----- ----- ----- ----- ----- -----
Is something fundamentally wrong here, or did I misunderstand something? Should I use part of the SSDs for ZIL? I could also spare a few dozens of GB from the OS M.2 SSD for that. At least if I can add an LVM device, since right now, it is all claimed by the Kubuntu installation. I have not done that yet because I understood this would only help with small, synced writes, which I don't really expect. Mostly, bigger date will be written back serially.
Why does cache look like a pool named cache, not something that belongs to the pool data? I used:
zpool add data cache [devices]
so it should belong to the data pool, shouldn't it?
zpool status
with the pool imported will do nicely.