0

I need a storage server with a total disk space of at least 400TB and write speed (preferably) well above 2GB/s for a scientific experiment which will transfer files of the size about 10GB via a network share and decided to go for a Dell PowerEdge R740xd2 with 26 drives, each with 20TB in a RAID0 configuration.

The RAID controller is a Broadcom / LSI MegaRAID SAS-3 3108 [Invader] (rev 02) and the drives are DELLEMC Exos X20 - 20TB 512e RPM 7.2K - SAS 12 Gbps. According to the specs, the Exos X20 is able to reach 272 MB/s (285 MB/s max.).

The benchmarks from e.g. hardwareluxx.de report 265.4MB/s write speed and even 281.1MB/s for sequencial write. Benchmark for Exos X20

So in principle, these 26 drives in RAID0 should be able to yield around 7GB/s for writing.

I set the stripe size to 1MB in the hardware RAID configuration, that was the highest possible value. A single LVM partition was created with EXT4.

The read speed check with hdparm is already quite disappointing (2.6GB/s):

# hdparm -Tt /dev/sda

/dev/sda:
 Timing cached reads:   19172 MB in  2.00 seconds = 9595.81 MB/sec
SG_IO: bad/missing sense data, sb[]:  70 00 05 00 00 00 00 0d 00 00 00 00 20 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 Timing buffered disk reads: 6966 MB in  3.00 seconds = 2321.66 MB/sec

The sequential write test with fio is ridiculously low (400MiB/s):

# fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=1M --iodepth=64 --size=10G --readwrite=write
test: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=64
fio-3.16
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][w=382MiB/s][w=382 IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=4485: Fri Jan 13 11:02:30 2023
  write: IOPS=401, BW=402MiB/s (421MB/s)(10.0GiB/25485msec); 0 zone resets
   bw (  KiB/s): min=272384, max=485376, per=99.80%, avg=410624.00, stdev=32440.49, samples=50
   iops        : min=  266, max=  474, avg=401.00, stdev=31.68, samples=50
  cpu          : usr=1.95%, sys=2.67%, ctx=963, majf=0, minf=9
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.2%, 32=0.3%, >=64=99.4%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=0,10240,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
  WRITE: bw=402MiB/s (421MB/s), 402MiB/s-402MiB/s (421MB/s-421MB/s), io=10.0GiB (10.7GB), run=25485-25485msec

Disk stats (read/write):
    dm-0: ios=0/10370, merge=0/0, ticks=0/1605452, in_queue=1605452, util=99.30%, aggrios=0/10349, aggrmerge=0/21, aggrticks=0/1601612, aggrin_queue=1581044, aggrutil=99.23%
  sdb: ios=0/10349, merge=0/21, ticks=0/1601612, in_queue=1581044, util=99.23%

I am wondering what's wrong here, I am very likely missing something obvious. Any ideas? I have not tuned things like the EXT4 stride-size etc. but I thought that the defaults should already give an acceptable performance.

A bit of more information about the system:

The RAID controller:

18:00.0 RAID bus controller: Broadcom / LSI MegaRAID SAS-3 3108 [Invader] (rev 02)
        DeviceName: Integrated RAID
        Subsystem: Dell PERC H730P Mini
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin A routed to IRQ 34
        NUMA node: 0
        Region 0: I/O ports at 4000 [size=256]
        Region 1: Memory at 9d900000 (64-bit, non-prefetchable) [size=64K]
        Region 3: Memory at 9d800000 (64-bit, non-prefetchable) [size=1M]
        Expansion ROM at <ignored> [disabled]
        Capabilities: [50] Power Management version 3
                Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [68] Express (v2) Endpoint, MSI 00
                DevCap: MaxPayload 4096 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 0.000W
                DevCtl: CorrErr- NonFatalErr+ FatalErr+ UnsupReq+
                        RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 256 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr- TransPend-
                LnkCap: Port #0, Speed 8GT/s, Width x8, ASPM L0s, Exit Latency L0s <2us
                        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 8GT/s (ok), Width x8 (ok)
                        TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range BC, TimeoutDis+, NROPrPrP-, LTR-
                         10BitTagComp-, 10BitTagReq-, OBFF Not Supported, ExtFmt-, EETLPPrefix-
                         EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
                         FRS-, TPHComp-, ExtTPHComp-
                         AtomicOpsCap: 32bit- 64bit- 128bitCAS-
                DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis-, LTR-, OBFF Disabled
                         AtomicOpsCtl: ReqEn-
                LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+, EqualizationPhase1+
                         EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
        Capabilities: [a8] MSI: Enable- Count=1/1 Maskable+ 64bit+
                Address: 0000000000000000  Data: 0000
                Masking: 00000000  Pending: 00000000
        Capabilities: [c0] MSI-X: Enable+ Count=97 Masked-
                Vector table: BAR=1 offset=0000e000
                PBA: BAR=1 offset=0000f000
        Capabilities: [100 v2] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt+ RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP+ FCP+ CmpltTO+ CmpltAbrt+ UnxCmplt- RxOF+ MalfTLP+ ECRC+ UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
                CEMsk:  RxErr+ BadTLP+ BadDLLP+ Rollover+ Timeout+ AdvNonFatalErr+
                AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
                        MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
                HeaderLog: 04000001 1710000f 18080000 b9620497
        Capabilities: [1e0 v1] Secondary PCI Express
                LnkCtl3: LnkEquIntrruptEn-, PerformEqu-
                LaneErrStat: 0
        Capabilities: [1c0 v1] Power Budgeting <?>
        Capabilities: [148 v1] Alternative Routing-ID Interpretation (ARI)
                ARICap: MFVC- ACS-, Next Function: 0
                ARICtl: MFVC- ACS-, Function Group: 0
        Kernel driver in use: megaraid_sas
        Kernel modules: megaraid_sas

The latency with below 300us (measured with ioping) is fine:

# ioping -c 10 .
4 KiB <<< . (ext4 /dev/dm-0): request=1 time=575.2 us (warmup)
4 KiB <<< . (ext4 /dev/dm-0): request=2 time=235.1 us
4 KiB <<< . (ext4 /dev/dm-0): request=3 time=257.0 us
4 KiB <<< . (ext4 /dev/dm-0): request=4 time=269.3 us
4 KiB <<< . (ext4 /dev/dm-0): request=5 time=288.5 us
4 KiB <<< . (ext4 /dev/dm-0): request=6 time=284.8 us
4 KiB <<< . (ext4 /dev/dm-0): request=7 time=272.8 us

Here is the output of tune2fs -l:

# tune2fs -l /dev/mapper/data--vg-data
tune2fs 1.45.5 (07-Jan-2020)
Filesystem volume name:   <none>
Last mounted on:          <not available>
Filesystem UUID:          57c19b70-c1f3-4af9-85ec-ae3ac191c7a7
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr dir_index filetype extent 64bit flex_bg sparse_super large_file huge_file dir_nlink extra_isize metadata_csum
Filesystem flags:         signed_directory_hash
Default mount options:    user_xattr acl
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              335544320
Block count:              5368709120
Reserved block count:     268435456
Free blocks:              5347083745
Free inodes:              335544309
First block:              0
Block size:               4096
Fragment size:            4096
Group descriptor size:    64
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         2048
Inode blocks per group:   128
Flex block group size:    16
Filesystem created:       Fri Jan 13 10:51:35 2023
Last mount time:          Fri Jan 13 10:51:37 2023
Last write time:          Fri Jan 13 10:51:37 2023
Mount count:              1
Maximum mount count:      -1
Last checked:             Fri Jan 13 10:51:35 2023
Check interval:           0 (<none>)
Lifetime writes:          41 MB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:               256
Required extra isize:     32
Desired extra isize:      32
Journal inode:            8
Default directory hash:   half_md4
Directory Hash Seed:      285a1147-264c-4e28-87b8-22e27d407d98
Journal backup:           inode blocks
Checksum type:            crc32c
Checksum:                 0x41f0d8c7
5
  • I agree that this should be much faster. Personally, I don't have very good experience with hardware RAID, so I'd go a different direction. You could reconfigure the controller to JBOD and then create a striped raid0 setup in software. Then you eliminate the controller's logic altogether and overhead on the CPU is negligble.
    – mtak
    Commented Jan 13, 2023 at 10:24
  • Thanks @mtak that's probably a good idea. I'll try that if there are no other hints ;)
    – tamasgal
    Commented Jan 13, 2023 at 10:29
  • 1
    The 3108 is a PCIe 3.0 x8 card from what I can find. That would set an upper limit of somewhere around 8GB/s maximum, though somewhat less with overheads. lspci -vv might show how many lanes are allocated.
    – Mokubai
    Commented Jan 13, 2023 at 10:34
  • @Mokubai I just added the output of lscpi
    – tamasgal
    Commented Jan 13, 2023 at 10:42
  • @mtak I tried the software RAID configuration but can only reach 1.4GB/s in sequential writing.
    – tamasgal
    Commented Jan 18, 2023 at 14:14

1 Answer 1

1

I think it's limited by the backboard of the RAID itself.

I have a 4U storage server also installed with a 4GB cached LSI MegaRAID SAS-3 3108. I configured it with different RAID levels of different volumes and did many tests. I found that the maximum writing speed is also about 2.6GB/s, exactly the same with yours!

At last, I gave up and think it as the hardware limit.

Here is brief summary of my test result:

4 volumes of RAID 6 are configured. Each volume consists of 8 18TB Westdigit hard drives. Writing to one volume can have a maximum 1.6GB/s speed. This is limited by 6 drives (raid6 of 8 drives have only 6 independent drives) times 260MB/s (one drive speed). But writing to two volumes at the same time can only achieve a speed of about 2.6GB/s. When writing to three or four volumes together, the total speed is also 2.6GB/s.

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .