Large ext4 partition ridiculously slow when writing

Question

We have four identical Linux servers with a large (5T) hard disk partition. We have Scientific Linux with this kernel:

Linux s3.law.di.unimi.it 2.6.32-358.18.1.el6.x86_64 #1 SMP Tue Aug 27 14:23:09 CDT 2013 x86_64 x86_64 x86_64 GNU/Linux

The servers are identically configured, installed, etc. But one, and only one of the servers, is ridiculously slow when writing with ext4. If I do a

dd if=/dev/zero of=/mnt/big/analysis/test

l -tr
total 11M
-rw-r--r-- 1 root    root  11M Apr 20 10:01 test
10:01:42 [s3] /mnt/big/analysis
l -tr
total 16M
-rw-r--r-- 1 root    root  16M Apr 20 10:02 test
10:02:13 [s3] /mnt/big/analysis

So 5MB in 30s. All other servers write more than an order of magnitude faster.

The machines are 64GB, 32-core and have no I/O or CPU activity, albeit 90% of memory is filled by a large Java process doing nothing. Only one machine is writing slowly.

SMART says everything is OK

# smartctl -H /dev/sdb
smartctl 5.42 2011-10-20 r3458 [x86_64-linux-2.6.32-358.18.1.el6.x86_64] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

SMART Health Status: OK

hdparm reads with no problems:

# hdparm -t /dev/sdb

/dev/sdb:
 Timing buffered disk reads:  1356 MB in  3.00 seconds = 451.60 MB/sec

The partition is mounted as follows:

/dev/sdb1 on /mnt/big type ext4 (rw,noatime,data=writeback)

and we set

tune2fs -o journal_data_writeback /dev/sdb1

for performance.

I tried to find anything that could explain why this specific server writes so slowly, that is, any difference in the output of a diagnostic tool, with no results.

Just to complete the picture: we started on all servers a crawl, partitions essentially empty. The crawl created a number of files on the partition, and in particular a 156G file (the store of the crawl). The crawl started OK, but after a few hours we saw slowdowns (apparently, as the store was growing). When we checked, we noticed that writing to disk was getting slower, and slower, and slower.

We stopped everything--no CPU activity, no I/O--but still dd was showing the behaviour above. The other three servers, in identical conditions, same files, etc., work perfectly, both during the crawl and using dd.

Frankly, I don't even know where to look. Does anyone hear a bell ringing? Which diagnostic tools might I use to understand what's happening, or which tests should I try?

Update

Beside the link posted, I thought it would have been a good idea to run the same tests on two servers running the crawler. It's interesting. For instance, vmstat 10 gives

Good

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 4  0  68692 9009864  70832 29853400    0    0    15   214    9    2 11  1 88  0  0 
10  0  68692 8985620  70824 29883460    0    0    48  7402 79465 62898 12  1 86  0  0   
11  0  68692 8936780  70824 29928696    0    0    54  6842 81500 66665 15  1 83  0  0   
10  2  68692 8867280  70840 30000104    0    0    65 36578 80252 66272 14  1 85  0  0   
15  0  68692 8842960  70840 30026724    0    0    61  3667 81245 65161 14  1 85  0  0

Bad

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
13  0   8840 14015100  92972 25683108    0    0    11   104    3    9  4  1 94  0  0    
 2  0   8840 14015800  92984 25696108    0    0    49 16835 38619 54204  2  2 96  0  0  
 1  0   8840 14026004  93004 25699940    0    0    33  4152 25914 43530  0  2 98  0  0  
 1  0   8840 14032272  93012 25703716    0    0    30  1164 25062 43230  0  2 98  0  0  
 2  0   8840 14029632  93020 25709448    0    0    24  5619 23475 40080  0  2 98  0  0

And iostat -x -k 5

Good

Linux 2.6.32-358.18.1.el6.x86_64 (s0.law.di.unimi.it)   04/21/2014  _x86_64_    (64 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          10.65    0.00    1.02    0.11    0.00   88.22

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sdb               0.11  3338.25   17.98   56.52   903.55 13579.19   388.79     7.32   98.30   1.23   9.18
sda               0.39     0.72    0.49    0.76    11.68     5.90    28.25     0.01   11.25   3.41   0.43

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          15.86    0.00    1.33    0.03    0.00   82.78

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sdb               0.00  1106.20    9.20   31.00    36.80  4549.60   228.18     0.41   10.09   0.39   1.58
sda               0.00     2.20    0.80    3.00     4.80    20.80    13.47     0.04   10.53   3.21   1.22

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          15.42    0.00    1.23    0.01    0.00   83.34

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sdb               0.00  1205.40    8.00   33.60    40.80  4956.00   240.23     0.39    9.43   0.33   1.38
sda               0.00     0.60    0.00    1.00     0.00     6.40    12.80     0.01    5.20   4.20   0.42

Bad

Linux 2.6.32-358.18.1.el6.x86_64 (s2.law.di.unimi.it)   04/21/2014  _x86_64_    (64 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           4.37    0.00    1.41    0.06    0.00   94.16

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sdb               0.06  1599.70   13.76   38.23   699.27  6551.73   278.96     3.12   59.96   0.99   5.16
sda               0.46     3.17    1.07    0.78    22.51    15.85    41.26     0.03   16.10   2.70   0.50

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          11.93    0.00    2.99    0.60    0.00   84.48

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sdb               0.00 14885.40   13.60  141.20    54.40 60106.40   777.27    34.90  225.45   1.95  30.14
sda               0.00     0.40    0.00    0.80     0.00     4.80    12.00     0.01    7.00   3.25   0.26

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          11.61    0.00    2.51    0.16    0.00   85.71

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sdb               0.00  2245.40   10.60   51.20    42.40  9187.20   298.69     3.51   56.80   2.04  12.58
sda               0.00     0.40    0.00    0.80     0.00     4.80    12.00     0.01    6.25   3.25   0.26

So (if I understand correctly the output) it appears that, yes, as it was apparent from the JVM stack traces the slow server is taking forever to do I/O. It remains to understand why :(.

I also ran a strace -c ls -R /. I didn't realize it had to run for a while, so the previous data is not very meaningful. The command was run while the crawler was running, so with massive I/O ongoing.

Good

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 99.62   14.344825         114    126027           getdents
  0.25    0.036219           1     61812        12 open
  0.07    0.009891           0     61802           close
  0.06    0.007975           0     61801           fstat
  0.01    0.000775           0      8436           write
  0.00    0.000043          22         2           rt_sigaction
  0.00    0.000000           0        12           read
  0.00    0.000000           0         1           stat
  0.00    0.000000           0         3         1 lstat
  0.00    0.000000           0        33           mmap
  0.00    0.000000           0        16           mprotect
  0.00    0.000000           0         4           munmap
  0.00    0.000000           0        15           brk
  0.00    0.000000           0         1           rt_sigprocmask
  0.00    0.000000           0         3         3 ioctl
  0.00    0.000000           0         1         1 access
  0.00    0.000000           0         3           mremap
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         1           fcntl
  0.00    0.000000           0         1           getrlimit
  0.00    0.000000           0         1           statfs
  0.00    0.000000           0         1           arch_prctl
  0.00    0.000000           0         3         1 futex
  0.00    0.000000           0         1           set_tid_address
  0.00    0.000000           0         1           set_robust_list
------ ----------- ----------- --------- --------- ----------------
100.00   14.399728                319982        18 total

Bad

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 99.81   24.210936         181    133618           getdents
  0.14    0.032755           1     63183        14 open
  0.03    0.006637           0     63171           close
  0.02    0.005410           0     63170           fstat
  0.00    0.000434           0     15002           write
  0.00    0.000000           0        12           read
  0.00    0.000000           0         1           stat
  0.00    0.000000           0         4         1 lstat
  0.00    0.000000           0        33           mmap
  0.00    0.000000           0        16           mprotect
  0.00    0.000000           0         4           munmap
  0.00    0.000000           0        25           brk
  0.00    0.000000           0         2           rt_sigaction
  0.00    0.000000           0         1           rt_sigprocmask
  0.00    0.000000           0         3         3 ioctl
  0.00    0.000000           0         1         1 access
  0.00    0.000000           0         3           mremap
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         1           fcntl
  0.00    0.000000           0         1           getrlimit
  0.00    0.000000           0         1           statfs
  0.00    0.000000           0         1           arch_prctl
  0.00    0.000000           0         3         1 futex
  0.00    0.000000           0         1           set_tid_address
  0.00    0.000000           0         1           set_robust_list
------ ----------- ----------- --------- --------- ----------------
100.00   24.256172                338259        20 total

Always add bs=1M or something to dd, otherwise you're measuring syscall overhead. smartctl -H is useless. Check kernel logs, full smart data, partition alignment, tune2fs, benchmark disks individually/directly w/o filesystem, ... if nothing comes up, replace cables, disk anyway — frostschutz, Commented Apr 20, 2014 at 15:30
bs=1M did not change the test (it was indeed running faster on the other servers), but thanks for the pointer. UPDATE: after leaving the disk doing nothing for 10 hours, THE SPEED IS AGAIN HIGH. The dd test copies hundreds of megabytes per second, like all other servers. So it appears that something "happens" incrementally to the file system. After a few hours, things gets slower and slower. But if stop all activity and wait for a few hours things get back to normality. I guess this has something to do with delayed writes, but frankly I don't know what I should change. — seba, Commented Apr 20, 2014 at 17:15
cfq. Thanks, really, for pointing this out—I didn't even know there was a settable scheduler. I think deadline would be more appropriate for what we are doing. We need the system being as dumb as possible, as we alternate phases in which we make large writes to different files. — seba, Commented Apr 20, 2014 at 18:58
For the time being, I'm trying disabling journalling altogether. — seba, Commented Apr 20, 2014 at 18:59

Community · Accepted Answer · 2017-05-23 12:40:03Z

It was the kernel. We were using 2.6.32, which is quite old, but it is the one supported by Red Hat EL and Scientific Linux.

Today I had lunch with a friend (Giuseppe Ottaviano) who had a similar experience tuning high-performance indexing algorithms. After upgrading to the newest version everything (compiler, libraries, etc.) he changed the kernel to the 3.10 line and suddenly everything worked fine.

It worked for us, too. With the 3.10 kernel (courtesy of http://elrepo.org), all problems vanished.

Giuseppe suspects a pernicious interaction between NUMA and kernel paging, which leads to the kernel loading and saving like crazy the same data, slowing the machine almost up to a halt.

Thorsten Staerk · Accepted Answer · 2014-04-21 17:09:55Z

0

Regarding the txt file, it all seems "normal" for all computers... low await (disk latency), rather lower (=better) for the "bad" computers... request queue not too bad, 0 wa (IO wait time in vmstat).

But you went from 5MB/30sec to 130 MB/sec, what's up?

answered Apr 21, 2014 at 17:09

Thorsten Staerk

3,7652 gold badges23 silver badges28 bronze badges

OK... I don't know. I think that eliminating the journal has changed the setup. I updated the tests after that as I wanted to have more than one output from IOStat. But the problem remains. At the end of a crawl we move some data using the transfer() facility of Java NIO. This means that we move from file to file 30G with just one method call. It's going on since days (literally) instead of a few minutes, copying a few hundreds K per second. The weird thing is that while that copy is going on so low, a dd on the same disk now does 130MB/s.
– seba
Commented Apr 22, 2014 at 8:37
Another thing that happened is that also a second server started to behave similarly, and we discovered that the two bad server have an old (2.8.0) BIOS w.r.t. the new ones (3.3.2). So we're going to update and retry.
– seba
Commented Apr 22, 2014 at 8:38
And... no, it's not the BIOS. At this point I'm thinking of some very low-level hardware problem, like interrupts timing out or something. Our next try will be reformatting the large partitions and test again.
– seba
Commented Apr 22, 2014 at 10:45

Add a comment |

Stack Exchange Network

Large ext4 partition ridiculously slow when writing

Update

Good

Bad

Good

Bad

Good

Bad

2 Answers 2

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
performance
ext4
.

Linked

Hot Network Questions

Large ext4 partition ridiculously slow when writing

Update

Good

Bad

Good

Bad

Good

Bad

2 Answers 2

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged performanceext4.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
performance
ext4
.