Skip to main content
edited tags
Link
Braiam
  • 36.3k
  • 28
  • 111
  • 173
Tweeted twitter.com/#!/StackUnix/status/458342383818932225
added 3986 characters in body
Source Link
seba
  • 251
  • 1
  • 2
  • 6

I also ran a strace -c ls -R /. I didn't realize it had to run for a while, so the previous data is not very meaningful. The command was run while the crawler was running, so with massive I/O ongoing.

Good

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 99.62   14.344825         114    126027           getdents
  0.25    0.036219           1     61812        12 open
  0.07    0.009891           0     61802           close
  0.06    0.007975           0     61801           fstat
  0.01    0.000775           0      8436           write
  0.00    0.000043          22         2           rt_sigaction
  0.00    0.000000           0        12           read
  0.00    0.000000           0         1           stat
  0.00    0.000000           0         3         1 lstat
  0.00    0.000000           0        33           mmap
  0.00    0.000000           0        16           mprotect
  0.00    0.000000           0         4           munmap
  0.00    0.000000           0        15           brk
  0.00    0.000000           0         1           rt_sigprocmask
  0.00    0.000000           0         3         3 ioctl
  0.00    0.000000           0         1         1 access
  0.00    0.000000           0         3           mremap
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         1           fcntl
  0.00    0.000000           0         1           getrlimit
  0.00    0.000000           0         1           statfs
  0.00    0.000000           0         1           arch_prctl
  0.00    0.000000           0         3         1 futex
  0.00    0.000000           0         1           set_tid_address
  0.00    0.000000           0         1           set_robust_list
------ ----------- ----------- --------- --------- ----------------
100.00   14.399728                319982        18 total

Bad

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 99.81   24.210936         181    133618           getdents
  0.14    0.032755           1     63183        14 open
  0.03    0.006637           0     63171           close
  0.02    0.005410           0     63170           fstat
  0.00    0.000434           0     15002           write
  0.00    0.000000           0        12           read
  0.00    0.000000           0         1           stat
  0.00    0.000000           0         4         1 lstat
  0.00    0.000000           0        33           mmap
  0.00    0.000000           0        16           mprotect
  0.00    0.000000           0         4           munmap
  0.00    0.000000           0        25           brk
  0.00    0.000000           0         2           rt_sigaction
  0.00    0.000000           0         1           rt_sigprocmask
  0.00    0.000000           0         3         3 ioctl
  0.00    0.000000           0         1         1 access
  0.00    0.000000           0         3           mremap
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         1           fcntl
  0.00    0.000000           0         1           getrlimit
  0.00    0.000000           0         1           statfs
  0.00    0.000000           0         1           arch_prctl
  0.00    0.000000           0         3         1 futex
  0.00    0.000000           0         1           set_tid_address
  0.00    0.000000           0         1           set_robust_list
------ ----------- ----------- --------- --------- ----------------
100.00   24.256172                338259        20 total

I also ran a strace -c ls -R /. I didn't realize it had to run for a while, so the previous data is not very meaningful. The command was run while the crawler was running, so with massive I/O ongoing.

Good

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 99.62   14.344825         114    126027           getdents
  0.25    0.036219           1     61812        12 open
  0.07    0.009891           0     61802           close
  0.06    0.007975           0     61801           fstat
  0.01    0.000775           0      8436           write
  0.00    0.000043          22         2           rt_sigaction
  0.00    0.000000           0        12           read
  0.00    0.000000           0         1           stat
  0.00    0.000000           0         3         1 lstat
  0.00    0.000000           0        33           mmap
  0.00    0.000000           0        16           mprotect
  0.00    0.000000           0         4           munmap
  0.00    0.000000           0        15           brk
  0.00    0.000000           0         1           rt_sigprocmask
  0.00    0.000000           0         3         3 ioctl
  0.00    0.000000           0         1         1 access
  0.00    0.000000           0         3           mremap
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         1           fcntl
  0.00    0.000000           0         1           getrlimit
  0.00    0.000000           0         1           statfs
  0.00    0.000000           0         1           arch_prctl
  0.00    0.000000           0         3         1 futex
  0.00    0.000000           0         1           set_tid_address
  0.00    0.000000           0         1           set_robust_list
------ ----------- ----------- --------- --------- ----------------
100.00   14.399728                319982        18 total

Bad

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 99.81   24.210936         181    133618           getdents
  0.14    0.032755           1     63183        14 open
  0.03    0.006637           0     63171           close
  0.02    0.005410           0     63170           fstat
  0.00    0.000434           0     15002           write
  0.00    0.000000           0        12           read
  0.00    0.000000           0         1           stat
  0.00    0.000000           0         4         1 lstat
  0.00    0.000000           0        33           mmap
  0.00    0.000000           0        16           mprotect
  0.00    0.000000           0         4           munmap
  0.00    0.000000           0        25           brk
  0.00    0.000000           0         2           rt_sigaction
  0.00    0.000000           0         1           rt_sigprocmask
  0.00    0.000000           0         3         3 ioctl
  0.00    0.000000           0         1         1 access
  0.00    0.000000           0         3           mremap
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         1           fcntl
  0.00    0.000000           0         1           getrlimit
  0.00    0.000000           0         1           statfs
  0.00    0.000000           0         1           arch_prctl
  0.00    0.000000           0         3         1 futex
  0.00    0.000000           0         1           set_tid_address
  0.00    0.000000           0         1           set_robust_list
------ ----------- ----------- --------- --------- ----------------
100.00   24.256172                338259        20 total
added 4701 characters in body
Source Link
seba
  • 251
  • 1
  • 2
  • 6

Update

Beside the link posted, I thought it would have been a good idea to run the same tests on two servers running the crawler. It's interesting. For instance, vmstat 10 gives

Good

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 4  0  68692 9009864  70832 29853400    0    0    15   214    9    2 11  1 88  0  0 
10  0  68692 8985620  70824 29883460    0    0    48  7402 79465 62898 12  1 86  0  0   
11  0  68692 8936780  70824 29928696    0    0    54  6842 81500 66665 15  1 83  0  0   
10  2  68692 8867280  70840 30000104    0    0    65 36578 80252 66272 14  1 85  0  0   
15  0  68692 8842960  70840 30026724    0    0    61  3667 81245 65161 14  1 85  0  0   

Bad

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
13  0   8840 14015100  92972 25683108    0    0    11   104    3    9  4  1 94  0  0    
 2  0   8840 14015800  92984 25696108    0    0    49 16835 38619 54204  2  2 96  0  0  
 1  0   8840 14026004  93004 25699940    0    0    33  4152 25914 43530  0  2 98  0  0  
 1  0   8840 14032272  93012 25703716    0    0    30  1164 25062 43230  0  2 98  0  0  
 2  0   8840 14029632  93020 25709448    0    0    24  5619 23475 40080  0  2 98  0  0  

And iostat -x -k 5

Good

Linux 2.6.32-358.18.1.el6.x86_64 (s0.law.di.unimi.it)   04/21/2014  _x86_64_    (64 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          10.65    0.00    1.02    0.11    0.00   88.22

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sdb               0.11  3338.25   17.98   56.52   903.55 13579.19   388.79     7.32   98.30   1.23   9.18
sda               0.39     0.72    0.49    0.76    11.68     5.90    28.25     0.01   11.25   3.41   0.43

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          15.86    0.00    1.33    0.03    0.00   82.78

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sdb               0.00  1106.20    9.20   31.00    36.80  4549.60   228.18     0.41   10.09   0.39   1.58
sda               0.00     2.20    0.80    3.00     4.80    20.80    13.47     0.04   10.53   3.21   1.22

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          15.42    0.00    1.23    0.01    0.00   83.34

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sdb               0.00  1205.40    8.00   33.60    40.80  4956.00   240.23     0.39    9.43   0.33   1.38
sda               0.00     0.60    0.00    1.00     0.00     6.40    12.80     0.01    5.20   4.20   0.42

Bad

Linux 2.6.32-358.18.1.el6.x86_64 (s2.law.di.unimi.it)   04/21/2014  _x86_64_    (64 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           4.37    0.00    1.41    0.06    0.00   94.16

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sdb               0.06  1599.70   13.76   38.23   699.27  6551.73   278.96     3.12   59.96   0.99   5.16
sda               0.46     3.17    1.07    0.78    22.51    15.85    41.26     0.03   16.10   2.70   0.50

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          11.93    0.00    2.99    0.60    0.00   84.48

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sdb               0.00 14885.40   13.60  141.20    54.40 60106.40   777.27    34.90  225.45   1.95  30.14
sda               0.00     0.40    0.00    0.80     0.00     4.80    12.00     0.01    7.00   3.25   0.26

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          11.61    0.00    2.51    0.16    0.00   85.71

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sdb               0.00  2245.40   10.60   51.20    42.40  9187.20   298.69     3.51   56.80   2.04  12.58
sda               0.00     0.40    0.00    0.80     0.00     4.80    12.00     0.01    6.25   3.25   0.26

So (if I understand correctly the output) it appears that, yes, as it was apparent from the JVM stack traces the slow server is taking forever to do I/O. It remains to understand why :(.

Update

Beside the link posted, I thought it would have been a good idea to run the same tests on two servers running the crawler. It's interesting. For instance, vmstat 10 gives

Good

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 4  0  68692 9009864  70832 29853400    0    0    15   214    9    2 11  1 88  0  0 
10  0  68692 8985620  70824 29883460    0    0    48  7402 79465 62898 12  1 86  0  0   
11  0  68692 8936780  70824 29928696    0    0    54  6842 81500 66665 15  1 83  0  0   
10  2  68692 8867280  70840 30000104    0    0    65 36578 80252 66272 14  1 85  0  0   
15  0  68692 8842960  70840 30026724    0    0    61  3667 81245 65161 14  1 85  0  0   

Bad

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
13  0   8840 14015100  92972 25683108    0    0    11   104    3    9  4  1 94  0  0    
 2  0   8840 14015800  92984 25696108    0    0    49 16835 38619 54204  2  2 96  0  0  
 1  0   8840 14026004  93004 25699940    0    0    33  4152 25914 43530  0  2 98  0  0  
 1  0   8840 14032272  93012 25703716    0    0    30  1164 25062 43230  0  2 98  0  0  
 2  0   8840 14029632  93020 25709448    0    0    24  5619 23475 40080  0  2 98  0  0  

And iostat -x -k 5

Good

Linux 2.6.32-358.18.1.el6.x86_64 (s0.law.di.unimi.it)   04/21/2014  _x86_64_    (64 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          10.65    0.00    1.02    0.11    0.00   88.22

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sdb               0.11  3338.25   17.98   56.52   903.55 13579.19   388.79     7.32   98.30   1.23   9.18
sda               0.39     0.72    0.49    0.76    11.68     5.90    28.25     0.01   11.25   3.41   0.43

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          15.86    0.00    1.33    0.03    0.00   82.78

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sdb               0.00  1106.20    9.20   31.00    36.80  4549.60   228.18     0.41   10.09   0.39   1.58
sda               0.00     2.20    0.80    3.00     4.80    20.80    13.47     0.04   10.53   3.21   1.22

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          15.42    0.00    1.23    0.01    0.00   83.34

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sdb               0.00  1205.40    8.00   33.60    40.80  4956.00   240.23     0.39    9.43   0.33   1.38
sda               0.00     0.60    0.00    1.00     0.00     6.40    12.80     0.01    5.20   4.20   0.42

Bad

Linux 2.6.32-358.18.1.el6.x86_64 (s2.law.di.unimi.it)   04/21/2014  _x86_64_    (64 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           4.37    0.00    1.41    0.06    0.00   94.16

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sdb               0.06  1599.70   13.76   38.23   699.27  6551.73   278.96     3.12   59.96   0.99   5.16
sda               0.46     3.17    1.07    0.78    22.51    15.85    41.26     0.03   16.10   2.70   0.50

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          11.93    0.00    2.99    0.60    0.00   84.48

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sdb               0.00 14885.40   13.60  141.20    54.40 60106.40   777.27    34.90  225.45   1.95  30.14
sda               0.00     0.40    0.00    0.80     0.00     4.80    12.00     0.01    7.00   3.25   0.26

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          11.61    0.00    2.51    0.16    0.00   85.71

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sdb               0.00  2245.40   10.60   51.20    42.40  9187.20   298.69     3.51   56.80   2.04  12.58
sda               0.00     0.40    0.00    0.80     0.00     4.80    12.00     0.01    6.25   3.25   0.26

So (if I understand correctly the output) it appears that, yes, as it was apparent from the JVM stack traces the slow server is taking forever to do I/O. It remains to understand why :(.

Source Link
seba
  • 251
  • 1
  • 2
  • 6
Loading