I am observing very high IOWAIT time on the CPUs of my server:
top - 14:24:20 up 846 days, 14:14, 2 users, load average: 14.42, 14.33, 14.57
Tasks: 345 total, 1 running, 341 sleeping, 3 stopped, 0 zombie
%Cpu0 : 0.9 us, 0.9 sy, 0.0 ni, 0.0 id, 98.1 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu1 : 1.4 us, 4.2 sy, 0.0 ni, 0.0 id, 94.4 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu2 : 0.9 us, 0.9 sy, 0.0 ni, 0.0 id, 98.1 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu3 : 0.0 us, 0.5 sy, 0.0 ni, 0.0 id, 99.5 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu4 : 0.0 us, 0.5 sy, 0.0 ni, 99.5 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu5 : 0.5 us, 0.5 sy, 0.0 ni, 0.0 id, 99.1 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu6 : 0.0 us, 0.0 sy, 0.0 ni, 0.0 id,100.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu7 : 0.0 us, 0.5 sy, 0.0 ni, 0.0 id, 99.5 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 65577636 total, 62258200 free, 2361520 used, 957916 buff/cache
KiB Swap: 33524732 total, 33394684 free, 130048 used. 62668248 avail Mem
After checking iostat, iotop, sar... everything reports no disk activity. I finally found sar reports 11 blocked tasks. Indeed, I get:
[root@ttllkk ~]# ps aux | awk '$8 ~ /D/'
root 1875 0.0 0.0 108056 344 ? D May05 0:00 sync
root 6503 0.0 0.0 72348 876 ? Ds Jan27 0:08 /usr/libexec/openssh/sftp-server
root 6515 0.0 0.0 72348 812 ? Ds Jan27 0:08 /usr/libexec/openssh/sftp-server
root 7737 0.0 0.0 72348 996 ? Ds Jan27 0:00 /usr/libexec/openssh/sftp-server
root 8065 0.0 0.0 0 0 ? D Jan27 0:00 [kworker/1:103]
root 8147 0.0 0.0 0 0 ? D Jan27 0:00 [kworker/1:185]
root 8294 0.0 0.0 72348 988 ? Ds Jan27 0:00 /usr/libexec/openssh/sftp-server
root 9163 0.0 0.0 0 0 ? D Jan27 0:00 [kworker/3:98]
root 9166 0.0 0.0 0 0 ? D Jan27 0:00 [kworker/3:101]
root 11406 0.0 0.0 108056 0 ? D Feb08 0:00 sync
root 15693 0.0 0.0 72348 992 ? Ds Jan27 0:04 /usr/libexec/openssh/sftp-server
root 17318 0.0 0.0 0 0 ? D Jan27 0:00 [kworker/1:1]
root 27112 0.0 0.0 108056 72 ? D Mar04 0:00 sync
root 30440 0.0 0.0 108056 320 ? D Apr10 0:00 sync
I am pretty sure that all of these have to do with the folder /box which I had mounted using CIFS (from a remote server). I tried umount -f /box
which failed (it was busy) and after that I did umount -l /box
so now that folder doesn't show anymore in the mount
output. However the processes are still there, and my CPU problem is still there.
Trying to kill the processes with kill -9
doesn't do anything...
I see some cifs related processes:
[root@ttllkk ~]# lsof 2> /dev/null | grep cifs 2> /dev/null
cifsd 2704 root cwd DIR 9,2 4096 2 /
cifsd 2704 root rtd DIR 9,2 4096 2 /
cifsd 2704 root txt unknown /proc/2704/exe
cifsiod 16258 root cwd DIR 9,2 4096 2 /
cifsiod 16258 root rtd DIR 9,2 4096 2 /
cifsiod 16258 root txt unknown /proc/16258/exe
cifsoploc 16259 root cwd DIR 9,2 4096 2 /
cifsoploc 16259 root rtd DIR 9,2 4096 2 /
cifsoploc 16259 root txt unknown /proc/16259/exe
Restarting the machine is NOT an option.
Is there a way I can deal with this and get this processes to end?
Edit:
Some more info about these processes. They are indeed hung on cifs-related work:
[root@ttllkk ~]# for pid in $(cat hang_pids.txt); do cat /proc/$pid/stack; echo "---"; echo ""; done
[<ffffffffa1a7f3df>] sync_inodes_sb+0xdf/0x3d0
[<ffffffffa1a840b9>] sync_inodes_one_sb+0x19/0x20
[<ffffffffa1a52933>] iterate_supers+0xc3/0x120
[<ffffffffa1a84394>] sys_sync+0x44/0xb0
[<ffffffffa1f95f92>] system_call_fastpath+0x25/0x2a
[<ffffffffffffffff>] 0xffffffffffffffff
---
[<ffffffffa19bd3c1>] wait_on_page_bit+0x81/0xa0
[<ffffffffa19bd4f1>] __filemap_fdatawait_range+0x111/0x190
[<ffffffffa19bd584>] filemap_fdatawait_range+0x14/0x30
[<ffffffffa19bd5c7>] filemap_fdatawait+0x27/0x30
[<ffffffffa19bfeac>] filemap_write_and_wait+0x4c/0x80
[<ffffffffc05173c9>] cifs_flush+0x49/0x90 [cifs]
[<ffffffffa1a4ba77>] filp_close+0x37/0x90
[<ffffffffa1a6fa2c>] __close_fd+0x8c/0xb0
[<ffffffffa1a4d5a3>] SyS_close+0x23/0x50
[<ffffffffa1f95f92>] system_call_fastpath+0x25/0x2a
[<ffffffffffffffff>] 0xffffffffffffffff
---
[<ffffffffa19bd3c1>] wait_on_page_bit+0x81/0xa0
[<ffffffffa19bd4f1>] __filemap_fdatawait_range+0x111/0x190
[<ffffffffa19bd584>] filemap_fdatawait_range+0x14/0x30
[<ffffffffa19bd5c7>] filemap_fdatawait+0x27/0x30
[<ffffffffa19bfeac>] filemap_write_and_wait+0x4c/0x80
[<ffffffffc05173c9>] cifs_flush+0x49/0x90 [cifs]
[<ffffffffa1a4ba77>] filp_close+0x37/0x90
[<ffffffffa1a6fa2c>] __close_fd+0x8c/0xb0
[<ffffffffa1a4d5a3>] SyS_close+0x23/0x50
[<ffffffffa1f95f92>] system_call_fastpath+0x25/0x2a
[<ffffffffffffffff>] 0xffffffffffffffff
---
[<ffffffffa19bd3c1>] wait_on_page_bit+0x81/0xa0
[<ffffffffa19ced5d>] invalidate_inode_pages2_range+0x26d/0x460
[<ffffffffa19cef67>] invalidate_inode_pages2+0x17/0x20
[<ffffffffc051cc65>] cifs_invalidate_mapping+0x35/0x60 [cifs]
[<ffffffffc051cd20>] cifs_revalidate_mapping+0x90/0xa0 [cifs]
[<ffffffffc051d06f>] cifs_revalidate_dentry+0x1f/0x30 [cifs]
[<ffffffffc050c7b2>] cifs_d_revalidate+0x42/0xf0 [cifs]
[<ffffffffa1a5a6ba>] lookup_fast+0x1da/0x230
[<ffffffffa1a5d0bd>] path_lookupat+0x16d/0x8d0
[<ffffffffa1a5d84b>] filename_lookup+0x2b/0xc0
[<ffffffffa1a61557>] user_path_at_empty+0x67/0xc0
[<ffffffffa1a615c1>] user_path_at+0x11/0x20
[<ffffffffa1a54003>] vfs_fstatat+0x63/0xc0
[<ffffffffa1a54421>] SYSC_newlstat+0x31/0x60
[<ffffffffa1a5488e>] SyS_newlstat+0xe/0x10
[<ffffffffa1f95f92>] system_call_fastpath+0x25/0x2a
[<ffffffffffffffff>] 0xffffffffffffffff
---
[<ffffffffa19bd3c1>] wait_on_page_bit+0x81/0xa0
[<ffffffffa19bd4f1>] __filemap_fdatawait_range+0x111/0x190
[<ffffffffa19bd584>] filemap_fdatawait_range+0x14/0x30
[<ffffffffa19bd5c7>] filemap_fdatawait+0x27/0x30
[<ffffffffc05141f9>] cifs_oplock_break+0x369/0x3b0 [cifs]
[<ffffffffa18bde8f>] process_one_work+0x17f/0x440
[<ffffffffa18befa6>] worker_thread+0x126/0x3c0
[<ffffffffa18c5e61>] kthread+0xd1/0xe0
[<ffffffffa1f95ddd>] ret_from_fork_nospec_begin+0x7/0x21
[<ffffffffffffffff>] 0xffffffffffffffff
---
[<ffffffffa19bd3c1>] wait_on_page_bit+0x81/0xa0
[<ffffffffa19bd4f1>] __filemap_fdatawait_range+0x111/0x190
[<ffffffffa19bd584>] filemap_fdatawait_range+0x14/0x30
[<ffffffffa19bd5c7>] filemap_fdatawait+0x27/0x30
[<ffffffffc05141f9>] cifs_oplock_break+0x369/0x3b0 [cifs]
[<ffffffffa18bde8f>] process_one_work+0x17f/0x440
[<ffffffffa18befa6>] worker_thread+0x126/0x3c0
[<ffffffffa18c5e61>] kthread+0xd1/0xe0
[<ffffffffa1f95ddd>] ret_from_fork_nospec_begin+0x7/0x21
[<ffffffffffffffff>] 0xffffffffffffffff
---
[<ffffffffa19bd3c1>] wait_on_page_bit+0x81/0xa0
[<ffffffffa19ced5d>] invalidate_inode_pages2_range+0x26d/0x460
[<ffffffffa19cef67>] invalidate_inode_pages2+0x17/0x20
[<ffffffffc051cc65>] cifs_invalidate_mapping+0x35/0x60 [cifs]
[<ffffffffc051cd20>] cifs_revalidate_mapping+0x90/0xa0 [cifs]
[<ffffffffc051d06f>] cifs_revalidate_dentry+0x1f/0x30 [cifs]
[<ffffffffc050c7b2>] cifs_d_revalidate+0x42/0xf0 [cifs]
[<ffffffffa1a5a6ba>] lookup_fast+0x1da/0x230
[<ffffffffa1a5d0bd>] path_lookupat+0x16d/0x8d0
[<ffffffffa1a5d84b>] filename_lookup+0x2b/0xc0
[<ffffffffa1a61557>] user_path_at_empty+0x67/0xc0
[<ffffffffa1a615c1>] user_path_at+0x11/0x20
[<ffffffffa1a54003>] vfs_fstatat+0x63/0xc0
[<ffffffffa1a54421>] SYSC_newlstat+0x31/0x60
[<ffffffffa1a5488e>] SyS_newlstat+0xe/0x10
[<ffffffffa1f95f92>] system_call_fastpath+0x25/0x2a
[<ffffffffffffffff>] 0xffffffffffffffff
---
[<ffffffffa19bd7a4>] __lock_page+0x74/0x90
[<ffffffffc04fa26d>] cifs_writev_complete+0x43d/0x480 [cifs]
[<ffffffffa18bde8f>] process_one_work+0x17f/0x440
[<ffffffffa18befa6>] worker_thread+0x126/0x3c0
[<ffffffffa18c5e61>] kthread+0xd1/0xe0
[<ffffffffa1f95ddd>] ret_from_fork_nospec_begin+0x7/0x21
[<ffffffffffffffff>] 0xffffffffffffffff
---
[<ffffffffa19bd7a4>] __lock_page+0x74/0x90
[<ffffffffc04fa26d>] cifs_writev_complete+0x43d/0x480 [cifs]
[<ffffffffa18bde8f>] process_one_work+0x17f/0x440
[<ffffffffa18befa6>] worker_thread+0x126/0x3c0
[<ffffffffa18c5e61>] kthread+0xd1/0xe0
[<ffffffffa1f95ddd>] ret_from_fork_nospec_begin+0x7/0x21
[<ffffffffffffffff>] 0xffffffffffffffff
---
[<ffffffffa19bd3c1>] wait_on_page_bit+0x81/0xa0
[<ffffffffa19bd4f1>] __filemap_fdatawait_range+0x111/0x190
[<ffffffffa19c0bc7>] filemap_fdatawait_keep_errors+0x27/0x30
[<ffffffffa1a7f4e5>] sync_inodes_sb+0x1e5/0x3d0
[<ffffffffa1a840b9>] sync_inodes_one_sb+0x19/0x20
[<ffffffffa1a52933>] iterate_supers+0xc3/0x120
[<ffffffffa1a84394>] sys_sync+0x44/0xb0
[<ffffffffa1f95f92>] system_call_fastpath+0x25/0x2a
[<ffffffffffffffff>] 0xffffffffffffffff
---
[<ffffffffa19bd3c1>] wait_on_page_bit+0x81/0xa0
[<ffffffffa19bd4f1>] __filemap_fdatawait_range+0x111/0x190
[<ffffffffa19bd584>] filemap_fdatawait_range+0x14/0x30
[<ffffffffa19bd5c7>] filemap_fdatawait+0x27/0x30
[<ffffffffa19bfeac>] filemap_write_and_wait+0x4c/0x80
[<ffffffffc05173c9>] cifs_flush+0x49/0x90 [cifs]
[<ffffffffa1a4ba77>] filp_close+0x37/0x90
[<ffffffffa1a6fa2c>] __close_fd+0x8c/0xb0
[<ffffffffa1a4d5a3>] SyS_close+0x23/0x50
[<ffffffffa1f95f92>] system_call_fastpath+0x25/0x2a
[<ffffffffffffffff>] 0xffffffffffffffff
---
[<ffffffffa19bd3c1>] wait_on_page_bit+0x81/0xa0
[<ffffffffa19bd4f1>] __filemap_fdatawait_range+0x111/0x190
[<ffffffffa19bd584>] filemap_fdatawait_range+0x14/0x30
[<ffffffffa19bd5c7>] filemap_fdatawait+0x27/0x30
[<ffffffffc05141f9>] cifs_oplock_break+0x369/0x3b0 [cifs]
[<ffffffffa18bde8f>] process_one_work+0x17f/0x440
[<ffffffffa18befa6>] worker_thread+0x126/0x3c0
[<ffffffffa18c5e61>] kthread+0xd1/0xe0
[<ffffffffa1f95ddd>] ret_from_fork_nospec_begin+0x7/0x21
[<ffffffffffffffff>] 0xffffffffffffffff
---
[<ffffffffa1a7f3df>] sync_inodes_sb+0xdf/0x3d0
[<ffffffffa1a840b9>] sync_inodes_one_sb+0x19/0x20
[<ffffffffa1a52933>] iterate_supers+0xc3/0x120
[<ffffffffa1a84394>] sys_sync+0x44/0xb0
[<ffffffffa1f95f92>] system_call_fastpath+0x25/0x2a
[<ffffffffffffffff>] 0xffffffffffffffff
---
[<ffffffffa1a7f3df>] sync_inodes_sb+0xdf/0x3d0
[<ffffffffa1a840b9>] sync_inodes_one_sb+0x19/0x20
[<ffffffffa1a52933>] iterate_supers+0xc3/0x120
[<ffffffffa1a84394>] sys_sync+0x44/0xb0
[<ffffffffa1f95f92>] system_call_fastpath+0x25/0x2a
[<ffffffffffffffff>] 0xffffffffffffffff
---
unmout
? For example like thissudo umount -t cifs -l /box
. The force unmount will fail if you have open connection, but lazy will detach the filesystem from the file hierarchy now and it will clean up all the references as soon as they are not used any more. A warning fromman
- remounts of the share will not be possible!