Linux page cache slows down IO on dual cpu server with 64GB ram

Question

I have a massive problem with the linux page cache which slows down the IO. For instance if I copy a lvm partition with dd, linux cache the data in the buffers or caches (free –m). That is not the problem, but after the buffer reaches a special value, the copy process stop and slow down to a few mbs or even kbs. I have done many tests with writing to disk or /dev/null the problem has nothing to do with the source drive or the destination.

In Detail:

There are two nearly identical servers. Both runing CentOS 6.5 with the same kernel. They have the same disks, the same setup, the same other hardware, the same in all ways. The only difference is that one sever has 2 CPUs and 64GB ram and the other one 1 CPU and 32 GB ram.
Here is also an image of the following copy process: https://i.sstatic.net/tYlym.jpg
Here a new version also with meminfo. The meminfo is from a different run, so the values are not identival, but it ist the complte same behavior: https://i.sstatic.net/4SIJG.jpg
Start copy with dd or other filesystem copy programm.
Buffer or cache start to fill. All is fine.
Buffer or cache reaches a maximum number (on 64GB ram server a value like 32GB or 17GB; on 32GB ram server all free memory)
On 64GB ram server the copy process now stops or is limited to few mbs. On the 32GB ram server all is fine.
On 64GB ram server I can solve the problem for a short moment by forcing the cache with "sync; echo 3 > /proc/sys/vm/drop_caches". But of course the buffer starts to grow again instantly and the problem occur again.

Conclusion:

The problem has either something to do with the second cpu or with the total amount of memory. I have the “feeling” that the problem coud be, that every cpu has his own 32GB ram and the copy process running only on the cpu. So finaly the copy process inceased the buffer / cache to nearly 32GB or to the not used memory of the other cpu and then linux thinks hey there is still memory so lets us increase the buffer further but the hardware below cant access the memory, or something like that.

Has anybody an idea or a solution? Sure I can use dd with direct flag, but that dont solve the problem, because there is also extern access via samba and so on.

EDIT1:

Here also the /proc/zoneinfo from the 64GB ram server: 1. http://pastebin.com/uSnpQbeD (before dd starts) 2. http://pastebin.com/18YVTfdb (when dd stop working)

EDIT2:

VM settings: http://pastebin.com/U9E9KkFS
/proc/sys/vm/zone_reclaim_mode was on the 32 GB ram server 0 and on the 64 GB ram server 1. I never touch this values. The installer set these. I temp changed it to 0 and retry the test. Now all memory is uses for buffer and cache. So it looks great and like the other server. But then it instantly start swaping with full speed... I set swapiness to 0. That helps, but it still swaping few mb per secound. And it increase the buffers every secound. So it dont swap the buffer, it swap the memory of the vms to get more memory to increase the buffers... crazy. But maybe this is normal!?

EDIT3:

/proc/buddyinfo and numactl --hardware: http://pastebin.com/0PmXxxin

FINAL RESULT

/proc/sys/vm/zone_reclaim_mode is for sure the technical rigth way, but the maschine did not work very well after it. For instance: if I copy a disk linux use now 100% of the free mem to buffer (not as before only XGB and then stop). But at the secound at that the last free mem was used to buffer, linux start swaping vm memory and increase the total amount of buffer and caches. The swap is normaly not needed in my system so the swap memory is on the same disk as some vms. In the result if a make a backup of these vms linux write the swap at the same time as I read from the disk for the backup. So it is bad to swap the vms but it is even worser that linux destroy my backup read speed... So the setting of /proc/sys/vm/zone_reclaim_mode to 0 dont solve the full problem... currently I run in a screen a script that sync and fushes cache every 10 sec... not nice but work much better for me. I have no webserver or normal file server on the system. I only run vms, make backups and store backups via samba. I dont like the solution.

Run these tests on both hosts again, and provide the output of /proc/meminfo. — Matthew Ife, Commented May 16, 2014 at 17:28
Oh, and a cursory view suggests the 64GB server is using at least 28G of memory for other things, such as active programs. — Matthew Ife, Commented May 16, 2014 at 17:31
Here a new updates screen shot i.sstatic.net/4SIJG.jpg The new values are from a new run, but its still the same problem. Yes on the 64GB server running several kvm vms. But thats not the point. Nobody is working at the moment and I also make tests with all vms shutdown. — user219392, Commented May 16, 2014 at 18:41
Can you provide the output of /proc/zoneinfo too? Be warned, its quite a large mount of text. — Matthew Ife, Commented May 16, 2014 at 18:55
pastebin.com/uSnpQbeD (before dd starts) pastebin.com/18YVTfdb (when dd stop working) — user219392, Commented May 16, 2014 at 19:04

Matthew Ife · Accepted Answer · 2014-05-16 22:29:20Z

The behaviour you are seeing is due to the way that Linux allocates memory on a NUMA system.

I am assuming (without knowing) that the 32GB system is non-numa, or not numa enough for Linux to care.

The behaviour of how to deal with numa is dictated by the /proc/sys/vm/zone_reclaim_mode option. By default, linux will detect if you are using a numa system and change the reclaim flags if it feels it would give better performance.

Memory is split up into zones, in numa system there is a zone for the first CPU socket and a zone for the second. These come up as node0 and node1. You can see them if you cat /proc/buddyinfo.

When the zone reclaim mode is set to 1, allocation from the first CPU socket will cause reclaim to occur on the memory zone associated with that CPU, this is because it is more efficient in terms of performance to reclaim from a local numa node. Reclaim in this sense is dropping pages such as clearing the cache, or swapping stuff out on that node.

Setting the value to 0 causes no reclaims to occur if the zone is filling up, instead allocating into the foreign numa zones for the memory. This comes at a cost of a breif lockage of the other CPU to gain exclusive access to that memory zone.

But then it instantly start swaping! after a few secouns: Mem: 66004536k total, 65733796k used, 270740k free, 34250384k buffers Swap: 10239992k total, 1178820k used, 9061172k free, 91388k cached

Swapping behaviour and when to swap is determined by a few factors, one being how active the pages are that have been allocated to applications. If they are not very active, they will be swapped in favour of the busier work occurring in the cache. I assume that the pages in your VMs dont get activated very often.

Thx for your detailed comment! I would vote it! I think I understand your text. I build the server on my own, so I know that 2x16GB Dimm related to 1 CPU socket on the mainboard. The other server has the same mainboard, but only one cpu. I understand the split and closer idea. But I had to say, that the NUMA optimizer dont work very well. So I can disable it, ok. Maybe now is all normal. I cant try it at the moment, because I have no vms on the other server. Yes the paged dont get often activated, because the vms had much memory and idle. But even with swapiness 0 the system swap for buffers. — user219392, Commented May 16, 2014 at 22:40
Setting swap to 0 doesn't disable swapping. It makes it much less likely to swap anonymous (normal) memory and when it does swap far less of it. — Matthew Ife, Commented May 16, 2014 at 22:42
yes I know, but I cant do more. Maybe now is all fine and I onyl dont like the swapping for buffer, which make no sense if you make a backup on a vm server. But sure I can use direct mode for that. Thx! I will look the next days to both servers and watch what happen. This is for sure much better than the complete io stop of the complete server..-;-) — user219392, Commented May 16, 2014 at 22:47
@MatthewIfe, You mentioned if the zone reclaim mode is set to 0 and the zone is filling up, instead of reclaiming, it will allocate from a foreign numa zone. But what happens if the foreign numa zones are also filled up? Would it now try to reclaim from its local zone? — Sobhan, Commented Jul 27, 2018 at 12:37

Stack Exchange Network

Linux page cache slows down IO on dual cpu server with 64GB ram

1 Answer 1

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
linux
memory
io
dd
.

Linked

Hot Network Questions

Linux page cache slows down IO on dual cpu server with 64GB ram

1 Answer 1

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged linuxmemoryiodd.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
linux
memory
io
dd
.