You are quite possibly exceeding the ability of the hardware to do
its work. There are too many CPUs simultaneously trying to
queue and to handle work - more than the physical hardware can actually do.
Your very multi-threaded software is spending its time waiting for ... itself.
These shared resources might be shared-memory, shared server, or even
the disk.
Native_queued_spin_lock_slowpath
is a
spin-lock.
Such a lock should "spin" only briefly and only occasionally, but yours
are doing it a lot.
CPU time spent in "spinning" is time 100% wasted.
You only need to dedicate enough CPUs to handle the task.
You accomplish nothing of value by dedicating more CPUs to the task
such that they merely wait for one another, especially given that they
are literally wasting CPU time in a spin-lock when they should be doing something useful.
You should reduce the number of CPUs that you use. You can use "affinity"
rules to distribute computing resources among CPUs if you really
need such performance.
See also the post
Why having more and faster cores makes my multithreaded software slower?