It really depend on the implementation.
A very important part of optimization is to utilize the caches efficiently. While L1 and L2 is typically per core, L3 is often shared. In the absolute worst case, the increased amount of memory requests could cause L3 cache lines to be evicted just before they are needed again.
A well optimized implementation should try to load a section of data that fits into the L1 cache, and do as much computation as possible before loading the next chunk. And ideally do the same with the L2 cache. This typically results in almost unreadable code, but it can give huge performance gains.
So the first thing I would do is check the library. I would expect common and popular libraries like numpy to perform well. Or libraries where the main selling point is performance, like Intel performance primitives. I would expect internal or smaller libraries to perform worse since optimization requires a huge time investment.
workstation-class processors typically also scale the amount of cache and memory channels with the number of cores. If you have a 40 core processor you likely have far more bandwidth available than a single core will use. 4-8 cores per memory channel seem fairly common today.
A well optimized library should make optimal usage of the resources available. In some cases manual tuning parameters might help, but this will likely be a trial and error process. But multiplying huge matrices is just fundamentally slow, so you will likely not solve your problem of poor scaling.
numastat
(on Linux) and whatever perf counters your OS/CPU offer to get an idea of what changes as your dataset grows.