Fast single-thread performance and very high multi-thread throughput is exactly what you get with a CPU like Intel's Xeon E5-2699v4.
It's a 22-core Broadwell. The sustained clock speed is 2.2GHz with all cores active (e.g. video encoding), but the single-core max turbo is 3.6GHz.
So while running a parallel task, it uses its 145W power budget as 22 6.6W cores. But while running a task with only a few threads, that same power budget lets a few cores turbo up to 3.6GHz. (The lower single-core memory and L3-cache bandwidth in a big Xeon means it might not run as fast as a desktop quad-core at 3.6GHz, though. A single core in a desktop Intel CPU can use a lot more of the total memory bandwidth.)
The 2.2GHz rated clock speed is that low because of thermal limits. The more cores a CPU has, the slower they have to run when they're all active. This effect isn't very big in the 4 and 8 core CPUs you mention in the question, because 8 isn't that many cores, and they have very high power budgets. Even enthusiast desktop CPUs noticeably show this effect: Intel's Skylake-X i9-7900X is a 10c20t part with base 3.3GHz, max turbo 4.5GHz. That's much more single-core turbo headroom than i7-6700k (4.0GHz sustained / 4.2GHz turbo without overclocking).
Frequency/voltage scaling (DVFS) allows the same core to operate over a wide range of the performance / efficiency curve. See also this IDF2015 presentation on Skylake power management, with lots of interesting details about what CPUs can do efficiently, and trading off performance vs. efficiency both statically at design time, and on the fly with DVFS.
At the other end of the spectrum, Intel Core-M CPUs have very low sustained frequency, like 1.2GHz at 4.5W, but can turbo up to 2.9GHz. With multiple cores active, they'll run their cores at a more efficient clock-speed, just like the giant Xeons.
You don't need a heterogeneous big.LITTLE style architecture to get most of the benefit. The small cores in ARM big.LITTLE are pretty crappy in-order cores that aren't good for compute work. The point is just to run a UI with very low power. Lots of them would not be great for video encoding or other serious number crunching. (@Lưu Vĩnh Phúc found some discussions about why x86 doesn't have big.LITTLE. Basically, spending extra silicon on a very-low-power extra-slow core wouldn't be worth it for typical desktop/laptop usage.)
whereas applications like video editing are determined by number of cores. [Wouldn't 2x 4.0 GHz + 4x 2.0 GHz be better at multi-threaded workloads than 4x 4GHz?]
This is your key misunderstanding. You seem to be thinking that the same number of total clock ticks per second is more useful if spread over more cores. That's never the case. It's more like
cores * perf_per_core * (scaling efficiency)^cores
(perf_per_core
is not the same thing as clock speed, because a 3GHz Pentium4 will get a lot less work per clock cycle than a 3GHz Skylake.)
More importantly, it's very rare that the efficiency is 1.0. Some embarrasingly parallel tasks do scale almost linearly (e.g. compiling multiple source files). But video encoding is not like that. For x264, scaling is very good up to a few cores, but gets worse with more cores. e.g. going from 1 to 2 cores will almost double the speed, but going from 32 to 64 cores will help much much less for a typical 1080p encode. The point at which speed plateaus depends on the settings. (-preset veryslow
does more analysis on each frame, and can keep more cores busy than -preset fast
).
With lots of very slow cores, the single-threaded parts of x264 would become bottlenecks. (e.g. the final CABAC bitstream encoding. It's h.264's equivalent of gzip, and doesn't parallelize.) Having a few fast cores would solve that, if the OS knew how to schedule for it (or if x264 pinned the appropriate threads to fast cores).
x265 can take advantage of more cores than x264, since it has more analysis to do, and h.265's WPP design allows more encode and decode parallelism. But even for 1080p, you run out of parallelism to exploit at some point.
If you have multiple videos to encode, doing multiple videos in parallel scales well, except for competition for shared resources like L3 cache capacity and bandwidth, and memory bandwidth. Fewer faster cores could get more benefit from the same amount of L3 cache, since they wouldn't need to work on so many different parts of the problem at once.