Good question, or at least one with an interesting answer. Part of this answer imagines a world where CPUs could scale efficiently in width instead of with multiple separate cores. Licensing / price models would be different!
The rest explains why they can't. Summary:
The cost of multiple cores scale close to linearly (in die area / power). Efficient interconnects between cores are harder when you have more cores, needing more aggregate bandwidth to use them all usefully. And they're all part of the same cache-coherency domain.
The cost of widening 1 core's superscalar pipeline scales ~quadratically This is doable with enough brute-force, up to a point anyway. Single-threaded performance is very important for interactive use (end-to-end latency matters, not just throughput), so current big-core high-end CPUs pay that price. e.g. Skylake (4-wide), Ryzen (5 or 6-wide), and Apple's A12 (7-wide for the big cores, 3-wide for the small energy efficient cores)
Serious diminishing IPC returns from just widening the pipeline beyond 3 or 4-wide, even with out-of-order execution to find the ILP. Branch misses and cache misses are hard, and still stall the whole pipeline.
You didn't mention frequency, just IPC, but scaling frequency is hard too. Higher frequency requires higher voltage, so power scales with frequency cubed: ^1
from frequency directly, and ^2
from voltage. (Capacitor stored energy scales with V^2, and most of the dynamic power beyond leakage current is from pumping charge into the capacitive loads of FET gates + wires.)
Performance = frequency times IPC. (Within the same architecture. Wider SIMD lets you get the same work done with fewer instructions, and some ISAs are denser than others, e.g. MIPS often takes more instructions to do the same work than x86 or AArch64.)
Costs are in die-area (manufacturing cost) and/or power (which indirectly limits frequency because cooling is hard). Also, lower power and performance per Watt is a goal in itself, especially for mobile (battery) and servers (power density / cooling costs / electricity costs).
Before multi-core per socket was a thing, you did have multi-socket systems for high-end use-cases where you wanted more throughput than was achievable with a single CPU that could be manufactured, so those were the only SMP systems. (Servers, high-end workstations).
If a single core could scale as efficiently as you wished, we'd have systems with 1 physical core per socket, and SMT (e.g. HyperThreading) to let them act as multiple logical cores. Typical desktops / laptops would only have 1 physical core, and we wouldn't struggle to parallelize things that don't scale linearly with more cores. e.g. make -j4
to take advantage of multi-socket servers, and/or to hide I/O latency on a desktop. (Or maybe we still would try to parallelize a lot if pipeline width scaled easily but IPC didn't, so we had to use more SMT threads.) Your OS kernel would still need to run across all logical cores, unless the way the CPU presents SMT to the OS was very different, so parallel scheduling algorithms and locking would still be needed there.
Donald Knuth said in a 2008 interview
I might as well flame a bit about my personal unhappiness with the current trend toward multicore architecture. To me, it looks more or less like the hardware designers have run out of ideas, and that they’re trying to pass the blame for the future demise of Moore’s Law to the software writers by giving us machines that work faster only on a few key benchmarks!
Yes, if we could have miracle single-core CPUs with 8x the throughput on real programs, we'd probably still be using them. With dual socket systems only when it was worth paying much more for more throughput (not single-threaded performance).
Multiple CPUs reduces context-switch costs when multiple programs are running (by letting them really run in parallel instead of rapid switching between them); pre-emptive multitasking interrupting the massive out-of-order machinery such a CPU would require would probably hurt even more than it does now.
Physically it would be single core (for a simple cache hierarchy with no interconnects between cores) but support SMT (e.g. Intel's HyperThreading) so software could use it as 8 logical cores that dynamically compete for throughput resources. Or when only 1 thread is running / not stalled, it would get the full benefit.
So you'd use multiple threads when that was actually easier/natural (e.g. separate processes running at once), or for easily-parallelized problems with dependency chains that would prevent maxing out the IPC of this beast.
But unfortunately it's wishful thinking on Knuth's part that multi-core CPUs will ever stop being a thing at this point.
Single-thread performance scaling
I think if they made a 1 core equivalent of an 8 core CPU, that one core would have 800% increase in IPC so you would get the full performance in all programs, not just those that are optimized for multiple cores.
Yes, that's true. If it was possible to build such a CPU at all, it would be very amazing. But I think it's literally impossible on the same semiconductor manufacturing process (i.e. same quality / efficiency of transistors). It's certainly not possible with the same power budget and die area as an 8-core CPU, even though you'd save on logic to glue cores together, and wouldn't need as much space for per-core private caches.
Even if you allow frequency increases (since the real criterion is work per second, not work per clock), making even a 2x faster CPU would be a huge challenge.
If it were possible at anywhere near the same power and die-area budget (thus manufacturing cost) to build such a CPU, yes CPU vendors would already be building them that way.
Specifically the More Cores or Wider Cores? section, for the necessary background to understand this answer; it starts simple with how in-order pipelined CPUs work, then superscalar (multiple instructions per clock). Then explains how we hit the power-wall right around the P4 era, leading to end of easy frequency scaling, leaving mostly just IPC and getting more work done per instruction (e.g. SIMD) as the path forward, even with smaller transistors.
Making a pipeline wider (max instructions per clock) typically scales in cost as width-squared. That cost is measured in die area and/or power, for wider parallel dependency checking (hazard detection), and a wider out-of-order scheduler to find ready instructions to run. And more read / write ports on your register file and cache if you want to run instructions other than nop
. Especially if you have 3-input instructions like FMA or add-with-carry (2 registers + flags).
There are also diminishing IPC returns for making CPUs wider; most workloads have limited small-scale / short-range ILP (Instruction-Level Parallelism) for CPUs to exploit, so making the core wider doesn't increase IPC (instructions per clock) if IPC is already limited to less than the width of the core by dependency chains, branch misses, cache misses, or other stalls. Sure you'd get a speedup in some unrolled loops with independent iterations, but that's not what most code spends most of its time doing. Compare/branch instructions make up 20% of the instruction mix in "typical" code, IIRC. (I think I've read numbers from 15 to 25% for various data sets.)
Also, a cache miss that stalls all dependent instructions (and then everything once ROB capacity is reached) costs more for a wider CPU. (The opportunity cost of leaving more execution units idle; more potential work not getting done.) Or a branch miss similarly causes a bubble.
To get 8x the IPC, we'd need at least an 8x improvement in branch-prediction accuracy and in cache hit rates. (Not sure the math truly works like that, and it depends how branchy the workload is.)
But anyway, cache hit rates don't scale well with cache capacity past a certain point for most workloads. And HW prefetching is smart, but can't be that smart. And at 8x the IPC, the branch predictors need to produce 8x as many predictions per cycle as well as having them be more accurate. (No current x86-64 designs can handle more than 1 taken branch per clock, and even back to back taken branches are a bottleneck for the predictors in recent Intel, unless it's the same loop branch running repeatedly. So branchy code is likely to be a problem, vs. finding more ILP in blocks of straight-line code, e.g. in unrolled loops.)
Current techniques for building out-of-order execution CPUs can only find ILP over short ranges. For example, Skylake's ROB size is 224 fused-domain uops, scheduler for non-executed uops is 97 unfused-domain. See Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths for a case where scheduler size is the limiting factor in extracting ILP from 2 long chains of instructions, if they get too long. And/or see this more general and introductory answer).
So finding ILP between two separate long loops is not something we can do with hardware. Dynamic binary-recompilation for loop fusion could be possible in some cases, but hard and not something CPUs can really do unless they go the Transmeta Crusoe route. (x86 emulation layer on top of a different internal ISA; in that case VLIW). But standard modern x86 designs with uop caches and powerful decoders aren't easy to beat for most code.
And outside of x86, all ISAs still in use are relatively easy to decode, so there's no motivation for dynamic-recompilation other than long-distance optimizations. TL:DR: hoping for magic compilers that can expose more ILP to the hardware didn't work out for Itanium IA-64, and is unlikely to work for a super-wide CPU for any existing ISA with a serial model of execution.
If you did have a super-wide CPU, you'd definitely want it to support SMT so you can keep it fed with work to do by running multiple low-ILP threads.
Since Skylake is currently 4 uops wide (and achieves a real IPC of 2 to 3 uops per clock, or even closer to 4 in high-throughput code), a hypothetical 8x wider CPU would be 32-wide!
Being able to carve that back into 8 or 16 logical CPUs that dynamically share those execution resources would be fantastic: non-stalled threads get all the front-end bandwidth and back-end throughput.
But with 8 separate cores, when a thread stalls there's nothing else to keep the execution units fed; the other threads don't benefit.
Execution is often bursty: it stalls waiting for a cache miss load, then once that arrives many instructions in parallel can use that result. With a super-wide CPU, that burst can go faster, and it can actually help with SMT.
But we can't have magical super-wide CPUs
So to gain throughput we instead have to expose parallelism to the hardware in the form of thread-level parallelism. Generally compilers aren't great at knowing when/how to use threads, other than for simple cases like very big loops. (OpenMP, or gcc's -ftree-parallelize-loops
). It still takes human cleverness to rework code to efficiently get useful work done in parallel, because inter-thread communication is expensive, and so is thread startup.
TLP is coarse-grained parallelism, unlike the fine-grained ILP within a single thread of execution which HW can exploit.
CPUs aimed at interactive workloads (like Intel / AMD x86, and Apple / ARM AArch64 high-end cores) definitely do push into the diminishing returns of IPC scaling, because single-threaded performance is still so valuable when latency matters, not just throughput for massively parallel problems.
Being able to run 8 copies of a game in parallel at 15fps each is much less valuable than being able to run one copy at 45fps. CPU vendors know this, and that's why modern CPUs do use out-of-order execution even though it costs significant power and die-area. (But GPUs don't because their workload is already massively parallel).
Intel's many-core Xeon Phi hardware (Knight's Landing / Knight's Mill) is an interesting half-way point: very limited out-of-order execution and SMT to keep 2-wide cores fed with AVX512 SIMD instructions to crunch numbers. The cores are based on Intel's low-power Silvermont architecture. (Out-of-order exec but with a small reordering window, much smaller than big-core Sandybridge-family. And a narrower pipeline.)
update: hybrid / big.LITTLE designs such as ARM, or recently Intel Alder Lake, have some performance cores for non-parallel work, and some efficiency cores (throughput-optimized, and cheap to wake up for easy but frequent stuff like decoding the next few frames of an audio file). See my answer on What are performance and efficiency cores in Intel's 12th Generation Alder lake CPU Line?
BTW, all this is orthogonal to SIMD. Getting more work done per instruction always helps, if it's possible for your problem.
Pricing models
Software pricing models are predicated on the current landscape of hardware.
Per-core licensing models became more widespread (and relevant even to single-socket desktops) with the advent of multi-core CPUs. Before that, it was only relevant for servers and big workstations.
If software didn't need multiple cores to run at top speed, there wouldn't really be a way to sell it cheaper to people who aren't getting as much benefit out of it because they run it on a weaker CPU. Unless maybe the software/hardware ecosystem evolved controls on "SMT channels" that let you configure a maximum execution width for code running on that logical core. (Again imagining a world where CPUs scale in pipeline width instead of multiple separate cores.)