Why not make one big CPU core? [closed]

Question

Closed. This question needs to be more focused. It is not currently accepting answers.

Want to improve this question? Update the question so it focuses on one problem only by editing this post.

Closed 5 years ago.

I don't understand why CPU manufacturers make multi-core chips. Scaling of multiple cores is horrible, this is highly application specific, and I am sure you can point out certain program or code that runs great on many cores, but most of the time the scaling is garbage. It's a waste of silicon die space and a waste of energy.

Games, for example, almost never use more than four cores. Science and engineering simulations like Ansys or Fluent is priced by how many cores the PC it runs on have, so you pay more because you have more cores, but the benefit of more cores becomes really poor past 16 cores, yet you have these 64 core workstations... it's a waste of money and energy. It is better to buy a 1500 W heater for the winter, much cheaper.

Why don't they make make a CPU with just one big core?

I think if they made a one-core equivalent of an eight-core CPU, that one core would have a 800% increase in IPC, so you would get the full performance in all programs, not just those that are optimized for multiple cores. More IPC increase performance everywhere, it's reliable and simple way to increase performance. Multiple cores increase performance only in limited number of programs, and the scaling is horrible and unreliable.

Comments are not for extended discussion; this conversation has been moved to chat. Any conclusions reached should be edited back into the question and/or any answer(s). — Dave Tweed, Commented Jun 14, 2019 at 13:18
You might be interested in this article: gotw.ca/publications/concurrency-ddj.htm — lvella, Commented Jun 14, 2019 at 14:22
"but the benefit of more cores becomes really poor past 16 cores" You obviously do not know what you are talking about. Trust me, I've worked on processes which run on a few tens of thousands of CPUs. There is an entire class of problem called "Embarrassingly parallelisable", where throwing more cores at the problem works very well. — Aron, Commented Jun 17, 2019 at 1:34

tripleee · Accepted Answer · 2022-08-31 09:21:46Z

The problem lies with the assumption that CPU manufacturers can just add more transistors to make a single CPU core more powerful without consequence.

To make a CPU do more, you have to plan what doing more entails. There are really three options:

Make the core run at a higher clock frequency. - The trouble with this is we are already hitting the limitations of what we can do.

Power usage and hence thermal dissipation increases with frequency - if you double the frequency, you nominally double the power dissipation. If you increase the voltage, your power dissipation goes up with the square of voltage.

Interconnects and transistors also have propagation delays due to the non-ideal nature of the world. You can't just increase the number of transistors and expect to be able to run at the same clock frequency.

We are also limited by external hardware - mainly RAM. To make the CPU faster, you have to increase the memory bandwidth, by either running it faster, or increasing the data bus width.
Add more complex instructions. - Instead of running faster, we can add a more rich instruction set - common tasks like encryption etc. can be hardened into the silicon. Rather than taking many clock cycles to calculate in software, we instead have hardware acceleration.

This is already being done on Complex Instruction Set (CISC) processors. See things like SSE2, SSE3. A single CPU core today is far far more powerful than a CPU core from even 10 years ago even if run at the same clock frequency.

The trouble is, as you add more complicated instructions, you add more complexity and make the chips bigger. As a direct result the CPU gets slower - the achievable clock frequencies drop as propagation delays rise.

These complex instructions also don't help you with simple tasks. You can't harden every possible use case, so inevitably large parts of the software you are running will not benefit from new instructions, and in fact will be harmed by the resulting clock rate reduction.

You can also make the data bus widths larger to process more data at once, however again this makes the CPU larger and you hit a tradeoff between throughput gained through larger data buses and the clock rate dropping. If you only have small data (e.g. 32-bit integers), having a 256-bit CPU doesn't really help you.
Make the CPU more parallel. - Rather than trying to do one thing faster, instead do multiple things at the same time. If the task you are doing lends itself to operating on several things at a time, then you want either a single CPU that can perform multiple calculations per instruction (Single Instruction Multiple Data (SIMD)), or having multiple CPUs that can each perform one calculation.

This is one of the key drivers for multi-core CPUs. If you have multiple programs running, or can split your single program into multiple tasks, then having multiple CPU cores allows you to do more things at once.

Because the individual CPU cores are effectively separate blocks (barring caches and memory interfaces), each individual core is smaller than the equivalent single monolithic core. Because the core is more compact, propagation delays reduce, and you can run each core faster.

As to whether a single program can benefit from having multiple cores, that is entirely down to what that program is doing, and how it was written.

Comments are not for extended discussion; this conversation has been moved to chat. Any conclusions reached should be edited back into the question and/or any answer(s). — Dave Tweed, Commented Jun 14, 2019 at 13:16
One of the points raised in comments that still hasn't been addressed is that CPUs can be parallel by running multiple instructions per clock (Superscalar). That's orthogonal to SIMD and frequency; instructions per clock (IPC) is the third factor in actual throughput per time. All modern CPUs for interactive-use workloads are at least 2-wide. — Peter Cordes, Commented Jun 14, 2019 at 13:23
For a more accurate answer read sciencedirect.com/topics/computer-science/… — Tony Stewart EE75, Commented Jun 14, 2019 at 16:34

einpoklum · Accepted Answer · 2019-06-13 19:39:39Z

40

In addition to the other answers, there is another element: chip yields. A modern processor has several billion transistors in them, each and every one of those transistors have to work perfectly in order for the whole chip to function properly.

By making multi-core processors, you can cleanly partition groups of transistors. If a defect exists in one of the cores, you can disable that core, and sell the chip at a reduced price according to the number of functioning cores. Likewise, you can also assemble systems out of validated components as in a SMP system.

For virtually every CPU you buy, it started life being made to be a top-end premium model for that processor line. What you end up with, depends on what portions of that chip are working incorrectly and disabled. Intel doesn't make any i3 processors: they are all defective i7, with all the features that separate the product lines disabled because they failed testing. However, the portions that are still working are still useful and can be sold for much cheaper. Anything worse becomes keychain trinkets.

And defects are not uncommon. Perfectly creating those billions of transistors is not an easy task. If you have no opportunities to selectively use portions of a given chip, the price of the result is going to go up, real fast.

With just a single über processor, manufacturing is all or nothing, resulting in a much more wasteful process. For some devices, like image sensors for scientific or military purposes, where you need a huge sensor and it all has to work, the costs of those devices are so enormous only state-level budgets can afford them.

edited Jun 13, 2019 at 19:39

einpoklum

1056 bronze badges

answered Jun 12, 2019 at 20:08

whatsisname

1,4341 gold badge10 silver badges16 bronze badges

4

\$\begingroup\$ If/when yields improve and are producing more fully-working chips than the market demands, vendors usually start fusing off some of the cores/cache and/or binning them at lower frequency SKU, instead of adjusting the price structure to make the high-end chips relatively cheaper. With GPUs / graphics cards you used to be able to unlock disabled shader units on some cards with a firmware hack, to see if you got lucky and got a card where they were only disabled for market segmentation, not actual defects. \$\endgroup\$
– Peter Cordes
Commented Jun 13, 2019 at 0:05
4

\$\begingroup\$ Intel has manufactured dual-core dies for some of their chips. With all their ULV (ultralow voltage) mobile SKUs being dual-core, there weren't enough defective quad-cores, and the smaller die area (especially with a cut-down iGPU as well) gives more working dual-core chips per wafer than fusing off quad-core dies. en.wikichip.org/wiki/intel/microarchitectures/… has die-shots of Sandybridge 131 mm² die size dual-core + GT1 graphics, vs. 149 mm² dual-core + GT2 graphics + 216 mm² quad + GT2. There's still room to for defects in cache etc. \$\endgroup\$
– Peter Cordes
Commented Jun 13, 2019 at 0:14
\$\begingroup\$ And (some) defects in part of an FMA unit can presumably be handled by fusing it off and selling it as a Celeron or Pentium chip (no AVX, so only 128-bit vectors.) Even modern Skylake or Coffee Lake Pentium chips lack AVX. The SIMD FMA units make up a decent fraction of a core (and run many SIMD ops other than FP math, including integer mul and integer shift), so I wouldn't be surprised if the 2x 256-bit FMA units can be mapped to 2x 128-bit using whichever 2 chunks are still working. With Skylake Xeon, there are even SKUs with reduced AVX512 FMA throughput (only 1 working 512-bit FMA) \$\endgroup\$
– Peter Cordes
Commented Jun 13, 2019 at 0:18
\$\begingroup\$ @PeterCordes If yields get that good, then vendors will bring out higher-density and/or faster clock rate (and therefore higher defect rate) designs until the defect rates get back to where they can disable cores and/or under-clock the chips to sell at discount.. \$\endgroup\$
– Monty Harder
Commented Jun 13, 2019 at 16:38
\$\begingroup\$ @MontyHarder: That's sort of true, but validation cost money and time, and existing production lines will keep making existing designs for a while. But yes, some Intel examples of what you're talking about are Haswell Refresh, and various refinements of Skylake with basically no architectural change and minor improvements to their 14nm process. (Sometimes with new iGPU). e.g. Kaby Lake then Coffee Lake etc. as "optimization" steps in Intel's normal tick-tock cadence. \$\endgroup\$
– Peter Cordes
Commented Jun 13, 2019 at 17:02

| Show 2 more comments

pjc50 · Accepted Answer · 2019-06-12 12:02:09Z

26

Data dependency

It's fairly easy to add more instructions per clock by making a chip "wider" - this has been the "SIMD" approach. The problem is that this doesn't help most use cases.

There are roughly two types of workload, independent and dependent. An example of an independent workload might be "given two sequences of numbers A1, A2, A3... and B1, B2,... etc, calculate (A1+B1) and (A2+B2) etc." This kind of workload is seen in computer graphics, audio processing, machine learning, and so on. Quite a lot of this has been given to GPUs, which are designed especially to handle it.

A dependent workload might be "Given A, add 5 to it and look that up in a table. Take the result and add 16 to it. Look that up in a different table."

The advantage of the independent workload is that it can be split into lots of different parts, so more transistors helps with that. For dependent workloads, this doesn't help at all - more transistors can only make it slower. If you have to get a value from memory, that's a disaster for speed. A signal has to be sent out across the motherboard, travelling sub-lightspeed, the DRAM has to charge up a row and wait for the result, then send it all the way back. This takes tens of nanoseconds. Then, having done a simple calculation, you have to send off for the next one.

Power management

Spare cores are turned off most of the time. In fact, on quite a lot of processors, you can't run all the cores all of the time without the thing catching fire, so the system will turn them off or downclock them for you.

Rewriting the software is the only way forwards

The hardware can't automatically convert dependent workloads into independent workloads. Neither can software. But a programmer who's prepared to redesign their system to take advantage of lots of cores just might.

answered Jun 12, 2019 at 12:02

pjc50

46.9k4 gold badges66 silver badges126 bronze badges

2

\$\begingroup\$ Citation needed for "can't run all the cores at the same time". Unless you consider the single-core max turbo clock speed to be the "real" clock speed of the CPU. In the classic sense (before we hit the power wall and clock speed was limited by critical path propagation delays), yes that's true, but in the modern world it makes more sense to look at the baseline clock speed as what can be sustained with all cores active running heavy workloads. Anything higher than that is gravy you can opportunistically use as power / thermal limits allow. (e.g. Intel's Turbo). \$\endgroup\$
– Peter Cordes
Commented Jun 12, 2019 at 23:55
1

\$\begingroup\$ But in terms of power, even a single core's max clock is limited by thermals moreso than propagation delays (although probably the pipeline stage boundaries are selected so you're close to that limit at the target max turbo). And voltage is a variable too: worse power but shorter gate delays. So anyway, it doesn't make sense to consider the single-core max turbo as something you "should" be able to run all cores at, because that limit already comes from power. \$\endgroup\$
– Peter Cordes
Commented Jun 12, 2019 at 23:59
\$\begingroup\$ The context of the original question was definitely asking about single-core max speed, and for many practical purposes that (and its cache misses) are the real limiting factor for percieved speed to the user. \$\endgroup\$
– pjc50
Commented Jun 13, 2019 at 8:34
\$\begingroup\$ Yes, we'd all take 8x single-thread performance instead of an 8-core CPU if we could. (With SMT to let it run naturally-separate workloads without context-switch overhead. See my answer. :) A hypothetical super-wide core would probably be able to clock itself faster when the workload caused a lot of stalls, instead of keeping all the transistors in SIMD FMA units powered up and switching every clock. (Power gating within a single core is also key to not melting at high clocks; en.wikipedia.org/wiki/Dark_silicon). So having a single wide core wouldn't make this different. \$\endgroup\$
– Peter Cordes
Commented Jun 13, 2019 at 17:13
\$\begingroup\$ Although you do have a point that the single-threaded performance we see on current CPUs is better than if they were limited to a clock speed that they could sustain on all cores simultaneous even with a worst-case workload. i.e. Turbo is key, especially for low-TDP parts like laptop chips (Why can't my CPU maintain peak performance in HPC): usually a big ratio between baseline and max turbo, unlike high-power but low-core-count desktop chips, e.g. i7-6700k Skylake is 4GHz base, 4.2GHz single-core turbo (without overclocking; higher is possible with 95W TDP). \$\endgroup\$
– Peter Cordes
Commented Jun 13, 2019 at 17:16

Add a comment |

Peter Cordes · Accepted Answer · 2022-01-05 03:16:23Z

Good question, or at least one with an interesting answer. Part of this answer imagines a world where CPUs could scale efficiently in width instead of with multiple separate cores. Licensing / price models would be different!

The rest explains why they can't. Summary:

The cost of multiple cores scale close to linearly (in die area / power). Efficient interconnects between cores are harder when you have more cores, needing more aggregate bandwidth to use them all usefully. And they're all part of the same cache-coherency domain.
The cost of widening 1 core's superscalar pipeline scales ~quadratically This is doable with enough brute-force, up to a point anyway. Single-threaded performance is very important for interactive use (end-to-end latency matters, not just throughput), so current big-core high-end CPUs pay that price. e.g. Skylake (4-wide), Ryzen (5 or 6-wide), and Apple's A12 (7-wide for the big cores, 3-wide for the small energy efficient cores)
Serious diminishing IPC returns from just widening the pipeline beyond 3 or 4-wide, even with out-of-order execution to find the ILP. Branch misses and cache misses are hard, and still stall the whole pipeline.
You didn't mention frequency, just IPC, but scaling frequency is hard too. Higher frequency requires higher voltage, so power scales with frequency cubed: ^1 from frequency directly, and ^2 from voltage. (Capacitor stored energy scales with V^2, and most of the dynamic power beyond leakage current is from pumping charge into the capacitive loads of FET gates + wires.)

Performance = frequency times IPC. (Within the same architecture. Wider SIMD lets you get the same work done with fewer instructions, and some ISAs are denser than others, e.g. MIPS often takes more instructions to do the same work than x86 or AArch64.)

Costs are in die-area (manufacturing cost) and/or power (which indirectly limits frequency because cooling is hard). Also, lower power and performance per Watt is a goal in itself, especially for mobile (battery) and servers (power density / cooling costs / electricity costs).

Before multi-core per socket was a thing, you did have multi-socket systems for high-end use-cases where you wanted more throughput than was achievable with a single CPU that could be manufactured, so those were the only SMP systems. (Servers, high-end workstations).

If a single core could scale as efficiently as you wished, we'd have systems with 1 physical core per socket, and SMT (e.g. HyperThreading) to let them act as multiple logical cores. Typical desktops / laptops would only have 1 physical core, and we wouldn't struggle to parallelize things that don't scale linearly with more cores. e.g. make -j4 to take advantage of multi-socket servers, and/or to hide I/O latency on a desktop. (Or maybe we still would try to parallelize a lot if pipeline width scaled easily but IPC didn't, so we had to use more SMT threads.) Your OS kernel would still need to run across all logical cores, unless the way the CPU presents SMT to the OS was very different, so parallel scheduling algorithms and locking would still be needed there.

Donald Knuth said in a 2008 interview

I might as well flame a bit about my personal unhappiness with the current trend toward multicore architecture. To me, it looks more or less like the hardware designers have run out of ideas, and that they’re trying to pass the blame for the future demise of Moore’s Law to the software writers by giving us machines that work faster only on a few key benchmarks!

Yes, if we could have miracle single-core CPUs with 8x the throughput on real programs, we'd probably still be using them. With dual socket systems only when it was worth paying much more for more throughput (not single-threaded performance).

Multiple CPUs reduces context-switch costs when multiple programs are running (by letting them really run in parallel instead of rapid switching between them); pre-emptive multitasking interrupting the massive out-of-order machinery such a CPU would require would probably hurt even more than it does now.

Physically it would be single core (for a simple cache hierarchy with no interconnects between cores) but support SMT (e.g. Intel's HyperThreading) so software could use it as 8 logical cores that dynamically compete for throughput resources. Or when only 1 thread is running / not stalled, it would get the full benefit.

So you'd use multiple threads when that was actually easier/natural (e.g. separate processes running at once), or for easily-parallelized problems with dependency chains that would prevent maxing out the IPC of this beast.

But unfortunately it's wishful thinking on Knuth's part that multi-core CPUs will ever stop being a thing at this point.

Single-thread performance scaling

I think if they made a 1 core equivalent of an 8 core CPU, that one core would have 800% increase in IPC so you would get the full performance in all programs, not just those that are optimized for multiple cores.

Yes, that's true. If it was possible to build such a CPU at all, it would be very amazing. But I think it's literally impossible on the same semiconductor manufacturing process (i.e. same quality / efficiency of transistors). It's certainly not possible with the same power budget and die area as an 8-core CPU, even though you'd save on logic to glue cores together, and wouldn't need as much space for per-core private caches.

Even if you allow frequency increases (since the real criterion is work per second, not work per clock), making even a 2x faster CPU would be a huge challenge.

If it were possible at anywhere near the same power and die-area budget (thus manufacturing cost) to build such a CPU, yes CPU vendors would already be building them that way.

See Modern Microprocessors A 90-Minute Guide!

Specifically the More Cores or Wider Cores? section, for the necessary background to understand this answer; it starts simple with how in-order pipelined CPUs work, then superscalar (multiple instructions per clock). Then explains how we hit the power-wall right around the P4 era, leading to end of easy frequency scaling, leaving mostly just IPC and getting more work done per instruction (e.g. SIMD) as the path forward, even with smaller transistors.

Making a pipeline wider (max instructions per clock) typically scales in cost as width-squared. That cost is measured in die area and/or power, for wider parallel dependency checking (hazard detection), and a wider out-of-order scheduler to find ready instructions to run. And more read / write ports on your register file and cache if you want to run instructions other than nop. Especially if you have 3-input instructions like FMA or add-with-carry (2 registers + flags).

There are also diminishing IPC returns for making CPUs wider; most workloads have limited small-scale / short-range ILP (Instruction-Level Parallelism) for CPUs to exploit, so making the core wider doesn't increase IPC (instructions per clock) if IPC is already limited to less than the width of the core by dependency chains, branch misses, cache misses, or other stalls. Sure you'd get a speedup in some unrolled loops with independent iterations, but that's not what most code spends most of its time doing. Compare/branch instructions make up 20% of the instruction mix in "typical" code, IIRC. (I think I've read numbers from 15 to 25% for various data sets.)

Also, a cache miss that stalls all dependent instructions (and then everything once ROB capacity is reached) costs more for a wider CPU. (The opportunity cost of leaving more execution units idle; more potential work not getting done.) Or a branch miss similarly causes a bubble.

To get 8x the IPC, we'd need at least an 8x improvement in branch-prediction accuracy and in cache hit rates. (Not sure the math truly works like that, and it depends how branchy the workload is.)
But anyway, cache hit rates don't scale well with cache capacity past a certain point for most workloads. And HW prefetching is smart, but can't be that smart. And at 8x the IPC, the branch predictors need to produce 8x as many predictions per cycle as well as having them be more accurate. (No current x86-64 designs can handle more than 1 taken branch per clock, and even back to back taken branches are a bottleneck for the predictors in recent Intel, unless it's the same loop branch running repeatedly. So branchy code is likely to be a problem, vs. finding more ILP in blocks of straight-line code, e.g. in unrolled loops.)

Current techniques for building out-of-order execution CPUs can only find ILP over short ranges. For example, Skylake's ROB size is 224 fused-domain uops, scheduler for non-executed uops is 97 unfused-domain. See Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths for a case where scheduler size is the limiting factor in extracting ILP from 2 long chains of instructions, if they get too long. And/or see this more general and introductory answer).

So finding ILP between two separate long loops is not something we can do with hardware. Dynamic binary-recompilation for loop fusion could be possible in some cases, but hard and not something CPUs can really do unless they go the Transmeta Crusoe route. (x86 emulation layer on top of a different internal ISA; in that case VLIW). But standard modern x86 designs with uop caches and powerful decoders aren't easy to beat for most code.

And outside of x86, all ISAs still in use are relatively easy to decode, so there's no motivation for dynamic-recompilation other than long-distance optimizations. TL:DR: hoping for magic compilers that can expose more ILP to the hardware didn't work out for Itanium IA-64, and is unlikely to work for a super-wide CPU for any existing ISA with a serial model of execution.

If you did have a super-wide CPU, you'd definitely want it to support SMT so you can keep it fed with work to do by running multiple low-ILP threads.

Since Skylake is currently 4 uops wide (and achieves a real IPC of 2 to 3 uops per clock, or even closer to 4 in high-throughput code), a hypothetical 8x wider CPU would be 32-wide!

Being able to carve that back into 8 or 16 logical CPUs that dynamically share those execution resources would be fantastic: non-stalled threads get all the front-end bandwidth and back-end throughput.

But with 8 separate cores, when a thread stalls there's nothing else to keep the execution units fed; the other threads don't benefit.

Execution is often bursty: it stalls waiting for a cache miss load, then once that arrives many instructions in parallel can use that result. With a super-wide CPU, that burst can go faster, and it can actually help with SMT.

But we can't have magical super-wide CPUs

So to gain throughput we instead have to expose parallelism to the hardware in the form of thread-level parallelism. Generally compilers aren't great at knowing when/how to use threads, other than for simple cases like very big loops. (OpenMP, or gcc's -ftree-parallelize-loops). It still takes human cleverness to rework code to efficiently get useful work done in parallel, because inter-thread communication is expensive, and so is thread startup.

TLP is coarse-grained parallelism, unlike the fine-grained ILP within a single thread of execution which HW can exploit.

CPUs aimed at interactive workloads (like Intel / AMD x86, and Apple / ARM AArch64 high-end cores) definitely do push into the diminishing returns of IPC scaling, because single-threaded performance is still so valuable when latency matters, not just throughput for massively parallel problems.

Being able to run 8 copies of a game in parallel at 15fps each is much less valuable than being able to run one copy at 45fps. CPU vendors know this, and that's why modern CPUs do use out-of-order execution even though it costs significant power and die-area. (But GPUs don't because their workload is already massively parallel).

Intel's many-core Xeon Phi hardware (Knight's Landing / Knight's Mill) is an interesting half-way point: very limited out-of-order execution and SMT to keep 2-wide cores fed with AVX512 SIMD instructions to crunch numbers. The cores are based on Intel's low-power Silvermont architecture. (Out-of-order exec but with a small reordering window, much smaller than big-core Sandybridge-family. And a narrower pipeline.)

update: hybrid / big.LITTLE designs such as ARM, or recently Intel Alder Lake, have some performance cores for non-parallel work, and some efficiency cores (throughput-optimized, and cheap to wake up for easy but frequent stuff like decoding the next few frames of an audio file). See my answer on What are performance and efficiency cores in Intel's 12th Generation Alder lake CPU Line?

BTW, all this is orthogonal to SIMD. Getting more work done per instruction always helps, if it's possible for your problem.

Pricing models

Software pricing models are predicated on the current landscape of hardware.

Per-core licensing models became more widespread (and relevant even to single-socket desktops) with the advent of multi-core CPUs. Before that, it was only relevant for servers and big workstations.

If software didn't need multiple cores to run at top speed, there wouldn't really be a way to sell it cheaper to people who aren't getting as much benefit out of it because they run it on a weaker CPU. Unless maybe the software/hardware ecosystem evolved controls on "SMT channels" that let you configure a maximum execution width for code running on that logical core. (Again imagining a world where CPUs scale in pipeline width instead of multiple separate cores.)

"thread startup is expensive" - that is not a hard fact; it's an artifact of common modern Operating Systems. — MSalters, Commented Jun 13, 2019 at 8:22
@MSalters And indeed, some research projects have explored how awesome it would be to drop this approach. The same with the "human cleverness to rework code" - there are ways of writing code that are naturally easier to parallelize, they just haven't been very popular in the past few decades. Where they are used, you can generally see massive horizontal scaling at very low cost; in fact, to the point that horizontal scaling is starting to become far cheaper than vertical in many applications. It just means you mustn't give developers the choice - if circumstances force it, it works fine :D — Luaan, Commented Jun 13, 2019 at 8:43

Graham · Accepted Answer · 2019-06-14 07:38:53Z

Going back in time, processors weren't able to run that fast. As a result, if you wanted to do more processing then you needed more processors. This could be with a maths coprocessor, or it could simply be with more of the same processor. The best example of this is the Inmos Transputer from the 80s, which was specifically designed for massively parallel processing with multiple processors plugged together. The whole concept hinged on the assumption that there was no better way to increase processing power than to add processors.

Trouble is, that assumption was (temporarily) incorrect. You can also get more processing power by making one processor do more calculations. Intel and AMD found ways to push clock speeds ever higher, and as you say, it's way easier to keep everything on one processor. The result was that until the mid 2000s, the fast single-core processor owned the market. Inmos died a death in the early 90s, and all their experience died with them.

The good times had to end though. Once clock speeds got up to GHz there really wasn't scope for going further. And back we went to multiple cores again. If you genuinely can't get faster, more cores is the answer. As you say though, it isn't always easy to use those cores effectively. We're a lot better these days, but we're still some way off making it as easy as the Transputer did.

Of course there are other options for improvement as well - you could be more efficient instead. SIMD and similar instruction sets get more processing done for the same number of clock ticks. DDR gets your data into and out of the processor faster. It all helps. But when it comes to processing, we're back to the 80s and multiple cores again.

Comments are not for extended discussion; this conversation has been moved to chat. Any conclusions reached should be edited back into the question and/or any answer(s). — Dave Tweed, Commented Jun 14, 2019 at 13:19

EvilSnack · Accepted Answer · 2019-06-14 04:07:21Z

Let me draw an analogy:

If you have a monkey typing away at a typewriter, and you want more typing to get done, you can give the monkey coffee, typing lessons, and perhaps make threats to get it to work faster, but there comes a point where the monkey will be typing at maximum capacity.

So if you want to get more typing done, you have to get more monkeys.

To extend the analogy further, you need a separate typewriter for each monkey (representing the data bus that each core will need), you need a way to get bananas to each monkey and something to pick up their droppings (analogous to power distribution and heat dissipation) and you need a way to ensure that the monkeys aren't all trying to type the same passage in Twelfth Night (analogous to rightly dividing the workload among processors). But all of this is less work for more gain than trying to get more typing out of one monkey.

hekete · Accepted Answer · 2019-06-12 12:36:51Z

7

You point out that a lot software doesn't use more than (x) cores. But this is entirely a limitation placed by the designers of that software. Home PCs having multiple cores is still new(ish) and designing multi-threaded software is also more difficult with traditional APIs and languages.

Your PC is also not just running that 1 program. It is doing a whole bunch of other things that can be put onto less active cores so your primary software isn't getting interrupted by them as much.

It's not currently possible to just increase the speed of a single core to match the throughput of 8 cores. More speed is likely going to have to come from new architecture.

As more cores are commonly available and APIs are designed with that assumption, programmers will start commonly using more cores. Efforts to make multi-threaded designs easier to make are on going. If you asked this question in a few years you would probably being saying "My games only commonly use 32 cores, so why does my CPU have 256?".

answered Jun 12, 2019 at 12:36

hekete

1,3967 silver badges12 bronze badges

3

\$\begingroup\$ The difference between 1 vs. multiple cores is huge in terms of getting software to take advantage. Most algorithms and programs are serial. e.g. Donald Knuth has said that multi-core CPUs look like HW designers are "trying to pass the blame for the future demise of Moore’s Law to the software writers by giving us machines that work faster only on a few key benchmarks!" \$\endgroup\$
– Peter Cordes
Commented Jun 13, 2019 at 0:24
\$\begingroup\$ Unfortunately nobody has yet come up with a way to make a single wide/fast core run a single-threaded program anywhere near as fast as we can get efficiently-parallel code to run across multiple core. But fortunately CPU designers realize that single-threaded performance is still critical and make each individual core much bigger and more powerful than it would be if they were going for pure throughput on parallel problems. (Compare a Skylake (4-wide) or Ryzen (5-wide) vs. a core of a Xeon Phi (Knight's Landing / Knight's Mill based on Silvermont + AVX512) (2-wide and limited OoO exec) \$\endgroup\$
– Peter Cordes
Commented Jun 13, 2019 at 0:27
2

\$\begingroup\$ Anyway yes, having at least 2 cores is often helpful for a multitasking OS, but pre-emptive multi-tasking on a single core that was 4x or 8x as fast as a current CPU would be pretty good. For many interactive use-cases that would be much better, if it were possible to build at all / with the same power budget. (Dual core does help reduce context-switch costs when multiple tasks want CPU time, though.) \$\endgroup\$
– Peter Cordes
Commented Jun 13, 2019 at 0:29
1

\$\begingroup\$ All true, but historically multi-core was more expensive. There wasn't a lot of reason to design parallel algorithms out side of science applications. There is a lot of room for parallelization, even in algorithms that require a mostly serial execution. But current generation IPC isn't great and is easy to mess up. Which generally results in bugs that are really hard to find and fix. Of course a 4x faster CPU would be amazing (but you would still want multiple cores). \$\endgroup\$
– hekete
Commented Jun 13, 2019 at 6:46
2

\$\begingroup\$ @PeterCordes Well, most algorithms and programs aren't serial because they have to be, but mostly because it's the way it's always been done (with a sprinkling of "it was a good trade-off"). The most egregious cases are where you can just run the same program four times on four separate workloads and have them run in parallel with no issue. But that hits another problem - CPU is not a bottleneck all that often, and usually the way around it is to use better algorithms, not more CPUs. Sometimes those help with other bottlenecks too (memory, disk, network...). \$\endgroup\$
– Luaan
Commented Jun 13, 2019 at 8:49

| Show 4 more comments

Peter Mortensen · Accepted Answer · 2019-06-14 00:42:51Z

The most compelling reason from a historical standpoint, is power dissipation.

After the Pentium IV, Intel tried to pursue a next generation processor code-named Tejas that was supposed to run in the 4 GHz to 12 GHz range. The problem was that running at that speed generated too much heat to be viable.

After Tejas was cancelled it took Intel another 10 to 15 years before they finally had cores running at 4 GHz with acceptable levels of heat.

See Tejas and Jayhawk.

Intel had another project in parallel with Tejas that involved using multiple cores. That project had acceptable levels of heat, so that's the way they went. It allowed them to increase performance now rather than waiting another 10 years for 10 nm fabrication processes.

Assuming the cores are not resource starved, then to get the same number of instructions per second from a single core instead of N cores you would need the instruction rate of that single core to be N times faster. The dynamic power dissipation of a CPU core is linearly proportional to the operating frequency. It is also proportional to the square of the operating voltage. Running at lower frequencies allows the use of lower operating voltages. Using lower voltages at lower frequencies means that practically speaking heat generated goes down with the cube of the operating frequency.

An extreme example of this is the human brain, which can perform the equivalent of 2^18 operations per second using only 20 W of power. It achieves this by using billions of neurons running in parallel at a only few hundred Hz.

Also keep in mind that there are usually hundreds or thousands of threads running at once on a PC. The operating system handles allocating time on a core to each thread. So even if an individual program doesn't take advantage of all the cores, it still benefits because the other programs are taking less of its CPU time if they run on another core.

If anything, the high performance market is moving to more parallel processing in the form of FPGAs. Intel recently bought Altera (the second largest FPGA manufacturer) and is now selling boards with an FPGA hardware accelerator on them. Software can load the FPGA with an image at run-time using an API call. The CPU then feeds data into the FPGA and lets it do most of the work. The types of applications are typically video encoding, AI, rendering, database search, etc.

Also keep in mind that there are usually hundreds or thousands of threads running at once on a PC. No, not running. That many threads exist on modern desktops, but almost all of them are asleep waiting for I/O or a timer at any given time. e.g. the load average (over the last minute) on my Linux desktop is currently 0.19 tasks actively ready to use CPU time at any given moment. If I was running a video encode, x264 would have started multiple threads for the OS to schedule on multiple cores, but only about as many as I have logical cores. — Peter Cordes, Commented Jun 14, 2019 at 0:41
And BTW, the OP (for some reason) omitted frequency entirely, and asked about scaling IPC (instructions per clock cycle), not per second. What you say is true, but they were proposing to make CPUs wider, not clocked higher. I already addressed that in my answer, so your answer explaining power scaling with frequency is a nice addition, +1. — Peter Cordes, Commented Jun 14, 2019 at 0:44
@PeterCordes That is correct, I didn't mean to imply that all threads execute at once, the do of course take turns. Thanks for clarifying. — user4574, Commented Jun 14, 2019 at 0:47
Well not so much "take turns" as that they aren't ready to run at all, most of the time. They're mostly all asleep, usually only waking up for a short burst of computation e.g. after the OS delivers a keypress even or a network read, or wakes them up because a timer expired. It's rare for more than 2 to be awake at once, unless you're actually doing something computationally intensive. And if you are, you don't start hundreds of threads, you start a number of threads ~= number of available cores. — Peter Cordes, Commented Jun 14, 2019 at 0:50

Dirk Bruere · Accepted Answer · 2019-06-13 08:12:10Z

Just to round out the picture of where all this is going...

Neural Networks and AI are the super hot topics of the moment. One reason is that one can efficiently use vast numbers of simple cores in parallel and so extract close to maximum compute performance. The requirement is inherently massively parallel and maps fairly easily onto array of processors without much communication needed between cores. This is why GPUs were the first goto technology for AI acceleration. Right now we are seeing chips optimized even better than video GPUs for NNs coming to market. The next, or maybe final, step is to make NNs using analog technologies like memristors.

And as an aside, in something like a gaming PC there is far more raw performance in the graphics card than the multicore Intel or AMD CPU

Re "...inherently massively parallel": Even embarrassingly parallel? — Peter Mortensen, Commented Jun 13, 2019 at 21:01

Tony Stewart EE75 · Accepted Answer · 2019-06-14 16:33:25Z

Fundamentally, CMOS losses are exponentially (^1.5) proportional to frequency and parallel CPU performance are somewhat less than linear proportional to the number of CPU’s.

So the ratio for computing power to power dissipation is improved for multi-CPU applications at different clock rates when comparing speed vs qty of CPU’s for a fixed power dissipation.

It’s more complex than this, but these are the fundamentals why parallel CPU’s are better bang per Watt in dynamic applications. There will always be exceptions when optimized for one scenario.

It is not the size of a bigger CPU that makes it faster for Intel/AMD typical PC applications, rather it is the reduced size from lithographic resolution and lower gate capacitance that reduces power along with reduced sub-threshold level and Core voltage.

The improvement is not linear and does not mean 8 cores is 4x better than 2 but the goal if met is to have more processing dynamic range with the throttling of power dissipation, speed and voltage to improve both performance and efficiency and peak power on demand without excessive temperature rise.

For a more scientific answer read https://www.sciencedirect.com/topics/computer-science/dynamic-power-consumption

Peter Mortensen · Accepted Answer · 2019-06-13 22:00:45Z

-2

Multicores aren't usually multiscalar. And multiscalar cores aren't multicores.

It would be sort of perfect finding a multiscalar architecture running at several megahertz, but in general its bridges would be not consumer-enabled, but costly so the tendency is multicore programming at lower frequency rather than short instructions at high clock speeds.

Multiple instruction cores are cheaper and easier to command, and that's why it's a bad idea having a multiscalar architectures at several gigahertz.

edited Jun 13, 2019 at 22:00

Peter Mortensen

1,6883 gold badges17 silver badges23 bronze badges

answered Jun 13, 2019 at 5:17

machtur

72 bronze badges

1

\$\begingroup\$ Do you mean "superscalar", multiple instructions per clock? Most multi-core CPUs are superscalar. e.g. Ryzen is 5-wide. Apple's high-end AArch64 chips are 6 or 8-wide. There's a lot of low-hanging fruit for a 2-wide CPU to exploit in most code, so it's worth making each core at least 2-wide before scaling to multiple cores that each need their own private cache, and an interconnect between cores (e.g. Intel's Xeon Phi many-core compute cards have many dual-issue cores). Same for smartphone cores: small cores are at least 2-wide. Single-threaded performance matters! \$\endgroup\$
– Peter Cordes
Commented Jun 14, 2019 at 0:32
1

\$\begingroup\$ Or did you mean dl.acm.org/citation.cfm?id=224451 - a research paper on what they call "Multiscalar" cores that look for ILP over larger ranges in the control-flow graph of a high-level program, using a combination of HW and SW. The mainstream CPUs we use in desktops and smartphones are not like this, they're just ordinary superscalar with out-of-order execution, implementing a serial ISA that pretends to run instructions one at a time. \$\endgroup\$
– Peter Cordes
Commented Jun 14, 2019 at 0:37
\$\begingroup\$ Thanks. afaik, the idea behind scalar arch is the measurability of heat behind known or predefined sets of instructions (the case of AVX). <br/>Current architectures computation vs heat is is pondered not computably predictable. this enhances the improbability multicores could run at large frequencies since their ability to perform in a time/heat ideal is not computable. that's all i know so far. i'm digging vector machines for this purpose of understanding the physics of "multiscalars". the case is xeon/phy follow an ideal thermal curve like ancient cpus did. enhancing the customer experience \$\endgroup\$
– machtur
Commented Jun 14, 2019 at 1:03
\$\begingroup\$ SIMD instruction-sets like AVX are a way to get more work through the pipeline without having to make the whole pipeline wider, just the execution units. For example, Skylake can run 3 vpaddd ymm0, ymm1, ymm2 instructions per clock, each one performing 8 packed 32-bit integer additions. So 24 integer adds per clock but the out-of-order execution machinery "only" has to keep track of 3 instructions in flight. That's much cheaper to build than a CPU that could run 24 add eax, edx instructions per clock. SIMD is basically orthogonal to pipeline width. \$\endgroup\$
– Peter Cordes
Commented Jun 14, 2019 at 13:30
\$\begingroup\$ Skylake is a good case of optimization per clock cycle. the variants arenumerous i'm not into them which are interesting cases of internal bus optimization since skylakes integrate Xeon original offloading into the SIMD pipeline that way. I assume one big core would integrate offloading and computation in few cycles the way (for an instance) phenom does for AVX. it's the way way computation has integrated forward versus the power required for internal block operations. as oposite to multiple short instructions like in Gpu-like with multiple "virtual" cores similar to additions to the Nehalem \$\endgroup\$
– machtur
Commented Jun 15, 2019 at 3:27

| Show 12 more comments

Stack Exchange Network

Why not make one big CPU core? [closed]

11 Answers 11

Data dependency

Power management

Rewriting the software is the only way forwards

Single-thread performance scaling

See Modern Microprocessors A 90-Minute Guide!

But we can't have magical super-wide CPUs

Pricing models

Not the answer you're looking for? Browse other questions tagged
cpu
or ask your own question.

Linked

Hot Network Questions

Why not make one big CPU core? [closed]

11 Answers 11

Data dependency

Power management

Rewriting the software is the only way forwards

Single-thread performance scaling

See Modern Microprocessors A 90-Minute Guide!

But we can't have magical super-wide CPUs

Pricing models

Not the answer you're looking for? Browse other questions tagged cpu or ask your own question.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
cpu
or ask your own question.