Why do we have CPUs with all the cores at the same speeds and not combinations of different speeds?

Question

In general if you are buying a new computer you would determine which processor to buy by what your expected workload will be. Performance in games tends to be determined by single core speed, whereas applications like video editing are determined by number of cores.

In terms of what is available on the market - all the CPUs seem to have roughly the same speed with the main differences being more threads or more cores.

For example:

Intel Core i5-7600K, base frequency 3.80 GHz, 4 cores, 4 threads
Intel Core i7-7700K, base frequency 4.20 GHz, 4 cores, 8 threads
AMD Ryzen 5 1600X, base frequency 3.60 GHz, 6 cores, 12 threads
AMD Ryzen 7 1800X, base frequency 3.60 GHz, 8 cores, 16 threads

So why do we see this pattern of increasing cores with all cores having the same clock speed?

Why do we not have variants with differing clock speeds? For example, two 'big' cores and lots of small cores.

For examples sake, instead of, say, four cores at 4.0 GHz (i.e. 4x4 GHz ~ 16 GHz maximum), what about a CPU with two cores running at say 4.0 GHz and say four cores running at 2 GHz (i.e. 2x4.0 GHz + 4x2.0 GHz ~ 16 GHz maximum). Wouldn't the second option be equally good at single threaded workloads, but potentially better at multi-threaded workloads?

I ask this question as a general point - not specifically about those CPUs I listed above, or about any specific one specific workload. I am just curious as to why the pattern is as it is.

There are many mobiles with fast and slow cores, and on nearly all modern multi core servers the CPU core speeds clock independent depending on the load, some even switch off cores when not used. On a general purpose computer where you do not design for saving energy however having only two types of cores (CPU and GPU) just makes the platform more flexible. — eckes, Commented Jun 24, 2017 at 13:29
Before the thread scheduler could make an intelligent choice about which core to use it would have to determine if a process can take advantage of multiple cores. Doing that reliably would be highly problematic and prone to error. Particularly when this can change dynamically according to the needs of the application. In many cases the scheduler would have to make a sub optimal choice when the best core was in use. Identical cores makes things simpler, provides maximum flexibility, and generally has the best performance. — LMiller7, Commented Jun 24, 2017 at 15:34
Clock speeds cannot reasonably be said to be additive in the manner you described. Having four cores running at 4 Ghz does not mean you have a "total" of 16 GHz, nor does it mean that this 16 Ghz could be partitioned up into 8 processors running at 2 Ghz or 16 processors running at 1 GHz. — Bob Jarvis - Слава Україні, Commented Jun 24, 2017 at 23:43
The premise of the question is simply wrong. Modern CPUs are perfectly capable of running cores at different speeds — phuclv, Commented Jun 25, 2017 at 3:17
Multi-core CPU: can I say I have a 3x2.1GHz=6.3GHz CPU?, How do I calculate clock speed in multi-core processors?, — phuclv, Commented Jun 25, 2017 at 3:21

bwDraco · Accepted Answer · 2017-06-28 22:24:24Z

This is known as heterogeneous multiprocessing (HMP) and is widely adopted by mobile devices. In ARM-based devices which implement big.LITTLE, the processor contains cores with different performance and power profiles, e.g. some cores run fast but draw lots of power (faster architecture and/or higher clocks) while others are energy-efficient but slow (slower architecture and/or lower clocks). This is useful because power usage tends to increase disproportionately as you increase performance once you get past a certain point. The idea here is to get performance when you need it and battery life when you don't.

On desktop platforms, power consumption is much less of an issue so this is not truly necessary. Most applications expect each core to have similar performance characteristics, and scheduling processes for HMP systems is much more complex than scheduling for traditional SMP systems. (Windows 10 technically has support for HMP, but it's mainly intended for mobile devices that use ARM big.LITTLE.)

Also, most desktop and laptop processors today are not thermally or electrically limited to the point where some cores need to run faster than others even for short bursts. We've basically hit a wall on how fast we can make individual cores, so replacing some cores with slower ones won't allow the remaining cores to run faster.

While there are a few desktop processors that have one or two cores capable of running faster than the others, this capability is currently limited to certain very high-end Intel processors (as Turbo Boost Max Technology 3.0) and only involves a slight gain in performance for those cores that can run faster.

While it is certainly possible to design a traditional x86 processor with both large, fast cores and smaller, slower cores to optimize for heavily-threaded workloads, this would add considerable complexity to the processor design and applications are unlikely to properly support it.

Take a hypothetical processor with two fast Kaby Lake (7th-generation Core) cores and eight slow Goldmont (Atom) cores. You'd have a total of 10 cores, and heavily-threaded workloads optimized for this kind of processor may see a gain in performance and efficiency over a normal quad-core Kaby Lake processor. However, the different types of cores have wildly different performance levels, and the slow cores don't even support some of the instructions the fast cores support, like AVX. (ARM avoids this issue by requiring both the big and LITTLE cores to support the same instructions.)

Again, most Windows-based multithreaded applications assume that every core has the same or nearly the same level of performance and can execute the same instructions, so this kind of asymmetry is likely to result in less-than-ideal performance, perhaps even crashes if it uses instructions not supported by the slow cores. While Intel could modify the slow cores to add advanced instruction support so that all cores can execute all instructions, this would not resolve issues with software support for heterogeneous processors.

A different approach to application design, closer to what you're probably thinking about in your question, would use the GPU for acceleration of highly parallel portions of applications. This can be done using APIs like OpenCL and CUDA. As for a single-chip solution, AMD promotes hardware support for GPU acceleration in its APUs, which combine a traditional CPU and a high-performance integrated GPU onto the same chip, as Heterogeneous System Architecture, though this has not seen much industry uptake outside of a few specialized applications.

Windows already has a notion of 'Apps', 'Background Processes' and 'Windows Processes'. So this doesn't extend to a hardware level? — Jamie, Commented Jun 25, 2017 at 12:45
@Jamie A "background" process gets smaller time slices and is more likely to be interrupted. Windows 10 does, to some extent, account for HMP systems, though there isn't much information on yet how. — Bob, Commented Jun 26, 2017 at 5:36
So I think that after the edit @bwDraco has pretty much answered it for me. If there was a 'mixed' processor it could easily support the same instruction set if it was built that way, so then we would need some sort of scheduler to pick the right core. I'm thinking that really the applications which benefit from going to lots of small cores would probably benefit even more from going to lots and lots of really small cores. Thus we have GPU acceleration. — Jamie, Commented Jun 26, 2017 at 21:57
Note that the GPU case isn't trading 2 big cores for 10 small and slow cores, but rather the (very rough) equivalent of trading 2 big cores for 1024 small and slow cores. Massively parallel, not just a little bit more parallel. — Yakk, Commented Jun 27, 2017 at 15:36
Intel could probably get a Goldmont core to run AVX2 instructions without much extra silicon (slowly, by decoding to pairs of 128b ops). Knight's Landing (Xeon Phi) has Silvermont-based cores with AVX512, so it's not like it's impossible to modify Silvermont. But KNL adds out-of-order execution for vector instructions, while normal Silver/Goldmont only does OOO for integer, so they'd probably want to design it closer to Goldmont than KNL. Anyway, insn sets are not a real problem. It's OS support and small benefit that are the real obstacles to spending die-area on a low-power core. — Peter Cordes, Commented Jun 28, 2017 at 20:30

harrymc · Accepted Answer · 2017-06-28 10:13:54Z

What you're asking is why are current systems using Symmetric multiprocessing rather than Asymmetric multiprocessing.

Asymmetric multiprocessing were used in the old days, when a computer was enormous and housed over several units.

Modern CPUs are cast as one unit, in one die, where it is much simpler not to mix CPUs of different types, since they all share the same bus and RAM.

There is also the constraint of the clock that governs the CPU cycles and RAM access. This will become impossible when mixing CPUs of different speeds. Clock-less experimental computers did exist and were even pretty fast, but the complexities of modern hardware imposed a simpler architecture.

For example, Sandy Bridge and Ivy Bridge cores can't be running at different speeds at the same time since the L3 cache bus runs at the same clock speed as the cores, so to prevent synchronization problems they all have to either run at that speed or be parked/off (link: Intel's Sandy Bridge Architecture Exposed). (Also verified in the comments below for Skylake.)

[EDIT] Some people have mistaken my answer to mean saying that mixing CPUs is impossible. For their benefit I state : Mixing of differing CPUs is not beyond today's technology, but is not done - "why not" is the question. As answered above, this would be technically complicated, therefore costlier and for too little or no financial gain, so does not interest the manufacturers.

Here are answers to some comments below :

Turbo boost changes CPU speeds so they can be changed

Turbo boost is done by speeding up the clock and changing some multipliers, which is exactly what people do when overclocking, except that the hardware does it for us. The clock is shared between cores on the same CPU, so this speeds up uniformly the entire CPU and all its cores.

Some phones have more than one CPU of different speeds

Such phones typically have a custom firmware and software stack associated with each CPU, more like two separate CPUs (or like CPU and GPU), and they lack a single view of system memory. This complexity is hard to program and so Asymmetric multiprocessing was left in the mobile realm, since it requires low-level close-to-the-hardware software development, which is shunned by general-purpose desktop OS. This is the reason that such configurations aren't found in the PC (except for CPU/GPU if we stretch enough the definition).

My server with 2x Xeon E5-2670 v3 (12 cores with HT) currently has cores at 1.3 GHz, 1.5 GHz, 1.6 GHz, 2.2 GHz, 2.5 GHz, 2.7 GHz, 2.8 GHz, 2.9 GHz, and many other speeds.

A core is either active or idle. All cores that are active at the same time run at the same frequency. What you are seeing is just an artifact of either timing or averaging. I have myself also noted that Windows does not park a core for a long time, but rather separately parks/unparks all cores far faster than the refresh rate of Resource Monitor, but I don't know the reason for this behavior which probably is behind the above remark.

Intel Haswell processors have integrated voltage regulators that enable individual voltages and frequencies for every core

Individual voltage regulators differ from clock speed. Not all cores are identical - some are faster. Faster cores are given slightly less power, creating the headroom to boost the power given to weaker cores. Core voltage regulators will be set as low as possible in order to maintain the current clock speed. The Power Control Unit on the CPU regulates voltages and will override OS requests where necessary for cores that differ in quality. Summary: Individual regulators are for making all cores operate economically at the same clock speed, not for setting individual core speeds

@harrymc there are synchroniser blocks that manage it perfectly well; DRAM runs slower than core speed, and you can have Intel cores running at different speeds dynamically on the same chip. — pjc50, Commented Jun 24, 2017 at 21:47
Intel Core-series processors run at different speeds on the same die all the time. — Nick T, Commented Jun 25, 2017 at 15:32
The sole existence of big.LITTLE architectures and core-indepenendent clock boosting proves you wrong. Heterogeneous multiprocessing is mainstream. It can be done, it is done in phones, but for some reason not in desktops. — Agent_L, Commented Jun 26, 2017 at 11:02
@Agent_L: The reason is the complexity. Desktop CPUs are costly enough already. So I repeat: Everything is possible, but the actual question is why it is not done, not whether it can be done. Do not attack me as if I have claimed this is impossible - all I say is that it's too complicated and costly and for too little gain to interest the manufacturers. — harrymc, Commented Jun 26, 2017 at 12:56

Matteo Italia · Accepted Answer · 2017-06-24 19:21:59Z

Why do we not have variants with differing clock speeds? ie. 2 'big' cores and lots of small cores.

It's possible that the phone in your pocket sports exactly that arrangement - the ARM big.LITTLE works exactly as you described. There it's not even just a clock speed difference, they can be entirely different core types - typically, the slower clocked ones are even "dumber" (no out-of-order execution and other CPU optimizations).

It's a nice idea essentially to save battery, but has its own shortcomings; the bookkeeping to move stuff between different CPUs is more complicated, the communication with the rest of the peripherals is more complicated and, most importantly, to use such cores effectively the task scheduler has to be extremely smart (and often to "guess right").

The ideal arrangement is to run non-time-critical background tasks or relatively small interactive tasks on on the "little" cores and wake the "big" ones only for big, long computations (where the extra time spent on the little cores ends up eating more battery) or for medium-sized interactive tasks, where the user feels sluggishness on the little cores.

However, the scheduler has limited information about the kind of work each task may be running, and has to resort to some heuristic (or external information, such as forcing some affinity mask on a given task) to decide where to schedule them. If it gets this wrong, you may end up wasting a lot of time/power to run a task on a slow core, and give a bad user experience, or using the "big" cores for low priority tasks, and thus wasting power/stealing them away from tasks that would need them.

Also, on an asymmetric multiprocessing system it's usually more costly to migrate tasks to a different core than it would be on an SMP system, so the scheduler generally has to make a good initial guess instead of trying to run on a random free core and moving it around later.

The Intel choice here instead is to have a lower number of identical intelligent and fast cores, but with very aggressive frequency scaling. When the CPU gets busy it quickly ramps up to the maximum clock speed, does the work the fastest it can and then scales it down to go back to lowest power usage mode. This doesn't place particular burden on the scheduler, and avoids the bad scenarios described above. Of course, even when in low clock mode, these cores are "smart" ones, so they'll probably consume more than the low-clock "stupid" big.LITTLE cores.

Heuristics should be pretty simple. Any involuntary task switch (use of full timeslice) is an indication that the slow cpu is inappropriate for the task. Very low utilization and all voluntary task switches is indication that the task could be moved to the slow cpu. — R.. GitHub STOP HELPING ICE, Commented Jun 25, 2017 at 2:12
another problem is that 4 stupid 2GHz cores may take more die size than 2 smart 4GHz cores, or they may be smaller and take much less power than 4 GHz cores but run also much much slower — phuclv, Commented Jun 25, 2017 at 3:43
@R.: in line of principle I agree with you, but even enabling some basic scheduler support for this I saw ridiculous core jostling on an ARM board I used, so there must be something else to it. Besides, most "regular" multithreaded software is written with SMP in mind, so it's not untypical to see thread pools as big as the total number of cores, with jobs dragging on the slow cores. — Matteo Italia, Commented Jun 27, 2017 at 7:03
@Ramhound: A 120W 10-core part has a power budget of 12W per core (except in single-core turbo mode). This is why the highest single-core clocks are found in the quad-core parts, where e.g. Intel's i7-6700k has a power budget of 91W for 4 cores: 22.75W per core sustained with all cores active (at 4.0GHz even with an AVX2+FMA workload like Prime95). This is also why the single-core Turbo headroom is only an extra 0.2GHz, vs. a 22-core Broadwell E5-2699v4 with 2.2GHz base@145W, 3.6GHz turbo. — Peter Cordes, Commented Jun 28, 2017 at 18:59
@Ramhound: added an answer that expands on this. A many-core Xeon seems to be exactly what the OP is looking for: operate as many low-power cores, or spend a lot of power running a single-thread fast when possible (turbo). — Peter Cordes, Commented Jun 28, 2017 at 20:10

Pang · Accepted Answer · 2018-04-27 05:55:46Z

Performance in games tends to be determined by single core speed,

In the past (DOS era games): Correct.
These days, it is no longer true. Many modern games are threaded and benefit from multiple cores. Some games are already quite happy with 4 cores and that number seems to rise over time.

whereas applications like video editing are determined by number of cores.

Sort of true.

Number of cores * times speed of the core * efficiency.
If you compare a single identical core to a set of identical cores, then you are mostly correct.

In terms of what is available on the market - all the CPUs seem to have roughly the same speed with the main differences being more threads or more cores. For example:

Intel Core i5 7600k, Base Freq 3.80 GHz, 4 Cores Intel Core i7 7700k, Base Freq 4.20 GHz, 4 Cores, 8 Threads AMD Ryzen 1600x, Base Freq 3.60 GHz, 6 Cores, 12 Threads AMD Ryzen 1800x, Base Freq 3.60 GHz, 8 Cores, 16 Threads

Comparing different architectures is dangerous, but ok...

So why do we see this pattern of increasing cores with all cores having the same clock speed?

Partially because we ran into a barrier. Increasing clock speed further means more power needed and more heat generated. More heat meant even more power needed. We have tried that way, the result was the horrible pentium 4. Hot and power hungry. Hard to cool. And not even faster than the smartly designed Pentium-M (A P4 at 3.0GHz was roughly as fast as a P-mob at 1.7GHz).

Since then, we mostly gave up on pushing clock speed and instead we build smarter solutions. Part of that was to use multiple cores over raw clock speed.

E.g. a single 4GHz core might draw as much power and generate as much heat as three 2GHz cores. If your software can use multiple cores, it will be much faster.

Not all software could do that, but modern software typically can.

Which partially answers why we have chips with multiple cores, and why we sell chips with different numbers of cores.

As to clock speed, I think I can identify three points:

Low power CPUs makes sense for quite a few cases which raw speed is not needed. E.g. Domain controllers, NAS setups, ... For these, we do have lower frequency CPUs. Sometimes even with more cores (e.g. 8x low speed CPU make sense for a web server).
For the rest, we usually are near the maximum frequency which we can do without our current design getting too hot. (say 3 to 4GHz with current designs).
And on top of that, we do binning. Not all CPU are generated equally. Some CPU score badly or score badly in part of their chips, have those parts disabled and are sold as a different product.

The classic example of this was a 4 core AMD chip. If one core was broken, it was disabled and sold as a 3 core chip. When demand for these 3 cores was high, even some 4 cores were sold as the 3 core version, and with the right software hack, you could re-enable the 4th core.

And this is not only done with the number of cores, it also affects speed. Some chips run hotter than others. Too hot and sell it as a lower speed CPU (where lower frequency also means less heat generated).

And then there is production and marketing and that messes it up even further.

Why do we not have variants with differing clock speeds? ie. 2 'big' cores and lots of small cores.

We do. In places where it makes sense (e.g. mobile phones), we often have a SoC with a slow core CPU (low power), and a few faster cores. However, in the typical desktop PC, this is not done. It would make the setup much more complex, more expensive, and there is no battery to drain.

As I pointed out - "I ask this question as a general point - not specifically about those cpus I listed above", and there was a reason I gave two examples from each architecture. If we treat the two scenarios as 1. all big cores, and 2. two big & two small - then i think all the points you mention apply to both cases - ie. a theoretical max single core speed, binning of chips, downclocking when not in use. — Jamie, Commented Jun 24, 2017 at 14:34
A single max speed core is not all that interesting when it does not get choosen though. Schedulers will need to be updated to actually prefer the high speed core(s). — Hennes, Commented Jun 24, 2017 at 19:37

David Schwartz · Accepted Answer · 2017-06-26 04:31:45Z

10

Why do we not have variants with differing clock speeds? For example, two 'big' cores and lots of small cores.

Unless we were extremely concerned about power consumption, it would make no sense to accept all the cost associated with an additional core and not get as much performance out of that core as possible. The maximum clock speed is determined largely by the fabrication process, and the entire chip is made by the same process. So what would the advantage be to making some of the cores slower than the fabrication process supported?

We already have cores that can slow down to save power. What would be the point to limiting their peak performance?

answered Jun 26, 2017 at 4:31

David Schwartz

62k7 gold badges101 silver badges150 bronze badges

2

This is what I was thinking. Why intentionally use some inferior components when they could all be elite? +1.
– MPW
Commented Jun 26, 2017 at 13:14
1

@MPW The choice isn't between creating a big core and then neutering it, it is between all big vs a few big and lots of small cores. Because you have two competing scenarios - single thread performance and multi thread performance - why not maximise both? Do we know that you can't fabricate a chip with a few big and lots of small cores?
– Jamie
Commented Jun 26, 2017 at 21:33
@Jamie You could fabricate a chip with a few big and lots of small cores. But the smaller cores wouldn't run at a lower clock speed.
– David Schwartz
Commented Jun 26, 2017 at 22:54
They would if they were designed that way... The question is why aren't they designed that way from scratch, not taking an existing fabrication process and neutering it.
– Jamie
Commented Jun 27, 2017 at 0:26
@Jamie I don't understand what you're saying. The whole CPU has to be made with the same fabrication process, and the maximum clock speed is largely a characteristic of the fabrication processes. Cores that require a lower clock speed at the same fabrication level would generally be more complex and take more space, otherwise why would they require a lower clock speed?
– David Schwartz
Commented Jun 27, 2017 at 0:27

| Show 2 more comments

Grant Wu · Accepted Answer · 2017-06-25 03:04:22Z

Why do we not have variants with differing clock speeds? For example, two 'big' cores and lots of small cores.

Nominal clock speeds don't really mean too much for most larger processors nowadays since they all have the capability to clock themselves up and down. You're asking whether or not they can clock different cores up and down independently.

I'm kind of surprised by many of the other answers. Modern processors can and do do this. You can test this by, for example, opening up CPU-Z on a smartphone - my Google Pixel is perfectly capable of running different cores at different speeds:

It is nominally 2.15 Ghz, but two cores are at 1.593 Ghz and two are at 1.132 Ghz.

In fact, since 2009 mainstream Intel CPUs have had logic to boost individual cores higher while underclocking other cores, allowing better single core performance while remaining within a TDP budget: http://www.anandtech.com/show/2832/4

Newer Intel processors with "Favored Core" (an Intel marketing term) have each core characterized at the factory, with the fastest cores being able to boost extra high: http://www.anandtech.com/show/11550/the-intel-skylakex-review-core-i9-7900x-i7-7820x-and-i7-7800x-tested/7

AMD's Bulldozer chips had a primitive version of this: http://www.anandtech.com/show/4955/the-bulldozer-review-amd-fx8150-tested/4

AMD's new Ryzen chips probably have this as well, although it's not explicitly stated here: http://www.anandtech.com/show/11170/the-amd-zen-and-ryzen-7-review-a-deep-dive-on-1800x-1700x-and-1700/11

You are answering a different question. The question is about lots of big cores vs a couple of big cores and lots of small cores - the merits of the two scenarios. In both situations you can clock up and down dependent on demand, or boost a core. — Jamie, Commented Jun 25, 2017 at 12:41
That's not how I read the question. The question does not mention architecturally different cores, despite using the words "big" and "small". It focuses exclusively on clock speed. — Grant Wu, Commented Jun 26, 2017 at 19:42

hobbs · Accepted Answer · 2017-06-26 04:14:22Z

On a modern system you often do have all of the cores running at different speeds. Clocking down a core that isn't heavily used reduces power usage and thermal output, which is good, and features like "turbo boost" let one or two cores run significantly faster as long as the other cores are idle, and therefore the power usage and heat output of the entire package don't go too high. In the case of a chip with such a feature, the speed you see in the listing is the highest speed you can get with all the cores at once. And why would all of the cores have the same maximum speed? Well, they're all of an identical design, on the same physical chip, laid down with the same semiconductor process, so why should they be different?

The reason all of the cores are identical is because that makes it easiest for a thread that's running on one core at one point to start running on a different core at another point. As mentioned elsewhere, there are commonly-used chips that don't follow this principle of identical cores, namely the ARM "big.LITTLE" CPUs. Although in my mind the most important difference between the "big" and "little" cores isn't clock speed (the "big" cores tend to be fancier, wider, more speculative cores that get more instructions per clock at the cost of higher power usage, while the "little" cores hew closer to ARM's single-issue, in-order, low-power roots), since they're different designs on the same chip they will generally have different maximum clock speeds as well.

And getting further into the realm of heterogeneous computing, it's also becomming common to see "CPU" and "GPU" cores integrated onto the same chip. These have thoroughly different designs, run different instruction sets, are addressed differently, and generally will be clocked differently as well.

Peter Cordes · Accepted Answer · 2017-07-22 20:47:11Z

Fast single-thread performance and very high multi-thread throughput is exactly what you get with a CPU like Intel's Xeon E5-2699v4.

It's a 22-core Broadwell. The sustained clock speed is 2.2GHz with all cores active (e.g. video encoding), but the single-core max turbo is 3.6GHz.

So while running a parallel task, it uses its 145W power budget as 22 6.6W cores. But while running a task with only a few threads, that same power budget lets a few cores turbo up to 3.6GHz. (The lower single-core memory and L3-cache bandwidth in a big Xeon means it might not run as fast as a desktop quad-core at 3.6GHz, though. A single core in a desktop Intel CPU can use a lot more of the total memory bandwidth.)

The 2.2GHz rated clock speed is that low because of thermal limits. The more cores a CPU has, the slower they have to run when they're all active. This effect isn't very big in the 4 and 8 core CPUs you mention in the question, because 8 isn't that many cores, and they have very high power budgets. Even enthusiast desktop CPUs noticeably show this effect: Intel's Skylake-X i9-7900X is a 10c20t part with base 3.3GHz, max turbo 4.5GHz. That's much more single-core turbo headroom than i7-6700k (4.0GHz sustained / 4.2GHz turbo without overclocking).

Frequency/voltage scaling (DVFS) allows the same core to operate over a wide range of the performance / efficiency curve. See also this IDF2015 presentation on Skylake power management, with lots of interesting details about what CPUs can do efficiently, and trading off performance vs. efficiency both statically at design time, and on the fly with DVFS.

At the other end of the spectrum, Intel Core-M CPUs have very low sustained frequency, like 1.2GHz at 4.5W, but can turbo up to 2.9GHz. With multiple cores active, they'll run their cores at a more efficient clock-speed, just like the giant Xeons.

You don't need a heterogeneous big.LITTLE style architecture to get most of the benefit. The small cores in ARM big.LITTLE are pretty crappy in-order cores that aren't good for compute work. The point is just to run a UI with very low power. Lots of them would not be great for video encoding or other serious number crunching. (@Lưu Vĩnh Phúc found some discussions about why x86 doesn't have big.LITTLE. Basically, spending extra silicon on a very-low-power extra-slow core wouldn't be worth it for typical desktop/laptop usage.)

whereas applications like video editing are determined by number of cores. [Wouldn't 2x 4.0 GHz + 4x 2.0 GHz be better at multi-threaded workloads than 4x 4GHz?]

This is your key misunderstanding. You seem to be thinking that the same number of total clock ticks per second is more useful if spread over more cores. That's never the case. It's more like

cores * perf_per_core * (scaling efficiency)^cores

(perf_per_core is not the same thing as clock speed, because a 3GHz Pentium4 will get a lot less work per clock cycle than a 3GHz Skylake.)

More importantly, it's very rare that the efficiency is 1.0. Some embarrasingly parallel tasks do scale almost linearly (e.g. compiling multiple source files). But video encoding is not like that. For x264, scaling is very good up to a few cores, but gets worse with more cores. e.g. going from 1 to 2 cores will almost double the speed, but going from 32 to 64 cores will help much much less for a typical 1080p encode. The point at which speed plateaus depends on the settings. (-preset veryslow does more analysis on each frame, and can keep more cores busy than -preset fast).

With lots of very slow cores, the single-threaded parts of x264 would become bottlenecks. (e.g. the final CABAC bitstream encoding. It's h.264's equivalent of gzip, and doesn't parallelize.) Having a few fast cores would solve that, if the OS knew how to schedule for it (or if x264 pinned the appropriate threads to fast cores).

x265 can take advantage of more cores than x264, since it has more analysis to do, and h.265's WPP design allows more encode and decode parallelism. But even for 1080p, you run out of parallelism to exploit at some point.

If you have multiple videos to encode, doing multiple videos in parallel scales well, except for competition for shared resources like L3 cache capacity and bandwidth, and memory bandwidth. Fewer faster cores could get more benefit from the same amount of L3 cache, since they wouldn't need to work on so many different parts of the problem at once.

supercat · Accepted Answer · 2017-06-24 23:57:18Z

While it's possible to design computers that have different parts running at different independent speeds, arbitration of resources often requires being able to quickly decide which request to service first, which in turn requires knowing whether any other request might have come in soon enough to win priority. Deciding such things, most of the time, is pretty simple. Something like a "quiz buzzer" circuit could be implemented with as few as two transistors. The problem is that making quick decisions that are reliably unambiguous is hard. The only practical way to do that in many cases is to use a decide called a "synchronizer", which can avoid ambiguities but introduces a two-cycle delay. One could design a caching controller which would reliably arbitrate among two systems with separate clocks if one were willing to tolerate a two-cycle delay on every operation to determine who won arbitration. Such an approach would be less than useful, however, if one would like a cache to respond immediately to requests in the absence of contention, since even uncontested requests would still have a two-cycle delay.

Running everything off a common clock avoids the need for synchronization, which in turn avoids a two-cycle communications delay every time it's necessary to pass information or control signals between clock domains.

Yakk · Accepted Answer · 2017-06-27 15:45:51Z

Desktop computers do this already.

They have (set of) a CPU(s), with 1-72 threads active at once, and a (set of) GPU(s), with 16-7168 computing units.

Graphics is an example of a task that we have found massive parallel work to be efficient. The GPU is optimized to do the kind of operations that we want to do graphics (but it isn't limited to that).

This is a computer with a few big cores, and lots of small cores.

In general, trading one core at X FLOPS for three cores at X/2 FLOPS is not worth it; but trading one core at X FLOPS for one hundred cores at X/5 FLOPS is very much worth it.

When programming for this, you generate very different code for the CPU and for the GPU. Lots of work is done to divide the workload, so that the GPU gets tasks that are best done on the GPU, and the CPU gets tasks that are best done on the CPU.

It is arguably much easier to write code for a CPU, because massively parallel code is harder to get right. So only when the payoff is large is it worth trading single-core performance for multi-core situations. GPUs give a large payoff when used properly.

Now, mobile devices do this for a different reason. They have low-power cores that are significantly slower, but use significantly less power per unit of compute as well. This lets them stretch battery life much longer when not doing CPU intensive tasks. Here we have a different kind of "large payoff"; not performance, but power efficiency. It still takes a lot of work on the part of the OS and possibly application writer to get this to work right; only the large payoff made it worth it.

Hypersoft Systems · Accepted Answer · 2017-07-01 05:05:20Z

The reason common systems have cores at the same speed is a simple math problem. Input and output timing (with optimizations) based on a single set of constants (which are scalable = multipliable by a number of units).

And someone here said mobile devices have multi-cpus with different speeds. That's just not true. Its not a central processing unit if it is not the unit of central processing; no matter what the manufacturer says it is or is not. in that case [not a cpu] its just a "support package".

RyRoUK · Accepted Answer · 2017-06-24 20:18:07Z

-10

I don't think the OP understands basic electronics. All computers require one thing for them to function - a clock. Clock cycles generated by an internal clock are the metronome for the movement of all data. To achieve synchronicity, all operations must be tied to a common clock. This is true for both internal data execution on an isolated computer as well as entire networks.

If you wanted to isolate cores on a CPU by running them at different frequencies, you could certainly design such a platform. Although, it would require engineering a motherboard solution that ties each individual core to its own isolated subset of motherboard features. You would be left with 4 individual computers instead of a quad-core computer.

Alternatively, as another person pointed out, you can add code to your kernel that adjusts core frequency on an individual basis. This will cause hits on performance, though. You can have speed or power efficiency - but you can't have both.

answered Jun 24, 2017 at 20:18

RyRoUK

11 bronze badge

1

I don't, hence my question. Comparing an Intel i5 7600 to an i5 7600k, we see that the base clock is 100mhz for both and the difference is the core ratio. So you could have two cores with the same base clock of 100mhz but with different core ratios - does this scenario violate the synchronicity requirement?
– Jamie
Commented Jun 24, 2017 at 21:01
4

Yeah, this is oversimplifying too much; it's not really true that all operations must be tied to the same clock, there are lots of clock domains and it's perfectly possible to run different cores at the same speed. Bus clock is not the same as internal clock, etc.
– pjc50
Commented Jun 24, 2017 at 21:45
11

Modern chips already have multiple clock domains (even the RTC of a cheap&dumb microcontroller usually runs on a separate 32.7kHz domain). You just have to synchronize between clock domains. Even with a common clock you could divide it by 2, 4, 8 and so on.
– Michael
Commented Jun 24, 2017 at 21:46
1

All true. But it still reduces efficiency of operation. And that is always the goal in regards to performance. That was my point. Sure, you can do it. But you'll take a hit on performance.
– RyRoUK
Commented Jun 27, 2017 at 10:41
1

"Reduces performance" - compared to what? You are assuming a base state where you have n processors running with the same clock. That doesn't have to be the case. Processor X + processor Y is a more powerful/flexible solution than processor X alone, no matter what exactly processor Y is.
– hmijail
Commented Jun 29, 2017 at 22:36

| Show 1 more comment

Stack Exchange Network

Why do we have CPUs with all the cores at the same speeds and not combinations of different speeds?

12 Answers 12

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
cpu
multi-core
cpu-architecture
cpu-cores
.

Linked

Hot Network Questions

Why do we have CPUs with all the cores at the same speeds and not combinations of different speeds?

12 Answers 12

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged cpumulti-corecpu-architecturecpu-cores.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
cpu
multi-core
cpu-architecture
cpu-cores
.