45
$\begingroup$

When considering how multi-thread-friendly our program must be, my team puzzled about whether there's anything that absolutely cannot be done on a single-core CPU. I posited that graphics processing requires massively parallel processing, but they argue that things like DOOM were done on single-core CPUs without GPUs.

Is there anything that must be done on a multi-core processor?

Assume there is infinite time for both development and running.

$\endgroup$
17
  • 8
    $\begingroup$ While the answers below seem to largely be “no”, there are historically systems that literally couldn't have worked without a co-processor handling some tasks. One strong example I know of is the Nintendo DS, which includes an 67MHz ARM9 CPU and a 33MHz ARM7 CPU (also used for back-compat when playing GBA games). For DS games, the ARM7 handles playing audio & Wi-Fi communication because the ARM9 can't process & draw anything of note to the screen while keeping up with feeding audio to the sound chip directly. So as @jmite states “under what constraints”, lack of speed can req multiple CPUs. $\endgroup$ Commented Mar 5, 2016 at 3:15
  • 10
    $\begingroup$ At my job we use multicore Xeons and the Xenomai real-time Linux extensions to do low-latency audio processing. We have a three-stage audio processing pipeline, and each stage gets its own dedicated core, which it uses ~70% of the cycles of. Non-real-time tasks get to use the fourth core, and whatever cycles are leftover on the first three. This would only be possible on a single-core CPU if that single core was 3+ times faster than a core on current 4-core CPU; given that the current CPU runs at 2GHz, that might be difficult to achieve. $\endgroup$ Commented Mar 5, 2016 at 3:38
  • 20
    $\begingroup$ Software on a single-core CPU can emulate a multi-core CPU. The difference is almost entirely speed. $\endgroup$ Commented Mar 5, 2016 at 7:53
  • 24
    $\begingroup$ One thing that must be done on a multi core system is testing multithreaded software. Because some defects will (almost) never happen on a single-core system. I'm not sure that qualifies as an answer, though... $\endgroup$
    – nikie
    Commented Mar 5, 2016 at 17:34
  • 13
    $\begingroup$ @nikie A single-core system can emulate memory ordering and stale caches too - but I imagine this would be extremely inefficient (like 10× slowdown) $\endgroup$
    – Nayuki
    Commented Mar 5, 2016 at 21:10

7 Answers 7

48
$\begingroup$

If you don't care about the running time, anything you can do on a multi-core machine, you can do on a single-core machine. A multi-core machine is just a way of speeding up some kinds of computations.

If you can solve a problem in time $T$ on a multi-core machine with $n$ cores, then you can solve it time $\sim Tn$ (or less look at Amdahl's law) on a single-core machine. The single-core machine can emulate a multi-core machine using time-slicing / time-sharing.

$\endgroup$
10
  • 3
    $\begingroup$ I am not entirely sure that is absolutely correct. I don't think memory consistency bugs are possible to generate on a single core (Yes, one could emulate a multicache system on a unicore, but such indirection is kind of cheating.). (Perhaps an equivalent of implementing reg. swap by move ops in a VLIW, exploiting guaranteed ||ism?) I suppose that even on a single-threaded core it would still be possible to extract entropy from multithreaded timing variability, but the amount of entropy would be smaller per unit time (which is really just a matter of performance like the other differences). $\endgroup$
    – user4577
    Commented Mar 5, 2016 at 1:33
  • 6
    $\begingroup$ @PaulA.Clayton Memory consistency bugs are usually unwanted and well-written software should not exhibit them. However, if you really wanted to, you could emulate them on a single CPU. (Although it might be slow) $\endgroup$ Commented Mar 5, 2016 at 7:54
  • 4
    $\begingroup$ Sometimes the time on a single core will be more than $n$ times longer than on an $n$-core machine, for instance for searching with random restarts or if the pieces fit into cache on the multiple cores but not on the single core. $\endgroup$ Commented Mar 5, 2016 at 9:50
  • 12
    $\begingroup$ "The single-core machine can emulate a multi-core machine using time-slicing / time-sharing." And indeed have done so since the dawn of the "modern" Operating System. $\endgroup$ Commented Mar 5, 2016 at 18:19
  • 1
    $\begingroup$ @PaulA.Clayton I think you could get memory consistency problems (like a non-atomic increment) if you were to have two different processes that both modified the same shared memory. You just need pre-emptive multi-tasking. Of course, this is generally why modern OSes don't have processes sharing the same writable memory unless they explicitly ask to. $\endgroup$
    – Patrick M
    Commented Mar 6, 2016 at 18:09
58
$\begingroup$

The question is: under what constraints?

There are certainly problems where, if we ask the question "can we solve this problem on hardware X in the given amount of time", the answer will be no.

But this is not a "future-proof" answer: things which in the past could not be done fast enough in a single core probably can be now, and we can't predict what future hardware will be capable of.

In terms of computability, we know that a single-tape Turing Machine is capable of computing all the same functions as a single or multi-core computer, so, runtime aside, there are no problems that a multi-core computer can solve that a single-core can't.

In terms of something like graphics, literally everything that is on the GPU could be done on the CPU... if you are willing to wait long enough.

$\endgroup$
15
  • 3
    $\begingroup$ @JanDvorak I would actually say that this is not done by the GPU at all ;) $\endgroup$
    – TomTom
    Commented Mar 6, 2016 at 7:17
  • 16
    $\begingroup$ If time is not a constraint you could do all calculations by hand, pen and paper. $\endgroup$ Commented Mar 6, 2016 at 9:21
  • 2
    $\begingroup$ @mathreadler Yes, because the brain is Turing Complete. Something that turned into a lengthy debate on Physics Stackexchange. $\endgroup$
    – JBentley
    Commented Mar 6, 2016 at 13:14
  • 4
    $\begingroup$ Actually, @JanDvorak, generating VGA is quite simple and can be done in software on a lowly 16 MHz micro controller, as this project shows: pyroelectro.com/tutorials/arduino_basic_vga $\endgroup$
    – axello
    Commented Mar 6, 2016 at 20:54
  • 3
    $\begingroup$ @mathreadler That is actually a more complicated question than it first appears. A short answer might be "yes" because a specialized machine can construct a computer without requiring any turing complete tools to do so. A longer answer might be "no," because the ability to construct a turing machine may imply that one has a larger turing machine that is in an "initialization" state where it constructs the rest of the state machine. The full answer is even more complicated because we have never constructed a Turing Complete device. We have developed abstract ideas for machines that are... $\endgroup$
    – Cort Ammon
    Commented Mar 7, 2016 at 15:40
17
$\begingroup$

As other answers have pointed out, a single CPU can always emulate multiple CPUs by slicing time and playing the role of each virtual CPU. This emulation will certainly calculate the correct answers.

In the real world, execution time may be important. It could mean the difference between a mediocre frame rate and a stellar visual experience. Or the difference between profit and loss in trading.

One pathological situation where a multiprocessor is vastly faster than a uniprocessor is where the processing is a data pipeline, context switching is expensive, and the machine code for each pipeline stage just barely fits in a CPU's cache.

Let me illustrate with some numbers. Suppose you have a data pipeline (3D rendering, etc.) that has 4 processing stages, each stage has 256 KiB of program code, and you conveniently have 4 CPUs with 256 KiB of L2 cache. If you try to run this processing on a single CPU, then switching between the 4 tasks will be expensive and involve heavy cache misses. On the other hand if you run it on a 4-core system, the calculation could potentially be very smooth, cache misses are minimal, and context switches are non-existent. (As a side note, this is related to the notion of pinning certain applications to certain cores - e.g. only doing OS kernel operations in one core, or TCP/IP handling, etc.)

$\endgroup$
0
8
$\begingroup$

It's much harder to develop really nefarious data races with a single CPU. I mean, sure, you can pull off tearing between words if you interrupt a single CPU, but can you build exotic scenarios where there is no single interleaving of threads which does what you want?

Okay, maybe making insidious bugs doesn't count as a valid use of multi-code advancements. As it turns out, there's not much that mutli-core can do that single core cannot given time. The reason is simple. If you try to avoid those evil data races, you have to have synchronization points in your code. If you model your code as a lattice of computations where ones inputs must be complete and synchronized before you can calculate and produce outputs, it's easy to see that a single CPU can simply work their way along the lattice, calculating the next available block of work.

In fact, if you can demonstrate that your algorithm can be solved by a Turing machine (which is virtually every algorithm we care about), it can be proven that the algorithm can be done by not only a single core CPU, but in fact a state machine with a very long piece of tape for memory!

The CHESS race detector actually leverages this to find race cases. It runs everything singlethreaded and systematically explores all possible interleaves between threads, trying to find cases where a test fails because of a race case. CHESS depends on the fact that you can run any multithreaded application on a single core.

The cases where you need multicore appear when you start stretching the limits of hardware. The obvious one is when you have time constraints. Some problems with realtime time constraints are impossible to do single core because they simply can't drive a single core's clock fast enough. There's a reason CPUs climbed up to 4Ghz and then settled down a bit, preferring more cores at lower speeds.

A more exotic version of this timing constraint is in hard-real time systems. In some hard real time systems, the service of interrupts is so demanding that you actually have to pick a multi-core CPU that lets you divvy the interrupts up across the cores, or you run into timing limitations.

Another limit arises with data busses. Consider the Blue Gene/P as an example. JUGENE, a particular Blue Gene/P supercomputer, has 144 terabytes of memory. They simply don't make single CPU computers that can access all of that memory.

$\endgroup$
4
  • 1
    $\begingroup$ Re, They simply don't make single CPU computers that can access [that much] memory. "Don't" is not the same as "can't". You could design and build a uniprocessor with 144 terabytes or more of main memory. The only reason people don't is because of diminishing returns: The incremental, practical value of adding more memory to a uni-processor design reaches a peak at some point and then drops off as the memory size grows, while the incremental cost remains constant. $\endgroup$ Commented Mar 7, 2016 at 21:18
  • $\begingroup$ @jameslarge That would be why that sentence came in the part of my answer discussing real life practical hardware, and why it did not appear in the first 2/3 of the answer which discussed the theoretical capacities. $\endgroup$
    – Cort Ammon
    Commented Mar 7, 2016 at 21:54
  • $\begingroup$ "Don't" vs. "Can't" is illustrated by two systems in my basement. If I could physically add that much memory into their hardware configurations, their CPUs "could" access each byte. But I can't, so they "can't". The capabilities of the CPUs are beyond practicality. $\endgroup$ Commented Mar 8, 2016 at 1:07
  • $\begingroup$ I was thinking something like this answer. It seems that race conditions would be impossible (or happen 100% of the time) in a single-core environment. As for a practical application, I theorize that a software developer could engineer some unique form of copy protection by way of coding some weird race condition test that would always pass on the specific target hardware, but would fail on emulated hardware run by a single core. In this case, emulation by a multi-core system would probably pass sometimes, but unreliably. $\endgroup$ Commented Mar 9, 2016 at 21:15
6
$\begingroup$

If you need to observe a process running on a single processing element without disturbing its real-time behavior (or as little as possible), like for benchmarking or activity logging, you'll probably need a separate processing resource.

$\endgroup$
2
  • $\begingroup$ Nice, concise example of something that would require precise emulation if not multiple processors $\endgroup$
    – Ky -
    Commented Mar 7, 2016 at 15:59
  • $\begingroup$ Hey is this your account? Mayby you would like to merge it? $\endgroup$
    – Evil
    Commented Jun 1, 2016 at 3:25
4
$\begingroup$

The other answers adhere to the limited view of parallelism as "distributed concurrency". This gives some answers: in a clean model of computation à la Turing, multiple cores do not offer an advantage; the only advantage you may get is efficiency.

There is the one thing multiple processing units (PUs) can do that a single one can not, though: execute operations in parallel, that is at the same time.

That is very useful if you run multiple programs at the same time. Granted, it is only rarely that you absolutely need more than concurrent execution, and most uses come down to increased efficiency. But there is this difference.

Say you need to process data sensor data from multiple sources in real time. Whatever that means precisely in your application, one PU can only handle so many input streams concurrently without violating its response time limit. So you need multiple PUs once you have too many sensors for your current PU generation.

In the more classical realm, one maybe convincing example are portfolio algorithms. Say you have a problem for which you have multiple (say $k$) algorithms with orthogonal costs; good cases of one are bad cases for others. You can not quickly tell which one is best for a given input, though.

You can run all algorithms in parallel and abort once one finishes. If you have at least $k$ PUs, you get the minimum running time among all $k$ algorithms in the portfolio. With only one PU, you'd get $k$ times that, assuming a fair scheduler, plus all the overhead.

$\endgroup$
0
$\begingroup$

from a CS pov, "multicore" is not that much different in theory than "distributed computing". the basic concept is "independent computing elements (that compute in parallel". so slightly rephrasing the question ("multicore" is not really exactly a theoretical concept in CS) leads to some other possibilities. as pointed out in other answers, sequential programming is equivalent to parallel programming from a CS pov. this goes back to the definition of the theoretical system for computing, namely a Turing machine. theoretical analysis of CS performance is ultimately in terms of TMs where the distinction of parallel vs sequential does not really apply (although there is some rough analogy with multitape TMs).

but considering this question less abstractly, distributed computing is indeed superior or possibly nearly even required for some problems involving fault tolerance. in this area there is a concept that it applies when/ where the independent computing elements are taken to have some degree of unreliability (this is not really a universally applicable assumption for all contexts). here are several cases where fault tolerance is improved with or even requires independent computing elements.

  • consider that each processor has an independent "[x]%" chance of failing during computation. a system can be devised whereby through communication the overall fault tolerance of the system is superior to individual components. this was applied many decades ago eg in Space Shuttle systems. more recently there are basic protocols designed to utilize it eg Paxos which solve the so-called consensus problem. a more down-to-earth example is Google which has a lot of proprietary algorithms to essentially build their supercomputer(s) out of individually unreliable elements coupled with fault-tolerant algorithms.

  • Bitcoin involves distributed transactions to compute the ledger and that is not merely due to sheer processing load issues. the algorithm is carefully designed to thwart corrupt nodes. in short it "solves"/ implements the Byzantine generals problem which is not merely about maximizing parallel performance, it involves independent entities "checking" each other and "algorithmically/ cryptographically/ securely" rejecting invalid computations aka a sort of "cheating" or "corruption".

  • a classic analysis of parallelism concludes there are about 7 "fundamental" problem pattern types that decompose into particular parallel execution breakdowns. see The Landscape of Parallel Computing Research: A View from Berkeley

  • there is some element of an open theoretical question here wrt performance considerations addressed in most other answers. the question of whether there are any problems that are "inherently faster" in parallel than sequential is also known roughly as the P=?NC problem where NC is considered the class of "efficiently parallelizable" algorithms and P is "efficient [sequential] algorithms"

$\endgroup$
2
  • 1
    $\begingroup$ I love this answer! I learned a lot from your examples :D $\endgroup$
    – Ky -
    Commented Mar 8, 2016 at 14:45
  • $\begingroup$ +1 for fault tolerance in mission-critical environments with radiation, -1 for lack of caps and redundancy. $\endgroup$ Commented Mar 10, 2016 at 12:09

Not the answer you're looking for? Browse other questions tagged or ask your own question.