23
\$\begingroup\$

Say one transistor fails and causes the entire computation to be wrong, how does a computer check to see if it is correct? My only guess is that it does it multiple times across different units. I tried looking for information online but I don't really know what to look for and since complex programs go through billions of cycles my intuition is telling me that there is a way to prevent calculation error.

I am looking for a list of jumping-off points where I can start learning neighboring fields, terms and textbooks; I am not very well versed in this field but I have taken quantum 1 and 2 and I have general knowledge about electrical engineering including first-order logic.

\$\endgroup\$
7
  • 1
    \$\begingroup\$ Are you asking this only from the perspective that there is a hardware malfunction like a broken trasistor, or does it include hardware being perfectly fine, but computations having too much error because unexpected numbers being input caused unexpected things in the calculations and thus outputs are also unexpected? \$\endgroup\$
    – Justme
    Commented May 5 at 12:35
  • \$\begingroup\$ Ot would be very unusual for a single transistor to fail. Failure is usually handled at a different 'level'. \$\endgroup\$
    – copper.hat
    Commented May 6 at 0:23
  • 4
    \$\begingroup\$ Generally they don't. Computation is never protected, only memory. So when you need computers to control important things like airplanes and nuclear powerplants you use multiple computers to check each other. But each individual computer is not designed to self-detect computational errors \$\endgroup\$
    – slebetman
    Commented May 6 at 6:33
  • 3
    \$\begingroup\$ Probably the keyword is reliability and you can find scholarly articles if you search for "reliability theory." This was a hot topic in the 1950's when components were much less ... reliable. Generally modern hardware is much more trustworthy, with the possible exception of space, so problems are solved in software. I asked a related question, there are a bunch of sources I found (mostly from the 50's) at cstheory.stackexchange.com/q/40571/47163 \$\endgroup\$
    – Ben Burns
    Commented May 6 at 17:06
  • \$\begingroup\$ There are software implemented reliability systems in many applications, like database, networking, and security software. Some network protocols include checksums, parity bits, etc. to improve reliability, but again these checks are generally implemented in software. \$\endgroup\$ Commented May 7 at 2:31

8 Answers 8

38
\$\begingroup\$

how does a computer check to see if its correct?

Most won’t. Your average desktop PC isn’t mission critical so there are only a few hardware checks, for example signed firmware and TPM checks, but if your CPU can’t count correctly anymore, it would just crash or fail to boot. Failures after manufacturing stage are so rare that this is not a concern for most users during a PC's lifetime.

If you take one step up from a desktop PC to a server, ECC memory comes into play. It will detect errors in the memory and correct them on the fly, but still assumes the CPU counts correctly.

One step further up would be an industrial (high end) control PC, with redundancy. There, you may have a watchdog or something more elaborate to disable the entire PC if such a counting error happens, which will fail to update the watchdog, and the redundant PC will take over.

Furthest up the ladder I’ve heard about is SpaceX on their rockets which had elaborate 2-out-of-3 voting schemes and redundant hypervisors made from X86 hardware. It will detect any deviation and stop listening to that CPU when it happens and redundant enough to have several CPUs malfunctioning at once.

\$\endgroup\$
10
  • 8
    \$\begingroup\$ Fly-by-wire aviation systems have a moderate amount of design info around on how redundancy was achieved. \$\endgroup\$ Commented May 5 at 8:00
  • 2
    \$\begingroup\$ so if my understanding is correct, the transistors, logic gates up to the whole cpu has enough consistency in its design that the probability of an individual component failing within 1 trillion cycles is near zero, however the chance of failure for the other components like memory are significantly higher and that is where the errors come from \$\endgroup\$
    – Waterbloo
    Commented May 5 at 8:58
  • 2
    \$\begingroup\$ @Waterbloo Memory yes (different technology, more things at stake). CPU no. Defects are tested during production and unusable cores are turned off by microcode or laser burning of traces. Same with L2 cache. One bit of one block unusable = turn off entire block and sell at reduced price. Any faults after that are not captured. The CPU simply fails. \$\endgroup\$
    – winny
    Commented May 5 at 9:59
  • 9
    \$\begingroup\$ The space shuttle had quadruple redundancy with majority rule. www.nasa.gov \$\endgroup\$
    – AkselA
    Commented May 5 at 16:10
  • 6
    \$\begingroup\$ Re, "...something...to disable the entire PC." Disabling the PC is nice, but you also may want to ensure that the machine that the PC was controlling goes to some safe state. In one complex system that I worked on, there was a "safety interlock circuit." It was a current loop that went through various different components. IDK which component provided the current, but every component could sense the current, and if any component broke the loop, then everything would shut down, very quickly, and no software could override it. \$\endgroup\$ Commented May 5 at 22:33
21
\$\begingroup\$

Triple Modular Redundancy (TMR)

For computers designed to operate in environments which may be affected by Single Event Upsets (SEU) due to radiation, e.g. for space or particle accelerators, Triple Modular Redundancy (TMR) may be used in the design to add protection. E.g. from Gaisler Research:

  1. A Portable and Fault-Tolerant Microprocessor Based on the SPARC V8 Architecture which has the abstract:

    The architecture and implementation of the LEON-FT processor is presented. LEON-FT is a fault-tolerant 32-bit processor based on the SPARC V8 instruction set. The processors tolerates transient SEU errors by using techniques such as TMR registers, on-chip EDAC, parity, pipeline restart, and forced cache miss. The first prototypes were manufactured on the Atmel ATC35 0.35 μm CMOS process, and subjected to heavy-ion fault-injection at the Louvain Cyclotron. The heavy-ion tests showed that all of the injected errors (> 100,000) were successfully corrected without timing or software impact. The device SEU threshold was measured to be below 6 MeV while ion energy-levels of up to 110 MeV were used for error injection.

  2. Functional Triple Modular Redundancy (FTMR) which is a report for the EUROPEAN SPACE AGENCY (ESA) has the following scope:

    This document discusses the use of Triple Modular Redundancy (TMR) for the protection of combinatorial and sequential logic in reprogrammable logic devices. A VHDL approach has been developed for automatic TMR insertion and a demonstration design has been developed. The approach is called “Functional Triple Modular Redundancy (FTMR)”.

    This document addresses the protection of random sequential and combinatorial logic. This document does not address the protection of inputs and outputs, the usage of on-chip block memories or dedicated shift-registers etc. It assumes a good knowledge of the Xilinx architecture. For detailed information on Xilinx FPGAs and mitigation techniques such as configuration memory scrubbing, see [RD7].

Cortex-R5 for functional safety

The ARM Cortex-R5 processor which has support for functional safety which includes features such as:

  1. ECC on some internal data which can correct some errors. From About the processor:

    Error Checking and Correction (ECC) is used on the Cortex-R5 processor ports and in Level 1 (L1) memories to provide improved reliability and address safety-critical applications.

  2. A dual-redundant safety mode. From Split/lock:

    The Cortex-R5 processor can be configured so that it can be switched, under reset, between a twin-CPU performance mode and a dual-redundant safety mode.

    The dual-redundant safety mode compares that both CPUs produce the same output. If there is a difference the system can be put into a safe state by hardware. I.e. unlike triple modular redundancy a majority vote isn't possible to determine the correct output, but instead just detect that an error has occurred.

See TMS570LC43x 16/32-Bit RISC Flash Microcontroller Technical Reference Manual for an example of microcontroller which has been implemented using a Cortex-R. Searching that TRM shows:

  1. lockstep safety protection has been applied to the Vector Interrupt Module (VIM) as well as the Cortex-R5 CPUs.
  2. ECC has been applied to :
    • Level 1 cache memories
    • Level 2 SRAM
    • Flash memories of the R5F core
    • The memory used for some peripherals. E.g. the Controller Area Network (DCAN) module

For learning about the Cortex-R5 features kits such as LAUNCHXL2-570LC43 Hercules TMS570LC43x LaunchPad Development Kit are available relatively cheaply for running on actual hardware. There is an ERR LED which indicates if an error has occurred, e.g. a core compare error between the dual lockstep CPUs.

Diverse redundancy

The What does 'triple redundant closed-loop digital avionics system' mean? answer by David Hammen on the Space Exploration stack mentions issues with common failure modes:

Finally, triple redundancy accomplishes nothing if every one of those triply-redundant systems exhibits the same common error. A good number of mishaps in space have been attributed to bad flight software or to bad commands issued to the flight software. It doesn't matter if there are one hundred flight computers if each and every one of them has the same bad code or is given the same bad command. Something bad will happen. Common mode failure is the thing that scares safety engineers the most.

Carbon-copy redundancy offers zero protection against common mode failures. The Space Shuttle used quad redundancy in its primary system to address the problem of two fault tolerance (to some failures). To combat the problem of common mode failures, the Space Shuttle had a fifth Backup Flight System. The Shuttle BFS software was built by a completely different contractor than that responsible for building the primary avionics software system. The job of the BFS (which was never used) was to bring the vehicle back to Earth. While the mission would have been a failure, the astronauts would still have been alive.

Diverse Redundant Systems for Reliable Space Life Support mentions diverse redundancy can mitigate against common cause failures (CCF), albeit that report is at a high system level, and doesn't give details on the supporting electronics design. I think the diverse part means implementations from different design teams which hopefully won't make the same systematic errors.

\$\endgroup\$
2
  • 4
    \$\begingroup\$ Safety critical systems (such as flight control computers) not only have triple redundancy, each of the redundant paths has a different architecture. This means, in practice, that all the computing elements are different - one might be a power architecture, another X86 and so on. This protects against such things as microcode bugs. All the computing elements do the same thing in the same amount of time. The probability of all 3 having the same type of bug at the same point in the calculation is very low (the normal acceptance rule is no more than 1E-9 faults per flight hour). \$\endgroup\$ Commented May 7 at 8:38
  • \$\begingroup\$ In addition to my above, the 3 computing elements are galvanically isolated. \$\endgroup\$ Commented May 7 at 8:39
8
\$\begingroup\$

Mainframes: Lockstepping

Chester Gillon's answer has brought up lockstep safety in an embedded CPU. This concept also exists on the other end of the computing spectrum, in mainframes.

The Wikipedia article on mainframes sums it up:

Mainframes also have execution integrity characteristics for fault tolerant computing. For example, z900, z990, System z9, and System z10 servers effectively execute result-oriented instructions twice, compare results, arbitrate between any differences (through instruction retry and failure isolation), then shift workloads "in flight" to functioning processors, including spares, without any impact to operating systems, applications, or users. This hardware-level feature, also found in HP's NonStop systems, is known as lock-stepping, because both processors take their "steps" (i.e. instructions) together. Not all applications absolutely need the assured integrity that these systems provide, but many do, such as financial transaction processing.

Software runtime checks

Even if you don't have hardware redundancy, you can run tests in software to compare actual computational results with expected reference data.

Wikipedia says this about the program Prime95, a prime number generator:

[...] due to the high precision requirements of primality testing, the program is very sensitive to computation errors and proactively reports them. These factors make it a commonly used tool among overclockers to check the stability of a particular configuration.

\$\endgroup\$
1
4
\$\begingroup\$

Existing answers focus on how a computer checks if its computations are correct. But computers don't just compute; they also store and communicate information.

The CPU in your average PC does not perform much checking. As noted in other answers, more advanced computers in critical applications often have triple redundant CPUs.

The transistors on memory chips are much smaller and more numerous, and here error checking is commonplace, especially on servers.

Back in the 1990s a 32-bit PC would require (for example) four 8-bit x 1 megabyte memory modules, each with nine 1 megabit memory chips on it. The 9th bit would be used for a parity check - it would record whether the total number of 1 bits in the remaining 8 bits was even or odd. In this way, if there was a memory error in a single bit of any given byte the computer could detect it. This feature was included in virtually all PCs at that time. The response wasn't very useful though - it would typically announce it had a parity error and stop working, which would force the user to buy new memory if it happened often.

With the switch to 64 bit computers, memory modules have increased their bus width to 64, but often have 72 bit wide internal architecture. The additional 8 bits can be used to carry out a more advanced check called error check and correct (ECC) which is able to correct single bit errors (usually done when the memory is read) and detect but not correct errors in a larger number of bits. Note that this is achieved with the same amount of redundancy (9 bits for every 8) as for the old parity check, but the fact that bits are checked in groups of 64+8 instead of groups of 8+1 allows for a more advanced algorithm which enables the correction.

In a similar way, file storage and communication uses checksums. A similar concept, check digits, is used to detect errors in human data entry such as for account numbers.

\$\endgroup\$
5
  • 3
    \$\begingroup\$ ECC memory modules have existed long before 64-bit computers. The width of the memory interface isn't related to the width of general-purpose registers in the CPU. \$\endgroup\$
    – Nayuki
    Commented May 6 at 13:56
  • 1
    \$\begingroup\$ Note that ECC RAM has never been common in consumer computers; it's mostly used in industrial control equipment and servers, where RAM errors can cause more severe issues. Your personal laptop, desktop, tablet, smartphone, etc almost certainly does not use ECC RAM. \$\endgroup\$
    – Hearth
    Commented May 6 at 15:49
  • 1
    \$\begingroup\$ @Hearth: Almost all flash storage devices, however, have been using ECC for ages, since it's much cheaper to construct a flash where the probability of two or more bits going bad within a page is acceptably low than to construct one where the probability of even a single bit going bad is acceptably low. \$\endgroup\$
    – supercat
    Commented May 6 at 19:35
  • \$\begingroup\$ @Nayuki Agreed ECC memory modules existed before 64 bit processors, but my main point is the overhead for error correction increases (typically, depending on the algorithm) as only the the log of the size of the chunk of bits. Hence it's possible to do ECC with a 64+8 arrangement but not with an 8+1 arrangement. Also my examples are typical ones from PCs and servers, which commonly had/have the architecture described (though there are plenty of other architectures that have been used on different types of computes.) \$\endgroup\$ Commented May 7 at 18:12
  • 1
    \$\begingroup\$ Note that error detecting/correcting codes were used even at the inter-chip level on some mainframes. There was a time when a third of the hardware in an IBM machine might be dedicated to checking the other two thirds, correcting any single-bit error, detecting the vast majority of multiple bit errors, reporting the fault location down to a single chip for fast repair, and supporting hardware diagnostics both during manufacture and in the field. That is not an exaggeration. (And is one of the reasons IBM sneered at the Cray, which had nowhere near that level of accuracy guarantee.) \$\endgroup\$
    – keshlam
    Commented May 7 at 20:28
2
\$\begingroup\$

As I understand it, critical aviation systems include:

  • Triple systems with voting and failsafe operation, as referenced in winny's answer.
  • Systems using multiple architectures and separately developed software. Imagine writing Microsoft Word or another large application for an Intel-based Windows system and an ARM-based Mac. But in addition to targeting different CPU architectures, have two separate teams develop the software with no shared source code, both writing to an extremely detailed specification. That's how the most critical systems are developed. (Except with real-time operating systems, not Windows or Mac OS.) This provides not only redundancy if a transistor, chip, power supply or other component (large or small) fails, but also provides protection for coding errors that would be hidden if multiple systems all ran the same code, as well as architectural issues (perhaps some unknown edge case with caches and memory access) specific to a particular CPU line.
\$\endgroup\$
2
\$\begingroup\$

As many other posters have mentioned, typical CPUs do not have controls for invalid computations in hardware. Many of them would lock up in case of problems. Sometimes a hardware reset (wiping volatile memory) of the hole computer or its certain components would clear the issue. Computers as larger systems, comprised of also memory, storage, wiring/slots/contacts, further controllers, etc. do. Mission-critical ones are redundant to do same work many times and compare the outcomes.

In software, important things are covered by "assertions" - that is, verification that insane results do not appear in critical parts of code, like a = 2 + 2; assert (a == 4); -- otherwise it crashes/restarts by design as the inputs/logic are not trustworthy. Generally it is safer to let programmers or sysadmins figure out the problem in real time than to corrupt data based on "garbage in - garbage out" if the work were to proceed.

There are programs designed to stress different parts of the computer, such as calculating Pi to the greatest precision possible, or rendering difficult scenes with GPU graphics (and comparing the rendered image to a stored model to count the artifacts), or just writing pre-known patterns into RAM and checking that they are read back successfully. This is most often employed by overclockers, who push their hardware to the speed (and thermal) limits above what the vendor specified as suitable for stable work. In fact, vendors do the same - checking say one chip out of a hundred from the same production batch and marking the whole batch based on the worst results (stable at lower frequencies/temperatures), so lucky buyers have a chance of getting individual devices which are more capable than is written on them. This is also done as part of troubleshooting - brownian movement does gnaw away at the the chips and their longevity, so eventually they are a mess of atoms that are no longer arranged as transistors. That can be tested with methods like this, whether by software or part of their firmware, on-demand or as a background maintenance job done by the device itself.

As I have dabbled with the ZFS filesystem from almost its beginning, I remember the pitches from its engineers about why they went about an integrated construct instead of separate layers (disks, virtual pool management/redundancy, logical volumes, filesystems...) practiced for decades before that. One crucial thing this allowed was end-to-end checksumming and redundancy managed holistically.

Remember I've mentioned wires and numerous controllers as parts of computers? "Passive" components do act up, it is a matter of time and statistics: cables are antennaes for EMI noise from fans and other motors (in the olden days, you could even hear HDD or CD-ROM spin-up in the wired headphones); sockets do get dusty and rusty over the years, causing ambiguous voltage levels (was it a 0 or 1?) or micro-sparks (unplug/plug cycles do help to scratch the bad top layer off the contact and let it work over a few more years of oxidation); bits in RAM or on disk (or nowadays in SSD quantum-law cages for electrons) are randomized by external energy (space radiation, general heat and brownian movement) or quantum effects (let's dissipate through an impenetrable power barrier just because we are a wave and can sometimes do that). Typical RAM also holds the information for just a short bit of time, being quick to read, write and forget, so its hardware constantly re-accesses the chips just to keep the memories fresh (which causes regular in-accessibilities for other work, which can be noticeable in real-time applications and may require those devices to use a different memory technology altogether).

This is largely why optical cabling is used - even in storage systems (to the extent that fiber-channel disks were connected by optical cable right from the HDD to the backplane of their rack box and beyond all the way to a server that used them, or part of them in a NAS/SAN setup) - to eliminate EMI interference.

Semi-active components like an insufficiently-powered PSU stretched close to its limits can also be a problem (when there's a burst load, its voltage minutely drops, and some 0's and 1's become hard to distinguish) even if we discount the EMI noise it can do or dried-up capacitors acting up.

Active components are even worse, way too many chipset manufacturers cut corners because some part of spec is too complicated to implement/test and too few layman users hit any problems - and hey, it is cheaper (more margin/bonus for those who save a buck)! A glaring example is "reset" support on USB hub devices, which is supposed to only programmatically recycle a connection to one device but often resets the whole hub or bus - disrupting everyone on it.

Same things did in fact happen with expensive storage back-planes (almost-passive boards where your 48 disks are plugged into and then are distributed to smart storage controllers), or more so with the less-expensive ones (IIRC SATA did not really have a concept of single-device/port reset... or just nobody implemented it... but my memory can be vague about this). So the ZFS community did in practice have to work around situations like "That one HDD did not respond for 30 sec, let's reset its link. Oops, here go all other disks becoming AWOL... and coming back after a while!" or sometimes reset storms ("...a disk went AWOL, let's reset it!" times 48 or so) causing outages too long to glaze over.

Thinking of capacitors mentioned earlier - those also have a role to play in storage (or not, if absent), such as safe flushing of buffered writes into the HDD or SSD devices when external power disappears. For efficiency, randomly targeted writes are typically "cached" - so they live in the device's little RAM chip for a while and then are burst out sequentially when a large enough block is collected. A loss of power can cause either the partially-written stuff to be committed to long-term storage, or worse - random garbage in random places. Bummer if that was your file system structure information.

And regarding hardware whose makers had cut corners - controllers capable of caching do add (and typically implement) commands to "flush" queued writes or otherwise fence them, so the software consumers like the operating system can "guarantee" that one transaction lands on the long-term storage before starting another operation. This is critical for example in filesystem metadata writes, where the obvious performance hit of un-cached writes is the lesser evil. Some controllers do lie (e.g. to win in benchmarks), and report a flushed write completed as soon as the request was received, while it is still cached and subject to getting lost or written out of order.

Redundancy comes in RAM (from ECC adding a few bits per byte to protect from single or dual bit flips, to whole RAID-like setups allowing a complete RAM module to die off), in storage (many HDDs storing same info), in controllers (those HDDs connected to different boards, so the whole board and its 8 or so disks can disappear without immediately fatal results) and of course in independent computers spread across the globe (what if a data center has a fire and all its devices get fried?)

Checksumming, at least as used in ZFS as the example I know best, allows it to compute a hash of each content block as soon as it is queued for writing, to store in the data tree (so this hash becomes part of content of a metadata block, and up the tree this goes until the "uber-block" which describes the current state of the storage system). This does assume that the RAM (or even CPU cache) and CPU logic were trustworthy in the split second that the content block was assembled and its checksum was calculated, but the rest of the system is effectively expected to lie sometimes (or "bit-rot" over time). Law of big numbers, bro.

The content and metadata blocks are scheduled separately and pass over the untrustworthy cabling, sockets, controllers, into disks. Metadata is typically written at least twice, and at distant LBA addresses from each other and the contents - so on rotating disks where this meant physically distant areas, a disk head crashing into the surface or some other mode of failure would hopefully only damage one copy. A completed ZFS transaction involves writing four uber-blocks in different parts of the device/partition (with checksums and locations of further metadata) so you can almost always find the newest reliable state of your data tree - if those metadata writes do happen as sequentially as they are queued and "flushed".

Whenever a read of ZFS metadata or content block happens, all the way through the tree, we know the expected checksum of the expected bit-length of the content (in a larger "cluster" if one is not fully populated by your smaller write) - and so the checksum of the read block is calculated and compared to the expectation. If they do not match, you instantly know something was corrupted and is no longer trustworthy. If you have redundancy (e.g. mirroring, such as enforced for the metadata part of the tree) you can read the other copy and hope (and check!) that it is intact - in that case, overwrite the bad copy with the good one. If not, you at least can report an error and not have the user trust that the garbage they received is what they had originally stored and expected to read back. Maybe they have a backup?..

Systems like this also practice regular "scrubbing" - e.g. weekly reads of the data pool just to confirm that all checksums match, and to repair whatever possible if they do not. (Similar to refreshes that RAM chips do many times a second - with repairs possible in the ECC and/or RAID-like setups).

\$\endgroup\$
1
\$\begingroup\$

In the modern general-purpose computers, the main thing to prevent most of the computational errors is the profound, multi-layer complexity of the modern general purpose software.

Such a playing cards tower is profoundly sensitive to data corruption - every circuit and every basic routine is exersized in number of multiple roles, many times per second. E.g. a single bit in memory could be a pointer, part of some data or a state machine indicator at any particular moment.

If for whatever reason, a single bit in memory flips once in a while, it sooner or later flips in a way that makes some important data structure self-inconsistent. This ultimately ends up serving the user with one of the various "screens of death".

Such a computer is naturally considered untrustworthy and is subject to repair or disposal.

The same "selection pressure" generally applies to the basic operating system layer and the basic software library routines (e.g. process scheduler, memory allocation, device drivers, file systems and likes).

Things that are beaten hundreds and thousand times per second in various use cases (including multiple corner cases) are either sane and consistent, or will fail in obvious manner.

\$\endgroup\$
0
\$\begingroup\$

On your personal or works computer: Get a computer from a reputable company that doesn't build hardware that runs out on spec to be one percent faster than the competitor and becomes unstable. (Intel has currently lots of problems with motherboard makers who intentionally exceed design specs, and then you get a computer that works fine for an hour but not longer). Make sure you have decent cooling, including putting the computer away from the radiators in your home. No voltage spikes or similar. No super strong magnets attached to the case. If your roof is leaking, fix it. Keep hot drinks away from your computer, and even more so sugary drinks. Get RAM and hard drives / SSDs from a reputable supplier. Just common sense things really, and your computer will run just fine.

A single transistor not working correctly will very likely have either no effect whatsoever, or make your computer crash. Say you move a 64 bit pointer from register A to register B. Switching a single bit in that pointer during the move - if the pointer is used - has a good chance to crash your computer. And a crash means no computational errors :-(

(I once worked at a place where out of 7 computers, four had incorrect RAM supplied. The effect was that everything was fine as long as you kept working; if you got up, got a cup of coffee and came back five minutes later, it crashed as soon as you touched the keyboard. It was badly annoying and time consuming but didn't cause any real problems).

But for this kind of machines, errors are prevented by using quality parts with very, very low failure rates from the manufacturer's side, and providing a decent environment for the computer from your side. There are other answers in case this isn't good enough.

\$\endgroup\$

Not the answer you're looking for? Browse other questions tagged or ask your own question.