19

According to https://en.wikipedia.org/wiki/Cray-1

The Cray-1 was built as a 64-bit system, a departure from the 7600/6600, which were 60-bit machines (a change was also planned for the 8600). Addressing was 24-bit, with a maximum of 1,048,576 64-bit words (1 megaword) of main memory, where each word also had 8 parity bits for a total of 72 bits per word.[10] There were 64 data bits and 8 check bits.

It seems to me by the nature of parity, it should suffice to have one bit of overhead per word, rather than eight. I can understand on something like an 8088/87, you might be stuck with 1/8 because the memory system deals in eight bits at a time, but why is it that way on a 64-bit machine?

3
  • Every parity bit you add halves the error rate. Hence 8 bits divide it by 256. (Though as error correction was used as well, the improvement is not so good.) Commented Mar 6, 2019 at 17:24
  • 7
    8/64 = 1/8. Guess how many parity bits modern computers use for parity on bytes??
    – RonJohn
    Commented Mar 6, 2019 at 20:29
  • 1
    Isn't this the most common configuration of ECC? Even today, ECC DIMMs are also 64+8. Commented Mar 30, 2020 at 3:42

4 Answers 4

29

There were 64 data bits and 8 check bits.

It seems to me by the nature of parity, it should suffice to have one bit of overhead per word, rather than eight. [...]

What you refer to here is a simple single bit parity. Basically counting the number of ones (even parity) or zeros (odd). Such a mechanism can only detect an odd number of bit flips (1 or 3 or 5 or ... flipping). Even numbers of flips can't be detected and will result in undetected computing errors.

What the Cray uses is a parity system based on Hamming encoding. Encoding parity this way allows detection of multiple bit errors within a word and even correction of these on the fly. The 8-bit code used was able to correct single bit errors (SEC) and detect double error (DED).

So while a machine with a single bit parity can detect single bit flips, it will always fail on double flips. Further, even if an error is detected, the only solution is to halt the program. With SEC-DED, a single error detected will be recovered (final) on the fly (at cost of maybe a few cycles) and a multi-bit error will halt the machine.

I can understand on something like an 8088/87, you might be stuck with 1/8 because the memory system deals in eight bits at a time, but why is it that way on a 64-bit machine?

Because it's still just 1/8th, but now with improved flavour :))

Considering the quite important function of invisible error correction, the question is rather why only 8. Longer codes would allow to detect even longer errors and multi-bit corrections. With the 1 Ki by 1 RAMs used (Fairchild 10415FC), any width could have been made. Then again, while the Cray 1 architecture shows a switch to the 'new' standard of 8 bit units - so using 8 parity bits comes naturally. Doesn't it?


Remark#1

Eventually it's the same development the PC took, just instead of going from 9 bit memory (SIMM) over 36 bit (PS/2) to today's 72 Bit DIMM, the Cray-1 leapfrogged all of this and started with 72 Bit right away.


Remark#2

Seymour Cray is known to have said that 'Parity is for Farmers' when designing the 6600. While this quote was famous in inspiring the reply 'Farmers buy Computers' when parity got introduced with the 7600, not may know what he was referring to on an implied level: The Doctrine of Parity, a US policy to make farming profitable again during and after the great depression - a policy that to some degree still results in higher food prices in the US than in most other countries.


Remark#3

The Cray Y-MP of 1990 even went a step further and added parity to (most) registers. Also the code was changed to enable double-bit correction and multi-bit detection.

15
  • 4
    Cray certainly resisted parity and error checking hardware in the Cray-1, because it was a performance hit. AFAIK one (the first production?) Cray-1 was built without parity and delivered to a US government agency (can't remember exactly where), and it did have better benchmarked performance than any of the later production machines.
    – alephzero
    Commented Mar 6, 2019 at 12:16
  • 2
    @alephzero: Would parity have required a performance hit if its sole function was to sound an alarm in case of parity fault to notify the user that the output from the current job should not be trusted, as opposed to trying to prevent erroneous computations? Even if parity-validation logic wouldn't be able to indicate whether a fetch had received valid data until long after the data had already been used, it could still provide an extremely valuable pass-fail indication of whether the output from a job should be trusted.
    – supercat
    Commented Mar 6, 2019 at 19:09
  • 1
    @EdwardBarnard: You're saying the 10 cycle duration was for parity but not SECDED? If so, then unless there was some faster mode without any sort of parity protection, it sounds like you're saying there was only a performance hit if one needed to be able to recover from parity errors (as opposed to merely sounding an alarm).
    – supercat
    Commented Mar 27, 2019 at 3:13
  • 2
    @supercat: Memory access was either "vector mode" or "scalar mode", with access time a bit faster for vector mode - but still 1 clock period faster for Serial 1. There's a third mode, instruction fetch, not relevant here. This was literally wired into the hardware; no option to turn on or off. There WAS an option as to whether or not generate a hardware interrupt to report single, double, or both, but the single-bit-error-correction happened regardless of interrupt settings. I never worked with Serial 1 personally but did other CRAY-1's inside the operating system. Commented Mar 27, 2019 at 15:31
  • 1
    @EdwardBarnard: If parity were only needed for the purposes of sounding an alarm, I don't see why it should need to have any performance impact at all, given that one could have a "parity storage and monitoring" circuit which took the current address, data, and read/write status as inputs, and had an alarm output, but didn't influence main system behavior in any way whatsoever beyond minimal capacitive loading on the address and data bus lines. Even if computing parity from a word that was being read or written would take multiple cycles, so what? Use clocked registers to perform...
    – supercat
    Commented Oct 19, 2023 at 17:57
12

After the first Cray-1 was built, some calculation determined that the time between failures would be greatly extended by having a single-error-correction-double-error-detection (SECDED) without much cost in speed. The point is that with large memory, random single bit errors occur every few hours; with SECDED, it's every few years or so.

3
  • Yes. Mean time between failure was a significant consideration. Multi-day runs for a single program were not uncommon. SECDED, allowing to ride through flipped memory bits, were one of the factors allowing the long runs without hardware failure. Commented Mar 27, 2019 at 0:21
  • 2
    Some time in 1977 or so the X1 register on the CDC6400 at Northwestern's Vogelback computer center failed in one bit. Gobs of files got corrupted. The on site CDC engineer was able to repair the machine (don't know if he replaced the transistor or the module). Unfortunately the backup system had been misconfigured so files couldn't be recovered. The center was shut down for a day while backups had to be be rerun. If parity had been in place the hardware breakdown wouldn't have caused such an issue.
    – kd4ttc
    Commented Oct 1, 2020 at 20:50
  • I also recall a run of a program I once experienced that had a faulty result. Reran the job and it worked without change. Only time in my life I had a single bit error affect a computer program.
    – kd4ttc
    Commented Oct 1, 2020 at 20:50
6

The extra bits are used to allow for error detection and correction (EDAC).

This scheme is described in detail in: Cray 1 Hardware Reference Manual at page 5-5 (~168)

The use of EDAC in the Cray-1 is rather ironic given that Seymour Cray is (in)famous for once saying

Parity is for farmers.

Which I think is a reference to farm subsides in Europe.

1
  • 6
    "Farm income parity"' was a policy in 20th century US agriculture, probably topical in the 60s and 70s, so I suppose Cray was referring to that,
    – dave
    Commented Mar 7, 2019 at 23:56
1

SECDED means single error correction double error detection there were enough memory picks or drops to really benefit from the single error correction. and if a module failed 2 bits in a word it flagged error. the performance hit was say 2 clock cycles maybe 4 for the first operand set through the channel but every clock after that out comes an answer. or then it did. we built it into the first raid arrays we sold to NASA around 88 or 89. That was a big win there for big data to come

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .