How did restarts resolve parity errors triggered by broken core-rope wires?

Question

In his account of the history of the Apollo Guidance Computer, Don Eyles describes the role of a parity bit as follows (p. 80):

... if one of those hair thin wires in our woven core-rope memory happened to break, a parity failure might occur. Each of our fifteen-bit words of memory actually had an additional, sixteenth bit, called the parity bit, which the assembler set to a one or a zero to force the total number of ones among the 16 bits to be odd. When the computer accessed a word of memory it counted the bits that were set to one and if the result was not an odd number it would trigger a restart.

I understand how parity is working here to detect the broken wire; but it's not clear to me what restarting accomplishes. If the cause of the parity error was a broken wire, won't that wire still be broken after the restart, trigging another, and so on?

The quote doesn't seem to me to imply that the reset is curative. Is there more to the source that makes it seem so? — Wayne Conrad, Commented Jul 31, 2019 at 16:25
@WayneConrad Yes, the paragraph occurs in the context of a discussion of the way in which restarts of the AGC provided a means for continuing to operate in the face of a variety of error conditions (e.g. overloads). — orome, Commented Jul 31, 2019 at 16:28
I wouldn't think a permanent break would be fixed by resetting, but e.g. a mechanical jolt caused a connection to momentarily fail (either self-curing, or fixable by removing and re-inserting a module), resetting could restore the computer to operation. — supercat, Commented Jul 31, 2019 at 16:36
@supercat I agree. I also suppose that if the break is in a program you can simply not use, then you could continue to use the remaining, unbroken programs in the AGC. — Wayne Conrad, Commented Jul 31, 2019 at 17:04
@orome Many of the AGC programs were under pilot control. They only ran when called for. — Wayne Conrad, Commented Jul 31, 2019 at 19:43

DrSheldon · Accepted Answer · 2019-08-01 15:52:17Z

The Apollo Guidance Computer had fixed and erasable memory. The former (which is the focus of this question) contained instructions and constants, and was functionally equivalent to ROM. The latter was where data was stored, and was functionally equivalent to RAM. The response to parity errors in either memory was the same.

The effect of a failed memory bit is highly variable. In the worst cases, it could lock up the computer or branch to the wrong code. Most likely would be an incorrect computation with the computer otherwise proceeding normally. It's also possible that a bit error may be in code that would not be used, or in branch cases that did not get taken.

The AGC had several dozen programs, which the astronauts could select and run. The computer's response to a parity error was to reset the AGC to a state where the astronauts could select a new program (possibly even the same program they had been running). Unless absolutely needed to reset the computer to this state, data was preserved. So the astronauts could try the same program, a different program, manual control (if applicable), or (for the lunar module) use the Abort Guidance System instead.

Would the same problem recur? Again, it depends heavily on the location of the error bit. Some subroutines are used in virtually every program, and the problem would again manifest itself; other parts of the code execute under very specific circumstances and may never again be touched. In the worst case, there were procedures for dealing without the AGC, and the astronauts and ground crews practiced various contingencies in simulators before launch.

Although the computer was in the cabin, it was not practical to repair it during flight. Both types of memory were woven by hand under a microscope. The problem poses a broken wire, which would require unweaving and then re-weaving that line of memory. Even if the astronauts had the tools and materials available, they would be more likely to cause more damage or to get the pattern wrong.

Patching the code was not an option. The fixed memory was physically permanent. Even if you could write code to the erasable memory, there was no mechanism to execute code at an arbitrary address.

No parity error actually occurred during the entire Apollo Program. That link to Space.SE provides more information on parity errors.

So the answer is that the wire would (unless the problem was transient) still be broken and the memory location thus unusable. What happens next would depend on where the break was. But I don't really see a practical way to proceed. Could the crew, during a mission, locate the fault and avoid it? Can a single address be avoided, or would whole banks of memory need to be taken out pro commission? — orome, Commented Aug 1, 2019 at 14:18
I have a paper on transient errors of the AGC, which I did not quote because the question was on broken core-rope. It and other sources indicate that parity errors were expected to be either mis-reads of a memory location, or a flipped bit of the erasable memory. Either is temporary and recoverable. Damage to the fixed memory would be permanent, and although the AGC was in the cabin, attempting to repair it would likely cause more damage. The alternative was to use something else that does not use the affected memory. — DrSheldon, Commented Aug 1, 2019 at 15:17
Patching the code was not an option. The fixed memory was physically permanent. Even if you could write code to the erasable memory, there was no mechanism to execute code at an arbitrary address. There was no power-on-self-test, because you need to be able to reboot the computer and try something else, without the computer getting in the way of that. — DrSheldon, Commented Aug 1, 2019 at 15:24
Could you ad something to the effect that "Damage to the fixed memory would be permanent, and although the AGC was in the cabin, attempting to repair it would likely cause more damage" to the answer? — orome, Commented Aug 1, 2019 at 15:27
Would there have been any particular difficulty with having certain parts of the code check whether a certain word holds a particular "magic value" and--if so--branching to a small otherwise-unused section of RAM? That would create some failure modes, but would seem like it would increase the range of problems that could be fixed "on the fly". — supercat, Commented Aug 1, 2019 at 16:22

Stack Exchange Network

How did restarts resolve parity errors triggered by broken core-rope wires?

1 Answer 1

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
hardware
memory
apollo-guidance-computer
.

Hot Network Questions

How did restarts resolve parity errors triggered by broken core-rope wires?

1 Answer 1

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged hardwarememoryapollo-guidance-computer.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
hardware
memory
apollo-guidance-computer
.