8
$\begingroup$

The Discover Magazine article How New Horizons Survived the 40-Year-Glitch and Made it to Pluto is a little confusing. It wraps NASA history together with several different space missions to get the "40-year" value, since New Horizons was launched only about eleven years ago.

The engineers have become so good at fixing problems that most of the time the public has no idea what they are up against—until something goes wrong, as happened to New Horizons last weekend, when a software glitch caused the probe to shut down into “safe” mode. For a moment, this was a news story. Then, once again, the engineers stepped up and solved the problem (caused by an obscure timing flaw in a command sequence sent to the probe in preparation for flyby). Within three days, all was back to normal.

In the linked NASA News item NASA’s New Horizons Plans July 7 Return to Normal Science Operations, the issue is also mentioned:

The investigation into the anomaly that caused New Horizons to enter “safe mode” on July 4 has concluded that no hardware or software fault occurred on the spacecraft. The underlying cause of the incident was a hard-to-detect timing flaw in the spacecraft command sequence that occurred during an operation to prepare for the close flyby. No similar operations are planned for the remainder of the Pluto encounter.

Q: Is the glitch described in any more detail anywhere? Was it purely a software/computing timing issue or did the timing involve comms or something mechanical as well?

$\endgroup$
0

1 Answer 1

10
$\begingroup$

I found a bit more detail by googling for "New Horizons Anomaly Review Board Report". The best writeup was from here.

By 4 p.m., the mission's Anomaly Review Board had convened to be briefed on what had transpired and to discuss the best way forward. Midway through the spacecraft's recovery, they determined there was no fault in the hardware or software. But there had been a conflict when the spacecraft tried to commit to flash memory the complete command sequence for the nine days of the flyby, which it had just received from Earth, while at the same time compressing science data that had been gathered by its instruments. All that simultaneous activity had triggered an overload of the main computer, prompting the autonomy software to switch to the backup processor, point New Horizons toward Earth, and put the craft into safe mode.

It sounds like the ground sent up a lengthy command sequence while the system already had a heavy work load compressing science data, and the main computer crashed due to an overload.

So the so-called "timing flaw" sounds like it was, basically, just a bad choice of the time at which to send that command sequence.

$\endgroup$
2
  • 1
    $\begingroup$ OK great - this is a very clear and concise description of what happened. I guess "no fault of the hardware or software" means it showed no unexpected or erroneous behavior, but maybe it could be called an unanticipated or untested or un-modeled scenario? Thank you for finding this! $\endgroup$
    – uhoh
    Commented Nov 17, 2016 at 4:49
  • 2
    $\begingroup$ Yes, it sounds like the operating system could have been better designed to handle overloads - as the Apollo LEM computer was, during the famous '1202' alarms during the Apollo 11 landing. $\endgroup$ Commented Nov 17, 2016 at 14:40

Not the answer you're looking for? Browse other questions tagged or ask your own question.