39
$\begingroup$

I've had a fear of flying for the past several years and one of the ways this irrational phobia manifests is concern over the airplane power-cycling in mid-air: engines cutting out, control equipment powering off, etc.

Occasionally a computer or smartphone will crash and reboot itself, or in a worse case, the power supply or battery will fail and the device will not turn back on.

Does this ever happen with commercial jets? I expect that they are electronically wired up in a fundamentally different way from computers, so that one system failure does not affect the rest, but I've never asked a pilot.

$\endgroup$
5

7 Answers 7

67
$\begingroup$

I'm a programmer and private pilot, so maybe I can help dispel some of those fears.

  1. The computers that run a commercial airplane are conceptually much simpler than the one that runs your phone. This means far less chance of a bug in the software, just because there's less for the programmer to keep track of.

  2. If your phone restarts, it doesn't imperil anybody's life. So, the testing and Q.A. for such a device is basically whatever the company wants to do. On the other hand, computers in aviation are much more thoroughly tested before they can be certified to fly.

  3. Similarly to point 1, the computers on an airplane each have only one job. A lot of the crashes on a typical PC or smartphone come from different apps stepping on each other's toes. (The operating system is supposed to keep each app apart so they can't do that, but point 2 applies to operating systems as well.)

  4. The airplane isn't one single computer, like your phone. Yes, your phone probably has multiple processors, each with multiple cores, but they're tightly coupled together in order to form one computer. The computers on an airplane are networked together, but they are separate computers. If your phone crashes, even if the guy sitting next to you is on the same network, it doesn't affect him, does it? Similarly, if the FADEC for one engine fails (a vanishingly rare occurrence, because FADECs are conceptually among the simplest of computers), all that will happen is that one engine will shut down, with no effect on the rest of the plane.

  5. On the other hand, the really important computers (such as fly-by-wire controllers) have multiple redundancies. So, if one fails, the rest of them can pick up the slack. Even the pilot wouldn't notice except for the warning light that would show in the cockpit.

If you want to see what actually happens when things fail on an airplane, take a look at the following videos:

The pilot flew the airplane by hand while he shut down and restarted the avionics. The flight then proceeded normally.

The pilot actually let the autopilot have it for a few seconds, just out of curiosity as to where it would take them. But once it started banking more than it was supposed to, he disengaged the autopilot and took over manual control with no issue.

Engine stayed running despite losing all electrical power to the entire plane. He landed safely.

Edit: Well, just two hours ago as I type this, another video of an in-flight failure just happened to pop up in my YouTube feed:

He switched to his backup generator, and other than an annoying whine in the headphones, had no other issue with the flight.

$\endgroup$
10
  • 14
    $\begingroup$ Any reason why you included spoilers in your answer? $\endgroup$
    – Valay_17
    Commented Jan 3, 2020 at 8:25
  • 34
    $\begingroup$ I, for one, like the suspense of not knowing how each of those would turn out! $\endgroup$ Commented Jan 3, 2020 at 14:24
  • 26
    $\begingroup$ @Valay_17 Just in case they wanted to watch the videos themselves without knowing what happens. $\endgroup$ Commented Jan 3, 2020 at 14:36
  • 7
    $\begingroup$ Not only are the fly-by-wire computers redundant, they're built using different architectures and programmed by different teams to minimize the chance of a problem affecting all the computers at once. The A320, for example, has two Intel FBW computers and two Motorola FBW computers. $\endgroup$
    – Mark
    Commented Jan 3, 2020 at 22:05
  • 11
    $\begingroup$ @AlphaCentauri - here on the Earth we don't need to poke into some other process space to "step on each other toes" - ask some other app to open file with bad data/file name or launch browser to site with 0-day exploit or push accessibility API too far... There are plenty of official IPC communication channels between processes that can be used to cause issues even unintentionally. And here cost of testing to aviation (or any life-critical) standards is way to prohibitive for any general purpose OS to be tested to qualify (at least in default consumer configuration). $\endgroup$ Commented Jan 3, 2020 at 22:53
27
$\begingroup$

Before we start, it's important to say that your concern is not irrational. If this were to happen, or if your plane's control systems were to otherwise malfunction in a dangerous manner, your life would genuinely be in danger.

You aren't the first person to have thought of this, though. For this reason, we have a category of control systems we describe technically as safety related, and there is an entire branch of engineering called safety engineering dedicated to formally assessing these systems and trying to prevent accidents. This includes airplane control systems, but also anti-lock brakes, medical devices, and any other system where people could be harmed by it going wrong. The degree to which people can be harmed by this is formally assessed as a Safety Integrity Level based on risk. The risk is a combination of how likely the event is, how bad the outcome will be, and whether the people involved can take any mitigating action, and it is assessed for every way a safety related system can misbehave.

Note that this assessment may not be as intuitive as you'd think. I once worked on a chaff and flare dispenser system for military aircraft. You would think that the risk of failing to fire countermeasures and the pilot being shot down would be your major risk - but the safety assessment (we used an FMEA) showed that the pilot had other mitigating options such as armour and an ejector seat, getting shot down is a chance they'd already accepted when they took the job, and the risk of a crashing plane hitting buildings was miniscule and something that had already been institutionally accepted as part of having an air force. The most serious risk was actually that the system would misfire whilst an armourer was reloading it, because then they'd get a volley of 36 shotgun shells to the head at close range. The armourer did not sign up to taking that chance, and there was no practical way to protect them. As a result, our system had to default to not firing if there were any discrepancies.

There are many ways to ensure reliability. Redundancy is the most popular one. You can have multiple sensors in multiple locations, so the system can always work out what's going on if one (or more, perhaps) should fail. There are usually multiple actuators for important flight surfaces, or multiple flight surfaces where the aircraft can remain in control if one or more are damaged. Passenger planes generally have multiple engines too, and multiple fuel tanks which can be isolated from each other in case of damage. In a number of cases there may be multiple control systems which "vote" on the right action, so one malfunctioning unit will be ignored. In the extreme case, each control system may even have been programmed by a different software team, so that a bug in one team's software is extremely unlikely to be present in another team's software. And there may be other backup systems in place such as mechanical controls.

Another good mitigating method is training. It's perfectly acceptable for things to go wrong if the people operating it are able to deal with that failure and keep going. It's important not to underestimate how good people can be. People can and do also cause failures, so training can also be a case of telling them "don't do that". Large aircraft are relatively slow to respond to controls, so it's relatively common that pilots can overcorrect and make things worse. For some commercial aircraft, the standard response taught to pilots in case of instability is to let go of the stick and allow the aircraft to correct itself.

It's worth noting that both these factors are why the Boeing-737MAX disasters are so bad, to the extent that there should be criminal charges brought against the people individually and the organisation collectively. The system concerned did not use redundant inputs, even though they were available; the impact of the system failing to respond correctly was not assessed nor mitigated; and the crew were given no training in how to deal with its failure, nor even told that it existed. In the UK, the crime of "corporate manslaughter" exists to prosecute exactly these kind of failures.

The other element to all this though is quality, so that you try to make sure the systems don't go wrong in the first place. The reliability of software is almost entirely dependent on the quantity of reviewing and testing that takes place. I'm currently working on software for scientific equipment, and I reckon to spend around 10-20% of my development time on testing. PCs and mobile phones will be about the same. When I worked on automotive and aerospace systems, this was entirely reversed - we reckoned to spend around 5-10% of our time on coding, 10-20% of our time on design, and the rest of our time went on reviewing and testing.

Change control is also radically more locked down. Microsoft may release an upgrade and then do damage control on the few cases where it misbehaves, and sneak in a few extra features at the same time. In safety-related development though, you don't change a single line of code without formal sign-off that (a) everyone understands what that change will do, (b) that this change fixes this bug and does not change anything else, and (c) that this change is even needed. Many bug triage sessions involve us spotting bugs where we eventually decide that the impact of the bug is tiny (perhaps we're 10ms later turning on a warning light for example), but the risk of trying to fix the bug could potentially be high if we happened to get it wrong, so it is safer for this trivial bug to stick around.

As the Boeing-737MAX case shows us, all these processes are only worth a damn if people follow them. The processes exist though, and they are best practise in an industry of tens of thousands of engineers worldwide which has plenty of formal standards internationally to establish this. Failing to follow these standards is almost by definition gross negligence, and most countries have laws which allow prosecution of people and companies who are negligent to this degree. Most engineers would like to do a good job anyway; but the laws ensure an organisation as a whole stays honest and doesn't cut corners.

$\endgroup$
7
  • 1
    $\begingroup$ Very good answer. You could, if you have time, elaborate a sentence or two on how the very processes of software and hardware development for safety critical systems are strictly defined to ensure the safety of the finished product, e.g. as in IEC 61508 and then domain specific norms. To actually have a proven, systematic, documented process which ensures that requirements are fulfilled, specifications are followed, reviews and testing is ensured etc. is the most crucial difference to general product development. $\endgroup$ Commented Jan 4, 2020 at 9:39
  • $\begingroup$ @Peter-ReinstateMonica Thanks. All very true, of course. I was worried I'd gone on too long as it was though! :) $\endgroup$
    – Graham
    Commented Jan 5, 2020 at 21:33
  • $\begingroup$ It might be worth noting that the test / quality process requires independence from the team that actually implemented the code in many cases. (Certainly true for DO-178 and DO-254). $\endgroup$ Commented Jan 6, 2020 at 10:30
  • $\begingroup$ "The most serious risk was actually that the system would misfire whilst an armourer was reloading it, because then they'd get a volley of 36 shotgun shells to the head at close range" - Why was the armourer standing in the flare dispenser's line of fire while reloading it? $\endgroup$
    – Vikki
    Commented Jul 15, 2021 at 2:04
  • 1
    $\begingroup$ @Vikki The design of the dispenser is basically a 6x6 grid of tubes in a rectangular frame. A flare or chaff cartridge goes in each hole, and then the armourer plugs the loaded frame onto the mounting panel which has the firing contacts. Inevitably this means he has 36 armed tubes facing him as he slides it home. $\endgroup$
    – Graham
    Commented Jul 15, 2021 at 7:38
10
$\begingroup$

Your concerns are reasonable and justified. A mid-air shutdown or reboot would be catastrophic to an airliner. Which is why, engineers designed the systems such that this scenario is practically impossible to happen.

Electrical power

An airliner has multiple electrical power source. Each jet engine has a built-in generator. When the turbine spins, electricity is generated. Each generator can be independently turned off should a problem arise. Most airliners also have an Auxiliary Power Unit, or APU. The APU can be started in an emergency to provide backup electrical and hydraulic power to the airplane, as done in the famous Hudson Riving Landing.

If everything fails (for example if the airplane runs out of fuel), limited electrical power can be provided by windmilling, either using the Ram Air Turbine (e.g. Boeing 777) or by windmilling the turbines themselves (e.g. Boeing 747), as the airplane slowly glides towards a landing spot.

Then there is the battery, of course, which is charged at all times. It can provide limited power in case of emergencies.

Computers

All airliners come with multiple flight control computers. The units are built by different manufacturers, on different CPU architectures and different source codes. The chance of all units failing at the same time due to a bug or defect is very low. In the unlikely event that one of the units fail, the pilots can disconnect that unit from the rest of the system.

For example, the Airbus A320 has 2 Elevator Aileron Computers, 3 Spoiler Elevator Computers and 2 Flight Augmentation Computers. Each unit can be disabled shall it malfunction.

Mechanical linkage

In the extraordinary unlikely event that electrical power is completely lost, certain flight controls are linked to the cockpit via mechanical means and can be operated with human force. For example, the emergency procedure for a complete flight computer failure in the Airbus A320 is to land the aircraft using nothing but rudder peals, the elevator trim wheel and throttles. This has never happened in history.

$\endgroup$
1
  • $\begingroup$ Might I also add that long oceanic flights such as London to New York would have to be ETOPS certified (plane, crew, airline, etc). This ensures there is always a palce for emergency landing at any point in the flight if something still goes wrong. $\endgroup$ Commented Jan 3, 2020 at 8:13
5
$\begingroup$

as somebody who did software tests on an unimportant (class D, will explain soon) system for an airplane to be approved to be landed on civilian airports: In airplane there is a strict hierarchy on what kind of software functions mean; they are listed in DO-178B.

  • Class A systems are assumed to be "failure free"; they are extremely well tested. These systems are not meant to reboot, and they typically will not turn off or do any other additional action upon an error condition. (e.g. when an engine controller looses the connection to the flight deck it will just remain in it's last engine setting). Class A systems are developed under an high amount of testing and level of scrutiny.
  • ...
  • for the system (class D) which I tested the main logic was "if there is an error, send an error message and then get of the network, halt and wait for a reset from the cockpit. Even Class D system tests include test procedures which are not used often (e.g. white-box tests with hardware emulators). These are still the best tested software which I have seen in my life.

The logic here is that an unimportant systems crash will never hamper with the important systems function (the pilot can choose when to rest these). Most systems in an airplane are double redundant. The micro controllers and HW architecture used are designed in a way that simple failures on a board will be limited in impact. The main network (e.g. AFDX) is also redundant, and measures are taken in the interface that software running wild does not go outside its bound in using the busses.

Normal reset procedures are safe in respect to leaving the plane always in a controllable state. An example of a wrong reset procedure - powering down both flight control computers, which is not allowed while in air, because the pilot was not happy with the results of the standard way of resetting the computers - was Air Asia Flight 8501.

$\endgroup$
4
$\begingroup$

Occasionally a computer or smartphone will crash and reboot itself

This can happen for two reasons: a software error or a hardware error. Both can cause the CPU to stop processing new instructions (i.e. a "hang") or to cause the machine to reboot itself. The latter is close to a hang, because the Operating System detects it cannot continue operating normally and issues a hardware restart.

The possible causes are endless, but the results come down to the same: the processor cannot execute any new operations and therefore not continue operating normally. This is suboptimal if human lives depend on its continuous functioning.

Hardware errors can be caused by degraded functionality, through damage or wear. For example a power supply that cannot deliver the required power at all times, or a memory module that is damaged through electrostatic discharge, causing random bits to "flip" (a 1 unintentionally being read as a 0 or vice versa).

Software errors are caused by programmer errors or installation errors. A clean Operating System installation (Windows, Linux, MacOS, ...) on your computer or smartphone, having well-functioning hardware, given the hardware is supported by the OS and the appropriate drivers are installed so the OS can communicate properly with the hardware, will not crash. Sure, decades ago some OSes were prone to crashing after being up for a certain amount of time, but it's 2020 now. Those issues have all been ironed out of modern operating systems.

The problem with consumer-grade hardware and software is that it's not life-critical, not redundant, and people want to be able to install random applications on their devices, distributed by random software developers. You won't see a pilot opening the App Store on your Airbus mid-air and install the new Christmas Lights app to let the cabin lights blink festively, which just happens to stop the fuel pumps because the developer never tested it in flight.

So how do airplane manufacturers stop this from happening:

  • Redundancy: when one system stops (or its outputs lie outside valid values), there's a reserve system to take over.
  • Specificness: as opposed to general purpose computers, the devices in an airplane have a very specific goal, and are built, installed, configured and tested for that goal.
  • Testing: you might bring your computer or phone to the shop for maintenance when it starts behaving erratically. That might be too late for that hardware, but usually they'll be able to recover your pictures. Buy new hardware and/or reinstall the OS and you're good to go. Planes are checked more regularly.
$\endgroup$
5
  • 4
    $\begingroup$ I love reading this Stack Exchange, but I don't know much about planes. I do know a thing or two about computers though, so I hope this answer is useful and on-topic. $\endgroup$
    – CodeCaster
    Commented Jan 3, 2020 at 12:12
  • 3
    $\begingroup$ Modern operating systems, like Windows are not used in flight computers. Linux might be used, but I think it's rare (because of the modifications needed to meet avionics standards). More commonly would be LynxOS or VxWorks, which are highly specialized systems designed to run on high-reliability avionics and other industrial systems, and aren't a traditional OS like you would think of Windows or Linux. $\endgroup$
    – Ron Beyer
    Commented Jan 3, 2020 at 13:03
  • $\begingroup$ @Ron that was absolutely not what I wanted to imply, sorry if so! "Rebooting engine control for Windows Update..." $\endgroup$
    – CodeCaster
    Commented Jan 3, 2020 at 13:05
  • 2
    $\begingroup$ @RonBeyer - apart from VxWorks cert, Greenhills integrity is also used in safety critical avionics applications. vxWorks is also very popular in avionics for mission critical applications. $\endgroup$ Commented Jan 3, 2020 at 13:28
  • $\begingroup$ A desktop tends to hang on a crash so the user / developer can see that and maybe gain some info about why it crashed. (Or at least the fact that it did crash). When you have other requirements that take priority (keep operating) like many embedded systems, you include a watchdog timer that reboots the system if the OS doesn't poke it every millisecond or whatever. Embedded systems typically don't take long to boot up. A classic example of something similar is Did the 1202 error and associated reboot prevent disaster on Apollo 11 landing? $\endgroup$ Commented Jan 5, 2020 at 3:24
2
$\begingroup$

Regarding power cycling specifically (as opposed to software-based failures in general which other answers have detailed very well), it's not a problem at all. Unlike a typical PC or smartphone which takes time to turn back on and can lose data when power is lost, the control systems on an airplane are typically designed such that they will resume complete operation the second the power is restored.

Think of it more like your refrigerator internal temperature monitor than your personal computer. If you power cycle it, it will immediately begin functioning again as if nothing happened.

$\endgroup$
2
  • 1
    $\begingroup$ In particular, apart from the auto-pilot and the navigation system, pretty much none of the systems are actually reliant on any form of storage, not even RAM. They deal with real-time sensor data. If they reset, no data is being lost, because they only work with data that is being captured in real-time anyway. And a misbehaving auto-pilot or NAV doesn't make the plane fall out of the sky, it just means that responsibility for where you are going and where you are now moves from the computer to the pilot. $\endgroup$ Commented Jan 5, 2020 at 18:51
  • $\begingroup$ @JörgWMittag Well strictly speaking, they all have RAM, even if it's only a few kilobytes. While it's possible to run a program without any RAM, it's too limited even for these purposes. Doing calculations on real-time sensor data, or even non-trivial arithmetic, requires more than just a few general purpose registers. $\endgroup$
    – forest
    Commented Jan 6, 2020 at 1:50
0
$\begingroup$

Reboots can also sometimes be acceptable if you can reboot quickly without losing critical information. This occurred as part of the Apollo 11 landing when they had the 1202 alarm:

He realized that the 1202 was a code meaning that the guidance computer on-board the landing craft was getting overloaded with tasks. The programmers had anticipated this overloading might someday occur, and so had established a system internal aspect that would automatically do a fast reboot and then a memory restore to try and get the computer back underway.

$\endgroup$
5
  • $\begingroup$ but this would not at all apply to the fly by ethernet controls that operate the actual control surfaces, surely? $\endgroup$
    – Fattie
    Commented Jan 3, 2020 at 17:25
  • $\begingroup$ If an error is detected, it is better in many cases for the system to be inop temporarily while it reboots rather than be stuck in a faulted state. $\endgroup$ Commented Jan 3, 2020 at 17:36
  • $\begingroup$ @Fattie sure it would. Better to do nothing for a moment, then do the right thing, than to do the wrong thing immediately. $\endgroup$
    – hobbs
    Commented Jan 4, 2020 at 1:49
  • 2
    $\begingroup$ This seems more like a comment, since there's really no comparison between an ancient computer with magnetic core memory and rope memory and exceptionally limited computing requirements, and a complex modern machine with hundreds of microprocessors with incredibly complex ISAs and millions of lines of code written in multiple programming languages. $\endgroup$
    – forest
    Commented Jan 4, 2020 at 11:42
  • 1
    $\begingroup$ @forest: Modern embedded systems do still have a watchdog timer that would reboot if the system locked up, instead of just sitting there like a desktop so a user can write down the error code (or notice that it crashed in the first place). But yes, the quote in this answer makes it sound like a primitive garbage-collection scheme. However, the Apollo 11 AGC did have a watchdog timer (they called it the "Nightwatchman" :P, and was one of a few things that could trigger a reboot: How did the Apollo guidance computer handle parity bit errors?) $\endgroup$ Commented Jan 5, 2020 at 3:30

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .