SlideShare a Scribd company logo
HAB Software Woes
John Graham-Cumming
September 2012

Or “My capsule didn‟t crash but my software did”
Background
        > 30 years of
         programming
         experience

        One HAB flight
         ◦ GAGA-1
http://blog.jgc.org/2011/04/gaga-1-flight.html
https://github.com/jgrahamc/gaga
Where‟s your flight‟s
complexity?
   Example: GAGA-1
    ◦ One balloon, parachute, polystyrene box
    ◦ Many metres of cord attached with knots
    ◦ An off-the-shelf camera

    ◦ 2,836 lines of code
    ◦ Common to see defect rates of 2 to 4 per
      KLOC
    ◦ So GAGA-1 likely has 5 to 10 errors in it
Real Stuff Seen on HAB
flights
 Complete computer crash
 Altitude going negative
 Latitude and longitude garbled
 Cutdown triggered in back of car
 Long periods of no transmission
 Not setting the GPS up before launch
 Not turning the camera on
 Running out of camera disk space
 Altitude jumping around rhythmically
The Curse and Joy of
Determinism
   Computers do what you tell them to
    ◦ Precisely what you tell them to
    ◦ Not what you think you told them to do
   A Curse
    ◦ Will do things you don‟t expect
    ◦ Will process bogus input without
      complaint
   The Joy
    ◦ Easy to test that it does what‟s expected
HAB Is A Harsh Environment
 Cold
 Vibration
 Stuff breaks in flight


 Software needs to be able to cope with
  failing hardware
 Very important to think about failure
  modes
 YOUR CODE IS ON ITS OWN OUT
  THERE
Deadly Sins
 The “It works!” Fallacy
 The Last Minute Change
 Being Far Too Clever
 Overlooking Odd Behaviour
 Copying Other People‟s Code
 Assuming Finding A Bug Solves The
  Problem
The “It works!” Fallacy
   If you‟re an inexperienced (and
    sometimes experienced)
    programmer…
    ◦ You hack some code together
    ◦ It works once
    ◦ You assume it will always work

   Only solution to this is
    ◦ Testing
    ◦ Paranoia
The Last Minute Change
 Never, ever change anything in code
  at the last minute no matter how
  simple.
 Example: HABE 1
    ◦ Complete camera failure
    ◦ Maximum integer size in uBASIC on
      CHDK is 999,999
    ◦ Last minute change of integer from
      600,000 to 1,000,000 caused total failure
Being Far Too Clever
       Example: GAGA-1
        ◦ Entered the wrong value of 2 * pi in code
          to do GPS position conversion from
          radians to degrees

        ◦ Caught before flight because I verified the
          location of my own back garden

        ◦ Note to self: 2 * pi != 6.2818.


https://github.com/jgrahamc/gaga/blob/master/gaga-1/flight/gaga1/gps.cpp#L113
Overlooking Odd Behaviour
       Example: GAGA-1
        ◦ In tests RTTY output was fine some of the
          time, garbled at other times
        ◦ Turned out to be interrupts from the GPS
          messing up the RTTY timing
        ◦ Solution: disable GPS serial interface while
          sending RTTY string

     ALWAYS BE HONEST WITH
      YOURSELF ABOUT YOUR CODE
     EXPECT THE SPANISH INQUISITION!

https://github.com/jgrahamc/gaga/blob/master/gaga-1/flight/gaga1/tsip.cpp#L229
Copying Other People‟s Code
     Don‟t do this, you have no idea what
      you are copying or who they copied it
      from
     Better practice is to look at other
      people‟s code and…
        ◦   Write your own version
        ◦   That you understand
        ◦   That you are able to test
        ◦   Example: GAGA-1
              Read lots of people‟s RTTY code, wrote my
               own
https://github.com/jgrahamc/gaga/blob/master/gaga-
APRS Tracker using copied
     code




   If the altitude in metres contained an 8 or a 9 the altitude reported would
   be wrong

http://sharon.esrac.ele.tue.nl/users/pe1rxq/aprstracker/aprstracker.html
Assuming Finding The Bug
Solves The Problem
 Just because you‟ve found A bug
  doesn‟t mean it was THE bug
 Lots of research in computer science
  shows bugs tend to cluster
 Example: CLOUD1, CLOUD2
    ◦ Three bugs in printing latitude, longitude
      and altitude
    ◦ One fixed on CLOUD1, …
“The One Thing I Didn‟t Test”




 http://ukhas.org.uk/guides:common_coding_errors_payload_testing
Common problems with uC
 Lack of floating point support
 Small integers
You might never be a
great programmer…

… but you can be a
paranoid tester!
Good Things To Do
 No infinite loops
 Self-Checking
 Unexpected Error Handling
 Handle Exceptions
 Simulation
 Simplify, Simplify, Simplify
 Unit Test
 Write Log Files
No Infinite Loops
 Never sit in a loop waiting forever
 Example: ATLAS 3
while (1) {
  // Make sure data is available to read
  if (Serial.available()) {
    b = Serial.read();

         if(bytePos == 8){
           navmode = b;
           return true;
         }

         bytePos++;
        }
        // Timeout if no valid response in 3 seconds
        if (millis() - startTime > 3000) {
          navmode = 0;
          return false;
        }
    }
}
             https://github.com/jamescoxon/Atlas-Flight-Computer/blob/master/Atlas3/Atlas3_3.pde#L
Self-Checking
  -- Now enter a self-check of the manual mode settings

  log( "Self-check started" )

  assert_prop( 49, -32764, "Not in manual mode" )
  assert_prop( 5,     0, "AF Assist Beam should be Off" )
  assert_prop( 6,     0, "Focus Mode should be Normal" )
  assert_prop( 8,     0, "AiAF Mode should be On" )
  assert_prop( 21,     0, "Auto Rotate should be Off" )
  assert_prop( 29,     0, "Bracket Mode should be None" )
  assert_prop( 57,     0, "Picture Mode should be Superfine" )
  assert_prop( 66,     0, "Date Stamp should be Off" )
  assert_prop( 95,     0, "Digital Zoom should be None" )
  assert_prop( 102,     0, "Drive Mode should be Single" )
  assert_prop( 133,     0, "Manual Focus Mode should be Off" )
  assert_prop( 143,     2, "Flash Mode should be Off" )
  assert_prop( 149, 100, "ISO Mode should be 100" )
  assert_prop( 218,     0, "Picture Size should be L" )
  assert_prop( 268,     0, "White Balance Mode should be Auto" )
  assert_gt( get_time("Y"), 2009, "Unexpected year" )
  assert_gt( get_time("h"), 6, "Hour appears too early" )
  assert_lt( get_time("h"), 20, "Hour appears too late" )
  assert_gt( get_vbatt(), 3000, "Batteries seem low" )
  assert_gt( get_jpg_count(), ns, "Insufficient card space" )
https://github.com/jgrahamc/gaga/blob/master/gaga-1/camera/gaga-1.lua#L96
Self-Checking
      Example: ALTAS 3
      Makes sure uBlox GPS will work at
       high altitude; fixes it if not
    if((count % 10) == 0) {
     digitalWrite(6, LOW);
     checkNAV();
     delay(1000);
     if(navmode != 6){
       setupGPS();
       delay(1000);
     }
     checkNAV();
     delay(1000);
     digitalWrite(6, HIGH);
   }


https://github.com/jamescoxon/Atlas-Flight-Computer/blob/master/Atlas3/Atlas3_3.pde#L3
Unexpected Error Handling
    def temperature():
      t = at.cmd( 'AT#TEMPMON=1' )

      # Command returns something like:
      #
      # #TEMPMEAS: 0,28
      #
      # OK
      #
      # So split on whitespace first to isolate the temperate 0,28
      # and then split on comma to get the temperature

      w = t.split()
      if len(w) < 2:
          logger.log( "Temperature read returned %s" % t )
          return -1000

      m = w[1].split(',')
      if len(m) != 2:
          logger.log( "Temperature read returned %s" % t )
          return -1000
      else:
          return int(m[1])


https://github.com/jgrahamc/gaga/blob/master/gaga-1/recovery/util.py
Handle Exceptions
     If your language can generate
      exceptions then you‟d better handle
      them!
     Example: GAGA-1
       ◦ Recovery computer used Python
       ◦ Exception could have killed it
       ◦ Global exception handler
    except:
        logger.log( "Caught exception in main loop: %s" %
   sys.exc_info()[1] )



       Bonus: What‟s wrong with that code?
https://github.com/jgrahamc/gaga/blob/master/gaga-1/recovery/gaga-1.py#L144
Simulation
 Simulate a flight
 Example: UKHAS wiki has example of
  using a PC as a fake GPS
http://www.ukhas.org.uk/guides:common_coding_errors_payload_testing

   Example: GAGA-1
    ◦ To test the embedded Telit module wrote
      modules that faked the entire Telit Python
      interface.
https://github.com/jgrahamc/gaga/blob/master/gaga-1/recovery/GPS.py
https://github.com/jgrahamc/gaga/blob/master/gaga-1/recovery/MDM.py
Simplify, Simplify, Simplify
 Make your code as simple as possible
 Never have ��duplicated‟ or „copy and
  paste‟ code
 Break it up into small functions that
  you understand
 Make sure you understand the
  limitations of the functions you call
Unit Test
 Break your program up into small,
  separate functions
 Write tests that call that function and
  make sure it does what you expect.
 Lots of ways to do this
    ◦ Use something like cpptest
    ◦ ArduinoUnit
    ◦ Write your own test program
Unit Test Example
 In the bad APRS program
 Turn metres to feet code into a
  separate function: int m_to_f(int m)
    assertEquals(m_to_f(1000),3300)
    assertEquals(m_to_f(2000),6600)
    assertEquals(m_to_f(3000),9900)
    assertEquals(m_to_f(4000),13200)
    assertEquals(m_to_f(5000),16500)
    assertEquals(m_to_f(6000),19800)
    assertEquals(m_to_f(7000),23100)
    assertEquals(m_to_f(8000),26400)
    assertEquals(m_to_f(9000),29700)
    assertEquals(m_to_f(10000),33000)
Write Log Files
 Write detailed log files to non-volatile
  memory for post flight debugging
 Data sent via RTTY or APRS is limited
 Log exceptions and errors in detail
 Make sure you have a timestamp
Perform system testing
   Test your entire system before flight
    ◦ Put your tracker in the garden
    ◦ Get a GPS lock
    ◦ Listen to the RTTY on your radio
    ◦ Look at the decoded RTTY on your
      computer
    ◦ Test uploaded data on the tracker*


    ◦ *I didn‟t do that step, on the day people
      had to fix the tracker for me.

More Related Content

HAB Software Woes

  • 1. HAB Software Woes John Graham-Cumming September 2012 Or “My capsule didn‟t crash but my software did”
  • 2. Background  > 30 years of programming experience  One HAB flight ◦ GAGA-1 http://blog.jgc.org/2011/04/gaga-1-flight.html https://github.com/jgrahamc/gaga
  • 3. Where‟s your flight‟s complexity?  Example: GAGA-1 ◦ One balloon, parachute, polystyrene box ◦ Many metres of cord attached with knots ◦ An off-the-shelf camera ◦ 2,836 lines of code ◦ Common to see defect rates of 2 to 4 per KLOC ◦ So GAGA-1 likely has 5 to 10 errors in it
  • 4. Real Stuff Seen on HAB flights  Complete computer crash  Altitude going negative  Latitude and longitude garbled  Cutdown triggered in back of car  Long periods of no transmission  Not setting the GPS up before launch  Not turning the camera on  Running out of camera disk space  Altitude jumping around rhythmically
  • 5. The Curse and Joy of Determinism  Computers do what you tell them to ◦ Precisely what you tell them to ◦ Not what you think you told them to do  A Curse ◦ Will do things you don‟t expect ◦ Will process bogus input without complaint  The Joy ◦ Easy to test that it does what‟s expected
  • 6. HAB Is A Harsh Environment  Cold  Vibration  Stuff breaks in flight  Software needs to be able to cope with failing hardware  Very important to think about failure modes  YOUR CODE IS ON ITS OWN OUT THERE
  • 7. Deadly Sins  The “It works!” Fallacy  The Last Minute Change  Being Far Too Clever  Overlooking Odd Behaviour  Copying Other People‟s Code  Assuming Finding A Bug Solves The Problem
  • 8. The “It works!” Fallacy  If you‟re an inexperienced (and sometimes experienced) programmer… ◦ You hack some code together ◦ It works once ◦ You assume it will always work  Only solution to this is ◦ Testing ◦ Paranoia
  • 9. The Last Minute Change  Never, ever change anything in code at the last minute no matter how simple.  Example: HABE 1 ◦ Complete camera failure ◦ Maximum integer size in uBASIC on CHDK is 999,999 ◦ Last minute change of integer from 600,000 to 1,000,000 caused total failure
  • 10. Being Far Too Clever  Example: GAGA-1 ◦ Entered the wrong value of 2 * pi in code to do GPS position conversion from radians to degrees ◦ Caught before flight because I verified the location of my own back garden ◦ Note to self: 2 * pi != 6.2818. https://github.com/jgrahamc/gaga/blob/master/gaga-1/flight/gaga1/gps.cpp#L113
  • 11. Overlooking Odd Behaviour  Example: GAGA-1 ◦ In tests RTTY output was fine some of the time, garbled at other times ◦ Turned out to be interrupts from the GPS messing up the RTTY timing ◦ Solution: disable GPS serial interface while sending RTTY string  ALWAYS BE HONEST WITH YOURSELF ABOUT YOUR CODE  EXPECT THE SPANISH INQUISITION! https://github.com/jgrahamc/gaga/blob/master/gaga-1/flight/gaga1/tsip.cpp#L229
  • 12. Copying Other People‟s Code  Don‟t do this, you have no idea what you are copying or who they copied it from  Better practice is to look at other people‟s code and… ◦ Write your own version ◦ That you understand ◦ That you are able to test ◦ Example: GAGA-1  Read lots of people‟s RTTY code, wrote my own https://github.com/jgrahamc/gaga/blob/master/gaga-
  • 13. APRS Tracker using copied code If the altitude in metres contained an 8 or a 9 the altitude reported would be wrong http://sharon.esrac.ele.tue.nl/users/pe1rxq/aprstracker/aprstracker.html
  • 14. Assuming Finding The Bug Solves The Problem  Just because you‟ve found A bug doesn‟t mean it was THE bug  Lots of research in computer science shows bugs tend to cluster  Example: CLOUD1, CLOUD2 ◦ Three bugs in printing latitude, longitude and altitude ◦ One fixed on CLOUD1, …
  • 15. “The One Thing I Didn‟t Test” http://ukhas.org.uk/guides:common_coding_errors_payload_testing
  • 16. Common problems with uC  Lack of floating point support  Small integers
  • 17. You might never be a great programmer… … but you can be a paranoid tester!
  • 18. Good Things To Do  No infinite loops  Self-Checking  Unexpected Error Handling  Handle Exceptions  Simulation  Simplify, Simplify, Simplify  Unit Test  Write Log Files
  • 19. No Infinite Loops  Never sit in a loop waiting forever  Example: ATLAS 3 while (1) { // Make sure data is available to read if (Serial.available()) { b = Serial.read(); if(bytePos == 8){ navmode = b; return true; } bytePos++; } // Timeout if no valid response in 3 seconds if (millis() - startTime > 3000) { navmode = 0; return false; } } } https://github.com/jamescoxon/Atlas-Flight-Computer/blob/master/Atlas3/Atlas3_3.pde#L
  • 20. Self-Checking -- Now enter a self-check of the manual mode settings log( "Self-check started" ) assert_prop( 49, -32764, "Not in manual mode" ) assert_prop( 5, 0, "AF Assist Beam should be Off" ) assert_prop( 6, 0, "Focus Mode should be Normal" ) assert_prop( 8, 0, "AiAF Mode should be On" ) assert_prop( 21, 0, "Auto Rotate should be Off" ) assert_prop( 29, 0, "Bracket Mode should be None" ) assert_prop( 57, 0, "Picture Mode should be Superfine" ) assert_prop( 66, 0, "Date Stamp should be Off" ) assert_prop( 95, 0, "Digital Zoom should be None" ) assert_prop( 102, 0, "Drive Mode should be Single" ) assert_prop( 133, 0, "Manual Focus Mode should be Off" ) assert_prop( 143, 2, "Flash Mode should be Off" ) assert_prop( 149, 100, "ISO Mode should be 100" ) assert_prop( 218, 0, "Picture Size should be L" ) assert_prop( 268, 0, "White Balance Mode should be Auto" ) assert_gt( get_time("Y"), 2009, "Unexpected year" ) assert_gt( get_time("h"), 6, "Hour appears too early" ) assert_lt( get_time("h"), 20, "Hour appears too late" ) assert_gt( get_vbatt(), 3000, "Batteries seem low" ) assert_gt( get_jpg_count(), ns, "Insufficient card space" ) https://github.com/jgrahamc/gaga/blob/master/gaga-1/camera/gaga-1.lua#L96
  • 21. Self-Checking  Example: ALTAS 3  Makes sure uBlox GPS will work at high altitude; fixes it if not if((count % 10) == 0) { digitalWrite(6, LOW); checkNAV(); delay(1000); if(navmode != 6){ setupGPS(); delay(1000); } checkNAV(); delay(1000); digitalWrite(6, HIGH); } https://github.com/jamescoxon/Atlas-Flight-Computer/blob/master/Atlas3/Atlas3_3.pde#L3
  • 22. Unexpected Error Handling def temperature(): t = at.cmd( 'AT#TEMPMON=1' ) # Command returns something like: # # #TEMPMEAS: 0,28 # # OK # # So split on whitespace first to isolate the temperate 0,28 # and then split on comma to get the temperature w = t.split() if len(w) < 2: logger.log( "Temperature read returned %s" % t ) return -1000 m = w[1].split(',') if len(m) != 2: logger.log( "Temperature read returned %s" % t ) return -1000 else: return int(m[1]) https://github.com/jgrahamc/gaga/blob/master/gaga-1/recovery/util.py
  • 23. Handle Exceptions  If your language can generate exceptions then you‟d better handle them!  Example: GAGA-1 ◦ Recovery computer used Python ◦ Exception could have killed it ◦ Global exception handler except: logger.log( "Caught exception in main loop: %s" % sys.exc_info()[1] )  Bonus: What‟s wrong with that code? https://github.com/jgrahamc/gaga/blob/master/gaga-1/recovery/gaga-1.py#L144
  • 24. Simulation  Simulate a flight  Example: UKHAS wiki has example of using a PC as a fake GPS http://www.ukhas.org.uk/guides:common_coding_errors_payload_testing  Example: GAGA-1 ◦ To test the embedded Telit module wrote modules that faked the entire Telit Python interface. https://github.com/jgrahamc/gaga/blob/master/gaga-1/recovery/GPS.py https://github.com/jgrahamc/gaga/blob/master/gaga-1/recovery/MDM.py
  • 25. Simplify, Simplify, Simplify  Make your code as simple as possible  Never have „duplicated‟ or „copy and paste‟ code  Break it up into small functions that you understand  Make sure you understand the limitations of the functions you call
  • 26. Unit Test  Break your program up into small, separate functions  Write tests that call that function and make sure it does what you expect.  Lots of ways to do this ◦ Use something like cpptest ◦ ArduinoUnit ◦ Write your own test program
  • 27. Unit Test Example  In the bad APRS program  Turn metres to feet code into a separate function: int m_to_f(int m) assertEquals(m_to_f(1000),3300) assertEquals(m_to_f(2000),6600) assertEquals(m_to_f(3000),9900) assertEquals(m_to_f(4000),13200) assertEquals(m_to_f(5000),16500) assertEquals(m_to_f(6000),19800) assertEquals(m_to_f(7000),23100) assertEquals(m_to_f(8000),26400) assertEquals(m_to_f(9000),29700) assertEquals(m_to_f(10000),33000)
  • 28. Write Log Files  Write detailed log files to non-volatile memory for post flight debugging  Data sent via RTTY or APRS is limited  Log exceptions and errors in detail  Make sure you have a timestamp
  • 29. Perform system testing  Test your entire system before flight �� Put your tracker in the garden ◦ Get a GPS lock ◦ Listen to the RTTY on your radio ◦ Look at the decoded RTTY on your computer ◦ Test uploaded data on the tracker* ◦ *I didn‟t do that step, on the day people had to fix the tracker for me.