The day a software bug almost killed the Spirit rover
The Spirit rover’s Mars mission almost ended before it really got going due to a DOS-related software bug, which wasn’t caught due to a rushed development schedule
This Friday, January 25th, marks the ninth anniversary of the successful landing of the Opportunity rover on Mars, where it’s still rolling, digging and exploring to this day. However, this past Monday, January 21st, marks the ninth anniversary of a less-happy event on Mars, the near premature end to the mission of Opportunity’s twin rover, Spirit, due to a software bug, the “flash memory management anomaly.”
Spirit, like Opportunity, had landed and deployed successfully on Mars a few weeks earlier, on January 4th, 2004, and for those first couple of weeks, had begun to happily tool around and explore the red planet. However, on January 21st, Sol (solar day) 18 of the mission (one solar day on Mars is a little over 24 and a half hours), the mission team at NASA’s Jet Propulsion Labs didn’t get a signal as expected from Spirit.
After some initial testing, the operations team was able to get a response from the rover, but nothing more than a beep, confirming it was alive. While they were able to rule out problems with the rover’s antenna, they were unable to get any telemetry data from Spirit. By the end of the day, while they knew the rover was alive, they also knew the problem was either with the interface card to the radios or a problem with the flight software (FSW). “Panic started to set in for the operations team,“ wrote Glenn Reeves and Tracy Neilson in an official JPL report on the incident.
Over the next two solar days, the operations team was able to finally coax some diagnostic data from the rover and figure out what was happening, if not why. Basically, the flight software was stuck in a continuous reboot cycle. Each time it tried to restart, it was encountering an error, which would trigger another restart. They suspected the problem was with the rover’s flash memory, where the DOS-based file system was stored. By the end of Sol 20, while the operations team didn’t know the root cause of the problem, they knew that, since the rover couldn’t properly shut down, as it was meant to do nightly, its battery power was getting low and it was in danger of overheating - and ending the mission before it even really began.
Finally, on Sol 21, the ground team was able to get the rover out its reboot cycle, by putting Spirit into “crippled” mode. What this did was allow the FSW to start up without accessing the flash memory, instead using the system RAM as a simulated file system. By doing this, they could finally tell the rover to shut down in order to recharge its batteries and buy themselves time to diagnose the root cause and come up with a solution.
The team ultimately figured out that the problem was a combination of a DOS library design flaw, a bug in some 3rd party software, and several configuration errors. Basically, the third party code was mirroring the flash memory to RAM, of which there was only half as much. Due to the way DOS managed the file system it kept growing in size even when files were deleted. Eventually, when the FSW tried to reboot, new memory could not be allocated, which triggered an error, which then triggered another reboot, and so on. On Sol 32 the operations team was able to successfully fix the bug, reformat the flash memory and bring Spirit fully back to life. Spirit then continued to operate until March 22, 2010 (Sol 2210), well beyond it’s initial planned lifespan of 90 solar days.
In the end, it was concluded that the problem could have been avoided. The "compressed" schedule, three years from concept to launch, was deemed a contributing factor, since it led to incomplete development and testing. While the potential issues with the DOS file system were known at the time of development, fully fleshing them out and determining the correct settings for the associated configuration parameters was deemed a low priority and so wasn’t done.
This kind of problem sound familiar to anyone? That is, having to rush the development of something and thereby not fully develop or test the system? Nope, doesn’t sound familiar to me either...
Mariner 1's $135 million software bug