Finally, on Sol 21, the ground team was able to get the rover out its reboot cycle, by putting Spirit into “crippled” mode. What this did was allow the FSW to start up without accessing the flash memory, instead using the system RAM as a simulated file system. By doing this, they could finally tell the rover to shut down in order to recharge its batteries and buy themselves time to diagnose the root cause and come up with a solution.
The team ultimately figured out that the problem was a combination of a DOS library design flaw, a bug in some 3rd party software, and several configuration errors. Basically, the third party code was mirroring the flash memory to RAM, of which there was only half as much. Due to the way DOS managed the file system it kept growing in size even when files were deleted. Eventually, when the FSW tried to reboot, new memory could not be allocated, which triggered an error, which then triggered another reboot, and so on. On Sol 32 the operations team was able to successfully fix the bug, reformat the flash memory and bring Spirit fully back to life. Spirit then continued to operate until March 22, 2010 (Sol 2210), well beyond it’s initial planned lifespan of 90 solar days.
In the end, it was concluded that the problem could have been avoided. The "compressed" schedule, three years from concept to launch, was deemed a contributing factor, since it led to incomplete development and testing. While the potential issues with the DOS file system were known at the time of development, fully fleshing them out and determining the correct settings for the associated configuration parameters was deemed a low priority and so wasn’t done.
This kind of problem sound familiar to anyone? That is, having to rush the development of something and thereby not fully develop or test the system? Nope, doesn’t sound familiar to me either...
Related reading: Mariner 1's $135 million software bug