Andy: Yeah, I can definitely see that. I’ve seen that in an embryonic sense already, and I can just see that as systems get more and more powerful and memory gets cheaper and cheaper and more dense, I could see having -- on a small, modest computer -- having 20, 50, 100 different setups, a couple of different QA ones with different configurations, testing the networking in between them in the box but pretending it’s a large grid cluster, whatever. Yeah, I could see that really becoming much more prevalent. I especially like the idea of extending version control to the operating system image level. This is a very powerful idea at Internet service providers (ISPs): You roll out some new bit of the stack or upgrade the operating system, and if it doesn’t work out, you just roll back to the previous image, and in a matter of minutes, you’re back to where you were and then you can go straighten it out again.
Look at some of these massive outages -- airline reservation systems or air traffic control or customs, most recently, I think was a big problem out in L.A. -- any of these kinds of headline-making outages because some software update failed somewhere and took the system down…it could certainly help ameliorate that sort of thing. And even on the developer’s desk, just having the freedom to say, “Well, let’s try it with all 14 different versions of this.” Yes, I think would help a great deal.
Ed: Can you give me an example of one of the really hard problems in computing today?
Andy: I think this is a two-fold thing. One problem is a cross between a cultural issue and a computing issue. The hard problem in computing, I don’t think, is stuff like facial recognition, voice recognition, trying to emulate these aspects of human senses. Yeah, they were a real pain, and a lot of researchers have spent a lot of time and a lot of energy trying to work it out, and they’ll figure it out someday, somehow. They’ve made great strides. These aren’t areas that I’m expert in, [and] they’re very hard, but they’re not the real hard ones.
The real hard one, to me, is getting any kind of a computer system to exhibit situational awareness and actual judgment. Getting some sort of a system that has any kind of situational awareness, I think, is the really hard part, because the danger you run into now as the computer becomes ubiquitous is you end up with an entire class of computer workers, not programmers, but the folks who work in fast food or banking or a call center, where they are genuinely slaves to the machine.
How many times have you called up your credit card company or a utility or ISP or phone service and there’s some problem with your account and the person on the other end of the phone says, “I’m sorry. The computer won’t let me” or “I can’t do this ’cause the computer won’t let me” or “It’s not showing on the computer.” “It’s all the computer’s fault.” Whatever the problem is, something has gone wrong and the person has no capability of correcting it and the computer has no capability of correcting it. This is the stuff of science fiction fear-mongering: you get to the point where nobody knows how to fix it. Your civilization is some thousand years in the future and you’re all slave to some computer that can’t fix itself until Kirk and Spock come along and make it blow itself up.
Ed: Right, kinda like the Nomad machine in that episode of Star Trek the Original Series: The Changeling.
Andy: It’s a legitimate gripe: When something out of the ordinary happens, software in general is not designed to deal well with that. It’s designed to deal with the average case and the normal situation, and as soon as something happens outside the norm, our software is not sophisticated enough to learn from that, to realize that there are other venues, other possibilities.
Ed: Okay. Now to me that is essentially a complexity problem. As an analogy, consider the goal of software quality. One proven solution to achieving that goal is to use a fine-grain level of unit testing. So you keep doing this unit testing and you break it down small enough that you can achieve quality by having enough unit testing throughout the breadth of the system.
Now is there some analogous thing here to address the notion of exceptional cases and the permutations and combinations of these exceptional cases that lead to the computer operator being unable to take any effective action?
Andy: I don’t think so because I think the combinatory explosion of everything that’s possible would just simply be overwhelming even for the fastest quantum-based computer. I think that’s kinda the wrong way to go.
The hard problem basically comes down to having a system in the largest sense of the word that can actually be aware and learn, because you cannot necessarily teach it what to do when something happens a priori. We
don’t do that with human education and human training. You can’t prepare your children for every single eventuality that’ll hit them in life. You hit the high points to give them the tools to make their judgment calls when the time comes. I think that’s the big chasm that we have to cross from the very literal, almost naïve, approach of, “teach the computer these 12 steps and it will do them forever” to having a system that can actually learn and apply the basics, the principles you’ve given it to novel situations.
Ed: Jumping from a really hard problem…You might find my next question easy, but I’m quite certain our readers would still like to hear your answer. When you’re working on an existing system as a maintenance programmer and you make performance optimizations to the code to make it go faster, for example, what are some effective ways to avoid introducing bugs and fostering maintainability?
Andy: That’s a simple answer compared to the Star Trek computer thing. That’s the combination of what we’ve always called a safety net. [It consists of] having the basic technical practices in place of version control so that you’ve got an ability to roll back changes to be able to compare and contrast and test the system at any point in time from before your changes, after your changes, months and months before any of that even started, when it was peak load, whatever, to be able to just dial the old time machine to any point in time and work with the system as it existed then.
And that’s a little bit beyond what most people can do with version control. You roll the system back a certain amount. Suddenly, you don’t have the right libraries to work with that version. You don’t have that same compiler anymore, so there are some potential issues there, but, ideally, what you want is to be able to re-create the system as it existed at any point in time. That’s on the one hand.
[On] the second hand, you need the fairly comprehensive unit tests so that you can prove this is exactly how it functioned then. For example, you can say, “I’ve made these changes, and guess what, it still functions exactly the same or these things have changed but we can migrate that and have a plan for that.” So you have to have version control, you have to have unit testing, and you have to have automatic artifact creation.
If you’re in compiled language, building object files, linking, installing, slamming a WAR [Java web archive] file somewhere or a JAR [Java archive] file somewhere else or whatever your particular platform demands -- however you actually construct and deploy the software needs to be completely automated and those instructions for that automation need to likewise be under version control so that there’s really nothing left to chance.
The production process is ironclad and subject to version control. The unit tests are fairly complete so you can test anything that you’ve introduced for good or for bad, and it may be acceptable. You may decide to change things as a result of it, but at least you know if you’ve introduced anything that makes a material change to the functioning of the software.
With these three low-level technical practices in place, it’s pretty safe to do almost anything to the code base. You can optimize it for speed. You can add functionality, remove outdated functionality, change things depending on changing business needs, refactor the design, because now you’ve got to hand it over to maintenance programmers who aren’t as up-to-date on the techniques you used originally.
Ed: Speaking of maintenance programmers, how do you hone your sense of smell for where a bug might be?
Andy: The best place to look for where a bug might be is quite near the last one you found. That’s my number one tip. You can look this up in the ACM papers—there are studies that show bugs tend to clump. [They’re] not uniformly distributed at all. They come in clumps. So when you’re doing code review of other people’s code or you find something you did horribly wrong, the odds are pretty high it’s not sitting there by itself and it’s gonna have something else real nearby. So that’s always a good starting point.
Ed: Assuming they’ve already tried that and it didn’t work, do you have any advice to help people track down hard-to-find bugs?
Andy: The number one thing to do when you’re stuck on a problem [such as finding a bug] is to step away from the keyboard. Take a break. Go walk around the parking lot. Go get a soda. Go get a beer if you’re so inclined. Whatever your environment is, it’s to remove yourself from that immediate L brain track. And this is one of the interesting things I discovered from the research [I did for the “Refactoring Your Wetware” talk]. You know, when you’re stuck on a hard problem, sitting at the computer is literally the worst place to be. I’ve given talks -- many dozens of times now -- and I’ve had so many people come up to me to corroborate the anecdote that you’re sitting there and you’re debugging or it’s a design problem that you just, you can’t get the ends to meet, and how are you gonna work this out… . And they’ll sit there and sweat bullets for some arbitrary amount of time and then, in disgust, go walk off to the bathroom, the parking lot, go home, whatnot, and halfway through the parking lot, bang, the answer hits them. You know? Or, worst case, it will be in the shower the next morning or on the commute or whatever it is, and it just asynchronously pops into their head.
That brilliant idea popping into your head will not do it most of the time when you’re sitting there pounding on the keyboard in frustration. You know, it just blocks you from doing that. So the number one advice I have when stuck is stop. You know, take a deep breath, literally take a deep breath, ’cause that actually does help re-oxygenate you and get things kicking along a little bit better. Mom was right when she said, “Stop, take a deep breath, and count to 10.” There are actual real physiological reasons you should do that.
Ed: Okay. Well, what about reproducing the bug or writing a test case, an automated test case, adding a test case to the test suite?
Andy: Oh, yeah, yeah. Yeah, you do all that stuff. That’s the currently approved way to do a bug: have a test case that will demonstrate it conclusively first. Before you touch it in code, make sure you can reproduce the bug via a test case. Then go in, fix the code, and then re-run the test case to make sure that it’s fixed. But those are the easy ones. That’s Agile canon and certainly the best way to go about it. But what do you do when the bug is more elusive than that? You know, it’s non-deterministic. You can’t reproduce it well. There is a race condition somewhere, some area deep down somewhere, and you really don’t know all the constraints that apply to it. That’s where it gets a lot more interesting, and that’s where you need to try what you can and then when you’re just not figuring it out, walk away from it for a little while.
Because then, when you step away from it, you’ll find, “Well now. I did never consider that this would be something in the cookie on the user’s machine” or this or that or some other factor that comes into play that you might not have thought of.
Ed: Since you are one of the pioneers and leading proponents of Agile software development practices, let’s talk a bit about test-driven development. Specifically, how do you deal with bugs in the tests and the time it takes to debug the bugs in the tests? Some of the Agile detractors will point that out and say the test code is just as likely to be buggy as the production code.
Andy: Yeah, there are folks [who] say that. I don’t really buy that as an argument. Certainly from a philosophical point of view, that’s true.
You know, code is code, and you can write bugs in test code just as easily as you can write bugs in normal code. However, test code by nature tends to be pretty simple stuff. You’re setting up some parameters and you’re calling something, and it’s not rocket science. I can make a typo in “hello world.” I’m not throwing stones here. I can certainly introduce bugs in the simplest of circumstances, as can everyone. But on the whole, properly written test code is not some big monstrous morass that’s hard to figure out. It’s very simple, small methods, four or five lines of code each. Pretty easy to take a look at and say, “Yes, this is reasonable,” or, “Yeah, it’s a little bit suspicious.” On the one hand, I don’t buy [their argument], but on the other hand, you’ve got a nice validation mechanism. If you are suspicious of your test code, it’s pretty easy to go in and deliberately introduce bugs into the real code and make sure that the test code catches them.
I don’t really buy their argument. The parallel argument is, “It takes a lot of time to write the test code.” [Is equally specious.] No [it doesn’t]. It’s like an investment strategy. You’re spending an incremental amount of time to write test code, but if you get nailed with a hard bug, it may be exponential time to solve it.
It’s really kind of a false argument. They only say that because it’s easier to measure the amount of time that you’re spending on test code.
Ed: Is it ever appropriate to not write tests?