May 20, 2011, 3:46 PM — When it comes to debugging hard-to-diagnose software and operating-system problems, there is no set recipe. Rather debugging is all about "having the right tools and knowing how to use them," advised Microsoft technical fellow Mark Russinovich at the close of the Microsoft TechEd conference, held this week in Atlanta.
Among the highlights of each year's TechEd conference are the technology demonstrations. Smart Microsoft and partner engineers walk attendees through how to use some new technology in a step-by-step process, making it seem easy -- or even fun -- to deploy.
And one of the most popular demonstrations over the past few years has been Russinovich's "Cases of the Unexplained," in which he shows how he and others tracked down hard-to-pinpoint errors in Windows deployments.
This year, of course, was no exception. Before a packed auditorium, Russinovich debugged a number of tricky problems using only a handful of free tools, many created by Russinovich himself, including Process Explorer and Process Monitor. He borrowed many examples in his presentation from his blog, where he collects user stories of tough problems.
In the cases Russinovich demonstrated, the root causes of the misbehaving systems were not readily obvious. This was especially true of software that, he noted, when it crashes, offers little instruction about its downfall. "Programs do a bad job of telling what went wrong," he said. Yet he showed that it is possible to carefully track the symptom of the problem back to the cause.
One example Russinovich dubbed "the case of the slow website." This example was submitted to Russinovich by a system administrator from an unnamed company. The organization's users were complaining of slow performance of some internal Web pages. The admin tracked all the Web pages to a single server, then ran Process Explorer, which shows all the processes on a server, and how much memory and CPU resources each thread of a process is consuming.
The admin identified one thread that was hogging more than a quarter of the server's resources. Doing a Web search, he found that the related process belonged to a Windows management driver that, in turn, communicated with the server chassis' management controller provided by the server manufacturer. The two components were having difficulty in communicating, so the communication between them spiked.
The difficulty turned out to be that the blade server was not slotted into the rack appropriately. The user reseated the server chassis and the server quickly returned to delivering its Web pages speedily.
Another problem came not from misbehaving equipment or software, but rather from user behavior. "This case came into the Microsoft Exchange support team," Russinovich said.
Users complained that Microsoft Exchange would periodically delay responding for up to 30 seconds. Microsoft requested the customer to log the server performance using Performance Monitor, which showed periodic spikes in CPU utilization. Using ProcDump, a Microsoft engineer created a script that would capture all the process information whenever processor usage went above a certain threshold.
Looking through the results, the engineer found an Exchange search function was consuming many of the cycles. The sluggishness was caused by the fact that a number of users had humongous mailboxes, which when they searched through them, would spike the server load. The admin instructed the users to reduce the size of their mailboxes, or organize them better. As a result of such tidying up on the users' part, server performance improved.
In a third case, Russinovich's wife had complained that the Windows Photo Gallery would hang after showing a movie. The bug was particularly annoying to her, as she was showing friends some home movies. A friend of hers even quipped, "This never happens with a Mac."
Russinovich reran the Photo Gallery software while capturing all the processes in Process Monitor. "When in doubt run Process Monitor," he advised. He matched the time of the hang with all the processes running at that time. While most of the processes were routine, he found an unusual system call, ironically enough, to an Apple QuickTime object, which was the source of program hang. "Sure enough, it was from that company that doesn't know how to write Windows software," he joked.














