March 16, 2001, 11:05 AM —
A recent experience reminded me that many systems are run by part-time systems administrators, novice administrators, or non-administrators drafted into managing (or at least living with) Sun systems. This column is for part-timers who get thrown into a nasty situation without understanding the fundamental operations of Solaris or the tools available.
The following is an analysis of a debugging session. It should be informative for those who do not have the privilege (i.e. bad luck) of doing this type of work every day, and it might even interest more experienced administrators.
The programmers and Q&A staff at a client site were struggling with a performance problem. It was described as a memory leak, but they could not find the problem's source.
A Web-based application was being enhanced with several new features, and ported from Solaris 2.6 to Solaris 7 to gain performance. Preliminary tests showed that this network-intensive application's performance would increase by 300 percent simply by changing the operating system release. Current production servers were running out of throughput, so testing this port was a high priority. The Q&A group performed long-term testing of the application and reported that after 7 hours of testing on the Solaris 7 server (an Ultra 10), the performance went from 500 connections per second satisfied to around 20.
The Q&A folks theorized that the problem was a memory leak; further delving revealed that all pieces were in place for rapid problem diagnosis. Luckily, many similar machines running both Solaris 2.6 and Solaris 7 were available, and the problem was easily reproducible (after several hours). These machines were not in production, so invasive debugging techniques (reboots and software reloads) were possible. Finally, the program was written in C and developers were available to answer questions.
The initial analysis
After further experimentation, the Q&A group determined that the problem occurred on both the Solaris 2.6 and Solaris 7 machines and on multiple machines running each operating system. On the surface, it seemed the problem was a memory leak resulting from new code changes, rather than a problem at the system level. Time to fire off some commands and debug the problem.
Back to basics
The last scenario they tried when debugging the problem turned out to be the simplest, as it often is.