Debugging a performance problem

Unix Insider –

A recent experience reminded me that many systems are run by part-time systems administrators, novice administrators, or non-administrators drafted into managing (or at least living with) Sun systems. This column is for part-timers who get thrown into a nasty situation without understanding the fundamental operations of Solaris or the tools available.

The following is an analysis of a debugging session. It should be informative for those who do not have the privilege (i.e. bad luck) of doing this type of work every day, and it might even interest more experienced administrators.

The scenario

The programmers and Q&A staff at a client site were struggling with a performance problem. It was described as a memory leak, but they could not find the problem's source.

A Web-based application was being enhanced with several new features, and ported from Solaris 2.6 to Solaris 7 to gain performance. Preliminary tests showed that this network-intensive application's performance would increase by 300 percent simply by changing the operating system release. Current production servers were running out of throughput, so testing this port was a high priority. The Q&A group performed long-term testing of the application and reported that after 7 hours of testing on the Solaris 7 server (an Ultra 10), the performance went from 500 connections per second satisfied to around 20.

The Q&A folks theorized that the problem was a memory leak; further delving revealed that all pieces were in place for rapid problem diagnosis. Luckily, many similar machines running both Solaris 2.6 and Solaris 7 were available, and the problem was easily reproducible (after several hours). These machines were not in production, so invasive debugging techniques (reboots and software reloads) were possible. Finally, the program was written in C and developers were available to answer questions.

The initial analysis

After further experimentation, the Q&A group determined that the problem occurred on both the Solaris 2.6 and Solaris 7 machines and on multiple machines running each operating system. On the surface, it seemed the problem was a memory leak resulting from new code changes, rather than a problem at the system level. Time to fire off some commands and debug the problem.

Back to basics

The last scenario they tried when debugging the problem turned out to be the simplest, as it often is. However, it is helpful to examine the entire process that led to the solution.

There were two Solaris 7 machines: one (the happy Sun) at the start of the run, and one (the unhappy Sun) while it was experiencing the problem.

Before starting a debugging session (or any work you may want to review later), run the

<font face="Courier">script</font>
command.
<font face="Courier">script</font>
stores the input and output of all shell-level actions in a file, and is very useful for postmortems or reviews of the tasks performed.

I like to get a feel for the machine in question, so I ran a few status commands on the unhappy Sun in order to eliminate variables. Because the problem was seen on several machines, I expected to find an otherwise healthy system.

<font face="Courier">prtdiag -v</font>
confirmed that:

<font face="Courier">
# /usr/platform/sun4u/sbin/prtdiag -v
System Configuration:  Sun Microsystems  sun4u Sun Ultra 5/10 UPA/PCI (UltraSPARC-IIi 440MHz)
System clock frequency: 110 MHz
Memory size: 256 Megabytes

========================= CPUs =========================

                    Run   Ecache   CPU    CPU
Brd  CPU   Module   MHz     MB    Impl.   Mask
---  ---  -------  -----  ------  ------  ----
 0     0     0      440     2.0   12       9.1

========================= IO Cards =========================

     Bus#  Freq
Brd  Type  MHz   Slot  Name                              Model
---  ----  ----  ----  --------------------------------  ----------------------
 0   PCI-1  33     1   ebus                                                    
 0   PCI-1  33     1   network-SUNW,hme                                        
 0   PCI-1  33     2   SUNW,m64B                         ATY,GT-C              
 0   PCI-1  33     3   ide-pci1095,646                                         

     Bus   Freq
Brd  Type  MHz   Slot  Name                              Model
---  ----  ----  ----  --------------------------------  ----------------------
 0   UPA   110    30   FFB, Double Buffered              SUNW,501-4788         

No failures found in System
===========================

========================= HW Revisions =========================

ASIC Revisions:
---------------
Cheerio: ebus Rev 1

FFB Hardware Configuration:
-----------------------------------
        Board rev: 2
        FBC version: 0x3241906d
        DAC: Brooktree 9070, version 1
        3DRAM: Mitsubishi 130b, version 2

System PROM revisions:
----------------------
  OBP 3.25.1 2000/01/14 13:40   POST 3.0.8 2000/01/14 16:05

</font>

Although "no failures found" indicates a healthy system, a quick scan of the

<font face="Courier">/var/adm/messages.*</font>
files, where the system reports its errors, revealed many "NOTICE: alloc: /: file system full" error messages, meaning the disk filled up at some point.

The

<font face="Courier">uptime</font>
command shows the current load average -- the number of threads wanting to run, on average, over the last 1, 5, and 15 minutes (as shown by a quick view of the manual page via
<font face="Courier">man uptime</font>
). The single CPU system was using all that CPU and had been for the past 15 minutes:

<font face="Courier">
# uptime       
 10:46pm  up 22 day(s), 14:34,  2 users,  load average: 1.00, 1.00, 1.00</font>

When I asked the Q&A folks why they suspected a memory leak, they pointed to a copy of the

<font face="Courier">vmstat</font>

output they had made the day before:

<font face="Courier">
# vmstat 10 10
 procs     memory            page            disk          faults      cpu
 r b w   swap  free  re  mf pi po fr de sr dd dd f0 s0   in   sy   cs us sy id
 0 0 0 2161544 140192  9  98 42  4  4  0  0  8  0  0  0  171 1501  140  3  3 94
 0 0 0 2123808 151200 284 8445 759 0 0 0  0 96  0  0  0  320 5560  905 20 75  6
 0 0 0 2123808 149120 284 8446 759 0 0 0  0 95  0  0  0  320 5555  904 17 74  9
 0 0 0 2123832 123991 295 8294 744 0 0 0  0 97  0  0  0  324 5624  913 29 67  4
 0 0 0 2144640  98592 293 8363 823 0 0 0  0 10  0  0  0  505 3427  545 29 75  6
 0 0 0 2142488  83373 234 8152 557 0 0 0  0  9  0  0  0  506 3577  558 20 74  6
 0 0 0 2139968  73328 233 8116 554 0 0 0  0 12  0  0  0  478 3608  558 29 65  5
 0 0 0 2138232  68832 332 8194 773 0 0 0  0  2  0  0  0  465 3431  524 21 75  4
 0 0 0 2136456  54448 232 8495 733 0 0 0  0  2  0  0  0  492 3431  540 28 79  3
 0 0 0 2134584  43344 221 8213 755 0 0 0  0 13  0  0  0  485 3548  580 20 74  6
</font>

The Q&A staff asked for 10 lines of output showing 10-second averages for the system memory and CPU use information. The first line of every

<font face="Courier">*stat</font>
command should be ignored: it is the average since the system booted, not the average over a 10-second period. They thought the decreasing available memory under the free column meant something was draining the system's memory, and spent a few days tracking down the leak.

The night before I arrived, the Q&A folks searched on Unix Insider to learn more about

<font face="Courier">vmstat</font>
, and correctly determined that a decreasing amount of free memory is not a problem indicator. Generally, Solaris uses memory as needed for caching and program execution -- it does not attempt to free up used memory until there is a shortage. As a result, there is a downward trend in free memory in
<font face="Courier">vmstat</font>
until Solaris starts freeing up used memory and the free memory level stabilizes.

The other indicator of a system memory shortage is the scan rate (sr) column. The larger its value, the more desperate the system is for memory. Here the value was zero, meaning no memory shortage. So, assuming the problem was not a memory leak, it was time to determine the true cause of the performance loss.

The CPU use columns in

<font face="Courier">vmstat</font>
produced interesting results: roughly 25 percent of the CPU went to user-level processing (non-system calls), while about 75 percent was spent on system calls. My experience with this command led me to conclude there was a programming problem (inefficient use of threading or locking, for instance) or an I/O problem. I/O problems usually show up in the id column in
<font face="Courier">vmstat</font>
, though, because I/O wait time is shown as idle time.

To sort out this issue, we looked at the

<font face="Courier">mpstat</font>
command, which shows more CPU-use details:

<font face="Courier">
# mpstat 10 10
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0  166   4    0   273  154  147   21    0    0    0  1535    3   3   3  91
  0 8294  92    0   422  225  911  103    0    0    0  5566   21  68  11   0
  0 8421  94    0   420  225  904  102    0    1    0  5542   22  68  10   0
^C
</font>

Now this was interesting: there was some I/O wait state (wt column), but not much. It still looked like a program problem, because of the large system CPU use component (sys column).

More details on I/O are generated with the

<font face="Courier">iostat</font>
command, which we ran next:

<font face="Courier">
# iostat -xn 10 10
                    extended device statistics              
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    4.1    3.6   31.9  277.6  0.3  0.1   34.3   10.0   2   5 c0t0d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    6.8   0   0 c0t1d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 fd0
    0.0    0.0    0.0    0.0  0.0  0.0    1.3   83.9   0   0 c0t2d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 sol:vold(pid307)
                    extended device statistics              
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
   92.7    7.4  741.7   52.1  0.5  0.1    4.9    1.3   3  10 c0t0d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t1d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 fd0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t2d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 sol:vold(pid307)
                    extended device statistics              
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
   95.3    0.6  762.4    4.4  0.0  0.1    0.0    0.7   0   7 c0t0d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t1d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 fd0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t2d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 sol:vold(pid307)
                    extended device statistics              
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
   94.2    0.3  753.6    2.2  0.0  0.1    0.0    0.7   0   7 c0t0d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t1d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 fd0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c0t2d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 sol:vold(pid307)
^C
</font>

The root disk was definitely busy, as shown by the c0t0d0s0 row. 94 reads per second is hefty, as is 753 KB read per second. So we had some evidence of an I/O problem (the disk was being used heavily), but also evidence that the problem was with the program.

A quick scan of these commands' output on the happy Sun revealed no I/O, 95 percent of the CPU used for user level and only 5 percent for system calls, and no memory shortage. Curious! It's usually around this time in a problem-solving exercise that I say, "I've never seen this before!" It seems to make people more comfortable.

Time to see what the application is up to. The

<font face="Courier">truss</font>
command shows the system calls that a process is executing. We used
<font face="Courier">ps</font>
to find the process ID (PID) of the application in question:

<font face="Courier">
# ps -elfy  
 S      UID   PID  PPID  C PRI NI   RSS     SZ    WCHAN    STIME TTY      TIME CMD
 T     root     0     0  0   0 SY     0      0            Jul 12 ?        0:12 sched
 S     root     1     0  0  41 20   144    776        ?   Jul 12 ?        0:00 /etc/init -
 S     root     2     0  0   0 SY     0      0        ?   Jul 12 ?        0:00 pageout
 S     root     3     0  0   0 SY     0      0        ?   Jul 12 ?        0:07 fsflush
. . . 
 S     test   715   714  0  84 20  6576   8840        ? 21:38:18 ?        37:21 test
 S     test   714   502  0  40 20  2360   3112        ? 21:38:18 ?        0:00  dtexec
 O     root  4837  4468  1  51 20  1072   1880          23:29:26 pts/6    0:00 ps -elfy
</font>

The process name is "test", therefore the PID is 715.

<font face="Courier">truss</font>
is very verbose, so we sent its output to a file and used
<font face="Courier">vi</font>
to poke around at it. This is what it showed at the output:

<font face="Courier">
$ truss -o /tmp/problem-log.txt -p 715
$ vi /tmp/problem-log.txt
. . .
munmap(0xFF390000, 8192)			= 0
open64("/opt/log/transactions.log", O_RDWR|O_CREAT|O_TRUNC, 01600) = 3
llseek(3, 1023, SEEK_SET)			= 1023
write(3, "\0", 1)				Err#28 ENOSPC
llseek(0, 0, SEEK_CUR)				= 1776
. . .
</font>

The error message ENOSPC indicated the program was trying to write to file #3, called

<font face="Courier">/opt/log/transactions.log</font>
, which it had just opened. Doing a
<font face="Courier">man -s 2 intro</font>
(displaying the manual page for the system error messages) revealed the following enlightening paragraph:

<font face="Courier">
  28 ENOSPC
           No space left on device

           While writing an ordinary file or creating a directory
           entry,  there  is no free space left on the device. In
           the fcntl(2) function,  the  setting  or  removing  of
           record  locks on a file cannot be accomplished because
           there are no more record entries left on the system.
</font>

The write failed due to a full disk. Looking through the output of

<font face="Courier">truss</font>
, we found many occurrences of the ENOSPC. The application was trying to log every action it performed to a full log file, and was not catching the error. Rather, it continued on and tried again after the next activity.

Hmm, this could get embarrassing. We took a look at the system's disk space use:

<font face="Courier">
C# df -k
Filesystem            kbytes    used   avail capacity  Mounted on
/dev/dsk/c0t0d0s0    3007086 3002579       0   100%    /
/proc                      0       0       0     0%    /proc
fd                         0       0       0     0%    /dev/fd
mnttab                     0       0       0     0%    /etc/mnttab
swap                 2122128     344 2121784     1%    /tmp
/dev/dsk/c0t1d0s2    8759116       9 8671516     1%    /big
/dev/dsk/c0t0d0s7    3718430  135234 3546012     4%    /export/home
</font>

Notice that

<font face="Courier">/opt</font>
is not a separate partition, so it must be part of
<font face="Courier">/dev/dsk/c0t0d0s0</font>
, the busy disk partition. Sure enough, the disk partition was full. In fact, it had probably filled up after 7 or so hours of testing, causing performance to drop precipitously. The performance stayed poor for the rest of the run because the disk remained full. When they restarted the test, they removed the log file, so the problem did not show up for several more hours.

We had been given a clue early on in the

<font face="Courier">/var/adm/messages</font>
log file. We ignored that, figuring it was a symptom of an unrelated problem -- it was not. The most interesting part of this exercise was that a full disk could cause such a system-wide performance problem. One would assume a disk full error would be at least as fast as a normal disk write. In fact, disk full errors incur high overhead.

Although the developers and Q&A folks smiled sheepishly and apologized for wasted time, it was not a waste. You can always learn more about how the system works (and how it doesn't).

Updating a classic

The first edition of Building Internet Firewalls was the kind of book you should keep nearby. It had hands-on information about planning, implementing, and managing firewalls. The second edition, by Zwicky, Cooper, and Chapman, just arrived; it includes information on Windows NT and Linux, as well as updated Unix information. The security strategies chapter is especially useful, and the section on building firewalls includes technology, architecture, design, and details on types. The Internet services section describes every major networking protocol (including NT) and its security ramifications. Building Internet Firewalls also includes chapters on security policies, maintenance, security incidents, resources, and tools. Like the first, this edition is filled with quality information that firewall builders need.

Top 10 Hot Internet of Things Startups
You Might Like
Join the discussion
Be the first to comment on this article. Our Commenting Policies