From: www.itworld.com
April 18, 2008 —
When tracking down an unusual problem on a Unix system, you may encounter many turns and dead ends on the path toward the solution as some problems may have little connection to what first appears to be troubling your system. We're going to follow an unusually winding path from first appearance of a problem to its eventual solution.
What initially brought this particular problem to the attention of the sysadmins was that the collection of performance statistics (via sar) had stopped accumulating. The previous day's sar report had been fetched from all the other servers and stored on the server used to analyze activity across the network. This one server had dropped from sight.
"Why would sar have suddenly stopped collecting stats?", we wondered. One day it's working, the next it's not. We saw no evidence of anyone logging in on the day data collection had stopped nor any previous days going back a long way. In addition, nothing about the sar setup seemed to have changed.
We noticed right away that something was amiss with the sys user account -- the one used to collect performance stats. If we attempted to switch users to sys, we saw these errors:
fermion# su sys su: Invalid GID fermion# su - sys su: Invalid project ID
These were errors I had never encountered before. In fact, I found it hard to imagine what would be required for a GID to be considered "invalid" and wasn't sure what "project ID" referred to.
As it turned out, the su problem was a separate issue. For some reason, the su executable was owned by an ordinary user instead of by root.
fermion:/var/adm/sa # ls -l /usr/bin/su -r-sr-xr-x 1 jaydoe staff 17328 May 2 2001 /usr/bin/su
The invalid GID and project ID errors, therefore, did not refer to the user that I was trying to switch to, but to the ownership of the su executable.
When we tried running the /etc/rc2.dS21perf script, we ran into a second problem. It appeared that there was a disk space issue.
fermion:/ # /etc/rc2.d/S21perf start sadc: creat failed: No space left on device
Examining disk space on the system, however, this didn't appear to be the case.
fermion:/ # df -k
Filesystem kbytes used avail capacity Mounted on
/dev/dsk/c1t0d0s0 140991 53111 73781 42% /
/dev/dsk/c1t0d0s6 1985367 701002 1224804 37% /usr
/dev/dsk/c1t0d0p0:boot
12127 1743 10384 15% /boot
/proc 0 0 0 0% /proc
fd 0 0 0 0% /dev/fd
mnttab 0 0 0 0% /etc/mnttab
/dev/dsk/c1t0d0s3 1985367 1184764 741042 62% /var
swap 2915816 8 2915808 1% /var/run
swap 2916188 380 2915808 1% /tmp
/dev/dsk/c1t0d0s5 1985367 1223 1924583 1% /opt
/dev/dsk/c1t0d0s7 61387645 34852018 25921751 58% /export/home
/dev/dsk/c1t0d0s1 387183 202055 146410 58% /usr/openwin
Using the touch command, we verified that, indeed, we could not write
to /var.
# cd /var/adm/sa # touch huh touch: huh cannot create
We switched user to sys, no longer a problem, and touched a file in /tmp, proving that there was nothing wrong with the sys account or with the touch command.
% touch /tmp/huh % ls -l /tmp/huh -rw-r--r-- 1 sys sys 0 Apr 9 11:31 /tmp/huh
I also verified that there was nothing wrong with permissions on /var.
% cd /var % ls -ldn drwxrwxr-x 2 4 3 512 Apr 7 19:00 . % ls -ld . drwxrwxrwx 2 adm sys 512 Apr 7 19:00 .
Even so, as root, I was unable to create even an empty file in /var. The /var file system was behaving as though it was mounted read-only though we could see that this was not the case.
# mount | grep var /var on /dev/dsk/c1t0d0s3 read/write/setuid/intr/largefiles/onerror=panic/dev=740003 on Thu May 31 10:37:01 2007 /var/run on swap read/write/setuid/dev=1 on Thu May 31 10:37:01 2007
It finally dawned on me that there was a second reason for "No space left on device" errors. In addition to being out of file storage space, one can exceed the capacity of a file system by running out of inodes. I quickly verified that I could rename files even though I could not create new ones. A check on the inode capacity of the affected file system revealed that none were free.
# df -t /var
/var (/dev/dsk/c1t0d0s3 ): 1600912 blocks 0 files
total: 3970734 blocks 485440 files
Problem solved? Of course not. While I'd uncovered the reason why sar had stopped building data files, we now had to track down why there were 485,440 files in /var. That's a lot of files! And, considering that Solaris file systems are generally built in such a way that almost no one runs out of inodes when disk space is still available, most of these had to be small files.
Lots of small files in /var?
We soon found the problem in /var/spool/mqueue -- a backed up mail queue containing 473,147 files! In fact, there were far too many files even for us to use simple rm commands to remove them. We had to remove groups of files with commands like rm *44, so as to get past "arg list too long" problems. Of course, we first examined some of the files and saved a few representative examples for use in the next phase of the problem -- figuring out why these files had amassed in the mqueue directory in the first place.
# rm * /usr/bin/rm: arg list too long # rm 4* /usr/bin/rm: arg list too long # rm [1-4]4*
This meant that we had two problems yet to solve -- why the email wasn't going anywhere and why these messages were being generated in the first place.
The messages were generated by a script intended to check whether a license manager was running on the system. Since the file system containing the script was mounted from another system, we had to log in elsewhere to modify it. The script contained a command that looked like this:
pgrep lmgrd.ste || /etc/rc2.d/S85lmgrd start
The script was run through cron and the output of the script was sent to /dev/null with the commonly used "2>&1 >/dev/null" syntax. Even so, each piece of email seemed to contain the process ID of the lmgrd.ste process and the destination address of a system that probably hadn't active on the network for years.
We modified the script to send the output of the pgrep command to /dev/null. After all, the only thing we need from the pgrep command is a yay or nay regarding whether the lmgrd.ste process is running in the form of a zero (yes) or non-zero (no) return code.
pgrep lmgrd.ste >/dev/null || /etc/rc2.d/S85lmgrd start
The cron job, which had apparently been checking the license manager every 5 minutes for years, did away with any output from the start script, but the pgrep command output was still being generated and emailed.
# keep license manager up 0,5,10,15,20,25,30,35,40,45,50,55 * * * * /usr/local/bin/chkLicMgr 2>&1 /dev/null
Running every five minutes for years, it's no wonder the files accumulating in /var/spool/mqueue eventually consumed every inode in the file system.
Once the chkLicMgr script was modified, no new email was being generated. One part of the problem was resolved.
Next, we needed to address the problem of the missing destination host. For that, a quick review of the sendmail.cf file showed a line that looked something like this. The "spooky" system (the actual name and domain have been altered) had been shut down years earlier.
fermion:/etc/mail # grep spooky sendmail.cf DRspooky.particles.com 297
With the sendmail.cf file edited, sendmail restarted and the unintended output of the chkLicMgr script squelched, the system was once again behaving itself and once again collecting performance data. The fact that the system could have gone years without the messages piling up in /var/spool/mqueue being noticed is a mystery of its own. The system must have played an important role on the network some years earlier, but had since been mostly ignored in favor of faster systems and newer software.
With all the checking, the license manager on the system seemed to have been running reliably for years, though its users and someone who once thought it important to verify every five minutes that it was still up had all since moved on. What a strange turn of events.
ITworld