From: www.itworld.com
March 15, 2005 —
Today's column is another troubleshooting tale - a real story from the pits of systems administration. And, although the solution to the particular problem encountered turned out to be trivial, the process of tracking it down involved some interesting turns and insights into little known details about the mail command works on a Unix system.
While the subject of today's troubleshooting tale relates to email, the problem cropped up on a server that is anything but a mail server. In fact, any mail stored on the box is exclusively generated on the box and comprises the output of scripts that are run through cron to assist in monitoring various processes and applications. For this reason, the problem didn't get much notice and persisted for a long period of time before anyone thought to complain.
The problem itself was simple, but annoying. Specifically, anytime users on the problematic server quit the mail command after both reading and deleting messages, they would be faced with the message "mail: Cannot open savefile". Afterwards, their inboxes remained intact; in other words, deleted messages had not been removed. Since most of these users weren't interested in the particular cron job output email after perusing it once, they would remove the entire mail file and then go through the same procedure the next day.
Eventually, someone decided to ask why the mail command was giving them such a hard time and, after verifying the nature of the problem, attention was focussed on the troublesome server.
Checking the Server
The /var/mail/directory on the server looked more or less as anyone would have expected it to look. Each user's mail file was owned by the particular user and the group associated with all of the mail files was "mail".
drwxrwxr-x 2 root mail 512 Mar 18 12:34 :saved -rw-rw---- 1 carolt mail 3929 Mar 18 07:11 carolt -rw-rw---- 1 flyguy mail 1860 Mar 18 10:19 flyguy -rw-rw---- 1 jdoe mail 504 Mar 17 13:05 jdoe -rw-rw---- 1 zippo mail 1336 Mar 18 09:00 zippo
Even so, it was easy to confirm that any of these users encountered the same error any time they tried to delete their mail within the purview of the mail command.
What's Different?
The old trick of comparing a "broken" system with a working system didn't lead to any revelations. Comparing the /var/mail directory with the same directory on another server (exhibiting no such mail problems), it was clear that file permissions weren't the issue. Even the :saved directory appeared to be identically configured.
Checking the Web
As is usually the case with problems of this sort, a small population of sysadmins had encountered the same or similar problems and posted their troubles and suggestions on the web. The string "cannot chown savefile" yielded only a few hits, but our "cannot open savefile" error located 210 documents, many of which related to the problem at hand. Most of these posters cautioned about proper permissions on the /var/mail/:saved directory which, again, were correct in my case.
Running a Trace
One particular sun-managers discussion suggested using the truss command to capture the exact nature of the exhibited error. This poster was able to determine that his "Cannot chown savefile" error came about because the issued chown command was unable to change the group ownership of the savefile to that for "mail" -- as shown in this line:
chown("/var/mail/:saved/ramakant", 238, 12) Err#1 EPERM
Note that this particular user's UID, 238 and the GID of the mail group, 12, appear as arguments in this truss output. The error, EPERM, means "operation not permitted".
To reproduce what this individual had done, I sent myself a pile of messages using "echo testing | mail myself" commands and then issued the mail command.
boson(~henrystocker) $ mail From henrystocker@particles.com Fri Mar 4 13:05:15 2005 Date: Fri, 4 Mar 2005 13:05:15 -0500 (EST) From: Sandra Henry-StockerMessage-Id: <200503041805.j24I5FCh018905@boson.particles.com> Content-Length: 9 testing ?
Then, in another ssh session on the same system, I ran a truss on the mail process running in the first.
First, I grabbed the PID for the mail command:
boson(~root) # ps -ef | grep mail
henrysto 18909 16920 0 13:05:18 pts/7 0:00 mail
root 16557 1 0 09:40:06 ? 0:00 /usr/lib/sendmail -bd -q15m
Then I used the PID as an argument to truss:
boson(~root) # truss -p 18909
*** SGID: rgid/egid/sgid = 1011 / 0 / 0 ***
read(0, 0xDFB450BC, 1024) (sleeping...)
Once I hit a "d" to remove a message, the previously sleeping process woke up,
read and displayed the next message and then went back to sleep.
read(0, " d\n", 1024) = 2 llseek(1, 0, SEEK_CUR) = 318014 lseek(4, 3864, SEEK_SET) = 3864 read(4, " F r o m h e n r y s t".., 8192) = 1288 write(1, " F r o m h e n r y s t".., 58) = 58 write(1, " D a t e : F r i , 4".., 43) = 43 write(1, " F r o m : S a n d r a".., 57) = 57 write(1, " M e s s a g e - I d : ".., 63) = 63 write(1, " C o n t e n t - L e n g".., 18) = 18 write(1, "\n", 1) = 1 write(1, " t e s t i n g\n", 8) = 8 write(1, "\n", 1) = 1 write(1, "\n", 1) = 1 xstat(2, "/var/mail/henrystocker", 0x08047B14) = 0 write(1, " ? ", 2) = 2 read(0, 0xDFB450BC, 1024) (sleeping...)
No problems yet. It wasn't until I typed the q to quit the mail program that the problematic error message appeared.
? q mail: Cannot open savefile You have new mail in /var/mail/henrystocker boson(~henrystocker) $
This left me with an intact mailbox. In the second window, however, I could see that what the mail command had been struggling with:
open("/var/mail/:saved/henrystocker", O_WRONLY|O_CREAT|O_TRUNC, 0666) Err#13 EACCES
write(2, " m a i l", 4) = 4
write(2, " : ", 2) = 2
write(2, " C a n n o t o p e n ".., 20) = 20
I was failing on a similar system call to the one mentioned in the post, one to "open" or create a file in the :saved directory. When I changed permissions on :saved to 777 so that I would be able to open the file, the error changed to the chown error reported in the posting:
open("/var/mail/:saved/henrystocker", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 5
chown("/var/mail/:saved/henrystocker", 1111, 1) Err#1 EPERM
write(2, " m a i l", 4) = 4
write(2, " : ", 2) = 2
write(2, " C a n n o t c h o w n".., 21) = 21
In this case, the required file was created in the :saved directory (the "= 5" indicates the file handle that was used), but the subsequent chown call failed. It seemed that I was getting a little closer to a solution, but with a configuration that no longer matched my other servers.
The chown command being called by mail is, of course, not /usr/bin/chown but the chown system call. The syntax that we see in the truss output follows this form of the call (from the man page), indicating that the arguments are path, UID and GID:
int chown(const char *path, uid_t owner, gid_t group);
Chown What?
So, two questions arise. 1) What is this file we are trying to chown? and 2) Why are we running into trouble?
Not much was going on in the file system until I pressed q to quit the mail program. At this point, a number of things seemed to be happening and my truss output was the best source of insights into these activities. From this output, I could tell that the mail program was attempting to reconstruct my new mailfile (i.e., my mailfile without the messages that I had elected to delete) in the :saved folder and was then trying to recreate /var/mail/henrystocker with the proper ownership (user and group). Tracing the mail program on a properly working system, I might have seen the following sequence of actions. Here, we see the original mail file being erased ("unlinked"), the temporary file having its permissions set and being moved back to the original file's location and a lock file being removed.
unlink("/var/mail/henrystocker") = 0
chmod("/var/mail/:saved/henrystocker", 0100660) = 0
rename("/var/mail/:saved/henrystocker", "/var/mail/henrystocker") = 0
lstat64("/var/mail/henrystocker.lock", 0x08046D24) = 0
unlink("/var/mail/henrystocker.lock") = 0
In other words, for the mail program to work for me, two things need to happen. First, I need for the mail program as I run it to be able to create the temporary file in the /var/mail/:saved directory, even though it doesn't do this until I have typed "q" for quit. Second, I need to be able to switch the group on my reconstructed mail file to mail and move it back into /var/mail. If all of the file permissions are correct, this all happens as it should.
While it was easy to see that my file permissions appeared to be correct when I first examined them, it was odd that making them non-standard inched me toward a solution until I considered one additional aspect of the mail program. That is, though the program was run by me (and only needed world execute for this), it required the setgid bit to change the group on the reconstructed mail file. This turned out to be the crux of the problem. While /usr/bin/mail did indeed have the setgid bit set, the group associated with the file was incorrectly set to "other" instead of "mail". Because of this, mail first had trouble writing to the :saved directory ("cannot open savefile") and then, after I'd changed permissions on this directory to 777, had problems trying to change ownership ("can't chown savefile") of the new file to mail. Changing the group on /usr/bin/mail to mail fixed the problem once and for all.
The corrected /usr/bin/mail looks like this:
-r-x--s--x 1 root mail 69488 Nov 15 2001 /usr/bin/mail
A good change detection tool might have picked up this group ownership issue long before the first complaint finally wafted in. But, as they say, hindsight is 20/20.
Interestingly, the mailx command turned out to work even when the mail program would not.
ITworld.com