October 17, 2001, 5:32 PM — Opera is something I do not appreciate fully. The
costumes are exquisite, the music is emotional, but without
understanding Italian the plots are hard to follow. Pavarotti could be
performing a free-form exploration of the UUCP source code and I would
have trouble distinguishing it from Madama Butterfly. Bugs
Bunny making a fruit salad on Elmer Fudd's head is the most
comprehensible opera I've witnessed. "Wait," my cultured friends tell
me, "use the libretto to grasp the story." While it's not quite a set
of Cliff Notes, the libretto (text of the opera) helps you build a
framework for understanding the action on stage.
What does this have to do with the world of system administration?
If the error messages, user questions, system-call errors, and other
cryptic failures you encounter sometimes make as much sense La
Traviata, then you need a libretto -- a framework for
understanding what the system is trying to tell you. We'll look at the
various ways in which system calls fail, and the symptoms by which
those failures manifest themselves. Starting with general file
permission issues, we'll then dive down into NFS failures, and close
with some comments on the importance of vigilance in enforcing system
programming guidelines. You may not understand Puccini any better than
before, but such help is easier to find.
Trap defense
System calls represent the boundary between user processes and
operating system (kernel) services. When a process executes a system
call, the associated wrapper in the libc.so library is called
to perform some basic argument checking. If the call is syntactically
acceptable, the wrapper executes a privileged instruction to force a
trap into the kernel. From there, the operating system takes over by
copying arguments, performing extensive checking, and completing the
service request. If you dump out the code for a system call in
libc.so, you'll see a "ta 8" instruction to issue trap 0x08,
which is a system call (see /usr/include/sys/trap.h for trap
types):
huey% adb /usr/lib/libc.so.* _read,4?ia _read: st %o0, [%sp + 0x44] read+4: mov 0x3, %g1 read+8: ta 0x8 read+0xc: bgeu read + 0x40
Nearly every system call returns a single value, ranging from a pointer
or an address, such as from shmat() or brk(),
to the size of a data transfer from read() and
write(), to a standard system type like a UID returned by
getuid(). System calls that return integers often use
negative return values to flag a failure, but this rule doesn't apply
to calls that return addresses, which are usually set to NULL if the
call fails. Simple, inconsistent indicators of success or failure don't
give you (and your process) enough information to determine what went
wrong and how to repair the situation, so the system call return value
is supplemented by the error number, or errno value.
If an exception is encountered while processing the system call,
errno is set to one of the values in /usr/include/sys/errno.h.
A successful call sets errno to zero. Most applications include the
errno.h header file, containing the possible values of errno.
Insert a extern int errno; in your code, and it is
accessible as an integer variable.
In theory, your code should check the value of errno after each
system call, including those that should "never" fail like
close(), because these system calls can report
failures deferred from other requests -- a topic we'll visit later. Of
course, not all code does such paranoid checking, and you can't modify
commercial applications to make them fit your quality standards ex
post facto. So, how do you start tracking down a user issue when
all you have is an error message?
The system call return value
is supplemented by the error number,
or errno value.
Trace amounts
The first thing to do is to become familiar with the various kinds of
errors reported back through the errno mechanism. Your best source of
information is the introduction to section 2 of the manual pages:













