From: www.itworld.com

A file by any other name

by Hal Stern

October 17, 2001 —

 

Despite all of the work that has been done with
object-oriented systems and user interfaces, the Unix filesystem is
the primary mechanism we use to locate and interact with our data. The
hierarchical namespace is simple and relatively easy to use -- until
you begin adding symbolic links, NFS mounts, removable media, and
other wormholes that take you from one disk to another with little
warning. Large Unix environments stress the filesystem in new and
creative ways: Processes can go file-crazy and run into system-resource
limitations. Users who have yet to discover the wonder of directories
complain of terrible system performance, or just when you think your list
of headaches is complete, you try to unmount a CD-ROM but continually
get "filesystem busy" error messages, or you promise to remove the
mailbox of a user who is hanging you up -- as soon as you determine
his or her identity.

This month, we'll sort through some filesystem navigation issues.
We'll look at open() to see how processes find files, and
examine some performance issues. From there, it's off to the links --
hard and symbolic -- to see how they alter paths through the
filesystem. Finally, we'll look at the tools available to find the
user associated with an open file or directory. While we may not offer
any solutions to the deep-versus-wide directory layout argument, we'll
try to make sure that no matter what you call your files, you'll get the
right bits.

Open duplicity

We see the filesystem as a tree of directories and filenames. These
physical names aren't used inside of a process. Logical
names
known as file descriptors, or file handles, are used to
identify a file for reading or writing. Unix file descriptors are
integers returned by the open() or dup()
system calls. Pass a filesystem pathname to open() along
with the read or write permissions you want, and open()
returns a file descriptor or an error:

int fd;
fd = open("/home/stern/cols/sep95.doc", O_RDONLY);
if (fd < 0) {
	fprintf(stderr, "cannot open file\n");
	exit(1);
}

You can open special (raw) devices or special files like Unix domain
sockets using a similar code fragment. File descriptors are maintained
on a per-process basis, in the same kernel data structures that keep
track of the stack, address space, and signal handlers. Every process
starts out with file descriptors 0, 1, and 2 assigned to the standard
input, output, and error streams, respectively. Each subsequent call
to open() returns the next available file handle. When
you call close() on a file descriptor, that handle is the
next one used by open(). The integer is really just an
index into a table of per-process file descriptors that point to the
system open file table. The system file table points to other
file-specific information like inodes, NFS server addresses, or
protocol-control blocks for file descriptors associated with sockets.

What's the point of dup() if open() is the
primary mechanism for converting pathnames to file handles? When you
want to use the same file for two output streams, such as
stdout and stderr, use dup() to
copy one descriptor into another. The following code segment closes
the default stdout and stderr streams and pumps both into a new file:

int fd;
close(1);
close(2);
fd = open("/home/stern/log/errors", O_WRONLY|O_CREATE);
dup(fd);

Because the file descriptors for stdout and stderr were closed before
the calls to open() and dup(), these handles
are re-used.

Open file descriptors are preserved when new processes are created
with fork() and vfork(), so any process that
gets started via exec() inherits the open file
descriptors left by the calling process. Here's a simple example of
using dup() and fork() to set up a pipe
between two processes:

int fd[2], pid;
pipe(&fd);
close(0);
dup(fd[0]);
if ((pid = fork()) == 0) {
	close(1);
	close(2);
	dup(fd[1]);  /* connect stdout */
	dup(fd[1]);  /* and stderr */
	exec("writer");
} 
exec("reader");

First the call to pipe() creates a pipe and returns file
descriptors for the reading and writing ends. Standard input is
closed, and the reading end of the pipe is connected via
dup(). A new process is created using
fork(), and its stdout and stderr are fitted into the
writing end of the pipe. The child process execs the
writer and the parent process becomes the reader. This is a slimmed-down version of what happens inside the shell when you execute
command_a | command_b from the command line.

The per-process file descriptor table has a default maximum size of 20
in SunOS 4.1.x and 64 in Solaris. If you exceed the default size, the
next call to open() or dup() returns EMFILE.
Most processes don't touch that many files, but system processes or
connection managers such as inetd may open many file
descriptors. Relax the limit using the unlimit command
from the shell. You can even start inetd in a subshell
that has the file descriptor limit removed:

luey% ( unlimit descriptors; inetd & )

Setting the file descriptor ceiling to "unlimited" really means 128 in
SunOS 4.1.x, and 1,024 in Solaris. The smaller limit in SunOS is due
to the standard I/O library (and others) using a signed
char for the file descriptor. If you are porting code
from BSD platforms to Solaris, be on the lookout for snippets that use
8-bit signed file descriptors and test for return values less than
zero: Solaris will return file descriptors over 128, which look like
failures but are valid file handles. Solaris also dynamically
allocates space in the per-process file descriptor table, so removing
the file descriptor limit won't make your processes bloat with unused
table space.

Open system call

Peering inside the open() system call lets us learn more
about the performance implications of excessively deep or wide
directory structures. When open() converts a pathname
into a file handle, it starts by walking down the path one component
at a time. For example, to look up /home/stern/log/error,
first home is found in the root directory, then
stern is found under home, and so on. To locate a
filename component, a linear search is performed on the directory -- Unix
doesn't keep the directory entries sorted on disk, so each one must be
examined to find a match or determine that the file or directory
doesn't exist.

Why is the lookup done one component at a time? Any of those
directories could be a mount point or a symbolic link, taking the
pathname resolution to another filesystem or another machine in the
case of an NFS mount. If /home/stern is NFS-mounted, the
first two lookups occur on the local disk, then the request to find
log in stern is sent over the network to the NFS
server. These pathname resolutions show up as NFS lookup requests and
can account for as much as 40% of your NFS traffic.

Pathname-to-file-descriptor parsing can be a disk-, network-, and
server-intensive business. If the directory entries must be searched,
they have to be read from disk, incurring file-access overhead
before the file is opened for normal I/O. When you are using
NFS-mounted files, you may be performing directory searches on the
server, accruing network round-trip and disk I/O penalties.





Excessively deep directory structures

generate large numbers of lookup requests

and can frustrate users. . .




So how does the operating system accelerate this process? Recently
completed lookups are kept in the Directory Name Lookup Cache (DNLC),
an association of directories and component names that is checked
before performing a directory search. DNLC statistics are accessible
through vmstat -s:

luey% vmstat -s
     ...
  462936 total name lookups (cache hits 91%)
     214 toolong

Ideally, your hit rate should be over 90%, and as close to 99% as
possible. However, several things conspire to make the DNLC less
efficient:

Sometimes vmstat -s reports bizarre statistics, usually
when its internal counters overflow and its math precision becomes
non-existent. Kimberley Brown, co-author of Panic! Unix System Crash Dump Analysis
provides a detailed adb script to dump out all of the name cache (nc)
statistics. It shows you the number of hits, misses, too-long names,
and purges.

Deep and wide

Given that directory searches occur in linear fashion, but each
pathname component requires another lookup operation, is it better to
have deep or wide directories? Do you do better having all of your
files in just a few directories, minimizing the number of lookups, or
do you go for more lookups but make each one go faster because the
directories have only a few entries?

The answer, as usual, is to avoid the extremes. Excessively deep
directory structures generate large numbers of lookup requests
and can frustrate users trying to navigate up and down half
a dozen levels of directories. Perhaps the worst file
naming convention is to take a serial number and use each digit
as a directory, that is, document 51269 becomes /docs/5/1/2/6/9.
While it's trivial to find any document, it's painful to consume
so much of the disk with directories. Use a hash table to find
documents quickly, or group them with several dozen documents in a
single directory.

Large, flat directories are equally bad. When a user complains that
ls takes a long time, try ls | wc and see
how many files are in the current directory. In addition to the disk
hits required take to search the directory, you're also paying a CPU
time penalty to sort the list.

SUBHEAD Link to La-La-Land

Pathname resolution takes a turn when a mount point is encountered,
switching from one local disk to another or to an NFS server. Symbolic
links cause a similar detour. First, some background on links and
their implementation. Filesystem links come in two flavors: hard and
symbolic. Hard links are merely multiple names for the same file. They
exist as duplicate directory entries, pointing to the same blocks on
disk. Create hard links using the ln command:

luey% ln test1 test2

Because they refer to a set of disk blocks, hard links cannot span
filesystem boundaries. If you're wondering why hard links are useful,
consider the mv command. You can rename files by copying
them, and removing the old name. However, this assumes you have disk
space for two copies of the file while the operation is in progress,
and it may upset any careful tuning you've done of the disks. Instead of copying the file, mv uses a hard link to
create the new name, then it removes the old name. The disk blocks
never move, unless you're moving the file across filesystems, in which
case mv copies the file.

Symbolic links are pointers to filesystem pathnames, not filesystem
data blocks. They're also created with the ln command,
using the -s flag:

luey% ln -s /var/log/stern/errors/sep95 /home/stern/log

The link target is the first argument, and the link is the second
argument. When open() hits a symbolic link, it follows
the link by substituting its value for the current pathname component.
In the above example, /home/stern/log is a link to
/var/log/stern/errors/sep95 so the lookup of log in
stern turns into a lookup of var in the root
directory. Encountering a symbolic link lengthens the pathname lookup
if the link has many components, and it may force a disk access to
read the link. All symbolic links in SunOS are read from the disk,
while only those with more than 32 characters in the target are read
from disk in Solaris -- shorter links are kept in memory with the
link's inode.

Absolute links point to a pathname that begins with a /, while
relative links assume the current directory as a starting point.
For example:

luey% ln -s ../logs/sep95 error

creates a link named error in the current directory, pointing
to ../logs/sep95. Using symbolic links successfully means
enforcing consistency in filesystem access on all of your clients. If
you use absolute links, you must be sure that all of the pathname
components in the link are accessible to processes that hit the link.
With relative links, you run the risk of ending up somewhere
unexpected when you walk up the directory tree.

One of the most common uses of symbolic links is the creation of a
consistent name space built from CPU-, OS-, or release-specific
components. For example, /usr/local/bin contains binaries
specific to a CPU architecture and an OS release. Many administrators
use symbolic links to point to the "right" version of
/usr/local/bin:

luey% ln -s /usr/local/sun4.solaris2.4/bin /usr/local/bin

Maintaining the links gets more difficult as the number of special
cases grows, and it makes performing an upgrade less than pleasant.
Another solution is to use the hierarchical mount feature of the
automounter, letting it build up the right combination of generic and
specific filesystems on the fly. Here's a hierarchical automounter map
entry for /usr/local:

/usr/local	\
	/		servera:/usr/local
	/bin		servera:/usr/local/bin.$ARCH.$OS
	/lib		serverb:/usr/local/lib.$ARCH.$OS

First the generic /usr/local template is mounted, giving you
machine-independent directories like /usr/local/share and
/usr/local/man. Then the bin and lib
directories are dropped on top, using the variants named by the
automounter variables. Invoke the automounter with the appropriate
variable definitions on each machine:

luey% automount OS=solaris2.4 ARCH=sun4

When you upgrade the OS, change this line to reflect the new OS
variable value, and all of your machine dependencies are resolved.

Virtual Boston

Wise use of the automounter and symbolic links can keep your users
from falling off the edge of a filesystem into previously uncharted
net-surf. But what if you want to explicitly prevent prowling? How can
you create a restricted environment from which scripts, users, and
not-so-well-intentioned visitors cannot escape? The magic system call
is chroot() -- change root. When chroot() is
called with a pathname, that directory becomes the new virtual root of
the filesystem. The calling process and all of its descendants only
see files in the virtual root and below.

Anonymous ftp servers use chroot() to keep you corraled
in the public ftp area. Webmasters would sleep better at night if CGI
(common gateway interface) scripts ran with chroot() so
that file damage or exposure caused by script failures could be
contained to a selected subset of the server's filesystems.

Once you call chroot(), anything you'll need has to
appear in the truncated filesystem. If you are using anonymous ftp, and you
expect users to generate listings, you'll need a version of
/bin/ls as well as the dynamic linker ld.so
and the essential C runtime libraries from /lib. Let's say
you want to use /export/pub/ftp as the root of your subset.
You'll need to create the following directories and populate them:
/export/pub/ftp/bin for binaries such as ls and
ld.so, /export/pub/ftp/lib for dynamically
linked libraries, and /export/pub/dev for special devices needed
at runtime, like /dev/zero.

If you use chroot() in other places, for example,
creating electronically padded cells for scripts, be careful about
using symbolic links. Absolute links' values still refer to the
old root of the filesystem. If you perform
chroot("/home/cell"), and then try to resolve a link that
points to /var/log/errors, the actual pathname created
(relative to the true root of the filesystem) is
/home/cell/var/log/errors. Relative links are a safer bet for
portability when using chroot(), but again be sure that
you don't try to walk above the root of the truncated filesystem. An
arbitrarily long chain of .. components still only hits the subset
root, not the true filesystem root.

You must be root to call chroot(), so it is usually
called by daemons started by root just before they exec()
another image or change their effective user ID to something
non-privileged. Changing your filesystem-centric viewpoint to preen
unrelated directories gives you a Virtual Boston -- the "hub of the
universe."

User descriptors

By now you can help your users get to files, create shortcuts, and
maybe even eliminate the odd complaint about the terrible performance
of File Manager on that home directory with a thousand files in it.
The true test of courage is to find out what files are in use without
having to send mail to everyone in the building. Fortunately, there is
a tool triumvirate to let you walk from the system open file table
back to associated processes.

The first is a built-in tool called
fuser, short for file user.
Give it a filename, directory name, or a special device, and it dumps
out the process IDs that have open file descriptors pointing to the
argument. fuser's output shows you whether the process
has the file open as its current working directory, as a normal file,
or as its root directory:

luey# fuser /var/spool/lp
/var/spool/lp: 150c 

The c indicates that the directory is used as a current working
directory; exploring the process ID with ps yields:

luey# ps -p 150
PID TTY        TIME COMD
150 ?          0:00 lpNet

Because fuser goes into the kernel to scan the system
file table, it must be run as root. One of the more powerful
applications of fuser is finding the process that's got
an open file or working directory on the CD device, preventing the
CD-ROM caddy from ejecting with "file busy" messages:

luey# fuser /dev/dsk/c0t6d0s0
/dev/rdsk/c0t6d0s0:      166o

Process 166 has a file open on the CD-ROM; killing it will let you
eject away. Note that fuser writes process IDs to stdout,
but the open status letters go to stderr. If you want a list of
process IDs to hand to kill or another command, catch the standard
output of fuser and ignore the commentary on standard
error.

A similar tool for SunOS 4.1.x is
ofiles.
It takes a filename or device and tells you the processes that are
using it and the type of use, including read or write locks on the
file. ofiles merges the file descriptor table walk with
some of the functionality of ps, making it a simpler tool
to use:

luey% ofiles /var
/var/	/var/ (mount point)
USER        PID  TYPE    FD  CMD           
root         59  file/x   7  ypbind        
root        186  cwd         cron          
root        186  file     3  cron          
root         95  file     9  syslogd       
root        103  cwd         sendmail      
root        194  file/x   3  lpd           

ofiles has the added advantage of understanding protocol
control blocks (PCB) associated with sockets. Again, this only works
under SunOS 4.1.x, but you can answer nasty questions like "Who owns
the socket on port 139?" First use netstat -A to dump out
the port number and PCB table:

luey% netstat -A 
Active Internet connections
PCB      Proto Recv-Q Send-Q  Local Address      Foreign Address    (state)
f8984400 tcp        0      0  duck.139           pcserver.3411     TIME_WAIT

Then feed the PCB address to ofiles to locate the owner of the socket
endpoint:

luey% ofiles -n f8984400
USER        PID  TYPE    FD  CMD
daemon     4559  sock     4  lanserver

ofiles should be run as root or made a setgid executable
owned by group kmem so that it too can read the kernel's memory.

The third tool of the trio is lsof,
which produces a list of open files. Consider it a version of
ls that shows no regard for privacy:

luey% lsof /dev/rdsk/c0t6d0s0
COMMAND     PID     USER   FD   TYPE     DEVICE   SIZE/OFF  INODE/NAME
vold        166     root   10r  VCHR    32,  48        0x0   7869 /dev/rdsk/...

lsof is a superset of ofiles and
fuser. It understands how to convert socket descriptors
into process IDs, and can handle specifications in terms of filenames,
filesystem names, or even TCP/IP addresses and service port numbers.
lsof also lets you build in a secure mode, so that users
can only get information on their own open files. Besides coming to
the rescue of users held captive by CD-ROMs that won't unmount, what
else can you do with lsof? Use it to generate "hot lists"
of files and directories to feed capacity planning exercises and drive
disk space allocation. After all, it doesn't matter by what name your
users call their files -- as long as they don't have to call you to
access them.