A file by any other name

Unix Insider –

Despite all of the work that has been done with object-oriented systems and user interfaces, the Unix filesystem is the primary mechanism we use to locate and interact with our data. The hierarchical namespace is simple and relatively easy to use -- until you begin adding symbolic links, NFS mounts, removable media, and other wormholes that take you from one disk to another with little warning. Large Unix environments stress the filesystem in new and creative ways: Processes can go file-crazy and run into system-resource limitations. Users who have yet to discover the wonder of directories complain of terrible system performance, or just when you think your list of headaches is complete, you try to unmount a CD-ROM but continually get "filesystem busy" error messages, or you promise to remove the mailbox of a user who is hanging you up -- as soon as you determine his or her identity.

This month, we'll sort through some filesystem navigation issues. We'll look at

<font face="Courier">open()</font>
to see how processes find files, and examine some performance issues. From there, it's off to the links -- hard and symbolic -- to see how they alter paths through the filesystem. Finally, we'll look at the tools available to find the user associated with an open file or directory. While we may not offer any solutions to the deep-versus-wide directory layout argument, we'll try to make sure that no matter what you call your files, you'll get the right bits.

Open duplicity

We see the filesystem as a tree of directories and filenames. These physical names aren't used inside of a process. Logical names known as file descriptors, or file handles, are used to identify a file for reading or writing. Unix file descriptors are integers returned by the

<font face="Courier">open()</font>
or
<font face="Courier">dup()</font>
system calls. Pass a filesystem pathname to
<font face="Courier">open()</font>
along with the read or write permissions you want, and
<font face="Courier">open()</font>
returns a file descriptor or an error:

<font face="Courier">int fd;
fd = open("/home/stern/cols/sep95.doc", O_RDONLY);
if (fd < 0) {
	fprintf(stderr, "cannot open file\n");
	exit(1);
}
</font>

You can open special (raw) devices or special files like Unix domain sockets using a similar code fragment. File descriptors are maintained on a per-process basis, in the same kernel data structures that keep track of the stack, address space, and signal handlers. Every process starts out with file descriptors 0, 1, and 2 assigned to the standard input, output, and error streams, respectively. Each subsequent call to

<font face="Courier">open()</font>
returns the next available file handle. When you call
<font face="Courier">close()</font>
on a file descriptor, that handle is the next one used by
<font face="Courier">open()</font>
. The integer is really just an index into a table of per-process file descriptors that point to the system open file table. The system file table points to other file-specific information like inodes, NFS server addresses, or protocol-control blocks for file descriptors associated with sockets.

What's the point of

<font face="Courier">dup()</font>
if
<font face="Courier">open()</font>
is the primary mechanism for converting pathnames to file handles? When you want to use the same file for two output streams, such as
<font face="Courier">stdout</font>
and
<font face="Courier">stderr</font>
, use
<font face="Courier">dup()</font>
to copy one descriptor into another. The following code segment closes the default stdout and stderr streams and pumps both into a new file:

<font face="Courier">int fd;
close(1);
close(2);
fd = open("/home/stern/log/errors", O_WRONLY|O_CREATE);
dup(fd);
</font>

Because the file descriptors for stdout and stderr were closed before the calls to

<font face="Courier">open()</font>
and
<font face="Courier">dup()</font>
, these handles are re-used.

Open file descriptors are preserved when new processes are created with

<font face="Courier">fork()</font>
and
<font face="Courier">vfork()</font>
, so any process that gets started via
<font face="Courier">exec()</font>
inherits the open file descriptors left by the calling process. Here's a simple example of using
<font face="Courier">dup()</font>
and
<font face="Courier">fork()</font>
to set up a pipe between two processes:

<font face="Courier">int fd[2], pid;
pipe(&fd);
close(0);
dup(fd[0]);
if ((pid = fork()) == 0) {
	close(1);
	close(2);
	dup(fd[1]);  /* connect stdout */
	dup(fd[1]);  /* and stderr */
	exec("writer");
} 
exec("reader");
</font>

First the call to

<font face="Courier">pipe()</font>
creates a pipe and returns file descriptors for the reading and writing ends. Standard input is closed, and the reading end of the pipe is connected via
<font face="Courier">dup()</font>
. A new process is created using
<font face="Courier">fork()</font>
, and its stdout and stderr are fitted into the writing end of the pipe. The child process
<font face="Courier">exec</font>
s the writer and the parent process becomes the reader. This is a slimmed-down version of what happens inside the shell when you execute
<font face="Courier">command_a | command_b</font>
from the command line.

The per-process file descriptor table has a default maximum size of 20 in SunOS 4.1.x and 64 in Solaris. If you exceed the default size, the next call to

<font face="Courier">open()</font>
or
<font face="Courier">dup()</font>
returns EMFILE. Most processes don't touch that many files, but system processes or connection managers such as
<font face="Courier">inetd</font>
may open many file descriptors. Relax the limit using the
<font face="Courier">unlimit</font>
command from the shell. You can even start
<font face="Courier">inetd</font>
in a subshell that has the file descriptor limit removed:

<font face="Courier">luey% ( unlimit descriptors; inetd & )
</font>

Setting the file descriptor ceiling to "unlimited" really means 128 in SunOS 4.1.x, and 1,024 in Solaris. The smaller limit in SunOS is due to the standard I/O library (and others) using a signed

<font face="Courier">char</font>
for the file descriptor. If you are porting code from BSD platforms to Solaris, be on the lookout for snippets that use 8-bit signed file descriptors and test for return values less than zero: Solaris will return file descriptors over 128, which look like failures but are valid file handles. Solaris also dynamically allocates space in the per-process file descriptor table, so removing the file descriptor limit won't make your processes bloat with unused table space.

Open system call

Peering inside the

<font face="Courier">open()</font>
system call lets us learn more about the performance implications of excessively deep or wide directory structures. When
<font face="Courier">open()</font>
converts a pathname into a file handle, it starts by walking down the path one component at a time. For example, to look up /home/stern/log/error, first home is found in the root directory, then stern is found under home, and so on. To locate a filename component, a linear search is performed on the directory -- Unix doesn't keep the directory entries sorted on disk, so each one must be examined to find a match or determine that the file or directory doesn't exist.

Why is the lookup done one component at a time? Any of those directories could be a mount point or a symbolic link, taking the pathname resolution to another filesystem or another machine in the case of an NFS mount. If /home/stern is NFS-mounted, the first two lookups occur on the local disk, then the request to find log in stern is sent over the network to the NFS server. These pathname resolutions show up as NFS lookup requests and can account for as much as 40% of your NFS traffic.

Pathname-to-file-descriptor parsing can be a disk-, network-, and server-intensive business. If the directory entries must be searched, they have to be read from disk, incurring file-access overhead before the file is opened for normal I/O. When you are using NFS-mounted files, you may be performing directory searches on the server, accruing network round-trip and disk I/O penalties.

Excessively deep directory structures

generate large numbers of lookup requests

and can frustrate users. . .

So how does the operating system accelerate this process? Recently completed lookups are kept in the Directory Name Lookup Cache (DNLC), an association of directories and component names that is checked before performing a directory search. DNLC statistics are accessible through

<font face="Courier">vmstat -s</font>
:

<font face="Courier">luey% vmstat -s
     ...
  462936 total name lookups (cache hits 91%)
     214 toolong
</font>

Ideally, your hit rate should be over 90%, and as close to 99% as possible. However, several things conspire to make the DNLC less efficient:

  • Opening a file for the first time adds its entry to the DNLC. If you perform many browsing type operations, you'll decrease the DNLC's efficiency.

  • When a file or directory is removed, the DNLC entry is purged. Create the file with the same name again, and a new entry is inserted. Large volumes of file-creation activity, such as compilations, reduce the hit rate as well.

  • Having a DNLC that is too small, particularly on an NFS server, will cripple your performance by sending each DNLC miss off to disk for a directory search. Under Solaris, the DNLC is sized dynamically but in SunOS, you need to increase your

    <font face="Courier">maxusers</font>
    kernel parameter to crank up the cache size.

  • Long filenames can't be cached. In SunOS 4.1.x, any component over 14 characters long doesn't fit in the DNLC, and in Solaris, anything with 32 characters or more isn't inserted.

  • Unmounting a filesystem purges its cached entries. When the automounter decides a filesystem is idle and unmounts it, any entries for files on that filesystem are removed.

Sometimes

<font face="Courier">vmstat -s</font>
reports bizarre statistics, usually when its internal counters overflow and its math precision becomes non-existent. Kimberley Brown, co-author of Panic! Unix System Crash Dump Analysis provides a detailed adb script to dump out all of the name cache (nc) statistics. It shows you the number of hits, misses, too-long names, and purges.

Deep and wide

Given that directory searches occur in linear fashion, but each pathname component requires another lookup operation, is it better to have deep or wide directories? Do you do better having all of your files in just a few directories, minimizing the number of lookups, or do you go for more lookups but make each one go faster because the directories have only a few entries?

The answer, as usual, is to avoid the extremes. Excessively deep directory structures generate large numbers of lookup requests and can frustrate users trying to navigate up and down half a dozen levels of directories. Perhaps the worst file naming convention is to take a serial number and use each digit as a directory, that is, document 51269 becomes /docs/5/1/2/6/9. While it's trivial to find any document, it's painful to consume so much of the disk with directories. Use a hash table to find documents quickly, or group them with several dozen documents in a single directory.

Large, flat directories are equally bad. When a user complains that

<font face="Courier">ls</font>
takes a long time, try
<font face="Courier">ls | wc</font>
and see how many files are in the current directory. In addition to the disk hits required take to search the directory, you're also paying a CPU time penalty to sort the list.

SUBHEAD Link to La-La-Land

Pathname resolution takes a turn when a mount point is encountered, switching from one local disk to another or to an NFS server. Symbolic links cause a similar detour. First, some background on links and their implementation. Filesystem links come in two flavors: hard and symbolic. Hard links are merely multiple names for the same file. They exist as duplicate directory entries, pointing to the same blocks on disk. Create hard links using the

<font face="Courier">ln</font>
command:

<font face="Courier">luey% ln test1 test2
</font>

Because they refer to a set of disk blocks, hard links cannot span filesystem boundaries. If you're wondering why hard links are useful, consider the

<font face="Courier">mv</font>
command. You can rename files by copying them, and removing the old name. However, this assumes you have disk space for two copies of the file while the operation is in progress, and it may upset any careful tuning you've done of the disks. Instead of copying the file,
<font face="Courier">mv</font>
uses a hard link to create the new name, then it removes the old name. The disk blocks never move, unless you're moving the file across filesystems, in which case
<font face="Courier">mv</font>
copies the file.

Symbolic links are pointers to filesystem pathnames, not filesystem data blocks. They're also created with the

<font face="Courier">ln</font>
command, using the
<font face="Courier">-s</font>
flag:

<font face="Courier">luey% ln -s /var/log/stern/errors/sep95 /home/stern/log
</font>

The link target is the first argument, and the link is the second argument. When

<font face="Courier">open()</font>
hits a symbolic link, it follows the link by substituting its value for the current pathname component. In the above example, /home/stern/log is a link to /var/log/stern/errors/sep95 so the lookup of log in stern turns into a lookup of var in the root directory. Encountering a symbolic link lengthens the pathname lookup if the link has many components, and it may force a disk access to read the link. All symbolic links in SunOS are read from the disk, while only those with more than 32 characters in the target are read from disk in Solaris -- shorter links are kept in memory with the link's inode.

Absolute links point to a pathname that begins with a /, while relative links assume the current directory as a starting point. For example:

<font face="Courier">luey% ln -s ../logs/sep95 error
</font>

creates a link named error in the current directory, pointing to ../logs/sep95. Using symbolic links successfully means enforcing consistency in filesystem access on all of your clients. If you use absolute links, you must be sure that all of the pathname components in the link are accessible to processes that hit the link. With relative links, you run the risk of ending up somewhere unexpected when you walk up the directory tree.

One of the most common uses of symbolic links is the creation of a consistent name space built from CPU-, OS-, or release-specific components. For example, /usr/local/bin contains binaries specific to a CPU architecture and an OS release. Many administrators use symbolic links to point to the "right" version of /usr/local/bin:

<font face="Courier">luey% ln -s /usr/local/sun4.solaris2.4/bin /usr/local/bin
</font>

Maintaining the links gets more difficult as the number of special cases grows, and it makes performing an upgrade less than pleasant. Another solution is to use the hierarchical mount feature of the automounter, letting it build up the right combination of generic and specific filesystems on the fly. Here's a hierarchical automounter map entry for /usr/local:

<font face="Courier">/usr/local	\
	/		servera:/usr/local
	/bin		servera:/usr/local/bin.$ARCH.$OS
	/lib		serverb:/usr/local/lib.$ARCH.$OS
</font>

First the generic /usr/local template is mounted, giving you machine-independent directories like /usr/local/share and /usr/local/man. Then the bin and lib directories are dropped on top, using the variants named by the automounter variables. Invoke the automounter with the appropriate variable definitions on each machine:

<font face="Courier">luey% automount OS=solaris2.4 ARCH=sun4
</font>

When you upgrade the OS, change this line to reflect the new OS variable value, and all of your machine dependencies are resolved.

Virtual Boston

Wise use of the automounter and symbolic links can keep your users from falling off the edge of a filesystem into previously uncharted net-surf. But what if you want to explicitly prevent prowling? How can you create a restricted environment from which scripts, users, and not-so-well-intentioned visitors cannot escape? The magic system call is

<font face="Courier">chroot()</font>
-- change root. When
<font face="Courier">chroot()</font>
is called with a pathname, that directory becomes the new virtual root of the filesystem. The calling process and all of its descendants only see files in the virtual root and below.

Anonymous ftp servers use

<font face="Courier">chroot()</font>
to keep you corraled in the public ftp area. Webmasters would sleep better at night if CGI (common gateway interface) scripts ran with
<font face="Courier">chroot()</font>
so that file damage or exposure caused by script failures could be contained to a selected subset of the server's filesystems.

Once you call

<font face="Courier">chroot()</font>
, anything you'll need has to appear in the truncated filesystem. If you are using anonymous ftp, and you expect users to generate listings, you'll need a version of
<font face="Courier">/bin/ls</font>
as well as the dynamic linker
<font face="Courier">ld.so</font>
and the essential C runtime libraries from /lib. Let's say you want to use /export/pub/ftp as the root of your subset. You'll need to create the following directories and populate them: /export/pub/ftp/bin for binaries such as
<font face="Courier">ls</font>
and
<font face="Courier">ld.so</font>
, /export/pub/ftp/lib for dynamically linked libraries, and /export/pub/dev for special devices needed at runtime, like /dev/zero.

If you use

<font face="Courier">chroot()</font>
in other places, for example, creating electronically padded cells for scripts, be careful about using symbolic links. Absolute links' values still refer to the old root of the filesystem. If you perform
<font face="Courier">chroot("/home/cell")</font>
, and then try to resolve a link that points to /var/log/errors, the actual pathname created (relative to the true root of the filesystem) is /home/cell/var/log/errors. Relative links are a safer bet for portability when using
<font face="Courier">chroot()</font>
, but again be sure that you don't try to walk above the root of the truncated filesystem. An arbitrarily long chain of .. components still only hits the subset root, not the true filesystem root.

You must be root to call

<font face="Courier">chroot()</font>
, so it is usually called by daemons started by root just before they
<font face="Courier">exec()</font>
another image or change their effective user ID to something non-privileged. Changing your filesystem-centric viewpoint to preen unrelated directories gives you a Virtual Boston -- the "hub of the universe."

User descriptors

By now you can help your users get to files, create shortcuts, and maybe even eliminate the odd complaint about the terrible performance of File Manager on that home directory with a thousand files in it. The true test of courage is to find out what files are in use without having to send mail to everyone in the building. Fortunately, there is a tool triumvirate to let you walk from the system open file table back to associated processes.

The first is a built-in tool called

<font face="Courier">fuser</font>
, short for file user. Give it a filename, directory name, or a special device, and it dumps out the process IDs that have open file descriptors pointing to the argument.
<font face="Courier">fuser</font>
's output shows you whether the process has the file open as its current working directory, as a normal file, or as its root directory:

<font face="Courier">luey# fuser /var/spool/lp
/var/spool/lp: 150c 
</font>

The c indicates that the directory is used as a current working directory; exploring the process ID with

<font face="Courier">ps</font>
yields:

<font face="Courier">luey# ps -p 150
PID TTY        TIME COMD
150 ?          0:00 lpNet
</font>

Because

<font face="Courier">fuser</font>
goes into the kernel to scan the system file table, it must be run as root. One of the more powerful applications of
<font face="Courier">fuser</font>
is finding the process that's got an open file or working directory on the CD device, preventing the CD-ROM caddy from ejecting with "file busy" messages:

<font face="Courier">luey# fuser /dev/dsk/c0t6d0s0
/dev/rdsk/c0t6d0s0:      166o
</font>

Process 166 has a file open on the CD-ROM; killing it will let you eject away. Note that

<font face="Courier">fuser</font>
writes process IDs to stdout, but the open status letters go to stderr. If you want a list of process IDs to hand to kill or another command, catch the standard output of
<font face="Courier">fuser</font>
and ignore the commentary on standard error.

A similar tool for SunOS 4.1.x is

<font face="Courier">ofiles</font>
. It takes a filename or device and tells you the processes that are using it and the type of use, including read or write locks on the file.
<font face="Courier">ofiles</font>
merges the file descriptor table walk with some of the functionality of
<font face="Courier">ps</font>
, making it a simpler tool to use:

<font face="Courier">luey% ofiles /var
/var/	/var/ (mount point)
USER        PID  TYPE    FD  CMD           
root         59  file/x   7  ypbind        
root        186  cwd         cron          
root        186  file     3  cron          
root         95  file     9  syslogd       
root        103  cwd         sendmail      
root        194  file/x   3  lpd           
</font>

<font face="Courier">ofiles</font>
has the added advantage of understanding protocol control blocks (PCB) associated with sockets. Again, this only works under SunOS 4.1.x, but you can answer nasty questions like "Who owns the socket on port 139?" First use
<font face="Courier">netstat -A</font>
to dump out the port number and PCB table:

<font face="Courier">luey% netstat -A 
Active Internet connections
PCB      Proto Recv-Q Send-Q  Local Address      Foreign Address    (state)
f8984400 tcp        0      0  duck.139           pcserver.3411     TIME_WAIT
</font>

Then feed the PCB address to

<font face="Courier">ofiles</font>
to locate the owner of the socket endpoint:

<font face="Courier">luey% ofiles -n f8984400
USER        PID  TYPE    FD  CMD
daemon     4559  sock     4  lanserver
</font>

<font face="Courier">ofiles</font>
should be run as root or made a setgid executable owned by group kmem so that it too can read the kernel's memory.

The third tool of the trio is

<font face="Courier">lsof</font>
, which produces a list of open files. Consider it a version of
<font face="Courier">ls</font>
that shows no regard for privacy:

<font face="Courier">luey% lsof /dev/rdsk/c0t6d0s0
COMMAND     PID     USER   FD   TYPE     DEVICE   SIZE/OFF  INODE/NAME
vold        166     root   10r  VCHR    32,  48        0x0   7869 /dev/rdsk/...
</font>

<font face="Courier">lsof</font>
is a superset of
<font face="Courier">ofiles</font>
and
<font face="Courier">fuser</font>
. It understands how to convert socket descriptors into process IDs, and can handle specifications in terms of filenames, filesystem names, or even TCP/IP addresses and service port numbers.
<font face="Courier">lsof</font>
also lets you build in a secure mode, so that users can only get information on their own open files. Besides coming to the rescue of users held captive by CD-ROMs that won't unmount, what else can you do with
<font face="Courier">lsof</font>
? Use it to generate "hot lists" of files and directories to feed capacity planning exercises and drive disk space allocation. After all, it doesn't matter by what name your users call their files -- as long as they don't have to call you to access them.

What’s wrong? The new clean desk test
You Might Like
Join the discussion
Be the first to comment on this article. Our Commenting Policies