A file by any other name

Unix Insider –

Despite all of the work that has been done with object-oriented systems and user interfaces, the Unix filesystem is the primary mechanism we use to locate and interact with our data. The hierarchical namespace is simple and relatively easy to use -- until you begin adding symbolic links, NFS mounts, removable media, and other wormholes that take you from one disk to another with little warning. Large Unix environments stress the filesystem in new and creative ways: Processes can go file-crazy and run into system-resource limitations. Users who have yet to discover the wonder of directories complain of terrible system performance, or just when you think your list of headaches is complete, you try to unmount a CD-ROM but continually get "filesystem busy" error messages, or you promise to remove the mailbox of a user who is hanging you up -- as soon as you determine his or her identity.

This month, we'll sort through some filesystem navigation issues. We'll look at

<font face="Courier">open()</font>
to see how processes find files, and examine some performance issues. From there, it's off to the links -- hard and symbolic -- to see how they alter paths through the filesystem. Finally, we'll look at the tools available to find the user associated with an open file or directory. While we may not offer any solutions to the deep-versus-wide directory layout argument, we'll try to make sure that no matter what you call your files, you'll get the right bits.

Open duplicity

We see the filesystem as a tree of directories and filenames. These physical names aren't used inside of a process. Logical names known as file descriptors, or file handles, are used to identify a file for reading or writing. Unix file descriptors are integers returned by the

<font face="Courier">open()</font>
or
<font face="Courier">dup()</font>
system calls. Pass a filesystem pathname to
<font face="Courier">open()</font>
along with the read or write permissions you want, and
<font face="Courier">open()</font>
returns a file descriptor or an error:

<font face="Courier">int fd;
fd = open("/home/stern/cols/sep95.doc", O_RDONLY);
if (fd < 0) {
	fprintf(stderr, "cannot open file\n");
	exit(1);
}
</font>

You can open special (raw) devices or special files like Unix domain sockets using a similar code fragment. File descriptors are maintained on a per-process basis, in the same kernel data structures that keep track of the stack, address space, and signal handlers. Every process starts out with file descriptors 0, 1, and 2 assigned to the standard input, output, and error streams, respectively. Each subsequent call to

<font face="Courier">open()</font>
returns the next available file handle. When you call
<font face="Courier">close()</font>
on a file descriptor, that handle is the next one used by
<font face="Courier">open()</font>
. The integer is really just an index into a table of per-process file descriptors that point to the system open file table. The system file table points to other file-specific information like inodes, NFS server addresses, or protocol-control blocks for file descriptors associated with sockets.

What's the point of

<font face="Courier">dup()</font>
if
<font face="Courier">open()</font>
is the primary mechanism for converting pathnames to file handles? When you want to use the same file for two output streams, such as
<font face="Courier">stdout</font>
and
<font face="Courier">stderr</font>
, use
<font face="Courier">dup()</font>
to copy one descriptor into another. The following code segment closes the default stdout and stderr streams and pumps both into a new file:

<font face="Courier">int fd;
close(1);
close(2);
fd = open("/home/stern/log/errors", O_WRONLY|O_CREATE);
dup(fd);
</font>

Because the file descriptors for stdout and stderr were closed before the calls to

<font face="Courier">open()</font>
and
<font face="Courier">dup()</font>
, these handles are re-used.

Open file descriptors are preserved when new processes are created with

<font face="Courier">fork()</font>
and
<font face="Courier">vfork()</font>
, so any process that gets started via
<font face="Courier">exec()</font>
inherits the open file descriptors left by the calling process. Here's a simple example of using
<font face="Courier">dup()</font>
and
<font face="Courier">fork()</font>
to set up a pipe between two processes:

<font face="Courier">int fd[2], pid;
pipe(&fd);
close(0);
dup(fd[0]);
if ((pid = fork()) == 0) {
	close(1);
	close(2);
	dup(fd[1]);  /* connect stdout */
	dup(fd[1]);  /* and stderr */
	exec("writer");
} 
exec("reader");
</font>

First the call to

<font face="Courier">pipe()</font>
creates a pipe and returns file descriptors for the reading and writing ends. Standard input is closed, and the reading end of the pipe is connected via
<font face="Courier">dup()</font>
. A new process is created using
<font face="Courier">fork()</font>
, and its stdout and stderr are fitted into the writing end of the pipe. The child process
<font face="Courier">exec</font>
s the writer and the parent process becomes the reader. This is a slimmed-down version of what happens inside the shell when you execute
<font face="Courier">command_a | command_b</font>
from the command line.

The per-process file descriptor table has a default maximum size of 20 in SunOS 4.1.x and 64 in Solaris. If you exceed the default size, the next call to

<font face="Courier">open()</font>
or
<font face="Courier">dup()</font>
returns EMFILE. Most processes don't touch that many files, but system processes or connection managers such as
<font face="Courier">inetd</font>
may open many file descriptors. Relax the limit using the
<font face="Courier">unlimit</font>
command from the shell. You can even start
<font face="Courier">inetd</font>
in a subshell that has the file descriptor limit removed:

<font face="Courier">luey% ( unlimit descriptors; inetd & )
</font>

Setting the file descriptor ceiling to "unlimited" really means 128 in SunOS 4.1.x, and 1,024 in Solaris. The smaller limit in SunOS is due to the standard I/O library (and others) using a signed

<font face="Courier">char</font>
for the file descriptor. If you are porting code from BSD platforms to Solaris, be on the lookout for snippets that use 8-bit signed file descriptors and test for return values less than zero: Solaris will return file descriptors over 128, which look like failures but are valid file handles. Solaris also dynamically allocates space in the per-process file descriptor table, so removing the file descriptor limit won't make your processes bloat with unused table space.

Open system call

Peering inside the

<font face="Courier">open()</font>
system call lets us learn more about the performance implications of excessively deep or wide directory structures. When
<font face="Courier">open()</font>
converts a pathname into a file handle, it starts by walking down the path one component at a time. For example, to look up /home/stern/log/error, first home is found in the root directory, then stern is found under home, and so on. To locate a filename component, a linear search is performed on the directory -- Unix doesn't keep the directory entries sorted on disk, so each one must be examined to find a match or determine that the file or directory doesn't exist.

Why is the lookup done one component at a time? Any of those directories could be a mount point or a symbolic link, taking the pathname resolution to another filesystem or another machine in the case of an NFS mount. If /home/stern is NFS-mounted, the first two lookups occur on the local disk, then the request to find log in stern is sent over the network to the NFS server. These pathname resolutions show up as NFS lookup requests and can account for as much as 40% of your NFS traffic.

Pathname-to-file-descriptor parsing can be a disk-, network-, and server-intensive business. If the directory entries must be searched, they have to be read from disk, incurring file-access overhead before the file is opened for normal I/O. When you are using NFS-mounted files, you may be performing directory searches on the server, accruing network round-trip and disk I/O penalties.

Excessively deep directory structures

generate large numbers of lookup requests

and can frustrate users. . .

So how does the operating system accelerate this process? Recently completed lookups are kept in the Directory Name Lookup Cache (DNLC), an association of directories and component names that is checked before performing a directory search. DNLC statistics are accessible through

<font face="Courier">vmstat -s</font>
:

<font face="Courier">luey% vmstat -s
     ...
  462936 total name lookups (cache hits 91%)
     214 toolong
</font>

Ideally, your hit rate should be over 90%, and as close to 99% as possible. However, several things conspire to make the DNLC less efficient:

  • Opening a file for the first time adds its entry to the DNLC. If you perform many browsing type operations, you'll decrease the DNLC's efficiency.

  • When a file or directory is removed, the DNLC entry is purged. Create the file with the same name again, and a new entry is inserted. Large volumes of file-creation activity, such as compilations, reduce the hit rate as well.

  • Having a DNLC that is too small, particularly on an NFS server, will cripple your performance by sending each DNLC miss off to disk for a directory search. Under Solaris, the DNLC is sized dynamically but in SunOS, you need to increase your

    <font face="Courier">maxusers</font>
    kernel parameter to crank up the cache size.

  • Long filenames can't be cached. In SunOS 4.1.x, any component over 14 characters long doesn't fit in the DNLC, and in Solaris, anything with 32 characters or more isn't inserted.

  • Unmounting a filesystem purges its cached entries. When the automounter decides a filesystem is idle and unmounts it, any entries for files on that filesystem are removed.

Sometimes

<font face="Courier">vmstat -s</font>
reports bizarre statistics, usually when its internal counters overflow and its math precision becomes non-existent. Kimberley Brown, co-author of Panic! Unix System Crash Dump Analysis provides a detailed adb script to dump out all of the name cache (nc) statistics. It shows you the number of hits, misses, too-long names, and purges.

Deep and wide

Given that directory searches occur in linear fashion, but each pathname component requires another lookup operation, is it better to have deep or wide directories? Do you do better having all of your files in just a few directories, minimizing the number of lookups, or do you go for more lookups but make each one go faster because the directories have only a few entries?

The answer, as usual, is to avoid the extremes. Excessively deep directory structures generate large numbers of lookup requests and can frustrate users trying to navigate up and down half a dozen levels of directories. Perhaps the worst file naming convention is to take a serial number and use each digit as a directory, that is, document 51269 becomes /docs/5/1/2/6/9. While it's trivial to find any document, it's painful to consume so much of the disk with directories. Use a hash table to find documents quickly, or group them with several dozen documents in a single directory.

Large, flat directories are equally bad. When a user complains that

<font face="Courier">ls</font>
takes a long time, try
<font face="Courier">ls | wc</font>
and see how many files are in the current directory. In addition to the disk hits required take to search the directory, you're also paying a CPU time penalty to sort the list.

SUBHEAD Link to La-La-Land

Pathname resolution takes a turn when a mount point is encountered, switching from one local disk to another or to an NFS server. Symbolic links cause a similar detour. First, some background on links and their implementation. Filesystem links come in two flavors: hard and symbolic. Hard links are merely multiple names for the same file. They exist as duplicate directory entries, pointing to the same blocks on disk. Create hard links using the

<font face="Courier">ln</font>
command:

<font face="Courier">luey% ln test1 test2
</font>
1 2 Page
What’s wrong? The new clean desk test
You Might Like
Join the discussion
Be the first to comment on this article. Our Commenting Policies