Too small to keep, too big to throw back

Unix Insider –

In my March '98 article on file compression, I asked the question: How big is a file, anyway? This month I am going to expand that question to: How big is a directory, anyway?

If the directory only contains files, it's easy enough to issue an

<font face="Courier">ls -ls</font>
command and get the sizes of files in bytes and blocks.

<font face="Courier">$ ls -ls
total 6
   2 -rw-r--r--   1 mjb     group        3 Feb 04 23:31 minutes.txt
   4 -rw-r--r--   1 mjb     group     1201 Feb 04 23:25 note.txt
</font>

The first column contains the size of the file in 512-byte blocks, and the sixth column gives the size of the file in bytes. Files in this directory consume 6 blocks, containing only 1204 bytes. In the March column, I discussed allocation units -- the minimum space allocated by the operating system for a file. You should review that article for more details, but here's a brief explanation of how allocation units work.

This method is used in all major operating systems in one form or another. Some convenient number of bytes is selected as the minimum amount that can be allocated to a file. This amount is an allocation unit. If the file doesn't use all the space in an allocation unit, it's recorded at the beginning of the unit, with the remaining space set aside to accommodate further expansion of that file.

As you add to the file, the new data is stored in the empty reserved space on the disk, so long as it doesn't exceed the number of bytes permitted in an allocation unit. Once the file has used all available space, another allocation unit is grabbed and reserved. Any spillover from the first allocation unit is tucked in at the start of second allocation unit, and so on.

Earlier Unix systems used an allocation unit of 512 bytes. These 512 bytes came to be known as a block. As disk sizes grew, the basic allocation unit was increased to 1024 bytes on most systems (larger on some), but many utilities, such as

<font face="Courier">ls</font>
above, still report file sizes or disk use in 512 byte blocks. So, the 3-byte file uses 2 blocks.

In the following example, the directory in question includes a subdirectory, perl. The 2 blocks allocated for the perl directory are the blocks used only by the directory itself, not those used by the files in the directory.

<font face="Courier">$ ls -ls
total 6
   2 -rw-r--r--   1 mjb     group        3 Feb 04 23:31 minutes.txt
   4 -rw-r--r--   1 mjb     group     1201 Feb 04 23:25 note.txt
   2 drwxr-xrx    2 mjb     group      128 Jan 29 18:53 perl
</font>

We could figure out the sizes, by doing an

<font face="Courier">ls -ls perl</font>
, but suppose there's another directory under perl? And what if there were a third directory beneath that one?

How do you du?

The solution to this dilemma is the Unix utility

<font face="Courier">du</font>
. This little utility will recurse through all subdirectories and display all the blocks being used. In the display below, the directory being processed contains a perl subdirectory, which in turn contains a src subdirectory. The src directory contains files totaling 1540 blocks. The perl directory count includes all the blocks in src plus the blocks used by files in perl. Finally, the top level includes all blocks below it, plus blocks used by files used in the current directory.

<font face="Courier">$ du
1540 ./perl/src
5648 ./perl
5654 .
</font>

The

<font face="Courier">-a</font>
option displays the details for each file.

<font face="Courier">$ du -a
1500 ./perl/src/big.prl
40   ./perl/src/prog.prl
1540 ./perl/src
4108 ./perl/perl.tar
5648 ./perl
2    ./minutes.txt
4    ./note.txt
5654 .
</font>

The

<font face="Courier">du</font>
command will cut through a lot of
<font face="Courier">ls</font>
commands. It provides size information as well as a reasonable display of the directory tree.

Switching things around with tr

The

<font face="Courier">tr</font>
utility translates one set of characters into another. The command
<font face="Courier"> tr abc def test.txt</font>
will process the records from test.txt and will translate the letter a to d, the letter b to e and the letter c to f. At first glance this doesn't seem very useful, unless you want to practice amateur cryptography, but
<font face="Courier">tr</font>
has additional options that make it much more powerful. Two examples should give you a feel for the command.

The characters to be translated can be expressed as a range. In the command below, a directory is output through

<font face="Courier">tr</font>
, which translates a to A, b to B and so on -- converting everything from lowercase to uppercase.

<font face="Courier">$ ls -ls|tr [a-z] [A-Z]
TOTAL 6
   2 -RW-R--R--   1 MJB     GROUP        3 FEB 04 23:31 MINUTES.TXT
   4 -RW-R--R--   1 MJB     GROUP     1201 FEB 04 23:25 NOTE.TXT
   2 DRWXR-XRX    2 MJB     GROUP      128 JAN 29 18:53 PERL
</font>

Using tr in the real world

Among other things, case conversion solves a problem created by some utilities that copy MS-DOS files onto a system. They copy the files using the uppercase convention of MS-DOS, and the file names need to be converted to lowercase to work correctly. Assuming a directory full of files named in uppercase, the following command will rename all the files to lowercase versions. The command takes each file name and echoes it through a pipe using

<font face="Courier">tr</font>
to change uppercase to lowercase. The result is used as the target of a
<font face="Courier">mv</font>
command.

<font face="Courier">$ for name in *
> do
> mv $name `echo $name|tr [A-Z] [a-z]`
> done
$
</font>

<font face="Courier">tr</font>
includes the
<font face="Courier">-s</font>
switch, which squeezes repeating instances of the output characters to one instance. In the following example, the file test.txt contains one line with several spaces between the words. The
<font face="Courier">tr</font>
command translates each space into another space, but the
<font face="Courier">-s</font>
option compacts multiple spaces into a single output space. The resulting file, test2.txt, has a single space between each word.

<font face="Courier">$ type test.txt
How     are   you           today?
$ tr -s " " " " < test.txt >test2.txt
$ type test2.txt
How are you today?
</font>

Fancy line numbering with nl

The

<font face="Courier">nl</font>
utility adds line numbers to a file. Although this would seem like a simple task,
<font face="Courier">nl</font>
has a great number of options. To illustrate some of these options, we're going to undertake the old-fashioned task of adding line numbers to a Cobol program. I chose this example because it's a great way of illustrating many of the features of
<font face="Courier">nl</font>
. The following listing is hello.txt, a Cobol program with missing line numbers.

<font face="Courier">$ type hello.txt
IDENTIFICATION DIVISION.
PROGRAM-ID. HELLO.
ENVIRONMENT DIVISION.
DATA DIVISION.
PROCEDURE DIVISION.

PROGRAM-BEGIN.
    DISPLAY "Hello world."

PROGRAM-DONE.
    STOP RUN.
</font>

The first pass at this is simply to add line numbers, as in the following listing. The output has several problems.

The numbers in this listing start at one and rise in increments of one. Cobol usually operates in increments of 10 or 100, although one is valid. Cobol numbering also includes leading zeroes, which this listing doesn't display. Blank lines should be numbered but aren't. Finally,

<font face="Courier">nl</font>
's default behavior is to add a tab separator after the number and before the original line. Though the tabs are not visible in this listing, many Cobol compilers can't handle them at all.

<font face="Courier">$ nl <hello.txt >hello.cbl
type hello.cbl
     1  IDENTIFICATION DIVISION.
     2  PROGRAM-ID. HELLO.
     3  ENVIRONMENT DIVISION.
     4  DATA DIVISION.
     5  PROCEDURE DIVISION.

     6  PROGRAM-BEGIN.
     7      DISPLAY "Hello world."

     8  PROGRAM-DONE.
     9      STOP RUN.
</font>

Let's tackle these problems one at a time. The separator character can be specified as an ordinary space using the

<font face="Courier">-s</font>
switch (as in
<font face="Courier">-s" "</font>
). The first modified version of the command is shown below.

<font face="Courier">$ nl -s" " <hello.txt >hello.cbl
</font>

The format for the number itself is controlled by several options. The

<font face="Courier">-w</font>
option specifies the width of the number. For Cobol, this width is six. The default for
<font face="Courier">nl</font>
happens to be six, but I'll include the option to be thorough. The
<font face="Courier">-v</font>
option lets you specify the starting number, and
<font face="Courier">-i</font>
lets you specify the increment. In the listing below, I've specified a space separator, and a width of 6 digits, starting at 100 and going up in increments of 100.

<font face="Courier">$ nl -s" " -w6 -v100 -i100 <hello.txt >hello.cbl
type hello.cbl
   100 IDENTIFICATION DIVISION.
   200 PROGRAM-ID. HELLO.
   300 ENVIRONMENT DIVISION.
   400 DATA DIVISION.
   500 PROCEDURE DIVISION.

   600 PROGRAM-BEGIN.
   700     DISPLAY "Hello world."

   800 PROGRAM-DONE.
   900     STOP RUN.
</font>

This is closer, but it still needs work. The number format is controlled by the

<font face="Courier">-n</font>
option. There are three formats. Left-justified with leading zeroes suppressed is represented as
<font face="Courier">-nln</font>
. Right justified with leading zeroes suppressed is
<font face="Courier">-nrn</font>
. (This is the default.) Right-justified with leading zeroes kept is
<font face="Courier">-nrz</font>
. I use
<font face="Courier">-nrz</font>
in the following listing:

<font face="Courier">$ nl -s" " -w6 -v100 -i100 -nrz <hello.txt >hello.cbl
type hello.cbl
000100 IDENTIFICATION DIVISION.
000200 PROGRAM-ID. HELLO.
000300 ENVIRONMENT DIVISION.
000400 DATA DIVISION.
000500 PROCEDURE DIVISION.

000600 PROGRAM-BEGIN.
000700     DISPLAY "Hello world."

000800 PROGRAM-DONE.
000900     STOP RUN.
</font>

The default behavior of

<font face="Courier">nl</font>
is to skip blank lines, as shown above. The treatment of blank lines can be modified with the
<font face="Courier">-b</font>
switch. Some
<font face="Courier">-b</font>
options are
<font face="Courier">-ba</font>
(number all lines),
<font face="Courier">-bt</font>
(number only text lines -- the default behavior), and
<font face="Courier">-bpstring</font>
(number only lines containing the string "string"). This last option is interesting. An artificial example of this is shown in the following listing. Here, only lines containing the word PROGRAM are numbered.

<font face="Courier">$ nl -s" " -w6 -v100 -i100 -nrz -bpPROGRAM <hello.txt >hello.cbl
type hello.cbl
IDENTIFICATION DIVISION.
000100 PROGRAM-ID. HELLO.
ENVIRONMENT DIVISION.
DATA DIVISION.
PROCEDURE DIVISION.

000200 PROGRAM-BEGIN.
    DISPLAY "Hello world."

000300 PROGRAM-DONE.
    STOP RUN.
</font>

But what we really want is the

<font face="Courier">-ba</font>
option to number all lines. In the following listing we have the final version of the command, and the result.

<font face="Courier">$ nl -s" " -w6 -v100 -i100 -nrz -ba <hello.txt >hello.cbl
type hello.cbl
000100 IDENTIFICATION DIVISION.
000200 PROGRAM-ID. HELLO.
000300 ENVIRONMENT DIVISION.
000400 DATA DIVISION.
000500 PROCEDURE DIVISION.
000600 
000700 PROGRAM-BEGIN.
000800     DISPLAY "Hello world."
000900 
001000 PROGRAM-DONE.
001100     STOP RUN.
</font>

The

<font face="Courier">nl</font>
program sounds deceptively simple at first, but it performs a wide range of numbering tasks. It also includes switches for recognizing the start of new pages, for numbering pages, and to start numbering at the beginning again so that the lines on each page can start at one.

Insider: How the basic tech behind the Internet works
Join the discussion
Be the first to comment on this article. Our Commenting Policies