Processing files with awk, part two

One more piece of awk syntax will make it an even more useful tool. I said in last month's column that awk treats the spaces in a record as a field separator. It is possible to change the field separator to another value.

Figure 1 is an example of a passwd file. The password itself in this example is replaced with a single exclamation mark. This file has several separate fields in it, but the field separator is a colon (:) rather than spaces.

Figure 1

<font face="Courier">root:!:0:1:Super User:/:
daemon:!:1:1:System Daemons:/etc
lbw:!:209:200:Lavinia Bowder Washinton:/home/lbw:/bin/csh
bob:!:210:200:Robbie Cramer:/home/bob:/bin/ksh
joann:!:213:200:Jo Ann Batson:/home/joann:/bin/ksh
jlan:!:214:200:Jack Landon:/home/jlan:/bin/ksh
jank:!:215:200:Jan Kingly:/home/jank:/bin/ksh
ljn:!:216:200:Laura Nugent:/home/ljn:/bin/ksh
mjb:!:220:200:Mo Budlong:/home/mjb:/bin/ksh
bda:!:235:500:Basic Development Accnt:/home/bda:/bin/ksh
obrero:!:245:500::/home/obrero:/bin/ksh
guest1:!:501:500:Guest1 Account:/disk2/guest1:/bin/ksh
guest2:!:502:500:Guest2 Account:/disk2/guest2:/bin/ksh
guest3:!:503:500:Guest3 Account:/disk2/guest3:/bin/ksh
beb:!:248:202:Becky E Brown  :/home/beb:/bin/ksh
</font>

A passwd file can be used as the input file to awk for for an awk report by changing the field separator. Figure 2 is a short example. There are two points to notice.

First the logic in BEGIN{FS=":"}. In awk, FS is a pre-defined variable that contains the field separator. If you make no changes to it, the FS value is set to spaces. In this listing, the BEGIN logic sets FS to a colon (:), so the value of the field separator is changed before the first record is read. This allows the passwd records to be broken into fields at the colons.

The second point to notice is on line 3 of Figure 2. In all previous examples the file has been piped into awk using "ls -l|awk etc." In this example, the file is specifically named by placing it on the command line after the closing single quote at the end of the awk commands. Awk can take its input from a pipe as in previous examples, or from an explicitly named file (or files) as in Figure 2. Remember that the closing quote ends multiline input so be sure to type the closing quote, a space and then the name of the file.

Remember to type a TAB wherever you see the ^ mark.

Figure 2

<font face="Courier">awk '
BEGIN{FS=":"}
{print $1 "  ^" $5}' /etc/passwd
</font>

Unless you are in the C shell, the closing quote ends multiline input so be sure to type the closing quote, followed by a space and followed by the name of the file.

Figure 3 is an example using C shell continuation characters. The example shown in Figure 2 works correctly. Figure 4 gives you two further examples, one version that won't work and another that will.

Figure 3

<font face="Courier">awk ' \
BEGIN{FS=":"} \
{print $1 "  ^" $5}' /etc/passwd
</font>

Figure 4

<font face="Courier">awk '
BEGIN{FS=":"}
{print $1 "  ^" $5}
' /etc/passwd        < this works as multiline input is still active

awk '
BEGIN{FS=":"}
{print $1 "  ^" $5}' < multiline input ends here
/etc/passwd          < this won't work multiline input
                      ended on the previous line
</font>

Figure 5 is a sample output from Figure 2 or Figure 3 for the C shell. The awk script selects field $1 which is the user id, and field $5 which is the user name and prints them with a tab between them.

Figure 5

<font face="Courier">root         Super User
daemon       System Daemons
lbw          Lavinia Bowder Washinton
bob          Robbie Cramer
joann        Jo Ann Batson
jlan         Jack Landon
jank         Jan Kingly
ljn          Laura Nugent
mjb          Mo Budlong
bda          Basic Development Accnt
obrero       
guest1       Guest1 Account
guest2       Guest2 Account
guest3       Guest3 Account
beb          Becky E Brown
</font>

Awk has a number of pre-defined variables. You have already seen FS. Another useful one is NR. This is a variable that contains the number of the current record. It is updated by 1 as each record is read. You may use this to number the output records as in Figure 6, the output of which would look like Figure 7.

Figure 6

<font face="Courier">awk '
BEGIN{FS=":"}
{print NR ".   ^" $1 "  ^" $5}' /etc/passwd
</font>

Figure 7

<font face="Courier">1.     root         Super User
2.     daemon       System Daemons
3.     lbw          Lavinia Bowder Washinton
4.     bob          Robbie Cramer
6.     joann        Jo Ann Batson
7.     jlan         Jack Landon
8.     jank         Jan Kingly
9.     ljn          Laura Nugent
10.    mjb          Mo Budlong
11.    bda          Basic Development Accnt
12.    obrero       
13.    guest1       Guest1 Account
14.    guest2       Guest2 Account
15.    guest3       Guest3 Account
16.    beb          Becky E Brown
</font>

You may also use NR in the END logic. After the last record is read, NR is left set to the value of the last record. Figure 8 would produce output that looks like Figure 9.

Figure 8

<font face="Courier">awk '
BEGIN{FS=":"}
{print $1 "  ^" $5}
END{print "Total users = " NR}' /etc/passwd
</font>

Figure 9

<font face="Courier">root         Super User
daemon       System Daemons
lbw          Lavinia Bowder Washinton
bob          Robbie Cramer
joann        Jo Ann Batson
jlan         Jack Landon
jank         Jan Kingly
ljn          Laura Nugent
mjb          Mo Budlong
bda          Basic Development Accnt
obrero       
guest1       Guest1 Account
guest2       Guest2 Account
guest3       Guest3 Account
beb          Becky E Brown  
Total users = 16
</font>

Complex reporting: using printf to make it look right

The awk print command is good enough for a lot of reporting, but when it comes to more complex or longer print layouts involving tidy columns of information you need something more powerful. The intent of Figure 10 is to print four columns of information from the /etc/passwd file -- User id, name, home pat, and login shell. The columns are separated by tabs. The actual output looks something like Figure 11. A single tab is not enough to produce decent alignment when the fields are of substantially varying lengths.

Figure 10

<font face="Courier">awk '
BEGIN{FS=":";print "User  ^Name  ^Home  ^Shell}
{print $1 "  ^" $5 " ^" $6 "  ^" $7}
END{print "Total users = " NR}' /etc/passwd
</font>

Figure 11

<font face="Courier">User  Name  Home  Shell
root  Super User  /
daemon      System Daemons    /etc
lbw   Lavinia Bowder Washinton     /home/lbw   /bin/csh
bob   Robbie Cramer     /home/bob   /bin/ksh
joann Jo Ann Batson     /home/joann /bin/ksh
jlan  Jack Landon /home/jlan  /bin/ksh
jank  Jan Kingly  /home/jank  /bin/ksh
ljn   Laura Nugent      /home/ljn   /bin/ksh
mjb   Mo Budlong  /home/mjb   /bin/ksh
bda   Basic Development Accnt /home/bda    /bin/ksh
obrero            /home/obrero      /bin/ksh
guest1      Guest1 Account    /disk2/guest1     /bin/ksh
guest2      Guest2 Account    /disk2/guest2     /bin/ksh
guest3      Guest3 Account    /disk2/guest3     /bin/ksh
beb   Becky E Brown       /home/beb   /bin/ksh
Total users = 16
</font>

To handle this it is necessary to use the other awk print command which is printf (print formatted). The printf command is similar to the printf command of the C programming language, but a simplified explanation of the command is in order for those who do not know C.

The printf command is executed by providing a format string and a list of the values to be printed using the format string. These are separated by commas as in:

<font face="Courier">
printf "format_string", $1, $3, $6, $7
</font>

Some versions of awk require parentheses around the arguments as in:

<font face="Courier">
printf("format_string", $1, $3, $6, $7)
</font>

It is always safe to include the parentheses.

The values that can be used in a format string are very extensive and can format data in all sorts of ways, but for simple reports, the most useful format is the fixed width string.

A fixed width string field starts with a percent sign (%). If a minus sign (-) follows, then the printed data is left-justified within the fixed width of the field. Most string data is left-justified, so you should usually include the minus sign. The next part of the format is the length of the field, and finally an `s' ends the formatting. An example of this would be "%-30s" which is a field containing 30 left-justified characters. Using this format string with printf would look something like:

<font face="Courier">printf("%-30s",$1)
</font>

This would print field $1 in a left-justified, 30-character field space.

If field $1 does not contain 30 characters, then the field is padded with spaces until 30 character spaces are filled. One big advantage of a format string is that you can force a field to always print with a certain width by filling unused portions of the field with spaces. You may combine multiple format fields in a format string as in:

<font face="Courier">printf("%-20s%-30s", $1, $2)
</font>

This example will take field $1 and place it, left-justified into the first printing position. The field will be padded until it is 20 characters long. Then field $2 will be appended and padded out to 30 characters. This guarantees that columns will line up under one another. The format string for each field should be long enough to accommodate the largest value that will be placed in the field.

There is one small hitch in printf. The print command automatically prints a newline at the end of each print statement. The printf command does not, so you must explicitly end the format string with a newline "\n".

Using these rules, let's create a format string for the four fields that we want to print from the /etc/passwd file. In Figure 12 I have taken the four fields, found the longest example, made a guess as to a safe width to use, and then created a format string that is one character longer than the safe width. This allows for a minimum of a single space between fields.

Figure 12

FieldLongestSafe WidthFormat
User id 6 10 "%-11s"
Name 25 30 "%-31s"
Home 8 15 "%-16s"
Shell 8 15 "%-16s"

The next step is to combine all of the fields into one long format string and append a newline.

<font face="Courier">printf("%-11s%-31s%-16s%-16s\n")
</font>

Finally list the fields to be printed with separating commas.

<font face="Courier">printf("%-11s%-31s%-16s%-16s\n",$1,$5,$6,$7)
</font>

For your version of awk the format string and list of values after printf may not need to be enclosed in parentheses as in:

<font face="Courier">printf "%-11s%-31s%-16s%-16s\n",$1,$5,$6,$7
</font>

It is always safe to use the parentheses, but in many versions of awk you do not need them.

Figure 13 is the first version of the awk script using printf. It does not include column titles.

Figure 13

<font face="Courier">awk '
BEGIN{FS=":"}
{printf("%-11s%-31s%-16s%-16s\n",$1,$5,$6,$7)}
END{print "Total users = " NR}' /etc/passwd
</font>

Figure 14 is the C shell version of the same listing.

Figure 14

<font face="Courier">awk ' \
BEGIN{FS=":"} \
{printf("%-11s%-31s%-16s%-16s\n",$1,$5,$6,$7)} \
END{print "Total users = " NR}' /etc/passwd
</font>

Adding column titles involves ensuring that the column titles actually line up with the fields in the format string. Figure 15 uses a simple trick to ensure that the column titles do align. The values used by printf to fill a format string when printing do not need to be variables. They can also be strings. The header or title line can be created by using the same format string that was used in the body of the report.

Figure 15

<font face="Courier">awk '
BEGIN{FS=":";
printf("%-11s%-31s%-16s%-16s\n","User","Name","Home","Shell")}
{printf("%-11s%-31s%-16s%-16s\n",$1,$5,$6,$7)}
END{print "Total users = " NR}' /etc/passwd
</font>

The output from Figure 15 is shown in Figure 16 -- it's a much more readable and useful output.

Figure 16

<font face="Courier">User       Name                           Home            Shell
root       Super User                     /
daemon     System Daemons                 /etc
lbw        Lavinia Bowder Washinton       /home/lbw       /bin/csh
bob        Robbie Cramer                  /home/bob       /bin/ksh
joann      Jo Ann Batson                  /home/joann     /bin/ksh
jlan       Jack Landon                    /home/jlan      /bin/ksh
jank       Jan Kingly                     /home/jank      /bin/ksh
ljn        Laura Nugent                   /home/ljn       /bin/ksh
mjb        Mo Budlong                     /home/mjb       /bin/ksh
bda        Basic Development Accnt        /home/bda       /bin/ksh
obrero                                    /home/obrero    /bin/ksh
guest1     Guest1 Account                 /disk2/guest1   /bin/ksh
guest2     Guest2 Account                 /disk2/guest2   /bin/ksh
guest3     Guest3 Account                 /disk2/guest3   /bin/ksh
beb        Becky E Brown                  /home/beb       /bin/ksh
Total users = 16
</font>

In case you're offended by figure 15

1 2 Page
Insider: How the basic tech behind the Internet works
Join the discussion
Be the first to comment on this article. Our Commenting Policies