P>
Standard input and output
Standard input and output refer to the default places from which a program will take input and to which it will write output. The standard input (stdin) for a program running interactively at the command line is the keyboard; the standard output (stdout) is the terminal screen.
Redirection
With input/output redirection, a program can take input or send output someplace other than standard input or output -- to or from a file, for instance. Redirection of stdin is accomplished using the < symbol, and redirection of stdout by the > symbol. For example:
$ ls > list
redirects the output of the ls command, which would normally go into a file called list. Similarly,
$ cat < list
redirects the input for cat, which, in the absence of a filename, would be expected to come from the file list. So we print the contents of that file on screen.
Pipes
Pipes connect programs through I/O redirection and are denoted by the | symbol. For example:
$ ls | less
is a common way of comfortably viewing the output from a directory listing where there are more files than will fit on the screen.
grep
The principle of grep is very simple: search the input for a pattern, and then output that pattern. Here's an example:
$ grep 'Linus Torvalds' *
This searches all the files in the current directory for the name Linus Torvalds.
Various command-line switches can modify grep's behavior. For example, if we aren't sure about case, we can write:
$ grep -y 'linus torvalds'
The -y switch tells grep to match without considering case. If you use uppercase letters in the pattern, however, they'll still match only uppercase. (This is broken in GNU grep, which ignores case when given the -y switch -- that's what the -i switch is for).
Given even this much grep, it's easy to construct a practical application. Store name and address details in a file and you've got a searchable address book.
$ grep -y [search arg] ~/lib/phone-book
Put the command above in a text file called filename and make it executable:
$ chmod +x filename
But suppose we want to find all the occurrences of a text string that could be a reference to Linus Torvalds. Searching for Linus Torvalds won't find Linus or Torvalds individually. We need a way of saying, "This or this or this." Here's where egrep (extended grep) comes in. This handy program modifies standard grep to provide just such a syntax.
$ egrep 'Linus Torvalds|L\. Torvalds|L\. T\.|Mr\. Torvalds'
This will find most ways of naming the inventor of Linux. Note the backslashes to escape the full stop; because that's a special character in regular expressions, when we want to use it as itself, we must tell egrep not to interpret it as a magic character.
Regular expressions
Both grep and egrep and many other Unix filters support regular expressions. A regular expression (regexp, for short) is a description of a text pattern. The pattern is coded in a small language designed for the purpose of describing text patterns. The special symbols are as follows:
.
|
any one character
|
*
|
zero or more of the preceding character
|
^
|
beginning of line
|
$
|
end of line
|
[a-z]
|
a set of characters [a-z] is the whole lower-case alphabet
|
tr
tr is perhaps the epitome of the filter. Short for translate, tr changes a character or set of characters to another character or set of characters by mapping input characters to output characters. Here's an example:
$ tr A-Z a-z
This changes uppercase letters to lowercase.
The more complicated example applies rot13, an old cipher. Each letter of the alphabet is changed to the letter 13 characters ahead of it in the alphabetic sequence. Letters in the second half of the alphabet are wrapped around.
tr '[a-m][n-z][A-M][N-Z]' '[n-z][a-m][N-Z][A-M]'
sort(1)
Sorting is a very basic computer operation commonly used on text to get lists in alphabetical order, or to sort a numbered list. Unix has a powerful filter for sorting called, logically enough, sort(1).
head and tail
These are two very simple filters with a variety of uses. As their names suggest, head(1) shows the beginning of a file, while tail(1) shows the end. By default, both show the first or last 10 lines, respectively, but tail in particular has other useful options.
sed
sed, the stream editor, is a filter used to operate on lines of text as an alternative to an interactive editor. There are times when firing up vi and making a change, whether manually or using vi/ex commands, is not appropriate. What if you have to make the same changes to 50 files? What if you need to change a string, but aren't sure exactly what files it occurs in?
As is common in the Unix world where tools are often duplicated, sed can do most things that grep can. Here's a simple grep in sed:
sed -n '/Linus Torvalds/p'
All this does is read standard input and print the lines containing the string Linus Torvalds.
sed's default behavior is to pass standard input to standard output unchanged. To make it do anything useful, give it instructions. In the example above, we searched for the string by enclosing it in // and told sed to print (p) any line with that string in it. The -n switch made sure it didn't print any lines that didn't match the pattern. Remember, the default behavior is to print everything.
If this was all sed could do, we'd be better off sticking with grep. sed's forte is changing text files according to rules you supply.
Let's take a simple example.
$ sed 's/Torvuls/Torvalds/g'
This uses the sed substitute (s) command and applies it globally (g). It looks for every occurrence of Torvuls and changes each one to Torvalds. Without the g command at the end, it would change only the first occurrence of Torvuls on each line.
sed '/^From /,/^$/d'
This searches the standard input for the word From at the beginning of a line followed by a space, and deletes all the lines from the line containing that pattern -- up to and including the first blank line, which is represented by ^$ (a beginning of a line), ^, and $ (an end of a line). In plain English, it strips out the header from a Usenet posting that you've saved in a file.
Making a text file double-spaced takes just one command:
#!/usr/bin/sed -f
G
According to the manual page, all that does is "append the contents of the hold space to the current text buffer." That means for each line we output the contents of a buffer that sed uses to store text. Because we haven't put anything in there, it's empty. But in sed, appending this buffer adds a newline, regardless of whether there is anything in the buffer. The effect is to add an extra newline to each line, double-spacing the output.
Now, something more complex: I publish my configuration file for vi, .exrc, on the Web. I want it to look nice in people's browsers, so I run it through sed to turn it into a simple HTML document.
#!/bin/sed -f
#filter-exrc: turn .exrc into html
1i\
<html>\
</head>\
<title>Paul Dunne's .exrc file<\/title>\
<\/head>\
<body>\
<pre>\
<code>
$a\
<\/code>\
<\/pre>\
<\/body>\
<\/html>
First, we give sed the address of the first line (1) and tell it to insert (i) all the given text up to the next newline. This allows us to specify multiple lines of text to insert by escaping newlines in the text. Then we wait until the end of the file ($) before appending (a) some additional lines of HTML. Remember, although we gave no instructions save for first and last line, sed has been sending all the lines of the .exrc file to standard output. Our result is the original .exrc file, bracketed with extra lines that make it an HTML document.
>
awk
Another useful filter is the awk programming language. The name awk comes from the initials of Aho, Weinberger, and Kernighan -- the three writers of the language.
Here's another way to do a grep:
$ awk '/Linus Torvalds/'
Like grep and sed, awk can search for text patterns. As is the case with sed, each pattern can be associated with an action. If no action is supplied, the default action is to print each line in which the pattern is matched. Alternatively, if no pattern is supplied, the default action is to apply the action to every line.
*** centre lines
#!/usr/bin/awk -f
#centre: centre lines in file(s) or stdin
#usage: centre [filenames]
BEGIN {
linelength = 80
spaces = ""
}
{
for (i = 1; i < (linelength - length($0)) / 2; i++)
spaces = spaces " "
print spaces $0
}
Of course, this isn't the only filter for centering text. We could write it in sed:
sed -n '
# remove leading and trailing blanks
s/^[ ]*\(.*[^ ]\).*$/\1/
# append 80 spaces
s/$/ /
# chop character 80 onwards
s/^\(.\{80\}\).*/\1/
# prefix string with half the trailing spaces
s/^\(.*[^ ]\)\( *\)\(\2\)/\2\1/
p
'
One strength of awk is its ability to treat data as tabular -- that is, to arrange it in rows and columns. awk automatically splits each input line into fields. The default field separator is white space (blanks and tabs), but you can set it to any character you want. Many Unix utilities produce this sort of tabular output. In our next section, we'll see how this tabular format can be sent as input to awk, using a shell construction we haven't yet seen.
Pipes
The pipe (|) is a junction that allows us to connect the standard output of one program with the standard input of another. We can build quite complex programs on the command line or in a shell script simply by stringing filters together.
If we look again at the humble wc filter, we see its default output is in four columns. An alternative way of specifying the -c switch (i.e., to count only characters) would be:
$ wc | awk ' { print $3 } '
16826
This takes the whole output of wc:
491 3011 16826 sw.filters
and filters it to get the third column, the character count. To print the whole input line, simply use $0.
We know we can see hidden files using ls -a, but how do we see just hidden files? A simple filtering of ls -a output makes it easy.
$ ls -a | grep ^[.].*
ls output often needs filtering. To see what programs I've been working on recently, I might run:
$ ls -tr ~/bin | tail -80 | 3
Of course, pipes greatly increase the power of programmable filters, such as sed and awk. Here's a script to calculate the last Friday of any given month.
#!/bin/sh
/usr/bin/cal $1 $2 |
awk '{ lasta = a; a = $6; if (a == "") a=lasta } END { print a }'
It's often useful to store data in simple ASCII tables, and awk is a great tool for manipulating such data. Consider this weights and measures converter. We have a simple text file of conversions:
From To Rate
--- --- ----
kg lb 2.20
lb kg 0.4536
st lb 14
lb st 0.07
kg st 0.15
st kg 6.35
in cm 2.54
cm in 0.394
The script below reads a weight, the unit it's measured in, the unit we wish to convert to, and gives us the result.
$ weightconv 100 kg lb
220
$
#!/bin/sh
#weightconv: weights & measures converter
table=/usr/local/lib/weights_and_measures
case $# in
0|1) echo "weightconv: usage weightconv amount from [to]" 1>&2; exit 1;;
esac
amount=$1
from=$2
to=$3
rate=`grep "^$from $to" $table|awk '{print $3}'`
case $rate in
"") echo "weightconv: no rate found for $from to $to" 1>&2; exit 2;;
esac
echo $amount $rate | awk '{print $1*$2}'
Power filters
This classic example of what one might call filtered pipelines is from The Unix Programming Environment:
cat $* |
tr -sc A-Za-z '\012' |
sort |
uniq -c |
sort -n |
tail
Let's take it line by line. First, we concatenate the input using cat $*, because this command will be run from a shell script.
Next, we put each word on a separate line using tr: the -s squeezes, and the -c says to take the complement of the pattern given. Together, these switches strip out all characters that don't make up words and replace each with a newline. This puts each word on a separate line.
Then we feed the output of tr into uniq, which strips out duplicates, and, with the -c argument, prints a count of the number of times a duplicate word was found.
We then sort numerically (-n), giving us a list of words ordered by frequency.
Finally, we print the last 10 lines of the output. We now have a simple word frequency counter. For any text input, it will output a list of the 10 most frequently used words.
Conclusion
The combination of filters and pipes is very powerful; you can use these tools to break down tasks and pick the best approach for each one. Many jobs that would have to be handled in a programming language in another computing environment can be done under Unix by stringing together a few simple filters on the command line. With filters and pipes, working with your Unix box is both easier and more productive.