From: www.itworld.com
March 26, 2001 —
Chief among the the unsung heroes of Linux and Unix is m4. Unsung? Well, for
instance, although m4 has been a standard part of Unix since version 7,
no mention is made of it in that great O'Reilly & Associates book, Unix
Power Tools. What is it
about m4 that makes it so useful, and yet so overlooked? m4 -- a macro
processor -- unfortunately has a dry name that disguises a great utility. A
macro processor is basically a program that scans text and looks for
defined symbols, which it replaces with other text or other symbols.
Thus, m4 is a powerful general-purpose utility that can be used to
automate many tasks people often end up doing in sed,
awk, perl, and
even their favorite text editor. Even so, it still doesn't seem like a macro
processor is that big of a deal.
Unix developers already have a built-in macro processor, in the form
of the C preprocessor, in their compiler. Perhaps this is
what accounts for m4's relative neglect. Whatever the case may be, this
article
will show Linux users the power and usefulness of this software tool.
What is m4?
What is macro processing, and what is it good for? In their seminal work,
Software Tools, Kernighan and Plauger have a succinct
definition:
"Macros are used to extend some underlying language -- to perform a
translation from one language to another."
Thus, symbolic constants may be defined so that subsequent occurrences
of the name can be replaced by the defining string of characters,
regardless of the contents of the definition or its context. Such a
definition is called a macro, the replacement process is called
macro expansion, and the program for the process is called a macro
processor. The task performed by any macro processor is
the replacement of text by other text. A macro is defined either by
the m4 program (a built-in) or by the user. In addition to doing macro
expansion, m4 -- with functions that include other files, perform integer
arithmetic, manipulate text, and so forth -- is a perfect example of the
power
of the Unix filter concept.
The contemporary implementation of m4 on a Linux system is GNU
m4, which follows System V Release 3 m4, with extensions. I am
aware of no other version of m4 that has been ported to Linux.
m4 implementations on BSD may differ slightly. However, m4 is m4, and this
article should be useful for other Unix users, too. The latest version is
1.4, which was released in October 1994.
The scanning process
As m4 reads its input, it separates it into tokens. A token is
either a previously defined name, a string, or any single character
that is not a part of either a name or string. The input is then
scanned for recognized macros. This scanning process is recursive, which
means
that scanning continues until no more macros are recognized. The transformed
input is written to the output. Macros can be built in or user-defined. A
list of built-in macros follows later in the article.
Defining macros
The most important of the built-in macros is define(), which
allows users to define their own macros. For example, define(author, defines a macro "author" -- any occurrence of which will
Paul Dunne)
expand to the string "Paul Dunne". m4 expands macro names into
their defining text as soon as it possibly can.
Quoting
The m4 quote characters are ` and '. For example,
`this is quoted'. It is often best to quote both macro name and
substitution text in a definition. This avoids any unwanted side effects,
such as an early expansion of another macro name. m4 uses commas as argument
separators; therefore, any definition that includes commas must be quoted.
Arguments
Built-in functions
m4 provides a small set of useful built-in functions. We may group
them under the following headings:
Flow control functions
m4 provides the classic "if-then" programming construct, in two
related forms.
ifdef(a,b)
defines b if a is defined, and
ifelse(a,b,c,d)
compares the strings a and b. If they match, string c is returned
as the function value; if not, string d. Actually, ifelse is
not
limited to four arguments; it can take any greater number, and it thus
provides a limited multiway decision-making capability. For example,
ifelse(a,b,c,d,e,f,g)
means that if a matches b, then c; else if d matches e, then f; else g.
Arithmetic functions
There are three arithmetic built-ins.
incr, which increments its numeric argument by one.
decr, which decrements its numeric argument by one.
eval, which performs arbitrary integer arithmetic.
Its operators are:
unary + and - |
||
** or ^ |
exponentiation | |
+ - |
||
== != < <= > >= |
equal, not equal, less than, less than or equal to, greater than, greater than or equal to | |
! |
not | |
& or && |
logical and | |
| or || |
logical or |
String functions
There are two functions for simple operations on strings of characters.
len(a)
returns the length of the string "a".
substr(s, m, n)
returns a substring from the string "s", starting at position m,
and continuing for n characters.
As a more complicated example than those we've had so far, consider
this combination of ifelse, eval, and
substr.
define(len,`ifelse($1,,0,`eval(1+len(substr($1,2)))')')
Well now, what does this do? It is an implementation of the m4
built-in len in terms of other m4 built-ins! Note the two
layers
of quotes. The outer layer prevents all initial evaluation. We want
len defined as exactly what's in the second argument. The inner
layer protects the eval built-in from being evaluated while the
arguments for the ifelse are collected.
translit(s, f, t)
returns the string "s" with all occurrences of the characters
listed in "f" replaced by those listed in "t". It functions
as a simpler version of the Linux command `tr'. For example,
translit(s,abcdefghijklmnopqrstuvwxyz, is the well-known rot-13, or Caesar.
nopqrstuvwxyzabcdefghijklm)
File functions
File functions, as the name suggests, are used for working with files.
include(filename)
includes the contents of "filename" at the point in the input
stream at which it occurs. This is useful if we have a central collection
of standard m4 macros, which we can then use in another file by simply
creating an appropriate include macro.
divert(n)
This is used to divert text from the input stream to an internal
file number. File number -1 is equivalent to discarding the text,
file number 0 is the normal output stream, and files are usually
used for temporary storage. For example,
divert(-1) is most commonly used to get rid of the extraneous
white space that is often generated by m4. For example,
divert(-1)
...
definitions
...
divert
ensures that no output is performed while the various definitions
between the ellipses are performed (the ellipses are not part of
m4 syntax). Otherwise, we would end up with a pack of newlines in our output.
dnl
It's hard to categorize this one, so I've put it here. dnl is "delete
to newline." It was used as a comment character in the original m4. As the
name suggests, all characters up to the next newline are deleted
from the output stream. GNU m4 also allows use of # as a
comment
character, with the difference that such comments are passed to the
output stream. Any macro calls or definitions after the # are
ignored however. The input is passed to the output exactly as is.
System functions
There is one system function -- that is, one that communicates with the
underlying operating system.
esyscmd
passes a command to the system interpreter, usually the unix shell,
for execution. For example, esyscmd(date) returns today's date.
There are also some miscellaneous functions that have been added to
the original m4 function set:
changecom
changes the m4 comment character (normally #).
traceon/off
turns tracing on and off. This is useful for debugging.
Usage
m4 is invoked the normal way, by simply typing m4. It works as
a classic Unix filter, reading from standard input if no filename
is given on the command line and writing to standard output.
Both input and output may be redirected in the shell or by commands
in the input file.
A full summary of m4 usage, available by typing m4 --help, provides:
Usage: m4 [OPTION]... [FILE]...
Mandatory or optional arguments attached to long options are mandatory and
optional for short options, too.
Operation modes:
--help display this help and exit
--version output version information and exit
-e, --interactive unbuffer output, ignore interrupts
-E, --fatal-warnings stop execution after first warning
-Q, --quiet, --silent suppress some warnings for built-ins
-P, --prefix-built-ins force a `m4_' prefix to all built-ins
Preprocessor features:
-I, --include=DIRECTORY search this directory second for includes
-D, --define=NAME[=VALUE] enter NAME as having VALUE, or empty
-U, --undefine=NAME delete built-in NAME
-s, --synclines generate `#line NO "FILE"' lines
Limits control:
-G, --traditional suppress all GNU extensions
-H, --hashsize=PRIME set symbol lookup hash table size
-L, --nesting-limit=NUMBER change artificial nesting limit
Frozen state files:
-F, --freeze-state=FILE produce a frozen state on FILE at end
-R, --reload-state=FILE reload a frozen state from FILE at start
Debugging:
-d, --debug=[FLAGS] set debug level (no FLAGS implies `aeq')
-t, --trace=NAME trace NAME when it will be defined
-l, --arglength=NUM restrict macro tracing size
-o, --error-output=FILE redirect debug and trace output
FLAGS is any of:
t trace for all macro calls, not only traceon'ed
a show actual arguments
e show expansion
q quote values as necessary, with a or e flag
c show before collect, after collect, and after call
x add a unique macro call ID, useful with c flag
f say current input file name
l say current input line number
p show results of path searches
i show changes in input files
V shorthand for all of the above flags
If no FILE or if FILE is `-', standard input is read.
This is a formidable list of options. But we need use only a few.
In fact, most often m4 is run as just m4, with perhaps the -P flag
to specify that built-ins are preceded by m4_, e.g., m4_include
rather than include. Below is an example of a line I use in a makefile to
generate my html pages:
cat $*.m4 | htmlize | m4 -P > $*.html
m4 at work
So, we've had an overview of m4. Now, lets take a look at how it can be used
to do useful work.
Example: Generating HTML
I use m4, among other Linux software tools, to maintain my Web pages.
Rather than marking each page up in HTML -- a tiresome chore -- I have
written a set of definitions that translates m4 macros into HTML.
As well as being easier on the eye and simpler to write than HTML,
this has other advantages. For example, an often seen feature on
Websites is the navigational button bar, which has links to the
main parts of a site. Obviously, it is nicer not to have a link
from the button bar to our Linux page if that's where we already are,
for example. This can be automated using m4, so that the correct HTML
code is generated. The definition I use is as follows:
m4_define(
`_button_bar',
`<HR>
<P ALIGN="center">
m4_ifdef(`_index',[Home],_link(index.html, [Home]))
m4_ifdef(`_linux',[Linux],_link(linux.html, [Linux]))
m4_ifdef(`_writing',[Writing],_link(writing.html, [Writing]))
m4_ifdef(`_bookshop',[Bookshop],_link(bookshop/index.html, [Bookstore]))
</P>
<HR>'
)
Then, in the file linux.html, the macro _link is defined, and so
when _button_bar is referenced later in that file, the button bar
code generated has no link to the Linux page as the Linux link is
grayed out.
Again, we can define your email address in the master file. Then,
if you should change your email address there is no need to do a global search-and-replace through all the files that constitute the site. A simple make
updates everything -- but that's the subject of another article.
Example: A Linux key-map
Maintenance of Linux keymap files is an interesting and imaginative use of
m4.
I don't do this myself, since hacking an existing file is simplest
for me. We don't have the space to examine the file in any depth here. If you
take a look at /usr/lib/kbd/keymaps/i386/qwerty/hypermap.m4 on
your Linux system, you will see how using m4 makes defining a complicated
keymap quite a bit simpler and makes it easier to maintain.
Example: Sendmail config
m4's most well known application helps to demystify sendmail configuration
files. The sendmail source distribution comes with m4 macros that are
sufficient to generate a sendmail.cf for most any site. At most, a
little tweaking of the resulting sendmail.cf file (whose syntax has
been memorably and justly compared to line noise) may be required.
For anyone who has tried to write a sendmail config file from scratch --
in the days before the m4 macros -- this is a godsend.
Differences between m4 versions
Inevitably, there are different versions of m4. This is not an
issue for the Linux user, as you will invariably be using GNU m4.
The main difference is that System V m4 supports multiple arguments for
`defn'. Since the usefulness of this is unclear to GNU m4's
maintainer (and indeed to me), this feature is not in GNU m4.
There are several other incompatibilities (which shouldn't surprise
anyone who's tried to use GNU make and then BSD's pmake, or vice
versa). None are too important, but those interested can read the relevant
info page (alas, no man page has been provided). As this article is about
m4 rather than GNU m4, I won't mention the various extensions
implemented in the GNU version -- those curious can see the list in the info
page.
Things to watch
Quoting can be cantankerous on occasion. Quoting problems can
usually be solved by changequote. For example, to
include one of the quote characters in a macro definition, using
changequote([[,]])
and then
define([[`a quoted macro']])
will keep the quote characters in the macro definition. Note that
` and ' can't be escaped, so we have to do it this
way.
Another thing to watch out for is that if you have the name of an m4 built in
your text, m4 will interpret such names as calls to that function, which is
presumably not what you want.
This can be avoided by quoting, but that is inconvenient. GNU m4
offers us a better way. The -P command-line switch allows us to
preface all built-ins with the string m4_ rather than use the # character as
the C preprocessor does.
Limitations
Sadly, there is no man page. However, there is an info page.
m4 is a useful tool, but it can be overstrained. Although it can be
made to do most things with ingenuity, m4 is at its best when used
for straightforward text substitution, as with our HTML example.
In Software Tools, Kernighan and Plauger sum it up nicely:
"The main thing is to ensure that any operation -- macro call,
definition, other built-in -- can occur in the middle of any other one.
If this is possible, then in principle the macro processor is capable
of doing any computation, although it may well be hard to express.... In principle, macro [i.e. m4] is capable of performing any computing
task, but it is all too easy to write incomprehensible macros."
This article has been an introduction to an often overlooked Linux
program. Hopefully, you'll now be able to go off and do some m4ing
yourselves.
LinuxWorld.com