A regular expression (regex) describes a pattern of text (rather than
merely a literal substring of text) for matching, extracting, or
replacing with something else. We create such patterns using the regex
language features, consisting largely of literal characters
(alphanumeric and a few others) that stand for themselves, and several
special characters or character sequences representing particular
meanings within a regex pattern.
In this first part of the tutorial we will outline the 5 basic concepts
needed to understand regular expressions:
1) Concatenation: An implicit assumption simply meaning we can
create larger, more complex patterns by combining simpler
patterns. For example, m/f/ is a pattern that matches the
character 'f', while m/o/ is a pattern that matches the
character 'o'. If we can combine these into m/foo/, then we can
match the character sequence 'foo'.
2) Alternation: The '|' character is a meta-character inside a
regular expression. It acts as an operator allowing us to
specify two or more alternative subpatterns. For example, the
pattern m/ab|cd/ will match either 'ab' or 'cd'.
3) Iteration: The '*' meta-character is an iterative, or
quantitative, operator meaning to match zero-or-more of the
previous element. For example: the pattern m/a*b/ would
match 'ab', 'aab', 'aaab', etc..., or even just 'b' (i.e., zero-
or-more 'a' characters followed by a 'b' character).
4) Grouping: Parentheses supply a way to create subexpressions
treated as a unit. If, for example, we want to match zero-or-
more occurrences of the substring 'foo', then we could specify
our pattern as: m/(foo)*/. Placing * outside of the parentheses
applies it to the whole parenthesized subexpression. Parentheses
also govern the scope of alternation: the pattern m/ab|cd/ means
match either 'ab' or 'cd', but the pattern m/a(b|c)d/ means
match an 'a', then either a 'b' or a 'c', and finally a 'd'.
5) Wildcard: The dot . is the wildcard character. It matches any
character other than a newline character (this can be changed to
include the newline as well). Thus, the pattern: m/f.*bar/ will
match an 'f' followed by zero-or-more of any characters,
followed by 'bar'.
Those are the primary concepts for regular expressions; although many
more meta-characters and concepts exist, many derive from these basics.
Let's consider a couple of simple examples.
If we want to read in a file line-by-line and print out only lines
containing a 'foo' followed somewhere on the same line by 'bar', we
could use this pattern:
print if /foo.*bar/;
However, if we want to print lines that match either 'foodbar'
or 'footbar', we could do this:
print if /foo(d|t)bar/;
We can write quite complicated regular expressions using just the above
concepts, but they would very quickly get unmanageable. For example, if
we wanted to print out lines containing two digits together, we could
print if /(0|1|2|3|4|5|6|7|8|9)(0|1|2|3|4|5|6|7|8|9)/;
As you can see, while possible it won't be a pleasant task matching
something like an 'f' followed by any digit followed by any
alphabetical character (regardless of case) using only alternation as
in the above example.
Next week we'll look at the character class and several shortcut
sequences to make such tasks a great deal simpler (not to mention a lot
shorter as well).
Next Week: Character Classes