December 07, 2000, 12:00 AM — A regular expression (regex) describes a pattern of text (rather than
merely a literal substring of text) for matching, extracting, or
replacing with something else. We create such patterns using the regex
language features, consisting largely of literal characters
(alphanumeric and a few others) that stand for themselves, and several
special characters or character sequences representing particular
meanings within a regex pattern.
In this first part of the tutorial we will outline the 5 basic concepts
needed to understand regular expressions:
1) Concatenation: An implicit assumption simply meaning we can
create larger, more complex patterns by combining simpler
patterns. For example, m/f/ is a pattern that matches the
character 'f', while m/o/ is a pattern that matches the
character 'o'. If we can combine these into m/foo/, then we can
match the character sequence 'foo'.
2) Alternation: The '|' character is a meta-character inside a
regular expression. It acts as an operator allowing us to
specify two or more alternative subpatterns. For example, the
pattern m/ab|cd/ will match either 'ab' or 'cd'.
3) Iteration: The '*' meta-character is an iterative, or
quantitative, operator meaning to match zero-or-more of the
previous element. For example: the pattern m/a*b/ would
match 'ab', 'aab', 'aaab', etc..., or even just 'b' (i.e., zero-
or-more 'a' characters followed by a 'b' character).
4) Grouping: Parentheses supply a way to create subexpressions
treated as a unit. If, for example, we want to match zero-or-
more occurrences of the substring 'foo', then we could specify
our pattern as: m/(foo)*/. Placing * outside of the parentheses
applies it to the whole parenthesized subexpression. Parentheses
also govern the scope of alternation: the pattern m/ab|cd/ means
match either 'ab' or 'cd', but the pattern m/a(b|c)d/ means
match an 'a', then either a 'b' or a 'c', and finally a 'd'.
5) Wildcard: The dot . is the wildcard character. It matches any
character other than a newline character (this can be changed to
include the newline as well). Thus, the pattern: m/f.*bar/ will
match an 'f' followed by zero-or-more of any characters,
followed by 'bar'.
Those are the primary concepts for regular expressions; although many
more meta-characters and concepts exist, many derive from these basics.