Regular Expressions Tutorial -- Part 1 Back to Basics

A regular expression (regex) describes a pattern of text (rather than

merely a literal substring of text) for matching, extracting, or

replacing with something else. We create such patterns using the regex

language features, consisting largely of literal characters

(alphanumeric and a few others) that stand for themselves, and several

special characters or character sequences representing particular

meanings within a regex pattern.

In this first part of the tutorial we will outline the 5 basic concepts

needed to understand regular expressions:

1) Concatenation: An implicit assumption simply meaning we can

create larger, more complex patterns by combining simpler

patterns. For example, m/f/ is a pattern that matches the

character 'f', while m/o/ is a pattern that matches the

character 'o'. If we can combine these into m/foo/, then we can

match the character sequence 'foo'.

2) Alternation: The '|' character is a meta-character inside a

regular expression. It acts as an operator allowing us to

specify two or more alternative subpatterns. For example, the

pattern m/ab|cd/ will match either 'ab' or 'cd'.

3) Iteration: The '*' meta-character is an iterative, or

quantitative, operator meaning to match zero-or-more of the

previous element. For example: the pattern m/a*b/ would

match 'ab', 'aab', 'aaab', etc..., or even just 'b' (i.e., zero-

or-more 'a' characters followed by a 'b' character).

4) Grouping: Parentheses supply a way to create subexpressions

treated as a unit. If, for example, we want to match zero-or-

more occurrences of the substring 'foo', then we could specify

our pattern as: m/(foo)*/. Placing * outside of the parentheses

applies it to the whole parenthesized subexpression. Parentheses

also govern the scope of alternation: the pattern m/ab|cd/ means

match either 'ab' or 'cd', but the pattern m/a(b|c)d/ means

match an 'a', then either a 'b' or a 'c', and finally a 'd'.

5) Wildcard: The dot . is the wildcard character. It matches any

character other than a newline character (this can be changed to

include the newline as well). Thus, the pattern: m/f.*bar/ will

match an 'f' followed by zero-or-more of any characters,

followed by 'bar'.

Those are the primary concepts for regular expressions; although many

more meta-characters and concepts exist, many derive from these basics.

Let's consider a couple of simple examples.

If we want to read in a file line-by-line and print out only lines

containing a 'foo' followed somewhere on the same line by 'bar', we

could use this pattern:

while(<>){

print if /foo.*bar/;

}

However, if we want to print lines that match either 'foodbar'

or 'footbar', we could do this:

while(<>){

print if /foo(d|t)bar/;

}

We can write quite complicated regular expressions using just the above

concepts, but they would very quickly get unmanageable. For example, if

we wanted to print out lines containing two digits together, we could

write:

while(<>){

print if /(0|1|2|3|4|5|6|7|8|9)(0|1|2|3|4|5|6|7|8|9)/;

}

As you can see, while possible it won't be a pleasant task matching

something like an 'f' followed by any digit followed by any

alphabetical character (regardless of case) using only alternation as

in the above example.

Next week we'll look at the character class and several shortcut

sequences to make such tasks a great deal simpler (not to mention a lot

shorter as well).

Next Week: Character Classes

What’s wrong? The new clean desk test
Join the discussion
Be the first to comment on this article. Our Commenting Policies