Regular Expressions Tutorial, Part 3: Anchoring Matches

In the last couple of weeks we have covered what I refer to as the five

basic concepts (concatenation, alternation, quantification, grouping,

and wildcards) and we have introduced character classes. This week, I

introduce a new kind of regex element: an anchor.

An anchor specifies a position in the target string that has particular

properties. The caret '^' and the dollar '$' symbols represent the two

main anchors, which refer to the start and end of the target string

respectively. Thus the pattern /foo/ will match if the target string

contains 'foo' anywhere, but the pattern /^foo/ will only match if the

target string contains 'foo' at the beginning of the string. The

pattern reads: match the start of the string followed by an 'f'

followed by an 'o' followed by an 'o'. Similarly, the pattern /foo$/

will match if the target contains 'foo' at the end of the string. The

precise meaning of these two anchors can be changed with the /m

modifier, which will cause them to match at the beginning and end of

each line within the string rather than just at the beginning and end

of the entire string. The \A and \Z anchors are similar to ^ and $

respectively, but they always match the beginning and end of the string

and never at internal line boundaries.

If we wanted a script to count the lines of code (LOC) in a Perl

program and ignore comment lines, blank lines, and lines after the

__END__ or __DATA__ tags, we could write it as:

#!/usr/bin/perl -w

use strict;

my $count = 0;

while(<>){

next if /^\s*#/; # ignore comment lines

next if /^\s*$/; # ignore blank lines

last if /^(__END__|__DATA__)/; # stop

$count++;

}

print "There are $count lines of code\n";

It isn't perfect (there could be POD markup anywhere within the program

and not just after the __END__ or __DATA__ tokens), but it provides a

reasonable measure. The first regex /^\s*#/ matches any line starting

with optional whitespace and a # character (a line containing only a

comment); the second regex /^\s*$/ matches a line containing only

optional whitespace (blank lines); the final regex matches lines

beginning with one of the two program-ending tokens. If you are putting

your subroutines after such a token for autoloading purposes, then you

wouldn't want to include this line in your LOC counting program.

Another anchor is the \b metacharacter that matches what is often

called word boundary. This matches a position in the target string

between a \w and \W character, or between the start and end of the

string and a \w character. The \G anchor matches the point where the

previous m//g match left off (that is, at the current pos() for the

target string).

Anchors are also referred to as zero-width assertions because they

match a position in the string and do not consume any characters in the

string. Thus, other zero-width constructs, such as positive and

negative look-ahead assertions, can also be thought of as anchors. A

positive look ahead is written as (?=some pattern) and matches the

current position in the string (without consuming anything) only

if 'some pattern' could match at this point. A negative look ahead (?!

some pattern) matches the current position if the given pattern fails

at the current position in the target string.

Consider a case where we want to print any line containing all search

terms in any order. If you know how many terms you'll have in advance

(say 3), you could something along the lines of:

print if /foo/ && /bar/ && /baz/;

However, the following construct of multiple look-ahead assertions can

be useful in other cases:

print if /^(?=.*foo)(?=.*bar)(?=.*baz)/;

This works because none of the look-ahead assertions consume any of the

target string, so that each assertion is tested from the beginning of

the string in turn.

Consider a program that accepts as its first argument a string of space

separated search terms -- you do not know how many you'll get but you

want to print any line containing all of the terms

in any order:

#!/usr/bin/perl -w

use strict;

my $search = shift @ARGV;

my $pattern = join('', map{"(?=.*$_)"} split " ", $search);

while(<>){

print if /^$pattern/o;

}

Here we have constructed the multiple look-ahead pattern by splitting

the first argument into component search terms and wrapping each inside

of a look-ahead assertion.

Next Week: More Quantifiers

From CIO: 8 Free Online Courses to Grow Your Tech Skills
Join the discussion
Be the first to comment on this article. Our Commenting Policies