Regular Expressions Tutorial, Part 3: Anchoring Matches

By Andrew Johnson, ITworld |  How-to

In the last couple of weeks we have covered what I refer to as the five
basic concepts (concatenation, alternation, quantification, grouping,
and wildcards) and we have introduced character classes. This week, I
introduce a new kind of regex element: an anchor.

An anchor specifies a position in the target string that has particular
properties. The caret '^' and the dollar '$' symbols represent the two
main anchors, which refer to the start and end of the target string
respectively. Thus the pattern /foo/ will match if the target string
contains 'foo' anywhere, but the pattern /^foo/ will only match if the
target string contains 'foo' at the beginning of the string. The
pattern reads: match the start of the string followed by an 'f'
followed by an 'o' followed by an 'o'. Similarly, the pattern /foo$/
will match if the target contains 'foo' at the end of the string. The
precise meaning of these two anchors can be changed with the /m
modifier, which will cause them to match at the beginning and end of
each line within the string rather than just at the beginning and end
of the entire string. The \A and \Z anchors are similar to ^ and $
respectively, but they always match the beginning and end of the string
and never at internal line boundaries.

If we wanted a script to count the lines of code (LOC) in a Perl
program and ignore comment lines, blank lines, and lines after the
__END__ or __DATA__ tags, we could write it as:

#!/usr/bin/perl -w
use strict;
my $count = 0;
while(<>){
next if /^\s*#/; # ignore comment lines
next if /^\s*$/; # ignore blank lines
last if /^(__END__|__DATA__)/; # stop
$count++;
}
print "There are $count lines of code\n";

It isn't perfect (there could be POD markup anywhere within the program
and not just after the __END__ or __DATA__ tokens), but it provides a
reasonable measure. The first regex /^\s*#/ matches any line starting
with optional whitespace and a # character (a line containing only a
comment); the second regex /^\s*$/ matches a line containing only
optional whitespace (blank lines); the final regex matches lines
beginning with one of the two program-ending tokens. If you are putting
your subroutines after such a token for autoloading purposes, then you
wouldn't want to include this line in your LOC counting program.

Another anchor is the \b metacharacter that matches what is often
called word boundary. This matches a position in the target string
between a \w and \W character, or between the start and end of the
string and a \w character. The \G anchor matches the point where the
previous m//g match left off (that is, at the current pos() for the
target string).

Anchors are also referred to as zero-width assertions because they
match a position in the string and do not consume any characters in the
string.

Join us:
Facebook

Twitter

Pinterest

Tumblr

LinkedIn

Google+

Spotlight on ...
Online Training

    Upgrade your skills and earn higher pay

    Readers to share their best tips for maximizing training dollars and getting the most out self-directed learning. Here’s what they said.

     

    Learn more

Answers - Powered by ITworld

ITworld Answers helps you solve problems and share expertise. Ask a question or take a crack at answering the new questions below.

Ask a Question
randomness