Regular Expression Tutorial Part 5: Greedy and Non-Greedy Quantification

By Andrew Johnson, ITworld |  How-to

Last week, I described the standard quantifiers as greedy. This week,
we will look at non-greedy forms of quantifiers, but first let's
discuss just what it means to be greedy.

my \$string = 'bcdabdcbabcd';
\$string =~ m/^(.*)ab/;
print "\$1\n"; # prints: bcdabdcb

The * is greedy; therefore, the .* portion of the regex will match as
much as it can and still allow the remainder of the regex to match. In
this case, it will match everything up to the last 'ab'. Actually,
the .* will match right to the end of the string, and then start
backing up until it can match an 'ab' (this is called backtracking).

To make the quantifier non-greedy you simply follow it with a '?'
symbol:

my \$string = 'bcdabdcbabcd';
\$string =~ m/^(.*?)ab/;
print "\$1\n"; # prints: bcd

In this case the .*? portion attempts to match the least amount of data
while allowing the remainder of the regex to match. Here the regex
engine will match the beginning of the string, then it will try to
match zero of anything and check to see if the rest can match (that
fails). Next, it will match the 'b' and then check again if the 'ab'
can match (still fails). This continues until the the .*? has matched
the first 3 characters and then the following 'ab' is matched.

You can make any of the standard quantifiers that aren't exact non-
greedy by appending a '?' symbol to them: *?, +?, ??, {n,m}?, and {n,}?.

One thing to watch out for: given a pattern such as /^(.*?)%(.*?)/ one
could match and extract the first two fields of a like of % separated
data:

#!/usr/bin/perl -w
use strict;
\$_ = 'Johnson%Andrew%AX321%37';
m/^(.*?)%(.*?)%/;
print "\$2 \$1\n";

And one can easily begin to think of each subexpression as
meaning 'match up to the next % symbol', but that isn't exactly what it
means. Let's say that the third field represents an ID tag and we want
to extract only those names of people with ID tags starting with 'A'.
We might be tempted to do this:

#!/usr/bin/perl -w
use strict;
while () {
print "\$2 \$1\n" if m/^(.*?)%(.*?)%A/;
}
__DATA__
Johnson%Andrew%AX321%Manitoba
Smith%John%BC142%Alberta

This would print out:

Andrew Johnson
John%BC142 Smith

But that isn't what we wanted at all -- what happened? Well, the second
half of the regex does not say match up to the next % symbol and then
match an 'A', it says, match up to the next % symbol that is followed
by an 'A'. The pattern '(.*?)' part is not prevented from matching and
proceeding past a % character if that is what is necessary to cause the
whole regex to succeed.