$_ = '123456abc789';
my $pattern = '\d\d\d';
while ( m/($pattern)/g ) {
print "$1\n";
}
The above will match each sequence of 3 digits and execute the loop.
Each string has a positional marker associated with it that records
where the last regex match ended. You can access or set this marker
directly with the pos() function, thus the regex engine knows where to
continue searching from in the string. When the pattern can no longer
be found, the match operator returns false (ending the while loop in
this case) and the positional marker is reset to 0 (the beginning of
the string).
One thing to notice is that the above snippet will skip over the 'abc'
part of the string -- that is, on the third attempt to match, we start
trying to match at position 6 (right before the 'a') but we aren't
forced to actually match at that point. To force the match to succeed
where we left off we would do:
$_ = '123456abc789';
my $pattern = '\d\d\d';
while ( m/\G($pattern)/g ) {
print "$1\n";
}
In this case, each occurrence of $pattern *must* be found immediately
following the positional marker (either the beginning of the string, or
wherever the last successful match left off). Thus, this snippet only
finds and prints '123', and '456', and then the match fails.
What if we wanted to match different patterns while stepping through
the string (say, sequences of three digits or three lowercase letters)?
We could set up an alternation pattern and then test the captured
results:
$_ = '123456abc789';
my $pattern = '\d\d\d|[a-z]{3}';
while ( m/\G($pattern)/g ) {
my $result = $1;
if ($result =~ /\d/) {
print "We got 3 digits\n";
} else {
print "We got 3 letters\n";
}
}
That's not horrible, though we needed to test for numbers twice (once
in the original pattern, and once in the if test). This could get more
cumbersome if we had more choices to distinguish (and slower because
alternations in regexen are somewhat slow).
The /c modifier allows a /g match to fail without resetting the
positional marker so we can try another match:
$_ = '123XYZ456abc789';
while (1) {
print "Got digits ($1)\n" and next if m/\G(\d\d\d)/gc;
print "Got UCase ($1)\n" and next if m/\G([A-Z]{3})/gc;
print "Got LCase ($1)\n" and next if m/\G([a-z]{3})/gc;
print "End of Parsing\n" and last if m/\G$/gc;
print "Parse Error at position: ", pos(), "\n" and last;
}
Now, we never skip over any data that we haven't accounted for, yet
when any regex fails we simply try the next regex from the same
position. Our parse of the string only fails if all of the regexen fail
and we hit the last line of the loop. The above succeeds through the
string, but if you try $_ = '123ABC456ab789'; you'll get a parse error
message at position 9. If you tried this without the /c modifier you
would have a problem because the if the first regex fails it would
reset the positional marker to 0 (meaning you wouldn't be starting
where you wanted with the next regex).