Groping through big data with grep

The grep command has a lot more options and "flavors" than the casual command line pioneer might expect, but there are some options and limitations that you should know about when you're working with big data files.

There are times you might not get what you are looking for using the grep command. Sometimes you end up with far more matches than you expect and sometimes you get far fewer. If you're working with small files, getting a few extra lines might not be a big deal. When you're working with big data files, on the other hand, you might have to be a lot more precise in your queries and know when to abandon grep for a more accommodating tool. First, understand that I have no gripes with grep. Grep is not just a nice tool for grabbing lines containing specific text from arbitrary files. It's one of the cornerstones of Unix and it works with regular expressions -- thus, its name. Grep stands for "(globally search a regular expression and print). This fact that it works so well with regular expressions is why it comes in so handy for so many routine tasks. With grep, we can find text literally (e.g., a name, a label) or we can find it by expressing what it looks like (e.g., dates, addresses, phone numbers). Because grep uses regular expressions, it can accommodate a wide range of text patterns and it can anchor our searches to the beginnings or endings of lines when this is important. We can find a string irrespective of its case. We can look for numbers that have a certain number of digits or numbers that look like 888-456-7890 or 123-45-6789. The grep command is one of the tools that makes the Unix command line powerful. But what do you do when it does something you weren't expecting? Let's take a look at some of the problems you might run into.

Finding just what you want

In the example below, we can see how grep selects from a list of bugs all those that start with "bee" and then all those for which the third letter is an "a". This is pretty basic when it comes to using regular expressions.

$ grep ^bee bugs
bees
beetles
$ grep ^..a bugs
roaches
grasshoppers

We can use similar logic to find all lines that start with four (or more) digits:

$ grep ^[0-9][0-9][0-9][0-9] textfile
2010
2013 is turning out to be a better year
12345 Taylor Avenue

In that command, the ^ identifies the beginning of the line and each [0-9] represents a single digit. We can do the same thing with a little less typing by using an extended (-E) regular expression.

$ grep -E "^[0-9]{4}" textfile
2010
2013 is turning out to be a better year
12345 Taylor Avenue

In that command, the -E says we want to use extended regular expressions, the [0-9] says we want to match a digit and the {4} says we wnt to repeat the preceding pattern (matching a digit) four times. Thus, four digits. Of course, any line that starts with five digits also starts with four digits and, were we to insert a space character after the }, we would miss picking up lines that contain only a four-digit number. So we're getting a bit more than we were looking for. A better option is this. Here we are using the "whole word" feature(e.g., \<text\>) to select lines with four (and no more) digit numbers.

$ grep '\<[0-9]\{4\}\>' textfile
2010
2013 is turning out to be a better year

We can also loosen our search so that it selects lines containing numbers that have between three and five digits like this:

$ grep '\<[0-9]\{3,5\}\>' textfile
2010
2013 is turning out to be a better year
12345 Taylor Avenue
127.0.0.1

Well, almost! As you see, we also picked up 127.0.0.1. Why? Because it starts with a three digit number and is then followed by text which doesn't violate the "end of word" definition. Picking out IP addresses from unstructured data also introduces quite a bit of complexity. If you're looking to find a particular IP address within your big data files, you're going to quickly discover that the dots in any IP address that you provide will also match any single character. So you might get lines that don't necessarily match the criteria that you had in mind.

$ grep 12.45.78.0 text
1234567890
12.45.78.0

If that's what you want to do, however, you have a couple options. You can use grep's -F (fixed string) option to turn off the ". is a wild card" behavior or you can use the fgrep command which basically does the same thing. Each one of these commands turns off the special meaning of . as a wild card.

$ grep -F 12.45.78.0 text
12.45.78.0
$ fgrep 12.45.78.0 text
12.45.78.0 

Alternately, you could use escape characters to tell grep to interpret the dots in the IP address as literal dots, not wild cards:

$ grep "10\.20\.30\.40" textfile
Don't use the 10.20.30.40 address unless you first talk to Pete.

We can also use a fairly complicated regular expression to match on anything that looks like an IPv4 address:

$ grep -w '[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}' textfile
127.0.0.1
Don't use the 10.20.30.40 address unless you first talk to Pete.
Refer to Section 11.2.3.4.20 for the process guidelines.

Of course, here we're matching on any string that has four sets of 1-3 digits, separated by dots. But, once again, we find one line that matches this pattern, but extends it with another dot and another number. Of course, if you know anything about the context in which your IP addresses exist, whether they are at the beginning or ends of lines, surrounded by white space or colons, you can construct an expression that will be much more precise and maybe give you just what you want and nothing more.

Very big files!

The other grep issue that you might not be expecting with grep is that it gives up when a file reaches a certain size and complexity. In the example below, I added text to the end of a very large file and then was unable to find it using grep.

$ echo "THE END" >> BigFile
$ grep "THE END" BigFile
$

What's going on? I had run this test when I noticed that results I had expected to see in a search appeared to be missing from my output. Since the file I was working with was far too huge for browsing, I needed an easy test to determine if my assumption (that perl couldn't handle the file because it was so large) was correct. This particular file, BigFile just happens to be more than 20 GB in size. When you are working with very large files, you may be better off using perl or some other scripting language that won't sweat when files are extremely large. Your searches will still take a considerable amount of time to complete (unless maybe you're working on a supercomputer), but they will likely find your text. Looking for "THE END" with perl is as simple as this. Running this script and seeing "THE END", I know that perl has successfully made it to the end of my file.

#!/usr/bin/perl -w

$logfile=$ARGV[0];

open LOG,"<$logfile" or die "cannot open log: $logfile";

while < <LOG> > {
    next if < ! /THE\sEND/ >;
    print OUT;
}

close LOG;

By the way, an even larger file did not have the same problem with grep, so I don't have a rule of thumb to tell you when files are too big for grep to manage. Just be wary that grep can quietly stop providing results without any indication that it has done so. There were no errors to be seen, just incomplete results and incomplete results can be very hard to recognize when your data files are huge and you don't know what to expect.

Read more of Sandra Henry-Stocker's Unix as a Second Language blog and follow the latest IT news at ITworld, Twitter and Facebook.

What’s wrong? The new clean desk test
Join the discussion
Be the first to comment on this article. Our Commenting Policies