Groping through big data with grep

The grep command has a lot more options and "flavors" than the casual command line pioneer might expect, but there are some options and limitations that you should know about when you're working with big data files.


There are times you might not get what you are looking for using the grep command. Sometimes you end up with far more matches than you expect and sometimes you get far fewer. If you're working with small files, getting a few extra lines might not be a big deal. When you're working with big data files, on the other hand, you might have to be a lot more precise in your queries and know when to abandon grep for a more accommodating tool.

First, understand that I have no gripes with grep. Grep is not just a nice tool for grabbing lines containing specific text from arbitrary files. It's one of the cornerstones of Unix and it works with regular expressions -- thus, its name. Grep stands for "(globally search a regular expression and print). This fact that it works so well with regular expressions is why it comes in so handy for so many routine tasks. With grep, we can find text literally (e.g., a name, a label) or we can find it by expressing what it looks like (e.g., dates, addresses, phone numbers).

Because grep uses regular expressions, it can accommodate a wide range of text patterns and it can anchor our searches to the beginnings or endings of lines when this is important. We can find a string irrespective of its case. We can look for numbers that have a certain number of digits or numbers that look like 888-456-7890 or 123-45-6789.

The grep command is one of the tools that makes the Unix command line powerful. But what do you do when it does something you weren't expecting? Let's take a look at some of the problems you might run into.

Finding just what you want

In the example below, we can see how grep selects from a list of bugs all those that start with "bee" and then all those for which the third letter is an "a". This is pretty basic when it comes to using regular expressions.

$ grep ^bee bugs
$ grep ^..a bugs

We can use similar logic to find all lines that start with four (or more) digits:

$ grep ^[0-9][0-9][0-9][0-9] textfile
2013 is turning out to be a better year
12345 Taylor Avenue

In that command, the ^ identifies the beginning of the line and each [0-9] represents a single digit. We can do the same thing with a little less typing by using an extended (-E) regular expression.

$ grep -E "^[0-9]{4}" textfile
2013 is turning out to be a better year
12345 Taylor Avenue

In that command, the -E says we want to use extended regular expressions, the [0-9] says we want to match a digit and the {4} says we wnt to repeat the preceding pattern (matching a digit) four times. Thus, four digits. Of course, any line that starts with five digits also starts with four digits and, were we to insert a space character after the }, we would miss picking up lines that contain only a four-digit number.

Photo Credit: 

flickr / richard_north

Join us:






Answers - Powered by ITworld

Ask a Question