Groping through big data with grep

The grep command has a lot more options and "flavors" than the casual command line pioneer might expect, but there are some options and limitations that you should know about when you're working with big data files.

By  

So we're getting a bit more than we were looking for.

A better option is this. Here we are using the "whole word" feature(e.g., \) to select lines with four (and no more) digit numbers.

$ grep '\<[0-9]\{4\}\>' textfile
2010
2013 is turning out to be a better year

We can also loosen our search so that it selects lines containing numbers that have between three and five digits like this:

$ grep '\<[0-9]\{3,5\}\>' textfile
2010
2013 is turning out to be a better year
12345 Taylor Avenue
127.0.0.1

Well, almost! As you see, we also picked up 127.0.0.1. Why? Because it starts with a three digit number and is then followed by text which doesn't violate the "end of word" definition.

Picking out IP addresses from unstructured data also introduces quite a bit of complexity. If you're looking to find a particular IP address within your big data files, you're going to quickly discover that the dots in any IP address that you provide will also match any single character. So you might get lines that don't necessarily match the criteria that you had in mind.

$ grep 12.45.78.0 text
1234567890
12.45.78.0

If that's what you want to do, however, you have a couple options. You can use grep's -F (fixed string) option to turn off the ". is a wild card" behavior or you can use the fgrep command which basically does the same thing. Each one of these commands turns off the special meaning of . as a wild card.

$ grep -F 12.45.78.0 text
12.45.78.0
$ fgrep 12.45.78.0 text
12.45.78.0 

Alternately, you could use escape characters to tell grep to interpret the dots in the IP address as literal dots, not wild cards:

$ grep "10\.20\.30\.40" textfile
Don't use the 10.20.30.40 address unless you first talk to Pete.

We can also use a fairly complicated regular expression to match on anything that looks like an IPv4 address:

$ grep -w '[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}' textfile
127.0.0.1
Don't use the 10.20.30.40 address unless you first talk to Pete.
Refer to Section 11.2.3.4.20 for the process guidelines.

Of course, here we're matching on any string that has four sets of 1-3 digits, separated by dots. But, once again, we find one line that matches this pattern, but extends it with another dot and another number.

Of course, if you know anything about the context in which your IP addresses exist, whether they are at the beginning or ends of lines, surrounded by white space or colons, you can construct an expression that will be much more precise and maybe give you just what you want and nothing more.

Very big files!

The other grep issue that you might not be expecting with grep is that it gives up when a file reaches a certain size and complexity.

Photo Credit: 

flickr / richard_north

Join us:
Facebook

Twitter

Pinterest

Tumblr

LinkedIn

Google+

Answers - Powered by ITworld

ITworld Answers helps you solve problems and share expertise. Ask a question or take a crack at answering the new questions below.

Join us:
Facebook

Twitter

Pinterest

Tumblr

LinkedIn

Google+

Ask a Question