I might grab the top 1,000 lines and the bottom 1,000 lines from a huge file, append one to the other and use the result as my sample -- in case there are fundamental differences between the start and end of a data set.
One thing I often do is remove or ignore the portion of the each line in my data file that makes the lines unique -- e.g., the time of day or the connection identifier. I might remove the time, but not the date from entries in a log file so that I can easily count how many times particular errors are occurring each day If I want to know how many times serious errors are happening, I can then grep and count the lines that then look the same. Alternately, I can mask (or ignore) tha data that makes every line unique and preserve for my analysis only the data that helps to answer my questions.
From my experiences with big data, I'd suggest that when you're starting out on a big data analysis problem, you should:
1) Make sure you have enough disk space to work. Remember that, when you uncompress a 100 MB file, the resultant file might be many times as large.
2) Grab a representative sample. Try to get a reasonably large and representative data sample.
3) Craft your tools. I often stick a couple comments in my scripts with sample data to remind me what the data I'm processing looks like while I'm coding.
4) Testing your tools. Try out your tools on a small sample of data so that you can get a quick response. Work your way up to larger samples.
5) Run your first trial. Try your tool on a real file, however big, to see how long it takes to process and if gets to the end without running into problems.
As you refine your tools, think about how you might make the things that you discover understandable. How do you present your findings so that they draw attention to the things that matter? I was influenced many years ago by "The Visual Display of Quantitative Information" by Edward R. Tufte. Graphs might not always be an option, but presenting data in a way that makes sense to viewers is essential if you want them to understood your results.
It's the things that you don't know to look for that will bite you. But, if you can rule out everything that follows the patterns that you expect to see, maybe you can highlight those that are new and unusual. At least Unix offers you a lot of tools to get you on your way.
Read more of Sandra Henry-Stocker's Unix as a Second Language blog and follow the latest
href="http://www.itworld.com/news">IT newsat ITworld, Twitter and Facebook.
flickr / Marius B