I see a lot of headlines these days about "big data" and immediately identify. I think of all the big data problems that I've grappled with over the years and imagine what the key problems with managing big data will be as the data stores get larger and more diverse -- storage, manipulation, characterization, extraction of meaning (or "intelligence") ... There's a lot more to looking for needles in haystacks than folks might imagine. First, you've got to know what kinds of needles you're looking for and second, you need to know what kind of tools might help you find them. In a sense, I've been working with big data for decades -- cumbersome system logs, sometimes that went on for months, and crazy big logs from big web sites (e.g., the web logs from magazines like SunWorld and JavaWorld back in the late 90's). I might be analyzing tens or hundreds of gigabytes worth of data to answer important questions. The tools that I use won't be much different that those I use to work with files that are tens of kilobytes in size, but the techniques vary considerably. When I work with modest files, I have a chance to review their contents and to refine my analysis step by step. First, I extract records to see what the results look like. Then, I refine my strategy so that the results looks just like what I want. When I'm working with huge data sets, I have to do some guessing about what I'm going to encounter. I may not have time to start over if my first pass isn't successful or I may lose critical insights if my search techniques are too generic. No matter how much auditors insist to the contrary, I will never have time to review all of my logs, but I can make time to review summaries or insightful extractions ... and, if they pick up the right stuff, I'll probably be able to see problems before they've overwhelmed me. At least most of the time. So what is "big data"? Or, more properly, what are "big data"? Some people, when describing big data, talk about the three V's -- volume, variety and velocity -- to describe the growing challenges of huge data collections. Data stores often measure in the terrabytes. In fact, the last external USB that I purchased for my personal use fits in my pocket and holds 2 TB. The variety of data is growing as well. And the rate of change is accelerating. So, we not only have a lot of data to store and analyze, but the challenge of maintaining tools to help us make use of it. We can save logs, compress logs, and back up logs. But what if we want to actually get some value out of them? What if we want to find out how many people are reading this sysadmin column or looking at the ads we've placed on every page? How do we pull the important statistics from logs of all sorts of logs of various sizes and content? And how do we do this kind of analysis for a moving, changing and growing set of data? You might buy tools that help analyze particular types of data files. There are some great tools out there for analyzing web traffic and syslog contents, but do they help you notice when people are trying to compromise your site? You might build scripts that allow you to customize your analysis of various data sources. But you probably won't always be looking for the same things. How do you manipulate a pile of data and find essential information without knowing ahead of time what you might find, and without necessarily knowing what you're looking for? These are the most challenging of the big data questions from my point of view. They say hindsight is 20/20. How do we sharpen the focus of foresight? Fortunately for those of us working on Unix systems, there are a lot of powerful built-in tools that give us a leg up when it comes to extracting essential insights from even the largest data files. Unix is, out of the box, well equipped to help us approach huge globs of data without knowing ahead of time what we ought to be looking for. What's different? What's worrisome? How do we recognize an emerging threat or a new problem when we don't yet know what it looks like? Top among these tools are:
- regular expressions
- scripting languages (bash, perl, etc.)
- tools to compress and uncompress files (gzip, bzip2)
- tools like grep, fgreg, cut and awk for finding and selecting text
- tr and sed for masking or changing text
- sort and uniq for summarization
You might use these tools for: looking at just the fifth column in a file to count up how many people in your data set are from particular states or countries counting lines by date to create an activity graph to show how many messages were processed or how many times users logged in looking for evidence that either someone is having trouble logging in or someone is trying to break into their account summing expenses by month to determine which months were the most costly for your project looking for signs of people attempting to hack into your web site by processing your web traffic logs looking through your firewall logs for evidence that systems have been compromised looking through phone records to find out who Sandra Henry-Stocker is calling to be reminded about what she needs to pick up at the grocery store on her way home from work (OK, just kidding) I've often had to look through many gigabytes of data to find the far less than 1% of their content that I need to answer critical questions. I usually prototype my tools by first grabbing a data sample and making sure everything works as planned before I move on to a larger data set. For this, head and tail commands are invaluable. I might grab the top 1,000 lines and the bottom 1,000 lines from a huge file, append one to the other and use the result as my sample -- in case there are fundamental differences between the start and end of a data set. One thing I often do is remove or ignore the portion of the each line in my data file that makes the lines unique -- e.g., the time of day or the connection identifier. I might remove the time, but not the date from entries in a log file so that I can easily count how many times particular errors are occurring each day If I want to know how many times serious errors are happening, I can then grep and count the lines that then look the same. Alternately, I can mask (or ignore) tha data that makes every line unique and preserve for my analysis only the data that helps to answer my questions. From my experiences with big data, I'd suggest that when you're starting out on a big data analysis problem, you should:
1) Make sure you have enough disk space to work. Remember that, when you uncompress a 100 MB file, the resultant file might be many times as large.
2) Grab a representative sample. Try to get a reasonably large and representative data sample.
3) Craft your tools. I often stick a couple comments in my scripts with sample data to remind me what the data I'm processing looks like while I'm coding.
4) Testing your tools. Try out your tools on a small sample of data so that you can get a quick response. Work your way up to larger samples.
5) Run your first trial. Try your tool on a real file, however big, to see how long it takes to process and if gets to the end without running into problems.
As you refine your tools, think about how you might make the things that you discover understandable. How do you present your findings so that they draw attention to the things that matter? I was influenced many years ago by "The Visual Display of Quantitative Information" by Edward R. Tufte. Graphs might not always be an option, but presenting data in a way that makes sense to viewers is essential if you want them to understood your results. It's the things that you don't know to look for that will bite you. But, if you can rule out everything that follows the patterns that you expect to see, maybe you can highlight those that are new and unusual. At least Unix offers you a lot of tools to get you on your way.
Read more of Sandra Henry-Stocker's Unix as a Second Language blog and follow the latest IT news at ITworld, Twitter and Facebook.