June 30, 2013, 2:58 PM — I see a lot of headlines these days about "big data" and immediately identify. I think of all the big data problems that I've grappled with over the years and imagine what the key problems with managing big data will be as the data stores get larger and more diverse -- storage, manipulation, characterization, extraction of meaning (or "intelligence") ... There's a lot more to looking for needles in haystacks than folks might imagine. First, you've got to know what kinds of needles you're looking for and second, you need to know what kind of tools might help you find them.
In a sense, I've been working with big data for decades -- cumbersome system logs, sometimes that went on for months, and crazy big logs from big web sites (e.g., the web logs from magazines like SunWorld and JavaWorld back in the late 90's). I might be analyzing tens or hundreds of gigabytes worth of data to answer important questions. The tools that I use won't be much different that those I use to work with files that are tens of kilobytes in size, but the techniques vary considerably. When I work with modest files, I have a chance to review their contents and to refine my analysis step by step. First, I extract records to see what the results look like. Then, I refine my strategy so that the results looks just like what I want. When I'm working with huge data sets, I have to do some guessing about what I'm going to encounter. I may not have time to start over if my first pass isn't successful or I may lose critical insights if my search techniques are too generic.
No matter how much auditors insist to the contrary, I will never have time to review all of my logs, but I can make time to review summaries or insightful extractions ... and, if they pick up the right stuff, I'll probably be able to see problems before they've overwhelmed me. At least most of the time.
So what is "big data"? Or, more properly, what are "big data"? Some people, when describing big data, talk about the three V's -- volume, variety and velocity -- to describe the growing challenges of huge data collections. Data stores often measure in the terrabytes. In fact, the last external USB that I purchased for my personal use fits in my pocket and holds 2 TB. The
variety of data is growing as well. And the rate of change is accelerating. So, we not only have a lot of data to store and analyze, but the challenge of maintaining tools to help us make use of it.
We can save logs, compress logs, and back up logs. But what if we want to actually get some value out of them? What if we want to find out how many people are reading this sysadmin column or looking at the ads we've placed on every page? How do we pull the important statistics from logs of all sorts of logs of various sizes and content? And how do we do this kind of analysis for a moving, changing and growing set of data?
You might buy tools that help analyze particular types of data files.
flickr / Marius B