There are some great tools out there for analyzing web traffic and syslog contents, but do they help you notice when people are trying to compromise your site?
You might build scripts that allow you to customize your analysis of various data sources. But you probably won't always be looking for the same things. How do you manipulate a pile of data and find essential information without knowing ahead of time what you might find, and without necessarily knowing what you're looking for? These are the most challenging of the big data questions from my point of view. They say hindsight is 20/20. How do we sharpen the focus of foresight?
Fortunately for those of us working on Unix systems, there are a lot of powerful built-in tools that give us a leg up when it comes to extracting essential insights from even the largest data files.
Unix is, out of the box, well equipped to help us approach huge globs of data without knowing ahead of time what we ought to be looking for. What's different? What's worrisome? How do we recognize an emerging threat or a new problem when we don't yet know what it looks like? Top among these tools are:
- regular expressions
- scripting languages (bash, perl, etc.)
- tools to compress and uncompress files (gzip, bzip2)
- tools like grep, fgreg, cut and awk for finding and selecting text
- tr and sed for masking or changing text
- sort and uniq for summarization
You might use these tools for:
looking at just the fifth column in a file to count up how many people in your data set are from particular states or countries counting lines by date to create an activity graph to show how many messages were processed or how many times users logged in looking for evidence that either someone is having trouble logging in or someone is trying to break into their account summing expenses by month to determine which months were the most costly for your project looking for signs of people attempting to hack into your web site by processing your web traffic logs looking through your firewall logs for evidence that systems have been compromised looking through phone records to find out who Sandra Henry-Stocker is calling to be reminded about what she needs to pick up at the grocery store on her way home from work (OK, just kidding)
I've often had to look through many gigabytes of data to find the far less than 1% of their content that I need to answer critical questions. I usually prototype my tools by first grabbing a data sample and making sure everything works as planned before I move on to a larger data set. For this, head and tail commands are invaluable.
flickr / Marius B