Digesting log data

Be the first to comment | 2I like it!
April 22, 2009, 08:26 AM —  ITworld — 

Reducing voluminous log data to a size that can be read and understood in a matter of minutes can make the difference between systems administrators having the time to review log data on a routine basis and only reviewing it when a problem has become so noticeable that an analysis is unavoidable. 

I insist on two criteria when digesting log data. The first is that no messages are omitted. If I only look for messages that I know to be potential problems (like those that include the word "warning"), I may easily overlook many other problems of an immediate or emerging importance. The second is to include a count of how many times each message has appeared. This gives me a sense of the severity of each problem.

Though I've created a script to digest log data at various times in my career and using a variety of tools, my most recent attempt in Perl has some advantages. One advantage is that it works where similarly constructed shell scripts fail for lack of resources. Another advantage is that the code itself is surprisingly simple.

The reason I turned to Perl is easy to explain. When I attempted to digest a particularly large log file on the command line using standard Unix utilities, my system balked with a complaint that no space was left on the device - despite the fact that I used the most terse and lightweight command that I could conjure. I had sorted the file and passed it to the uniq command with a -c argument intended to give me the number of times each pattern occurred. This is what I got:

$ sort logfile | uniq -c
sort: write error while sorting: No space left on device

While this modest little Unix command will work for most files most of the time, my file was more than 800,000 lines long. When I replaced this command with a Perl script, I had my results (on repeated runnings) in anywhere from 12 to 20 seconds. A typical messages file takes only a few seconds.

The primary "trick" to this Perl script is making good use of arrays.

The first thing we do in this script is to check for the existence of the log file name on the command line and assign the name provided to a variable.

#!/bin/perl

if ( $#ARGV != 0 ) {
print "usage: $0 \n"; exit } $logf=$ARGV[0];

In the following lines, we read the log file into an array. We then change all occurrences of digits to single pound signs to reduce the uniqueness of our data and increase the level of compression. This would reduce dates and times, for example, to strings that all look the same (e.g., "12: Nov # #:#:# boson su: 'su root' failed for demian on /dev/pts/#). We also count up how many times each of the particular patterns appears. For this part of the process, we use an associative array - an array for which the index is a string value rather than a simple numeric sequence. At the end of this section, we have a single array element for each message type. The index is the string itself and the value the count.

@logf=( `cat $logf` ); # read log f into an array

foreach $line ( @logf ) {
$line=~s/\d+/#/g; # digits to # signs $count{$line}++; # count repeats }

We then sort the array, copying it into a new array.

@alpha=sort @logf; # sort the errors

In the next phase, we remove duplicates from the sorted array by copying it to yet another array and using a grep command.

$prev = 'null'; # remove duplicates
@uniq = grep($_ ne $prev && ($prev = $_), @alpha);

In the last phase, we print the count associated with each pattern and then the line itself.

foreach $line (@uniq) { # uniq lines w counts
print "$count{$line}: "; print "$line"; }

The final output looks like this:

24: Nov # #:#:# boson cvs[#]: login failure (for
/cvsroot) 2: Nov # #:#:# boson cvs[#]: login refused for #/cvsroot 1: Nov # #:#:# boson last message repeated # time 12: Nov # #:#:# boson su: 'su root' failed for demian on /dev/pts/#

This script uses a number of arrays, but each operation is fairly quick and the entire script with blanks and all is only 24 lines long. My 800,000+ line log file ended up as roughly 24 lines of text. I can peruse that much data before I've consumed my first half cup of coffee each morning.

ITworld

I like it!
Post a comment
The content of this field is kept private and will not be shown publicly.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.
Free books

Essential JavaFX
Get started building rich Web apps quickly with an introduction to the power of JavaFX key features -- scene node graphs, nodes as components, the coordinate system, layout options, colors and gradients, custom classes with inheritance, animation, binding, and event handlers.Enter now!

The Nomadic Developer
Consulting can be hugely rewarding, but it's easy to fail if you are unprepared. To succeed, you need a mentor who knows the lay of the land. Aaron Erickson is your mentor, and this is your guidebook. Enter now!

Featured Sponsor

AISO founders envisioned a Web hosting company that was environmentally friendly. While the company employed energy-efficient innovations like solar panels, its infrastructure produced unacceptable power and cooling requirements. Find out how AISO leveraged AMD technology to overcome their challenge in this case study white paper.

In this whitepaper, Scalar explores the opportunity to change the landscape with respect to mission critical databases built around Oracle. Leveraging technologies such as Linux, high-end commodity processing power and Oracle RAC technology to architect, design, build and maintain database infrastructure that delivers maximum availability, reliability and performance at a fraction of traditional cost.

On a typical day, weather.com, the Web site for The Weather Channel in Atlanta, serves up between 15 million and 20 million page views. But in September 2004, when back-to-back hurricanes ransacked Florida, the peak traffic on one day more than tripled: over 70 million page views by more than 7 million unique visitors. Read the full success story now.

Marketplace