Digesting log data

By , ITworld |  Open Source, Perl, Unix

Reducing voluminous log data to a size that can be read and understood in a matter of minutes can make the difference between systems administrators having the time to review log data on a routine basis and only reviewing it when a problem has become so noticeable that an analysis is unavoidable. 

I insist on two criteria when digesting log data. The first is that no messages are omitted. If I only look for messages that I know to be potential problems (like those that include the word "warning"), I may easily overlook many other problems of an immediate or emerging importance. The second is to include a count of how many times each message has appeared. This gives me a sense of the severity of each problem.

Though I've created a script to digest log data at various times in my career and using a variety of tools, my most recent attempt in Perl has some advantages. One advantage is that it works where similarly constructed shell scripts fail for lack of resources. Another advantage is that the code itself is surprisingly simple.

The reason I turned to Perl is easy to explain. When I attempted to digest a particularly large log file on the command line using standard Unix utilities, my system balked with a complaint that no space was left on the device - despite the fact that I used the most terse and lightweight command that I could conjure. I had sorted the file and passed it to the uniq command with a -c argument intended to give me the number of times each pattern occurred. This is what I got:

$ sort logfile | uniq -c
sort: write error while sorting: No space left on device

While this modest little Unix command will work for most files most of the time, my file was more than 800,000 lines long. When I replaced this command with a Perl script, I had my results (on repeated runnings) in anywhere from 12 to 20 seconds. A typical messages file takes only a few seconds.

The primary "trick" to this Perl script is making good use of arrays.

The first thing we do in this script is to check for the existence of the log file name on the command line and assign the name provided to a variable.

#!/bin/perl

if ( $#ARGV != 0 ) {
print "usage: $0 \n"; exit } $logf=$ARGV[0];

In the following lines, we read the log file into an array. We then change all occurrences of digits to single pound signs to reduce the uniqueness of our data and increase the level of compression. This would reduce dates and times, for example, to strings that all look the same (e.g., "12: Nov # #:#:# boson su: 'su root' failed for demian on /dev/pts/#). We also count up how many times each of the particular patterns appears. For this part of the process, we use an associative array - an array for which the index is a string value rather than a simple numeric sequence. At the end of this section, we have a single array element for each message type. The index is the string itself and the value the count.

@logf=( `cat $logf` ); # read log f into an array

foreach $line ( @logf ) {
$line=~s/\d+/#/g; # digits to # signs $count{$line}++; # count repeats }

We then sort the array, copying it into a new array.

@alpha=sort @logf; # sort the errors

In the next phase, we remove duplicates from the sorted array by copying it to yet another array and using a grep command.

$prev = 'null'; # remove duplicates
@uniq = grep($_ ne $prev && ($prev = $_), @alpha);

In the last phase, we print the count associated with each pattern and then the line itself.

foreach $line (@uniq) { # uniq lines w counts
print "$count{$line}: "; print "$line"; }

The final output looks like this:

24: Nov # #:#:# boson cvs[#]: login failure (for
/cvsroot) 2: Nov # #:#:# boson cvs[#]: login refused for #/cvsroot 1: Nov # #:#:# boson last message repeated # time 12: Nov # #:#:# boson su: 'su root' failed for demian on /dev/pts/#

This script uses a number of arrays, but each operation is fairly quick and the entire script with blanks and all is only 24 lines long. My 800,000+ line log file ended up as roughly 24 lines of text. I can peruse that much data before I've consumed my first half cup of coffee each morning.

Join us:
Facebook

Twitter

Pinterest

Tumblr

LinkedIn

Google+

Answers - Powered by ITworld

Join us:
Facebook

Twitter

Pinterest

Tumblr

LinkedIn

Google+

Ask a Question
randomness