Digesting log data
Reducing voluminous log data to a size that can be read and understood in a matter of minutes can make the difference between systems administrators having the time to review log data on a routine basis and only reviewing it when a problem has become so noticeable that an analysis is unavoidable.
I insist on two criteria when digesting log data. The first is that no messages are omitted. If I only look for messages that I know to be potential problems (like those that include the word "warning"), I may easily overlook many other problems of an immediate or emerging importance. The second is to include a count of how many times each message has appeared. This gives me a sense of the severity of each problem.
Though I've created a script to digest log data at various times in my career and using a variety of tools, my most recent attempt in Perl has some advantages. One advantage is that it works where similarly constructed shell scripts fail for lack of resources. Another advantage is that the code itself is surprisingly simple.
The reason I turned to Perl is easy to explain. When I attempted to digest a particularly large log file on the command line using standard Unix utilities, my system balked with a complaint that no space was left on the device - despite the fact that I used the most terse and lightweight command that I could conjure. I had sorted the file and passed it to the uniq command with a -c argument intended to give me the number of times each pattern occurred. This is what I got:
$ sort logfile | uniq -c
sort: write error while sorting: No space left on device
While this modest little Unix command will work for most files most of the time, my file was more than 800,000 lines long. When I replaced this command with a Perl script, I had my results (on repeated runnings) in anywhere from 12 to 20 seconds. A typical messages file takes only a few seconds.
The primary "trick" to this Perl script is making good use of arrays.
The first thing we do in this script is to check for the existence of the log file name on the command line and assign the name provided to a variable.
#!/bin/perl
if ( $#ARGV != 0 ) {
print "usage: $0 \n"; exit } $logf=$ARGV[0];
In the following lines, we read the log file into an array. We then change all occurrences of digits to single pound signs to reduce the uniqueness of our data and increase the level of compression. This would reduce dates and times, for example, to strings that all look the same (e.g., "12: Nov # #:#:# boson su: 'su root' failed for demian on /dev/pts/#). We also count up how many times each of the particular patterns appears. For this part of the process, we use an associative array - an array for which the index is a string value rather than a simple numeric sequence. At the end of this section, we have a single array element for each message type. The index is the string itself and the value the count.
Sign up for ITworld's Daily newsletter
Follow ITworld on Twitter @IT_world
On Twitter now
unix
Powered by Twitter
jfruh
Apple syncing patent can't come soon enough
pasmith
New Twitter features borrow from 3rd party clients
Esther Schindler
Open Source Changes the Software Acquisition Process
mikelgan
How to set up continuous podcast play on the new iTunes
David Strom
Five important Windows 7 mobility features
sjvn
Guard your Wi-Fi for your own sake
Sandra Henry-Stocker
Grepping on Whole Words
Sidekick: The Good News & the Bad News
Either way you look at it Microsoft Data Center management did not follow standards or best practices in this failure. In which case it makes me wonder more about the outsourcing of corporate data much less personal data.
- mburton325
Join the conversation here
Quick, practical advice for IT pros. Made fresh daily.
Want to cash in on your IT savvy? Send your tip to tips@itworld.com. If we post it, we'll send you a $25 Amazon e-gift card.












