Unix tip: Monitoring network switches

Helping a friend with a problem on one of his Cisco switches, I noticed that all of his switches were logging to one of his admin servers and that the log they were all writing to was not only getting very large but that it contained several years worth of messages, including a large number of errors. While it's certainly a good idea to centralize log files since this strategy means you will have a way to check on the status of all of your switches at once, you have to remember to periodically check the centralized log for evidence of problems. Otherwise, you lose all the benefit it could be providing in monitoring the health of your LAN. Many of the errors in the lengthy log file were surprisingly vague. They reported only that various ports on some of the switches were "experiencing errors" and failed to mention the nature of the errors. Here's an example:

Feb 02 11:10:12 switch11-2 241: 000238: Feb 02 11:10:11 UTC: %LINK-4-ERROR:
FastEthernet0/7 is experiencing errors

When we checked the switches and the systems attached to the ports for which these errors were generated, we found that there was a mismatch between the duplex settings. Either the switch port was set to full duplex and the network adaptor on the server was set to half duplex or vice versa. In case, these terms aren't in the lingo you sling in your day to day life, let's quickly review what they mean. If a network interface is operating in half duplex, it means that systems can communicate through that interface in either direction, but only one direction at a time. If that same network interface is operating in full duplex, it means the systems can communicate in both directions at the same time. While the systems for which these complaints were being logged were all still operational, network performance is reduced when duplex settings are not the same on the switch port and the connected system. In addition, intermittent connectivity can occur. So, we decided to systematically attack the problems. First, we generated a list of the systems showing errors by extracting and summarizing the errors in the log file like this:

boson# grep "is experiencing errors" cisco.log | awk '{print $4,$12}' | uniq -c
   635 switch11-2 FastEthernet0/16
   117 switch4-10 FastEthernet0/7
   ...

We then compared the settings on each switch port with the settings on the network interface on the Solaris servers. To determine whether a network interface is running in half or full duplex, we used the ndd command. For example, you can tell in the output below that the network interface on boson is running in full duplex (fdx) mode.

boson# ndd -get /dev/dmfe1 adv_100fdx_cap
1
boson# ndd -get /dev/dmfe1 adv_100hdx_cap
0

We then compared the ndd output with the output of the "show interfaces" command on the Cisco switches. If the network adaptor on the Solaris system was running in half duplex mode, the corresponding "ndd -set" commands would fix the problem. To find out all the parameters available for your particular interface, use an ndd command like that shown below:

bash-2.03# ndd /dev/dmfe1 \?
?                             (read only)
link_status                   (read only)
link_speed                    (read only)
link_mode                     (read only)
adv_autoneg_cap               (read and write)
adv_100T4_cap                 (read and write)
adv_100fdx_cap                (read and write)
adv_100hdx_cap                (read and write)
adv_10fdx_cap                 (read and write)
adv_10hdx_cap                 (read and write)
autoneg_cap                   (read only)
100T4_cap                     (read only)
100fdx_cap                    (read only)
100hdx_cap                    (read only)
10fdx_cap                     (read only)
10hdx_cap                     (read only)
lp_autoneg_cap                (read only)
lp_100T4_cap                  (read only)
lp_100fdx_cap                 (read only)
lp_100hdx_cap                 (read only)
lp_10fdx_cap                  (read only)
lp_10hdx_cap                  (read only)

Once the switch port and server in each instance of an mismatch were both running in full duplex mode, the "is experiencing errors" messages stopped showing up in the log files. To make sure we wouldn't fail to notice if problems of this type were to show up again, we also added a simple script to the log server to send us daily email with a summary of the recent messages. In generating a log summary, we needed to include the date in our analysis as well as the switch names and ports for which the problems were reported. Since we wanted to see only the last couple of days worth of errors, we first generated a file containing the last two dates as they would appear in the file. We then generated summaries of messages by date:

#!/bin/bash

LOG=/var/log/cisco.log

# generate list of last two dates in the log
cat $LOG | awk '{print substr($0,1,6)}' | uniq | tail -2 > /tmp/cisco$$

# generate summaries of messages by date
while read dt
do
    echo $dt
    day=`echo $dt | awk '{print $2}'`
    grep "$dt" $LOG | awk '{print $4,$NF}' | grep -v Start | sort | uniq -c
done < /tmp/cisco$$

rm /tmp/cisco$$

This simple script only prints the date, the switch name and the last word of each line. This is hardly enough to tell us what's wrong, but it's just enough to tell us whether or not we should go look at the log file. Most days since it was implemented, we only see the dates and a couple summary lines. To make this work, however, we need to make sure that the most recent dates appear in the file whether or not errors or other messages were logged for that day. Otherwise we could get the same data, day after day, even though it was no longer current. To avoid this, we run a cron job at the beginning of every day to add a tag such as "Feb 6 [Start]" to the log file.

# Check on Switches
0 0 * * * /usr/local/bin/dtCiscoLog
5 13 * * * /usr/local/bin/ckCiscoLog | mailx -s "Cisco Log analysis" david@oops.org

When we run the script, we then get something like this:

Feb  5
   1 switch17-3 errors
Feb  6

The cron job that runs the script once a day sends this output to us. If there's nothing to report, the email will include only the dates and as you see above for Feb 6th. If any errors appear in the log, we'd see a single line for each switch and error type, including the count of how many times the particular error occurred on that switch.

What’s wrong? The new clean desk test
Join the discussion
Be the first to comment on this article. Our Commenting Policies