Unix tip: Monitoring network switches

By Sandra Henry-Stocker  5 comments

Helping a friend with a problem on one of his Cisco switches, I noticed that all of his switches were logging to one of his admin servers and that the log they were all writing to was not only getting very large but that it contained several years worth of messages, including a large number of errors. While it's certainly a good idea to centralize log files since this strategy means you will have a way to check on the status of all of your switches at once, you have to remember to periodically check the centralized log for evidence of problems. Otherwise, you lose all the benefit it could be providing in monitoring the health of your LAN.

Many of the errors in the lengthy log file were surprisingly vague. They reported only that various ports on some of the switches were "experiencing errors" and failed to mention the nature of the errors. Here's an example:

Feb 02 11:10:12 switch11-2 241: 000238: Feb 02 11:10:11 UTC: %LINK-4-ERROR:
FastEthernet0/7 is experiencing errors

When we checked the switches and the systems attached to the ports for which these errors were generated, we found that there was a mismatch between the duplex settings. Either the switch port was set to full duplex and the network adaptor on the server was set to half duplex or vice versa.

In case, these terms aren't in the lingo you sling in your day to day life, let's quickly review what they mean. If a network interface is operating in half duplex, it means that systems can communicate through that interface in either direction, but only one direction at a time. If that same network interface is operating in full duplex, it means the systems can communicate in both directions at the same time.

While the systems for which these complaints were being logged were all still operational, network performance is reduced when duplex settings are not the same on the switch port and the connected system. In addition, intermittent connectivity can occur. So, we decided to systematically attack the problems. First, we generated a list of the systems showing errors by extracting and summarizing the errors in the log file like this:

boson# grep "is experiencing errors" cisco.log | awk '{print $4,$12}' | uniq -c
   635 switch11-2 FastEthernet0/16
   117 switch4-10 FastEthernet0/7
   ...

We then compared the settings on each switch port with the settings on the network interface on the Solaris servers. To determine whether a network interface is running in half or full duplex, we used the ndd command. For example, you can tell in the output below that the network interface on boson is running in full duplex (fdx) mode.

boson# ndd -get /dev/dmfe1 adv_100fdx_cap
1
boson# ndd -get /dev/dmfe1 adv_100hdx_cap
0

We then compared the ndd output with the output of the "show interfaces" command on the Cisco switches. If the network adaptor on the Solaris system was running in half duplex mode, the corresponding "ndd -set" commands would fix the problem.

To find out all the parameters available for your particular interface, use an ndd command like that shown below:

bash-2.03# ndd /dev/dmfe1 \?
?                             (read only)
link_status                   (read only)
link_speed                    (read only)
link_mode                     (read only)
adv_autoneg_cap               (read and write)
adv_100T4_cap                 (read and write)
adv_100fdx_cap                (read and write)
adv_100hdx_cap                (read and write)
adv_10fdx_cap                 (read and write)
adv_10hdx_cap                 (read and write)
autoneg_cap                   (read only)
100T4_cap                     (read only)
100fdx_cap                    (read only)
100hdx_cap                    (read only)
10fdx_cap                     (read only)
10hdx_cap                     (read only)
lp_autoneg_cap                (read only)
lp_100T4_cap                  (read only)
lp_100fdx_cap                 (read only)
lp_100hdx_cap                 (read only)
lp_10fdx_cap                  (read only)
lp_10hdx_cap                  (read only)

Once the switch port and server in each instance of an mismatch were both running in full duplex mode, the "is experiencing errors" messages stopped showing up in the log files. To make sure we wouldn't fail to notice if problems of this type were to show up again, we also added a simple script to the log server to send us daily email with a summary of the recent messages.

In generating a log summary, we needed to include the date in our analysis as well as the switch names and ports for which the problems were reported.

Since we wanted to see only the last couple of days worth of errors, we first generated a file containing the last two dates as they would appear in the file.

We then generated summaries of messages by date:

#!/bin/bash

LOG=/var/log/cisco.log

# generate list of last two dates in the log
cat $LOG | awk '{print substr($0,1,6)}' | uniq | tail -2 > /tmp/cisco$$

# generate summaries of messages by date
while read dt
do
    echo $dt
    day=`echo $dt | awk '{print $2}'`
    grep "$dt" $LOG | awk '{print $4,$NF}' | grep -v Start | sort | uniq -c
done < /tmp/cisco$$

rm /tmp/cisco$$

This simple script only prints the date, the switch name and the last word of each line. This is hardly enough to tell us what's wrong, but it's just enough to tell us whether or not we should go look at the log file. Most days since it was implemented, we only see the dates and a couple summary lines.

To make this work, however, we need to make sure that the most recent dates appear in the file whether or not errors or other messages were logged for that day. Otherwise we could get the same data, day after day, even though it was no longer current. To avoid this, we run a cron job at the beginning of every day to add a tag such as "Feb 6 [Start]" to the log file.

# Check on Switches
0 0 * * * /usr/local/bin/dtCiscoLog
5 13 * * * /usr/local/bin/ckCiscoLog | mailx -s "Cisco Log analysis" david@oops.org

When we run the script, we then get something like this:

Feb  5
   1 switch17-3 errors
Feb  6

The cron job that runs the script once a day sends this output to us. If there's nothing to report, the email will include only the dates and as you see above for Feb 6th. If any errors appear in the log, we'd see a single line for each switch and error type, including the count of how many times the particular error occurred on that switch.

5 comments

    Anonymous 2 years ago
    If you do not have access to the Cisco logs, you can execute "netstat -i" and if Oerrs and Ierrs are not equal to 0 you probable have a problem. Also your network performance will be terrible.On Solaris 10 you can execute "dladm show-dev" as root to show you the speed and duplex of your network interfaces. I still use "ndd" to set modify the network etherface but you might be able to use "dladm".I wrote a simple script that uses a combination of "ifconfig -a","netstat -i","netstat -r", "netstat -rav" and "dladm show-dev" to give be all the info I need.
    jackjames
    jackjames 3 years ago
    Welcome to the blog! Enjoy!Hello everyone,I'm a newbie to website design and being a this site. I'm stopping by a few forums to pick up tips and get answers from people who know a lot more about this that I do!My username has to do with cooking, my big hobby. It's related to my website and blog.How is everyone doing? Would just like to give a quick introduction about myself, I am jack from Arizona, 22 years old and just recently gotten into this and stuff....seo ppc and it jobs uk
    Anonymous 3 years ago
    On a linux system, as root you can run "ethtool eth0", and it will tell you your speed and whether or not you are running in Full or Half duplex mode.

      Add a comment

      Post a comment using one of these accounts
      Or join now
      At least 6 characters

      Note: Comment will appear soon after you have activated your account.
      Obscene/spam comments will be removed and accounts suspended.
      The information you submit is subject to our Privacy Policy and Terms of Service.

      ITworld LIVE

      IT Management/StrategyWhite Papers & Webcasts

      White Paper

      Evaluator Group: Storage Federation - IT Without Limits (Analysis of HP Peer Motion with Storage Federation)

      As the role of IT increases within organizations, the need to move data when and where it is needed is critical to support emerging business requirements. This has become increasingly difficult due to the huge growth of data volumes. This white paper sponsored by HP + Intel evaluates a solution that aims to enable the movement of data without physical limitations. Read now and see how this could enable agility and efficiency.

      White Paper

      ESG Lab Validation Report: HP Data Protector & Deduplication Solutions

      Many organizations have deployed disk-to-disk backup technologies to improve the speed and reliability of their backup and disaster recovery operations. A growing number of these now look to data deduplication to enhance retention periods and reduce costs. This ESG Lab Validation Report sponsored by HP + Intel examines a number of backup and recovery solutions and evaluates their ease of implementation as well as their ability to improve reliability and reduce costs.

      White Paper

      Business Value of Blade

      The nature of the blade platform makes system management, monitoring and provisioning easy and efficient. Access this resource to learn how blade migration will save your data center time and money while increasing performance.

      White Paper

      Accelerate time to application value

      For your IT organization to keep pace with the business, you need a new, faster approach to infrastructure deployment-an approach that increases agility and accelerates time to application value. That's HP Converged Systems. Built on Converged Infrastructure, these systems deliver the industry's first portfolio of pre-integrated, tested, and optimized infrastructure solutions for applications running in virtual, cloud, dedicated, or hybrid environments.

      White Paper

      Converged Infrastructure for Dummies

      As you know, everything is mobile, connected, interactive, and immediate. This is exactly why organizations need a highly agile IT infrastructure in order to keep pace with extreme fluctuations in business demand. This book will help you understand why infrastructure convergence has been widely accepted as the optimal approach for simplifying and accelerating your IT to deliver services at the speed of business while also shifting significantly more IT resources from operations to innovation.

      See more White Papers | Webcasts

      Ask a question

      Ask a Question