Unix Tip: Debugging tales: SSH command failure

ITworld.com – Send in your Unix questions today!

See additional Unix tips and tricks

In a typical work week, a Unix systems administrator is likely to have at least one small mystery to solve -- one "huh?", one "that doesn't make any sense" or one "I've never seen this before". Most of the time when I find myself baffled by something on one of the systems I manage, it's because I've overlooked some element of the problem. Soon afterwards, I'm usually saying "oh, yes, of course!", having pulled the missing piece into focus during my review of how things are supposed to work. In this week's column, we'll follow my train of thought as I poked through one such small mystery.

The onset of this particular puzzle was noticing that I could not successfully run a remote shell command from one of two servers that I use as launch pads in managing many other servers. On one such server, the command retrieved the requested information. On the other, the same command to the same system failed with an unexpected error.

Logged into the first server, remote commands worked just fine:

    # ssh beanybaby date
    Mon Nov  8 08:26:17 EST 2004

From the other server, I got the error shown here:

    # ssh beanybaby date
    ssh: connect to host beanybaby port 22: Connection timed out

I was in the process of verifying that a relatively large number of systems will each accept a superuser command run from either of the two secured systems. This configuration will come in very handy if ever I need to shut them all down, remove an account from all of them in short order or install an important patch.

Immediately, I began to run through a list of some things that might be wrong and what I'd expect to see in each case. For example, if the hostname wasn't resolving on one of the servers, I'd be getting a response like this:

ssh: beanybaby: host/servname not known

If the remote host wasn't configured to allow password-free SSH commands, I'd be prompted to enter a password:

root@beanybaby's password:

If the remote system's fingerprint wasn't in the local known_hosts file (/.ssh/known_hosts), I'd be getting this:

    The authenticity of host 'beanybaby (10.9.8.7)' can't be established.
    DSA key fingerprint is 6a:7f:a0:ac:bc:28:3a:7f:10:38:83:e1:0b:27:95:6f.
    Are you sure you want to continue connecting (yes/no)? 
While the particular error generated by my ssh command suggests that I am having a problem connecting to port 22 (the SSH port) on the target system, it was obvious that the problem could not be related to the sshd process on that system because the same connection request worked properly on the first server.

Same Target?

The next thing that I questioned was whether the two servers were actually connecting to the same system when I issued my "ssh beanybaby" command. Using nslookup, I was quickly able to determine that they were both pulling the proper information from DNS. Both servers responded with the information shown below.

    > nslookup beanybaby
    Server:  ns1
    Address:  127.0.0.1

    Name:    beanybaby.example.org
    Address:  10.10.2.11
Knowing that DNS is generally the second source for resolving hostnames, however, I then checked the /etc/host file on each system. On one of the servers, I noticed this:

   
    # grep beanybaby /etc/hosts
    #10.10.2.11     beanybaby        beanybaby.example.org
    10.9.1.90       beanybaby        beanybaby.example.org
Aha! On one of the servers (the working one), someone had commented out the host entry for beanybaby and replaced it with another. This was clearly the source of the discrepancy.

As it turned out, the change made to the /etc/hosts file on the first of the two servers corresponded to a redeployment of the particular system on a different subnet. The problem I ran into came about because the change was made locally on one server and wasn't folded into the zone file on the DNS server. The "Connection timed out" message came about because the hostname resolved, but the resultant IP address was no longer valid.

After determining why my ssh command had failed, I updated the DNS zone file to reflect the new location of the system in question and gave a little thought to the process of maintaining proper records in an environment in which systems move from one subnet to another with some frequency. I also gave some thought to the troubleshooting process. The two approaches that seem most popular are asking what has changed recently that might account for the problem we have run into and comparing similar systems to determine what is different between the two. Either approach would have been useful in this case, but the first would require that I know more about what other sysadmins are doing.

One trick that I use to verify that a system is properly registered in DNS is to ask it to reflect its hostname back to me with a command like this:

# ssh beanybaby uname -n

beanybaby

Since a fairly large number of the systems I manage are configured to respond to such commands, I can run this command against a collection of them with a simple loop on the command line:

# for server in `cat server_list`
> do
>     echo $server                       
>     ssh $server "uname -n"
> done
spongebob
spongebob
barbie
barbie
powerranger
powerranger
...

If any of the hosts doesn't respond, I jot down its name and check it separately to determine whether it has been moved or its configuration has changed. Inaccessible systems slow down my loop, but not so much that it wastes much of my time.

There are eight million stories in the Unix sysadmin's bag of tricks; this has been one of them.

ITWorld DealPost: The best in tech deals and discounts.
  
Shop Tech Products at Amazon