We're likely all familiar with the phrase "treating the symptoms". We don't want our doctors prescribing pain killers any time we suffer from some kind of discomfort. We want them to uncover and address the underlying health issues. Even so, we may not apply this same kind of thinking when we're responding to system problems. Rebooting a system that goes down once a week is not a good solution unless you're also looking into the possible causes of the system's crashes. If you reboot a system that crashes in the middle of a workday without ever knowing why, you won't have any control over when it might happen again or whether the symptoms will be worse next time. Root cause analysis is figuring out what the ultimate cause is for any particular problem. A technique for ensuring that you don't stop too early was developed by Sakichi Toyota and is called the "5 Whys". In this technique, you ask "Why?" up to five times, each time getting closer to the real cause. Applied to my missing a weekly staff meeting, it might look like this:
- I missed the morning meeting -- Why?
- I didn't get to the office on time -- Why?
- My car broke down -- Why?
- I ran out of gas -- Why?
- My gas gauge on my car is unreliable -- Why?
- I ignored the recall notice that I got six months ago
I once tracked the cause of a system going down in the middle of the day to a cron job in which someone had shut a system down a year earlier. He'd forgotten to remove the shutdown command after a shutdown that was intended to shut the system down cleanly prior to a planned power outage. The following year, the shutdown fell on a week day. One way to avoid this kind of problem is to include the weekday in the cron job. June 9th last year was a Sunday; this year, it's a Monday. A cronjob that includes the weekday like this one wouldn't have fired off very often (every seven years or so) and maybe would have caught someone's eye before it did.
0 12 9 6 0 /bin/shutdown -h now
Better yet, to avoid unintentional repeats of cron jobs, you might consider using an at command. It would fire once and never again. It might not take five times asking why to get to the core of your problem, but getting into the habit of asking yourself "Is there a reason for the reason?" will help you get there. Your system didn't just go down because of a cron job, but because someone failed to remove the cron job after it ran. And maybe they failed to remove the cron job because they were fired the next day or because they lost root access to the server or because someone else was supposed to take care of that or because no one really monitors cron jobs. You might not want to bother with a formal root cause analysis for every little problem you run into. Sometimes it isn't worth the time. But problems that repeat or seriously interfere with productivity should be analyzed and reviewed if only because doing so adds to your checklist of issues to watch out for. Here's another example of (up to) 5 whys:
- I had to reboot the system -- Why?
- Because it had crashed -- Why?
- Because a very odd process used all available resources -- Why?
- Because it was poorly written or because someone hacked into the system -- Why?
- Because we hired a careless sysadmin and/or the security on our system is weak -- Why?
- Because money is tight and everyone is too busy to test their code or review system security.
Whenever you find yourself fixing the same problem over and over again, you should be collecting evidence that might lead you to the underlying cause. When is the problem occurring? What symptoms are associated with the problem? Can you pick out those characteristics that seem to go along with it? Does it occur every time some other problem occurs? Every time certain processes are running? Every time certain users are logged in? Ask yourself what you know about the problem. Collect any information which might be relevant. This might include:
- the content of any error messages you see -- e.g., unusual messages in your log files?
- a list of the processes that are running
- a display of disk space and memory usage
- a listing of active network connections
If a problem occurs fairly often, you might want to collect performance (e.g., sar) data, ps and netstat output every 5-10 minutes, and examine it after a problem has occurred to see if anything looks unusual when the problem emerged. Also, make sure that you determine as soon as you can whether the problem affects all users or just one or two. If it affects more than one person but not everyone, what do those people have in common? One of the most effective techniques when diagnosing a problem is to see if can you reproduce it. If a problem seems to occur every time more than 20 users are logged in, log in as 20 users and see if it does. Questions that might help include:
- What is happening?
- Does it happen regularly?
- what processes are running at that time?
- who is logged in?
- What are the symptoms/effects?
- Is there any evidence of the problem in your log files?
- How long has the problem been occurring? When did we first see it?
- Are there other problems that may be related?
Drill down to whatever flaw in procedure or the system itself is ultimately responsible, but avoid getting down to "because no one is infallible" or "because crap happens". Stop when you've identified a problem that you can address. ;-) Always consider what you can do to remove or alleviate the cause. Maybe you need a more rigorous procedure for testing applications and scripts before they are implemented on your production systems. Maybe your servers need more memory or you need a cluster configured so that any one server going down has no observable affect on overall availability. Some root causes might not be worth addressing. An application that fails once a year, but would cost a quarter of a million dollars to replace should maybe just go down once year, especially if you can choose when it goes down. Always consider whether the cure is something you or your organization can afford. How does the cost of the problem compare to the cost of the fix? Digging down to a root cause can be very gratifying if you can pin down the problem and overcome it, but just thinking through the possible causes will keep you on your toes and alert to anything unusual happening on your systems.
Read more of Sandra Henry-Stocker's Unix as a Second Language blog and follow the latest IT news at ITworld, Twitter and Facebook.
flickr / Lara604 https://creativecommons.org/licenses/by-sa/2.0/