Unix: Root cause analysis

By  

But problems that repeat or seriously interfere with productivity should be analyzed and reviewed if only because doing so adds to your checklist of issues to watch out for. Here's another example of (up to) 5 whys:

  • I had to reboot the system -- Why?
  • Because it had crashed -- Why?
  • Because a very odd process used all available resources -- Why?
  • Because it was poorly written or because someone hacked into the system -- Why?
  • Because we hired a careless sysadmin and/or the security on our system is weak -- Why?
  • Because money is tight and everyone is too busy to test their code or review system security.

Whenever you find yourself fixing the same problem over and over again, you should be collecting evidence that might lead you to the underlying cause. When is the problem occurring? What symptoms are associated with the problem? Can you pick out those characteristics that seem to go along with it? Does it occur every time some other problem occurs? Every time certain processes are running? Every time certain users are logged in? Ask yourself what you know about the problem.

Collect any information which might be relevant. This might include:

  • the content of any error messages you see -- e.g., unusual messages in your log files?
  • a list of the processes that are running
  • a display of disk space and memory usage
  • a listing of active network connections

If a problem occurs fairly often, you might want to collect performance (e.g., sar) data, ps and netstat output every 5-10 minutes, and examine it after a problem has occurred to see if anything looks unusual when the problem emerged.

Also, make sure that you determine as soon as you can whether the problem affects all users or just one or two. If it affects more than one person but not everyone, what do those people have in common?

One of the most effective techniques when diagnosing a problem is to see if can you reproduce it. If a problem seems to occur every time more than 20 users are logged in, log in as 20 users and see if it does.

Questions that might help include:

  • What is happening?
  • When?
  • Does it happen regularly?
  • what processes are running at that time?
  • who is logged in?
  • What are the symptoms/effects?
  • Is there any evidence of the problem in your log files?
  • How long has the problem been occurring? When did we first see it?
  • Are there other problems that may be related?

Drill down to whatever flaw in procedure or the system itself is ultimately responsible, but avoid getting down to "because no one is infallible" or "because crap happens". Stop when you've identified a problem that you can address. ;-)

Always consider what you can do to remove or alleviate the cause. Maybe you need a more rigorous procedure for testing applications and scripts before they are implemented on your production systems.

Photo Credit: 
Join us:
Facebook

Twitter

Pinterest

Tumblr

LinkedIn

Google+

Answers - Powered by ITworld

Join us:
Facebook

Twitter

Pinterest

Tumblr

LinkedIn

Google+

Ask a Question
randomness