June 08, 2014, 3:35 PM — We're likely all familiar with the phrase "treating the symptoms". We don't want our doctors prescribing pain killers any time we suffer from some kind of discomfort. We want them to uncover and address the underlying health issues. Even so, we may not apply this same kind of thinking when we're responding to system problems. Rebooting a system that goes down once a week is not a good solution unless you're also looking into the possible causes of the system's crashes.
If you reboot a system that crashes in the middle of a workday without ever knowing why, you won't have any control over when it might happen again or whether the symptoms will be worse next time.
Root cause analysis is figuring out what the ultimate cause is for any particular problem. A technique for ensuring that you don't stop too early was developed by Sakichi Toyota and is called the "5 Whys". In this technique, you ask "Why?" up to five times, each time getting closer to the real cause. Applied to my missing a weekly staff meeting, it might look like this:
- I missed the morning meeting -- Why?
- I didn't get to the office on time -- Why?
- My car broke down -- Why?
- I ran out of gas -- Why?
- My gas gauge on my car is unreliable -- Why?
- I ignored the recall notice that I got six months ago
I once tracked the cause of a system going down in the middle of the day to a cron job in which someone had shut a system down a year earlier. He'd forgotten to remove the shutdown command after a shutdown that was intended to shut the system down cleanly prior to a planned power outage. The following year, the shutdown fell on a week day. One way to avoid this kind of problem is to include the weekday in the cron job. June 9th last year was a Sunday; this year, it's a Monday. A cronjob that includes the weekday like this one wouldn't have fired off very often (every seven years or so) and maybe would have caught someone's eye before it did.
0 12 9 6 0 /bin/shutdown -h now
Better yet, to avoid unintentional repeats of cron jobs, you might consider using an at command. It would fire once and never again.
It might not take five times asking why to get to the core of your problem, but getting into the habit of asking yourself "Is there a reason for the reason?" will help you get there. Your system didn't just go down because of a cron job, but because someone failed to remove the cron job after it ran. And maybe they failed to remove the cron job because they were fired the next day or because they lost root access to the server or because someone else was supposed to take care of that or because no one really monitors cron jobs.
You might not want to bother with a formal root cause analysis for every little problem you run into. Sometimes it isn't worth the time.