Monday, 28 September 2009

Problem Solving: Five Whys: Getting To The Root

Being a software developer isn't just about releasing great new software. Sometimes we need to fix problems that occur in the live system (or in testing or development). When investigating production failures we need to be sure they don't happen again, so we need to be sure we get to the root cause and fix that.

Six Sigma is a disciplined, data-driven approach and methodology for eliminating defects. It can be seen as being heavily statistics based (driving towards six standard deviations between the mean and the nearest specification limit), but many of its tools and techniques involve no statistics and are easily adopted.

Six sigma's basic approach to root cause analysis is called DMAIC; it stands for define, measure, analyse, implement and control. In other words, define your problem, measure it (so you can understand it and subsequently see that a change has been made), analyse the problem (and find a solution to the root cause), implement the solution, and control and monitor the ongoing production process to be sure it is now functioning well and continues to do so.

One of the Six Sigma techniques offered in the analysis phase is "5 whys". Somewhat reminiscent of conversations with my kids when they were younger, 5 Whys teaches us that by repeatedly asking "why" we can peel away layers of the problem until we get to the root cause. Asking "why" 5 times is usually sufficient, but the general rule is to keep asking until the root cause is identified.

Note that the technique is intended to offer a structured route to help teams establish root cause. It doesn't work well when wrongly used to emphasise the person or blame, turning the 5 Whys into the Five Whos!

Here's an example I recently encountered. Our daily production review revealed that one of our input had failed to load last night. Why? Because it didn't match the data structure expected by our data loader (an extra column had been added to the right-side of the file). Why? Because the group that regularly supplies the data file had changed the structure. At this point the knee-jerk reaction was to assume we had some unexpected emergency coding to do in order to get our data loader to accept the new structure, but we continued with 5 Whys. Why was the data structure changed, and why were we not told? Because (the supplying group told us) the change had been tested with all systems that used the file and they weren't aware that we used the file. Why weren't they aware we used the file? We'd informed them, but there had been staff changes and they didn't keep formal records. 

The negotiated resolution was to a) temporarily supply two files (the old structure and the new structure) until we had time to plan and schedule a change to our data loader, and b) create a more formal process for recording consumers of the data file.

The purpose of 5 Whys is to find the root cause and to avoid assuming that a symptom is the cause. The objective is to find THE problem rather the problem. Used thoughtfully, 5 Whys can be tremendously powerful in helping you identify and resolve production problems (and problems during testing and development phases too).

If you're not sure what to do, just ask your kids!