I've mentioned my use of Six Sigma techniques for debugging a couple of times before (5 Whys and Ishikawa diagrams) but a mention of the Six Sigma approach to end-to-end problem resolution (and process improvement) is long-overdue.
Known as DMAIC (pronounced Duh-May-Ick), it's an abbreviation for Define, Measure, Analyse, Improve, and Control. Used in combination, the five steps in the DMAIC process can be used to identify and resolve your most complex issues. As part of the Six Sigma approach, DMAIC can be used to improve existing programs and applications (pro-actively fixing them before they break!).
The five steps lead a team logically through the process by defining the problem clearly, implementing solutions linked to root causes, and taking steps to ensure the solutions remain in place. You can try skipping some of the steps, but in my experience they each contribute a valuable part of the lasting solution. The basics of each step are:
DEFINE - First of all, make sure you have a clear and precise definition of the issue. This might include a list of symptoms, plus some information about the negative impact on users and/or the system. Make sure you understand priorities and objectives. All-in-all, this is an exercise in pulling together what the team already knows, and making sure that each member of the team is heading off to solve the same issue
MEASURE - Use the measurement phase to gather additional information about the issue and the impact of the issue. The aim is to gather as much information as is practical, for the purpose of exposing the root cause or causes of the issue. It may be appropriate to capture temporary/intermediate files or network traffic, or to look for patterns in the surrounding activity in the system. Remember that this is a data capture phase, you're not trying to solve the issue at this point
ANALYSE - In the analysis phase, we want to pinpoint and verify root causes in line with our priorities and objectives. Always be prepared to distrust the evidence or the means by which it was collected. Be careful to avoid "analysis paralysis", i.e. wasting time by reviewing data that isn't moving the investigation forward. Remember that the aim is to identify the root cause (or causes), not to define one or more solutions
IMPROVE - This phase identifies solutions, prioritises and selects them. It may be appropriate to implement a pilot of one or two solutions prior to finalising your decision and implementing the definitive solution. Your choice of solution may be influenced by impact analysis that shows the full knock-on effect of implementing each potential solution. Be sure to gain proper confirmation of success from appropriate stakeholders before implementing the definitive solution
CONTROL - Often overlooked, the control phase is post implementation of the fix or improvement. Use the control phase to make sure that lessons were learned and that the problem doesn't re-occur (nor problems of a similar nature). Add additional test steps to your regression testing suite and put additional control & validation steps in your application. If it's not practical to stop all potential re-occurrences give thought to how the issue could have picked-up sooner and dealt with more quickly
The five steps seem like common sense, but in the heat of a high priority, complex issue it is easy to forget the some of these basics. I find it helps me a great deal to have the five steps at the back of my mind as my team and I investigate issues. I mentally tick-off each phase and makes sure that we've got all the information we need to proceed to the next step.
I hope you don't get any complex issues, but if you do (we all do, right?) then you will find DMAIC to be a useful guide.