Root cause analysis

In science and engineering, root cause analysis (RCA) is a method of problem solving used for identifying the root causes of faults or problems.[1] It is widely used in IT operations, manufacturing, telecommunications, industrial process control, accident analysis (e.g., in aviation,[2] rail transport, or nuclear plants), medicine (for medical diagnosis), healthcare industry (e.g., for epidemiology), etc. Root cause analysis is a form of inductive (first create a theory [root] based on empirical evidence [causes]) and deductive (test the theory [underlying causal mechanisms] with empirical data) inference.

RCA can be decomposed into four steps:

  • Identify and describe the problem clearly
  • Establish a timeline from the normal situation until the problem occurs
  • Distinguish between the root cause and other causal factors (e.g., using event correlation)
  • Establish a causal graph between the root cause and the problem

RCA generally serves as input to a remediation process whereby corrective actions are taken to prevent the problem from recurring. The name of this process varies from one application domain to another. According to ISO/IEC 31010, RCA may include the techniques Five whys, Failure mode and effects analysis (FMEA), Fault tree analysis, Ishikawa diagram, and Pareto analysis.

Definitions

There are essentially two ways of repairing faults and solving problems in science and engineering.

Reactive management

Reactive management consists of reacting quickly after the problem occurs, by treating the symptoms. This type of management is implemented by reactive systems,[3][4] self-adaptive systems,[5] self-organized systems, and complex adaptive systems. The goal here is to react quickly and alleviate the effects of the problem as soon as possible.

Proactive management

Proactive management, conversely, consists of preventing problems from occurring. Many techniques can be used for this purpose, ranging from good practices in design to analyzing in detail problems that have already occurred and taking actions to make sure they never recur. Speed is not as important here as the accuracy and precision of the diagnosis. The focus is on addressing the real cause of the problem rather than its effects.

Root cause analysis is often used in proactive management to identify the root cause of a problem, that is, the factor that was the leading cause. It is customary to refer to the "root cause" in singular form, but one or several factors may constitute the root cause(s) of the problem under study.

A factor is considered the "root cause" of a problem if removing it prevents the problem from recurring. Conversely, a "causal factor" is a contributing action that affects an incident/event's outcome but is not the root cause. Although removing a causal factor can benefit an outcome, it does not prevent its recurrence with certainty.

Example

Imagine an investigation into a machine that stopped because it was overloaded and the fuse blew.[6] Investigation shows that the machine was overloaded because it had a bearing that was not being sufficiently lubricated. The investigation proceeds further and finds that the automatic lubrication mechanism had a pump that was not pumping sufficiently, hence the lack of lubrication. Investigation of the pump shows that it has a worn shaft. Investigation of why the shaft was worn discovers that there is not an adequate mechanism to prevent metal scrap getting into the pump. This enabled scrap to get into the pump, and damage it.

The apparent root cause of the problem is that metal scrap can contaminate the lubrication system. Fixing this problem ought to prevent the whole sequence of events recurring. The real root cause could be a design issue if there is no filter to prevent the metal scrap getting into the system. Or if it has a filter that was blocked due to lack of routine inspection, then the real root cause is a maintenance issue.

Compare this with an investigation that does not find the root cause: replacing the fuse, the bearing, or the lubrication pump will probably allow the machine to go back into operation for a while. But there is a risk that the problem will simply recur, until the root cause is dealt with.

The above does not include cost/benefit analysis: does the cost of replacing one or more machines exceed the cost of downtime until the fuse is replaced? This situation is sometimes referred to as the cure being worse than the disease.[7][8]

As an unrelated example of the conclusions that can be drawn in the absence of the cost/benefit analysis, consider the tradeoff between some claimed benefits of population decline: In the short term there will be fewer payers into pension/retirement systems; whereas halting the population will require higher taxes to cover the cost of building more schools. This can help explain the problem of the cure being worse than the disease.[9]

Costs to consider go beyond finances when considering the personnel who operate the machinery. Ultimately, the goal is to prevent downtime; but more so prevent catastrophic injuries. Prevention begins with being proactive.

Application domains

Root cause analysis is used in many application domains.

Manufacturing and industrial process control

The example above illustrates how RCA can be used in manufacturing. RCA is also routinely used in industrial process control, e.g. to control the production of chemicals (quality control).

RCA is also used for failure analysis in engineering and maintenance.

IT and telecommunications

Root cause analysis is frequently used in IT and telecommunications to detect the root causes of serious problems. For example, in the ITIL service management framework, the goal of incident management is to resume a faulty IT service as soon as possible (reactive management), whereas problem management deals with solving recurring problems for good by addressing their root causes (proactive management).

Another example is the computer security incident management process, where root-cause analysis is often used to investigate security breaches.[10]

RCA is also used in conjunction with business activity monitoring and complex event processing to analyze faults in business processes.

Its use in the IT industry cannot always be compared to its use in safety critical industries, since in normality the use of RCA in IT industry is not supported by pre-existing fault trees or other design specs. Instead a mixture of debugging, event based detection and monitoring systems (where the services are individually modelled) is normally supporting the analysis. Training and supporting tools like simulation or different in-depth runbooks for all expected scenarios do not exist, instead they are created after the fact based on issues seen as 'worthy'. As a result the analysis is often limited to those things that have monitoring/observation interfaces and not the actual planned/seen function with focus on verification of inputs and outputs. Hence, the saying "there is no root cause" has become common in the IT industry.

Health and safety

In the domains of health and safety, RCA is routinely used in medicine (diagnosis) and epidemiology (e.g., to identify the source of an infectious disease), where causal inference methods often require both clinical and statistical expertise to make sense of the complexities of the processes.[11]

RCA is used in environmental science (e.g., to analyze environmental disasters), accident analysis (aviation and rail industry), and occupational safety and health.[12] In the manufacture of medical devices,[13] pharmaceuticals,[14] food,[15] and dietary supplements,[16] root cause analysis is a regulatory requirement.

Systems analysis

RCA is also used in change management, risk management, and systems analysis.

General principles

Example of a root cause analysis method

Despite the different approaches among the various schools of root cause analysis and the specifics of each application domain, RCA generally follows the same four steps:

  1. Identification and description: Effective problem statements and event descriptions (as failures, for example) are helpful and usually required to ensure the execution of appropriate root cause analyses.
  2. Chronology: RCA should establish a sequence of events or timeline for understanding the relationships between contributory (causal) factors, the root cause, and the problem under investigation.
  3. Differentiation: By correlating this sequence of events with the nature, the magnitude, the location, and the timing of the problem, and possibly also with a library of previously analyzed problems, RCA should enable the investigator(s) to distinguish between the root cause, causal factors, and non-causal factors. One way to trace down root causes consists in using hierarchical clustering and data-mining solutions (such as graph-theory-based data mining). Another consists in comparing the situation under investigation with past situations stored in case libraries, using case-based reasoning tools.
  4. Causal graphing: Finally, the investigator should be able to extract from the sequences of events a subsequence of key events that explain the problem, and convert it into a causal graph.

To be effective, root cause analysis must be performed systematically. The process enables the chance to not miss any other important details. A team effort is typically required, and ideally all persons involved should arrive at the same conclusion. In aircraft accident analyses, for example, the conclusions of the investigation and the root causes that are identified must be backed up by documented evidence.[17]

Transition to corrective actions

The goal of RCA is to identify the root cause of the problem with the intent to stop the problem from recurring or worsening. The next step is to trigger long-term corrective actions to address the root cause identified during RCA, and make sure that the problem does not resurface. Correcting a problem is not formally part of RCA, however; these are different steps in a problem-solving process known as fault management in IT and telecommunications, repair in engineering, remediation in aviation, environmental remediation in ecology, therapy in medicine, etc.

Challenges

Without delving in the idiosyncrasies of specific problems, several general conditions can make RCA more difficult than it may appear at first sight.

First, important information is often missing because it is generally not possible, in practice, to monitor everything and store all monitoring data for a long time.

Second, gathering data and evidence, and classifying them along a timeline of events to the final problem, can be nontrivial. In telecommunications, for instance, distributed monitoring systems typically manage between a million and a billion events per day. Finding a few relevant events in such a mass of irrelevant events is asking to find the proverbial needle in a haystack.

Third, there may be more than one root cause for a given problem, and this multiplicity can make the causal graph very difficult to establish.

Fourth, causal graphs often have many levels, and root-cause analysis terminates at a level that is "root" to the eyes of the investigator. Looking again at the example above in industrial process control, a deeper investigation could reveal that the maintenance procedures at the plant included periodic inspection of the lubrication subsystem every two years, while the current lubrication subsystem vendor's product specified a 6-month period. Switching vendors may have been due to management's desire to save money, and a failure to consult with engineering staff on the implication of the change on maintenance procedures. Thus, while the "root cause" shown above may have prevented the quoted recurrence, it would not have prevented other   perhaps more severe  failures affecting other machines.

See also

Notes

  1. See Wilson, Dell & Anderson 1993, pp. 8–17.
  2. See IATA 2016 and Sofema 2017.
  3. See Manna & Pnueli 1995.
  4. See Lewerentz & Lindner 1995.
  5. See Babaoglu et al. 2005.
  6. See Ohno 1988.
  7. "The Cure Worse Than the Disease". The New York Times. 5 November 1927.
  8. Andrew C. Revkin (7 December 2000). "Dredging River's PCB's Could Be a Cure Worse Than the disease, G. E. insists". The New York Times.
  9. Phillip Longman (9 June 2004). "The Global Baby Bust". The New York Times.
  10. See Abubakar et al. 2016
  11. Landsittel, Douglas; Srivastava, Avantika; Kropf, Kristin (2020). "A Narrative Review of Methods for Causal Inference and Associated Educational Resources". Quality Management in Health Care. 29 (4): 260–269. doi:10.1097/QMH.0000000000000276. ISSN 1063-8628. PMID 32991545. S2CID 222146291.
  12. See OSHA 2019.
  13. Office of Regulatory Affairs (26 December 2019). "Corrective and Preventive Actions (CAPA)". FDA.
  14. US-FDA. "CURRENT GOOD MANUFACTURING PRACTICE FOR FINISHED PHARMACEUTICALS". Electronic Code of Federal Regulations (eCFR). Retrieved 28 December 2020.
  15. US-FDA. "CURRENT GOOD MANUFACTURING PRACTICE, HAZARD ANALYSIS, AND RISK-BASED PREVENTIVE CONTROLS FOR HUMAN FOOD". Electronic Code of Federal Regulations (eCFR). Retrieved 28 December 2020.
  16. US-FDA. "CURRENT GOOD MANUFACTURING PRACTICE IN MANUFACTURING, PACKAGING, LABELING, OR HOLDING OPERATIONS FOR DIETARY SUPPLEMENTS". Electronic Code of Federal Regulations (eCFR). Retrieved 28 December 2020.
  17. See IATA 2016.

References

  • Abubakar, Aisha; Bagheri Zadeh, Pooneh; Janicke, Helge; Howley, Richard (2016). "Root cause analysis (RCA) as a preliminary tool into the investigation of identity theft". Proc. 2016 International Conference On Cyber Security And Protection Of Digital Services (Cyber Security).
  • Babaoglu, O.; Jelasity, M.; Montresor, A.; Fetzer, C.; Leonardi, S.; van Moorsel, A.; van Steen, M., eds. (2005). Self-star Properties in Complex Information Systems; Conceptual and Practical Foundations. LNCS. Vol. 3460. Springer.
  • IATA (8 April 2016). "Root Cause Analysis for Civil Aviation Authorities and Air Navigation Service Providers". International Air Transport Association. Archived from the original on 8 April 2016. Retrieved 17 November 2017. Key steps to conducting an effective root cause analysis, which tools to use for root cause identification, and how to develop effective corrective actions plans.
  • Lewerentz, Claus; Lindner, Thomas, eds. (1995). Formal Development of Reactive Systems; Case Study Production Cell. LNCS. Vol. 891. Springer.
  • Manna, Zohar; Pnueli, Amir (1995). Temporal Verification of Reactive Systems: Safety. Springer. ISBN 978-0387944593.
  • Ohno, Taiichi (1988). Toyota Production System: Beyond Large-Scale Production. Portland, Oregon: Productivity Press. p. 17. ISBN 0-915299-14-3.
  • OSHA; EPA. "FactSheet: The Importance of Root Cause Analysis During Incident Investigation" (PDF). Occupational Safety and Health Administration. Retrieved 22 March 2019.
  • Sofema (17 November 2017). "Root Cause Analysis for Safety Management Practitioners & Business Area Owners". Sofema Aviation Services. Archived from the original on 17 November 2017. Retrieved 17 November 2017. Identify best practice techniques and behaviours to perform effective Root Cause Analysis (RCA)
  • Wilson, Paul F.; Dell, Larry D.; Anderson, Gaylord F. (1993). Root Cause Analysis: A Tool for Total Quality Management. Milwaukee, Wisconsin: ASQ Quality Press. ISBN 0-87389-163-5.
This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.