Intermittent fault
An intermittent fault, often called simply an "intermittent", (or anecdotally "interfailing") is a malfunction of a device or system that occurs at intervals, usually irregular, in a device or system that functions normally at other times. Intermittent faults are common to all branches of technology, including computer software. An intermittent fault is caused by several contributing factors, some of which may be effectively random, which occur simultaneously. The more complex the system or mechanism involved, the greater the likelihood of an intermittent fault.
Intermittent faults are not easily repeatable because of their complicated behavioral patterns. These are also sometimes referred to as “soft” failures, since they do not manifest themselves all the time and disappear in an unpredictable manner. In contrast, “hard” failures are permanent failures that occur over a period of time (or are sometimes instantaneous). They have a specific failure site (location of failure), mode (how the failure manifests itself), and mechanism, and there is no unpredictable recovery for the failed system. Since intermittent faults are not easily repeatable, it is more difficult to conduct a failure analysis for them, understand their root causes, or isolate their failure site than it is for permanent failures.[1]
Intermittent failures can be a cause of no-fault-found (NFF) occurrences in electronic products and systems. NFF implies that a failure (fault) occurred or was reported to have occurred during a product’s use. The product was analyzed or tested to confirm the failure, but “a failure or fault” could be not found. A common example of the NFF phenomenon occurs when your computer “hangs up”. Clearly, a “failure” has occurred. However, if the computer is rebooted, it often works again. The impact of NFF and intermittent failures can be profound. Due to their characteristics, manufacturers may assume a cause(s) rather than spend the time and cost to determine a root cause. For example, a hard drive supplier claimed NFFs were not failures and allowed all NFF products to be returned to the field. Later it was determined that these products had a significantly higher return rate, suggesting that the NFF condition was actually a result of intermittent failures in the product. The result was increased maintenance costs, decreased equipment availability, increased customer inconvenience, reduced customer confidence, damaged company reputation, and in some cases potential safety hazards.[2]
A simple example of an effectively random cause in a physical system is a borderline electrical connection in the wiring or a component of a circuit, where (cause 1, the cause that must be identified and rectified) two conductors may touch subject to (cause 2, which need not be identified) a minor change in temperature, vibration, orientation, voltage, etc. (Sometimes this is described as an "intermittent connection" rather than "fault".) In computer software a program may (cause 1) fail to initialise a variable which is required to be initially zero; if the program is run in circumstances such that memory is almost always clear before it starts, it will malfunction on the rare occasions that (cause 2) the memory where the variable is stored happens to be non-zero beforehand.
Intermittent faults are notoriously difficult to identify and repair ("troubleshoot") because each individual factor does not create the problem alone, so the factors can only be identified while the malfunction is actually occurring. The person capable of identifying and solving the problem is seldom the usual operator. Because the timing of the malfunction is unpredictable, and both device or system downtime and engineers' time incur cost, the fault is often simply tolerated if not too frequent unless it causes unacceptable problems or dangers. For example, some intermittent faults in critical equipment such as medical life support equipment could result in killing a patient or in aeronautics causes a flight to be aborted or in some cases crash.
If an intermittent fault occurs for long enough during troubleshooting, it can be identified and resolved in the usual way.
Most recent efforts occurring in the U.S. military weapon system testing uses a technique given the acronym CTP. This acronym stands for Certification Test Protocols. The U.S. Army has been implementing the use of 4-wire Kelvin Ohm measurements by stimulating the wiring paths with a set decade method using current. Using automated testing, the testing event takes seconds and minutes for multiple wiring paths. Results are then compared against each other to locate the degraded condition. Based on over six years of use across the department of defense, it has been able to detect the root cause of the intermittent system faults effectively. CTP type measurement method does not require environmental chambers or vibration of the weapon system under test. For more information please Google Certification Test Protocols.
Troubleshooting techniques
Some techniques to resolve intermittent faults are:
- Automatic logging of relevant parameters over a long enough time for the fault to manifest can help; parameter values at the time of the fault may identify the cause so that appropriate remedial action can be taken.
- Changing operating circumstances while the fault is present to see if the fault temporarily clears or changes. For example, tapping components, cooling them with freezer spray, heating them. Striking the cabinet may temporarily clear the fault.
- a database of similar faults which have been resolved in identical or similar equipment[3]
- precautionary changes, without attempting to pinpoint the fault. For example, electrolytic capacitors subject to high ripple currents can be changed as a routine measure, without bothering to troubleshoot the fault at all. Connectors can be disconnected and reseated. This is sometimes a measure of desperation; things are changed until the fault stops happening, and it is hoped that it is actually resolved rather than dormant.
- In electrical systems and cable systems, time domain reflectometry techniques can be used: pulses are sent down electric wiring and the pulses reflected back are examined for anomalies, for example intermittent leakage during the stresses of aircraft operation; this can only be done for one test channel at time and is generally limited to intermittent faults >100milliseconds.[4]
- In complex, multiple channel systems, where the fault/s might be in an interconnection, the ideal method of finding an intermittent fault is to be able to monitor, detect and isolate all channels or electrical paths continuously and simultaneously. This methodology allows the system under test to benefit from continuous and complete test coverage while any environmental stressing of the system is performed. This type cannot be performed by scanning testing technology but needs to have some form of electronic neural-network which can perform these test without the need for any scanning and/or digital averaging; this testing regime is covered by the DoD's MIL-PRF-32516 published in March 2015 and it calls for testing technology to operate in the Class 1 category in order to combat intermittent faults effectively.[5]
- Three main methodologies to mitigate intermittent behavior in integrated circuits are dynamic instruction delaying, core frequency scaling, and thread migration. When the processor incurs more than the expected time to execute a process, time delay and timing violation occur. This fault may be avoided by using techniques such as dynamic instruction delaying. This is a type of algorithm that calculates the scheduling priorities during the execution of the system. The objective is to respond dynamically to the changing conditions and form a self-sustained, optimized configuration. Another approach to mitigating delay is core frequency scaling, which scales down the performance of the CPU to a lower frequency when less is needed and scales it up to a higher frequency when more is needed. Thread migration is another technique used to overcome intermittent failure. A thread is an ordered set of instructions that tells a computer exactly what to do. When a specific thread encounters failures, the content of the thread within the faulty computer core is transferred to another thread within an idle core, where the problem is addressed and solved.[1]
- Automatic testing using Certification Test Protocols (CTP) provides thorough effectiveness in detecting precursors to Electrical Wiring Interconnect System (EWIS) intermittent event type failure modes. CTP implements automatic testing using a circuit analyzer to use multiple current stimuli on EWIS companion wiring and comparing them for anomalous measurements. Use of CTP does not require flight emulation, shake/vibration, or physical movement to be successful. This ensures less costly methods than those posed in MIL-PRF-32516. [6]
References
- Bakhshi, Roozbeh; Kunche, Surya; Pecht, Michael (2014-02-18). "Intermittent Failures in Hardware and Software". Journal of Electronic Packaging. 136 (1): 011014. doi:10.1115/1.4026639. ISSN 1043-7398.
- Qi, H.; Ganesan, S.; Pecht, M. (May 2008). "No-fault-found and Intermittent Failures in Electronic Products". Microelectronics Reliability. 48 (5): 663–674. doi:10.1016/j.microrel.2008.02.003.
- Example of an intermittent TV fault in a database "Highlandelectrix PANASONI.TV". Archived from the original on 2009-04-13. Retrieved 2010-07-19.: "Z3T CHASSIS - NO START UP - INTERMITTENT. D1124 (5.1V) ZENER LEAKY"
- "Spread Spectrum Time Domain Reflectometry for Locating Intermittent Faults Archived 2010-05-01 at archive.today" Furse, Cynthia; Smith, Paul; IEEE SENSORS JOURNAL, VOL. 5, NO. 6, DECEMBER 2005"
- "No Fault Found, Retest OK, Cannot Duplicate or Fault Not Found? - Towards a standardised taxonomy " Samir Khan, Paul Phillips, Chris Hockley, Ian Jennions"
-