Equipment Uptime Systems

Free Article

Diagnostics

Why Intermittent Failures Are So Difficult to Diagnose — and How to Stop Chasing Them

A systematic approach to the faults that disappear the moment you arrive.

Topic Fault Diagnosis

Published 2026 Edition

Publisher Equipment Uptime Systems

Section 1

Why Intermittent Faults Defeat Normal Diagnostic Approaches

Standard troubleshooting assumes the fault is present when you arrive. With intermittent faults, that assumption is wrong, and everything built on it fails. Understanding why normal approaches break down is the first step toward a method that actually works.

Most diagnostic training — formal or informal — teaches a process that depends on one thing: being able to observe the fault. You arrive, you measure, you compare against specification, you identify the deviation, you trace it to a cause. This works reliably when the fault is stable and present. It fails completely when the fault is gone by the time you get there.

Intermittent faults break the diagnostic loop at step one. The equipment is running. Everything measures within spec. The event log shows no active alarms. A technician with good instincts will walk the floor, listen to the machine, feel for vibration, check a few parameters — and find nothing, because there is nothing to find in that moment. So they leave. The fault returns two days later.

Why the Standard Approach Breaks Down

The conventional troubleshooting sequence — observe the symptom, isolate the subsystem, test the components — relies on reproducibility. If you can reproduce the fault on demand, or at least make it appear with a known stimulus, you can work through it systematically. Intermittent faults resist reproduction by definition. Several specific failure mechanisms make them nearly impossible to catch this way:

The fault self-clears. Temperature drops, load reduces, the vibration stops, the moisture evaporates. The physical condition that caused the failure is no longer present.
The fault is condition-dependent. It only appears under a specific combination of operating conditions — load, temperature, sequence, timing — that may not exist when you are observing the system.
The fault leaves no trace. Unlike a burned component or a cracked shaft, many intermittent faults produce no physical evidence between events.

The result is a diagnostic situation where the technician is working from memory and verbal description rather than direct observation. "It hesitated for a second and then came back" is what you have to work with. That is a hard fault to catch with a multimeter and a visual inspection.

The Default Response: Replace Likely Suspects

When a fault cannot be reproduced and nothing measures wrong, most technicians eventually land on the same strategy: replace the parts most likely to cause that type of symptom. A PLC input card gets swapped. A proximity sensor gets replaced. A connector gets cleaned and re-seated. The machine runs fine for a week. The team declares victory.

Then it happens again. Or it does not happen again, and nobody knows whether the repair was actually correct or whether the fault is still present, waiting for the right conditions to return.

Parts replacement without confirmation is not diagnosis — it is guessing with a parts budget. On low-cost components, this is sometimes a reasonable tradeoff. On expensive assemblies, on equipment with lead times, or on machines where the fault has safety implications, it is a problem that compounds over time as the real cause goes unaddressed.

The Core Problem

Intermittent faults require a different diagnostic posture than stable faults. Instead of observing and measuring, you need to monitor and record — capturing the fault when it occurs rather than trying to find it when it is not there.

Section 2

The Four Categories of Intermittent Failure

Not all intermittent faults are the same. Grouping them by mechanism helps you design a monitoring approach that will actually catch them — different categories require different logging strategies, different triggers, and different evidence types.

Category	Typical Mechanism	When It Fails	When It Clears	Key Evidence
Thermal	Expansion, resistance change, solder joint fatigue	After 30–60 min of operation; hot ambient	After cooldown (20–40 min)	Temperature at fault, ambient temp, duty cycle log
Vibration / Mechanical	Loose connections, worn contacts, fretting corrosion	Under load, high RPM, or specific motion	Reduced load, stopped, or after vibration ceases	Load profile, vibration log, connection resistance at fault time
Contamination	Moisture, particulates, corrosion on contacts	High humidity, condensation, after washdown, dusty conditions	Dries out, clears mechanically	Humidity log, ambient conditions, inspection of suspect contacts
Software / Logic	Timing races, buffer overflow, firmware edge cases	Specific input sequences, high throughput, edge conditions	Power cycle, timeout, state reset	Event log with timestamps, sequence recording, PLC scan time monitoring

Thermal Failures

Thermal intermittent faults are among the most common in industrial electronics and control systems. The classic presentation: the machine runs fine for the first hour of the shift, then starts producing errors around mid-morning, then clears up when the operator takes a break and the machine sits idle for 20 minutes. This cycle can continue for days or weeks before anyone connects the timing pattern.

The mechanism is usually expansion-related. A solder joint that was marginal at room temperature loses contact as the board heats. A connector pin with micro-corrosion passes current reliably when cold but not when the resistance increases with temperature. An electrolytic capacitor with degraded electrolyte fails to filter properly once its ESR rises with heat.

The diagnostic clue is timing: thermal faults almost always correlate with warmup time or ambient temperature. Log the ambient temperature and the equipment surface temperature along with the fault events, and you will usually see the pattern within two or three fault cycles.

Vibration and Mechanical Faults

Loose terminals, worn edge connectors, and cracked PCB traces are the mechanical intermittent faults most commonly seen in industrial equipment. They tend to appear under load — when the machine is producing vibration, when a conveyor is running at full speed, or when a motor draws starting current. Under these conditions, a connection that makes perfectly good contact at rest develops enough movement to create a momentary open or high-resistance path.

Fretting corrosion deserves specific attention. In connectors and terminal blocks subject to small cyclical motion, the protective oxide layer on contact surfaces wears away and re-oxidizes continuously. The result is a contact that looks clean and tests fine with a static measurement but fails under the micro-movement of operating vibration. Visual inspection will not find it. A resistance measurement at rest will not find it. You need to measure under operating conditions.

Contamination Faults

Moisture-related faults tend to follow environmental cycles — they appear at a predictable time of year, after a washdown procedure, or after overnight condensation in a facility that cools significantly at night. Particulate contamination is more random, but tends to worsen progressively over time as buildup increases.

The diagnostic complication is that cleaning "as a diagnostic step" is common and often counterproductive. If you clean an enclosure and the fault goes away, you have confirmed that contamination was present but you have destroyed the evidence before establishing exactly what it was and where it was causing the problem. The fault returns when conditions recreate the contamination, and you are back to zero.

Software and Logic Faults

Timing-dependent faults in PLCs and control systems are the hardest category to diagnose because they leave the least physical evidence and require the most context to interpret. A PLC scan time that usually runs at 8 ms occasionally spikes to 45 ms under a specific loading condition, causing a sensor input to be missed. A serial communication buffer that is never quite full under normal conditions hits overflow during a specific sequence of operations that only occurs when two production events happen within a narrow time window.

These faults can appear to be hardware problems — a sensor that "misses" intermittently, a drive fault that clears itself — when the actual issue is in the control logic. Firmware version history and any recent parameter changes are essential context when investigating this category.

Diagnostic Starting Point

Before selecting a monitoring approach, commit to a category hypothesis. Ask: when does this fault occur, and when does it clear? The answers will usually point to one of these four categories, and each category has a different evidence collection strategy.

Section 3

How to Capture a Fault You Cannot Reproduce

If you cannot reproduce the fault on demand, you need to be ready to capture it when it reproduces itself. This requires setting up a monitoring configuration before the next event — not arriving after the fact and looking for evidence that is already gone.

The Data Logging Strategy

The goal of data logging on an intermittent fault is to capture enough context around the fault event to identify the cause — not just confirm that the fault occurred. Most PLC event logs and alarm histories tell you that a fault happened at a specific time. What they rarely tell you is what the system state was for the 30 seconds before the fault, which is usually where the cause lives.

A useful intermittent fault log captures three windows:

Pre-fault window: 30–120 seconds of data before the fault event, depending on how quickly the fault develops. This is where you see the anomaly that preceded the visible symptom.
Fault event: The moment the fault is detected, with all relevant parameter values at that instant.
Post-fault window: 15–60 seconds after the fault clears, to capture recovery behavior and confirm the fault has fully resolved.

What to Log and at What Sample Rate

Log everything that could be related to the fault category you have hypothesized. For a thermal fault, that means temperatures. For a power quality issue, that means supply voltage and current. For a communication fault, that means network traffic and response times. The specific parameters matter less than having continuous coverage around the fault event — one critical parameter you forgot to log will require another monitoring cycle.

Fault Category	Recommended Sample Rate	Key Parameters	Minimum Log Duration
Thermal	1 sample / 30 seconds	Ambient temp, component temp, load %, fault state	Full shift cycle (8–12 hours)
Vibration / Mechanical	10–100 samples / second	Vibration (X/Y/Z), load current, fault state	Full production run
Contamination	1 sample / minute	Humidity, temp (dewpoint calculation), fault state	24–48 hours minimum
Software / Logic	PLC scan rate (typically 1–10 ms)	All relevant I/O states, scan time, sequence flags	Until fault occurs

Sample rate selection has a practical constraint: storage and retrieval. A high-frequency log running for 72 hours generates a large file. Size it so you can actually retrieve and review the data. For most fault categories, the sample rate in the table above is sufficient to identify the cause without producing an unmanageable dataset.

Capturing Symptoms vs. Capturing Causes

The most common data logging mistake is logging the symptom channel and nothing else. The fault appears on Output 3, so you log Output 3. But Output 3 failed because of a temperature rise in the drive cabinet, which happened because the cabinet fan stopped running four hours earlier. The fault cause — the failed fan — is visible only in the cabinet temperature trend, which you did not log.

This is why baselining matters. Before the fault occurs again, spend time logging the system in normal operation. Capture what "good" looks like across all relevant parameters. When the fault occurs, the deviation from baseline will be visible, and it will often point to a parameter you would not have thought to monitor directly.

Baselining Normal Behavior

A baseline log is a recording of normal operating parameters across a complete production cycle — ideally including startup, steady-state production, load variations, and shutdown. It takes time to collect, but it dramatically accelerates fault diagnosis when a deviation occurs.

Specifically, a baseline lets you answer: what is different now compared to when it was running well? Without a baseline, you are comparing current readings to memory and specification — both of which are unreliable references for subtle deviations. With a baseline, you can overlay the fault-period data on the normal-operation data and see exactly where the divergence began.

Practical Note

PLC data logging capability varies widely. Some controllers support on-board trend recording at adequate sample rates; others require an external data logger or HMI-based trending. Verify what your system can actually capture before the next fault occurs — not during it.

Section 4

The Most Common Diagnostic Mistakes on Intermittent Faults

Experience with intermittent faults is hard to accumulate because each event is infrequent and the feedback loop is slow. These are the mistakes that appear repeatedly — not because technicians are careless, but because the normal instincts of good troubleshooters work against you on this type of fault.

Replacing Parts Without Capturing Evidence

A motor drive faults out intermittently. The fault code points to an overcurrent condition. The technician replaces the drive. The machine runs for three weeks with no faults. Case closed — until it happens again, same fault code, same symptom. This time the technician notices that the replacement drive is now six months old. They replace it again. It runs for five weeks.

What actually happened: the motor winding has an insulation defect that is developing slowly over time. Under normal operating temperatures it holds up, but after 90 minutes of operation under load it softens enough to produce momentary overcurrent events. The drive sees this as an overcurrent fault and shuts down to protect itself. The drive was never the problem. Two drives have now been consumed by the actual root cause, which remains unaddressed.

Parts replacement without confirming the mechanism creates two bad outcomes: the direct cost of replaced components, and the delayed discovery of the real cause, which continues to degrade in the meantime.

"Cleaning Everything" as a Diagnostic Step

Compressed air on the boards, contact cleaner on the connectors, a wipe-down of the cabinet. The machine runs fine afterward. This is presented as a success, but diagnostically it is a failure — you have removed the evidence without identifying what the evidence was pointing to.

If cleaning resolves the fault, the correct response is to document that contamination was the cause and then determine: what contaminant, where did it accumulate, and why is that location vulnerable? Answering those questions may lead to an enclosure seal improvement, a condensation drain modification, or a contamination guard — a permanent fix rather than a repeated cleaning cycle.

Cleaning as a maintenance step is valid. Cleaning as a diagnostic step, without capturing what you found before you cleaned it, is an information loss.

Assuming the Fault Is Gone Because It Has Not Recurred

An intermittent fault that has not occurred in three weeks has not necessarily been fixed. It may be that the conditions required to trigger it have not been present — the right ambient temperature, the right production sequence, the right load profile. This is especially true for thermally-triggered faults in facilities where ambient temperatures vary seasonally, and for contamination faults that correlate with specific production runs or cleaning cycles.

The correct stance: a fault is resolved only when you can explain the mechanism, demonstrate that the mechanism has been addressed, and observe that the relevant parameters remain within normal limits. "It hasn't happened again" is a data point, not a resolution.

Not Looking at What Changed Before the Fault Started

Intermittent faults that appear without an obvious trigger almost always have a change event in the preceding days or weeks — a firmware update, a parameter adjustment, a repair on an adjacent system, a seasonal temperature shift, a new production product with different run characteristics. The fault did not appear from nowhere; something changed the conditions that the system had been tolerating.

Make it a standard practice to document what changed before the fault started. Ask: was there a software update in the past 30 days? Was any work done on this equipment or the equipment upstream or downstream? Did ambient conditions change? Has the production recipe changed? The answer to one of these questions will frequently point directly to the cause.

The Most Expensive Mistake

Replacing a major assembly — a PLC processor, a servo drive, a vision system controller — based on intermittent fault symptoms, without capturing evidence during a fault event, and then finding the replacement exhibits the same fault. This happens regularly and is preventable. No assembly replacement for an intermittent fault should be approved without logged evidence that the assembly was the cause.

Section 5

Building a Fault Hypothesis and Testing It Systematically

Good intermittent fault diagnosis is hypothesis-driven. You form a specific, testable explanation for the fault mechanism, design a test that would disprove that explanation if it were wrong, and execute the test. This is slower to start but far faster to resolution than the alternatives.

The Elimination Approach vs. the Confirmation Approach

Most technicians approach intermittent faults using elimination: rule out everything that probably is not the cause until what remains must be the cause. This is the correct logic in theory. In practice, on intermittent faults with long cycle times between events, it is extremely slow. If each potential cause takes two weeks to rule out (because the fault recurs infrequently), and you have six candidates, you are looking at three months to reach a diagnosis by elimination.

The confirmation approach works differently: instead of ruling out wrong answers, you design a test that will produce specific, observable evidence if your hypothesis is correct. You are not waiting for the fault to recur under normal conditions — you are creating conditions that will force the fault to reveal itself, if your hypothesis about the mechanism is right.

Elimination Approach

Replace suspect A and wait
If fault recurs, replace suspect B
Continue until fault stops
Slow — one cycle per candidate
Never confirms the actual cause
May exhaust parts budget before finding cause

Confirmation Approach

Form a specific mechanism hypothesis
Design a test that would produce evidence for or against it
Execute the test; interpret results
One cycle to confirm or refute the hypothesis
Produces understanding, not just a resolved symptom
Evidence-based; defensible

How to Form a Good Fault Hypothesis

A useful fault hypothesis has three components: a proposed cause, a proposed mechanism, and a predicted observation. Without all three, the hypothesis is not testable.

Example of a weak hypothesis: "The PLC might be causing the problem."

Example of a strong hypothesis: "The PLC analog input card is drifting due to temperature-related offset error. If this is the cause, I expect to see the measured value on channel 4 shift by more than 0.3V from its baseline reading when the cabinet temperature exceeds 45°C, and this shift will correlate with the fault events in the alarm log."

The strong hypothesis specifies what to measure (channel 4 voltage), what the expected deviation is (more than 0.3V), what the triggering condition is (cabinet temp over 45°C), and what the correlation to fault events should look like. This is a test you can actually run and interpret.

Designing a Test That Proves or Disproves

A well-designed confirmation test has a clear pass and fail condition defined before the test runs — not after. This prevents confirmation bias, where ambiguous results get interpreted as support for whichever outcome the technician expected.

For thermal fault hypotheses: monitor the suspect parameter while deliberately inducing a temperature rise (longer run without cooling breaks, blocked airflow in a controlled test, heat gun applied to a specific board with appropriate safety precautions). Does the fault occur when temperature reaches your predicted threshold? Does it clear when temperature drops below it?

For vibration/mechanical hypotheses: monitor contact resistance or signal continuity while applying a mechanical stimulus (gentle tap test, load variation, vibration applied to the suspect connector). Does the resistance increase correlate with the fault symptom?

For software/logic hypotheses: capture PLC scan times and I/O state sequences during a high-load period. Does the timing deviation align with your predicted edge condition? Can you induce the fault by manually creating the sequence that your hypothesis predicts will trigger it?

When the Test Fails

If your test does not produce the predicted evidence, that is valuable information. It means your hypothesis was wrong, and you can eliminate that mechanism from your candidate list. Revise the hypothesis — what other mechanism would produce the same symptom pattern? — and test again. A disproved hypothesis is not a failed diagnosis; it is a narrowed problem space.

The confirmation approach requires more preparation time upfront than parts replacement does. But it produces a diagnosis rather than a bet, and the knowledge it generates — the actual failure mechanism — is directly usable for building a PM task that prevents the next occurrence.

Preventive Maintenance Playbook

Intermittent faults are a diagnostic problem, but they are also a PM program problem. The majority of intermittent failures in industrial equipment are caused by degradation that a well-designed PM program would have caught before it produced symptoms — insulation breakdown, connector fretting, thermal management failures, contamination ingress through deteriorated seals.

The Preventive Maintenance Playbook ($99) covers how to build PM tasks that catch degradation at the mechanism level — not just perform a visual inspection and call it done. It includes failure mode templates for common fault categories, frequency selection criteria based on failure history, and the monitoring checkpoints that precede intermittent faults in most equipment types.

View the PM Playbook →

Equipment Uptime Systems

Practical tools for maintenance managers, service leaders, and technical teams.