The decision framework that determines where your maintenance budget actually goes
The direct cost of a reactive repair — parts plus labor — is the number that gets tracked. It is also the smallest part of the cost. The compounding effects of a reactive-dominant maintenance program are harder to see but more damaging over time.
This is the most predictable consequence of reactive maintenance, and the one that is most often ignored. Every time an unplanned failure pulls technicians off scheduled PM work, those PMs do not disappear — they accumulate as deferred work. A maintenance team running at 70 percent reactive cannot also execute a PM schedule at 90 percent compliance. The hours are not there.
What grows instead is a PM backlog: work that should have been done but was not, sitting in the queue, aging. A PM deferred by two weeks is a minor inconvenience. A PM deferred four times over eight months is a failure waiting to happen on the asset that was supposed to be the beneficiary of that PM. The backlog is not neutral; it is an active reliability risk that compounds with each passing week.
The perverse part: the longer the PM backlog grows, the more reactive failures occur — because the PMs that would have caught and prevented those failures are in the backlog. The team becomes increasingly reactive, which further depresses PM compliance, which generates more reactive failures. This is the maintenance death spiral, and most operations that are stuck in it cannot see the mechanism clearly because they are entirely focused on today's emergency.
A maintenance team that is predominantly reactive develops skill in the specific failure modes that occur most frequently — because those are the repairs they repeat most often. What they do not develop, and what they may actually lose over time, is diagnostic skill on complex equipment that fails infrequently.
Complex equipment — process control systems, high-speed machining centers, servo-driven automation — requires a different kind of diagnostic capability than routine mechanical repairs. Technicians stay sharp on complex systems by working on them regularly: doing PM inspections, calibrations, functional checks, and subsystem tests. These tasks build the mental model of "how this machine behaves when it is healthy" that is prerequisite to diagnosing it accurately when it is not.
In a reactive-dominant environment, technicians only touch complex equipment when it fails. They do not develop that healthy-state mental model. As a result, diagnostic time on complex equipment is longer, more uncertain, and more likely to result in parts replacement rather than root cause identification. Meantime between failures on that equipment degrades — not because the equipment is getting worse, but because the maintenance team's ability to maintain it is.
Mean time between failures is a lagging indicator — it shows you where reliability has been, not where it is going. In a reactive maintenance environment, MTBF on critical assets tends to degrade slowly and then rapidly: slowly during the early period when the effects of deferred PMs are not yet visible, then rapidly once the first failures that those PMs would have prevented begin to occur.
The degradation is not even across the asset base. It concentrates on the assets with the most sensitivity to PM interval — equipment with wear components, fluid-lubricated systems, equipment running in harsh environments. These are typically also the highest-criticality assets, so the MTBF degradation occurs precisely where it is most expensive.
By the time a reactive maintenance pattern has been running for 18 to 24 months, MTBF degradation on critical assets is typically 30 to 50 percent compared to the period when PM was being executed consistently. Reversing that degradation requires sustained PM execution for a similar period — the recovery curve is not faster than the degradation curve.
Reactive maintenance feels manageable in the short run because MTBF degradation is not immediately visible. Equipment that should be failing does not start failing immediately when PM lapses — it fails 6 to 18 months later, depending on the asset. By the time the failure rate increase is obvious, the PM deficit that caused it is already 18 months old, and the recovery will take another 12 to 18 months. The decision to defer PM today costs money 18 months from now, in a way that is difficult to trace back to its origin.
The instinct to swing from reactive to heavily preventive is understandable. It is also wrong as a blanket strategy. Over-maintained equipment has its own failure modes — and an over-PM'd maintenance program has its own sustainability problems.
Every time a technician opens a gearbox, removes a bearing, or disassembles a pump for inspection, there is a nonzero probability that the equipment comes back from that maintenance event in worse condition than it went in. Seal damage during reassembly. Contaminants introduced during inspection. Misalignment after reinstallation. Over-torqued fasteners that fatigue the housing. Incorrect reassembly of a valve or coupling that was fine before it was touched.
This is not a hypothetical risk — it is documented in reliability engineering literature as "infant mortality" after maintenance: the elevated failure rate in the hours and days immediately following a maintenance event. Studies in aviation maintenance, where this phenomenon has been studied most rigorously, show that 7 to 15 percent of all maintenance-related accidents involve equipment that was working before the maintenance was performed.
The practical implication: the more frequently you intervene in running equipment, the more often you are accepting this risk. On equipment where failure is rare and the consequences of PM-induced failure are significant, reducing PM frequency to the minimum necessary — and improving execution quality for each PM that is done — is often the right move.
A fixed-interval replacement program — "replace the filter every 90 days," "replace the seal every 6 months," "replace the drive belt annually" — is designed around worst-case assumptions about degradation rate. The interval is set short enough to catch most failures before they occur, even on equipment in the harshest operating conditions.
For equipment in moderate or favorable operating conditions, that interval means you are regularly replacing components that still have significant useful life remaining. That is a direct cost: the wasted parts, the labor for an unnecessary PM, and the acceptance of PM-induced failure risk for a component that was not close to failing.
Across a large asset base, the waste from unnecessary interval-based replacement is substantial. A plant with 200 pieces of equipment on fixed-interval PM schedules, where 30 percent of replacement tasks are premature, is spending roughly a third of its PM parts budget on components that did not need replacing. That budget would produce more reliability improvement invested in condition-monitoring infrastructure or in improving PM execution on the assets where interval-based PM is genuinely needed.
The framing of "preventive vs. reactive" is a false choice. Both extremes are wrong. The right strategy is different for different assets, and the goal of maintenance program design is to match the maintenance approach to each asset's failure behavior, criticality, and the economics of intervention. That is harder than picking a side — but it is the only approach that actually optimizes reliability and cost simultaneously.
There are three legitimate maintenance strategies, and all three are correct — for the right assets. The mistake is applying a single strategy to an entire asset base.
Run-to-failure (RTF) is not a failure of maintenance discipline. It is the correct strategy for assets that meet all of the following conditions: failure is not a safety risk, failure does not stop production or has an immediate workaround, repair is fast and inexpensive, and the failure mode is not one that cascades to damage other equipment when it occurs.
Typical RTF candidates: light bulbs and other illumination components (obvious), small fans and blowers with redundant units, non-critical conveyors with multiple lanes, office equipment. Less obvious RTF candidates: certain packaging components, secondary utility runs with backup feeds, low-speed low-load drive components with large safety margins.
The key discipline with RTF is intentionality: the decision to run to failure must be a deliberate one made after evaluating the asset, not a passive outcome of not getting around to PM. An asset that is "accidentally" run to failure because no one set up PM is a different situation than one that has been consciously assigned RTF status. The first is a management gap; the second is an optimization.
Time-based or interval-based PM is the right strategy for assets with predictable wear patterns where failure modes are well understood and the cost of failure significantly exceeds the cost of planned maintenance. The classic examples: automotive timing belts (known MTBF, catastrophic failure mode, cheap to replace on schedule), filter elements with predictable loading rates, lubrication intervals for equipment in stable operating conditions.
The essential condition for time-based PM to work as intended: the interval must be set at a value that actually catches most failures before they occur, based on the failure mode's actual behavior — not just the OEM recommendation, which is typically conservative and designed for general conditions. An interval that is too long generates failures; an interval that is too short wastes resources. Finding the right interval requires some failure history data and willingness to adjust based on what the PMs actually find.
Condition-based maintenance (CBM) is triggered not by a calendar but by a measured indicator of equipment health. It is the right strategy for high-value assets with detectable degradation signals — equipment where failure is expensive, the failure mode develops gradually rather than suddenly, and there is a measurable parameter that changes in the period before failure occurs.
The critical requirement for CBM is that the degradation signal must have enough lead time to allow a planned response. A bearing that goes from normal vibration to catastrophic failure in four hours gives you no useful CBM window. A bearing that develops an elevated vibration signature six to eight weeks before failure gives you a reasonable window to plan and execute a replacement during a scheduled maintenance window rather than as an emergency.
Lead time is the determining factor. Without sufficient lead time between detectable degradation and failure, CBM degrades into a monitoring program that confirms failures rather than preventing them.
| Mode | Trigger | Best Condition | Main Risk | Cost Profile |
|---|---|---|---|---|
| Run-to-Failure | Asset failure | Non-critical, redundant, cheap to fix | Misclassifying a critical asset as RTF | Low planned cost; unpredictable reactive cost |
| Time-Based PM | Calendar interval | Predictable wear; known failure modes | Wrong interval; premature or late replacement | Predictable; may include unnecessary work |
| Condition-Based | Measured parameter exceeds threshold | Gradual degradation with detectable signal and sufficient lead time | Inadequate lead time; sensor reliability | Higher setup cost; lower ongoing parts waste |
The decision process does not require sophisticated analysis tools. It requires four questions, answered honestly for each asset. The answers will point you to the right maintenance mode in almost every case.
If failure stops production, creates a safety risk, or causes significant financial damage, the asset is critical and run-to-failure is disqualified. If failure is cosmetic, causes minor inconvenience, or has an immediate workaround, RTF becomes a candidate.
Be honest about "workarounds." A workaround that requires rerouting production through a bottleneck is not a workaround — it is a degraded state that costs money. A workaround that has genuinely zero operational impact qualifies. The distinction matters because misclassifying an asset as having a workaround is one of the most common ways RTF gets assigned incorrectly.
Predictable failure modes follow a pattern — they occur at characteristic intervals, they show warning signs, they follow a wear curve. Unpredictable failure modes occur randomly, without pattern, often as a result of external events (power surges, process upsets, operator error) rather than wear.
Time-based PM is only effective on predictable failure modes. Applying a time-based PM interval to a failure mode that is fundamentally random — like electronic component failure from voltage spikes — produces compliance numbers without reliability improvement, because the PM tasks are not aligned with the actual failure mechanism.
This question determines whether condition-based maintenance is feasible. If there is no parameter that changes detectably before failure, CBM is not available — you are limited to time-based PM or RTF. If there is a detectable signal but the lead time is short (less than one PM planning cycle), CBM requires continuous or near-continuous monitoring to be useful.
For most rotating equipment, the answer is yes — vibration, temperature, and oil analysis all provide detectable signals with meaningful lead times for most bearing and lubrication failure modes. For electrical components, the answer is often no — electronic failures frequently provide no advance warning. This is why electrical components are often correctly assigned to RTF or time-based PM despite being on critical equipment.
This is the economic filter. Even for critical assets with predictable failure modes and detectable signals, the right maintenance mode depends on whether the cost of prevention justifies the cost of failure. For a $200 component on a non-critical system with a $500 repair cost, the economics may favor RTF even if the failure mode is predictable. For a $500 component on a critical system with a $50,000 failure cost, the economics strongly favor PM.
Calculate this explicitly for your highest-cost PM tasks and your highest-frequency failures. The ratio of PM cost to failure cost is the single most useful number in maintenance budgeting — it tells you which PM investments are economically justified and which ones are consuming budget without proportionate benefit.
Work through the four questions for each asset on your critical equipment list. An asset that answers: consequence = high, failure predictable = yes, detectable signal = yes, PM cost < failure cost = clearly yes — belongs on a condition-based or time-based PM program. An asset that answers: consequence = low, predictable = no, signal = no, PM cost > failure cost — is a RTF candidate. Most assets will fall somewhere between those extremes, and the questions will point you to the right trade-off.
The four-question framework will tell you which maintenance mode is right for each asset. But knowing the right mode is not enough — and most maintenance programs that fail after a thoughtful design phase are failing in execution, not in strategy. The most common disconnect is one that most managers do not see until they have been burned by it.
This is the most widespread gap in operational maintenance programs, and it is invisible until you look for it deliberately. A PM schedule can be executing at 95 percent compliance, and the PM tasks can be genuinely well-intentioned, and the program can still fail to prevent the failures it was designed to prevent — because the tasks are not aligned with the actual failure modes of the equipment.
The most common version of this problem: a PM procedure that was copied from the OEM manual or from a similar asset, without being verified against the actual failure history of this specific equipment in this specific operating environment. The OEM manual says to inspect the coupling annually. But the coupling on this particular machine, in this particular application, fails from fatigue cracking at the keyway — a failure mode that is not visible during a routine visual inspection and that would require a dye-penetrant test to catch in its early stages. The inspection is being done; it is not catching the failure it needs to catch.
A second version: PM tasks focused on the component that fails rather than the condition that causes the component to fail. Replacing a filter is not PM — it is reactive maintenance on a slow-developing failure. PM on the system that the filter protects means investigating why the filter is loading faster than expected, what contaminants are present, and whether an upstream process change is introducing contamination that was not there originally. Component replacement is necessary, but it is the last step of PM execution, not the whole of it.
The clearest indicator that PM tasks are not matching failure modes is a high PM compliance rate combined with a flat or worsening MTBF on the assets covered by the PM program. If you are executing 90 percent of your PMs and the equipment is still failing at the same rate, your PMs are not aligned with your failure modes. The activity is happening; the outcome is not.
When you see this pattern, the right response is not to increase PM frequency. It is to pull the failure history for the assets in question and compare the actual failure modes that have occurred against the PM tasks that are supposed to prevent them. In most cases, you will find a mismatch: failures from failure modes that the PM tasks either do not address or address inadequately.
Alignment between PM tasks and failure modes requires knowing what the failure modes actually are — which means failure history data, not just OEM recommendations. For every critical asset on a PM program, answer this question: in the last 24 months, what did this asset actually fail from? Then look at the PM task list and ask: which of these tasks directly addresses that failure mode? If the answer is "none" or "indirectly," the PM task list needs revision.
This review does not need to be exhaustive or formal. A one-hour conversation between the maintenance manager and the technician most familiar with each critical asset will surface the most significant mismatches. The technicians who work on equipment every day almost always know that the PM tasks miss the things that actually break — they just have not been asked in a way that leads to a program revision.
For your three highest-frequency failures in the last 12 months, pull the PM task list for each asset and ask whether any task specifically addresses the failure mode that produced those failures. If the answer is no for any of them, you have identified an execution gap that no amount of PM compliance improvement will fix. The task list needs to change — not the compliance rate.
This article gives you the conceptual framework for choosing the right maintenance mode and identifying execution gaps. What it does not give you is the structured implementation system: the criticality scoring worksheets, the PM task templates built around failure modes, the scheduling framework that actually works in a multi-shift operation, and the KPI structure that tells you whether your program is working or just active.
The Preventive Maintenance Playbook is that system. It takes the framework in this article and gives you the tools to implement it — step by step, starting with your highest-priority assets, without a CMMS rollout or a six-month project plan required to begin.
Get the PM Playbook — $99 →