System Reliability
Why Reliability Theory?
Every physical system eventually fails. The engineering question is not whether failure will occur, but when and how likely. Reliability theory gives us a probability model for this question: each component operates correctly with probability and fails with probability , independently of all other components. The system is reliable if and only if a sufficient subset of components is operational.
In wireless communications, this model applies to redundant base station deployments, multi-path routing in mesh networks, and redundant power amplifiers in phased-array transmitters. Understanding how component-level reliabilities combine into system-level reliability is an indispensable engineering skill.
Definition: Component Reliability
Component Reliability
A component is a binary device: it is either working (event ) or failed (event ). The reliability of component is Components are assumed independent: knowledge that component has failed gives no information about whether component has failed.
Independence is a modeling assumption, not a physical law. Components powered by the same power bus are not independent β their failures are positively correlated. In that case the formulas below give overly optimistic reliability estimates.
Reliability
The probability that a component or system performs its intended function over a specified period under specified operating conditions. Written for component and for a system.
Related: Series System, Parallel System, General Inclusion-Exclusion Formula
Definition: Series System
Series System
A series system of components works if and only if all components work: Since , the product is at most , and adding components to a series system can only decrease reliability.
A series system models scenarios where every single component is critical: a single break in a power line, a failed decoder stage, or a missing link in a store-and-forward relay chain.
Series System
A system that works if and only if all components work. Reliability: .
Related: Reliability, Parallel System
Definition: Parallel System
Parallel System
A parallel system of components works if and only if at least one component works: Adding components to a parallel system can only increase reliability.
The derivation uses independence and De Morgan's law: .
Parallel System
A system that works if and only if at least one of its components works. Reliability: .
Related: Reliability, Series System
Example: Series vs. Parallel: A Numerical Comparison
You have components each with reliability . Compute the system reliability when the components are connected (a) in series and (b) in parallel.
Series system
$ Even with 90%-reliable components, a 5-stage series system is barely 59% reliable β every additional stage degrades the system.
Parallel system
10^{-5}$ at the system level.
Takeaway
Series connections aggregate failure probabilities multiplicatively β the system is less reliable than any component. Parallel connections aggregate failure probabilities multiplicatively in the failure domain β the system failure probability shrinks exponentially with redundancy.
Series vs. Parallel System Reliability
Explore how system reliability depends on component count and individual reliability for both series and parallel configurations. Observe that parallel systems improve dramatically with redundancy while series systems deteriorate.
Parameters
Theorem: Inclusion-Exclusion for System Reliability
Let be the events that components are working. For any system whose working condition is captured by , the probability satisfies Under independence, , and the formula simplifies to
Direct counting over-counts outcomes where multiple components work. Inclusion-exclusion corrects by alternately adding and subtracting the higher-order intersection terms. Each term accounts for all subsets of size .
For two components, verify directly: .
For three components, identify all terms and which get vs. signs.
The general formula follows from the indicator identity , then taking expectations.
Indicator identity
For any events , the indicator of their union satisfies This is a deterministic identity: both sides equal 1 if at least one occurs, and 0 if none occur.
Expand the product
Expanding the product using the binomial theorem over all subsets :
Take expectations
Applying and using linearity:
Key Takeaway
Inclusion-exclusion converts a union-of-events probability into a signed sum of intersection probabilities, which under independence factor into products of component reliabilities. This is the workhorse for analyzing complex networks that are neither purely series nor purely parallel.
Definition: Bridge Network
Bridge Network
A bridge network (also called a Wheatstone bridge topology) has five components arranged so that no simple series-parallel reduction applies. Labeling the components 1β5 where component 5 is the "bridge" link, the network works if and only if at least one of the following minimal path sets is operational: The system reliability requires inclusion-exclusion over these four paths.
The bridge network is the canonical example demonstrating that inclusion-exclusion is necessary β no series-parallel simplification can reduce it. It appears in relay networks and multi-hop wireless routing problems.
Example: Bridge Network Reliability via Inclusion-Exclusion
Compute the reliability of the bridge network with five independent components each having reliability . Apply inclusion-exclusion over the four minimal path sets , , , .
Probabilities of individual path sets working
Let be the event that all components in path work.
Pairwise intersections
The system works iff occurs. Pairwise intersections (all components in both paths must work):
Triple and quadruple intersections
$
Apply inclusion-exclusion
R_s(1) = 2+2-5+2 = 1R_s(0) = 0\blacksquare$
Bridge Network Reliability
The bridge network computed via inclusion-exclusion. Compare with the lower bounds from individual path sets and upper bounds from union bound. The bridge link (component 5) can be toggled off (setting ) to see what happens when the cross-link fails.
Parameters
Historical Note: Origins of Reliability Theory
1950sβ1965Modern reliability theory emerged from the post-World War II U.S. military effort to improve the dependability of complex electronics. The 1950s saw the failure rate of airborne electronics during missions rise alarmingly; the U.S. Department of Defense commissioned a study (1957) that established the field. Richard Barlow and Frank Proschan's 1965 textbook Mathematical Theory of Reliability provided the rigorous probabilistic foundations, including the concept of coherent systems and the role of inclusion-exclusion in structural function analysis.
Definition: Coherent System
Coherent System
A system is coherent if:
- Replacing a failed component by a working one can never cause a working system to fail (monotone structure function).
- Every component is relevant β there exist states of other components such that changing component from failed to working changes the system state.
Formally, the structure function (where means the system works given component state vector ) must be non-decreasing in each argument. Series and parallel systems are both coherent; the bridge network is coherent.
Coherence rules out pathological systems where adding a redundant component can somehow cause failure. Every physically meaningful system design should be coherent.
Theorem: Bonferroni Bounds for System Reliability
For a coherent system with components and independent reliabilities , let be the -th elementary symmetric polynomial. The inclusion-exclusion partial sums alternate around the true system reliability: In particular, the union bound (first-order approximation) gives and the first-order lower bound is .
The Bonferroni inequalities are the truncations of inclusion-exclusion. Stopping at an odd term gives a lower bound; stopping at an even term gives an upper bound. This is useful when exact computation is expensive but two-sided bounds suffice.
Consider the indicator identity for and truncate the polynomial expansion.
The truncation error at order has the same sign as .
Indicator expansion
From the proof of the inclusion-exclusion theorem, This is an exact equality of random variables.
Truncation and sign of error
Truncating at order , the remainder has the same sign as the next ()-th term. For odd, the next term has sign , so the truncation overestimates β a lower bound. For even the truncation underestimates β an upper bound.
Take expectations
Taking preserves the inequalities and converts indicators into probabilities:
Why This Matters: Wireless Network Availability and Diversity
In a multi-hop wireless relay network with relay nodes on the path from source to destination, successful delivery requires all hops to succeed β a series system. If each hop succeeds with probability (determined by the fading margin and coding scheme), the end-to-end success probability is , degrading rapidly with path length.
Spatial diversity combats this: a network with independent paths (frequency diversity, spatial diversity from multiple antennas, or route diversity) acts as a parallel system. End-to-end failure probability drops to . At high SNR where , diversity order suppresses the outage probability from to .
Quick Check
Three independent components each have reliability . You connect them in a 2-out-of-3 majority system: the system works if at least 2 of the 3 components work. What is the system reliability?
The system works when exactly 2 components work (probability ) or all 3 work (). Total: .
Redundancy vs. Cost Trade-off in Wireless Systems
Parallel redundancy (adding backup components) increases reliability at the cost of additional hardware, power, and management overhead. In 5G base stations, the 3GPP standard requires 99.999% availability (five 9s) over the air interface. Achieving this with 99%-reliable power amplifiers requires parallel amplifiers (each failure probability ; three in parallel gives failure probability, exceeding the five-nines requirement).
- β’
3GPP TS 22.261 mandates 99.999% availability for Ultra-Reliable Low Latency Communication (URLLC)
- β’
Each additional parallel unit roughly doubles the hardware and power cost
- β’
Active standby (hot spare) achieves faster failover than passive standby at higher steady-state cost
Common Mistake: Confusing Series and Parallel Reliability Formulas
Mistake:
A common mistake is to apply the parallel formula to a series system or vice versa. The two formulas look structurally similar and are easily swapped when working quickly.
Correction:
Remember the logic: series = AND (all must work), parallel = OR (at least one must work). The series formula is the probability of an intersection; the parallel formula uses De Morgan to convert a union to a complement of an intersection. Always check: does the system require every component, or just one?
Series vs. Parallel System Properties
| Property | Series System | Parallel System |
|---|---|---|
| Logic | ALL components must work (AND) | AT LEAST ONE must work (OR) |
| Reliability formula | ||
| Effect of adding components | Decreases | Increases |
| Bottleneck | Weakest component dominates | Strongest component dominates |
| Wireless analog | Multi-hop relay chain | Spatial diversity combining |
| Asymptotic | (even if ) | (even if ) |
Series and Parallel System Block Diagrams
Series/Parallel Reliability: Block Diagram Animation
Common Mistake: Independence Is an Assumption, Not a Fact
Mistake:
Applying the independence formula when the components are actually correlated leads to systematically optimistic reliability estimates.
Correction:
In practice, components may share a power supply, a common mode of failure (e.g., an earthquake), or be manufactured by the same defective production batch. These common-cause failures violate independence. The correct analysis uses the law of total probability: condition on whether the common-cause event occurs.