System Reliability

Why Reliability Theory?

Every physical system eventually fails. The engineering question is not whether failure will occur, but when and how likely. Reliability theory gives us a probability model for this question: each component operates correctly with probability RiR_i and fails with probability 1βˆ’Ri1 - R_i, independently of all other components. The system is reliable if and only if a sufficient subset of components is operational.

In wireless communications, this model applies to redundant base station deployments, multi-path routing in mesh networks, and redundant power amplifiers in phased-array transmitters. Understanding how component-level reliabilities combine into system-level reliability is an indispensable engineering skill.

Definition:

Component Reliability

A component is a binary device: it is either working (event AiA_i) or failed (event Fi=AicF_i = A_i^c). The reliability of component ii is Riβ€…β€Š=β€…β€ŠP(Ai)β€…β€Šβˆˆβ€…β€Š[0,1].R_i \;=\; \mathbb{P}(A_i) \;\in\; [0, 1]. Components are assumed independent: knowledge that component jj has failed gives no information about whether component ii has failed.

Independence is a modeling assumption, not a physical law. Components powered by the same power bus are not independent β€” their failures are positively correlated. In that case the formulas below give overly optimistic reliability estimates.

Reliability

The probability that a component or system performs its intended function over a specified period under specified operating conditions. Written RiR_i for component ii and RsR_s for a system.

Related: Series System, Parallel System, General Inclusion-Exclusion Formula

Definition:

Series System

A series system of nn components works if and only if all components work: Rsseriesβ€…β€Š=β€…β€ŠP(A1∩A2βˆ©β‹―βˆ©An)β€…β€Š=β€…β€Šβˆi=1nRi.R_s^{\text{series}} \;=\; \mathbb{P}(A_1 \cap A_2 \cap \cdots \cap A_n) \;=\; \prod_{i=1}^n R_i. Since Ri≀1R_i \leq 1, the product is at most min⁑iRi\min_i R_i, and adding components to a series system can only decrease reliability.

A series system models scenarios where every single component is critical: a single break in a power line, a failed decoder stage, or a missing link in a store-and-forward relay chain.

Series System

A system that works if and only if all nn components work. Reliability: Rs=∏i=1nRiR_s = \prod_{i=1}^n R_i.

Related: Reliability, Parallel System

Definition:

Parallel System

A parallel system of nn components works if and only if at least one component works: Rsparallelβ€…β€Š=β€…β€ŠP(A1βˆͺA2βˆͺβ‹―βˆͺAn)β€…β€Š=β€…β€Š1βˆ’βˆi=1n(1βˆ’Ri).R_s^{\text{parallel}} \;=\; \mathbb{P}(A_1 \cup A_2 \cup \cdots \cup A_n) \;=\; 1 - \prod_{i=1}^n (1 - R_i). Adding components to a parallel system can only increase reliability.

The derivation uses independence and De Morgan's law: P(A1βˆͺβ‹―βˆͺAn)=1βˆ’P(F1βˆ©β‹―βˆ©Fn)=1βˆ’βˆ(1βˆ’Ri)\mathbb{P}(A_1 \cup \cdots \cup A_n) = 1 - \mathbb{P}(F_1 \cap \cdots \cap F_n) = 1 - \prod (1-R_i).

Parallel System

A system that works if and only if at least one of its nn components works. Reliability: Rs=1βˆ’βˆi=1n(1βˆ’Ri)R_s = 1 - \prod_{i=1}^n (1 - R_i).

Related: Reliability, Series System

Example: Series vs. Parallel: A Numerical Comparison

You have n=5n = 5 components each with reliability Ri=0.9R_i = 0.9. Compute the system reliability when the components are connected (a) in series and (b) in parallel.

Series vs. Parallel System Reliability

Explore how system reliability depends on component count nn and individual reliability pp for both series and parallel configurations. Observe that parallel systems improve dramatically with redundancy while series systems deteriorate.

Parameters
0.9
10

Theorem: Inclusion-Exclusion for System Reliability

Let A1,…,AnA_1, \ldots, A_n be the events that components 1,…,n1, \ldots, n are working. For any system whose working condition is captured by A=A1βˆͺA2βˆͺβ‹―βˆͺAnA = A_1 \cup A_2 \cup \cdots \cup A_n, the probability satisfies P(A)=βˆ‘k=1n(βˆ’1)kβˆ’1βˆ‘1≀i1<β‹―<ik≀nP(Ai1βˆ©β‹―βˆ©Aik).\mathbb{P}(A) = \sum_{k=1}^n (-1)^{k-1} \sum_{1 \leq i_1 < \cdots < i_k \leq n} \mathbb{P}(A_{i_1} \cap \cdots \cap A_{i_k}). Under independence, P(Ai1βˆ©β‹―βˆ©Aik)=Ri1β‹―Rik\mathbb{P}(A_{i_1} \cap \cdots \cap A_{i_k}) = R_{i_1} \cdots R_{i_k}, and the formula simplifies to P(A)=βˆ‘k=1n(βˆ’1)kβˆ’1βˆ‘1≀i1<β‹―<ik≀n∏j=1kRij.\mathbb{P}(A) = \sum_{k=1}^n (-1)^{k-1} \sum_{1 \leq i_1 < \cdots < i_k \leq n} \prod_{j=1}^k R_{i_j}.

Direct counting over-counts outcomes where multiple components work. Inclusion-exclusion corrects by alternately adding and subtracting the higher-order intersection terms. Each term accounts for all (nk)\binom{n}{k} subsets of size kk.

Key Takeaway

Inclusion-exclusion converts a union-of-events probability into a signed sum of intersection probabilities, which under independence factor into products of component reliabilities. This is the workhorse for analyzing complex networks that are neither purely series nor purely parallel.

Definition:

Bridge Network

A bridge network (also called a Wheatstone bridge topology) has five components arranged so that no simple series-parallel reduction applies. Labeling the components 1–5 where component 5 is the "bridge" link, the network works if and only if at least one of the following minimal path sets is operational: {1,4},{2,3},{1,5,3},{2,5,4}.\{1,4\}, \quad \{2,3\}, \quad \{1,5,3\}, \quad \{2,5,4\}. The system reliability requires inclusion-exclusion over these four paths.

The bridge network is the canonical example demonstrating that inclusion-exclusion is necessary β€” no series-parallel simplification can reduce it. It appears in relay networks and multi-hop wireless routing problems.

Example: Bridge Network Reliability via Inclusion-Exclusion

Compute the reliability of the bridge network with five independent components each having reliability pp. Apply inclusion-exclusion over the four minimal path sets P1={1,4}P_1 = \{1,4\}, P2={2,3}P_2 = \{2,3\}, P3={1,5,3}P_3 = \{1,5,3\}, P4={2,5,4}P_4 = \{2,5,4\}.

Bridge Network Reliability

The bridge network Rs=2p2+2p3βˆ’5p4+2p5R_s = 2p^2 + 2p^3 - 5p^4 + 2p^5 computed via inclusion-exclusion. Compare with the lower bounds from individual path sets and upper bounds from union bound. The bridge link (component 5) can be toggled off (setting R5=0R_5 = 0) to see what happens when the cross-link fails.

Parameters
0.9
0.9

Historical Note: Origins of Reliability Theory

1950s–1965

Modern reliability theory emerged from the post-World War II U.S. military effort to improve the dependability of complex electronics. The 1950s saw the failure rate of airborne electronics during missions rise alarmingly; the U.S. Department of Defense commissioned a study (1957) that established the field. Richard Barlow and Frank Proschan's 1965 textbook Mathematical Theory of Reliability provided the rigorous probabilistic foundations, including the concept of coherent systems and the role of inclusion-exclusion in structural function analysis.

Definition:

Coherent System

A system is coherent if:

  1. Replacing a failed component by a working one can never cause a working system to fail (monotone structure function).
  2. Every component is relevant β€” there exist states of other components such that changing component ii from failed to working changes the system state.

Formally, the structure function ϕ:{0,1}n→{0,1}\phi: \{0,1\}^n \to \{0,1\} (where ϕ(x)=1\phi(\mathbf{x}) = 1 means the system works given component state vector x\mathbf{x}) must be non-decreasing in each argument. Series and parallel systems are both coherent; the bridge network is coherent.

Coherence rules out pathological systems where adding a redundant component can somehow cause failure. Every physically meaningful system design should be coherent.

Theorem: Bonferroni Bounds for System Reliability

For a coherent system with nn components and independent reliabilities R1,…,RnR_1, \ldots, R_n, let Sk=βˆ‘βˆ£J∣=k∏i∈JRiS_k = \sum_{|J|=k} \prod_{i \in J} R_i be the kk-th elementary symmetric polynomial. The inclusion-exclusion partial sums alternate around the true system reliability: βˆ‘k=12mβˆ’1(βˆ’1)kβˆ’1Skβ€…β€Šβ‰€β€…β€ŠRsβ€…β€Šβ‰€β€…β€Šβˆ‘k=12m(βˆ’1)kβˆ’1Sk,m=1,2,…\sum_{k=1}^{2m-1}(-1)^{k-1} S_k \;\leq\; R_s \;\leq\; \sum_{k=1}^{2m}(-1)^{k-1} S_k, \quad m = 1, 2, \ldots In particular, the union bound (first-order approximation) gives Rsβ‰€βˆ‘i=1nRiR_s \leq \sum_{i=1}^n R_i and the first-order lower bound is Rsβ‰₯βˆ‘i=1nRiβˆ’βˆ‘i<jRiRjR_s \geq \sum_{i=1}^n R_i - \sum_{i<j} R_i R_j.

The Bonferroni inequalities are the truncations of inclusion-exclusion. Stopping at an odd term gives a lower bound; stopping at an even term gives an upper bound. This is useful when exact computation is expensive but two-sided bounds suffice.

Why This Matters: Wireless Network Availability and Diversity

In a multi-hop wireless relay network with LL relay nodes on the path from source to destination, successful delivery requires all hops to succeed β€” a series system. If each hop succeeds with probability pp (determined by the fading margin and coding scheme), the end-to-end success probability is pLp^L, degrading rapidly with path length.

Spatial diversity combats this: a network with DD independent paths (frequency diversity, spatial diversity from multiple antennas, or route diversity) acts as a parallel system. End-to-end failure probability drops to (1βˆ’pL)D(1-p^L)^D. At high SNR where pβ‰ˆ1βˆ’Ο΅p \approx 1 - \epsilon, diversity order DD suppresses the outage probability from O(Ο΅L)O(\epsilon^L) to O(Ο΅LD)O(\epsilon^{LD}).

Quick Check

Three independent components each have reliability p=0.8p = 0.8. You connect them in a 2-out-of-3 majority system: the system works if at least 2 of the 3 components work. What is the system reliability?

3Γ—0.82Γ—0.2+0.833 \times 0.8^2 \times 0.2 + 0.8^3

0.83=0.5120.8^3 = 0.512

1βˆ’0.23=0.9921 - 0.2^3 = 0.992

1βˆ’3Γ—0.22Γ—0.81 - 3 \times 0.2^2 \times 0.8

⚠️Engineering Note

Redundancy vs. Cost Trade-off in Wireless Systems

Parallel redundancy (adding backup components) increases reliability at the cost of additional hardware, power, and management overhead. In 5G base stations, the 3GPP standard requires 99.999% availability (five 9s) over the air interface. Achieving this with 99%-reliable power amplifiers requires ⌈log⁑(10βˆ’5)/log⁑(0.01)βŒ‰=3\lceil \log(10^{-5})/\log(0.01) \rceil = 3 parallel amplifiers (each failure probability 0.010.01; three in parallel gives 0.013=10βˆ’60.01^3 = 10^{-6} failure probability, exceeding the five-nines requirement).

Practical Constraints
  • β€’

    3GPP TS 22.261 mandates 99.999% availability for Ultra-Reliable Low Latency Communication (URLLC)

  • β€’

    Each additional parallel unit roughly doubles the hardware and power cost

  • β€’

    Active standby (hot spare) achieves faster failover than passive standby at higher steady-state cost

πŸ“‹ Ref: 3GPP TS 22.261, Table 7.2.1-1

Common Mistake: Confusing Series and Parallel Reliability Formulas

Mistake:

A common mistake is to apply the parallel formula Rs=1βˆ’βˆ(1βˆ’Ri)R_s = 1 - \prod(1 - R_i) to a series system or vice versa. The two formulas look structurally similar and are easily swapped when working quickly.

Correction:

Remember the logic: series = AND (all must work), parallel = OR (at least one must work). The series formula Rs=∏RiR_s = \prod R_i is the probability of an intersection; the parallel formula Rs=1βˆ’βˆ(1βˆ’Ri)R_s = 1 - \prod(1-R_i) uses De Morgan to convert a union to a complement of an intersection. Always check: does the system require every component, or just one?

Series vs. Parallel System Properties

PropertySeries SystemParallel System
LogicALL components must work (AND)AT LEAST ONE must work (OR)
Reliability formulaRs=∏i=1nRiR_s = \prod_{i=1}^n R_iRs=1βˆ’βˆi=1n(1βˆ’Ri)R_s = 1 - \prod_{i=1}^n(1-R_i)
Effect of adding componentsDecreases RsR_sIncreases RsR_s
BottleneckWeakest component dominatesStrongest component dominates
Wireless analogMulti-hop relay chainSpatial diversity combining
Asymptotic nβ†’βˆžn\to\inftyRsβ†’0R_s \to 0 (even if Ri>0R_i > 0)Rsβ†’1R_s \to 1 (even if Ri<1R_i < 1)

Series and Parallel System Block Diagrams

Series and Parallel System Block Diagrams
Left: series system (components 1 through nn in chain β€” all must succeed for signal flow). Right: parallel system (nn parallel paths β€” any one suffices). The bridge network (center) cannot be reduced to either form.

Series/Parallel Reliability: Block Diagram Animation

An animated walkthrough showing how a system fails under random component failures. Components light up green (working) or red (failed) in real time, and the system-level status updates as each component state changes.
Each component fails with probability 1βˆ’Ri=0.11 - R_i = 0.1. The series system fails as soon as any one component fails; the parallel system survives until the last path is cut.

Common Mistake: Independence Is an Assumption, Not a Fact

Mistake:

Applying the independence formula P(A1βˆ©β‹―βˆ©An)=∏Ri\mathbb{P}(A_1 \cap \cdots \cap A_n) = \prod R_i when the components are actually correlated leads to systematically optimistic reliability estimates.

Correction:

In practice, components may share a power supply, a common mode of failure (e.g., an earthquake), or be manufactured by the same defective production batch. These common-cause failures violate independence. The correct analysis uses the law of total probability: condition on whether the common-cause event occurs.