Wald's Sequential Probability Ratio Test (SPRT)

Why Stop Early?

In the fixed-sample-size detection framework of Chapter 1, the statistician collects nn samples, forms the log-likelihood ratio β„“(n)=βˆ‘i=1nβ„“(yi)\ell^{(n)} = \sum_{i=1}^n \ell(y_i), and compares it to a threshold. The sample size nn is chosen in advance to meet target false-alarm and miss probabilities (Ξ±,Ξ²)(\alpha, \beta).

Two observations motivate a sequential approach. First, when the true hypothesis is "easy" β€” for example, the LLR increments are all strongly positive β€” a few samples suffice, and the remaining nβˆ’kn - k observations are wasted. Second, when the true hypothesis is "hard" β€” the LLR walks near zero β€” no fixed nn is large enough to guarantee reliable detection.

Wald's sequential probability ratio test (SPRT) resolves both issues. The sample size NN becomes a random variable, determined by the data: observe one sample at a time, update the cumulative LLR, and decide as soon as it crosses an upper or lower threshold. For a given target pair (Ξ±,Ξ²)(\alpha, \beta), the SPRT uses on average the smallest possible number of samples among all tests achieving those error levels β€” this is the Wald-Wolfowitz optimality theorem.

This section develops the SPRT from its threshold design through Wald's identity for the ASN, and closes with its connection to early stopping in iterative decoders, where the decision statistic is monitored between iterations.

Definition:

Sequential Probability Ratio Test (SPRT)

Let Y1,Y2,…Y_1, Y_2, \ldots be i.i.d. observations under either H0\mathcal{H}_0 (density f0f_0) or H1\mathcal{H}_1 (density f1f_1). Define the per-sample log-likelihood ratio β„“(y)=log⁑f1(y)f0(y)\ell(y) = \log \frac{f_1(y)}{f_0(y)} and the cumulative LLR β„“(n)=βˆ‘i=1nβ„“(Yi).\ell^{(n)} = \sum_{i=1}^n \ell(Y_i).

Fix two thresholds B<0<AB < 0 < A. The SPRT is the pair (gSPRT,N)(g_{\mathrm{SPRT}}, N) defined by the stopping rule

N=inf⁑{nβ‰₯1:β„“(n)βˆ‰(B,A)}N = \inf\{n \geq 1 : \ell^{(n)} \notin (B, A)\}

and the terminal decision

gSPRT(Y1,…,YN)={1ifΒ β„“(N)β‰₯A,0ifΒ β„“(N)≀B.g_{\mathrm{SPRT}}(Y_1, \ldots, Y_N) = \begin{cases} 1 & \text{if } \ell^{(N)} \geq A, \\ 0 & \text{if } \ell^{(N)} \leq B. \end{cases}

The test continues (takes another sample) while β„“(n)∈(B,A)\ell^{(n)} \in (B, A).

The thresholds AA and BB are the only design parameters. Choosing them properly yields a test with prescribed error probabilities β€” the content of Wald's inequalities below.

SPRT (Sequential Probability Ratio Test)

A sequential hypothesis test that accumulates log-likelihood ratios one sample at a time and stops as soon as the cumulative LLR exits the interval (B,A)(B, A). Optimal in ASN among tests with the same error probabilities.

Related: Average Sample Number (ASN), Wald's Identity

Average Sample Number (ASN)

The expected stopping time Ei[N]\mathbb{E}_i[N] of a sequential test under hypothesis Hi\mathcal{H}_i. Also called the expected sample size. For the SPRT it is computed via Wald's identity.

Related: SPRT (Sequential Probability Ratio Test)

Wald's Identity

For a random walk Sn=βˆ‘i=1nXiS_n = \sum_{i=1}^n X_i with i.i.d. increments and a stopping time NN with E[N]<∞\mathbb{E}[N] < \infty, E[SN]=E[N]β‹…E[X1]\mathbb{E}[S_N] = \mathbb{E}[N] \cdot \mathbb{E}[X_1].

Related: Average Sample Number (ASN)

Theorem: Wald's Threshold Approximation

Let Pf=Pr⁑(gSPRT=1∣H0)P_f = \Pr(g_{\mathrm{SPRT}} = 1 \mid \mathcal{H}_0) and Pd=Pr⁑(gSPRT=1∣H1)P_d = \Pr(g_{\mathrm{SPRT}} = 1 \mid \mathcal{H}_1) denote the false-alarm and detection probabilities of the SPRT with thresholds (B,A)(B, A). Ignoring the excess over the boundary at the stopping time,

eAβ‰ˆPdPf,eBβ‰ˆ1βˆ’Pd1βˆ’Pf.e^{A} \approx \frac{P_d}{P_f}, \qquad e^{B} \approx \frac{1 - P_d}{1 - P_f}.

Consequently, to achieve targets Pf≀αP_f \leq \alpha and 1βˆ’Pd≀β1 - P_d \leq \beta, Wald's choice is

A=log⁑1βˆ’Ξ²Ξ±,B=log⁑β1βˆ’Ξ±.A = \log \frac{1 - \beta}{\alpha}, \qquad B = \log \frac{\beta}{1 - \alpha}.

The thresholds depend only on the target error probabilities, not on the densities f0,f1f_0, f_1 or the SNR. Every SPRT that wants false-alarm Ξ±=10βˆ’3\alpha = 10^{-3} and miss Ξ²=10βˆ’3\beta = 10^{-3} uses the same thresholds Aβ‰ˆ6.9A \approx 6.9, Bβ‰ˆβˆ’6.9B \approx -6.9 nats regardless of the signal model.

,

Key Takeaway

The SPRT thresholds A=log⁑1βˆ’Ξ²Ξ±A = \log\frac{1-\beta}{\alpha}, B=log⁑β1βˆ’Ξ±B = \log\frac{\beta}{1-\alpha} are model-free: they depend only on the target error probabilities, not on the densities f0,f1f_0, f_1. This is the clean separation between detection requirements and signal structure that makes the SPRT a universal sequential framework.

Theorem: Wald's Identity

Let X1,X2,…X_1, X_2, \ldots be i.i.d. with ΞΌ=E[X1]\mu = \mathbb{E}[X_1] finite, and let Sn=βˆ‘i=1nXiS_n = \sum_{i=1}^n X_i. If NN is a stopping time with E[N]<∞\mathbb{E}[N] < \infty and E[∣X1∣]<∞\mathbb{E}[|X_1|] < \infty, then E[SN]=ΞΌβ‹…E[N].\mathbb{E}[S_N] = \mu \cdot \mathbb{E}[N].

The expected "total distance" a random walk covers by a stopping time is the product of the expected number of steps and the expected step. It is the stopping-time version of linearity of expectation.

,

Theorem: Average Sample Number of the SPRT

Under hypothesis Hi\mathcal{H}_i, let Di=Ei[β„“(Y1)]D_{i} = \mathbb{E}_i[\ell(Y_1)] denote the expected per-sample LLR, so that D1=D(f1βˆ₯f0)>0D_{1} = D(f_1 \| f_0) > 0 and D0=βˆ’D(f0βˆ₯f1)<0D_{0} = -D(f_0 \| f_1) < 0 (assuming f0β‰ f1f_0 \neq f_1). Neglecting the excess over the boundary, the SPRT with Wald thresholds satisfies

E0[N]β‰ˆ(1βˆ’Ξ±)B+Ξ±AD0,E1[N]β‰ˆΞ²B+(1βˆ’Ξ²)AD1.\mathbb{E}_0[N] \approx \frac{(1-\alpha) B + \alpha A}{D_{0}}, \qquad \mathbb{E}_1[N] \approx \frac{\beta B + (1-\beta) A}{D_{1}}.

Under H1\mathcal{H}_1 the LLR random walk drifts upward at rate D(f1βˆ₯f0)D(f_1 \| f_0) per step, so reaching AA takes about A/D(f1βˆ₯f0)A / D(f_1 \| f_0) steps. The exact formula adjusts for the rare case in which the walk exits through the wrong boundary.

,

Example: SPRT for a Gaussian Shift-in-Mean

Under H0\mathcal{H}_0, Yi∼N(0,Οƒ2)Y_i \sim \mathcal{N}(0, \sigma^2); under H1\mathcal{H}_1, Yi∼N(ΞΌ,Οƒ2)Y_i \sim \mathcal{N}(\mu, \sigma^2), with ΞΌ>0\mu > 0 known. Derive the SPRT explicitly and compute the ASN for Ξ±=Ξ²=10βˆ’3\alpha = \beta = 10^{-3} at SNR=ΞΌ2/Οƒ2=1\mathrm{SNR} = \mu^2/\sigma^2 = 1 (0 dB).

SPRT Sample Path and Decision Boundaries

Simulate sample paths of the cumulative LLR β„“(n)\ell^{(n)} under H0\mathcal{H}_0 and H1\mathcal{H}_1 for a Gaussian shift-in-mean problem. The SPRT stops as soon as the path exits the strip (B,A)(B, A). Vary the SNR and the target error levels (Ξ±,Ξ²)(\alpha, \beta) to see how the thresholds move and how the stopping time changes.

Parameters
0

Signal-to-noise ratio $\mu^2/\sigma^2$

0.001

Target false-alarm probability

0.001

Target miss probability

Average Sample Number vs. SNR

Compare the SPRT's ASN to the fixed-sample-size Neyman-Pearson test's required sample count as SNR varies. The ASN curve reveals the Wald-Wolfowitz optimality in operational terms: roughly a factor of two to three reduction at the same error levels.

Parameters
0.001
0.001

Random Walk of the LLR and SPRT Boundaries

Animation of the cumulative LLR under H1\mathcal{H}_1, with the Wald boundaries AA and BB drawn as horizontal lines. Watch the drift, the boundary hit, and the resulting decision.
Gaussian shift-in-mean, SNR=0\mathrm{SNR} = 0 dB, Ξ±=Ξ²=10βˆ’3\alpha = \beta = 10^{-3}.

Theorem: Wald-Wolfowitz Optimality of the SPRT

Among all sequential tests (stopping rule + terminal decision) of H0\mathcal{H}_0 vs.\ H1\mathcal{H}_1 based on i.i.d.\ observations with Pr⁑(rejectΒ H0∣H0)≀α\Pr(\text{reject } \mathcal{H}_0 \mid \mathcal{H}_0) \leq \alpha and Pr⁑(acceptΒ H0∣H1)≀β\Pr(\text{accept } \mathcal{H}_0 \mid \mathcal{H}_1) \leq \beta, the SPRT with Wald thresholds minimizes both E0[N]\mathbb{E}_0[N] and E1[N]\mathbb{E}_1[N] simultaneously.

No other test β€” deterministic or randomized, fixed or variable sample size β€” with the same error guarantees can terminate sooner on average under either hypothesis. The SPRT is a rare example of a doubly-optimal procedure.

,

SPRT Implementation

Complexity: O(1)O(1) per sample; expected total O(E[N])O(\mathbb{E}[N])
Input: Densities f0,f1f_0, f_1; targets α,β∈(0,1/2)\alpha, \beta \in (0, 1/2)
Output: Decision d∈{0,1}d \in \{0, 1\} and stopping time NN
1. A←log⁑1βˆ’Ξ²Ξ±A \leftarrow \log \frac{1-\beta}{\alpha}
2. B←log⁑β1βˆ’Ξ±B \leftarrow \log \frac{\beta}{1-\alpha}
3. ℓ←0\ell \leftarrow 0
4. n←0n \leftarrow 0
5. loop
6. n←n+1\quad n \leftarrow n + 1
7. \quad observe YnY_n
8. ℓ←ℓ+log⁑(f1(Yn)/f0(Yn))\quad \ell \leftarrow \ell + \log(f_1(Y_n) / f_0(Y_n))
9. \quad if β„“β‰₯A\ell \geq A then return (d,N)←(1,n)(d, N) \leftarrow (1, n)
10. \quad if ℓ≀B\ell \leq B then return (d,N)←(0,n)(d, N) \leftarrow (0, n)
11. end loop

Truncation (capping nn at Nmax⁑N_{\max}) is common in practice to bound worst-case latency. The resulting test is no longer strictly optimal but retains the ASN advantage for most sample paths.

Common Mistake: Excess Over the Boundary

Mistake:

Students apply the Wald threshold formulas and assume the error probabilities are exactly Ξ±\alpha and Ξ²\beta.

Correction:

Wald's formulas ignore the excess β„“(N)βˆ’A\ell^{(N)} - A (or Bβˆ’β„“(N)B - \ell^{(N)}) at the stopping time. The actual errors are typically smaller than (Ξ±,Ξ²)(\alpha, \beta) β€” the SPRT over-delivers on the error guarantees. For continuous increments with bounded variance, the overshoot adds O(1)O(1) nats that shift both thresholds inward in the Wald identity. Siegmund's corrected asymptotic expansions account for this.

Common Mistake: Non-i.i.d. Observations Break the SPRT

Mistake:

Applying the SPRT to dependent observations (e.g., correlated noise, adaptive measurements) using the same thresholds.

Correction:

The Wald inequalities rely on the martingale property of the likelihood ratio under H0\mathcal{H}_0, which in turn requires the per-sample LLRs to be a martingale difference sequence. For correlated data, the LLR must be computed for the joint density, and the thresholds must be reinterpreted. A mechanical i.i.d.\ implementation can produce actual errors orders of magnitude above target.

Historical Note: Wald and Wartime Statistics

1943-1947

Abraham Wald developed the SPRT during World War II as part of the Statistical Research Group at Columbia University (1943). The problem that motivated the work was munitions quality control: how to decide whether a batch of shells met specifications while testing as few shells as possible. Sequential testing had been informally proposed by others (notably Dodge and Romig in 1929), but Wald established the rigorous optimality theory.

The SPRT was classified until 1947 because of its potential military applications. When Wald published "Sequential Analysis" that year, it became the founding text of sequential statistical inference. Wald died in a plane crash in 1950 at age 48, cutting short one of the most prolific careers in 20th-century statistics.

Historical Note: The Wald-Wolfowitz Optimality Theorem

1948

In 1948, Wald and his collaborator Jacob Wolfowitz proved that the SPRT minimizes both E0[N]\mathbb{E}_0[N] and E1[N]\mathbb{E}_1[N] among all sequential tests with given error probabilities β€” a doubly-optimal property that has no analogue in the fixed-sample-size theory, where only the error is minimized for a fixed nn.

The proof uses a Bayesian reduction that was considered unusual at the time, since many statisticians viewed Bayes and frequentist methods as philosophically incompatible. Wald's willingness to use Bayesian tools instrumentally anticipated the modern decision-theoretic viewpoint.

Why This Matters: Early Stopping in Iterative Decoders

Turbo and LDPC decoders execute a fixed maximum number of iterations (typically 8-50) regardless of channel quality. In good channels most codewords decode correctly in 3-5 iterations; the remainder of the iteration budget is wasted power and latency.

Early-stopping criteria are sequential tests: monitor a decision statistic between iterations and stop as soon as confidence exceeds a threshold. Common statistics include the minimum absolute extrinsic LLR, the change in the hard-decision vector, and the syndrome check. Each is a proxy for the cumulative LLR of an SPRT on the hypothesis "the current hard decision is the true codeword." Well-designed early stopping achieves 5050-70%70\% iteration savings at no BER cost in the waterfall regime.

⚠️Engineering Note

Truncated SPRT in Wireless Systems

Pure SPRT has unbounded stopping time. Any real receiver must impose a maximum sample count Nmax⁑N_{\max} β€” a hard deadline driven by latency budgets (e.g., the LDPC decoder must produce a bit within one OFDM symbol). The truncated SPRT forces a decision at n=Nmax⁑n = N_{\max} by comparing β„“(Nmax⁑)\ell^{(N_{\max})} to a midpoint threshold (typically 0) if the walk is still in the continuation region.

The truncation introduces a small additional error term that must be budgeted alongside (Ξ±,Ξ²)(\alpha, \beta). For Nmax⁑β‰₯2 E1[N]N_{\max} \geq 2\, \mathbb{E}_1[N] the effect is usually negligible. For tight deadlines, the thresholds must be redesigned via direct numerical optimization over the truncated problem.

Practical Constraints
  • β€’

    3GPP NR turbo/LDPC decoders: Nmax⁑N_{\max} = 8-12 iterations per TTI

  • β€’

    LDPC hardware decoders allocate an iteration budget in clock cycles

  • β€’

    Early termination contributes directly to UE battery life

πŸ“‹ Ref: 3GPP TS 38.212, Section 5.3.2 (LDPC base graphs)

Quick Check

For a sequential test with target errors Ξ±=10βˆ’2\alpha = 10^{-2}, Ξ²=10βˆ’2\beta = 10^{-2}, what is AA (in nats)?

log⁑(0.99/0.01)β‰ˆ4.60\log(0.99/0.01) \approx 4.60

log⁑(0.01)β‰ˆβˆ’4.60\log(0.01) \approx -4.60

log⁑(0.01/0.99)β‰ˆβˆ’4.60\log(0.01/0.99) \approx -4.60

log⁑(100)β‰ˆ4.60\log(100) \approx 4.60 (same as option 1, but coincidental)

Quick Check

In the Gaussian shift-in-mean SPRT with Ξ±=Ξ²\alpha = \beta, how does the ASN under H1\mathcal{H}_1 scale with SNR in the low-SNR regime?

E1[N]∝1/SNR\mathbb{E}_1[N] \propto 1/\mathrm{SNR}

E1[N]∝1/SNR\mathbb{E}_1[N] \propto 1/\sqrt{\mathrm{SNR}}

E1[N]∝1/SNR2\mathbb{E}_1[N] \propto 1/\mathrm{SNR}^2

E1[N]\mathbb{E}_1[N] does not depend on SNR

Quick Check

Which statement about the Wald-Wolfowitz theorem is correct?

The SPRT minimizes E0[N]\mathbb{E}_0[N] only.

The SPRT minimizes E1[N]\mathbb{E}_1[N] only.

The SPRT minimizes both E0[N]\mathbb{E}_0[N] and E1[N]\mathbb{E}_1[N] among tests with the same error levels.

The SPRT minimizes max⁑(E0[N],E1[N])\max(\mathbb{E}_0[N], \mathbb{E}_1[N]) only.