Hypothesis Testing and Confidence Intervals

Why Hypothesis Testing Matters for Simulation

When you run a Monte Carlo BER simulation and get P^b=1.2Γ—10βˆ’3\hat{P}_b = 1.2 \times 10^{-3}, how confident are you in that number? Could the true BER be 10βˆ’410^{-4}? Hypothesis testing and confidence intervals answer these questions rigorously. Without them, you cannot distinguish a real performance difference from statistical noise β€” a critical issue when comparing two receiver algorithms or validating against theory.

Definition:

Hypothesis Test

A hypothesis test consists of:

  • Null hypothesis H0H_0: the default assumption (e.g., "the data comes from distribution F0F_0")
  • Alternative hypothesis H1H_1: what we accept if H0H_0 is rejected
  • Test statistic TT: a function of the data
  • p-value: P(Tβ‰₯tobs∣H0)P(T \ge t_{\mathrm{obs}} \mid H_0) β€” the probability of seeing a result at least as extreme under H0H_0
  • Significance level Ξ±\alpha: reject H0H_0 if p<Ξ±p < \alpha (typically Ξ±=0.05\alpha = 0.05)
from scipy.stats import ttest_1samp
stat, p_value = ttest_1samp(samples, popmean=0.0)
if p_value < 0.05:
    print("Reject H0 at 5% significance level")

Definition:

Type I and Type II Errors

Decision H0H_0 True H0H_0 False
Reject H0H_0 Type I error (Ξ±\alpha) Correct (Power 1βˆ’Ξ²1-\beta)
Accept H0H_0 Correct Type II error (Ξ²\beta)
  • Type I error rate = Ξ±\alpha = probability of false alarm
  • Type II error rate = Ξ²\beta = probability of missed detection
  • Power = 1βˆ’Ξ²1 - \beta = probability of correctly rejecting a false H0H_0

In simulation: Type I error means claiming an algorithm is better when it is not (false positive); Type II means missing a real improvement.

Definition:

Student's t-test

The one-sample t-test tests whether the population mean equals a hypothesized value ΞΌ0\mu_0. The test statistic is:

t=xΛ‰βˆ’ΞΌ0s/Nt = \frac{\bar{x} - \mu_0}{s / \sqrt{N}}

where ss is the sample standard deviation and NN is the sample size. Under H0H_0, t∼tNβˆ’1t \sim t_{N-1} (Student's tt distribution with Nβˆ’1N-1 degrees of freedom).

The two-sample t-test compares means of two groups:

from scipy.stats import ttest_ind
stat, p = ttest_ind(ber_algorithm_A, ber_algorithm_B)

Definition:

Confidence Interval

A (1βˆ’Ξ±)(1-\alpha) confidence interval for parameter ΞΈ\theta is a random interval [L,U][L, U] such that:

P(L≀θ≀U)=1βˆ’Ξ±P(L \le \theta \le U) = 1 - \alpha

For the mean of a Gaussian with unknown variance, the CI is:

xΛ‰Β±tΞ±/2,Nβˆ’1β‹…sN\bar{x} \pm t_{\alpha/2, N-1} \cdot \frac{s}{\sqrt{N}}

from scipy.stats import t
ci_half = t.ppf(1 - alpha/2, df=N-1) * s / np.sqrt(N)
ci = (x_bar - ci_half, x_bar + ci_half)

Definition:

Kolmogorov-Smirnov (KS) Test

The KS test is a nonparametric test for whether data follows a specified distribution. It compares the empirical CDF to the reference:

DN=sup⁑x∣F^N(x)βˆ’F0(x)∣D_N = \sup_x |\hat{F}_N(x) - F_0(x)|

from scipy.stats import kstest, norm
D, p = kstest(data, 'norm', args=(mu_hat, sigma_hat))

Two-sample KS test compares two empirical distributions:

from scipy.stats import ks_2samp
D, p = ks_2samp(data_A, data_B)

The KS test is distribution-free under H0H_0 β€” the critical values do not depend on the reference distribution. However, if you estimate the parameters from the same data, the pp-values are conservative (too large). Use the Lilliefors test for this case.

Definition:

Bootstrap Confidence Intervals

The bootstrap estimates the sampling distribution of a statistic by resampling the data with replacement:

  1. From NN observations, draw BB bootstrap samples (each of size NN, with replacement)
  2. Compute the statistic ΞΈ^(b)\hat{\theta}^{(b)} for each bootstrap sample
  3. Use the Ξ±/2\alpha/2 and 1βˆ’Ξ±/21-\alpha/2 quantiles of {ΞΈ^(b)}\{\hat{\theta}^{(b)}\} as confidence bounds
rng = np.random.default_rng(42)
B = 10000
boot_stats = np.array([
    np.mean(rng.choice(data, size=len(data), replace=True))
    for _ in range(B)
])
ci = np.percentile(boot_stats, [2.5, 97.5])

The bootstrap is especially useful for statistics without closed-form distributions, like the median BER across fading realizations.

Theorem: Confidence Interval Width Scales as 1/N1/\sqrt{N}

For a (1βˆ’Ξ±)(1-\alpha) confidence interval of the mean, the half-width is:

w=zΞ±/2β‹…ΟƒNw = z_{\alpha/2} \cdot \frac{\sigma}{\sqrt{N}}

To halve the confidence interval width, you need 4Γ—4\times as many samples. For a Monte Carlo BER estimate with target relative precision Ο΅\epsilon, the required number of trials is:

Nsβ‰₯zΞ±/22β‹…(1βˆ’Pb)Pbβ‹…Ο΅2β‰ˆzΞ±/22Pbβ‹…Ο΅2N_s \ge \frac{z_{\alpha/2}^2 \cdot (1 - P_b)}{P_b \cdot \epsilon^2} \approx \frac{z_{\alpha/2}^2}{P_b \cdot \epsilon^2}

The 1/N1/\sqrt{N} scaling is a fundamental law of statistics. To estimate a BER of 10βˆ’610^{-6} with 10% relative precision at 95% confidence, you need at least Nsβ‰ˆ4Γ—108N_s \approx 4 \times 10^8 trials.

Theorem: Bootstrap Consistency

For a "smooth" statistic TN=g(XˉN)T_N = g(\bar{X}_N) with gg differentiable:

sup⁑t∣Pβˆ—(TNβˆ—β‰€t)βˆ’P(TN≀t)βˆ£β†’a.s.0\sup_t \left| P^*(T_N^* \le t) - P(T_N \le t) \right| \xrightarrow{a.s.} 0

where Pβˆ—P^* denotes the bootstrap distribution. The bootstrap confidence interval has asymptotically correct coverage.

The empirical distribution converges to the true distribution (Glivenko-Cantelli), and resampling from it mimics the true sampling process.

Theorem: Exact Confidence Interval for BER

Let kk errors in NN trials. Since k∼Binomial(N,Pb)k \sim \mathrm{Binomial}(N, P_b), the Clopper-Pearson exact confidence interval for PbP_b is:

[Bβˆ’1(Ξ±/2; k, Nβˆ’k+1),β€…β€ŠBβˆ’1(1βˆ’Ξ±/2; k+1, Nβˆ’k)]\left[ B^{-1}(\alpha/2;\, k,\, N-k+1),\; B^{-1}(1-\alpha/2;\, k+1,\, N-k) \right]

where Bβˆ’1B^{-1} is the inverse beta CDF. For large NN and moderate kk, the normal approximation gives:

P^bΒ±zΞ±/2P^b(1βˆ’P^b)N\hat{P}_b \pm z_{\alpha/2} \sqrt{\frac{\hat{P}_b(1-\hat{P}_b)}{N}}

The BER is a proportion, so its confidence interval comes from the binomial distribution. The Clopper-Pearson interval is conservative (coverage β‰₯1βˆ’Ξ±\ge 1-\alpha); the normal approximation works well when kβ‰₯30k \ge 30.

Example: Comparing Two Receiver Algorithms via t-test

You run 50 independent BER trials for Algorithm A and Algorithm B. Determine whether there is a statistically significant difference in their average BER at the 5% level.

Example: Bootstrap Confidence Interval for Median BER

Compute a 95% bootstrap confidence interval for the median BER from 100 Monte Carlo trials.

Example: Verifying Rayleigh Fading with the KS Test

Generate fading samples and use the KS test to verify they follow a Rayleigh distribution.

Hypothesis Test Visualizer

Visualize the null distribution, test statistic, critical region, and p-value for a one-sample t-test. Adjust the true mean and sample size to see how power changes.

Parameters

Hypothesis Test Decision Regions

Hypothesis Test Decision Regions
Anatomy of a two-sided hypothesis test: null distribution, critical values at Β±zΞ±/2\pm z_{\alpha/2}, rejection regions (shaded), and the relationship between significance level, p-value, and test statistic.

Hypothesis Testing and Confidence Intervals

python
t-test, KS test, bootstrap confidence intervals, BER CI.
# Code from: ch09/python/hypothesis_testing.py
# Load from backend supplements endpoint

Quick Check

A p-value of 0.03 means:

There is a 3% probability that H0 is true

If H0 is true, there is a 3% chance of getting a test statistic at least as extreme as observed

The experiment has a 3% error rate

We should always reject H0

Quick Check

You need a confidence interval for BER that is half as wide. How many times more Monte Carlo trials do you need?

2x

4x

8x

sqrt(2)x

Common Mistake: Multiple Testing Without Correction

Mistake:

Running 20 t-tests at Ξ±=0.05\alpha = 0.05 to compare algorithm variants and reporting any significant result. By chance, you expect 20Γ—0.05=120 \times 0.05 = 1 false positive.

Correction:

Apply the Bonferroni correction (Ξ±β€²=Ξ±/m\alpha' = \alpha / m) or use scipy.stats.false_discovery_control() for the Benjamini-Hochberg procedure when performing multiple tests.

Key Takeaway

Always report confidence intervals with your Monte Carlo results. A bare BER number without a CI is scientifically meaningless. Use the Clopper-Pearson exact interval for small error counts (k<30k < 30) and the normal approximation for large counts.

Key Takeaway

The bootstrap is your universal confidence interval tool. It works for any statistic (median, percentile, ratio of BERs) without requiring closed-form distributions. Use Bβ‰₯10000B \ge 10000 bootstrap resamples for reliable intervals.

Why This Matters: Statistical Rigor in BER Simulation

In wireless research, the standard practice is to count at least 100 errors before declaring a BER measurement valid. This rule of thumb comes from the 1/N1/\sqrt{N} scaling of the CI width: with k=100k = 100 errors, the 95% CI is approximately P^bΒ±20%\hat{P}_b \pm 20\%. The exact Clopper-Pearson interval from Theorem 3 makes this precision quantitative.

Statistical Tests for Simulation Validation

TestNull HypothesisAssumptionsscipy Function
One-sample t-testΞΌ=ΞΌ0\mu = \mu_0Normal data or large NNttest_1samp
Two-sample t-testΞΌA=ΞΌB\mu_A = \mu_BIndependent samples, normal or large NNttest_ind
Paired t-testΞΌAβˆ’B=0\mu_{A-B} = 0Paired observationsttest_rel
KS testX∼F0X \sim F_0Continuous distributionkstest
Two-sample KSFA=FBF_A = F_BIndependent continuous samplesks_2samp
Chi-squaredObserved = ExpectedCategorical data, niβ‰₯5n_i \ge 5chisquare

p-value

The probability of observing a test statistic at least as extreme as the one computed, assuming the null hypothesis is true.

Confidence Interval

A random interval [L,U][L, U] that contains the true parameter with probability 1βˆ’Ξ±1-\alpha over repeated experiments.

Related: Bootstrap

Bootstrap

A resampling method that estimates the sampling distribution of a statistic by drawing samples with replacement from the observed data.

Related: Confidence Interval