Hypothesis Testing and Confidence Intervals
Why Hypothesis Testing Matters for Simulation
When you run a Monte Carlo BER simulation and get , how confident are you in that number? Could the true BER be ? Hypothesis testing and confidence intervals answer these questions rigorously. Without them, you cannot distinguish a real performance difference from statistical noise β a critical issue when comparing two receiver algorithms or validating against theory.
Definition: Hypothesis Test
Hypothesis Test
A hypothesis test consists of:
- Null hypothesis : the default assumption (e.g., "the data comes from distribution ")
- Alternative hypothesis : what we accept if is rejected
- Test statistic : a function of the data
- p-value: β the probability of seeing a result at least as extreme under
- Significance level : reject if (typically )
from scipy.stats import ttest_1samp
stat, p_value = ttest_1samp(samples, popmean=0.0)
if p_value < 0.05:
print("Reject H0 at 5% significance level")
Definition: Type I and Type II Errors
Type I and Type II Errors
| Decision | True | False |
|---|---|---|
| Reject | Type I error () | Correct (Power ) |
| Accept | Correct | Type II error () |
- Type I error rate = = probability of false alarm
- Type II error rate = = probability of missed detection
- Power = = probability of correctly rejecting a false
In simulation: Type I error means claiming an algorithm is better when it is not (false positive); Type II means missing a real improvement.
Definition: Student's t-test
Student's t-test
The one-sample t-test tests whether the population mean equals a hypothesized value . The test statistic is:
where is the sample standard deviation and is the sample size. Under , (Student's distribution with degrees of freedom).
The two-sample t-test compares means of two groups:
from scipy.stats import ttest_ind
stat, p = ttest_ind(ber_algorithm_A, ber_algorithm_B)
Definition: Confidence Interval
Confidence Interval
A confidence interval for parameter is a random interval such that:
For the mean of a Gaussian with unknown variance, the CI is:
from scipy.stats import t
ci_half = t.ppf(1 - alpha/2, df=N-1) * s / np.sqrt(N)
ci = (x_bar - ci_half, x_bar + ci_half)
Definition: Kolmogorov-Smirnov (KS) Test
Kolmogorov-Smirnov (KS) Test
The KS test is a nonparametric test for whether data follows a specified distribution. It compares the empirical CDF to the reference:
from scipy.stats import kstest, norm
D, p = kstest(data, 'norm', args=(mu_hat, sigma_hat))
Two-sample KS test compares two empirical distributions:
from scipy.stats import ks_2samp
D, p = ks_2samp(data_A, data_B)
The KS test is distribution-free under β the critical values do not depend on the reference distribution. However, if you estimate the parameters from the same data, the -values are conservative (too large). Use the Lilliefors test for this case.
Definition: Bootstrap Confidence Intervals
Bootstrap Confidence Intervals
The bootstrap estimates the sampling distribution of a statistic by resampling the data with replacement:
- From observations, draw bootstrap samples (each of size , with replacement)
- Compute the statistic for each bootstrap sample
- Use the and quantiles of as confidence bounds
rng = np.random.default_rng(42)
B = 10000
boot_stats = np.array([
np.mean(rng.choice(data, size=len(data), replace=True))
for _ in range(B)
])
ci = np.percentile(boot_stats, [2.5, 97.5])
The bootstrap is especially useful for statistics without closed-form distributions, like the median BER across fading realizations.
Theorem: Confidence Interval Width Scales as
For a confidence interval of the mean, the half-width is:
To halve the confidence interval width, you need as many samples. For a Monte Carlo BER estimate with target relative precision , the required number of trials is:
The scaling is a fundamental law of statistics. To estimate a BER of with 10% relative precision at 95% confidence, you need at least trials.
Theorem: Bootstrap Consistency
For a "smooth" statistic with differentiable:
where denotes the bootstrap distribution. The bootstrap confidence interval has asymptotically correct coverage.
The empirical distribution converges to the true distribution (Glivenko-Cantelli), and resampling from it mimics the true sampling process.
Theorem: Exact Confidence Interval for BER
Let errors in trials. Since , the Clopper-Pearson exact confidence interval for is:
where is the inverse beta CDF. For large and moderate , the normal approximation gives:
The BER is a proportion, so its confidence interval comes from the binomial distribution. The Clopper-Pearson interval is conservative (coverage ); the normal approximation works well when .
Example: Comparing Two Receiver Algorithms via t-test
You run 50 independent BER trials for Algorithm A and Algorithm B. Determine whether there is a statistically significant difference in their average BER at the 5% level.
Simulate BER data
import numpy as np
from scipy.stats import ttest_ind
rng = np.random.default_rng(42)
ber_A = 1e-3 + 2e-4 * rng.standard_normal(50)
ber_B = 1.2e-3 + 2e-4 * rng.standard_normal(50)
Run t-test
stat, p = ttest_ind(ber_A, ber_B)
print(f"t-statistic: {stat:.4f}")
print(f"p-value: {p:.6f}")
print(f"Significant at 5%: {p < 0.05}")
Interpret
If , we reject and conclude Algorithm B has a significantly different (higher) BER.
Example: Bootstrap Confidence Interval for Median BER
Compute a 95% bootstrap confidence interval for the median BER from 100 Monte Carlo trials.
Generate data and bootstrap
import numpy as np
rng = np.random.default_rng(42)
ber_trials = 1e-3 * np.exp(0.5 * rng.standard_normal(100))
B = 10000
boot_medians = np.array([
np.median(rng.choice(ber_trials, size=100, replace=True))
for _ in range(B)
])
ci = np.percentile(boot_medians, [2.5, 97.5])
print(f"Median BER: {np.median(ber_trials):.4e}")
print(f"95% Bootstrap CI: [{ci[0]:.4e}, {ci[1]:.4e}]")
Example: Verifying Rayleigh Fading with the KS Test
Generate fading samples and use the KS test to verify they follow a Rayleigh distribution.
Generate and test
from scipy.stats import kstest, rayleigh
import numpy as np
rng = np.random.default_rng(42)
h = (rng.standard_normal(5000)
+ 1j * rng.standard_normal(5000)) / np.sqrt(2)
envelope = np.abs(h)
D, p = kstest(envelope, 'rayleigh')
print(f"KS statistic: {D:.4f}")
print(f"p-value: {p:.4f}")
# p >> 0.05 => cannot reject Rayleigh hypothesis
Hypothesis Test Visualizer
Visualize the null distribution, test statistic, critical region, and p-value for a one-sample t-test. Adjust the true mean and sample size to see how power changes.
Parameters
Hypothesis Test Decision Regions
Hypothesis Testing and Confidence Intervals
# Code from: ch09/python/hypothesis_testing.py
# Load from backend supplements endpointQuick Check
A p-value of 0.03 means:
There is a 3% probability that H0 is true
If H0 is true, there is a 3% chance of getting a test statistic at least as extreme as observed
The experiment has a 3% error rate
We should always reject H0
The p-value is P(T >= t_obs | H0), the probability of the observed data (or more extreme) under the null hypothesis.
Quick Check
You need a confidence interval for BER that is half as wide. How many times more Monte Carlo trials do you need?
2x
4x
8x
sqrt(2)x
Width proportional to 1/sqrt(N), so halving width requires 4x the samples.
Common Mistake: Multiple Testing Without Correction
Mistake:
Running 20 t-tests at to compare algorithm variants and reporting any significant result. By chance, you expect false positive.
Correction:
Apply the Bonferroni correction () or use
scipy.stats.false_discovery_control() for the Benjamini-Hochberg
procedure when performing multiple tests.
Key Takeaway
Always report confidence intervals with your Monte Carlo results. A bare BER number without a CI is scientifically meaningless. Use the Clopper-Pearson exact interval for small error counts () and the normal approximation for large counts.
Key Takeaway
The bootstrap is your universal confidence interval tool. It works for any statistic (median, percentile, ratio of BERs) without requiring closed-form distributions. Use bootstrap resamples for reliable intervals.
Why This Matters: Statistical Rigor in BER Simulation
In wireless research, the standard practice is to count at least 100 errors before declaring a BER measurement valid. This rule of thumb comes from the scaling of the CI width: with errors, the 95% CI is approximately . The exact Clopper-Pearson interval from Theorem 3 makes this precision quantitative.
Statistical Tests for Simulation Validation
| Test | Null Hypothesis | Assumptions | scipy Function |
|---|---|---|---|
| One-sample t-test | Normal data or large | ttest_1samp | |
| Two-sample t-test | Independent samples, normal or large | ttest_ind | |
| Paired t-test | Paired observations | ttest_rel | |
| KS test | Continuous distribution | kstest | |
| Two-sample KS | Independent continuous samples | ks_2samp | |
| Chi-squared | Observed = Expected | Categorical data, | chisquare |
p-value
The probability of observing a test statistic at least as extreme as the one computed, assuming the null hypothesis is true.
Confidence Interval
A random interval that contains the true parameter with probability over repeated experiments.
Related: Bootstrap
Bootstrap
A resampling method that estimates the sampling distribution of a statistic by drawing samples with replacement from the observed data.
Related: Confidence Interval