Variance Reduction Techniques

Why Variance Reduction Can Save Days of Simulation Time

At Pb=10βˆ’6P_b = 10^{-6}, a standard Monte Carlo simulation needs ∼108\sim 10^8 trials for 10% relative precision. Variance reduction techniques can reduce this by factors of 10-1000, cutting simulation time from days to minutes. The price is additional mathematical complexity in designing the estimator β€” but the payoff is enormous for low-BER simulations.

Definition:

Importance Sampling

Instead of sampling from the original distribution f(x)f(x), draw from a proposal distribution g(x)g(x) that concentrates samples in the important region:

ΞΈ=Ef[h(X)]=∫h(x)f(x)g(x)g(x) dx=Eg ⁣[h(X)f(X)g(X)]\theta = E_f[h(X)] = \int h(x) \frac{f(x)}{g(x)} g(x)\,dx = E_g\!\left[h(X) \frac{f(X)}{g(X)}\right]

The IS estimator is:

ΞΈ^IS=1Nβˆ‘i=1Nh(Xi)w(Xi),w(Xi)=f(Xi)g(Xi)\hat{\theta}_{\mathrm{IS}} = \frac{1}{N}\sum_{i=1}^{N} h(X_i) w(X_i), \quad w(X_i) = \frac{f(X_i)}{g(X_i)}

where w(Xi)w(X_i) is the likelihood ratio (importance weight).

# IS for BPSK BER: shift noise mean toward decision boundary
mu_shift = np.sqrt(2 * ebno)  # optimal shift
noise_is = mu_shift + rng.standard_normal(N)
weights = np.exp(-mu_shift * noise_is + mu_shift**2 / 2)
errors_is = (noise_is > np.sqrt(2 * ebno))
ber_is = np.mean(errors_is * weights)

The key challenge is choosing g(x)g(x). A poor choice can increase variance (even to infinity). For AWGN BER estimation, the optimal choice is to shift the noise mean to the decision boundary.

Definition:

Optimal Importance Sampling Distribution

The zero-variance IS estimator uses the proposal:

gβˆ—(x)=∣h(x)∣f(x)∫∣h(x)∣f(x) dx=∣h(x)∣f(x)ΞΈg^*(x) = \frac{|h(x)| f(x)}{\int |h(x)| f(x)\,dx} = \frac{|h(x)| f(x)}{\theta}

This is impractical (requires knowing ΞΈ\theta), but it guides the design of good proposals. For BER estimation:

  • The error event is h(x)=1[x>d]h(x) = \mathbf{1}[x > d] where dd is the decision boundary
  • The optimal shift concentrates samples near dd
  • A practical choice: shift the noise mean by dd (exponential tilting)

Definition:

Antithetic Variates

The antithetic variates technique uses negatively correlated pairs to reduce variance:

ΞΈ^AV=12Nβˆ‘i=1N[g(Ui)+g(1βˆ’Ui)]\hat{\theta}_{\mathrm{AV}} = \frac{1}{2N}\sum_{i=1}^{N} \left[g(U_i) + g(1-U_i)\right]

where Ui∼Uniform(0,1)U_i \sim \mathrm{Uniform}(0,1) and 1βˆ’Ui1-U_i is the antithetic sample. If gg is monotonic, Cov[g(U),g(1βˆ’U)]<0\mathrm{Cov}[g(U), g(1-U)] < 0, reducing the variance.

u = rng.uniform(size=N)
x1 = norm.ppf(u)          # original samples
x2 = norm.ppf(1 - u)      # antithetic samples
estimate = 0.5 * (g(x1) + g(x2))
theta_hat = np.mean(estimate)

Antithetic variates is free in terms of computational cost β€” you use the same random numbers twice. The variance reduction is typically 2-5x for smooth functions.

Definition:

Control Variates

If ZZ is a random variable with known mean E[Z]=ΞΌZE[Z] = \mu_Z, the control variate estimator for ΞΈ=E[g(X)]\theta = E[g(X)] is:

ΞΈ^CV=1Nβˆ‘i=1Ng(Xi)βˆ’c(1Nβˆ‘i=1NZiβˆ’ΞΌZ)\hat{\theta}_{\mathrm{CV}} = \frac{1}{N}\sum_{i=1}^N g(X_i) - c\left(\frac{1}{N}\sum_{i=1}^N Z_i - \mu_Z\right)

The optimal coefficient is cβˆ—=Cov[g(X),Z]/Var[Z]c^* = \mathrm{Cov}[g(X), Z] / \mathrm{Var}[Z], giving variance reduction factor:

Var[ΞΈ^CV]Var[ΞΈ^]=1βˆ’ΟgZ2\frac{\mathrm{Var}[\hat{\theta}_{\mathrm{CV}}]}{\mathrm{Var}[\hat{\theta}]} = 1 - \rho_{gZ}^2

where ρgZ\rho_{gZ} is the correlation between g(X)g(X) and ZZ.

# Control variate: use SNR as control
g_samples = compute_ber_per_block(...)
z_samples = compute_snr_per_block(...)
mu_z = theoretical_snr
c_star = np.cov(g_samples, z_samples)[0,1] / np.var(z_samples)
theta_cv = np.mean(g_samples) - c_star * (np.mean(z_samples) - mu_z)

Definition:

Stratified Sampling

Divide the sample space into KK non-overlapping strata S1,…,SKS_1, \dots, S_K with probabilities pk=P(X∈Sk)p_k = P(X \in S_k). Sample nkn_k points from each stratum:

ΞΈ^SS=βˆ‘k=1Kpkβ‹…1nkβˆ‘i=1nkg(Xki)\hat{\theta}_{\mathrm{SS}} = \sum_{k=1}^{K} p_k \cdot \frac{1}{n_k}\sum_{i=1}^{n_k} g(X_{ki})

Stratified sampling always reduces variance compared to simple random sampling (it removes the between-strata variance component).

For BER simulation: stratify on the channel realization ∣h∣2|h|^2 to ensure coverage of both deep fades and strong channels.

Theorem: Importance Sampling Variance

The IS estimator with proposal gg has variance:

Varg[ΞΈ^IS]=1N(Eg ⁣[h2(X)f2(X)g2(X)]βˆ’ΞΈ2)\mathrm{Var}_g[\hat{\theta}_{\mathrm{IS}}] = \frac{1}{N}\left(E_g\!\left[h^2(X)\frac{f^2(X)}{g^2(X)}\right] - \theta^2\right)

The variance is minimized when g(x)∝∣h(x)∣f(x)g(x) \propto |h(x)| f(x). A poor choice of gg (thin tails relative to ff) can make the IS variance larger than crude Monte Carlo.

If gg does not cover the tails of ff where hh is nonzero, some samples will have extremely large weights f/gf/g, causing high variance. The "weight degeneracy" problem is the main practical challenge of IS.

Theorem: Antithetic Variates Variance Reduction

For antithetic variates with gg monotonic:

Var[ΞΈ^AV]=1N(Οƒ22+Cov[g(U),g(1βˆ’U)]2)\mathrm{Var}[\hat{\theta}_{\mathrm{AV}}] = \frac{1}{N}\left(\frac{\sigma^2}{2} + \frac{\mathrm{Cov}[g(U), g(1-U)]}{2}\right)

Since Cov[g(U),g(1βˆ’U)]≀0\mathrm{Cov}[g(U), g(1-U)] \le 0 for monotonic gg:

Var[ΞΈ^AV]≀σ22N≀Var[ΞΈ^MC]\mathrm{Var}[\hat{\theta}_{\mathrm{AV}}] \le \frac{\sigma^2}{2N} \le \mathrm{Var}[\hat{\theta}_{\mathrm{MC}}]

The factor-of-2 comes for free; additional reduction depends on the strength of the negative correlation.

When UU gives a high value of gg, 1βˆ’U1-U tends to give a low value (and vice versa). The pair average is less variable than two independent samples.

Theorem: Exponential Tilting for Rare-Event BER

For estimating Pb=P(N>d)P_b = P(N > d) where N∼N(0,1)N \sim \mathcal{N}(0, 1) and d=2Eb/N0d = \sqrt{2 E_b/N_0}, the optimal IS shift is ΞΌβˆ—=d\mu^* = d. The IS variance satisfies:

Var[P^b,IS]β‰ˆPb2N(ed2Pbβˆ’2β‹…14Ο€d2βˆ’1)\mathrm{Var}[\hat{P}_{b,\mathrm{IS}}] \approx \frac{P_b^2}{N} \left(e^{d^2} P_b^{-2} \cdot \frac{1}{\sqrt{4\pi d^2}} - 1\right)

This is exponentially smaller than the crude MC variance Pb(1βˆ’Pb)/Nβ‰ˆPb/NP_b(1-P_b)/N \approx P_b/N for large dd.

By shifting the noise distribution so errors become likely events, IS converts a rare-event problem into an ordinary estimation problem. The importance weights correct for the biased sampling.

Example: Importance Sampling for Low-BER BPSK

Estimate the BER of BPSK at Eb/N0=12E_b/N_0 = 12 dB using importance sampling with only 10000 samples, and compare with crude MC.

Example: Antithetic Variates for BER Estimation

Apply antithetic variates to reduce BER estimation variance.

Example: Control Variates for Fading Channel BER

Use the channel power ∣h∣2|h|^2 as a control variate when estimating BER over a Rayleigh fading channel.

Variance Reduction Comparison

Compare crude Monte Carlo, importance sampling, and antithetic variates for BPSK BER estimation. See how variance reduction improves estimation accuracy at high SNR.

Parameters

Variance Reduction Techniques

python
Importance sampling, antithetic variates, control variates for BER.
# Code from: ch09/python/variance_reduction.py
# Load from backend supplements endpoint

Quick Check

In importance sampling for BER, what happens if the proposal distribution g(x)g(x) has thinner tails than the original f(x)f(x)?

The estimator becomes biased

The variance may become infinite

The estimator converges faster

No effect β€” all proposals give the same variance

Common Mistake: Importance Sampling with Wrong Tail Behavior

Mistake:

Using a proposal distribution g(x)g(x) that does not cover the tails of f(x)β‹…h(x)f(x) \cdot h(x). For example, using a uniform proposal for estimating Gaussian tail probabilities β€” the weights f/gf/g explode in the tails.

Correction:

Ensure g(x)>0g(x) > 0 wherever f(x)h(x)β‰ 0f(x) h(x) \neq 0, and use exponential tilting (shift the mean) rather than arbitrary proposals. Monitor the effective sample size Neff=(βˆ‘wi)2/βˆ‘wi2N_{\mathrm{eff}} = (\sum w_i)^2 / \sum w_i^2.

Key Takeaway

For BER below 10βˆ’410^{-4}, always consider variance reduction. Importance sampling with exponential tilting is the most powerful technique (exponential speedup). Antithetic variates are free and give a reliable 2x improvement. Control variates work well when a correlated quantity with known mean is available.

Why This Matters: Importance Sampling in 5G Simulation

5G NR targets BLER of 10βˆ’510^{-5} for URLLC (Ultra-Reliable Low Latency Communication). Simulating this with crude MC requires ∼107\sim 10^7 codewords β€” prohibitive for complex LDPC/Polar coded systems. Industry link-level simulators use importance sampling to speed up low-BLER verification by 100-1000x.

Variance Reduction Methods Comparison

MethodVariance ReductionComplexityBest For
Crude MC1x (baseline)NoneModerate BER (>10βˆ’3>10^{-3})
Antithetic VariatesUp to 2xMinimalAny smooth estimator
Control Variates(1βˆ’Ο2)βˆ’1(1-\rho^2)^{-1}xNeed known-mean correlateWhen good control available
Importance SamplingUp to 10310^3x+Proposal designRare events, low BER
Stratified SamplingModerateStratum designHeterogeneous domains

Historical Note: Importance Sampling Origins

1950s-1980s

Importance sampling was introduced by Herman Kahn and Andy Marshall at RAND Corporation in 1953 for neutron transport simulations. The technique was independently developed for telecommunications by Jeruchim, Balaban, and Shanmugan in the 1980s, who applied it to BER estimation of digital communication systems with rare error events.

Advanced Importance Sampling

python
Optimal IS for BPSK/QPSK, adaptive IS, effective sample size monitoring.
# Code from: ch09/python/importance_sampling.py
# Load from backend supplements endpoint

Importance Sampling

A variance reduction technique that samples from a proposal distribution g(x)g(x) and corrects with likelihood ratios f(x)/g(x)f(x)/g(x).

Related: Likelihood Ratio

Likelihood Ratio

The importance weight w(x)=f(x)/g(x)w(x) = f(x)/g(x), the ratio of the original density to the proposal density.

Related: Importance Sampling

Antithetic Variates

A variance reduction technique using negatively correlated sample pairs (U,1βˆ’U)(U, 1-U) to reduce estimator variance.

Control Variate

A random variable with known mean used to reduce the variance of a Monte Carlo estimator via correlation.