Ferkans — Interactive Telecom Tutor

Why Classical Asymptotics Are Not Enough

Classical estimation theory, which underpins the CRLB machinery of Chapter 4, was built around an asymptotic regime in which the dimension $N$ of the parameter is held fixed and the number of observations $M$ grows without bound. In that regime the MLE is consistent, asymptotically unbiased, and efficient; the sample covariance converges to the true covariance; linear regression is well conditioned.

The point is that this regime is no longer the one we live in. A base station with $N_t=256$ antennas estimating its channel from $M=64$ pilots, a radar system forming a covariance estimate from fewer snapshots than it has array elements, a genomicist fitting a regression with more genes than patients — all operate in the proportional asymptotic regime, where $N$ and $M$ are both large but their ratio $\gamma := N/M$ is an $\Theta(1)$ constant.

The behaviour of estimators in this regime is qualitatively different. Eigenvalues of sample covariance matrices do not concentrate on true eigenvalues. The MLE, if it exists, can have strictly larger risk than a biased alternative. Regularization, which looks like a statistical crutch from the classical viewpoint, becomes essential — and its optimal amount depends on $\gamma$ .

Definition:
Proportional Asymptotic Regime

Let $(\mathbf{A}_M)_{M\geq 1}$ be a sequence of measurement matrices with $\mathbf{A}_M\in\mathbb{R}^{M\times N_M}$ . The proportional asymptotic regime is the joint limit

$M\to\infty,\quad N_M\to\infty,\quad \frac{N_M}{M}\to\gamma\in(0,\infty).$

A statistic $T_M$ is said to have a deterministic equivalent $T(\gamma)$ in this regime if $T_M\to T(\gamma)$ almost surely (or in probability) as $M\to\infty$ . Results expressed in terms of deterministic equivalents are the natural high-dimensional analogue of classical large-sample limits.

The aspect ratio $\gamma$ plays the role that is reserved for "number of samples" in classical asymptotics. Everything interesting in this chapter is a function of $\gamma$ .

Definition:
Gaussian Linear Observation Model

Throughout the chapter we work with the canonical model

$\mathbf{y}=\mathbf{A}\mathbf{x}+\mathbf{w},\qquad \mathbf{A}\in\mathbb{R}^{M\times N},\ \mathbf{w}\sim\mathcal{N}(\mathbf{0},\sigma^2\mathbf{I}_M),$

where $\mathbf{x}\in\mathbb{R}^N$ is the unknown parameter, $\mathbf{y}\in\mathbb{R}^M$ is the observation, and $\mathbf{w}$ is independent additive Gaussian noise. The entries of $\mathbf{A}$ are typically i.i.d. $\mathcal{N}(0,1/M)$ (so that $\mathbb{E}[\mathbf{A}^{T}\mathbf{A}]=\tfrac{N}{M}\mathbf{I}_N$ ), which is the standard scaling for the Marchenko–Pastur regime.

The scaling $1/M$ (rather than $1/N$ ) on the columns is a convention choice. With this scaling the operator norm of $\mathbf{A}^{T}\mathbf{A}$ concentrates at $(1+\sqrt{\gamma})^2$ and its minimum eigenvalue at $(1-\sqrt{\gamma})^2$ when $\gamma<1$ .

Theorem: Breakdown of MLE in the Proportional Regime

Consider the model of DGaussian Linear Observation Model with $\mathbf{A}$ having i.i.d. $\mathcal{N}(0,1/M)$ entries and $\gamma<1$ . The Ordinary-Least-Squares (MLE) estimator is $\hat{\mathbf{x}}_{\text{OLS}}=(\mathbf{A}^{T}\mathbf{A})^{-1}\mathbf{A}^{T}\mathbf{y}$ . In the proportional asymptotic regime its normalised MSE satisfies

$\frac{1}{N}\mathbb{E}\bigl[\|\hat{\mathbf{x}}_{\text{OLS}}-\mathbf{x}\|^2\bigr]\;\xrightarrow{\text{a.s.}}\;\frac{\gamma}{1-\gamma}\,\sigma^2.$

In particular, the risk blows up as $\gamma\uparrow 1$ , and OLS is not defined for $\gamma\geq 1$ .

Classical CRLB analysis would predict a per-coordinate variance $\sigma^2/(M-N)$ , which is $\sigma^2/(M(1-\gamma))$ . Dividing by $N$ and sending $M\to\infty$ gives exactly $\gamma\sigma^2/(1-\gamma)$ . The surprise is not the formula — it is that the CRLB itself blows up as $\gamma\to 1$ . The MLE is optimal and disastrous.

Show Hint

Use the rotational invariance of $\mathbf{A}$ 's distribution to diagonalise $\mathbf{A}^{T}\mathbf{A}$ .

The Marchenko–Pastur law characterises the limiting eigenvalue distribution of $\tfrac{1}{M}\mathbf{A}^{T}\mathbf{A}$ : its density on $[(1-\sqrt{\gamma})^2,(1+\sqrt{\gamma})^2]$ is $\tfrac{1}{2\pi\gamma\lambda}\sqrt{(b-\lambda)(\lambda-a)}$ .

The trace of $(\mathbf{A}^{T}\mathbf{A})^{-1}$ is $\sum_i 1/\mu_i$ where $\mu_i$ are its eigenvalues; take the $M\to\infty$ limit of this empirical average using the MP density.

Proof

Express the risk in terms of the eigenvalues

Since $\hat{\mathbf{x}}_{\text{OLS}}-\mathbf{x}=(\mathbf{A}^{T}\mathbf{A})^{-1}\mathbf{A}^{T}\mathbf{w}$ and $\mathbf{w}\sim\mathcal{N}(\mathbf{0},\sigma^2\mathbf{I}_M)$ , we obtain $\mathbb{E}\bigl[\|\hat{\mathbf{x}}_{\text{OLS}}-\mathbf{x}\|^2\bigm|\mathbf{A}\bigr]=\sigma^2\cdot\mathrm{tr}\bigl((\mathbf{A}^{T}\mathbf{A})^{-1}\bigr).$ Let $\mu_1,\dots,\mu_N$ denote the eigenvalues of $\tfrac{1}{M}\mathbf{A}^{T}\mathbf{A}$ . Then $\mathrm{tr}((\mathbf{A}^{T}\mathbf{A})^{-1})=\tfrac{1}{M}\sum_i\mu_i^{-1}$ .

Apply the Marchenko–Pastur law

In the proportional regime the empirical spectral distribution of $\tfrac{1}{M}\mathbf{A}^{T}\mathbf{A}$ converges weakly (a.s.) to the Marchenko–Pastur distribution with density $f_\gamma(\mu)=\tfrac{1}{2\pi\gamma\mu}\sqrt{(b-\mu)(\mu-a)},\qquad a=(1-\sqrt{\gamma})^2,\ b=(1+\sqrt{\gamma})^2.$ Hence $\frac{1}{N}\sum_i \mu_i^{-1}\xrightarrow{\text{a.s.}}\int_a^b\tfrac{1}{\mu}f_\gamma(\mu)\,d\mu=\tfrac{1}{1-\gamma}$ (this is a standard MP integral — see [?bai2010spectral]).

Combine

Therefore $\frac{1}{N}\mathbb{E}\bigl[\|\hat{\mathbf{x}}_{\text{OLS}}-\mathbf{x}\|^2\bigr]=\frac{\sigma^2}{M}\cdot\frac{1}{N}\sum_i\mu_i^{-1}\cdot N\xrightarrow{\text{a.s.}}\frac{\sigma^2}{M}\cdot\frac{M}{1-\gamma}\cdot\frac{N}{M}=\frac{\gamma\,\sigma^2}{1-\gamma}.\ \blacksquare$

Key Takeaway

In the proportional regime the MLE does not fail because the estimator is wrong — it fails because the problem itself becomes ill-conditioned as $\gamma\to 1$ . The cure is not a cleverer estimator but a shift of perspective: one must give up unbiasedness.

Marchenko–Pastur Eigenvalue Density

Plot the limiting eigenvalue density of $\tfrac{1}{M}\mathbf{A}^{T}\mathbf{A}$ as a function of the aspect ratio $\gamma=N/M$ , with the empirical histogram from a finite draw overlaid. Watch the support shift and the left edge approach zero as $\gamma\to 1$ .

Parameters

\gamma = N/M

0.5

Aspect ratio

M

400

Number of rows (empirical sample size)

Normalised OLS Risk as a Function of $\gamma$

Compare the theoretical OLS risk $\gamma/(1-\gamma)\cdot\sigma^2$ against Monte Carlo simulation for varying aspect ratio. The blow-up at $\gamma=1$ is the proportional-regime analogue of the classical identifiability boundary.

Parameters

\sigma^2

0.1

Noise variance

M

200

Example: Channel Estimation with Too Few Pilots

A massive-MIMO base station with $N_t=128$ antennas estimates its downlink channel from $M=64$ orthogonal pilot symbols using OLS. The per-antenna noise variance is $\sigma^2=0.01$ . What per-entry MSE does the OLS estimate attain, and what would classical CRLB thinking predict?

Solution

Identify the aspect ratio

Here $\gamma=N/M=128/64=2$ . Since $\gamma>1$ , the matrix $\mathbf{A}^{T}\mathbf{A}$ is rank-deficient and the OLS estimator is not defined — the pseudoinverse would return the minimum-norm solution, but its bias is unbounded.

What classical CRLB says

The classical (low-dimensional) CRLB would predict a per-coordinate variance $\sigma^2/M=0.01/64\approx 1.6\times 10^{-4}$ — an encouragingly small number. But it is simply wrong: with more parameters than observations, no unbiased estimator exists.

What needs to happen

One must add prior information — sparsity, low-rank structure, or a ridge penalty. Section 22.2 quantifies how much prior information buys back, and Section 22.3 shows that even with no prior at all, shrinkage dominates the MLE whenever $N\geq 3$ .

⚠️Engineering Note

Sample Covariance is Biased High and Low

A practical consequence of the Marchenko–Pastur law: a sample covariance matrix $\hat{\boldsymbol{\Sigma}}=\tfrac{1}{M}\sum_{i=1}^M\mathbf{x}_i\mathbf{x}_i^T$ computed from $M$ samples of $\mathcal{N}(\mathbf{0},\mathbf{I}_N)$ does not have eigenvalues concentrated at $1$ — they spread over $[(1-\sqrt\gamma)^2,(1+\sqrt\gamma)^2]$ . Plugging the sample covariance into any whitening or beamforming routine produces systematic errors that do not go away as $M\to\infty$ if $\gamma$ is held fixed.

Practical Constraints

•
For massive-MIMO covariance estimation, calibration campaigns must either collect $M\gg N$ snapshots or use shrinkage (see [?ledoit2004well]).
•
Eigenvalue clipping / linear-shrinkage estimators ( $\hat{\boldsymbol{\Sigma}}_{\text{shrink}}=\alpha\hat{\boldsymbol{\Sigma}}+(1-\alpha)\mathbf{I}$ ) are standard in portfolio optimisation and array processing.

Historical Note: Marchenko and Pastur (1967)

1960s

Vladimir Marchenko and Leonid Pastur, working in Kharkov, derived their eponymous law in 1967 as a curiosity about the spectra of large random matrices. It lay outside mainstream statistics for three decades and was rediscovered by the statistics and wireless communications communities in the late 1990s, when massive antenna arrays and genome-scale regression made the proportional regime unavoidable. The law now underpins the analysis of every high-dimensional estimator we use.

Common Mistake: Plugging $N$ into the Classical CRLB

Mistake:

A common reflex is to compute the CRLB for a model with $N$ unknowns and declare the resulting number a lower bound on the MSE, regardless of $M$ .

Correction:

The classical CRLB is a lower bound on the variance of unbiased estimators. When $N/M$ is not negligible — or worse, when $N>M$ — there is no unbiased estimator of $\mathbf{x}$ , so the bound is vacuous. Either use the Bayesian CRLB (if a prior is available) or switch to the minimax framework of Section 22.4.

Quick Check

For $\gamma=0.25$ , the Marchenko–Pastur distribution of $\tfrac1M\mathbf{A}^{T}\mathbf{A}$ is supported on which interval?

$[0.25,2.25]$

$[0.25,1.75]$

$[0,1]$

$[1-\gamma,1+\gamma]=[0.75,1.25]$

Correction:

[0.25,2.25]

The support is $[(1-\sqrt\gamma)^2,(1+\sqrt\gamma)^2]$ . With $\sqrt\gamma=0.5$ we get $a=(0.5)^2=0.25$ and $b=(1.5)^2=2.25$ .

Proportional Asymptotic Regime

The joint limit in which both the dimension $N$ and the sample size $M$ tend to infinity with the ratio $\gamma=N/M$ held fixed. This is the natural scaling for modern statistical problems in which the parameter dimension grows with the data.

The Blessing and Curse of High Dimensions

Why Classical Asymptotics Are Not Enough

Definition: Proportional Asymptotic Regime

Definition: Gaussian Linear Observation Model

Theorem: Breakdown of MLE in the Proportional Regime

Express the risk in terms of the eigenvalues

Apply the Marchenko–Pastur law

Combine

Key Takeaway

Marchenko–Pastur Eigenvalue Density

Parameters

Normalised OLS Risk as a Function of γ\gammaγ

Parameters

Example: Channel Estimation with Too Few Pilots

Identify the aspect ratio

What classical CRLB says

What needs to happen

Sample Covariance is Biased High and Low

Historical Note: Marchenko and Pastur (1967)

Common Mistake: Plugging NNN into the Classical CRLB

Quick Check

Proportional Asymptotic Regime

Definition:
Proportional Asymptotic Regime

Definition:
Gaussian Linear Observation Model

Normalised OLS Risk as a Function of $\gamma$

Common Mistake: Plugging $N$ into the Classical CRLB