The Cramer--Rao Lower Bound

How Good Can an Estimator Be?

Given a bias specification (say, unbiased) we ask the engineering question: how small can the variance of any such estimator be, for a given sample size and noise level? The answer is the Cramer--Rao lower bound. Unlike a performance bound for a particular algorithm, the CRB is a hard physical limit imposed by the statistical model --- nothing you can build will do better. This is why, when you plot the MSE of a real estimator alongside the CRB, the distance between the two is a receipt you can take to the designer: it says exactly how much slack is left and whether closing it is worth the complexity.

Definition:

Score Function and Regularity

Suppose θΛR\theta \in \Lambda \subseteq \mathbb{R} is a scalar, Λ\Lambda is an open interval, and the support of f(y;θ)=fθ(y)f(\mathbf{y};\theta) = f_\theta(\mathbf{y}) does not depend on θ\theta. The score function is s(y;θ)θlogfθ(y)=1fθ(y)fθ(y)θ.s(\mathbf{y}; \theta) \triangleq \frac{\partial}{\partial \theta} \log f_\theta(\mathbf{y}) = \frac{1}{f_\theta(\mathbf{y})} \frac{\partial f_\theta(\mathbf{y})}{\partial \theta}. The family {fθ}\{f_\theta\} is regular if differentiation under the integral sign is permitted, so that Eθ ⁣[s(Y;θ)]  =  fθ(y)θdy  =  θfθ(y)dy  =  0.\mathbb{E}_\theta\!\left[ s(\mathbf{Y};\theta) \right] \;=\; \int \frac{\partial f_\theta(\mathbf{y})}{\partial \theta} \, d\mathbf{y} \;=\; \frac{\partial}{\partial \theta} \int f_\theta(\mathbf{y}) \, d\mathbf{y} \;=\; 0.

Regularity fails in sharp ways for distributions whose support moves with θ\theta (e.g., Unif[0,θ]\mathrm{Unif}[0, \theta]). The CRB is not applicable in such cases --- faster-than-1/n1/n rates become possible.

Definition:

Fisher Information

Under the regularity conditions of DScore Function and Regularity, the Fisher information (scalar case) is J(θ)Varθ(s(Y;θ))  =  Eθ ⁣[(logfθ(Y)θ)2].J(\theta) \triangleq \text{Var}_\theta\bigl(s(\mathbf{Y};\theta)\bigr) \;=\; \mathbb{E}_\theta\!\left[\left(\frac{\partial \log f_\theta(\mathbf{Y})}{\partial \theta}\right)^2\right]. If in addition the density is twice differentiable and the second derivative can be interchanged with the integral, then J(θ)  =  Eθ ⁣[2logfθ(Y)θ2].J(\theta) \;=\; -\,\mathbb{E}_\theta\!\left[ \frac{\partial^2 \log f_\theta(\mathbf{Y})}{\partial \theta^2} \right]. For independent observations Y1,,YnY_1,\ldots,Y_n the log-likelihood is a sum, and J(θ)=iJi(θ)J(\theta) = \sum_i J_{i}(\theta). In the i.i.d. case, J(θ)=nJ1(θ)J(\theta) = n\,J_{1}(\theta).

The two expressions --- variance of the score and negative expected curvature --- are equal only on average. Pointwise they differ. The equivalence is a consequence of regularity and plays the same role here that the second-moment manipulation plays in the proof of the variance decomposition.

Theorem: Equivalence of the Two Fisher-Information Expressions

Under the regularity conditions of DScore Function and Regularity and DFisher Information, Eθ ⁣[(logfθ(Y)θ)2]=Eθ ⁣[2logfθ(Y)θ2].\mathbb{E}_\theta\!\left[\left(\frac{\partial \log f_\theta(\mathbf{Y})}{\partial \theta}\right)^2\right] = - \mathbb{E}_\theta\!\left[\frac{\partial^2 \log f_\theta(\mathbf{Y})}{\partial \theta^2}\right].

Theorem: Cramer--Rao Lower Bound (Scalar)

Let {fθ:θΛ}\{f_\theta : \theta \in \Lambda\} be a regular family with scalar θ\theta. Then for any unbiased estimator θ^(Y)\hat{\theta}(\mathbf{Y}),   Varθ(θ^(Y))    1J(θ)  θΛ.\boxed{\; \text{Var}_\theta\bigl(\hat{\theta}(\mathbf{Y})\bigr) \;\geq\; \frac{1}{J(\theta)} \;} \qquad \forall \theta \in \Lambda. Equality holds for every θΛ\theta \in \Lambda if and only if θlogfθ(y)  =  J(θ)(θ^(y)θ),\frac{\partial}{\partial \theta} \log f_\theta(\mathbf{y}) \;=\; J(\theta) \bigl(\hat{\theta}(\mathbf{y}) - \theta\bigr), i.e., the score is an affine function of the estimator. In that case θ^\hat{\theta} is called efficient, and it is the MVUE.

The inequality comes from Cauchy--Schwarz applied to two random variables: the centered estimator θ^(Y)θ\hat{\theta}(\mathbf{Y}) - \theta and the score s(Y;θ)s(\mathbf{Y};\theta). Their correlation is forced to be 11 by unbiasedness (after differentiating Eθ[θ^(Y)]=θ\mathbb{E}_\theta[\hat{\theta}(\mathbf{Y})] = \theta in θ\theta), and the variance of the score is J(θ)J(\theta). Cauchy--Schwarz then lower-bounds the variance of the estimator by the reciprocal of the Fisher information. This is the CRB proof pattern: every CRB you will see in this book is an instance of it --- vector, curved, Bayesian, functional --- with the same two-random-variable Cauchy--Schwarz at its core.

, ,

Historical Note: Cramer (1946), Rao (1945), and a Near-Simultaneous Discovery

1943--1946

The inequality was derived independently by C. R. Rao (Calcutta, 1945) and H. Cramer (Stockholm, 1946), and also by M. Frechet (Paris, 1943) in a less general form. Rao's derivation used the method that became the standard textbook proof; Cramer's 1946 monograph brought it to a wide audience. In the Soviet tradition the same bound carries A. Bhattacharya's name, because of an extension to higher-order derivatives he published in 1946--1948. In modern practice the two-name "Cramer--Rao" label prevails. The coincidence is not an accident: the ingredients (likelihood, Fisher information, Cauchy--Schwarz) were all airborne in the 1940s statistics community, waiting for someone to assemble them.

Example: Efficient Estimator: Mean of a Gaussian

Let Y1,,YnY_1, \ldots, Y_n be i.i.d. N(θ,σ2)\mathcal{N}(\theta, \sigma^2) with σ2\sigma^2 known. Compute the Fisher information and show that the sample mean Yˉ\bar{Y} is an efficient estimator of θ\theta.

Example: Amplitude Estimation in AWGN

Observe Yi=Asi+ZiY_i = A s_i + Z_i for i=1,,ni = 1, \ldots, n, with sis_i known and ZiZ_i \sim i.i.d. N(0,σ2)\mathcal{N}(0, \sigma^2). The unknown parameter is the amplitude ARA \in \mathbb{R}. Compute the CRB and the efficient estimator.

Definition:

Fisher Information Matrix

Let θ=(θ1,,θm)TΛRm\boldsymbol{\theta} = (\theta_1, \ldots, \theta_m)^T \in \Lambda \subseteq \mathbb{R}^m. Under vector regularity (support independent of θ\boldsymbol{\theta} and derivatives exchangeable with integration), the Fisher information matrix is the m×mm \times m matrix [J(θ)]ijEθ ⁣[logfθ(Y)θilogfθ(Y)θj]=Eθ ⁣[2logfθ(Y)θiθj].[\mathbf{J}(\boldsymbol{\theta})]_{ij} \triangleq \mathbb{E}_{\boldsymbol{\theta}}\!\left[ \frac{\partial \log f_{\boldsymbol{\theta}}(\mathbf{Y})}{\partial \theta_i} \cdot \frac{\partial \log f_{\boldsymbol{\theta}}(\mathbf{Y})}{\partial \theta_j} \right] = -\mathbb{E}_{\boldsymbol{\theta}}\!\left[ \frac{\partial^2 \log f_{\boldsymbol{\theta}}(\mathbf{Y})}{\partial \theta_i \, \partial \theta_j} \right]. As the covariance of the score vector, J(θ)0\mathbf{J}(\boldsymbol{\theta}) \succeq 0; it is strictly positive definite iff no component of θ\boldsymbol{\theta} is unidentifiable.

Theorem: Cramer--Rao Lower Bound (Vector Parameter)

For any unbiased estimator θ^(Y)\hat{\boldsymbol{\theta}}(\mathbf{Y}) of θRm\boldsymbol{\theta} \in \mathbb{R}^m with positive-definite FIM J(θ)\mathbf{J}(\boldsymbol{\theta}), Covθ(θ^(Y))    J(θ)1.\text{Cov}_{\boldsymbol{\theta}}\bigl(\hat{\boldsymbol{\theta}}(\mathbf{Y})\bigr) \;\succeq\; \mathbf{J}(\boldsymbol{\theta})^{-1}. Componentwise, Varθ(θ^i)[J(θ)1]ii\text{Var}_{\boldsymbol{\theta}}(\hat{\theta}_i) \geq [\mathbf{J}(\boldsymbol{\theta})^{-1}]_{ii}. For any (Frechet-differentiable) reparameterization α(θ):RmRr\boldsymbol{\alpha}(\boldsymbol{\theta}) : \mathbb{R}^m \to \mathbb{R}^r with Jacobian D(θ)=α/θT\mathbf{D}(\boldsymbol{\theta}) = \partial \boldsymbol{\alpha}/\partial \boldsymbol{\theta}^T, Covθ(α^(Y))    D(θ)J(θ)1D(θ)T.\text{Cov}_{\boldsymbol{\theta}}(\hat{\boldsymbol{\alpha}}(\mathbf{Y})) \;\succeq\; \mathbf{D}(\boldsymbol{\theta})\, \mathbf{J}(\boldsymbol{\theta})^{-1}\, \mathbf{D}(\boldsymbol{\theta})^T.

Read [J1]ii[\mathbf{J}^{-1}]_{ii} as the CRB on the ii-th component, not as 1/[J]ii1/[\mathbf{J}]_{ii}. The difference is caused by cross-terms: when another parameter is also being estimated, it steals information from θi\theta_i. This is why estimating amplitude and phase jointly is harder than estimating either alone.

,

Example: Joint CRB: Mean and Variance of a Gaussian

Let Y1,,YnY_1, \ldots, Y_n be i.i.d. N(μ,σ2)\mathcal{N}(\mu, \sigma^2) with both μ\mu and σ2\sigma^2 unknown, θ=(μ,σ2)T\boldsymbol{\theta} = (\mu, \sigma^2)^T. Compute J(θ)\mathbf{J}(\boldsymbol{\theta}) and the resulting componentwise CRBs.

Scalar CRB vs. Vector CRB

AspectScalar (θR)(\theta \in \mathbb{R})Vector (θRm)(\boldsymbol{\theta} \in \mathbb{R}^m)
Bound objectVarianceCovariance matrix (PSD ordering \succeq)
InformationJ(θ)J(\theta) (scalar)J(θ)\mathbf{J}(\boldsymbol{\theta}) (m×mm\times m PSD)
Componentwise boundVar(θ^)1/J(θ)\text{Var}(\hat{\theta})\geq 1/J(\theta)Var(θ^i)[J1]ii\text{Var}(\hat{\theta}_i)\geq [\mathbf{J}^{-1}]_{ii}, NOT 1/[J]ii1/[\mathbf{J}]_{ii}
Attainment conditionScore is affine in θ^\hat{\theta}Score is affine in θ^\hat{\boldsymbol{\theta}} (simultaneously)
Reparameterization α=u(θ)\alpha=u(\theta)Var(α^)u(θ)2/J(θ)\text{Var}(\hat{\alpha})\geq u'(\theta)^2/J(\theta)Cov(α^)DJ1DT\text{Cov}(\hat{\boldsymbol{\alpha}})\succeq \mathbf{D}\,\mathbf{J}^{-1}\mathbf{D}^T

Common Mistake: [J1]ii1/[J]ii[J^{-1}]_{ii} \neq 1/[J]_{ii}

Mistake:

When computing the CRB on a single component of a vector parameter, it is easy to write 1/[J(θ)]ii1/[\mathbf{J}(\boldsymbol{\theta})]_{ii} --- the "scalar formula applied to the ii-th row".

Correction:

The correct bound is [J(θ)1]ii[\mathbf{J}(\boldsymbol{\theta})^{-1}]_{ii}, which is always 1/[J(θ)]ii\geq 1/[\mathbf{J}(\boldsymbol{\theta})]_{ii}, with equality only when the FIM is diagonal. The inflation factor quantifies the price of estimating θi\theta_i jointly with the nuisance parameters.

Common Mistake: CRB Applies to Unbiased Estimators

Mistake:

"My estimator has MSE below 1/J(θ)1/J(\theta) --- I beat the CRB!"

Correction:

The CRB bounds the variance of unbiased estimators. A biased estimator can have smaller variance --- and smaller MSE --- than the CRB. For biased θ^\hat{\theta} with Eθ[θ^]=θ+b(θ)\mathbb{E}_\theta[\hat{\theta}] = \theta + b(\theta), the correct Cramer--Rao-type inequality is Varθ(θ^)(1+b(θ))2/J(θ)\text{Var}_\theta(\hat{\theta}) \geq (1 + b'(\theta))^2/J(\theta). The CRB should be compared against the variance of unbiased competitors; MSE comparisons need a different bound (e.g., Bayesian Cramer--Rao, van Trees).

CRB vs. Monte Carlo for Amplitude Estimation in AWGN

Compare the empirical variance of the matched-filter estimator A^MF=sTy/s2\hat{A}_{\text{MF}} = \mathbf{s}^T\mathbf{y}/\|\mathbf{s}\|^2 against the CRB σ2/s2\sigma^2/\|\mathbf{s}\|^2 as you sweep the SNR and the number of samples. The estimator sits exactly on the CRB, consistent with its efficiency.

Parameters
-5
20
32

Fisher Information as Curvature of the Log-Likelihood

For a single Gaussian sample YN(θ,σ2)Y \sim \mathcal{N}(\theta, \sigma^2), view the log-likelihood θ(y)=logfθ(y)\ell_\theta(y) = \log f_\theta(y) as a function of θ\theta and watch how its curvature at the peak --- that is, θ(y)=1/σ2-\ell''_\theta(y) = 1/\sigma^2 --- rises as σ\sigma shrinks. Averaging this curvature over YY gives the Fisher information.

Parameters
1
0.5
⚠️Engineering Note

Channel Estimation: Pilot SNR and the CRB

In a pilot-based channel estimator with TpT_p orthogonal pilot symbols, the FIM for the complex channel coefficient scales as TpEp/N0T_p \cdot E_p / N_0, and hence the CRB on the real and imaginary parts scales as σ2/Tp\sigma^2 / T_p. This is why 3GPP NR allocates a fraction of OFDM REs as DMRS: the pilot overhead directly buys CRB reduction. Increasing TpT_p squeezes the bound linearly; the catch is that it also linearly decreases the throughput available for data. Every channel-estimator design is negotiating this trade.

Practical Constraints
  • 5G NR DMRS density: 1 per 6 subcarriers (frequency), 1--4 OFDM symbols per slot (time), per TS 38.211

  • CRB on phase noise estimation scales as 1/(TSNR)1 / (T \cdot \text{SNR}) — motivates PTRS insertion at high FR2 frequencies

  • For narrowband IoT, long pilot sequences trade latency for CRB improvement

📋 Ref: 3GPP TS 38.211, Section 6.4
🎓CommIT Contribution(2023)

CRB as One Pillar of the Sensing--Communication Tradeoff

F. Liu, G. CaireIEEE Trans. Information Theory, vol. 69, no. 9

Integrated sensing and communication (ISAC) systems re-use a single waveform to both convey data and estimate target parameters. The resulting performance region has two axes: communication rate and sensing accuracy, where the latter is quantified via a CRB-type matrix on the target parameters (range, angle, Doppler). The work of Liu and Caire shows that the frontier of this region is a Pareto-optimal tradeoff between the capacity expression from ITA Chapter 18 and the CRB derived exactly as in this chapter. The CRB is the operational "distortion" in the sensing rate--distortion formulation of ISAC.

isaccrbsensinginformation-theoryView Paper →

Quick Check

An unbiased estimator of a scalar θ\theta attains the CRB. What can you say about the score function?

It is zero everywhere

It equals J(θ)(θ^(y)θ)J(\theta)(\hat{\theta}(\mathbf{y}) - \theta)

It depends only on θ\theta

Its mean is θ\theta

The CRB as Cauchy--Schwarz: Score and Estimator Co-linear at Efficiency

A visual proof of the scalar CRB: we treat the centered estimator θ^(Y)θ\hat{\theta}(\mathbf{Y}) - \theta and the score s(Y;θ)s(\mathbf{Y};\theta) as vectors in L2(fθ)L^2(f_\theta), show that their inner product equals one, and read off the Cauchy--Schwarz lower bound on θ^θ2\|\hat{\theta} - \theta\|^2.
Efficiency is the statement that the estimator lies on the line spanned by the score.