Linear Prediction and Kolmogorov-Szego

Predicting the Future From the Past

Prediction is the special case of the causal Wiener problem in which the target is the observation itself, shifted forward: Xn=Yn+dX_n = Y_{n+d} for some prediction horizon dβ‰₯1d \geq 1. The one-step predictor (d=1d = 1) is foundational β€” it is the engine inside the innovations representation, and its MSE is a fundamental invariant of the observation process. That invariant has a closed-form expression due to Kolmogorov and Szego, and it is one of the most beautiful formulas in linear estimation theory.

Definition:

One-Step Linear Predictor

Let {Yn}\{Y_n\} be zero-mean WSS with PSD Py(f)P_y(f) satisfying Paley-Wiener. The one-step linear predictor is the MMSE estimator of YnY_n based on the strict past {Ym:m<n}\{Y_m : m < n\}: Y^n∣nβˆ’1=βˆ‘kβ‰₯1p[k] Ynβˆ’k.\widehat{Y}_{n|n-1} = \sum_{k \geq 1} p[k]\, Y_{n-k}. The prediction error is Jn=Ynβˆ’Y^n∣nβˆ’1J_n = Y_n - \widehat{Y}_{n|n-1} β€” exactly the innovations of Section 9.3, up to a scalar normalization.

Theorem: Kolmogorov-Szego Formula

Under the Paley-Wiener condition, the one-step prediction MMSE is β€…β€ŠΟƒp2=exp⁑(βˆ«βˆ’1/21/2log⁑Py(f) df).β€…β€Š\boxed{\;\sigma_p^2 = \exp\left(\int_{-1/2}^{1/2} \log P_y(f)\, df\right).\;} This is the geometric mean of the PSD over one period.

The formula is striking: an integral over the whole spectrum collapses to a single number via the logarithm. It says the unpredictability of a process is captured by a geometric (not arithmetic) average of its spectral content. If Py(f)P_y(f) is flat (white noise) the geometric mean equals the arithmetic mean equals the variance, and prediction is impossible β€” each sample is truly fresh. If Py(f)P_y(f) is concentrated on a narrow band, the geometric mean is much smaller than the arithmetic mean, meaning most of the signal's power is predictable and only a small residual is unpredictable.

,

Example: Kolmogorov-Szego for AR(1)

Let Yn=aYnβˆ’1+UnY_n = a Y_{n-1} + U_n, UnU_n white with variance Οƒu2\sigma_u^2, ∣a∣<1|a| < 1. Verify the Kolmogorov-Szego formula by computing both sides.

Theorem: dd-Step Prediction MMSE

The MMSE of the dd-step predictor Y^n+d∣n\widehat{Y}_{n+d|n} using {Ym:m≀n}\{Y_m : m \leq n\} is Οƒp,d2=βˆ‘m=0dβˆ’1∣q[m]∣2,q[m]=βˆ«βˆ’1/21/2Py+(f)ej2Ο€fmdf,\sigma_{p,d}^2 = \sum_{m = 0}^{d-1} |q[m]|^2, \qquad q[m] = \int_{-1/2}^{1/2} P_y^+(f) e^{j 2\pi f m} df, where {q[m]}\{q[m]\} are the inverse DTFT coefficients of Py+(f)P_y^+(f). In particular, Οƒp,12=∣q[0]∣2=Οƒp2\sigma_{p,1}^2 = |q[0]|^2 = \sigma_p^2 recovers the Kolmogorov-Szego formula.

The dd-step prediction error is the sum of the first dd innovations that will occur after time nn, weighted by the MA coefficients q[m]=p+[m]q[m] = p^+[m] of the minimum-phase representation. As dβ†’βˆžd \to \infty, Οƒp,d2β†’ryy[0]\sigma_{p,d}^2 \to r_{yy}[0] (the signal variance) β€” predicting the far future becomes the same as estimating the marginal mean, which for zero-mean processes gives the variance.

Prediction Is a Special Case of Causal Wiener Filtering

In the causal Wiener framework, the dd-step prediction problem corresponds to taking Xn=Yn+dX_n = Y_{n+d}. The cross-PSD is Pxy(f)=Py(f)ej2πfdP_{xy}(f) = P_y(f) e^{j 2\pi f d} (a frequency-shifted version of the PSD). Substituting into the causal Wiener formula of TCausal Wiener Filter gives hˇd-step(f)=1Py+(f)[ej2πfdPy+(f)]+,\check{h}_{d\text{-step}}(f) = \frac{1}{P_y^+(f)} \left[e^{j 2\pi f d} P_y^+(f)\right]_+, which is the formula on page 126 of Kailath's Linear Estimation. So every result in this section is a corollary of Section 9.4 — a comforting sanity check that the machinery is consistent.

Why This Matters: Channel Prediction in Wireless Systems

Wireless channel coefficients are themselves WSS (approximately, over the coherence time) processes β€” typically modeled as Rayleigh or Ricean fading with a Jakes-like Doppler spectrum. Predicting the channel one or more symbol periods ahead is essential for closed-loop beamforming, adaptive modulation, and link adaptation in 5G/6G systems. The Kolmogorov-Szego formula gives the fundamental limit on how well the channel can be predicted: for a heavily Doppler-spread channel the geometric mean of the PSD is close to the arithmetic mean and prediction is difficult; for a narrowband Doppler channel the geometric mean is much smaller and several steps of prediction are feasible. This prediction gap drives the choice of feedback rate in multiuser MIMO systems.

Historical Note: Szego's Formula Pre-Dates the Filter

1920

The formula exp⁑(∫log⁑Py(f) df)\exp(\int \log P_y(f)\, df) appeared in pure mathematics long before it entered signal processing. Gabor Szego (1895-1985) published it in 1920 as a limit theorem for determinants of Toeplitz matrices: det⁑TN1/Nβ†’exp⁑(∫log⁑T(f)df)\det \mathbf{T}_N^{1/N} \to \exp(\int \log T(f) df) as Nβ†’βˆžN \to \infty, where TN\mathbf{T}_N is the NΓ—NN \times N Toeplitz matrix with symbol T(f)T(f). Kolmogorov recognized in 1941 that this same expression is the asymptotic one-step prediction variance. The bridge is Szego's theorem on the asymptotic distribution of Toeplitz eigenvalues, which says the eigenvalues of TN\mathbf{T}_N are asymptotically distributed according to the symbol T(f)T(f). The determinant, being the product of eigenvalues, then equals exp⁑(∫log⁑T(f)df)\exp(\int \log T(f) df) in the limit.

Common Mistake: K-S for White Noise Gives the Variance, Not Zero

Mistake:

Assuming that because white noise is unpredictable, the Kolmogorov-Szego formula returns zero.

Correction:

If Py(f)=Οƒ2P_y(f) = \sigma^2 constant, then ∫log⁑σ2 df=log⁑σ2\int \log \sigma^2\, df = \log \sigma^2, so Οƒp2=Οƒ2\sigma_p^2 = \sigma^2 β€” the full variance, not zero. The interpretation: white noise is maximally unpredictable, so its prediction MSE equals its total variance. This is consistent: the best you can do is predict zero, and the MSE is the variance.

Key Takeaway

The one-step prediction MMSE equals the geometric mean of the PSD: Οƒp2=exp⁑∫log⁑Py(f) df\sigma_p^2 = \exp\int \log P_y(f)\, df. Predictability is captured by how far the geometric mean falls below the arithmetic mean (the variance). A process is predictable to exactly the extent that its spectrum is peaky.

Quick Check

For a WSS process with PSD Py(f)=2+cos⁑(2Ο€f)P_y(f) = 2 + \cos(2\pi f), what can you say about the one-step prediction MMSE Οƒp2\sigma_p^2?

Οƒp2<ryy[0]\sigma_p^2 < r_{yy}[0] strictly.

Οƒp2=ryy[0]=2\sigma_p^2 = r_{yy}[0] = 2 because the process is WSS.

Οƒp2=0\sigma_p^2 = 0 because PyP_y is a smooth function.