Ferkans — Interactive Telecom Tutor

Predicting the Future From the Past

Prediction is the special case of the causal Wiener problem in which the target is the observation itself, shifted forward: $X_n = Y_{n+d}$ for some prediction horizon $d \geq 1$ . The one-step predictor ( $d = 1$ ) is foundational — it is the engine inside the innovations representation, and its MSE is a fundamental invariant of the observation process. That invariant has a closed-form expression due to Kolmogorov and Szego, and it is one of the most beautiful formulas in linear estimation theory.

Definition:
One-Step Linear Predictor

Let $\{Y_n\}$ be zero-mean WSS with PSD $P_y(f)$ satisfying Paley-Wiener. The one-step linear predictor is the MMSE estimator of $Y_n$ based on the strict past $\{Y_m : m < n\}$ : $\widehat{Y}_{n|n-1} = \sum_{k \geq 1} p[k]\, Y_{n-k}.$ The prediction error is $J_n = Y_n - \widehat{Y}_{n|n-1}$ — exactly the innovations of Section 9.3, up to a scalar normalization.

Theorem: Kolmogorov-Szego Formula

Under the Paley-Wiener condition, the one-step prediction MMSE is $\boxed{\;\sigma_p^2 = \exp\left(\int_{-1/2}^{1/2} \log P_y(f)\, df\right).\;}$ This is the geometric mean of the PSD over one period.

The formula is striking: an integral over the whole spectrum collapses to a single number via the logarithm. It says the unpredictability of a process is captured by a geometric (not arithmetic) average of its spectral content. If $P_y(f)$ is flat (white noise) the geometric mean equals the arithmetic mean equals the variance, and prediction is impossible — each sample is truly fresh. If $P_y(f)$ is concentrated on a narrow band, the geometric mean is much smaller than the arithmetic mean, meaning most of the signal's power is predictable and only a small residual is unpredictable.

Proof

Write the predictor in the innovations basis

By the innovations representation, $Y_n = \sum_{k \geq 0} p^+[k]\, J_{n-k}$ where the sequence $\{p^+[k]\}$ is the inverse DTFT of $P_y^+(f)$ . The strict past of $Y$ equals the strict past of $J$ , so the best predictor of $Y_n$ from $\{Y_m : m < n\}$ equals the best predictor from $\{J_m : m < n\}$ .

Predict $Y_n$ from past innovations

$Y_n = p^+[0] J_n + \sum_{k \geq 1} p^+[k] J_{n-k}$ . Since $J$ is white, the MMSE predictor is $\widehat{Y}_{n|n-1} = \sum_{k \geq 1} p^+[k] J_{n-k}$ , and the residual is $Y_n - \widehat{Y}_{n|n-1} = p^+[0]\, J_n$ .

Compute the prediction variance

Since $J_n$ has unit variance, $\sigma_p^2 = |p^+[0]|^2$ . But $p^+[0] = \int_{-1/2}^{1/2} P_y^+(f)\, df$ is the value at zero lag, which equals $P_y^+(0) \cdot \ldots$ — in fact the correct statement is: $|p^+[0]|^2 = \exp\left(\int \log |P_y^+(f)|^2 df\right) = \exp\left(\int \log P_y(f) df\right).$ The first equality follows from applying Jensen's formula to the minimum-phase factor $P_y^+$ : for an outer function (in the Hardy-space sense), $\log|p^+[0]| = \int \log|P_y^+(f)| df$ . The second equality uses $|P_y^+|^2 = P_y$ .

,

Example: Kolmogorov-Szego for AR(1)

Let $Y_n = a Y_{n-1} + U_n$ , $U_n$ white with variance $\sigma_u^2$ , $|a| < 1$ . Verify the Kolmogorov-Szego formula by computing both sides.

Solution

Compute the PSD

$P_y(f) = \sigma_u^2 / |1 - a e^{-j2\pi f}|^2$ .

Compute the log-integral

$\int_{-1/2}^{1/2} \log P_y(f)\, df = \log \sigma_u^2 - \int_{-1/2}^{1/2} \log |1 - a e^{-j2\pi f}|^2 df$ . The second integral is classical: $\int_{-1/2}^{1/2} \log |1 - a e^{-j2\pi f}|^2 df = 0$ (Jensen's formula applied to $1 - a z$ , whose single zero $z = 1/a$ lies outside the unit disk for $|a| < 1$ ). So $\int \log P_y(f) df = \log \sigma_u^2$ .

Exponentiate

$\sigma_p^2 = \exp(\log \sigma_u^2) = \sigma_u^2$ . This matches first principles: the one-step prediction error of an AR(1) driven by innovation $U_n$ is exactly $U_n$ , with variance $\sigma_u^2$ . The formula holds.

Theorem: $d$ -Step Prediction MMSE

The MMSE of the $d$ -step predictor $\widehat{Y}_{n+d|n}$ using $\{Y_m : m \leq n\}$ is $\sigma_{p,d}^2 = \sum_{m = 0}^{d-1} |q[m]|^2, \qquad q[m] = \int_{-1/2}^{1/2} P_y^+(f) e^{j 2\pi f m} df,$ where $\{q[m]\}$ are the inverse DTFT coefficients of $P_y^+(f)$ . In particular, $\sigma_{p,1}^2 = |q[0]|^2 = \sigma_p^2$ recovers the Kolmogorov-Szego formula.

The $d$ -step prediction error is the sum of the first $d$ innovations that will occur after time $n$ , weighted by the MA coefficients $q[m] = p^+[m]$ of the minimum-phase representation. As $d \to \infty$ , $\sigma_{p,d}^2 \to r_{yy}[0]$ (the signal variance) — predicting the far future becomes the same as estimating the marginal mean, which for zero-mean processes gives the variance.

Proof

Express $Y_{n+d}$ in innovations

$Y_{n+d} = \sum_{k \geq 0} q[k] J_{n+d-k} = \underbrace{\sum_{k = 0}^{d-1} q[k] J_{n+d-k}}_{\text{unpredictable from } n} + \underbrace{\sum_{k \geq d} q[k] J_{n+d-k}}_{\text{in the span of } \{J_m: m \leq n\}}$ .

Project onto causal innovations

The second sum is the MMSE predictor $\widehat{Y}_{n+d|n}$ ; the first is the prediction error. Its variance is $\sum_{k=0}^{d-1} |q[k]|^2$ because $\{J_m\}$ is white with unit variance.

Prediction Is a Special Case of Causal Wiener Filtering

In the causal Wiener framework, the $d$ -step prediction problem corresponds to taking $X_n = Y_{n+d}$ . The cross-PSD is $P_{xy}(f) = P_y(f) e^{j 2\pi f d}$ (a frequency-shifted version of the PSD). Substituting into the causal Wiener formula of TCausal Wiener Filter gives $\check{h}_{d\text{-step}}(f) = \frac{1}{P_y^+(f)} \left[e^{j 2\pi f d} P_y^+(f)\right]_+,$ which is the formula on page 126 of Kailath's Linear Estimation. So every result in this section is a corollary of Section 9.4 — a comforting sanity check that the machinery is consistent.

Why This Matters: Channel Prediction in Wireless Systems

Wireless channel coefficients are themselves WSS (approximately, over the coherence time) processes — typically modeled as Rayleigh or Ricean fading with a Jakes-like Doppler spectrum. Predicting the channel one or more symbol periods ahead is essential for closed-loop beamforming, adaptive modulation, and link adaptation in 5G/6G systems. The Kolmogorov-Szego formula gives the fundamental limit on how well the channel can be predicted: for a heavily Doppler-spread channel the geometric mean of the PSD is close to the arithmetic mean and prediction is difficult; for a narrowband Doppler channel the geometric mean is much smaller and several steps of prediction are feasible. This prediction gap drives the choice of feedback rate in multiuser MIMO systems.

Historical Note: Szego's Formula Pre-Dates the Filter

1920

The formula $\exp(\int \log P_y(f)\, df)$ appeared in pure mathematics long before it entered signal processing. Gabor Szego (1895-1985) published it in 1920 as a limit theorem for determinants of Toeplitz matrices: $\det \mathbf{T}_N^{1/N} \to \exp(\int \log T(f) df)$ as $N \to \infty$ , where $\mathbf{T}_N$ is the $N \times N$ Toeplitz matrix with symbol $T(f)$ . Kolmogorov recognized in 1941 that this same expression is the asymptotic one-step prediction variance. The bridge is Szego's theorem on the asymptotic distribution of Toeplitz eigenvalues, which says the eigenvalues of $\mathbf{T}_N$ are asymptotically distributed according to the symbol $T(f)$ . The determinant, being the product of eigenvalues, then equals $\exp(\int \log T(f) df)$ in the limit.

Common Mistake: K-S for White Noise Gives the Variance, Not Zero

Mistake:

Assuming that because white noise is unpredictable, the Kolmogorov-Szego formula returns zero.

Correction:

If $P_y(f) = \sigma^2$ constant, then $\int \log \sigma^2\, df = \log \sigma^2$ , so $\sigma_p^2 = \sigma^2$ — the full variance, not zero. The interpretation: white noise is maximally unpredictable, so its prediction MSE equals its total variance. This is consistent: the best you can do is predict zero, and the MSE is the variance.

Key Takeaway

The one-step prediction MMSE equals the geometric mean of the PSD: $\sigma_p^2 = \exp\int \log P_y(f)\, df$ . Predictability is captured by how far the geometric mean falls below the arithmetic mean (the variance). A process is predictable to exactly the extent that its spectrum is peaky.

Quick Check

For a WSS process with PSD $P_y(f) = 2 + \cos(2\pi f)$ , what can you say about the one-step prediction MMSE $\sigma_p^2$ ?

$\sigma_p^2 < r_{yy}[0]$ strictly.

$\sigma_p^2 = r_{yy}[0] = 2$ because the process is WSS.

$\sigma_p^2 = 0$ because $P_y$ is a smooth function.