Ferkans — Interactive Telecom Tutor

When We Don't Know the Model

Every estimator we have built so far starts from a parametric model: the noise is Gaussian, the channel is Rayleigh, the parameter lives in $\mathbb{R}^d$ and we know $d$ . Non-parametric methods refuse to commit to any fixed parameterization — they let the complexity grow with the data. Kernel density estimation asks, how dense are the data near this point? Nadaraya–Watson asks, what is the average response of nearby inputs? Gaussian processes ask, what function-space prior is consistent with what I have seen so far?

This flexibility is not free. Non-parametric estimators converge slower than parametric ones ( $n^{-4/(4+d)}$ vs. $n^{-1/2}$ ), and they pay a curse of dimensionality. But when the true model is genuinely unknown or strongly non-Gaussian, they often beat a mis-specified parametric estimator by a wide margin. The engineering question is never "parametric or non-parametric?" — it is "at what point does parametric bias exceed non-parametric variance?"

Definition:
Kernel Density Estimator

Given i.i.d. samples $x_1, \dots, x_n$ from an unknown density $f$ on $\mathbb{R}$ , the kernel density estimator (KDE) with kernel $K$ and bandwidth $h > 0$ is $\hat{f}_h(x) = \frac{1}{nh} \sum_{i=1}^n K\!\left(\frac{x - x_i}{h}\right).$ The kernel $K$ is a non-negative function satisfying $\int K(u)\,du = 1$ , $\int uK(u)\,du = 0$ , and $\int u^2 K(u)\,du = \sigma_K^2 < \infty$ .

Two special cases: $K$ a rectangular window gives the histogram (after binning); $K(u) = (2\pi)^{-1/2} e^{-u^2/2}$ (Gaussian kernel) gives a smooth estimator whose $\hat{f}_h$ is $C^\infty$ .

Kernel Density Estimator

A smooth non-parametric estimator $\hat{f}_h(x) = \tfrac{1}{nh}\sum_i K((x-x_i)/h)$ of an unknown density. The bandwidth $h$ trades bias (small $h$ → low bias, high variance) against smoothness (large $h$ → high bias, low variance). Introduced by Rosenblatt (1956) and Parzen (1962).

Theorem: Mean Integrated Squared Error of the KDE

Assume $f$ is twice continuously differentiable with $\int f''(x)^2\,dx < \infty$ , and $K$ satisfies the standard regularity conditions above. Then the pointwise bias and variance of $\hat{f}_h$ are $\mathbb{E}[\hat{f}_h(x)] - f(x) = \tfrac{1}{2} h^2 \sigma_K^2 f''(x) + o(h^2),$ $\text{Var}(\hat{f}_h(x)) = \frac{f(x) \int K(u)^2\,du}{nh} + o\!\left(\frac{1}{nh}\right),$ and the asymptotic MISE is minimized by $h^\star = \left( \frac{\int K^2}{n \sigma_K^4 \int (f'')^2} \right)^{1/5} = O(n^{-1/5}),$ at which $\text{MISE}(\hat{f}_{h^\star}) = O(n^{-4/5})$ .

Bias grows with $h^2$ (we over-smooth); variance shrinks with $h$ (wider bins average more). The optimal bandwidth balances these, and the optimal rate $n^{-4/5}$ is the non-parametric price — slower than the parametric $n^{-1}$ rate.

Show Hint

Expand $f(x_i)$ around $x$ to second order and apply the kernel moment conditions.

The variance calculation is $\text{Var}(Z_i)/n$ where $Z_i = h^{-1}K((x-x_i)/h)$ .

Differentiate $\text{bias}^2 + \text{var}$ with respect to $h$ and set to zero.

Proof

Bias expansion

By a change of variables $u = (x - y)/h$ , $\mathbb{E}[\hat{f}_h(x)] = \int h^{-1} K((x-y)/h) f(y)\,dy = \int K(u) f(x - hu)\,du.$ Taylor-expanding $f(x - hu) = f(x) - hu f'(x) + \tfrac{1}{2} h^2 u^2 f''(x) + o(h^2)$ and using $\int K = 1$ , $\int u K = 0$ , $\int u^2 K = \sigma_K^2$ : $\mathbb{E}[\hat{f}_h(x)] = f(x) + \tfrac{1}{2} h^2 \sigma_K^2 f''(x) + o(h^2)$ .

Variance expansion

Let $Z_i = h^{-1} K((x - x_i)/h)$ . Then $\hat{f}_h(x) = \tfrac{1}{n}\sum_i Z_i$ and $\text{Var}(\hat{f}_h) = \text{Var}(Z_1)/n$ . Computing $\mathbb{E}[Z_1^2] = h^{-1} \int K(u)^2 f(x - hu)\,du = h^{-1} f(x) \int K^2 + o(h^{-1}),$ while $\mathbb{E}[Z_1] = f(x) + O(h^2)$ , the leading term of $\text{Var}(Z_1)$ is $h^{-1} f(x) \int K^2$ , so $\text{Var}(\hat{f}_h(x)) = f(x) \int K^2 / (nh) + o(1/(nh))$ .

Optimize over $h$

Integrate to get $\text{MISE}(h) = \tfrac{1}{4} h^4 \sigma_K^4 \int (f'')^2 + \int K^2 / (nh) + o(\cdot)$ . Differentiating and solving yields $h^\star \propto n^{-1/5}$ , and plugging back gives $\text{MISE} = O(n^{-4/5})$ . $\blacksquare$

Bandwidth Selection in Kernel Density Estimation

Estimate the density of a bimodal Gaussian mixture from a finite sample with a Gaussian kernel. Sweep the bandwidth $h$ from under- to over-smoothed and watch the bias–variance tradeoff in action. The optimal bandwidth (plug-in or Silverman's rule) is marked for reference.

Parameters

n

samples150

Bandwidth

h

0.4

Mixture mode separation2.5

Random seed3

Historical Note: Rosenblatt and Parzen

1956–1962

Murray Rosenblatt introduced the kernel density estimator in a 1956 note in the Annals of Mathematical Statistics, where he gave the basic construction and consistency. Emanuel Parzen extended it in 1962 with a general asymptotic theory and popularized the tool under the name "Parzen window." In parallel, Élizbar Nadaraya (1964) and Geoffrey Watson (1964) independently proposed the kernel regression estimator that now bears both their names. These four papers established non-parametric estimation as a proper statistical discipline rather than a collection of heuristics.

Definition:
Nadaraya–Watson Estimator

Given i.i.d. pairs $(x_1, y_1), \dots, (x_n, y_n)$ from the joint distribution of $(X, Y)$ , the Nadaraya–Watson estimator of the regression function $m(x) = \mathbb{E}[Y \mid X = x]$ is $\hat{m}_h(x) = \frac{\sum_{i=1}^n K\!\left(\tfrac{x - x_i}{h}\right) y_i}{\sum_{i=1}^n K\!\left(\tfrac{x - x_i}{h}\right)} = \sum_{i=1}^n w_i(x) y_i,$ where the weights $w_i(x) = K((x-x_i)/h) / \sum_j K((x-x_j)/h)$ are non-negative and sum to one. The estimator is a locally weighted average of the responses.

Nadaraya–Watson Estimator

A non-parametric estimator of the regression function $m(x) = \mathbb{E}[Y|X=x]$ obtained as a kernel-weighted average of observed responses. Equivalent to locally constant regression; a first-order generalization is local linear regression, which reduces boundary bias.

Example: Boundary Bias of Nadaraya–Watson

Suppose $x_i$ are uniform on $[0, 1]$ and we estimate $m(x)$ at $x = 0$ with a Gaussian kernel of bandwidth $h$ . Why does $\hat{m}_h(0)$ exhibit bias of order $O(h)$ rather than $O(h^2)$ ?

Solution

Effective kernel is asymmetric

At an interior point $x \in (h, 1-h)$ , the kernel integrates symmetrically over both sides of $x$ , so the first-moment term $\int u K(u)\,du = 0$ kills the $O(h)$ bias. At $x = 0$ , only positive $u$ contribute, and the effective kernel has a nonzero first moment.

Compute the leading-order bias

Write $\hat{m}_h(0) \approx \int m(y) K(y/h) h^{-1} dy / \int K(y/h) h^{-1} dy$ and change variables $u = y/h$ . The numerator is $\int_0^\infty K(u) m(hu)\,du$ , which expands as $m(0) \int_0^\infty K(u)\,du + h m'(0) \int_0^\infty u K(u)\,du + O(h^2)$ . After dividing by the normalizer, the $O(h)$ term survives because $\int_0^\infty u K(u)\,du > 0$ , giving $\text{bias} = h \cdot m'(0) \cdot c_K + O(h^2)$ .

Remedy

Local linear regression removes this boundary bias by fitting a line instead of a constant in each local window. In practice, this is why modern non-parametric regression libraries default to local-linear rather than Nadaraya–Watson.

Definition:
Reproducing Kernel Hilbert Space

A reproducing kernel Hilbert space (RKHS) on a set $\mathcal{X}$ is a Hilbert space $\mathcal{H}$ of functions $f: \mathcal{X} \to \mathbb{R}$ for which every evaluation functional $L_x: f \mapsto f(x)$ is bounded. By the Riesz representation theorem there exists a reproducing kernel $K: \mathcal{X} \times \mathcal{X} \to \mathbb{R}$ satisfying:

$K(\cdot, x) \in \mathcal{H}$ for every $x \in \mathcal{X}$ ,
$\langle f, K(\cdot, x) \rangle_{\mathcal{H}} = f(x)$ for every $f \in \mathcal{H}$ , $x \in \mathcal{X}$ (reproducing property). The kernel is symmetric and positive definite: for any $x_1, \dots, x_n$ and $c_1, \dots, c_n \in \mathbb{R}$ , $\sum_{i,j} c_i c_j K(x_i, x_j) \geq 0$ .

Conversely, every symmetric positive-definite kernel generates a unique RKHS (Moore–Aronszajn theorem). The RKHS is the "function space" in which penalized regression, Gaussian process regression, and support vector machines all live.

Reproducing Kernel Hilbert Space (RKHS)

A Hilbert space of functions in which point evaluation is a bounded linear functional, equivalently, one generated by a positive-definite kernel $K$ through the reproducing property $f(x) = \langle f, K(\cdot, x)\rangle$ . Provides the geometric foundation for kernel methods, Gaussian processes, and SVMs.

Theorem: Representer Theorem

Let $\mathcal{H}$ be an RKHS with kernel $K$ . For any strictly increasing $g: \mathbb{R}_{\geq 0} \to \mathbb{R}$ and any loss $\ell$ , the minimizer of the regularized empirical risk $f^\star = \arg\min_{f \in \mathcal{H}} \;\sum_{i=1}^n \ell(y_i, f(x_i)) + g(\|f\|_{\mathcal{H}}^2)$ admits the finite-dimensional representation $f^\star(x) = \sum_{i=1}^n \alpha_i K(x_i, x)$ for some $\boldsymbol{\alpha} \in \mathbb{R}^n$ .

Even though $\mathcal{H}$ may be infinite-dimensional (for a Gaussian kernel it is), the optimizer is always a finite linear combination of the $n$ kernel functions centered at the training points. The representer theorem is what makes kernel methods computationally tractable — we optimize over $\mathbb{R}^n$ , not over an infinite-dimensional function space.

Show Hint

Decompose $f = f_\parallel + f_\perp$ where $f_\parallel \in \text{span}\{K(x_i,\cdot)\}$ and $f_\perp$ is orthogonal.

Show $f_\perp$ does not affect the data-fit term: $f(x_i) = \langle f, K(\cdot, x_i)\rangle = f_\parallel(x_i)$ .

Show $f_\perp$ strictly increases the regularization term.

Proof

Orthogonal decomposition

Let $V = \text{span}\{K(\cdot, x_1), \dots, K(\cdot, x_n)\} \subset \mathcal{H}$ , a finite-dimensional subspace. Write $f = f_\parallel + f_\perp$ with $f_\parallel \in V$ and $f_\perp \in V^\perp$ .

Data-fit term depends only on $f_\parallel$

By the reproducing property, $f(x_i) = \langle f, K(\cdot, x_i)\rangle = \langle f_\parallel, K(\cdot, x_i)\rangle + \langle f_\perp, K(\cdot, x_i)\rangle = f_\parallel(x_i) + 0,$ because $K(\cdot, x_i) \in V$ and $f_\perp \perp V$ . So the loss $\sum_i \ell(y_i, f(x_i))$ depends only on $f_\parallel$ .

Regularizer strictly penalizes $f_\perp$

By Pythagoras, $\|f\|_{\mathcal{H}}^2 = \|f_\parallel\|^2 + \|f_\perp\|^2$ , and $g$ is strictly increasing. Therefore any $f_\perp \neq 0$ strictly increases the regularizer without changing the data-fit term. At the optimum, $f_\perp = 0$ and $f^\star \in V$ . Writing out a basis, $f^\star = \sum_i \alpha_i K(\cdot, x_i)$ . $\blacksquare$

Definition:
Gaussian Process Regression

A Gaussian process (GP) is a distribution over functions $f: \mathcal{X} \to \mathbb{R}$ such that every finite collection $(f(x_1), \dots, f(x_n))$ is jointly Gaussian. It is specified by a mean function $\mu(x)$ and a covariance function $k(x, x')$ : $f \sim \mathcal{GP}(\mu, k)$ .

Given training data $\mathbf{y} = \mathbf{f}(\mathbf{X}) + \mathbf{w}$ with $\mathbf{w} \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I})$ , the posterior mean and posterior variance at a test point $x^\star$ are $\mu^\star(x^\star) = \mathbf{k}_\star^T (\mathbf{K} + \sigma^2 \mathbf{I})^{-1} \mathbf{y},$ $\sigma_\star^2(x^\star) = k(x^\star, x^\star) - \mathbf{k}_\star^T (\mathbf{K} + \sigma^2 \mathbf{I})^{-1} \mathbf{k}_\star,$ where $\mathbf{K}_{ij} = k(x_i, x_j)$ and $\mathbf{k}_\star = (k(x_1, x^\star), \dots, k(x_n, x^\star))^T$ .

Gaussian Process

A stochastic process such that any finite collection of its values is jointly Gaussian. Specified entirely by its mean and covariance functions. Used as a flexible Bayesian prior over functions; its posterior delivers both a mean prediction and a calibrated predictive variance.

GP Posterior Mean = Kernel Ridge Regression

The Gaussian-process posterior mean $\mu^\star(x^\star) = \mathbf{k}_\star^T (\mathbf{K} + \sigma^2 \mathbf{I})^{-1} \mathbf{y}$ is algebraically identical to the kernel ridge regression estimator obtained from the representer theorem with $\ell$ the squared loss and regularizer $\sigma^2 \|f\|_{\mathcal{H}}^2$ . The two perspectives — Bayesian function prior vs. regularized risk minimization — produce the same point estimator. The Gaussian process view adds an error bar: the posterior variance $\sigma_\star^2$ quantifies epistemic uncertainty about $f(x^\star)$ , which kernel ridge regression does not deliver directly.

Gaussian Process Regression with Uncertainty Bands

Fit a GP with a squared-exponential kernel to a small noisy sample from an unknown smooth function. Adjust the length-scale $\ell$ , output variance $\sigma_f^2$ , and noise $\sigma^2$ , and watch the posterior mean and 95% credible bands change. Notice how the bands widen in regions with no data and shrink near training points.

Parameters

Training points10

Length-scale

\ell

0.8

Signal std

\sigma_f

1

Noise std

\sigma

0.1

Random seed5

Example: Extrapolation Limits of a GP

Using a squared-exponential kernel with length-scale $\ell$ , show that far from the training data the posterior mean reverts to the prior mean and the posterior variance reverts to $k(x^\star, x^\star)$ .

Solution

Prior covariance decays

For $k(x, x') = \sigma_f^2 \exp(-(x-x')^2/(2\ell^2))$ , as $|x^\star - x_i|/\ell \to \infty$ each entry $k(x^\star, x_i) \to 0$ , so $\mathbf{k}_\star \to \mathbf{0}$ .

Posterior mean reverts

$\mu^\star(x^\star) = \mathbf{k}_\star^T(\mathbf{K}+\sigma^2\mathbf{I})^{-1}\mathbf{y} \to 0$ (the prior mean).

Posterior variance reverts

$\sigma_\star^2 = k(x^\star,x^\star) - \mathbf{k}_\star^T(\cdot)\mathbf{k}_\star \to \sigma_f^2 - 0 = \sigma_f^2$ (the prior variance). So far from the data, the GP knows it knows nothing — a built-in honesty that contrasts with, say, polynomial extrapolation. The length-scale $\ell$ determines the range over which training data "informs" the prediction.

Common Mistake: GPs Do Not Scale to Large $n$

Mistake:

Naïvely applying Gaussian process regression to a dataset with $n = 10^5$ or more training points.

Correction:

The GP posterior requires inverting (or solving a system with) the $n \times n$ Gram matrix $\mathbf{K} + \sigma^2 \mathbf{I}$ , which costs $O(n^3)$ time and $O(n^2)$ memory. For $n$ beyond a few thousand, exact inference becomes infeasible. The standard remedies are inducing-point approximations (SoR, FITC, DTC), variational sparse GPs (e.g., Titsias 2009), and random feature approximations (Rahimi–Recht). Each trades exactness for scalability, and the right trade-off depends on the data regime and the smoothness of the target function.

Parametric vs. Non-Parametric Estimation

Aspect	Parametric	Non-Parametric
Model complexity	Fixed $d$ parameters	Grows with $n$
Convergence rate (MSE)	$O(1/n)$	$O(n^{-4/(4+d)})$
Curse of dimensionality	No	Yes (rate worsens with $d$ )
Robustness to misspecification	Fails catastrophically	Graceful
Computational cost	Typically $O(n)$ or $O(n \log n)$	$O(n^2)$ – $O(n^3)$ (GP, kernel methods)
Interpretability	High (parameters have meaning)	Low (function-level)
Best regime	Known model, small data	Unknown model, moderate data

Definition:
Support Vector Machine (SVM)

For binary classification with labels $y_i \in \{-1, +1\}$ , the (soft-margin) support vector machine solves $\min_{f \in \mathcal{H}_K,\, b \in \mathbb{R}} \; \frac{1}{n}\sum_{i=1}^n \max(0, 1 - y_i(f(x_i) + b)) + \lambda \|f\|_{\mathcal{H}_K}^2,$ where the first term is the hinge loss and the second is the RKHS regularizer. By the representer theorem, the solution is $f^\star(x) = \sum_i \alpha_i K(x_i, x)$ , and the dual is a quadratic program in $\boldsymbol{\alpha}$ . The observations with nonzero $\alpha_i$ are the support vectors — they lie on or inside the margin and alone determine the decision boundary.

The SVM is the workhorse kernel method of the 1990s. It connects the RKHS framework to concrete convex optimization and, through the kernel trick, extends linear classification to nonlinear decision boundaries without ever computing feature vectors explicitly.

🔧Engineering Note

Choosing a Kernel

The kernel encodes your prior assumption about function smoothness. Common choices on $\mathbb{R}^d$ :

Squared exponential $K(x,x')=\sigma_f^2 e^{-\|x-x'\|^2/(2\ell^2)}$ : samples are $C^\infty$ — strongly smooth, possibly too smooth.
Matérn family: samples are $\lceil \nu \rceil - 1$ times differentiable; $\nu = 3/2$ and $\nu = 5/2$ are standard choices for moderate smoothness.
Polynomial $K(x,x')=(x^T x' + c)^d$ : finite-dimensional feature maps, useful for low-order interactions.
Spectral mixture (Wilson–Adams 2013): explicitly models quasi-periodic structure. The length-scale $\ell$ is typically learned by maximizing the GP marginal likelihood (type-II MLE) or by cross-validation.

Practical Constraints

•
Squared exponential over-smooths rough or step-like data
•
Matérn- $3/2$ is a safer default in most applied settings
•
Automatic relevance determination uses one length-scale per input dimension

Why This Matters: GPs for Channel Interpolation

In mmWave and massive-MIMO systems, channel state information is acquired on a sparse grid of pilot symbols and must be interpolated to payload symbols. A Gaussian process regression, with a kernel that encodes the channel's delay and Doppler correlations (Matérn in frequency, exponentially decaying in time), gives a principled way to interpolate and — crucially — to report confidence bounds on the interpolated channel. These confidence bounds feed directly into robust precoder design: beamforming vectors can be chosen to hedge against the GP's posterior uncertainty rather than to blindly trust a point estimate.

See full treatment in Estimation in ISAC Systems

Quick Check

Let $K(x,x') = x x'$ on $\mathbb{R}$ . Which statement is TRUE about the RKHS generated by $K$ ?

It is infinite-dimensional

It equals the linear functions $f(x) = \alpha x$ with $\|f\|_\mathcal{H}^2 = \alpha^2$

It contains all polynomials

It contains all $C^\infty$ functions

Correction:

It equals the linear functions

f(x) = \alpha x

with

\|f\|_\mathcal{H}^2 = \alpha^2

A polynomial kernel of degree 1 corresponds to the feature map $\phi(x) = x$ , and the RKHS is spanned by $K(\cdot, x') = x' \cdot (\cdot)$ — i.e., linear functions through the origin. The norm is $\|f\|_\mathcal{H}^2 = \alpha^2$ for $f(x) = \alpha x$ . Higher-degree polynomial kernels give higher-dimensional (but still finite-dimensional) RKHSs.

Quick Check

You double the sample size $n$ in a KDE problem. By how much should you (optimally) scale the bandwidth $h$ ?

Double it

Halve it

Multiply by $2^{-1/5} \approx 0.87$

Leave it unchanged

Correction:

Multiply by

2^{-1/5} \approx 0.87

The MISE-optimal bandwidth scales as $h^\star \propto n^{-1/5}$ , so doubling $n$ multiplies the optimal $h$ by $2^{-1/5} \approx 0.87$ — a modest decrease. This slow scaling is why bandwidth selection is hard: the optimum is not highly peaked and is sensitive to the unknown $\int (f'')^2$ .

Key Takeaway

Non-parametric and kernel methods build estimators whose complexity grows with the data. They pay a slower convergence rate ( $n^{-4/5}$ instead of $n^{-1}$ in one dimension, worsening with dimension) and a computational price ( $O(n^3)$ for exact GP), but they gain the freedom to track functions the modeler could never have guessed. The RKHS framework unifies kernel density estimation, kernel ridge regression, Gaussian processes, and SVMs under a single geometric lens: every solution lives in the span of kernel functions centered at the data, and the regularizer is an RKHS norm.

Kernel Methods and Non-Parametric Regression

When We Don't Know the Model

Definition: Kernel Density Estimator

Kernel Density Estimator

Theorem: Mean Integrated Squared Error of the KDE

Bias expansion

Variance expansion

Optimize over $h$

Bandwidth Selection in Kernel Density Estimation

Parameters

Historical Note: Rosenblatt and Parzen

Definition: Nadaraya–Watson Estimator

Nadaraya–Watson Estimator

Example: Boundary Bias of Nadaraya–Watson

Effective kernel is asymmetric

Compute the leading-order bias

Remedy

Definition: Reproducing Kernel Hilbert Space

Reproducing Kernel Hilbert Space (RKHS)

Theorem: Representer Theorem

Orthogonal decomposition

Data-fit term depends only on $f_\parallel$

Regularizer strictly penalizes $f_\perp$

Definition: Gaussian Process Regression

Gaussian Process

GP Posterior Mean = Kernel Ridge Regression

Gaussian Process Regression with Uncertainty Bands

Parameters

Example: Extrapolation Limits of a GP

Prior covariance decays

Posterior mean reverts

Posterior variance reverts

Common Mistake: GPs Do Not Scale to Large nnn

Parametric vs. Non-Parametric Estimation

Definition: Support Vector Machine (SVM)

Choosing a Kernel

Why This Matters: GPs for Channel Interpolation

Quick Check

Quick Check

Key Takeaway

Definition:
Kernel Density Estimator

Definition:
Nadaraya–Watson Estimator

Definition:
Reproducing Kernel Hilbert Space

Definition:
Gaussian Process Regression

Common Mistake: GPs Do Not Scale to Large $n$

Definition:
Support Vector Machine (SVM)