Kernel Methods and Non-Parametric Regression
When We Don't Know the Model
Every estimator we have built so far starts from a parametric model: the noise is Gaussian, the channel is Rayleigh, the parameter lives in and we know . Non-parametric methods refuse to commit to any fixed parameterization — they let the complexity grow with the data. Kernel density estimation asks, how dense are the data near this point? Nadaraya–Watson asks, what is the average response of nearby inputs? Gaussian processes ask, what function-space prior is consistent with what I have seen so far?
This flexibility is not free. Non-parametric estimators converge slower than parametric ones ( vs. ), and they pay a curse of dimensionality. But when the true model is genuinely unknown or strongly non-Gaussian, they often beat a mis-specified parametric estimator by a wide margin. The engineering question is never "parametric or non-parametric?" — it is "at what point does parametric bias exceed non-parametric variance?"
Definition: Kernel Density Estimator
Kernel Density Estimator
Given i.i.d. samples from an unknown density on , the kernel density estimator (KDE) with kernel and bandwidth is The kernel is a non-negative function satisfying , , and .
Two special cases: a rectangular window gives the histogram (after binning); (Gaussian kernel) gives a smooth estimator whose is .
Kernel Density Estimator
A smooth non-parametric estimator of an unknown density. The bandwidth trades bias (small → low bias, high variance) against smoothness (large → high bias, low variance). Introduced by Rosenblatt (1956) and Parzen (1962).
Related: Nadaraya–Watson Estimator, Bandwidth
Theorem: Mean Integrated Squared Error of the KDE
Assume is twice continuously differentiable with , and satisfies the standard regularity conditions above. Then the pointwise bias and variance of are and the asymptotic MISE is minimized by at which .
Bias grows with (we over-smooth); variance shrinks with (wider bins average more). The optimal bandwidth balances these, and the optimal rate is the non-parametric price — slower than the parametric rate.
Expand around to second order and apply the kernel moment conditions.
The variance calculation is where .
Differentiate with respect to and set to zero.
Bias expansion
By a change of variables , Taylor-expanding and using , , : .
Variance expansion
Let . Then and . Computing while , the leading term of is , so .
Optimize over $h$
Integrate to get . Differentiating and solving yields , and plugging back gives .
Bandwidth Selection in Kernel Density Estimation
Estimate the density of a bimodal Gaussian mixture from a finite sample with a Gaussian kernel. Sweep the bandwidth from under- to over-smoothed and watch the bias–variance tradeoff in action. The optimal bandwidth (plug-in or Silverman's rule) is marked for reference.
Parameters
Historical Note: Rosenblatt and Parzen
1956–1962Murray Rosenblatt introduced the kernel density estimator in a 1956 note in the Annals of Mathematical Statistics, where he gave the basic construction and consistency. Emanuel Parzen extended it in 1962 with a general asymptotic theory and popularized the tool under the name "Parzen window." In parallel, Élizbar Nadaraya (1964) and Geoffrey Watson (1964) independently proposed the kernel regression estimator that now bears both their names. These four papers established non-parametric estimation as a proper statistical discipline rather than a collection of heuristics.
Definition: Nadaraya–Watson Estimator
Nadaraya–Watson Estimator
Given i.i.d. pairs from the joint distribution of , the Nadaraya–Watson estimator of the regression function is where the weights are non-negative and sum to one. The estimator is a locally weighted average of the responses.
Nadaraya–Watson Estimator
A non-parametric estimator of the regression function obtained as a kernel-weighted average of observed responses. Equivalent to locally constant regression; a first-order generalization is local linear regression, which reduces boundary bias.
Related: Kernel Density Estimator, Bandwidth
Example: Boundary Bias of Nadaraya–Watson
Suppose are uniform on and we estimate at with a Gaussian kernel of bandwidth . Why does exhibit bias of order rather than ?
Effective kernel is asymmetric
At an interior point , the kernel integrates symmetrically over both sides of , so the first-moment term kills the bias. At , only positive contribute, and the effective kernel has a nonzero first moment.
Compute the leading-order bias
Write and change variables . The numerator is , which expands as . After dividing by the normalizer, the term survives because , giving .
Remedy
Local linear regression removes this boundary bias by fitting a line instead of a constant in each local window. In practice, this is why modern non-parametric regression libraries default to local-linear rather than Nadaraya–Watson.
Definition: Reproducing Kernel Hilbert Space
Reproducing Kernel Hilbert Space
A reproducing kernel Hilbert space (RKHS) on a set is a Hilbert space of functions for which every evaluation functional is bounded. By the Riesz representation theorem there exists a reproducing kernel satisfying:
- for every ,
- for every , (reproducing property). The kernel is symmetric and positive definite: for any and , .
Conversely, every symmetric positive-definite kernel generates a unique RKHS (Moore–Aronszajn theorem). The RKHS is the "function space" in which penalized regression, Gaussian process regression, and support vector machines all live.
Reproducing Kernel Hilbert Space (RKHS)
A Hilbert space of functions in which point evaluation is a bounded linear functional, equivalently, one generated by a positive-definite kernel through the reproducing property . Provides the geometric foundation for kernel methods, Gaussian processes, and SVMs.
Related: Gaussian Process, Kernel Ridge Regression
Theorem: Representer Theorem
Let be an RKHS with kernel . For any strictly increasing and any loss , the minimizer of the regularized empirical risk admits the finite-dimensional representation for some .
Even though may be infinite-dimensional (for a Gaussian kernel it is), the optimizer is always a finite linear combination of the kernel functions centered at the training points. The representer theorem is what makes kernel methods computationally tractable — we optimize over , not over an infinite-dimensional function space.
Decompose where and is orthogonal.
Show does not affect the data-fit term: .
Show strictly increases the regularization term.
Orthogonal decomposition
Let , a finite-dimensional subspace. Write with and .
Data-fit term depends only on $f_\parallel$
By the reproducing property, because and . So the loss depends only on .
Regularizer strictly penalizes $f_\perp$
By Pythagoras, , and is strictly increasing. Therefore any strictly increases the regularizer without changing the data-fit term. At the optimum, and . Writing out a basis, .
Definition: Gaussian Process Regression
Gaussian Process Regression
A Gaussian process (GP) is a distribution over functions such that every finite collection is jointly Gaussian. It is specified by a mean function and a covariance function : .
Given training data with , the posterior mean and posterior variance at a test point are where and .
Gaussian Process
A stochastic process such that any finite collection of its values is jointly Gaussian. Specified entirely by its mean and covariance functions. Used as a flexible Bayesian prior over functions; its posterior delivers both a mean prediction and a calibrated predictive variance.
Related: Reproducing Kernel Hilbert Space, Kernel Ridge Regression
GP Posterior Mean = Kernel Ridge Regression
The Gaussian-process posterior mean is algebraically identical to the kernel ridge regression estimator obtained from the representer theorem with the squared loss and regularizer . The two perspectives — Bayesian function prior vs. regularized risk minimization — produce the same point estimator. The Gaussian process view adds an error bar: the posterior variance quantifies epistemic uncertainty about , which kernel ridge regression does not deliver directly.
Gaussian Process Regression with Uncertainty Bands
Fit a GP with a squared-exponential kernel to a small noisy sample from an unknown smooth function. Adjust the length-scale , output variance , and noise , and watch the posterior mean and 95% credible bands change. Notice how the bands widen in regions with no data and shrink near training points.
Parameters
Example: Extrapolation Limits of a GP
Using a squared-exponential kernel with length-scale , show that far from the training data the posterior mean reverts to the prior mean and the posterior variance reverts to .
Prior covariance decays
For , as each entry , so .
Posterior mean reverts
(the prior mean).
Posterior variance reverts
(the prior variance). So far from the data, the GP knows it knows nothing — a built-in honesty that contrasts with, say, polynomial extrapolation. The length-scale determines the range over which training data "informs" the prediction.
Common Mistake: GPs Do Not Scale to Large
Mistake:
Naïvely applying Gaussian process regression to a dataset with or more training points.
Correction:
The GP posterior requires inverting (or solving a system with) the Gram matrix , which costs time and memory. For beyond a few thousand, exact inference becomes infeasible. The standard remedies are inducing-point approximations (SoR, FITC, DTC), variational sparse GPs (e.g., Titsias 2009), and random feature approximations (Rahimi–Recht). Each trades exactness for scalability, and the right trade-off depends on the data regime and the smoothness of the target function.
Parametric vs. Non-Parametric Estimation
| Aspect | Parametric | Non-Parametric |
|---|---|---|
| Model complexity | Fixed parameters | Grows with |
| Convergence rate (MSE) | ||
| Curse of dimensionality | No | Yes (rate worsens with ) |
| Robustness to misspecification | Fails catastrophically | Graceful |
| Computational cost | Typically or | – (GP, kernel methods) |
| Interpretability | High (parameters have meaning) | Low (function-level) |
| Best regime | Known model, small data | Unknown model, moderate data |
Definition: Support Vector Machine (SVM)
Support Vector Machine (SVM)
For binary classification with labels , the (soft-margin) support vector machine solves where the first term is the hinge loss and the second is the RKHS regularizer. By the representer theorem, the solution is , and the dual is a quadratic program in . The observations with nonzero are the support vectors — they lie on or inside the margin and alone determine the decision boundary.
The SVM is the workhorse kernel method of the 1990s. It connects the RKHS framework to concrete convex optimization and, through the kernel trick, extends linear classification to nonlinear decision boundaries without ever computing feature vectors explicitly.
Choosing a Kernel
The kernel encodes your prior assumption about function smoothness. Common choices on :
- Squared exponential : samples are — strongly smooth, possibly too smooth.
- Matérn family: samples are times differentiable; and are standard choices for moderate smoothness.
- Polynomial : finite-dimensional feature maps, useful for low-order interactions.
- Spectral mixture (Wilson–Adams 2013): explicitly models quasi-periodic structure. The length-scale is typically learned by maximizing the GP marginal likelihood (type-II MLE) or by cross-validation.
- •
Squared exponential over-smooths rough or step-like data
- •
Matérn- is a safer default in most applied settings
- •
Automatic relevance determination uses one length-scale per input dimension
Why This Matters: GPs for Channel Interpolation
In mmWave and massive-MIMO systems, channel state information is acquired on a sparse grid of pilot symbols and must be interpolated to payload symbols. A Gaussian process regression, with a kernel that encodes the channel's delay and Doppler correlations (Matérn in frequency, exponentially decaying in time), gives a principled way to interpolate and — crucially — to report confidence bounds on the interpolated channel. These confidence bounds feed directly into robust precoder design: beamforming vectors can be chosen to hedge against the GP's posterior uncertainty rather than to blindly trust a point estimate.
See full treatment in Estimation in ISAC Systems
Quick Check
Let on . Which statement is TRUE about the RKHS generated by ?
It is infinite-dimensional
It equals the linear functions with
It contains all polynomials
It contains all functions
A polynomial kernel of degree 1 corresponds to the feature map , and the RKHS is spanned by — i.e., linear functions through the origin. The norm is for . Higher-degree polynomial kernels give higher-dimensional (but still finite-dimensional) RKHSs.
Quick Check
You double the sample size in a KDE problem. By how much should you (optimally) scale the bandwidth ?
Double it
Halve it
Multiply by
Leave it unchanged
The MISE-optimal bandwidth scales as , so doubling multiplies the optimal by — a modest decrease. This slow scaling is why bandwidth selection is hard: the optimum is not highly peaked and is sensitive to the unknown .
Key Takeaway
Non-parametric and kernel methods build estimators whose complexity grows with the data. They pay a slower convergence rate ( instead of in one dimension, worsening with dimension) and a computational price ( for exact GP), but they gain the freedom to track functions the modeler could never have guessed. The RKHS framework unifies kernel density estimation, kernel ridge regression, Gaussian processes, and SVMs under a single geometric lens: every solution lives in the span of kernel functions centered at the data, and the regularizer is an RKHS norm.