Deep Learning for Estimation

From Likelihoods to Learned Maps

Classical estimation says: write down the likelihood, maximize it. When that likelihood is unavailable β€” because the forward model is a neural rendering engine, a finite-element solver, or a proprietary hardware chain β€” we can no longer compute the MLE in closed form. Simulation-based and data-driven estimation asks a different question: given pairs (xi,yi)(\mathbf{x}_i, \mathbf{y}_i) of true parameters and their noisy observations, what function g:y↦x^g: \mathbf{y} \mapsto \hat{\mathbf{x}} makes the smallest error?

Answering that question with a neural network turns the estimator into a learned map. The payoffs are dramatic where the likelihood is intractable. The costs are real: we need training data, we inherit whatever bias the data generator has, we lose interpretability, and we give up the error bars that a Bayesian analysis would have produced. Deep unfolding is the engineering compromise β€” keep the skeleton of a model-based iterative algorithm, but let a few parameters adapt to the data.

Definition:

Neural-Network Estimator

Given a family of neural networks {gΞΈ:θ∈Θ}\{g_{\boldsymbol{\theta}}: \boldsymbol{\theta} \in \Theta\} and training samples {(xi,yi)}i=1N\{(\mathbf{x}_i, \mathbf{y}_i)\}_{i=1}^N drawn from the joint distribution of parameter and observation, the end-to-end learned estimator is ΞΈ^∈arg⁑minβ‘ΞΈβˆˆΞ˜β€…β€Š1Nβˆ‘i=1Nβ„“(xi,gΞΈ(yi)),\hat{\boldsymbol{\theta}} \in \arg\min_{\boldsymbol{\theta} \in \Theta} \; \frac{1}{N} \sum_{i=1}^N \ell(\mathbf{x}_i, g_{\boldsymbol{\theta}}(\mathbf{y}_i)), where β„“\ell is a per-sample loss (typically squared error for continuous parameters, cross-entropy for discrete labels). Training proceeds by stochastic gradient descent on L(ΞΈ)\mathcal{L}(\boldsymbol{\theta}), with gradients computed by backpropagation.

If the training loss is squared error, then as Nβ†’βˆžN \to \infty the minimizer over a sufficiently expressive family approaches the MMSE estimator g⋆(y)=E[X∣Y=y]g^\star(\mathbf{y}) = \mathbb{E}[\mathbf{X} \mid \mathbf{Y} = \mathbf{y}]. The neural network is an approximation to the posterior mean β€” no more, no less.

End-to-End Learning

An estimator (or pipeline of estimators) whose free parameters are optimized jointly by minimizing a training loss, rather than being designed stage-by-stage against a model. The estimator becomes a differentiable program trained via backpropagation.

Related: Deep Unfolding, Model Based Estimation

Theorem: Squared-Error Training Recovers the Posterior Mean

Let (X,Y)(\mathbf{X}, \mathbf{Y}) be a random pair with E[βˆ₯Xβˆ₯2]<∞\mathbb{E}[\|\mathbf{X}\|^2] < \infty. For any measurable g:Rmβ†’Rdg: \mathbb{R}^m \to \mathbb{R}^d, E ⁣[βˆ₯Xβˆ’g(Y)βˆ₯2]β€…β€Šβ‰₯β€…β€ŠE ⁣[βˆ₯Xβˆ’E[X∣Y]βˆ₯2],\mathbb{E}\!\left[\|\mathbf{X} - g(\mathbf{Y})\|^2\right] \;\geq\; \mathbb{E}\!\left[\|\mathbf{X} - \mathbb{E}[\mathbf{X}\mid\mathbf{Y}]\|^2\right], with equality iff g(y)=E[X∣Y=y]g(\mathbf{y}) = \mathbb{E}[\mathbf{X}\mid\mathbf{Y}=\mathbf{y}] almost surely. Consequently, if a neural family is dense in L2L^2 and training data is unlimited, the trained network converges to the MMSE estimator.

Squared-error training is posterior-mean chasing. This is why ML people get the MMSE estimator by accident β€” they chose squared loss for convenience and it turned out to be the Bayes-optimal criterion. The neural network is only as good as (i) the training distribution's match to the deployment distribution and (ii) the family's ability to represent E[X∣Y]\mathbb{E}[\mathbf{X}|\mathbf{Y}].

Historical Note: Gregor and LeCun, 2010

2010

Deep unfolding was introduced by Karol Gregor and Yann LeCun at ICML 2010 in a paper titled "Learning Fast Approximations of Sparse Coding." They took the iterative soft-thresholding algorithm (ISTA) for β„“1\ell_1 regularized regression, unrolled TT of its iterations into a feed-forward network with TT layers, and learned the per-layer step sizes and matrices by backpropagation. The result β€” LISTA β€” produced sparse codes roughly 10Γ— faster than ISTA at comparable reconstruction quality. The idea generalized: every model-based iterative algorithm has an unfolded counterpart, and learning the per-iteration parameters often gains both speed and accuracy.

Definition:

Deep Unfolding

Let x(t+1)=F(x(t);Ο•)\mathbf{x}^{(t+1)} = F(\mathbf{x}^{(t)}; \boldsymbol{\phi}) be one iteration of a model-based algorithm (e.g., ISTA, proximal gradient, AMP) parameterized by hyper-parameters Ο•\boldsymbol{\phi} (step size, threshold, linear map). Deep unfolding constructs the estimator gΞΈ(y)=F( ⋯ F(F(x(0);Ο•1);Ο•2)⋯ ;Ο•T),g_{\boldsymbol{\theta}}(\mathbf{y}) = F(\,\cdots\, F(F(\mathbf{x}^{(0)}; \boldsymbol{\phi}_1); \boldsymbol{\phi}_2) \cdots ; \boldsymbol{\phi}_T), a TT-layer feed-forward network in which each layer reuses the iterative update rule but has its own learnable parameters ΞΈ=(Ο•1,…,Ο•T)\boldsymbol{\theta} = (\boldsymbol{\phi}_1, \dots, \boldsymbol{\phi}_T). Training optimizes ΞΈ\boldsymbol{\theta} end-to-end by backpropagation against a data-driven loss.

At one extreme (Ο•t\boldsymbol{\phi}_t all tied to the model values): no learning β€” we recover the original iterative algorithm. At the other extreme (dense unconstrained per-layer linear maps): pure black-box neural network. The interior of this spectrum β€” most of the model kept, a few parameters learned β€” is where deep unfolding earns its keep.

Deep Unfolding

A hybrid model-based / data-driven architecture in which TT iterations of a known iterative algorithm are unrolled into a TT-layer neural network whose per-layer parameters are learned from training data. Introduced by Gregor and LeCun (LISTA, 2010) and extended to AMP, ADMM, proximal splitting, and message passing.

Related: End-to-End Learning, Model Based Estimation

Example: LISTA: Learned ISTA

ISTA for LASSO uses the update x(t+1)=SΟ„(x(t)βˆ’Ξ·AT(Ax(t)βˆ’y))\mathbf{x}^{(t+1)} = \mathcal{S}_\tau(\mathbf{x}^{(t)} - \eta \mathbf{A}^T(\mathbf{A}\mathbf{x}^{(t)} - \mathbf{y})) with soft-threshold SΟ„\mathcal{S}_\tau and step size Ξ·\eta. Write the corresponding LISTA architecture and identify its learnable parameters.

LISTA: Learned ISTA for Sparse Recovery

Complexity: Inference: TT matrix-vector products; training: O(NTMN)O(NTMN) per epoch
Input: Measurements y∈RM\mathbf{y} \in \mathbb{R}^M, parameters {We(t),Ws(t),Ο„t}t=0Tβˆ’1\{\mathbf{W}_e^{(t)}, \mathbf{W}_s^{(t)}, \tau_t\}_{t=0}^{T-1}
Output: Sparse estimate x^∈RN\hat{\mathbf{x}} \in \mathbb{R}^N
1. Initialize x(0)←0\mathbf{x}^{(0)} \leftarrow \mathbf{0}
2. for t=0,1,…,Tβˆ’1t = 0, 1, \ldots, T-1 do
3. u(t)←We(t)y+Ws(t)x(t)\quad \mathbf{u}^{(t)} \leftarrow \mathbf{W}_e^{(t)} \mathbf{y} + \mathbf{W}_s^{(t)} \mathbf{x}^{(t)}
4. x(t+1)←SΟ„t(u(t))\quad \mathbf{x}^{(t+1)} \leftarrow \mathcal{S}_{\tau_t}(\mathbf{u}^{(t)})
5. end for
6. return x^=x(T)\hat{\mathbf{x}} = \mathbf{x}^{(T)}
Training: minimize 1Nβˆ‘iβˆ₯xiβ‹†βˆ’x^(yi)βˆ₯2\tfrac{1}{N}\sum_i \|\mathbf{x}_i^\star - \hat{\mathbf{x}}(\mathbf{y}_i)\|^2 by SGD + backprop through all TT layers.

LISTA typically converges 10–100Γ— faster than ISTA at comparable accuracy, because each learned layer is tuned for the signal distribution actually encountered at deployment β€” unlike ISTA, which is tuned for worst-case Lipschitz constants.

πŸŽ“CommIT Contribution(2021)

Unfolded OAMP for RF Imaging

H. Sarieddeen, G. Caire β€” IEEE Trans. Signal Processing

The CommIT group has developed deep-unfolded orthogonal approximate message passing (OAMP) for sparse RF-imaging reconstruction. The unfolded architecture inherits OAMP's provable convergence under Kronecker-structured sensing matrices, yet learns its regularization parameters from calibration data. In imaging experiments this hybrid outperforms both classical OAMP and unconstrained CNN baselines, demonstrating concretely that model structure + data-driven fine-tuning > either alone.

deep-unfoldingoamprf-imagingView Paper β†’

Unfolded Network vs. Model-Based Iterations

Compare the recovery error of classical ISTA (with fixed step size Ξ·\eta) and a LISTA-style unfolded network (with per-layer learned step sizes) as a function of iteration count TT on a synthetic sparse recovery task. The unfolded architecture reaches the same error with far fewer layers.

Parameters
10
8
0.05
1

Deep Unfolding Architecture Diagram

Construction of an unfolded network by stacking TT copies of a model-based iteration, each with its own learnable parameters.
One ISTA iteration x(t+1)=SΟ„(Wey+Wsx(t))\mathbf{x}^{(t+1)} = \mathcal{S}_{\tau}(\mathbf{W}_e\mathbf{y} + \mathbf{W}_s\mathbf{x}^{(t)}) is repeated TT times. In LISTA, the matrices We(t),Ws(t)\mathbf{W}_e^{(t)}, \mathbf{W}_s^{(t)} and threshold Ο„t\tau_t become layer-specific and are learned by backpropagation.

Model-Based vs. Data-Driven vs. Hybrid Estimation

CriterionModel-Based (MLE / MMSE)End-to-End Deep NetDeep Unfolding
Data requirementNone (closed-form model)Large labeled datasetModerate dataset
Inference costMany iterationsOne forward passTT layers (small TT)
Robust to model driftNo (biased)Only if retrainedPartially (model backbone)
InterpretabilityHighLowMedium
Uncertainty quantificationYes (Bayesian posterior)Rare / ad hocPossible via unfolded Bayesian methods
Performance ceilingBounded by CRLBBounded by data sizeTypically beats both at moderate NN

Common Mistake: Training Distribution vs. Deployment Distribution

Mistake:

Training a learned estimator on one signal distribution (say, i.i.d. Rayleigh channels) and deploying it in a setting where the channel statistics differ (line-of-sight, spatially correlated, or time-varying).

Correction:

The learned estimator converges to the MMSE estimator for the training distribution, not for the deployment distribution. If the distributions differ, the estimator is biased β€” and unlike model-based methods, it does not degrade gracefully. Mitigations: train on a broad mixture covering expected deployment conditions; use domain adaptation or fine-tuning at deployment time; prefer deep-unfolded architectures (which retain the model's inductive bias and tend to generalize better than black-box nets); always evaluate on realistic out-of-distribution benchmarks.

Common Mistake: The CRLB Still Holds

Mistake:

Believing that a learned estimator can beat the CramΓ©r–Rao bound if only the network is deep enough.

Correction:

The CRLB is an information-theoretic lower bound on unbiased estimators; it holds regardless of how the estimator is constructed. A neural network can approach the CRLB by tracking the MLE, or appear to beat it by being biased (shrinkage toward a prior mode typically gives MSE below CRLB at low SNR). The right metric for a biased estimator is Bayesian MSE or the Bayesian CRLB (Van Trees inequality, Β§The Van Trees Inequality).

⚠️Engineering Note

Deploying Learned Estimators in Real-Time Systems

Moving from a prototype neural estimator to deployment in a modem or radar receiver surfaces constraints that model-based methods never had to face:

  • Latency budget: a 5G NR slot is 500 ΞΌs at numerology ΞΌ=0\mu=0; an estimator must return inside that window.
  • Memory budget: a UE may allow at most 10–50 MB of weights for a DSP-class model.
  • Fixed-point arithmetic: networks trained in float32 must be quantized to int8 or int16; quantization-aware training is required.
  • Retraining and validation: every firmware release that touches the estimator must revalidate against a certification suite β€” network updates are not free.
  • Graceful degradation: in the presence of unfamiliar interference the estimator should detect OOD and fall back to a conservative model-based path.
Practical Constraints
  • β€’

    Latency < 500 ΞΌs at numerology ΞΌ=0\mu=0 (15 kHz SCS)

  • β€’

    Quantization to int8/int16 typically loses 0.5–2 dB in MSE

  • β€’

    Retraining validation costs typically dominate development timeline

  • β€’

    Fallback to model-based estimator is required for safety-critical paths

Why This Matters: Neural Channel Estimators in 5G and 6G

Proposals for 5G/6G receivers replace the classical least-squares / MMSE pilot channel estimator with a neural network trained on ray-traced or measured channel realizations. In multi-panel massive MIMO, a CNN or unfolded ADMM network exploits the 2D spatial correlation of the channel, reducing pilot overhead by 20–50% at the same estimation error. This is one of the few confirmed cases where learned estimators clearly outperform model-based ones: the true spatial channel manifold is low-dimensional and non-Gaussian, which the Gaussian-MMSE estimator cannot exploit but a learned one can.

See full treatment in Estimation in ISAC Systems

Quick Check

Why does LISTA typically need only T=10T = 10–2020 layers to match ISTA's accuracy at 10001000 iterations?

Backpropagation is inherently faster than gradient descent

LISTA's per-layer parameters are tuned for the data distribution, so each layer does more work than a generic ISTA step

LISTA uses ReLU activations, which are faster than soft-thresholding

LISTA does not actually solve the LASSO, so the comparison is unfair

Quick Check

A neural network trained with squared-error loss on infinite training data from a distribution p(x,y)p(\mathbf{x}, \mathbf{y}) converges to which estimator?

The MLE of x\mathbf{x} given y\mathbf{y}

The MAP estimator

The posterior mean E[X∣Y=y]\mathbb{E}[\mathbf{X}\mid\mathbf{Y}=\mathbf{y}]

An unbiased estimator achieving the CRLB

Key Takeaway

Data-driven estimation trades model assumptions for training data. End-to-end neural networks are best when the likelihood is intractable and training data is plentiful and representative. Deep unfolding is the principled middle ground: keep the skeleton of a model-based algorithm, unroll TT iterations into layers, and learn only the few parameters that matter. Under squared loss, all of these estimators chase the same target β€” the posterior mean β€” differing only in how they trade model structure against learned flexibility.