Ferkans — Interactive Telecom Tutor

From Likelihoods to Learned Maps

Classical estimation says: write down the likelihood, maximize it. When that likelihood is unavailable — because the forward model is a neural rendering engine, a finite-element solver, or a proprietary hardware chain — we can no longer compute the MLE in closed form. Simulation-based and data-driven estimation asks a different question: given pairs $(\mathbf{x}_i, \mathbf{y}_i)$ of true parameters and their noisy observations, what function $g: \mathbf{y} \mapsto \hat{\mathbf{x}}$ makes the smallest error?

Answering that question with a neural network turns the estimator into a learned map. The payoffs are dramatic where the likelihood is intractable. The costs are real: we need training data, we inherit whatever bias the data generator has, we lose interpretability, and we give up the error bars that a Bayesian analysis would have produced. Deep unfolding is the engineering compromise — keep the skeleton of a model-based iterative algorithm, but let a few parameters adapt to the data.

Definition:
Neural-Network Estimator

Given a family of neural networks $\{g_{\boldsymbol{\theta}}: \boldsymbol{\theta} \in \Theta\}$ and training samples $\{(\mathbf{x}_i, \mathbf{y}_i)\}_{i=1}^N$ drawn from the joint distribution of parameter and observation, the end-to-end learned estimator is $\hat{\boldsymbol{\theta}} \in \arg\min_{\boldsymbol{\theta} \in \Theta} \; \frac{1}{N} \sum_{i=1}^N \ell(\mathbf{x}_i, g_{\boldsymbol{\theta}}(\mathbf{y}_i)),$ where $\ell$ is a per-sample loss (typically squared error for continuous parameters, cross-entropy for discrete labels). Training proceeds by stochastic gradient descent on $\mathcal{L}(\boldsymbol{\theta})$ , with gradients computed by backpropagation.

If the training loss is squared error, then as $N \to \infty$ the minimizer over a sufficiently expressive family approaches the MMSE estimator $g^\star(\mathbf{y}) = \mathbb{E}[\mathbf{X} \mid \mathbf{Y} = \mathbf{y}]$ . The neural network is an approximation to the posterior mean — no more, no less.

End-to-End Learning

An estimator (or pipeline of estimators) whose free parameters are optimized jointly by minimizing a training loss, rather than being designed stage-by-stage against a model. The estimator becomes a differentiable program trained via backpropagation.

Theorem: Squared-Error Training Recovers the Posterior Mean

Let $(\mathbf{X}, \mathbf{Y})$ be a random pair with $\mathbb{E}[\|\mathbf{X}\|^2] < \infty$ . For any measurable $g: \mathbb{R}^m \to \mathbb{R}^d$ , $\mathbb{E}\!\left[\|\mathbf{X} - g(\mathbf{Y})\|^2\right] \;\geq\; \mathbb{E}\!\left[\|\mathbf{X} - \mathbb{E}[\mathbf{X}\mid\mathbf{Y}]\|^2\right],$ with equality iff $g(\mathbf{y}) = \mathbb{E}[\mathbf{X}\mid\mathbf{Y}=\mathbf{y}]$ almost surely. Consequently, if a neural family is dense in $L^2$ and training data is unlimited, the trained network converges to the MMSE estimator.

Squared-error training is posterior-mean chasing. This is why ML people get the MMSE estimator by accident — they chose squared loss for convenience and it turned out to be the Bayes-optimal criterion. The neural network is only as good as (i) the training distribution's match to the deployment distribution and (ii) the family's ability to represent $\mathbb{E}[\mathbf{X}|\mathbf{Y}]$ .

Show Hint

Apply the tower property: $\mathbb{E}[\|\mathbf{X}-g(\mathbf{Y})\|^2] = \mathbb{E}[\mathbb{E}[\|\mathbf{X}-g(\mathbf{Y})\|^2 \mid \mathbf{Y}]]$ .

Minimize the inner expectation pointwise in $\mathbf{y}$ .

The minimizer of $\mathbb{E}[\|\mathbf{X}-c\|^2 \mid \mathbf{Y}=\mathbf{y}]$ over $c \in \mathbb{R}^d$ is $c = \mathbb{E}[\mathbf{X}\mid\mathbf{Y}=\mathbf{y}]$ .

Proof

Condition on $\mathbf{Y}$

By the tower property, $\mathbb{E}[\|\mathbf{X} - g(\mathbf{Y})\|^2] = \mathbb{E}\!\left[\mathbb{E}[\|\mathbf{X} - g(\mathbf{Y})\|^2 \mid \mathbf{Y}]\right].$ For each $\mathbf{y}$ the inner expectation is $\mathbb{E}[\|\mathbf{X} - g(\mathbf{y})\|^2 \mid \mathbf{Y} = \mathbf{y}]$ , a function of $g(\mathbf{y})$ alone.

Pointwise minimization

The problem $\min_{c \in \mathbb{R}^d} \mathbb{E}[\|\mathbf{X}-c\|^2 \mid \mathbf{Y}=\mathbf{y}]$ is solved by $c = \mathbb{E}[\mathbf{X}\mid\mathbf{Y}=\mathbf{y}]$ , the conditional mean. Substituting this pointwise minimizer defines $g^\star(\mathbf{y}) = \mathbb{E}[\mathbf{X}\mid\mathbf{Y}=\mathbf{y}]$ .

Bound holds globally

Any other measurable $g$ gives an inner expectation no smaller than at $c = g^\star(\mathbf{y})$ pointwise, hence the outer expectation is also no smaller. Equality requires $g(\mathbf{y}) = g^\star(\mathbf{y})$ $\mathbf{Y}$ -a.s. $\blacksquare$

Historical Note: Gregor and LeCun, 2010

2010

Deep unfolding was introduced by Karol Gregor and Yann LeCun at ICML 2010 in a paper titled "Learning Fast Approximations of Sparse Coding." They took the iterative soft-thresholding algorithm (ISTA) for $\ell_1$ regularized regression, unrolled $T$ of its iterations into a feed-forward network with $T$ layers, and learned the per-layer step sizes and matrices by backpropagation. The result — LISTA — produced sparse codes roughly 10× faster than ISTA at comparable reconstruction quality. The idea generalized: every model-based iterative algorithm has an unfolded counterpart, and learning the per-iteration parameters often gains both speed and accuracy.

Definition:
Deep Unfolding

Let $\mathbf{x}^{(t+1)} = F(\mathbf{x}^{(t)}; \boldsymbol{\phi})$ be one iteration of a model-based algorithm (e.g., ISTA, proximal gradient, AMP) parameterized by hyper-parameters $\boldsymbol{\phi}$ (step size, threshold, linear map). Deep unfolding constructs the estimator $g_{\boldsymbol{\theta}}(\mathbf{y}) = F(\,\cdots\, F(F(\mathbf{x}^{(0)}; \boldsymbol{\phi}_1); \boldsymbol{\phi}_2) \cdots ; \boldsymbol{\phi}_T),$ a $T$ -layer feed-forward network in which each layer reuses the iterative update rule but has its own learnable parameters $\boldsymbol{\theta} = (\boldsymbol{\phi}_1, \dots, \boldsymbol{\phi}_T)$ . Training optimizes $\boldsymbol{\theta}$ end-to-end by backpropagation against a data-driven loss.

At one extreme ( $\boldsymbol{\phi}_t$ all tied to the model values): no learning — we recover the original iterative algorithm. At the other extreme (dense unconstrained per-layer linear maps): pure black-box neural network. The interior of this spectrum — most of the model kept, a few parameters learned — is where deep unfolding earns its keep.

Deep Unfolding

A hybrid model-based / data-driven architecture in which $T$ iterations of a known iterative algorithm are unrolled into a $T$ -layer neural network whose per-layer parameters are learned from training data. Introduced by Gregor and LeCun (LISTA, 2010) and extended to AMP, ADMM, proximal splitting, and message passing.

Example: LISTA: Learned ISTA

ISTA for LASSO uses the update $\mathbf{x}^{(t+1)} = \mathcal{S}_\tau(\mathbf{x}^{(t)} - \eta \mathbf{A}^T(\mathbf{A}\mathbf{x}^{(t)} - \mathbf{y}))$ with soft-threshold $\mathcal{S}_\tau$ and step size $\eta$ . Write the corresponding LISTA architecture and identify its learnable parameters.

Solution

Reparametrize one iteration

Rewrite the update as $\mathbf{x}^{(t+1)} = \mathcal{S}_\tau(\mathbf{W}_e \mathbf{y} + \mathbf{W}_s \mathbf{x}^{(t)}),$ where $\mathbf{W}_e = \eta \mathbf{A}^T$ and $\mathbf{W}_s = \mathbf{I} - \eta \mathbf{A}^T \mathbf{A}$ . Under ISTA these are fixed; under LISTA they become learnable.

Stack $T$ layers

The LISTA network is $\mathbf{x}^{(t+1)} = \mathcal{S}_{\tau_t}(\mathbf{W}_e^{(t)} \mathbf{y} + \mathbf{W}_s^{(t)} \mathbf{x}^{(t)}),\; t = 0, \dots, T-1.$ Free parameters per layer: the two matrices and the threshold $\tau_t$ , giving $\boldsymbol{\theta} = \{\mathbf{W}_e^{(t)}, \mathbf{W}_s^{(t)}, \tau_t\}_{t=0}^{T-1}$ .

Train end-to-end

Use training pairs $(\mathbf{x}_i^\star, \mathbf{y}_i = \mathbf{A}\mathbf{x}_i^\star + \mathbf{w}_i)$ and minimize $\tfrac{1}{N}\sum_i \|\mathbf{x}_i^\star - \mathbf{x}^{(T)}(\mathbf{y}_i)\|^2$ via backpropagation. Typically $T = 10$ – $20$ layers suffice to reach the accuracy that ISTA achieves only after hundreds of iterations.

LISTA: Learned ISTA for Sparse Recovery

Complexity: Inference:

T

matrix-vector products; training:

O(NTMN)

per epoch

Input: Measurements

\mathbf{y} \in \mathbb{R}^M

, parameters

\{\mathbf{W}_e^{(t)}, \mathbf{W}_s^{(t)}, \tau_t\}_{t=0}^{T-1}

Output: Sparse estimate

\hat{\mathbf{x}} \in \mathbb{R}^N

1. Initialize

\mathbf{x}^{(0)} \leftarrow \mathbf{0}

2. for

t = 0, 1, \ldots, T-1

do

3.

\quad \mathbf{u}^{(t)} \leftarrow \mathbf{W}_e^{(t)} \mathbf{y} + \mathbf{W}_s^{(t)} \mathbf{x}^{(t)}

4.

\quad \mathbf{x}^{(t+1)} \leftarrow \mathcal{S}_{\tau_t}(\mathbf{u}^{(t)})

5. end for

6. return

\hat{\mathbf{x}} = \mathbf{x}^{(T)}

Training: minimize

\tfrac{1}{N}\sum_i \|\mathbf{x}_i^\star - \hat{\mathbf{x}}(\mathbf{y}_i)\|^2

by SGD + backprop through all

T

layers.

LISTA typically converges 10–100× faster than ISTA at comparable accuracy, because each learned layer is tuned for the signal distribution actually encountered at deployment — unlike ISTA, which is tuned for worst-case Lipschitz constants.

🎓CommIT Contribution(2021)

Unfolded OAMP for RF Imaging

H. Sarieddeen, G. Caire — IEEE Trans. Signal Processing

The CommIT group has developed deep-unfolded orthogonal approximate message passing (OAMP) for sparse RF-imaging reconstruction. The unfolded architecture inherits OAMP's provable convergence under Kronecker-structured sensing matrices, yet learns its regularization parameters from calibration data. In imaging experiments this hybrid outperforms both classical OAMP and unconstrained CNN baselines, demonstrating concretely that model structure + data-driven fine-tuning > either alone.

deep-unfoldingoamprf-imagingView Paper →

Unfolded Network vs. Model-Based Iterations

Compare the recovery error of classical ISTA (with fixed step size $\eta$ ) and a LISTA-style unfolded network (with per-layer learned step sizes) as a function of iteration count $T$ on a synthetic sparse recovery task. The unfolded architecture reaches the same error with far fewer layers.

Parameters

Iterations / layers

T

10

True sparsity

s

8

Noise std

\sigma

0.05

Random seed1

Deep Unfolding Architecture Diagram

Construction of an unfolded network by stacking

T

copies of a model-based iteration, each with its own learnable parameters.

One ISTA iteration

\mathbf{x}^{(t+1)} = \mathcal{S}_{\tau}(\mathbf{W}_e\mathbf{y} + \mathbf{W}_s\mathbf{x}^{(t)})

is repeated

T

times. In LISTA, the matrices

\mathbf{W}_e^{(t)}, \mathbf{W}_s^{(t)}

and threshold

\tau_t

become layer-specific and are learned by backpropagation.

Model-Based vs. Data-Driven vs. Hybrid Estimation

Criterion	Model-Based (MLE / MMSE)	End-to-End Deep Net	Deep Unfolding
Data requirement	None (closed-form model)	Large labeled dataset	Moderate dataset
Inference cost	Many iterations	One forward pass	$T$ layers (small $T$ )
Robust to model drift	No (biased)	Only if retrained	Partially (model backbone)
Interpretability	High	Low	Medium
Uncertainty quantification	Yes (Bayesian posterior)	Rare / ad hoc	Possible via unfolded Bayesian methods
Performance ceiling	Bounded by CRLB	Bounded by data size	Typically beats both at moderate $N$

Common Mistake: Training Distribution vs. Deployment Distribution

Mistake:

Training a learned estimator on one signal distribution (say, i.i.d. Rayleigh channels) and deploying it in a setting where the channel statistics differ (line-of-sight, spatially correlated, or time-varying).

Correction:

The learned estimator converges to the MMSE estimator for the training distribution, not for the deployment distribution. If the distributions differ, the estimator is biased — and unlike model-based methods, it does not degrade gracefully. Mitigations: train on a broad mixture covering expected deployment conditions; use domain adaptation or fine-tuning at deployment time; prefer deep-unfolded architectures (which retain the model's inductive bias and tend to generalize better than black-box nets); always evaluate on realistic out-of-distribution benchmarks.

Common Mistake: The CRLB Still Holds

Mistake:

Believing that a learned estimator can beat the Cramér–Rao bound if only the network is deep enough.

Correction:

The CRLB is an information-theoretic lower bound on unbiased estimators; it holds regardless of how the estimator is constructed. A neural network can approach the CRLB by tracking the MLE, or appear to beat it by being biased (shrinkage toward a prior mode typically gives MSE below CRLB at low SNR). The right metric for a biased estimator is Bayesian MSE or the Bayesian CRLB (Van Trees inequality, §The Van Trees Inequality).

⚠️Engineering Note

Deploying Learned Estimators in Real-Time Systems

Moving from a prototype neural estimator to deployment in a modem or radar receiver surfaces constraints that model-based methods never had to face:

Latency budget: a 5G NR slot is 500 μs at numerology $\mu=0$ ; an estimator must return inside that window.
Memory budget: a UE may allow at most 10–50 MB of weights for a DSP-class model.
Fixed-point arithmetic: networks trained in float32 must be quantized to int8 or int16; quantization-aware training is required.
Retraining and validation: every firmware release that touches the estimator must revalidate against a certification suite — network updates are not free.
Graceful degradation: in the presence of unfamiliar interference the estimator should detect OOD and fall back to a conservative model-based path.

Practical Constraints

•
Latency < 500 μs at numerology $\mu=0$ (15 kHz SCS)
•
Quantization to int8/int16 typically loses 0.5–2 dB in MSE
•
Retraining validation costs typically dominate development timeline
•
Fallback to model-based estimator is required for safety-critical paths

Why This Matters: Neural Channel Estimators in 5G and 6G

Proposals for 5G/6G receivers replace the classical least-squares / MMSE pilot channel estimator with a neural network trained on ray-traced or measured channel realizations. In multi-panel massive MIMO, a CNN or unfolded ADMM network exploits the 2D spatial correlation of the channel, reducing pilot overhead by 20–50% at the same estimation error. This is one of the few confirmed cases where learned estimators clearly outperform model-based ones: the true spatial channel manifold is low-dimensional and non-Gaussian, which the Gaussian-MMSE estimator cannot exploit but a learned one can.

See full treatment in Estimation in ISAC Systems

Quick Check

Why does LISTA typically need only $T = 10$ – $20$ layers to match ISTA's accuracy at $1000$ iterations?

Backpropagation is inherently faster than gradient descent

LISTA's per-layer parameters are tuned for the data distribution, so each layer does more work than a generic ISTA step

LISTA uses ReLU activations, which are faster than soft-thresholding

LISTA does not actually solve the LASSO, so the comparison is unfair

Correction:

LISTA's per-layer parameters are tuned for the data distribution, so each layer does more work than a generic ISTA step

ISTA's step size $\eta$ must satisfy $\eta < 2/L$ for worst-case convergence, so each iteration makes only incremental progress. LISTA's learned per-layer matrices and thresholds adapt to the actual signal distribution, so each layer can take a much larger effective step. The architecture is the same; the parameters are better for this problem than ISTA's conservative defaults.

Quick Check

A neural network trained with squared-error loss on infinite training data from a distribution $p(\mathbf{x}, \mathbf{y})$ converges to which estimator?

The MLE of $\mathbf{x}$ given $\mathbf{y}$

The MAP estimator

The posterior mean $\mathbb{E}[\mathbf{X}\mid\mathbf{Y}=\mathbf{y}]$

An unbiased estimator achieving the CRLB

Correction:

The posterior mean

\mathbb{E}[\mathbf{X}\mid\mathbf{Y}=\mathbf{y}]

Minimizing expected squared error is equivalent to minimizing the conditional expected squared error pointwise, whose minimizer at each $\mathbf{y}$ is the posterior mean. The network is approximating the MMSE estimator — which is Bayes-optimal under squared loss.

Key Takeaway

Data-driven estimation trades model assumptions for training data. End-to-end neural networks are best when the likelihood is intractable and training data is plentiful and representative. Deep unfolding is the principled middle ground: keep the skeleton of a model-based algorithm, unroll $T$ iterations into layers, and learn only the few parameters that matter. Under squared loss, all of these estimators chase the same target — the posterior mean — differing only in how they trade model structure against learned flexibility.

Deep Learning for Estimation

From Likelihoods to Learned Maps

Definition: Neural-Network Estimator

End-to-End Learning

Theorem: Squared-Error Training Recovers the Posterior Mean

Condition on $\mathbf{Y}$

Pointwise minimization

Bound holds globally

Historical Note: Gregor and LeCun, 2010

Definition: Deep Unfolding

Deep Unfolding

Example: LISTA: Learned ISTA

Reparametrize one iteration

Stack $T$ layers

Train end-to-end

LISTA: Learned ISTA for Sparse Recovery

Unfolded OAMP for RF Imaging

Unfolded Network vs. Model-Based Iterations

Parameters

Deep Unfolding Architecture Diagram

Model-Based vs. Data-Driven vs. Hybrid Estimation

Common Mistake: Training Distribution vs. Deployment Distribution

Common Mistake: The CRLB Still Holds

Deploying Learned Estimators in Real-Time Systems

Why This Matters: Neural Channel Estimators in 5G and 6G

Quick Check

Quick Check

Key Takeaway

Definition:
Neural-Network Estimator

Definition:
Deep Unfolding