Ferkans — Interactive Telecom Tutor

Why Machine Learning at the Physical Layer?

Classical physical-layer algorithms --- MMSE channel estimation, Viterbi detection, Turbo decoding --- are derived from explicit mathematical models of the channel, noise, and modulation. They are optimal when the model is correct, but real-world channels exhibit hardware impairments (non-linear power amplifiers, I/Q imbalance, phase noise), imperfect CSI, and complex propagation effects that defy closed-form treatment.

Supervised learning offers a complementary paradigm: given labelled training pairs $\{(\mathbf{x}_i, \mathbf{y}_i)\}_{i=1}^N$ , learn a mapping $f_\theta$ that minimises an empirical loss without requiring a closed-form channel model. This section develops two flagship applications --- neural-network channel estimation and end-to-end autoencoder learning --- and establishes the mathematical foundations that connect supervised learning to classical estimation.

A word of caution: all simulations in this chapter use tiny pure-numpy networks ( $< 1000$ parameters, training in under a second) for pedagogical illustration. Production systems use GPU-accelerated frameworks with far larger models, but the principles are identical.

Definition:
Supervised Learning Framework for the Physical Layer

Let $\mathbf{y} \in \mathbb{R}^m$ denote the received observation (e.g., pilot observations) and $\mathbf{x} \in \mathbb{R}^n$ the quantity to be estimated (e.g., channel coefficients, transmitted symbols). A supervised learning approach trains a parametric function $f_\theta : \mathbb{R}^m \to \mathbb{R}^n$ by minimising the empirical risk:

$\hat{\theta} = \arg\min_\theta \; \frac{1}{N}\sum_{i=1}^{N} \ell\bigl(f_\theta(\mathbf{y}_i),\, \mathbf{x}_i\bigr)$

where $\{(\mathbf{y}_i, \mathbf{x}_i)\}_{i=1}^N$ is the training dataset and $\ell(\cdot, \cdot)$ is the loss function. Common choices:

MSE loss (regression): $\ell(\hat{\mathbf{x}}, \mathbf{x}) = \|\hat{\mathbf{x}} - \mathbf{x}\|^2$
Cross-entropy loss (classification): $\ell(\hat{\mathbf{p}}, c) = -\log \hat{p}_c$ where $\hat{\mathbf{p}} = \mathrm{softmax}(f_\theta(\mathbf{y}))$

The function $f_\theta$ is typically a feed-forward neural network (fully connected, convolutional, or residual) trained via stochastic gradient descent (SGD) or Adam:

$\theta \leftarrow \theta - \eta \, \nabla_\theta \ell\bigl(f_\theta(\mathbf{y}), \mathbf{x}\bigr)$

where $\eta > 0$ is the learning rate.

The training data can be generated in three ways: (1) from a known channel model (simulation-based), (2) from over-the-air pilot transmissions with known ground truth, or (3) from a combination (transfer learning). Approach (1) is most common in research; approach (2) enables adaptation to real hardware.

Definition:
Neural-Network Channel Estimation

Consider an OFDM system with $N_c$ subcarriers. Pilots are inserted at $N_p$ known subcarrier positions $\mathcal{P}$ . The received pilot observations are:

$\mathbf{y}_p = \mathrm{diag}(\mathbf{x}_p)\,\mathbf{h}_p + \mathbf{n}_p$

where $\mathbf{h}_p = [H(k)]_{k \in \mathcal{P}}$ is the channel at pilot subcarriers, $\mathbf{x}_p$ are the known pilot symbols, and $\mathbf{n}_p \sim \mathcal{CN}(\mathbf{0}, \sigma^2\mathbf{I})$ .

The LS estimate at pilots is $\hat{\mathbf{h}}_p^{\mathrm{LS}} = \mathrm{diag}(\mathbf{x}_p)^{-1}\mathbf{y}_p$ , and the full channel is recovered by interpolation.

A neural-network estimator replaces the interpolation by a learned mapping:

$\hat{\mathbf{h}} = f_\theta\!\left( [\operatorname{Re}(\hat{\mathbf{h}}_p^{\mathrm{LS}}),\; \operatorname{Im}(\hat{\mathbf{h}}_p^{\mathrm{LS}})] \right)$

where $f_\theta$ is a two-layer fully connected network:

$f_\theta(\mathbf{z}) = \mathbf{W}_2\, \sigma(\mathbf{W}_1\mathbf{z} + \mathbf{b}_1) + \mathbf{b}_2$

with ReLU activation $\sigma(\cdot) = \max(0, \cdot)$ and parameters $\theta = \{\mathbf{W}_1, \mathbf{b}_1, \mathbf{W}_2, \mathbf{b}_2\}$ . The network is trained to minimise the MSE:

$\mathcal{L}(\theta) = \frac{1}{N} \sum_{i=1}^N \|\hat{\mathbf{h}}_i - \mathbf{h}_i\|^2$

where $\mathbf{h}_i$ is the true channel (available during training from the simulator or calibration).

The NN estimator implicitly learns the channel statistics (power delay profile, correlation across subcarriers) from data. This is analogous to how MMSE estimation uses the channel correlation matrix $\mathbf{R}_{HH}$ , but without requiring explicit knowledge of $\mathbf{R}_{HH}$ . When the channel model is well matched, MMSE is near-optimal; the NN advantage emerges under model mismatch.

Theorem: Universal Approximation (Relevance to PHY)

Example: Comparing NN and LS Channel Estimation

An OFDM system has $N_c = 64$ subcarriers and $N_p = 8$ uniformly spaced pilots. The channel has $L = 8$ taps with an exponential power delay profile $\sigma_l^2 \propto e^{-l/3}$ . A two-layer neural network with 32 hidden units is trained on 500 channel realisations at SNR $= 15$ dB.

(a) Write the dimensions of the NN weight matrices.

(b) Count the total number of trainable parameters.

(c) Explain why the NN can outperform LS + interpolation.

Solution

Network dimensions

The input is $[\operatorname{Re}(\hat{\mathbf{h}}_p),\, \operatorname{Im}(\hat{\mathbf{h}}_p)] \in \mathbb{R}^{2N_p} = \mathbb{R}^{16}$ . The output is $[\operatorname{Re}(\hat{\mathbf{h}}),\, \operatorname{Im}(\hat{\mathbf{h}})] \in \mathbb{R}^{2N_c} = \mathbb{R}^{128}$ .

$\mathbf{W}_1 \in \mathbb{R}^{16 \times 32}, \quad \mathbf{b}_1 \in \mathbb{R}^{32}, \quad \mathbf{W}_2 \in \mathbb{R}^{32 \times 128}, \quad \mathbf{b}_2 \in \mathbb{R}^{128}$

Parameter count

$|\theta| = 16 \times 32 + 32 + 32 \times 128 + 128 = 512 + 32 + 4096 + 128 = 4768 \;\text{parameters}$ $

This is a very small network by modern standards (GPT-3 has 175 billion parameters), yet it suffices for this task because the input-output mapping has low intrinsic dimensionality.

Why NN outperforms LS + interpolation

LS + linear interpolation treats each subcarrier independently (via local interpolation kernels) and ignores the channel structure. The NN implicitly learns:

The power delay profile: The exponential decay means higher-frequency channel components carry less energy and should be attenuated (denoised). This is equivalent to Wiener filtering.
Cross-subcarrier correlations: The $L$ -tap channel creates smooth frequency-domain variations that the NN exploits for joint estimation.
Non-linear denoising: At low SNR, the NN learns to shrink noisy components more aggressively, approaching the MMSE estimator.

The performance gap is largest when $N_p \ll N_c$ (sparse pilots) and at low-to-moderate SNR, where the prior information about channel structure is most valuable.

NN vs LS Channel Estimation MSE

Compare a two-layer neural-network channel estimator with LS + linear interpolation across a range of SNR values. The NN is trained at the specified SNR on 500 channel realisations with an exponential power delay profile. Observe how the NN provides a larger gain at low SNR (where denoising matters most) and with fewer pilots (where interpolation is harder). The tiny numpy NN has only $\sim$ 5000 parameters --- a production system would use a deeper architecture with skip connections for even better performance.

Parameters

Training SNR (dB)15

Number of pilots

N_p

8

Definition:
End-to-End Autoencoder for Communication

The autoencoder approach treats the entire communication system --- encoder (transmitter), channel, and decoder (receiver) --- as a single neural network that is trained end-to-end to minimise the symbol error probability.

Architecture. For an $M$ -ary modulation scheme over $n$ channel uses:

Encoder $f_{\theta_{\text{enc}}} : \{1, \ldots, M\} \to \mathbb{R}^{2n}$ : Maps each message $s \in \{1, \ldots, M\}$ (represented as a one-hot vector) to a transmitted signal $\mathbf{x} = f_{\theta_{\text{enc}}}(s)$ , subject to a power constraint $\frac{1}{M}\sum_{s=1}^M \|\mathbf{x}_s\|^2 \leq 1$ .
Channel $p(\mathbf{y}|\mathbf{x})$ : The physical channel (AWGN, Rayleigh fading, etc.) adds noise. For AWGN: $\mathbf{y} = \mathbf{x} + \mathbf{n}$ , $\mathbf{n} \sim \mathcal{N}(\mathbf{0}, \sigma^2\mathbf{I})$ .
Decoder $g_{\theta_{\text{dec}}} : \mathbb{R}^{2n} \to \Delta^{M-1}$ : Maps the received signal to a probability distribution over messages via softmax, and the estimated message is $\hat{s} = \arg\max_s \, g_{\theta_{\text{dec}}} (\mathbf{y})_s$ .

Training. The system is trained by minimising the categorical cross-entropy:

$\mathcal{L}(\theta_{\text{enc}}, \theta_{\text{dec}}) = -\frac{1}{N}\sum_{i=1}^N \log g_{\theta_{\text{dec}}} (\mathbf{y}_i)_{s_i}$

Gradients pass through the channel layer: for AWGN, the channel is a simple addition node with a deterministic gradient of $\partial\mathbf{y}/\partial\mathbf{x} = \mathbf{I}$ .

The key insight of the autoencoder approach (O'Shea and Hoydis, 2017) is that the encoder learns the optimal constellation geometry jointly with the decoder, without being constrained to classical formats (PSK, QAM). The learned constellations often resemble known optimal packings, but can discover novel arrangements for non-standard channels.

Autoencoder Discovers QPSK Constellation

Animated training of an end-to-end autoencoder for 4-ary signalling. Starting from random initialisation, the constellation points converge to a QPSK-like arrangement as the encoder and decoder are jointly optimised.

The autoencoder independently discovers that QPSK is optimal for 4-ary signalling over AWGN — validating the end-to-end approach.

End-to-End Autoencoder Architecture — Block diagram of the communication autoencoder. The encoder maps each message $s$ (one-hot encoded) through a dense layer + ReLU + dense layer + normalisation to produce a 2D constellation point $\mathbf{x}_s$ . The channel adds noise $\mathbf{n}$ . The decoder processes the received signal $\mathbf{y}$ through a dense layer + ReLU + dense layer + softmax to produce a probability distribution over messages. The entire system is trained end-to-end by backpropagating the cross-entropy loss through the channel.

Backpropagation Through the Channel

A natural question is: how can we backpropagate through a stochastic channel?

For AWGN, $\mathbf{y} = \mathbf{x} + \mathbf{n}$ where $\mathbf{n}$ is independent of $\mathbf{x}$ . The gradient is simply:

$\frac{\partial \mathcal{L}}{\partial \mathbf{x}} = \frac{\partial \mathcal{L}}{\partial \mathbf{y}} \cdot \frac{\partial \mathbf{y}}{\partial \mathbf{x}} = \frac{\partial \mathcal{L}}{\partial \mathbf{y}}$

The noise acts as a stochastic regulariser during training (analogous to dropout or Gaussian noise injection in standard deep learning), but does not block gradient flow.

For non-differentiable channels (e.g., quantised feedback, discrete fading states), one must use techniques such as:

Straight-through estimator (STE): Replace the non-differentiable operation with an identity in the backward pass.
Generative adversarial network (GAN) channel: Train a differentiable neural network to approximate the channel $p(\mathbf{y}|\mathbf{x})$ and backpropagate through the surrogate.
REINFORCE / policy gradient: Treat the encoder output as a stochastic policy and estimate gradients via Monte Carlo sampling.

Autoencoder Learned Constellation

Train an end-to-end autoencoder with $M$ constellation points over 2 channel uses (so each point is 2D) under AWGN at the specified SNR. The left panel shows the learned constellation points (circles) alongside the classical $M$ -PSK reference (diamonds). The right panel shows the training loss convergence. Try $M = 4$ (compare with QPSK) and $M = 16$ (compare with 16-PSK). At high SNR the learned constellation approaches a circle; at low SNR it may discover arrangements with larger minimum distance. Note: This tiny numpy autoencoder has only $\sim$ 100 parameters and trains in under a second.

Parameters

Constellation size

M

4

SNR (dB)10

What the Autoencoder Learns

The autoencoder framework yields several insights about optimal signal design:

AWGN channel, $n = 1$ channel use per symbol: The learned constellation converges to uniform spacing on a line segment (essentially $M$ -PAM), which is indeed optimal for 1D AWGN.
AWGN channel, $n = 2$ channel uses: For $M = 4$ , the learned constellation closely resembles QPSK (4 points on a circle). For $M = 8$ , it discovers a rotated 8-PSK or occasionally a (1,7) arrangement (1 point at the origin, 7 on a circle) --- which has a slightly better minimum distance packing.
Rayleigh fading: Under fading without CSI at the transmitter, the autoencoder learns constellations that are more robust to amplitude variations, often placing points at varying radii rather than all on a single circle.
Non-linear channels: When a non-linear power amplifier (PA) model is included in the channel, the autoencoder jointly learns pre-distortion (at the encoder) and equalisation (at the decoder).

These results validate the autoencoder approach: it rediscovers known optimal solutions when the theory is well-established, and discovers novel solutions when the channel is too complex for closed-form analysis.

Common Mistake: NN Trained at One SNR Degrades at Others

Mistake:

Training a neural-network channel estimator or detector at a single SNR value (e.g., 20 dB) and deploying it across all operating conditions without retraining or SNR conditioning.

Correction:

A network trained at 20 dB learns denoising behaviour appropriate for that noise level. At 5 dB, the noise is 15 dB stronger and the network under-regularises, resulting in worse performance than a classical MMSE estimator tuned for 5 dB. Two solutions:

Train across a range of SNR values by sampling the training SNR uniformly from [0, 30] dB.
Condition on SNR: Feed the estimated SNR as an auxiliary input to the network, allowing it to adapt its behaviour.

Empirically, SNR-conditioned networks match or exceed per-SNR specialists, with a single model serving all operating points.

Common Mistake: Autoencoder Fails with Non-Differentiable Channel

Mistake:

Attempting to train an end-to-end autoencoder with backpropagation when the channel model includes non-differentiable operations (quantisation, discrete feedback, hard detection).

Correction:

Standard backpropagation requires a differentiable path from decoder loss to encoder parameters. Non-differentiable operations block gradient flow. Use:

Straight-through estimator (STE) for quantisation layers
GAN-based surrogate channel that is differentiable
REINFORCE / policy gradient for discrete actions
Gumbel-Softmax relaxation for discrete selections

Each approach has trade-offs: STE introduces gradient bias, GAN surrogates add training complexity, and REINFORCE has high variance. The choice depends on the specific non-linearity.

Historical Note: The Autoencoder Revolution: O'Shea and Hoydis (2017)

2017

Timothy O'Shea and Jakob Hoydis introduced the end-to-end autoencoder concept for communication systems in their landmark 2017 IEEE TCCN paper. By treating the entire transmitter-channel- receiver chain as a single neural network and training it to minimise classification error, they demonstrated that the network independently discovers classical constellation geometries (QPSK, 8-PSK) for AWGN channels — without any knowledge of modulation theory. The paper ignited a wave of research in "learning to communicate" and has been cited over 3,000 times. The autoencoder framework showed that deep learning could be applied not just as a tool within existing systems but as a fundamentally new approach to system design.

Historical Note: Gregor and LeCun: LISTA and the Birth of Deep Unfolding (2010)

2010

Karol Gregor and Yann LeCun introduced LISTA (Learned ISTA) at ICML 2010, establishing the deep unfolding paradigm. The key insight was deceptively simple: take a well-known iterative algorithm (ISTA), unfold a fixed number of iterations into layers, make the per-iteration parameters learnable, and train end-to-end. The result — 10 LISTA layers matching 100+ ISTA iterations — demonstrated that algorithm structure is a powerful inductive bias. This idea has since been applied to ADMM, belief propagation, WMMSE, and dozens of other algorithms across signal processing, computer vision, and wireless communications.

⚠️Engineering Note

Neural Network Inference Latency in Real-Time PHY

Physical-layer processing must complete within strict time budgets:

OFDM symbol duration: 66.7 $\mu$ s (5G NR, 15 kHz SCS)
Slot duration: 0.5 ms (5G NR, 30 kHz SCS)
HARQ feedback deadline: 4--8 slots (2--4 ms)

A neural-network channel estimator or detector must complete inference within a fraction of the symbol duration. Practical constraints:

FPGA inference: A 2-layer FC network with 5000 parameters achieves ~1 $\mu$ s latency on a Xilinx Zynq UltraScale+ at 200 MHz clock. This fits comfortably within the time budget.
GPU inference: Batch processing of 1000 subframes takes ~0.5 ms on an NVIDIA A100, but the latency per subframe is dominated by kernel launch overhead (~10 $\mu$ s).
Quantisation: INT8 quantisation reduces model size by 4x and doubles throughput on FPGA, with <0.5 dB performance loss for small networks. INT4 is feasible for inference-only.
LISTA layers: Each LISTA layer involves a matrix-vector product ( $N^2$ MACs) and soft-thresholding ( $N$ ops). For $N = 64$ , $L = 10$ : 40,960 MACs total, achievable in <1 $\mu$ s on DSP hardware.

Practical Constraints

•
5G NR symbol: 66.7 μs (15 kHz SCS)
•
FPGA inference: ~1 μs for 5000-param FC network
•
INT8 quantisation: <0.5 dB loss, 4x size reduction

Supervised Learning

A machine learning paradigm where a model $f_\theta$ is trained on labelled pairs $\{(\mathbf{x}_i, \mathbf{y}_i)\}$ to minimise an empirical loss. In wireless PHY, the labels are typically channel coefficients (for estimation) or transmitted symbols (for detection), obtained from simulation or calibration.

Communication Autoencoder

A neural network architecture that jointly optimises encoder (transmitter) and decoder (receiver) end-to-end, treating the communication chain as a single autoencoder. The encoder learns optimal constellation geometry and the decoder learns the corresponding detection rule.

Related: Supervised Learning

Quick Check

An end-to-end autoencoder for 16-ary signalling is trained using categorical cross-entropy loss. If the decoder outputs a uniform distribution $\hat{p}_s = 1/16$ for all messages $s$ (i.e., it has not learned anything), what is the training loss per sample?

$\log_2(16) = 4$ bits

$\ln(16) \approx 2.77$ nats

$16 \times \ln(16) \approx 44.4$ nats

$1/16 \approx 0.0625$

Correction:

\ln(16) \approx 2.77

nats

The cross-entropy of a uniform prediction is $-\log(1/16) = \ln 16 \approx 2.77$ nats. As training progresses, the loss should decrease toward 0 (perfect classification) or toward a floor set by the channel noise.

Quick Check

A neural-network channel estimator is trained at SNR $= 20$ dB and deployed at SNR $= 5$ dB. Which statement is most accurate?

The NN will perform optimally at 5 dB since neural networks generalise perfectly across SNR

The NN will likely perform worse than an MMSE estimator tuned for 5 dB, because it learned denoising behaviour appropriate for 20 dB noise levels

The NN will refuse to produce an estimate at a mismatched SNR

The performance will be identical because the LS input is the same

Correction:

The NN will likely perform worse than an MMSE estimator tuned for 5 dB, because it learned denoising behaviour appropriate for 20 dB noise levels

A network trained at 20 dB has learned to denoise mildly; at 5 dB the noise is 15 dB stronger and the network under-regularises. Training across a range of SNR values (or conditioning on SNR via an auxiliary input) is essential for robust deployment.

Supervised Learning for the Physical Layer

Why Machine Learning at the Physical Layer?

Definition: Supervised Learning Framework for the Physical Layer

Definition: Neural-Network Channel Estimation

Theorem: Universal Approximation (Relevance to PHY)

Example: Comparing NN and LS Channel Estimation

Network dimensions

Parameter count

Why NN outperforms LS + interpolation

NN vs LS Channel Estimation MSE

Parameters

Definition: End-to-End Autoencoder for Communication

Autoencoder Discovers QPSK Constellation

End-to-End Autoencoder Architecture

Backpropagation Through the Channel

Autoencoder Learned Constellation

Parameters

What the Autoencoder Learns

Common Mistake: NN Trained at One SNR Degrades at Others

Common Mistake: Autoencoder Fails with Non-Differentiable Channel

Historical Note: The Autoencoder Revolution: O'Shea and Hoydis (2017)

Historical Note: Gregor and LeCun: LISTA and the Birth of Deep Unfolding (2010)

Neural Network Inference Latency in Real-Time PHY

Supervised Learning

Communication Autoencoder

Quick Check

Quick Check

Definition:
Supervised Learning Framework for the Physical Layer

Definition:
Neural-Network Channel Estimation

Definition:
End-to-End Autoencoder for Communication