Supervised Learning for the Physical Layer

Why Machine Learning at the Physical Layer?

Classical physical-layer algorithms --- MMSE channel estimation, Viterbi detection, Turbo decoding --- are derived from explicit mathematical models of the channel, noise, and modulation. They are optimal when the model is correct, but real-world channels exhibit hardware impairments (non-linear power amplifiers, I/Q imbalance, phase noise), imperfect CSI, and complex propagation effects that defy closed-form treatment.

Supervised learning offers a complementary paradigm: given labelled training pairs {(xi,yi)}i=1N\{(\mathbf{x}_i, \mathbf{y}_i)\}_{i=1}^N, learn a mapping fΞΈf_\theta that minimises an empirical loss without requiring a closed-form channel model. This section develops two flagship applications --- neural-network channel estimation and end-to-end autoencoder learning --- and establishes the mathematical foundations that connect supervised learning to classical estimation.

A word of caution: all simulations in this chapter use tiny pure-numpy networks (<1000< 1000 parameters, training in under a second) for pedagogical illustration. Production systems use GPU-accelerated frameworks with far larger models, but the principles are identical.

Definition:

Supervised Learning Framework for the Physical Layer

Let y∈Rm\mathbf{y} \in \mathbb{R}^m denote the received observation (e.g., pilot observations) and x∈Rn\mathbf{x} \in \mathbb{R}^n the quantity to be estimated (e.g., channel coefficients, transmitted symbols). A supervised learning approach trains a parametric function fΞΈ:Rmβ†’Rnf_\theta : \mathbb{R}^m \to \mathbb{R}^n by minimising the empirical risk:

ΞΈ^=arg⁑minβ‘ΞΈβ€…β€Š1Nβˆ‘i=1Nβ„“(fΞΈ(yi), xi)\hat{\theta} = \arg\min_\theta \; \frac{1}{N}\sum_{i=1}^{N} \ell\bigl(f_\theta(\mathbf{y}_i),\, \mathbf{x}_i\bigr)

where {(yi,xi)}i=1N\{(\mathbf{y}_i, \mathbf{x}_i)\}_{i=1}^N is the training dataset and β„“(β‹…,β‹…)\ell(\cdot, \cdot) is the loss function. Common choices:

  • MSE loss (regression): β„“(x^,x)=βˆ₯x^βˆ’xβˆ₯2\ell(\hat{\mathbf{x}}, \mathbf{x}) = \|\hat{\mathbf{x}} - \mathbf{x}\|^2
  • Cross-entropy loss (classification): β„“(p^,c)=βˆ’log⁑p^c\ell(\hat{\mathbf{p}}, c) = -\log \hat{p}_c where p^=softmax(fΞΈ(y))\hat{\mathbf{p}} = \mathrm{softmax}(f_\theta(\mathbf{y}))

The function fΞΈf_\theta is typically a feed-forward neural network (fully connected, convolutional, or residual) trained via stochastic gradient descent (SGD) or Adam:

ΞΈβ†ΞΈβˆ’Ξ·β€‰βˆ‡ΞΈβ„“(fΞΈ(y),x)\theta \leftarrow \theta - \eta \, \nabla_\theta \ell\bigl(f_\theta(\mathbf{y}), \mathbf{x}\bigr)

where Ξ·>0\eta > 0 is the learning rate.

The training data can be generated in three ways: (1) from a known channel model (simulation-based), (2) from over-the-air pilot transmissions with known ground truth, or (3) from a combination (transfer learning). Approach (1) is most common in research; approach (2) enables adaptation to real hardware.

Definition:

Neural-Network Channel Estimation

Consider an OFDM system with NcN_c subcarriers. Pilots are inserted at NpN_p known subcarrier positions P\mathcal{P}. The received pilot observations are:

yp=diag(xp) hp+np\mathbf{y}_p = \mathrm{diag}(\mathbf{x}_p)\,\mathbf{h}_p + \mathbf{n}_p

where hp=[H(k)]k∈P\mathbf{h}_p = [H(k)]_{k \in \mathcal{P}} is the channel at pilot subcarriers, xp\mathbf{x}_p are the known pilot symbols, and np∼CN(0,Οƒ2I)\mathbf{n}_p \sim \mathcal{CN}(\mathbf{0}, \sigma^2\mathbf{I}).

The LS estimate at pilots is h^pLS=diag(xp)βˆ’1yp\hat{\mathbf{h}}_p^{\mathrm{LS}} = \mathrm{diag}(\mathbf{x}_p)^{-1}\mathbf{y}_p, and the full channel is recovered by interpolation.

A neural-network estimator replaces the interpolation by a learned mapping:

h^=fθ ⁣([Re⁑(h^pLS),β€…β€ŠIm⁑(h^pLS)])\hat{\mathbf{h}} = f_\theta\!\left( [\operatorname{Re}(\hat{\mathbf{h}}_p^{\mathrm{LS}}),\; \operatorname{Im}(\hat{\mathbf{h}}_p^{\mathrm{LS}})] \right)

where fΞΈf_\theta is a two-layer fully connected network:

fΞΈ(z)=W2 σ(W1z+b1)+b2f_\theta(\mathbf{z}) = \mathbf{W}_2\, \sigma(\mathbf{W}_1\mathbf{z} + \mathbf{b}_1) + \mathbf{b}_2

with ReLU activation Οƒ(β‹…)=max⁑(0,β‹…)\sigma(\cdot) = \max(0, \cdot) and parameters ΞΈ={W1,b1,W2,b2}\theta = \{\mathbf{W}_1, \mathbf{b}_1, \mathbf{W}_2, \mathbf{b}_2\}. The network is trained to minimise the MSE:

L(ΞΈ)=1Nβˆ‘i=1Nβˆ₯h^iβˆ’hiβˆ₯2\mathcal{L}(\theta) = \frac{1}{N} \sum_{i=1}^N \|\hat{\mathbf{h}}_i - \mathbf{h}_i\|^2

where hi\mathbf{h}_i is the true channel (available during training from the simulator or calibration).

The NN estimator implicitly learns the channel statistics (power delay profile, correlation across subcarriers) from data. This is analogous to how MMSE estimation uses the channel correlation matrix RHH\mathbf{R}_{HH}, but without requiring explicit knowledge of RHH\mathbf{R}_{HH}. When the channel model is well matched, MMSE is near-optimal; the NN advantage emerges under model mismatch.

Theorem: Universal Approximation (Relevance to PHY)

Example: Comparing NN and LS Channel Estimation

An OFDM system has Nc=64N_c = 64 subcarriers and Np=8N_p = 8 uniformly spaced pilots. The channel has L=8L = 8 taps with an exponential power delay profile Οƒl2∝eβˆ’l/3\sigma_l^2 \propto e^{-l/3}. A two-layer neural network with 32 hidden units is trained on 500 channel realisations at SNR =15= 15 dB.

(a) Write the dimensions of the NN weight matrices.

(b) Count the total number of trainable parameters.

(c) Explain why the NN can outperform LS + interpolation.

NN vs LS Channel Estimation MSE

Compare a two-layer neural-network channel estimator with LS + linear interpolation across a range of SNR values. The NN is trained at the specified SNR on 500 channel realisations with an exponential power delay profile. Observe how the NN provides a larger gain at low SNR (where denoising matters most) and with fewer pilots (where interpolation is harder). The tiny numpy NN has only ∼\sim5000 parameters --- a production system would use a deeper architecture with skip connections for even better performance.

Parameters
15
8

Definition:

End-to-End Autoencoder for Communication

The autoencoder approach treats the entire communication system --- encoder (transmitter), channel, and decoder (receiver) --- as a single neural network that is trained end-to-end to minimise the symbol error probability.

Architecture. For an MM-ary modulation scheme over nn channel uses:

  1. Encoder fΞΈenc:{1,…,M}β†’R2nf_{\theta_{\text{enc}}} : \{1, \ldots, M\} \to \mathbb{R}^{2n}: Maps each message s∈{1,…,M}s \in \{1, \ldots, M\} (represented as a one-hot vector) to a transmitted signal x=fΞΈenc(s)\mathbf{x} = f_{\theta_{\text{enc}}}(s), subject to a power constraint 1Mβˆ‘s=1Mβˆ₯xsβˆ₯2≀1\frac{1}{M}\sum_{s=1}^M \|\mathbf{x}_s\|^2 \leq 1.

  2. Channel p(y∣x)p(\mathbf{y}|\mathbf{x}): The physical channel (AWGN, Rayleigh fading, etc.) adds noise. For AWGN: y=x+n\mathbf{y} = \mathbf{x} + \mathbf{n}, n∼N(0,Οƒ2I)\mathbf{n} \sim \mathcal{N}(\mathbf{0}, \sigma^2\mathbf{I}).

  3. Decoder gΞΈdec:R2nβ†’Ξ”Mβˆ’1g_{\theta_{\text{dec}}} : \mathbb{R}^{2n} \to \Delta^{M-1}: Maps the received signal to a probability distribution over messages via softmax, and the estimated message is s^=arg⁑max⁑s gΞΈdec(y)s\hat{s} = \arg\max_s \, g_{\theta_{\text{dec}}} (\mathbf{y})_s.

Training. The system is trained by minimising the categorical cross-entropy:

L(ΞΈenc,ΞΈdec)=βˆ’1Nβˆ‘i=1Nlog⁑gΞΈdec(yi)si\mathcal{L}(\theta_{\text{enc}}, \theta_{\text{dec}}) = -\frac{1}{N}\sum_{i=1}^N \log g_{\theta_{\text{dec}}} (\mathbf{y}_i)_{s_i}

Gradients pass through the channel layer: for AWGN, the channel is a simple addition node with a deterministic gradient of βˆ‚y/βˆ‚x=I\partial\mathbf{y}/\partial\mathbf{x} = \mathbf{I}.

The key insight of the autoencoder approach (O'Shea and Hoydis, 2017) is that the encoder learns the optimal constellation geometry jointly with the decoder, without being constrained to classical formats (PSK, QAM). The learned constellations often resemble known optimal packings, but can discover novel arrangements for non-standard channels.

Autoencoder Discovers QPSK Constellation

Animated training of an end-to-end autoencoder for 4-ary signalling. Starting from random initialisation, the constellation points converge to a QPSK-like arrangement as the encoder and decoder are jointly optimised.
The autoencoder independently discovers that QPSK is optimal for 4-ary signalling over AWGN β€” validating the end-to-end approach.

End-to-End Autoencoder Architecture

End-to-End Autoencoder Architecture
Block diagram of the communication autoencoder. The encoder maps each message ss (one-hot encoded) through a dense layer + ReLU + dense layer + normalisation to produce a 2D constellation point xs\mathbf{x}_s. The channel adds noise n\mathbf{n}. The decoder processes the received signal y\mathbf{y} through a dense layer + ReLU + dense layer + softmax to produce a probability distribution over messages. The entire system is trained end-to-end by backpropagating the cross-entropy loss through the channel.

Backpropagation Through the Channel

A natural question is: how can we backpropagate through a stochastic channel?

For AWGN, y=x+n\mathbf{y} = \mathbf{x} + \mathbf{n} where n\mathbf{n} is independent of x\mathbf{x}. The gradient is simply:

βˆ‚Lβˆ‚x=βˆ‚Lβˆ‚yβ‹…βˆ‚yβˆ‚x=βˆ‚Lβˆ‚y\frac{\partial \mathcal{L}}{\partial \mathbf{x}} = \frac{\partial \mathcal{L}}{\partial \mathbf{y}} \cdot \frac{\partial \mathbf{y}}{\partial \mathbf{x}} = \frac{\partial \mathcal{L}}{\partial \mathbf{y}}

The noise acts as a stochastic regulariser during training (analogous to dropout or Gaussian noise injection in standard deep learning), but does not block gradient flow.

For non-differentiable channels (e.g., quantised feedback, discrete fading states), one must use techniques such as:

  • Straight-through estimator (STE): Replace the non-differentiable operation with an identity in the backward pass.
  • Generative adversarial network (GAN) channel: Train a differentiable neural network to approximate the channel p(y∣x)p(\mathbf{y}|\mathbf{x}) and backpropagate through the surrogate.
  • REINFORCE / policy gradient: Treat the encoder output as a stochastic policy and estimate gradients via Monte Carlo sampling.

Autoencoder Learned Constellation

Train an end-to-end autoencoder with MM constellation points over 2 channel uses (so each point is 2D) under AWGN at the specified SNR. The left panel shows the learned constellation points (circles) alongside the classical MM-PSK reference (diamonds). The right panel shows the training loss convergence. Try M=4M = 4 (compare with QPSK) and M=16M = 16 (compare with 16-PSK). At high SNR the learned constellation approaches a circle; at low SNR it may discover arrangements with larger minimum distance. Note: This tiny numpy autoencoder has only ∼\sim100 parameters and trains in under a second.

Parameters
4
10

What the Autoencoder Learns

The autoencoder framework yields several insights about optimal signal design:

  1. AWGN channel, n=1n = 1 channel use per symbol: The learned constellation converges to uniform spacing on a line segment (essentially MM-PAM), which is indeed optimal for 1D AWGN.

  2. AWGN channel, n=2n = 2 channel uses: For M=4M = 4, the learned constellation closely resembles QPSK (4 points on a circle). For M=8M = 8, it discovers a rotated 8-PSK or occasionally a (1,7) arrangement (1 point at the origin, 7 on a circle) --- which has a slightly better minimum distance packing.

  3. Rayleigh fading: Under fading without CSI at the transmitter, the autoencoder learns constellations that are more robust to amplitude variations, often placing points at varying radii rather than all on a single circle.

  4. Non-linear channels: When a non-linear power amplifier (PA) model is included in the channel, the autoencoder jointly learns pre-distortion (at the encoder) and equalisation (at the decoder).

These results validate the autoencoder approach: it rediscovers known optimal solutions when the theory is well-established, and discovers novel solutions when the channel is too complex for closed-form analysis.

Common Mistake: NN Trained at One SNR Degrades at Others

Mistake:

Training a neural-network channel estimator or detector at a single SNR value (e.g., 20 dB) and deploying it across all operating conditions without retraining or SNR conditioning.

Correction:

A network trained at 20 dB learns denoising behaviour appropriate for that noise level. At 5 dB, the noise is 15 dB stronger and the network under-regularises, resulting in worse performance than a classical MMSE estimator tuned for 5 dB. Two solutions:

  1. Train across a range of SNR values by sampling the training SNR uniformly from [0, 30] dB.
  2. Condition on SNR: Feed the estimated SNR as an auxiliary input to the network, allowing it to adapt its behaviour.

Empirically, SNR-conditioned networks match or exceed per-SNR specialists, with a single model serving all operating points.

Common Mistake: Autoencoder Fails with Non-Differentiable Channel

Mistake:

Attempting to train an end-to-end autoencoder with backpropagation when the channel model includes non-differentiable operations (quantisation, discrete feedback, hard detection).

Correction:

Standard backpropagation requires a differentiable path from decoder loss to encoder parameters. Non-differentiable operations block gradient flow. Use:

  • Straight-through estimator (STE) for quantisation layers
  • GAN-based surrogate channel that is differentiable
  • REINFORCE / policy gradient for discrete actions
  • Gumbel-Softmax relaxation for discrete selections

Each approach has trade-offs: STE introduces gradient bias, GAN surrogates add training complexity, and REINFORCE has high variance. The choice depends on the specific non-linearity.

Historical Note: The Autoencoder Revolution: O'Shea and Hoydis (2017)

2017

Timothy O'Shea and Jakob Hoydis introduced the end-to-end autoencoder concept for communication systems in their landmark 2017 IEEE TCCN paper. By treating the entire transmitter-channel- receiver chain as a single neural network and training it to minimise classification error, they demonstrated that the network independently discovers classical constellation geometries (QPSK, 8-PSK) for AWGN channels β€” without any knowledge of modulation theory. The paper ignited a wave of research in "learning to communicate" and has been cited over 3,000 times. The autoencoder framework showed that deep learning could be applied not just as a tool within existing systems but as a fundamentally new approach to system design.

Historical Note: Gregor and LeCun: LISTA and the Birth of Deep Unfolding (2010)

2010

Karol Gregor and Yann LeCun introduced LISTA (Learned ISTA) at ICML 2010, establishing the deep unfolding paradigm. The key insight was deceptively simple: take a well-known iterative algorithm (ISTA), unfold a fixed number of iterations into layers, make the per-iteration parameters learnable, and train end-to-end. The result β€” 10 LISTA layers matching 100+ ISTA iterations β€” demonstrated that algorithm structure is a powerful inductive bias. This idea has since been applied to ADMM, belief propagation, WMMSE, and dozens of other algorithms across signal processing, computer vision, and wireless communications.

⚠️Engineering Note

Neural Network Inference Latency in Real-Time PHY

Physical-layer processing must complete within strict time budgets:

  • OFDM symbol duration: 66.7 ΞΌ\mus (5G NR, 15 kHz SCS)
  • Slot duration: 0.5 ms (5G NR, 30 kHz SCS)
  • HARQ feedback deadline: 4--8 slots (2--4 ms)

A neural-network channel estimator or detector must complete inference within a fraction of the symbol duration. Practical constraints:

  • FPGA inference: A 2-layer FC network with 5000 parameters achieves ~1 ΞΌ\mus latency on a Xilinx Zynq UltraScale+ at 200 MHz clock. This fits comfortably within the time budget.
  • GPU inference: Batch processing of 1000 subframes takes ~0.5 ms on an NVIDIA A100, but the latency per subframe is dominated by kernel launch overhead (~10 ΞΌ\mus).
  • Quantisation: INT8 quantisation reduces model size by 4x and doubles throughput on FPGA, with <0.5 dB performance loss for small networks. INT4 is feasible for inference-only.
  • LISTA layers: Each LISTA layer involves a matrix-vector product (N2N^2 MACs) and soft-thresholding (NN ops). For N=64N = 64, L=10L = 10: 40,960 MACs total, achievable in <1 ΞΌ\mus on DSP hardware.
Practical Constraints
  • β€’

    5G NR symbol: 66.7 ΞΌs (15 kHz SCS)

  • β€’

    FPGA inference: ~1 ΞΌs for 5000-param FC network

  • β€’

    INT8 quantisation: <0.5 dB loss, 4x size reduction

Supervised Learning

A machine learning paradigm where a model fΞΈf_\theta is trained on labelled pairs {(xi,yi)}\{(\mathbf{x}_i, \mathbf{y}_i)\} to minimise an empirical loss. In wireless PHY, the labels are typically channel coefficients (for estimation) or transmitted symbols (for detection), obtained from simulation or calibration.

Related: Deep Unfolding (Algorithm Unrolling), Communication Autoencoder

Communication Autoencoder

A neural network architecture that jointly optimises encoder (transmitter) and decoder (receiver) end-to-end, treating the communication chain as a single autoencoder. The encoder learns optimal constellation geometry and the decoder learns the corresponding detection rule.

Related: Supervised Learning

Quick Check

An end-to-end autoencoder for 16-ary signalling is trained using categorical cross-entropy loss. If the decoder outputs a uniform distribution p^s=1/16\hat{p}_s = 1/16 for all messages ss (i.e., it has not learned anything), what is the training loss per sample?

log⁑2(16)=4\log_2(16) = 4 bits

ln⁑(16)β‰ˆ2.77\ln(16) \approx 2.77 nats

16Γ—ln⁑(16)β‰ˆ44.416 \times \ln(16) \approx 44.4 nats

1/16β‰ˆ0.06251/16 \approx 0.0625

Quick Check

A neural-network channel estimator is trained at SNR =20= 20 dB and deployed at SNR =5= 5 dB. Which statement is most accurate?

The NN will perform optimally at 5 dB since neural networks generalise perfectly across SNR

The NN will likely perform worse than an MMSE estimator tuned for 5 dB, because it learned denoising behaviour appropriate for 20 dB noise levels

The NN will refuse to produce an estimate at a mismatched SNR

The performance will be identical because the LS input is the same