Supervised Learning for the Physical Layer
Why Machine Learning at the Physical Layer?
Classical physical-layer algorithms --- MMSE channel estimation, Viterbi detection, Turbo decoding --- are derived from explicit mathematical models of the channel, noise, and modulation. They are optimal when the model is correct, but real-world channels exhibit hardware impairments (non-linear power amplifiers, I/Q imbalance, phase noise), imperfect CSI, and complex propagation effects that defy closed-form treatment.
Supervised learning offers a complementary paradigm: given labelled training pairs , learn a mapping that minimises an empirical loss without requiring a closed-form channel model. This section develops two flagship applications --- neural-network channel estimation and end-to-end autoencoder learning --- and establishes the mathematical foundations that connect supervised learning to classical estimation.
A word of caution: all simulations in this chapter use tiny pure-numpy networks ( parameters, training in under a second) for pedagogical illustration. Production systems use GPU-accelerated frameworks with far larger models, but the principles are identical.
Definition: Supervised Learning Framework for the Physical Layer
Supervised Learning Framework for the Physical Layer
Let denote the received observation (e.g., pilot observations) and the quantity to be estimated (e.g., channel coefficients, transmitted symbols). A supervised learning approach trains a parametric function by minimising the empirical risk:
where is the training dataset and is the loss function. Common choices:
- MSE loss (regression):
- Cross-entropy loss (classification): where
The function is typically a feed-forward neural network (fully connected, convolutional, or residual) trained via stochastic gradient descent (SGD) or Adam:
where is the learning rate.
The training data can be generated in three ways: (1) from a known channel model (simulation-based), (2) from over-the-air pilot transmissions with known ground truth, or (3) from a combination (transfer learning). Approach (1) is most common in research; approach (2) enables adaptation to real hardware.
Definition: Neural-Network Channel Estimation
Neural-Network Channel Estimation
Consider an OFDM system with subcarriers. Pilots are inserted at known subcarrier positions . The received pilot observations are:
where is the channel at pilot subcarriers, are the known pilot symbols, and .
The LS estimate at pilots is , and the full channel is recovered by interpolation.
A neural-network estimator replaces the interpolation by a learned mapping:
where is a two-layer fully connected network:
with ReLU activation and parameters . The network is trained to minimise the MSE:
where is the true channel (available during training from the simulator or calibration).
The NN estimator implicitly learns the channel statistics (power delay profile, correlation across subcarriers) from data. This is analogous to how MMSE estimation uses the channel correlation matrix , but without requiring explicit knowledge of . When the channel model is well matched, MMSE is near-optimal; the NN advantage emerges under model mismatch.
Theorem: Universal Approximation (Relevance to PHY)
Example: Comparing NN and LS Channel Estimation
An OFDM system has subcarriers and uniformly spaced pilots. The channel has taps with an exponential power delay profile . A two-layer neural network with 32 hidden units is trained on 500 channel realisations at SNR dB.
(a) Write the dimensions of the NN weight matrices.
(b) Count the total number of trainable parameters.
(c) Explain why the NN can outperform LS + interpolation.
Network dimensions
The input is . The output is .
Parameter count
$
This is a very small network by modern standards (GPT-3 has 175 billion parameters), yet it suffices for this task because the input-output mapping has low intrinsic dimensionality.
Why NN outperforms LS + interpolation
LS + linear interpolation treats each subcarrier independently (via local interpolation kernels) and ignores the channel structure. The NN implicitly learns:
-
The power delay profile: The exponential decay means higher-frequency channel components carry less energy and should be attenuated (denoised). This is equivalent to Wiener filtering.
-
Cross-subcarrier correlations: The -tap channel creates smooth frequency-domain variations that the NN exploits for joint estimation.
-
Non-linear denoising: At low SNR, the NN learns to shrink noisy components more aggressively, approaching the MMSE estimator.
The performance gap is largest when (sparse pilots) and at low-to-moderate SNR, where the prior information about channel structure is most valuable.
NN vs LS Channel Estimation MSE
Compare a two-layer neural-network channel estimator with LS + linear interpolation across a range of SNR values. The NN is trained at the specified SNR on 500 channel realisations with an exponential power delay profile. Observe how the NN provides a larger gain at low SNR (where denoising matters most) and with fewer pilots (where interpolation is harder). The tiny numpy NN has only 5000 parameters --- a production system would use a deeper architecture with skip connections for even better performance.
Parameters
Definition: End-to-End Autoencoder for Communication
End-to-End Autoencoder for Communication
The autoencoder approach treats the entire communication system --- encoder (transmitter), channel, and decoder (receiver) --- as a single neural network that is trained end-to-end to minimise the symbol error probability.
Architecture. For an -ary modulation scheme over channel uses:
-
Encoder : Maps each message (represented as a one-hot vector) to a transmitted signal , subject to a power constraint .
-
Channel : The physical channel (AWGN, Rayleigh fading, etc.) adds noise. For AWGN: , .
-
Decoder : Maps the received signal to a probability distribution over messages via softmax, and the estimated message is .
Training. The system is trained by minimising the categorical cross-entropy:
Gradients pass through the channel layer: for AWGN, the channel is a simple addition node with a deterministic gradient of .
The key insight of the autoencoder approach (O'Shea and Hoydis, 2017) is that the encoder learns the optimal constellation geometry jointly with the decoder, without being constrained to classical formats (PSK, QAM). The learned constellations often resemble known optimal packings, but can discover novel arrangements for non-standard channels.
Autoencoder Discovers QPSK Constellation
End-to-End Autoencoder Architecture
Backpropagation Through the Channel
A natural question is: how can we backpropagate through a stochastic channel?
For AWGN, where is independent of . The gradient is simply:
The noise acts as a stochastic regulariser during training (analogous to dropout or Gaussian noise injection in standard deep learning), but does not block gradient flow.
For non-differentiable channels (e.g., quantised feedback, discrete fading states), one must use techniques such as:
- Straight-through estimator (STE): Replace the non-differentiable operation with an identity in the backward pass.
- Generative adversarial network (GAN) channel: Train a differentiable neural network to approximate the channel and backpropagate through the surrogate.
- REINFORCE / policy gradient: Treat the encoder output as a stochastic policy and estimate gradients via Monte Carlo sampling.
Autoencoder Learned Constellation
Train an end-to-end autoencoder with constellation points over 2 channel uses (so each point is 2D) under AWGN at the specified SNR. The left panel shows the learned constellation points (circles) alongside the classical -PSK reference (diamonds). The right panel shows the training loss convergence. Try (compare with QPSK) and (compare with 16-PSK). At high SNR the learned constellation approaches a circle; at low SNR it may discover arrangements with larger minimum distance. Note: This tiny numpy autoencoder has only 100 parameters and trains in under a second.
Parameters
What the Autoencoder Learns
The autoencoder framework yields several insights about optimal signal design:
-
AWGN channel, channel use per symbol: The learned constellation converges to uniform spacing on a line segment (essentially -PAM), which is indeed optimal for 1D AWGN.
-
AWGN channel, channel uses: For , the learned constellation closely resembles QPSK (4 points on a circle). For , it discovers a rotated 8-PSK or occasionally a (1,7) arrangement (1 point at the origin, 7 on a circle) --- which has a slightly better minimum distance packing.
-
Rayleigh fading: Under fading without CSI at the transmitter, the autoencoder learns constellations that are more robust to amplitude variations, often placing points at varying radii rather than all on a single circle.
-
Non-linear channels: When a non-linear power amplifier (PA) model is included in the channel, the autoencoder jointly learns pre-distortion (at the encoder) and equalisation (at the decoder).
These results validate the autoencoder approach: it rediscovers known optimal solutions when the theory is well-established, and discovers novel solutions when the channel is too complex for closed-form analysis.
Common Mistake: NN Trained at One SNR Degrades at Others
Mistake:
Training a neural-network channel estimator or detector at a single SNR value (e.g., 20 dB) and deploying it across all operating conditions without retraining or SNR conditioning.
Correction:
A network trained at 20 dB learns denoising behaviour appropriate for that noise level. At 5 dB, the noise is 15 dB stronger and the network under-regularises, resulting in worse performance than a classical MMSE estimator tuned for 5 dB. Two solutions:
- Train across a range of SNR values by sampling the training SNR uniformly from [0, 30] dB.
- Condition on SNR: Feed the estimated SNR as an auxiliary input to the network, allowing it to adapt its behaviour.
Empirically, SNR-conditioned networks match or exceed per-SNR specialists, with a single model serving all operating points.
Common Mistake: Autoencoder Fails with Non-Differentiable Channel
Mistake:
Attempting to train an end-to-end autoencoder with backpropagation when the channel model includes non-differentiable operations (quantisation, discrete feedback, hard detection).
Correction:
Standard backpropagation requires a differentiable path from decoder loss to encoder parameters. Non-differentiable operations block gradient flow. Use:
- Straight-through estimator (STE) for quantisation layers
- GAN-based surrogate channel that is differentiable
- REINFORCE / policy gradient for discrete actions
- Gumbel-Softmax relaxation for discrete selections
Each approach has trade-offs: STE introduces gradient bias, GAN surrogates add training complexity, and REINFORCE has high variance. The choice depends on the specific non-linearity.
Historical Note: The Autoencoder Revolution: O'Shea and Hoydis (2017)
2017Timothy O'Shea and Jakob Hoydis introduced the end-to-end autoencoder concept for communication systems in their landmark 2017 IEEE TCCN paper. By treating the entire transmitter-channel- receiver chain as a single neural network and training it to minimise classification error, they demonstrated that the network independently discovers classical constellation geometries (QPSK, 8-PSK) for AWGN channels β without any knowledge of modulation theory. The paper ignited a wave of research in "learning to communicate" and has been cited over 3,000 times. The autoencoder framework showed that deep learning could be applied not just as a tool within existing systems but as a fundamentally new approach to system design.
Historical Note: Gregor and LeCun: LISTA and the Birth of Deep Unfolding (2010)
2010Karol Gregor and Yann LeCun introduced LISTA (Learned ISTA) at ICML 2010, establishing the deep unfolding paradigm. The key insight was deceptively simple: take a well-known iterative algorithm (ISTA), unfold a fixed number of iterations into layers, make the per-iteration parameters learnable, and train end-to-end. The result β 10 LISTA layers matching 100+ ISTA iterations β demonstrated that algorithm structure is a powerful inductive bias. This idea has since been applied to ADMM, belief propagation, WMMSE, and dozens of other algorithms across signal processing, computer vision, and wireless communications.
Neural Network Inference Latency in Real-Time PHY
Physical-layer processing must complete within strict time budgets:
- OFDM symbol duration: 66.7 s (5G NR, 15 kHz SCS)
- Slot duration: 0.5 ms (5G NR, 30 kHz SCS)
- HARQ feedback deadline: 4--8 slots (2--4 ms)
A neural-network channel estimator or detector must complete inference within a fraction of the symbol duration. Practical constraints:
- FPGA inference: A 2-layer FC network with 5000 parameters achieves ~1 s latency on a Xilinx Zynq UltraScale+ at 200 MHz clock. This fits comfortably within the time budget.
- GPU inference: Batch processing of 1000 subframes takes ~0.5 ms on an NVIDIA A100, but the latency per subframe is dominated by kernel launch overhead (~10 s).
- Quantisation: INT8 quantisation reduces model size by 4x and doubles throughput on FPGA, with <0.5 dB performance loss for small networks. INT4 is feasible for inference-only.
- LISTA layers: Each LISTA layer involves a matrix-vector product ( MACs) and soft-thresholding ( ops). For , : 40,960 MACs total, achievable in <1 s on DSP hardware.
- β’
5G NR symbol: 66.7 ΞΌs (15 kHz SCS)
- β’
FPGA inference: ~1 ΞΌs for 5000-param FC network
- β’
INT8 quantisation: <0.5 dB loss, 4x size reduction
Supervised Learning
A machine learning paradigm where a model is trained on labelled pairs to minimise an empirical loss. In wireless PHY, the labels are typically channel coefficients (for estimation) or transmitted symbols (for detection), obtained from simulation or calibration.
Related: Deep Unfolding (Algorithm Unrolling), Communication Autoencoder
Communication Autoencoder
A neural network architecture that jointly optimises encoder (transmitter) and decoder (receiver) end-to-end, treating the communication chain as a single autoencoder. The encoder learns optimal constellation geometry and the decoder learns the corresponding detection rule.
Related: Supervised Learning
Quick Check
An end-to-end autoencoder for 16-ary signalling is trained using categorical cross-entropy loss. If the decoder outputs a uniform distribution for all messages (i.e., it has not learned anything), what is the training loss per sample?
bits
nats
nats
The cross-entropy of a uniform prediction is nats. As training progresses, the loss should decrease toward 0 (perfect classification) or toward a floor set by the channel noise.
Quick Check
A neural-network channel estimator is trained at SNR dB and deployed at SNR dB. Which statement is most accurate?
The NN will perform optimally at 5 dB since neural networks generalise perfectly across SNR
The NN will likely perform worse than an MMSE estimator tuned for 5 dB, because it learned denoising behaviour appropriate for 20 dB noise levels
The NN will refuse to produce an estimate at a mismatched SNR
The performance will be identical because the LS input is the same
A network trained at 20 dB has learned to denoise mildly; at 5 dB the noise is 15 dB stronger and the network under-regularises. Training across a range of SNR values (or conditioning on SNR via an auxiliary input) is essential for robust deployment.