Model-Based vs Data-Driven Design
Choosing the Right ML Paradigm for Wireless
The preceding sections presented three paradigms for applying machine learning to wireless communications:
- Black-box neural networks (Section 31.1): Learn mappings directly from data without incorporating domain knowledge.
- Deep unfolding / model-based ML (Section 31.2): Embed the structure of known algorithms into the network architecture.
- Reinforcement learning (Section 31.3): Learn policies through interaction without explicit supervision.
This section provides a systematic comparison of model-based and data-driven (black-box) approaches along four axes: sample efficiency, generalisation, interpretability, and computational cost. The goal is to equip the reader with design guidelines for selecting the appropriate paradigm for a given wireless problem.
The comparison is particularly important in wireless, where domain knowledge (channel models, signal processing theory, information-theoretic bounds) is rich and well-established. The central question is: how much domain knowledge should be injected into the ML model, and in what form?
Definition: The Model-Based to Data-Driven Spectrum
The Model-Based to Data-Driven Spectrum
ML approaches for wireless can be placed on a continuum from fully model-based to fully data-driven:
Each step to the right trades domain knowledge for flexibility:
- More domain knowledge fewer parameters better sample efficiency, interpretability
- More flexibility more parameters better potential performance under model mismatch, but higher data requirement
The optimal operating point depends on:
- How accurate the domain model is
- How much training data is available
- The computational budget for training and inference
- Whether interpretability/certifiability is required
Sample Efficiency: The Key Advantage of Model-Based ML
Sample efficiency measures how many training examples are needed to achieve a target performance level. It is arguably the most important criterion in wireless, where labelled training data is expensive (requires over-the-air measurements or high-fidelity simulation).
Consider sparse channel estimation with subcarriers, measurements, and sparsity :
-
Black-box NN ( network): parameters. Needs 500--1000 training samples to converge.
-
LISTA (10 layers, only thresholds learned): 10 scalar thresholds parameters. Needs 20--50 training samples (or even 0 if initialised from ISTA and used without fine-tuning).
This advantage in sample efficiency arises because LISTA encodes the structure of the problem (the measurement matrix , the proximal operator) into the architecture. The black-box NN must learn this structure from data, requiring far more examples.
Rule of thumb: As a rough guide, the number of training samples should be at least the number of learnable parameters for good generalisation. Model-based approaches, by having fewer parameters, directly reduce the data requirement.
Model-Based vs Black-Box Sample Efficiency
Compare the NMSE (normalised mean-squared error) of a model-based approach (LISTA with known measurement matrix) and a black-box neural network, as a function of the number of training samples. The model-based approach leverages the measurement matrix to initialise its weights, achieving low NMSE even with few training samples. The black-box NN must learn everything from data and requires substantially more samples to match. Increase the SNR to see both methods improve (less noise), but the relative advantage of the model-based approach persists, especially in the low-data regime.
Parameters
Generalisation: Robustness to Distribution Shift
Generalisation refers to the ability to perform well on data that differs from the training distribution. In wireless, distribution shifts are the rule rather than the exception:
- A model trained in urban macrocell channels is deployed in an indoor small cell.
- A model trained at 3.5 GHz is used at 28 GHz.
- A model trained with 4 users encounters 8 users.
Model-based approaches generalise better because their inductive bias captures physical invariants that hold across distributions:
- LISTA inherits the ISTA convergence guarantee for any measurement matrix, so even with learned thresholds it remains a valid proximal operator.
- An unfolded WMMSE algorithm preserves the alternating optimisation structure that guarantees monotone sum-rate improvement.
Black-box NNs, in contrast, may memorise training-distribution- specific patterns that fail under shift. Empirically, deep unfolding methods degrade gracefully (e.g., 2--5 dB NMSE increase) under moderate distribution shift, while black-box networks can catastrophically fail (10--20 dB degradation or worse).
Mitigation strategies for black-box models:
- Domain randomisation: Train on a diverse mixture of channel models, SNR levels, and system configurations.
- Meta-learning: Train a model that can quickly adapt to new distributions with a few gradient steps (MAML, Reptile).
- Online fine-tuning: Continuously update the model on incoming data at deployment.
Systematic Comparison
The following table summarises the key trade-offs:
| Criterion | Model-Based (Deep Unfolding) | Data-Driven (Black-Box NN) |
|---|---|---|
| Sample efficiency | Excellent (few parameters to learn) | Poor (many parameters) |
| Generalisation | Good (physics-based inductive bias) | Variable (depends on training diversity) |
| Performance ceiling | Limited by algorithm structure | Potentially higher (more flexible) |
| Interpretability | High (each layer = one algorithm iteration) | Low (opaque mapping) |
| Design effort | High (must derive the unfolded algorithm) | Low (standard NN architecture) |
| Inference speed | Fast ( layers, structured operations) | Variable (depends on architecture) |
| Adaptability | Moderate (re-train thresholds) | High (re-train entire network) |
| Theoretical analysis | Easier (convergence guarantees exist) | Harder (few guarantees) |
When to use model-based ML:
- The underlying algorithm is known and performs reasonably well
- Training data is scarce ( labelled samples)
- Interpretability or safety certification is required
- The deployment environment may differ from training
When to use black-box NN:
- No good algorithm exists for the problem
- Abundant training data is available ( samples)
- The channel model is too complex for closed-form treatment
- Maximum performance is the priority over interpretability
Example: Choosing an ML Approach for MIMO Detection
A MIMO system with 64-QAM modulation requires a detector. The classical MMSE detector achieves acceptable BER at high SNR but degrades at low SNR and under channel estimation errors. The engineering team has access to 1000 labelled training samples (transmitted symbol, received signal pairs) and requires the detector to work across a range of channel conditions.
Recommend an ML approach and justify your choice.
Assess the problem characteristics
- Known algorithm exists: MMSE detection, . Also iterative detectors: gradient descent projected onto the constellation (projected gradient descent, PGD).
- Training data: 1000 samples --- moderate.
- Deployment: Multiple channel conditions --- robustness is important.
- Dimensions: , --- high-dimensional discrete output.
Evaluate options
-
Black-box NN ( per real dim): 12,000 parameters. With 1000 training samples and a rule-of-thumb of params, this is data-starved. Likely to overfit.
-
Deep unfolding of PGD ("DetNet"): Unfold 10 iterations of projected gradient descent. Per-layer parameters: step size and a learned perturbation . Total: 200 parameters. Well within the data budget.
-
MMSE + learned residual: Use MMSE as a first stage, then train a small NN to correct the residual. 500 parameters. Also feasible.
Recommendation
Deep unfolding (DetNet) is the recommended approach:
- Sample efficiency: 200 parameters vs 1000 samples gives a comfortable ratio.
- Generalisation: The PGD structure ensures that the detector always moves toward lower detection cost, providing a safety net under distribution shift.
- Interpretability: Each layer has a clear meaning (one gradient descent step + projection).
- Performance: Published results show that 10-layer DetNet matches near-ML performance at a fraction of the ML detector's complexity.
The black-box NN would be preferred only if the channel model were highly non-standard (e.g., severe hardware impairments that invalidate the linear model ) and abundant data were available.
Hybrid Approaches: The Best of Both Worlds
In practice, the most successful wireless ML systems are hybrid, combining model-based and data-driven components:
-
Model-based backbone + NN refinement: Use a classical algorithm (MMSE, OFDM, LDPC decoder) as the primary processing pipeline and attach a small NN to correct residual errors (e.g., non-linear PA distortion, imperfect CSI). This is robust (the classical algorithm handles the bulk of the processing) and data-efficient (the NN only needs to learn the residual).
-
NN feature extraction + model-based decision: Use a CNN or transformer to extract features from raw I/Q samples, then feed these features into a model-based algorithm (e.g., beamforming based on extracted AOA/AOD). The NN handles the hard perception task; the algorithm handles the structured optimisation.
-
Learned hyperparameters: Keep the algorithm fixed but learn its hyperparameters (step sizes, regularisation weights, constellation scaling) from data. This is the lightest-weight ML integration and often provides surprisingly large gains.
The overarching principle is: inject as much domain knowledge as possible into the architecture, and let the data fill in what the model misses.
Open Challenges in ML for Wireless
Despite the rapid progress, several fundamental challenges remain open:
1. Standardisation and deployment. Unlike computer vision (where ImageNet and ResNet are universal benchmarks), wireless ML lacks standardised datasets, channel models, and evaluation protocols. The O-RAN Alliance is working toward open interfaces that enable ML-based RAN intelligent controllers (RICs), but deployment in commercial networks remains limited.
2. Real-time constraints. Physical-layer processing must complete within microseconds (OFDM symbol duration 70 s in 5G). NN inference on specialised hardware (FPGAs, ASICs) can meet this, but training and adaptation at such timescales is an open problem.
3. Safety and robustness. Adversarial examples can fool NN-based detectors and decoders. Providing formal guarantees (e.g., worst-case BER bounds) for ML-based systems is largely unsolved.
4. Transfer across systems. A model trained for one carrier frequency, antenna configuration, or environment must adapt to new conditions. Meta-learning and few-shot adaptation are promising but not yet mature for wireless.
5. Energy efficiency. Training large ML models has a significant carbon footprint. Developing green AI methods that achieve good performance with minimal computation is critical for sustainable deployment.
Model-Based vs Data-Driven ML for Wireless
| Criterion | Model-Based (Deep Unfolding) | Data-Driven (Black-Box NN) |
|---|---|---|
| Sample efficiency | Excellent (10-100 parameters) | Poor (1000+ parameters) |
| Generalisation | Good (physics-based bias) | Variable (depends on training) |
| Performance ceiling | Limited by algorithm | Potentially higher |
| Interpretability | High (layer = iteration) | Low (opaque) |
| Design effort | High (derive unfolded form) | Low (standard architecture) |
| Inference speed | Fast (structured ops) | Variable |
| Theoretical analysis | Convergence guarantees | Few guarantees |
| Best when | Scarce data, known algorithm | Abundant data, complex model |
Key Takeaway
The central principle for ML in wireless is: inject as much domain knowledge as possible into the architecture, and let the data fill in what the model misses. Deep unfolding achieves 10--100 better sample efficiency than black-box networks by encoding algorithm structure as an inductive bias. In the data-scarce regime that characterises wireless (labelled samples cost over-the-air measurements or high-fidelity simulation), model-based ML is almost always the right starting point. Black-box networks should be reserved for problems where no suitable algorithm exists or the channel model is too complex for analytical treatment.
Why This Matters: Secure and Distributed Computing in the SC Book
The federated learning and secure aggregation techniques in this chapter connect to the broader theory of secure and distributed computing developed in the SC (Secure Computing) book:
- Secure aggregation protocols: MPC-based approaches, Shamir's secret sharing, and the ByzSecAgg framework (CommIT contribution: Jahani-Nezhad, Maddah-Ali, Caire)
- Differential privacy: Formal privacy guarantees and the privacy-utility trade-off in gradient sharing
- Byzantine fault tolerance: Detecting and mitigating adversarial client updates in distributed training
- Over-the-air computation: Using the wireless MAC channel for native gradient aggregation
Readers interested in the theoretical foundations of privacy- preserving distributed learning should consult the SC book.
Quick Check
A research team develops a deep unfolding network for sparse channel estimation. The network has 15 learnable thresholds (one per layer) and is initialised with the known measurement matrix. A competing team trains a fully connected black-box network with 5000 parameters from random initialisation. Both teams have access to 200 training channel realisations. Which outcome is most likely?
The black-box network significantly outperforms deep unfolding because it has more parameters and thus more capacity
Both networks perform identically because 200 samples is sufficient for either approach
The deep unfolding network outperforms the black-box because 200 samples provides ample data for 15 parameters but severely under-trains 5000 parameters
Neither network will work because 200 samples is always insufficient for ML
With only 200 training samples, the black-box network (5000 parameters) has a parameter-to-sample ratio of 25:1, far above the recommended 1:5 ratio. It will likely overfit. The deep unfolding network (15 parameters) has a ratio of 1:13, well within the safe regime. Moreover, its physics-informed initialisation means it starts close to the optimum and needs minimal fine-tuning.