Training and Deployment Considerations

Making ML Work in Production

The theoretical sections of this chapter showed that ML receivers can match or beat classical. The practical question is: how do you actually train, deploy, and maintain them? Training data, infrastructure, drift, privacy, and adversarial robustness are all real challenges. This section discusses them end-to-end and offers deployment patterns for 6G OTFS with ML.

Definition:

Training Data for ML OTFS

Training data for ML OTFS receivers comes in three categories:

Simulated: synthetic channels from 3GPP models (Urban Micro, Rural Macro, HST). Cheap, diverse, controllable. ~10610^6 samples, few hours generation.

Emulated: channel emulator (hardware or high-fidelity software) reproduces real-world conditions. Intermediate fidelity.

Real-world: measured channels from BS/UE logs. Most authentic but expensive and privacy-constrained. Typical: 10410^4-10510^5 samples.

Best practice: train on simulated for bulk, fine-tune on emulated for site-specific, adapt online on real-world for final tuning. Combined: 80% simulation, 15% emulation, 5% real-world.

Theorem: Simulation-to-Real Gap

A NN trained on pure simulation typically shows 2\sim 2-33 dB BER degradation when deployed on real channels (the "sim-to-real gap"). The gap arises from:

  • Hardware non-idealities not captured in simulation.
  • Fractional Doppler structure different from idealized models.
  • Interference patterns from real co-deployed services.

Mitigation:

  • Domain randomization: train on wide parameter distributions (not fixed values). Reduces gap to 1\sim 1 dB.
  • Domain adaptation: fine-tune on small real-world data. Recovers most of the gap.
  • Adversarial training: inject synthetic perturbations. Improves OOD robustness.

Total: with these mitigations, sim-to-real gap drops to 0.5\sim 0.5-11 dB. Acceptable for deployment.

Simulation is the friend of ML training (infinite data, perfect labels), but a naive simulator produces non-representative data. Good practice: make the simulator as realistic as feasible (include HW imperfections, non-Gaussian noise, specific band behaviors). Then fine-tune on real-world. The sim-to-real gap is the standard ML deployment challenge, well-studied.

Definition:

Federated Learning for OTFS

Federated learning (FL) trains ML models without centralizing training data:

  • Each UE/BS trains locally on its own channel observations.
  • Periodically, local model updates (gradients or weights) are aggregated at a server.
  • Server averages updates and distributes the new global model.
  • Cycle repeats.

Advantages:

  • Privacy: UE data never leaves the UE.
  • Efficiency: no centralized data collection.
  • Personalization: each UE can fine-tune the global model locally.

Disadvantages:

  • Communication overhead for model updates.
  • Convergence slower than centralized.
  • Federated clients may have biased data.

OTFS application: federated training of NN detectors and learned pilots. Model updates: 10\sim 10 MB per UE per round. 100 rounds to convergence. Total: 1 GB — acceptable.

Theorem: Federated vs Centralized Performance

Federated learning with KK clients, each with NN samples, achieves performance comparable to centralized training on KNK \cdot N samples, subject to:

  • Clients have independent data (i.i.d. assumption).
  • Aggregation happens frequently enough.

Gap: FL converges in O(K)\mathcal{O}(K) more rounds than centralized, but total compute same.

Practical 6G: K=100K = 100-10001000 UEs contributing. Each has 10310^3-10410^4 frames. Total: 10510^5-10710^7 samples equivalent. Sufficient for NN OTFS detector training.

FL achieves a Privacy-Performance trade-off. Centralized has access to all data but violates privacy. FL protects privacy at slight training cost. For commercial 6G where privacy regulation (GDPR, CCPA) forbids centralized data collection, FL is the only viable approach.

Example: Federated OTFS Receiver Deployment

A Tier-1 operator deploys federated learned-pilot for 6G OTFS. 1000 UEs across 10 cities. Describe the training/deployment flow.

Federated vs Centralized Training Convergence

Plot training loss and inference BER vs epoch for centralized and federated learning. Sliders: client count KK, rounds per epoch.

Parameters
100
10
0.2

Definition:

Adversarial Robustness

Adversarial examples are small, specifically-crafted perturbations to input that cause misclassification: x^adv=x+ϵsign(xL(x))\hat{x}_{\mathrm{adv}} = x + \epsilon \cdot \mathrm{sign}(\nabla_x L(x)) for adversarial loss LL. For OTFS receivers, adversarial perturbations can be injected into the received signal by a malicious transmitter (jamming-attack analog).

OTFS susceptibility:

  • Pure-NN detector: high susceptibility. Perturbation of 0.5 dB signal can cause BER to 10x worse.
  • Classical MP: robust. Perturbation unlikely to cause ML-specific failures.
  • Unfolded MP: between. Structure provides robustness.

Defenses:

  • Adversarial training: include adversarial examples in training data. NN learns to resist.
  • Detection: flag anomalous inputs and fall back to classical.
  • Input smoothing: denoise before NN detector. Reduces perturbation magnitude.

Theorem: Adversarial Training Performance

An NN trained with adversarial examples achieves: BERadv2BERclean,\mathrm{BER}_{\mathrm{adv}} \leq 2 \cdot \mathrm{BER}_{\mathrm{clean}}, whereas a naively-trained NN exhibits BERadv10BERclean\mathrm{BER}_{\mathrm{adv}} \geq 10 \cdot \mathrm{BER}_{\mathrm{clean}} under attack.

Cost: 2×\sim 2\times training time. 0.5\sim 0.5 dB worse on clean data.

For 6G V2X safety, adversarial training is mandatory.

Adversarial robustness is critical for safety-critical applications. The extra training time is a one-time cost; the inference remains unchanged. Clean-data performance: slight degradation. Under attack: huge improvement.

🔧Engineering Note

ML OTFS Deployment Patterns

ML OTFS deployment patterns:

Pattern 1: Pre-trained vendor models

  • Vendor (Qualcomm, MediaTek) trains NN detectors on broad simulated data.
  • Deployed to UE chip. Fixed. No UE-side training.
  • Pro: simple. Con: no personalization.

Pattern 2: Federated personalization

  • Vendor pre-trains. UE fine-tunes on local data.
  • Federated updates for global improvement.
  • Pro: privacy-preserving. Con: complex coordination.

Pattern 3: Online learning

  • UE continuously fine-tunes NN on measured channel.
  • Fastest adaptation. Privacy-preserving.
  • Pro: best performance in stable environments. Con: sensitive to environment changes.

Pattern 4: Hybrid

  • Vendor model + Federated updates + online fine-tuning.
  • Maximum flexibility. Standard for 6G commercial deployment.

2026 reality: Pattern 1 dominates (simpler). Pattern 2 in pilot. Pattern 4 expected 2030+.

Practical Constraints
  • Pattern 1: vendor-only (simplest, fixed)

  • Pattern 2: federated (privacy + personalization)

  • Pattern 3: online (adaptive, noisy)

  • Pattern 4: hybrid (6G commercial)

Common Mistake: Respect Privacy Constraints

Mistake:

Centralizing UE channel data for ML training, violating GDPR, CCPA, or similar privacy regulations.

Correction:

Design ML training for privacy from the ground up:

  • Federated learning (§3): training happens at UE.
  • Differential privacy: add noise to model updates.
  • Secure aggregation: cryptographic sum of updates.
  • Anonymization: remove UE identifiers before any data handling.

Compliance costs: training complexity increases by 20%\sim 20\%. Performance gap vs non-private: 1\sim 1 dB. Acceptable for regulatory compliance. Commercial deployments 2028+ will mandate privacy-preserving ML for physical layer.