RL for Scheduling and Power Control

Reinforcement Learning for a Problem That Already Has Answers

Chapter 5 treated power control and scheduling as convex optimization problems with clean closed-form or iterative solutions: max-min fairness by bisection, proportional fairness by a gradient method, weighted sum-rate maximization by the BjΓΆrnson-Hoydis-Sanguinetti framework. Every one of those problems has a rigorous convergence guarantee and a deterministic implementation that runs in a handful of milliseconds per scheduling interval on a commodity DSP.

So why consider reinforcement learning at all? Three reasons:

  1. Non-convex extensions. Once you introduce per-antenna power constraints with nonlinear amplifiers, per-user QoS tiers with hard deadlines, or cross-slot memory via HARQ, the optimization becomes non-convex and the clean algorithms of Chapter 5 lose their guarantees. An RL policy trained on these nonconvexities can sometimes find better operating points than iterative heuristics.

  2. Model uncertainty. The optimization in Chapter 5 assumes perfect CSI and known interference statistics. Under imperfect CSI (the common case) the optimal policy is formally an MDP policy, and policy gradient methods are one way to compute it.

  3. Cross-layer objectives. When the reward involves application-layer goodput, queue latency, or fairness over long windows, the problem is a control problem more than a mathematical program. RL handles the temporal credit assignment directly.

The section is structured around the PPO (Proximal Policy Optimization) algorithm because it is the de-facto standard in modern wireless RL papers. We develop the MDP formulation, write the PPO update, look at a training-reward trajectory, and then spend the second half of the section being honest about what goes wrong in production β€” reward gaming, training instability, and the simulation-to-reality gap that has prevented RL from actually shipping in commercial gNBs. Chapter 5's iterative methods remain the workhorse; RL is the augmentation, not the replacement.

Definition:

Scheduling as a Markov Decision Process

A Markov decision process for downlink scheduling at a massive MIMO BS is the tuple (S,A,p,r,Ξ³)(\mathcal{S}, \mathcal{A}, p, r, \gamma):

  • State st∈Ss_t \in \mathcal{S}: the BS-observable state at slot tt. Typically includes per-user CSI estimates {H^k}\{\hat{\mathbf{H}}_k\}, queue lengths {qk}\{q_k\}, historical throughputs {Rkavg}\{R_k^{\text{avg}}\}, and possibly buffered HARQ state.
  • Action at∈Aa_t \in \mathcal{A}: the scheduling + power decision. For KK users this is typically a tuple of (selected user set, per-user power allocation, precoder choice).
  • Transition p(st+1∣st,at)p(s_{t+1} \mid s_t, a_t): governed by the physical channel (fading, mobility) and the queue dynamics qk(t+1)=[qk(t)βˆ’Rk(t)]++Ak(t)q_k(t+1) = [q_k(t) - R_k(t)]^+ + A_k(t) for arrival process Ak(t)A_k(t).
  • Reward r(st,at)r(s_t, a_t): the designer's objective. Sum-rate βˆ‘kRk\sum_k R_k, weighted sum-rate βˆ‘kwkRk\sum_k w_k R_k, max-min min⁑kRk\min_k R_k, or a long-horizon utility βˆ‘klog⁑Rkavg\sum_k \log R_k^{\text{avg}}.
  • Discount factor γ∈[0,1)\gamma \in [0, 1) weighting future rewards.

The optimal policy π⋆(a∣s)\pi^{\star}(a \mid s) maximizes EΟ€[βˆ‘t=0∞γtr(st,at)]\mathbb{E}_{\pi}[\sum_{t=0}^{\infty} \gamma^t r(s_t, a_t)]. In small discrete cases this is solvable by value iteration. For realistic massive MIMO state and action spaces it is not, and we fall back on policy gradient methods with neural network function approximation.

PPO for Power Control and Scheduling

Complexity: Per iteration: O(Tβ‹…βˆ£Ο•βˆ£)\mathcal{O}(T \cdot |\phi|) for the policy update, where βˆ£Ο•βˆ£|\phi| is the number of network parameters. Total training time for a realistic 8-user BS simulation is 10-50 GPU-hours to convergence β€” much more than the milliseconds of the Chapter 5 iterative algorithms, but done once offline.
Input: MDP simulator, actor network πϕ(a∣s)\pi_\phi(a \mid s), critic Vψ(s)V_\psi(s), clip parameter Ο΅\epsilon, learning rate Ξ·\eta
Output: Trained policy πϕ\pi_\phi
1. Initialize Ο•,ψ\phi, \psi randomly
2. for iteration =1,2,…= 1, 2, \ldots do
3. \quad Collect TT trajectories {(st,at,rt)}\{(s_t, a_t, r_t)\} by running πϕold\pi_{\phi_{\text{old}}} in the simulator
4. \quad Compute advantage estimates A^t=βˆ‘k=0Tβˆ’tΞ³krt+kβˆ’Vψ(st)\hat{A}_t = \sum_{k=0}^{T-t} \gamma^k r_{t+k} - V_\psi(s_t) (generalized advantage)
5. \quad For each minibatch of (st,at,A^t)(s_t, a_t, \hat{A}_t):
6. \qquad Compute ratio ρt(Ο•)=πϕ(at∣st)/πϕold(at∣st)\rho_t(\phi) = \pi_\phi(a_t \mid s_t) / \pi_{\phi_{\text{old}}}(a_t \mid s_t)
7. \qquad Compute clipped surrogate Ltclip=min⁑(ρtA^t,clip(ρt,1βˆ’Ο΅,1+Ο΅)A^t)L^{\text{clip}}_t = \min(\rho_t \hat{A}_t, \text{clip}(\rho_t, 1-\epsilon, 1+\epsilon) \hat{A}_t)
8. \qquad Update ϕ←ϕ+Ξ·βˆ‡Ο•Lclip\phi \leftarrow \phi + \eta \nabla_\phi L^{\text{clip}}
9. \qquad Update ψ\psi by regressing Vψ(st)V_\psi(s_t) against βˆ‘k=0Tβˆ’tΞ³krt+k\sum_{k=0}^{T-t} \gamma^k r_{t+k}
10. end for

The clipping trick in step 7 is what makes PPO stable: it prevents any single policy update from moving the new policy too far from the old one, which would collapse the trust region of the advantage estimate. PPO has essentially replaced vanilla policy gradient, natural gradient, and TRPO in practice because it is simpler to implement and gives comparable or better empirical results.

PPO Training Reward vs Heuristic Baseline

Training reward trajectory for a PPO power-control agent compared to the weighted sum-rate of the proportional-fairness heuristic from Chapter 5. PPO catches up slowly and can eventually surpass the heuristic, but only after tens of thousands of environment interactions and only on the exact simulator it was trained on.

Parameters
8
3

Example: PPO vs Proportional Fairness on an 8-User Scheduler

An 8-user, 64-antenna BS uses PPO to learn a scheduler targeting proportional fairness. Compare against the analytical PF heuristic (Chapter 5) in terms of (i) achieved utility at convergence, (ii) training cost, and (iii) sensitivity to mobility.

Common Mistake: Reward Gaming in Wireless RL

Mistake:

A naive reward function "maximize sum-rate" leads the RL agent to starve the weakest users because serving them costs DoF that could go to stronger users. The reported sum-rate looks excellent while the fairness metric collapses. Similar failures appear when the reward mixes HARQ success probability with throughput β€” the agent learns to lower the MCS to the point where HARQ always succeeds, tanking the effective rate.

Correction:

Use a proportional-fairness-style utility βˆ‘klog⁑Rkavg\sum_k \log R_k^{\text{avg}} that explicitly penalizes starvation, or a max-min reward that upper-bounds any single user's throughput at the weakest user's rate. Always monitor BOTH the reward curve and an independent fairness metric (Jain's index, 95%-likely rate) during training. If the fairness metric tracks with the reward, the reward function is doing its job; if it diverges, the agent is gaming the reward.

Theorem: PPO Monotonic Improvement (Informal)

Under the PPO clipped surrogate objective with clip parameter Ο΅>0\epsilon > 0, each policy update produces a policy πϕ\pi_{\phi} whose true expected reward is at least that of πϕold\pi_{\phi_{\text{old}}} minus a term of order Ο΅2\epsilon^2, provided the advantage estimate is unbiased and the simulator is stationary. Equivalently, the clipping enforces a bounded KL-divergence trust region within which local improvement is guaranteed.

Vanilla policy gradient can take arbitrarily large steps and diverge. PPO limits the per-step change in policy ratio to the interval [1βˆ’Ο΅,1+Ο΅][1 - \epsilon, 1 + \epsilon], which is equivalent to an approximate KL constraint. Inside that trust region, the linear advantage estimate is valid and improvement is monotonic up to a quadratic remainder.

🚨Critical Engineering Note

The Simulation-to-Reality Gap in Wireless RL

The hardest problem in wireless RL is not training instability (PPO handles that) or convergence (hours of compute handle that); it is the sim-to-real gap. Every published RL-for-wireless result is trained in a simulator (QuaDRiGa, a custom Python model, or a 3GPP system-level simulator) and then either (i) never actually deployed, or (ii) deployed and observed to underperform the iterative baseline because the simulator missed something the real hardware does β€” nonlinear PA distortion, LTE/NR inter-slot dependencies, RU clock drift, processing latency variability, you name it.

Practical paths forward are all conservative: use RL to tune the hyperparameters of a classical algorithm rather than replace it (e.g. the step size of PF or the regularization of RZF), combine online learning with a safe-policy fallback, or deploy in narrow scenarios (URLLC slices, handover decisions) where the consequences of a bad action are bounded. Every major wireless RL lab now accepts this caveat; the field has largely given up on pure end-to-end RL replacing hand-designed scheduling.

Practical Constraints
  • β€’

    Training time: 10-50 GPU-hours

  • β€’

    Reward sensitivity: Β±30%\pm 30 \% across random seeds

  • β€’

    Sim-to-real gap: typically 20-50 % performance loss

  • β€’

    Deployment mode: augmentation over classical algorithms, not replacement

πŸ“‹ Ref: 3GPP TR 38.843, Section 7.3 (observations on RL for wireless)

Key Takeaway

RL for massive MIMO is not yet a replacement technology. The classical algorithms of Chapter 5 remain the production workhorse because they have rigorous guarantees, tiny compute footprints, and zero retraining cost. PPO and its siblings can squeeze 1-5 % extra utility in narrow deployment scenarios, but the engineering cost (GPU-hours, hyperparameter search, sim-to-real gap) has so far exceeded the benefit in commercial networks. The honest verdict: use RL to tune classical algorithms, not to replace them.

Why This Matters: RL as an xApp in the O-RAN RIC

The Open RAN near-real-time RIC is the first realistic deployment vehicle for wireless RL: xApps inside the RIC can observe rich telemetry (per-user throughputs, CQIs, buffer occupancies) and inject scheduling or handover decisions through standardized E2 service models. O-RAN Alliance working group 3 is actively specifying RL xApp interfaces. The practical constraint is the xApp must fall back gracefully to a classical policy when its confidence is low β€” exactly the conservative model we argued for above.

Proximal Policy Optimization (PPO)

A policy gradient method that constrains the per-step policy change via a clipped surrogate objective. Introduced by Schulman et al. (2017) as a simpler alternative to TRPO, PPO has become the default deep RL algorithm for continuous-control problems including wireless power control and scheduling. Its main advantages are implementation simplicity, empirical stability, and compatibility with large neural network policies.

Quick Check

Under which of the following conditions is PPO most likely to beat the proportional-fairness heuristic of Chapter 5 in production?

A standard multi-user downlink with stationary users and perfect CSI.

A deployment with exotic nonlinear constraints (hard latency deadlines, HARQ state, nonlinear amplifiers) where PF has no clean form.

A new 6G deployment with zero measurement data yet.

Any scheduling problem, because RL is universal.

Common Mistake: Reward Curves Masked by Seed Variance

Mistake:

Single-seed RL experiments are notoriously misleading: the same PPO + hyperparameter combination can produce radically different reward trajectories across random seeds, and cherry-picking the best one gives an illusory "PPO beats heuristic" result.

Correction:

Always report median Β±\pm interquartile range across at least 5 (ideally 10) random seeds. If the classical baseline is outside the PPO interquartile range, the gain is real; otherwise it is seed noise. This is now standard in the deep-RL literature (Henderson et al., 2018) but is still inconsistently applied in wireless papers.