Reinforcement Learning for Resource Allocation

Beyond Supervised Learning: When Labels Are Unavailable

Supervised learning requires a dataset of (input, desired output) pairs. For channel estimation, the ground-truth channel can be obtained from a simulator; for detection, the transmitted symbols are known during training. But many wireless networking problems lack such explicit labels:

  • Power control: What is the "correct" power allocation? It depends on all users' channels, interference patterns, and fairness constraints that change every millisecond.
  • Scheduling and resource allocation: The optimal decision depends on future traffic arrivals that are unknown at decision time.
  • Handover and beam management: These are sequential decisions with delayed consequences.

Reinforcement learning (RL) provides the natural framework: an agent interacts with an environment, observes states, takes actions, and receives rewards. The agent's goal is to learn a policy Ο€(a∣s)\pi(a|s) that maximises the cumulative discounted reward:

J(Ο€)=E ⁣[βˆ‘t=0∞γtrt]J(\pi) = \mathbb{E}\!\left[\sum_{t=0}^{\infty} \gamma^t r_t\right]

where γ∈[0,1)\gamma \in [0, 1) is the discount factor and rtr_t is the instantaneous reward (e.g., sum rate, negative outage probability). No supervisor provides the "correct" action --- the agent must discover it through trial and error.

Definition:

Markov Decision Process for Wireless Resource Allocation

A Markov Decision Process (MDP) is defined by the tuple (S,A,P,R,Ξ³)(\mathcal{S}, \mathcal{A}, P, R, \gamma):

  • State space S\mathcal{S}: encodes the relevant information about the network, e.g., quantised channel gains {∣hk∣2}k=1K\{|h_k|^2\}_{k=1}^K, buffer occupancy, interference levels.
  • Action space A\mathcal{A}: the set of possible decisions, e.g., discrete power levels {p1,…,pJ}\{p_1, \ldots, p_J\} for each user, scheduling decisions, beam indices.
  • Transition probability P(sβ€²βˆ£s,a)P(s'|s, a): the probability of moving to state sβ€²s' given current state ss and action aa. In wireless, this is governed by channel dynamics, mobility, and traffic models.
  • Reward function R(s,a)R(s, a): the immediate feedback, e.g., the instantaneous sum rate: R(s,a)=βˆ‘k=1Klog⁑2 ⁣(1+∣hk∣2pkβˆ‘jβ‰ k∣hj∣2pjβ‹…Ξ±+Οƒ2)R(s, a) = \sum_{k=1}^K \log_2\!\left(1 + \frac{|h_k|^2 p_k}{\sum_{j \neq k} |h_j|^2 p_j \cdot \alpha + \sigma^2}\right) where Ξ±\alpha models the interference coupling factor.
  • Discount factor γ∈[0,1)\gamma \in [0, 1): trades off immediate vs future rewards.

The agent's goal is to find the optimal policy Ο€βˆ—(a∣s)=arg⁑max⁑πJ(Ο€)\pi^*(a|s) = \arg\max_\pi J(\pi). For finite MDPs, the optimal policy can be found via dynamic programming (value iteration, policy iteration), but the state-action space in wireless problems is typically too large for exact solutions.

In wireless power control with KK users and JJ discrete power levels per user, the action space has ∣A∣=JK|\mathcal{A}| = J^K elements. For K=8K = 8 users and J=5J = 5 levels, ∣A∣=58=390 625|\mathcal{A}| = 5^8 = 390\,625. This "curse of dimensionality" motivates both decomposition approaches (per-user independent actions) and function approximation (deep RL).

Definition:

Q-Learning for Power Control

Q-learning (Watkins, 1989) is a model-free, off-policy RL algorithm that directly learns the optimal state-action value function Qβˆ—(s,a)=max⁑πE[βˆ‘t=0∞γtrt∣s0=s,a0=a,Ο€]Q^*(s, a) = \max_\pi \mathbb{E}[\sum_{t=0}^\infty \gamma^t r_t | s_0 = s, a_0 = a, \pi].

The Bellman optimality equation states:

Qβˆ—(s,a)=E ⁣[r+Ξ³max⁑aβ€²Qβˆ—(sβ€²,aβ€²)β€…β€Šβˆ£β€…β€Šs,a]Q^*(s, a) = \mathbb{E}\!\left[r + \gamma \max_{a'} Q^*(s', a') \;\big|\; s, a\right]

Q-learning approximates Qβˆ—Q^* via the update rule:

Q(st,at)←Q(st,at)+α ⁣[rt+Ξ³max⁑aβ€²Q(st+1,aβ€²)βˆ’Q(st,at)]Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha\!\left[r_t + \gamma \max_{a'} Q(s_{t+1}, a') - Q(s_t, a_t)\right]

where α∈(0,1)\alpha \in (0, 1) is the learning rate. Actions are selected via an ϡ\epsilon-greedy policy:

at={randomΒ actionΒ fromΒ AwithΒ probabilityΒ Ο΅arg⁑max⁑aQ(st,a)withΒ probabilityΒ 1βˆ’Ο΅a_t = \begin{cases} \text{random action from } \mathcal{A} & \text{with probability } \epsilon \\ \arg\max_a Q(s_t, a) & \text{with probability } 1 - \epsilon \end{cases}

For convergence, Ο΅\epsilon is typically annealed from a high value (exploration) to a low value (exploitation).

For wireless power control:

  • State sts_t: quantised channel gains (e.g., low/medium/high)
  • Action ata_t: power level vector (one of JKJ^K joint actions, or per-user independent actions)
  • Reward rtr_t: instantaneous sum rate
  • Each "episode" is one coherence interval with a new channel realisation

Tabular Q-learning converges to Qβˆ—Q^* under the conditions that every state-action pair is visited infinitely often and the learning rate satisfies βˆ‘tΞ±t=∞\sum_t \alpha_t = \infty, βˆ‘tΞ±t2<∞\sum_t \alpha_t^2 < \infty (Robbins-Monro conditions). In practice, convergence is fast for small state-action spaces but the table size explodes in high-dimensional problems.

Q-Learning for Multi-User Power Control

Input: KK users, power levels P={p1,…,pJ}\mathcal{P} = \{p_1, \ldots, p_J\},
episodes TT, learning rate Ξ±\alpha, discount Ξ³\gamma, exploration Ο΅0\epsilon_0
Initialisation:
1. Q(s,a)←0Q(s, a) \leftarrow 0 for all (s,a)(s, a)
2. ϡ←ϡ0\epsilon \leftarrow \epsilon_0
For episode t=1,…,Tt = 1, \ldots, T:
3. Observe channel gains ht=[∣h1∣,…,∣hK∣]T\mathbf{h}_t = [|h_1|, \ldots, |h_K|]^T
4. Quantise to state: st←Quantise(ht)s_t \leftarrow \mathrm{Quantise}(\mathbf{h}_t)
5. Action selection (Ο΅\epsilon-greedy):
- With probability ϡ\epsilon: at∼Uniform(A)a_t \sim \mathrm{Uniform}(\mathcal{A})
- Otherwise: at=arg⁑max⁑aQ(st,a)a_t = \arg\max_a Q(s_t, a)
6. Decode power vector: pt←Decode(at)\mathbf{p}_t \leftarrow \mathrm{Decode}(a_t)
7. Compute reward: rt=βˆ‘k=1Klog⁑2(1+SINRk(ht,pt))r_t = \sum_{k=1}^K \log_2(1 + \mathrm{SINR}_k(\mathbf{h}_t, \mathbf{p}_t))
8. Q-update:
Q(st,at)←Q(st,at)+Ξ±[rt+Ξ³max⁑aβ€²Q(st+1,aβ€²)βˆ’Q(st,at)]Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha[r_t + \gamma \max_{a'} Q(s_{t+1}, a') - Q(s_t, a_t)]
9. Anneal: ϡ←max⁑(Ο΅min⁑, ϡ⋅β)\epsilon \leftarrow \max(\epsilon_{\min},\, \epsilon \cdot \beta) where Ξ²<1\beta < 1
Output: Greedy policy Ο€βˆ—(s)=arg⁑max⁑aQ(s,a)\pi^*(s) = \arg\max_a Q(s, a)

Example: Q-Learning for Two-User Interference Channel

Two single-antenna transmitter-receiver pairs share a frequency band. The direct channel gains are ∣h11∣2=2.0|h_{11}|^2 = 2.0 and ∣h22∣2=0.5|h_{22}|^2 = 0.5, and the cross-interference gains are ∣h12∣2=0.3|h_{12}|^2 = 0.3 and ∣h21∣2=0.4|h_{21}|^2 = 0.4. The noise power is Οƒ2=1\sigma^2 = 1. Available power levels are {0.5,1.0,2.0}\{0.5, 1.0, 2.0\}.

(a) Compute the sum rate for equal power p1=p2=1.0p_1 = p_2 = 1.0.

(b) Enumerate all 9 joint power allocations and find the sum-rate-optimal one.

(c) Explain how Q-learning would discover this optimum without computing all SINR values explicitly.

RL Power Control vs Equal Power

Train a tabular Q-learning agent to allocate transmit power across KK users to maximise the sum rate. Each episode corresponds to one coherence interval with a random channel realisation. The solid curve shows the Q-learning agent's sum rate (smoothed) and the dashed line shows the equal-power baseline. Observe that the RL agent gradually learns to outperform equal power by adapting its allocation to channel conditions. Increasing KK makes the problem harder (larger action space, more interference), requiring more episodes for convergence. Note: For large KK, the agent uses per-user independent Q-tables to keep the problem tractable.

Parameters
4
200

Deep Reinforcement Learning for Wireless

When the state-action space is too large for a Q-table, the Q-function can be approximated by a neural network: QΞΈ(s,a)Q_\theta(s, a). This is the basis of Deep Q-Networks (DQN) (Mnih et al., 2015), which add two key ingredients:

  1. Experience replay: Store transitions (st,at,rt,st+1)(s_t, a_t, r_t, s_{t+1}) in a buffer and sample random mini-batches for training, breaking temporal correlations.
  2. Target network: Maintain a slowly updated copy QΞΈβˆ’Q_{\theta^-} for computing the target rt+Ξ³max⁑aβ€²QΞΈβˆ’(st+1,aβ€²)r_t + \gamma \max_{a'} Q_{\theta^-}(s_{t+1}, a'), stabilising training.

For continuous action spaces (e.g., continuous power levels), policy gradient methods are more natural:

  • DDPG / TD3: Actor-critic algorithms that learn both a policy πθ(s)\pi_\theta(s) (actor) and a Q-function QΟ•(s,a)Q_\phi(s,a) (critic).
  • PPO / SAC: State-of-the-art methods with better stability and sample efficiency.

In wireless, deep RL has been applied to:

  • Dynamic spectrum access
  • Beamforming with limited feedback
  • UAV trajectory optimisation
  • RIS phase shift configuration
  • Joint scheduling and power control

The main challenge is sample efficiency: deep RL often needs millions of interactions, which translates to hours or days of simulated channel realisations. Transfer learning and model-based RL (using a learned channel model as the environment) can mitigate this.

,

Multi-Agent RL for Distributed Resource Management

In a cellular network, each base station (or user) can be modelled as an independent RL agent, leading to a multi-agent reinforcement learning (MARL) problem. Key challenges:

  • Non-stationarity: Each agent's environment includes the actions of all other agents, which change as they learn. From agent kk's perspective, the environment is non-stationary.
  • Credit assignment: The reward (sum rate) depends on all agents' actions. Attributing credit to individual agents is difficult.
  • Communication overhead: Centralised training with decentralised execution (CTDE) is the dominant paradigm --- agents share information during training but act independently at deployment.

The MARL formulation connects directly to game theory: the power control problem is an interference game, and the RL agents implicitly converge to (or oscillate around) a Nash equilibrium. Whether this is close to the socially optimal (sum-rate-maximising) allocation depends on the reward design.

Markov Decision Process (MDP)

A sequential decision framework defined by (S,A,P,R,Ξ³)(\mathcal{S}, \mathcal{A}, P, R, \gamma): states, actions, transition probabilities, rewards, and discount factor. The agent learns a policy Ο€(a∣s)\pi(a|s) to maximise cumulative discounted reward. In wireless, states encode channel/network conditions and actions are resource allocation decisions.

Related: Q-Learning

Q-Learning

A model-free RL algorithm that learns the optimal state-action value function Qβˆ—(s,a)Q^*(s,a) via temporal-difference updates. The policy is derived greedily: Ο€βˆ—(s)=arg⁑max⁑aQ(s,a)\pi^*(s) = \arg\max_a Q(s,a). Tabular Q-learning converges to the optimum for finite MDPs; deep Q-networks (DQN) extend it to large state spaces via neural-network function approximation.

Related: Markov Decision Process (MDP)

⚠️Engineering Note

Sample Efficiency Challenges in RL for Wireless

Reinforcement learning for wireless resource allocation faces severe sample efficiency challenges in practice:

  • Training episodes: Tabular Q-learning for a 4-user power control problem with 5 power levels typically requires 10310^3--10410^4 episodes to converge. Deep RL (DQN, PPO) for realistic 20-user systems may need 10510^5--10610^6 episodes.
  • Simulation cost: Each episode requires a channel realisation and SINR computation. At 10610^6 episodes, this takes minutes in simulation but would take days of real-time interaction.
  • Sim-to-real gap: Policies trained in simulation may not transfer to real deployments due to model mismatch. Domain randomisation (varying channel models during training) and fine-tuning on real data help but add complexity.
  • Non-stationarity: The wireless environment changes due to mobility, traffic load, and network reconfiguration. RL agents must continuously adapt, but retraining is expensive.
  • O-RAN RIC timescales: The near-RT RIC operates at 10--100 ms granularity, while the non-RT RIC operates at >>1 s. RL inference must fit within these timescales.
Practical Constraints
  • β€’

    Tabular Q-learning convergence: 10³-10⁴ episodes (small systems)

  • β€’

    Deep RL convergence: 10⁡-10⁢ episodes (large systems)

  • β€’

    O-RAN near-RT RIC: 10-100 ms decision timescale

Quick Check

In Ο΅\epsilon-greedy Q-learning with Ο΅=0.1\epsilon = 0.1, the agent takes a random action 10% of the time. If Ο΅\epsilon is set to 0 (pure greedy) from the start, what is the primary risk?

The Q-table will overflow due to too many updates

The agent will converge faster because it always picks the best action

The agent may get stuck in a suboptimal policy because it never explores alternative actions whose Q-values were initialised to zero

The algorithm will not converge because the Bellman equation requires stochastic policies