Reinforcement Learning for Resource Allocation
Beyond Supervised Learning: When Labels Are Unavailable
Supervised learning requires a dataset of (input, desired output) pairs. For channel estimation, the ground-truth channel can be obtained from a simulator; for detection, the transmitted symbols are known during training. But many wireless networking problems lack such explicit labels:
- Power control: What is the "correct" power allocation? It depends on all users' channels, interference patterns, and fairness constraints that change every millisecond.
- Scheduling and resource allocation: The optimal decision depends on future traffic arrivals that are unknown at decision time.
- Handover and beam management: These are sequential decisions with delayed consequences.
Reinforcement learning (RL) provides the natural framework: an agent interacts with an environment, observes states, takes actions, and receives rewards. The agent's goal is to learn a policy that maximises the cumulative discounted reward:
where is the discount factor and is the instantaneous reward (e.g., sum rate, negative outage probability). No supervisor provides the "correct" action --- the agent must discover it through trial and error.
Definition: Markov Decision Process for Wireless Resource Allocation
Markov Decision Process for Wireless Resource Allocation
A Markov Decision Process (MDP) is defined by the tuple :
- State space : encodes the relevant information about the network, e.g., quantised channel gains , buffer occupancy, interference levels.
- Action space : the set of possible decisions, e.g., discrete power levels for each user, scheduling decisions, beam indices.
- Transition probability : the probability of moving to state given current state and action . In wireless, this is governed by channel dynamics, mobility, and traffic models.
- Reward function : the immediate feedback, e.g., the instantaneous sum rate: where models the interference coupling factor.
- Discount factor : trades off immediate vs future rewards.
The agent's goal is to find the optimal policy . For finite MDPs, the optimal policy can be found via dynamic programming (value iteration, policy iteration), but the state-action space in wireless problems is typically too large for exact solutions.
In wireless power control with users and discrete power levels per user, the action space has elements. For users and levels, . This "curse of dimensionality" motivates both decomposition approaches (per-user independent actions) and function approximation (deep RL).
Definition: Q-Learning for Power Control
Q-Learning for Power Control
Q-learning (Watkins, 1989) is a model-free, off-policy RL algorithm that directly learns the optimal state-action value function .
The Bellman optimality equation states:
Q-learning approximates via the update rule:
where is the learning rate. Actions are selected via an -greedy policy:
For convergence, is typically annealed from a high value (exploration) to a low value (exploitation).
For wireless power control:
- State : quantised channel gains (e.g., low/medium/high)
- Action : power level vector (one of joint actions, or per-user independent actions)
- Reward : instantaneous sum rate
- Each "episode" is one coherence interval with a new channel realisation
Tabular Q-learning converges to under the conditions that every state-action pair is visited infinitely often and the learning rate satisfies , (Robbins-Monro conditions). In practice, convergence is fast for small state-action spaces but the table size explodes in high-dimensional problems.
Q-Learning for Multi-User Power Control
Example: Q-Learning for Two-User Interference Channel
Two single-antenna transmitter-receiver pairs share a frequency band. The direct channel gains are and , and the cross-interference gains are and . The noise power is . Available power levels are .
(a) Compute the sum rate for equal power .
(b) Enumerate all 9 joint power allocations and find the sum-rate-optimal one.
(c) Explain how Q-learning would discover this optimum without computing all SINR values explicitly.
Equal-power sum rate
$
Exhaustive search
Computing for all allocations :
| 0.5 | 0.5 | 0.833 | 0.217 | 1.119 |
| 0.5 | 1.0 | 0.714 | 0.385 | 1.235 |
| 0.5 | 2.0 | 0.556 | 0.625 | 1.294 |
| 1.0 | 0.5 | 1.667 | 0.179 | 1.585 |
| 1.0 | 1.0 | 1.429 | 0.385 | 1.753 |
| 1.0 | 2.0 | 1.111 | 0.625 | 1.742 |
| 2.0 | 0.5 | 3.333 | 0.147 | 2.100 |
| 2.0 | 1.0 | 2.857 | 0.294 | 2.283 |
| 2.0 | 2.0 | 2.222 | 0.435 | 2.180 |
The maximum sum rate is bps/Hz at . User 1 (with the stronger direct channel) receives more power.
Q-learning discovery
Q-learning discovers this optimum without exhaustive enumeration:
- Initially, for all actions, so the agent explores randomly.
- When it happens to try , it receives a high reward (), and the corresponding Q-value increases.
- With -greedy exploration, the agent increasingly favours action but continues to explore.
- Over hundreds of episodes with varying channel realisations, Q-learning builds a table mapping each channel state to the best power allocation.
The advantage over exhaustive search emerges in large problems (, : actions) where enumeration is infeasible, but Q-learning can exploit structure (e.g., per-user decomposition) to converge.
RL Power Control vs Equal Power
Train a tabular Q-learning agent to allocate transmit power across users to maximise the sum rate. Each episode corresponds to one coherence interval with a random channel realisation. The solid curve shows the Q-learning agent's sum rate (smoothed) and the dashed line shows the equal-power baseline. Observe that the RL agent gradually learns to outperform equal power by adapting its allocation to channel conditions. Increasing makes the problem harder (larger action space, more interference), requiring more episodes for convergence. Note: For large , the agent uses per-user independent Q-tables to keep the problem tractable.
Parameters
Deep Reinforcement Learning for Wireless
When the state-action space is too large for a Q-table, the Q-function can be approximated by a neural network: . This is the basis of Deep Q-Networks (DQN) (Mnih et al., 2015), which add two key ingredients:
- Experience replay: Store transitions in a buffer and sample random mini-batches for training, breaking temporal correlations.
- Target network: Maintain a slowly updated copy for computing the target , stabilising training.
For continuous action spaces (e.g., continuous power levels), policy gradient methods are more natural:
- DDPG / TD3: Actor-critic algorithms that learn both a policy (actor) and a Q-function (critic).
- PPO / SAC: State-of-the-art methods with better stability and sample efficiency.
In wireless, deep RL has been applied to:
- Dynamic spectrum access
- Beamforming with limited feedback
- UAV trajectory optimisation
- RIS phase shift configuration
- Joint scheduling and power control
The main challenge is sample efficiency: deep RL often needs millions of interactions, which translates to hours or days of simulated channel realisations. Transfer learning and model-based RL (using a learned channel model as the environment) can mitigate this.
Multi-Agent RL for Distributed Resource Management
In a cellular network, each base station (or user) can be modelled as an independent RL agent, leading to a multi-agent reinforcement learning (MARL) problem. Key challenges:
- Non-stationarity: Each agent's environment includes the actions of all other agents, which change as they learn. From agent 's perspective, the environment is non-stationary.
- Credit assignment: The reward (sum rate) depends on all agents' actions. Attributing credit to individual agents is difficult.
- Communication overhead: Centralised training with decentralised execution (CTDE) is the dominant paradigm --- agents share information during training but act independently at deployment.
The MARL formulation connects directly to game theory: the power control problem is an interference game, and the RL agents implicitly converge to (or oscillate around) a Nash equilibrium. Whether this is close to the socially optimal (sum-rate-maximising) allocation depends on the reward design.
Markov Decision Process (MDP)
A sequential decision framework defined by : states, actions, transition probabilities, rewards, and discount factor. The agent learns a policy to maximise cumulative discounted reward. In wireless, states encode channel/network conditions and actions are resource allocation decisions.
Related: Q-Learning
Q-Learning
A model-free RL algorithm that learns the optimal state-action value function via temporal-difference updates. The policy is derived greedily: . Tabular Q-learning converges to the optimum for finite MDPs; deep Q-networks (DQN) extend it to large state spaces via neural-network function approximation.
Related: Markov Decision Process (MDP)
Sample Efficiency Challenges in RL for Wireless
Reinforcement learning for wireless resource allocation faces severe sample efficiency challenges in practice:
- Training episodes: Tabular Q-learning for a 4-user power control problem with 5 power levels typically requires -- episodes to converge. Deep RL (DQN, PPO) for realistic 20-user systems may need -- episodes.
- Simulation cost: Each episode requires a channel realisation and SINR computation. At episodes, this takes minutes in simulation but would take days of real-time interaction.
- Sim-to-real gap: Policies trained in simulation may not transfer to real deployments due to model mismatch. Domain randomisation (varying channel models during training) and fine-tuning on real data help but add complexity.
- Non-stationarity: The wireless environment changes due to mobility, traffic load, and network reconfiguration. RL agents must continuously adapt, but retraining is expensive.
- O-RAN RIC timescales: The near-RT RIC operates at 10--100 ms granularity, while the non-RT RIC operates at 1 s. RL inference must fit within these timescales.
- β’
Tabular Q-learning convergence: 10Β³-10β΄ episodes (small systems)
- β’
Deep RL convergence: 10β΅-10βΆ episodes (large systems)
- β’
O-RAN near-RT RIC: 10-100 ms decision timescale
Quick Check
In -greedy Q-learning with , the agent takes a random action 10% of the time. If is set to 0 (pure greedy) from the start, what is the primary risk?
The Q-table will overflow due to too many updates
The agent will converge faster because it always picks the best action
The agent may get stuck in a suboptimal policy because it never explores alternative actions whose Q-values were initialised to zero
The algorithm will not converge because the Bellman equation requires stochastic policies
With , the agent always picks . If Q-values are initialised to 0 and an early positive reward locks in a suboptimal action, the agent never tries better alternatives. Exploration is essential for Q-learning to converge to the optimal policy.