Ferkans — Interactive Telecom Tutor

Beyond Supervised Learning: When Labels Are Unavailable

Supervised learning requires a dataset of (input, desired output) pairs. For channel estimation, the ground-truth channel can be obtained from a simulator; for detection, the transmitted symbols are known during training. But many wireless networking problems lack such explicit labels:

Power control: What is the "correct" power allocation? It depends on all users' channels, interference patterns, and fairness constraints that change every millisecond.
Scheduling and resource allocation: The optimal decision depends on future traffic arrivals that are unknown at decision time.
Handover and beam management: These are sequential decisions with delayed consequences.

Reinforcement learning (RL) provides the natural framework: an agent interacts with an environment, observes states, takes actions, and receives rewards. The agent's goal is to learn a policy $\pi(a|s)$ that maximises the cumulative discounted reward:

$J(\pi) = \mathbb{E}\!\left[\sum_{t=0}^{\infty} \gamma^t r_t\right]$

where $\gamma \in [0, 1)$ is the discount factor and $r_t$ is the instantaneous reward (e.g., sum rate, negative outage probability). No supervisor provides the "correct" action --- the agent must discover it through trial and error.

Definition:
Markov Decision Process for Wireless Resource Allocation

A Markov Decision Process (MDP) is defined by the tuple $(\mathcal{S}, \mathcal{A}, P, R, \gamma)$ :

State space $\mathcal{S}$ : encodes the relevant information about the network, e.g., quantised channel gains $\{|h_k|^2\}_{k=1}^K$ , buffer occupancy, interference levels.
Action space $\mathcal{A}$ : the set of possible decisions, e.g., discrete power levels $\{p_1, \ldots, p_J\}$ for each user, scheduling decisions, beam indices.
Transition probability $P(s'|s, a)$ : the probability of moving to state $s'$ given current state $s$ and action $a$ . In wireless, this is governed by channel dynamics, mobility, and traffic models.
Reward function $R(s, a)$ : the immediate feedback, e.g., the instantaneous sum rate: $R(s, a) = \sum_{k=1}^K \log_2\!\left(1 + \frac{|h_k|^2 p_k}{\sum_{j \neq k} |h_j|^2 p_j \cdot \alpha + \sigma^2}\right)$ where $\alpha$ models the interference coupling factor.
Discount factor $\gamma \in [0, 1)$ : trades off immediate vs future rewards.

The agent's goal is to find the optimal policy $\pi^*(a|s) = \arg\max_\pi J(\pi)$ . For finite MDPs, the optimal policy can be found via dynamic programming (value iteration, policy iteration), but the state-action space in wireless problems is typically too large for exact solutions.

In wireless power control with $K$ users and $J$ discrete power levels per user, the action space has $|\mathcal{A}| = J^K$ elements. For $K = 8$ users and $J = 5$ levels, $|\mathcal{A}| = 5^8 = 390\,625$ . This "curse of dimensionality" motivates both decomposition approaches (per-user independent actions) and function approximation (deep RL).

Definition:
Q-Learning for Power Control

Q-learning (Watkins, 1989) is a model-free, off-policy RL algorithm that directly learns the optimal state-action value function $Q^*(s, a) = \max_\pi \mathbb{E}[\sum_{t=0}^\infty \gamma^t r_t | s_0 = s, a_0 = a, \pi]$ .

The Bellman optimality equation states:

$Q^*(s, a) = \mathbb{E}\!\left[r + \gamma \max_{a'} Q^*(s', a') \;\big|\; s, a\right]$

Q-learning approximates $Q^*$ via the update rule:

$Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha\!\left[r_t + \gamma \max_{a'} Q(s_{t+1}, a') - Q(s_t, a_t)\right]$

where $\alpha \in (0, 1)$ is the learning rate. Actions are selected via an $\epsilon$ -greedy policy:

$a_t = \begin{cases} \text{random action from } \mathcal{A} & \text{with probability } \epsilon \\ \arg\max_a Q(s_t, a) & \text{with probability } 1 - \epsilon \end{cases}$

For convergence, $\epsilon$ is typically annealed from a high value (exploration) to a low value (exploitation).

For wireless power control:

State $s_t$ : quantised channel gains (e.g., low/medium/high)
Action $a_t$ : power level vector (one of $J^K$ joint actions, or per-user independent actions)
Reward $r_t$ : instantaneous sum rate
Each "episode" is one coherence interval with a new channel realisation

Tabular Q-learning converges to $Q^*$ under the conditions that every state-action pair is visited infinitely often and the learning rate satisfies $\sum_t \alpha_t = \infty$ , $\sum_t \alpha_t^2 < \infty$ (Robbins-Monro conditions). In practice, convergence is fast for small state-action spaces but the table size explodes in high-dimensional problems.

Q-Learning for Multi-User Power Control

Input:

K

users, power levels

\mathcal{P} = \{p_1, \ldots, p_J\}

,

episodes

T

, learning rate

\alpha

, discount

\gamma

, exploration

\epsilon_0

Initialisation:

1.

Q(s, a) \leftarrow 0

for all

(s, a)

2.

\epsilon \leftarrow \epsilon_0

For episode

t = 1, \ldots, T

:

3. Observe channel gains

\mathbf{h}_t = [|h_1|, \ldots, |h_K|]^T

4. Quantise to state:

s_t \leftarrow \mathrm{Quantise}(\mathbf{h}_t)

5. Action selection (

\epsilon

-greedy):

- With probability

\epsilon

:

a_t \sim \mathrm{Uniform}(\mathcal{A})

- Otherwise:

a_t = \arg\max_a Q(s_t, a)

6. Decode power vector:

\mathbf{p}_t \leftarrow \mathrm{Decode}(a_t)

7. Compute reward:

r_t = \sum_{k=1}^K \log_2(1 + \mathrm{SINR}_k(\mathbf{h}_t, \mathbf{p}_t))

8. Q-update:

Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha[r_t + \gamma \max_{a'} Q(s_{t+1}, a') - Q(s_t, a_t)]

9. Anneal:

\epsilon \leftarrow \max(\epsilon_{\min},\, \epsilon \cdot \beta)

where

\beta < 1

Output: Greedy policy

\pi^*(s) = \arg\max_a Q(s, a)

Example: Q-Learning for Two-User Interference Channel

Two single-antenna transmitter-receiver pairs share a frequency band. The direct channel gains are $|h_{11}|^2 = 2.0$ and $|h_{22}|^2 = 0.5$ , and the cross-interference gains are $|h_{12}|^2 = 0.3$ and $|h_{21}|^2 = 0.4$ . The noise power is $\sigma^2 = 1$ . Available power levels are $\{0.5, 1.0, 2.0\}$ .

(a) Compute the sum rate for equal power $p_1 = p_2 = 1.0$ .

(b) Enumerate all 9 joint power allocations and find the sum-rate-optimal one.

(c) Explain how Q-learning would discover this optimum without computing all SINR values explicitly.

Solution

Equal-power sum rate

$\mathrm{SINR}_1 = \frac{|h_{11}|^2 p_1}{|h_{21}|^2 p_2 + \sigma^2} = \frac{2.0 \times 1.0}{0.4 \times 1.0 + 1.0} = \frac{2.0}{1.4} = 1.429KATEXPLACEHOLDER0END\mathrm{SINR}_2 = \frac{|h_{22}|^2 p_2}{|h_{12}|^2 p_1 + \sigma^2} = \frac{0.5 \times 1.0}{0.3 \times 1.0 + 1.0} = \frac{0.5}{1.3} = 0.385KATEXPLACEHOLDER1ENDR_{\Sigma} = \log_2(1 + 1.429) + \log_2(1 + 0.385) = 1.281 + 0.472 = 1.753 \;\text{bps/Hz}$ $

Exhaustive search

Computing $R_\Sigma$ for all $3^2 = 9$ allocations $(p_1, p_2)$ :

$p_1$	$p_2$	$\mathrm{SINR}_1$	$\mathrm{SINR}_2$	$R_\Sigma$
0.5	0.5	0.833	0.217	1.119
0.5	1.0	0.714	0.385	1.235
0.5	2.0	0.556	0.625	1.294
1.0	0.5	1.667	0.179	1.585
1.0	1.0	1.429	0.385	1.753
1.0	2.0	1.111	0.625	1.742
2.0	0.5	3.333	0.147	2.100
2.0	1.0	2.857	0.294	2.283
2.0	2.0	2.222	0.435	2.180

The maximum sum rate is $R_\Sigma = 2.283$ bps/Hz at $(p_1, p_2) = (2.0, 1.0)$ . User 1 (with the stronger direct channel) receives more power.

Q-learning discovery

Q-learning discovers this optimum without exhaustive enumeration:

Initially, $Q(s, a) = 0$ for all actions, so the agent explores randomly.
When it happens to try $(2.0, 1.0)$ , it receives a high reward ( $r = 2.283$ ), and the corresponding Q-value increases.
With $\epsilon$ -greedy exploration, the agent increasingly favours action $(2.0, 1.0)$ but continues to explore.
Over hundreds of episodes with varying channel realisations, Q-learning builds a table mapping each channel state to the best power allocation.

The advantage over exhaustive search emerges in large problems ( $K = 8$ , $J = 5$ : $390\,625$ actions) where enumeration is infeasible, but Q-learning can exploit structure (e.g., per-user decomposition) to converge.

RL Power Control vs Equal Power

Train a tabular Q-learning agent to allocate transmit power across $K$ users to maximise the sum rate. Each episode corresponds to one coherence interval with a random channel realisation. The solid curve shows the Q-learning agent's sum rate (smoothed) and the dashed line shows the equal-power baseline. Observe that the RL agent gradually learns to outperform equal power by adapting its allocation to channel conditions. Increasing $K$ makes the problem harder (larger action space, more interference), requiring more episodes for convergence. Note: For large $K$ , the agent uses per-user independent Q-tables to keep the problem tractable.

Parameters

Number of users

K

4

Training episodes200

Deep Reinforcement Learning for Wireless

When the state-action space is too large for a Q-table, the Q-function can be approximated by a neural network: $Q_\theta(s, a)$ . This is the basis of Deep Q-Networks (DQN) (Mnih et al., 2015), which add two key ingredients:

Experience replay: Store transitions $(s_t, a_t, r_t, s_{t+1})$ in a buffer and sample random mini-batches for training, breaking temporal correlations.
Target network: Maintain a slowly updated copy $Q_{\theta^-}$ for computing the target $r_t + \gamma \max_{a'} Q_{\theta^-}(s_{t+1}, a')$ , stabilising training.

For continuous action spaces (e.g., continuous power levels), policy gradient methods are more natural:

DDPG / TD3: Actor-critic algorithms that learn both a policy $\pi_\theta(s)$ (actor) and a Q-function $Q_\phi(s,a)$ (critic).
PPO / SAC: State-of-the-art methods with better stability and sample efficiency.

In wireless, deep RL has been applied to:

Dynamic spectrum access
Beamforming with limited feedback
UAV trajectory optimisation
RIS phase shift configuration
Joint scheduling and power control

The main challenge is sample efficiency: deep RL often needs millions of interactions, which translates to hours or days of simulated channel realisations. Transfer learning and model-based RL (using a learned channel model as the environment) can mitigate this.

,

Multi-Agent RL for Distributed Resource Management

In a cellular network, each base station (or user) can be modelled as an independent RL agent, leading to a multi-agent reinforcement learning (MARL) problem. Key challenges:

Non-stationarity: Each agent's environment includes the actions of all other agents, which change as they learn. From agent $k$ 's perspective, the environment is non-stationary.
Credit assignment: The reward (sum rate) depends on all agents' actions. Attributing credit to individual agents is difficult.
Communication overhead: Centralised training with decentralised execution (CTDE) is the dominant paradigm --- agents share information during training but act independently at deployment.

The MARL formulation connects directly to game theory: the power control problem is an interference game, and the RL agents implicitly converge to (or oscillate around) a Nash equilibrium. Whether this is close to the socially optimal (sum-rate-maximising) allocation depends on the reward design.

Markov Decision Process (MDP)

A sequential decision framework defined by $(\mathcal{S}, \mathcal{A}, P, R, \gamma)$ : states, actions, transition probabilities, rewards, and discount factor. The agent learns a policy $\pi(a|s)$ to maximise cumulative discounted reward. In wireless, states encode channel/network conditions and actions are resource allocation decisions.

Related: Q-Learning

Q-Learning

A model-free RL algorithm that learns the optimal state-action value function $Q^*(s,a)$ via temporal-difference updates. The policy is derived greedily: $\pi^*(s) = \arg\max_a Q(s,a)$ . Tabular Q-learning converges to the optimum for finite MDPs; deep Q-networks (DQN) extend it to large state spaces via neural-network function approximation.

Related: Markov Decision Process (MDP)

⚠️Engineering Note

Sample Efficiency Challenges in RL for Wireless

Reinforcement learning for wireless resource allocation faces severe sample efficiency challenges in practice:

Training episodes: Tabular Q-learning for a 4-user power control problem with 5 power levels typically requires $10^3$ -- $10^4$ episodes to converge. Deep RL (DQN, PPO) for realistic 20-user systems may need $10^5$ -- $10^6$ episodes.
Simulation cost: Each episode requires a channel realisation and SINR computation. At $10^6$ episodes, this takes minutes in simulation but would take days of real-time interaction.
Sim-to-real gap: Policies trained in simulation may not transfer to real deployments due to model mismatch. Domain randomisation (varying channel models during training) and fine-tuning on real data help but add complexity.
Non-stationarity: The wireless environment changes due to mobility, traffic load, and network reconfiguration. RL agents must continuously adapt, but retraining is expensive.
O-RAN RIC timescales: The near-RT RIC operates at 10--100 ms granularity, while the non-RT RIC operates at $>$ 1 s. RL inference must fit within these timescales.

Practical Constraints

•
Tabular Q-learning convergence: 10³-10⁴ episodes (small systems)
•
Deep RL convergence: 10⁵-10⁶ episodes (large systems)
•
O-RAN near-RT RIC: 10-100 ms decision timescale

Quick Check

In $\epsilon$ -greedy Q-learning with $\epsilon = 0.1$ , the agent takes a random action 10% of the time. If $\epsilon$ is set to 0 (pure greedy) from the start, what is the primary risk?

The Q-table will overflow due to too many updates

The agent will converge faster because it always picks the best action

The agent may get stuck in a suboptimal policy because it never explores alternative actions whose Q-values were initialised to zero

The algorithm will not converge because the Bellman equation requires stochastic policies

Correction: