Ferkans — Interactive Telecom Tutor

Why Reinforcement Learning?

RL learns to make sequential decisions by interacting with an environment. Unlike supervised learning, there are no labels — only rewards. This framework naturally models resource allocation, power control, and scheduling in wireless networks.

Definition:
Markov Decision Process (MDP)

An MDP is a tuple $(\mathcal{S}, \mathcal{A}, P, R, \gamma)$ :

$\mathcal{S}$ : state space
$\mathcal{A}$ : action space
$P(s'|s, a)$ : transition probability
$R(s, a, s')$ : reward function
$\gamma \in [0, 1)$ : discount factor

The Markov property: $P(s_{t+1}|s_t, a_t) = P(s_{t+1}|s_1, a_1, \ldots, s_t, a_t)$ .

Definition:
Policy and Value Functions

A policy $\pi(a|s)$ maps states to action distributions.

State-value function: $V^\pi(s) = \mathbb{E}_\pi\left[\sum_{t=0}^{\infty} \gamma^t R_t \mid s_0 = s\right]$

Action-value function: $Q^\pi(s, a) = \mathbb{E}_\pi\left[\sum_{t=0}^{\infty} \gamma^t R_t \mid s_0 = s, a_0 = a\right]$

Definition:
Bellman Equation

The optimal value function satisfies:

$V^*(s) = \max_a \left[R(s, a) + \gamma \sum_{s'} P(s'|s, a) V^*(s')\right]$

$Q^*(s, a) = R(s, a) + \gamma \sum_{s'} P(s'|s, a) \max_{a'} Q^*(s', a')$

Definition:
Gymnasium (Gym) Environment Interface

The standard RL environment API:

import gymnasium as gym

env = gym.make('CartPole-v1')
obs, info = env.reset()
for _ in range(1000):
    action = env.action_space.sample()
    obs, reward, terminated, truncated, info = env.step(action)
    if terminated or truncated:
        obs, info = env.reset()

Definition:
Discounted Return

The discounted return from time $t$ :

$G_t = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}$

With $\gamma = 0.99$ , rewards 100 steps in the future are worth $\gamma^{100} \approx 0.37$ of immediate rewards.

Definition:
Epsilon-Greedy Exploration

With probability $\varepsilon$ , take a random action; otherwise take the greedy action $\arg\max_a Q(s, a)$ . Decay $\varepsilon$ from 1.0 to 0.01 over training.

Theorem: Bellman Optimality and Convergence

For finite MDPs, the Bellman optimality equation has a unique solution $Q^*$ . Q-learning converges to $Q^*$ with probability 1 if all state-action pairs are visited infinitely often and the learning rate satisfies $\sum_t \alpha_t = \infty$ , $\sum_t \alpha_t^2 < \infty$ .

Q-learning is a stochastic approximation of value iteration, which converges because the Bellman operator is a contraction mapping.

Theorem: Policy Gradient Theorem

For a parameterised policy $\pi_\theta$ :

$\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}\left[\nabla_\theta \log \pi_\theta(a|s) \cdot Q^{\pi_\theta}(s, a)\right]$

This gradient can be estimated from trajectories without knowing $P$ .

Increase the probability of actions that lead to high returns. The log-probability trick converts the expectation over actions into a form that can be estimated via sampling.

Theorem: Advantage Function and Variance Reduction

The advantage function $A^\pi(s, a) = Q^\pi(s, a) - V^\pi(s)$ measures how much better action $a$ is compared to average. Using advantage in the policy gradient (instead of $Q$ ) reduces variance without introducing bias:

$\nabla_\theta J = \mathbb{E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(a|s) \cdot A^{\pi_\theta}(s, a)]$

Subtracting the baseline $V(s)$ reduces noise in the gradient estimate by centring the returns around their expected value.

Example: Custom Gymnasium Environment

Implement a simple resource allocation environment.

Solution

Implementation

import gymnasium as gym
from gymnasium import spaces
import numpy as np

class PowerControlEnv(gym.Env):
    def __init__(self, n_users=4):
        super().__init__()
        self.n_users = n_users
        self.observation_space = spaces.Box(0, 1, (n_users,))
        self.action_space = spaces.Discrete(n_users)

    def reset(self, seed=None):
        self.channels = np.random.rand(self.n_users)
        return self.channels.astype(np.float32), {}

    def step(self, action):
        sinr = self.channels[action] / (0.1 + 0.5 * self.channels.sum())
        reward = float(np.log2(1 + sinr))
        self.channels = np.random.rand(self.n_users)
        return self.channels.astype(np.float32), reward, False, False, {}

Example: Tabular Q-Learning

Implement Q-learning for a small discrete environment.

Solution

Implementation

Q = np.zeros((n_states, n_actions))
for episode in range(1000):
    s = env.reset()[0]
    for t in range(100):
        if np.random.random() < epsilon:
            a = env.action_space.sample()
        else:
            a = np.argmax(Q[s])
        s2, r, done, _, _ = env.step(a)
        Q[s, a] += alpha * (r + gamma * np.max(Q[s2]) - Q[s, a])
        s = s2
        if done: break

Example: Deep Q-Network (DQN)

Implement DQN with experience replay and target network.

Solution

Key components

class DQN(nn.Module):
    def __init__(self, obs_dim, n_actions):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(obs_dim, 128), nn.ReLU(),
            nn.Linear(128, 128), nn.ReLU(),
            nn.Linear(128, n_actions))
    def forward(self, x): return self.net(x)

# Experience replay buffer
replay = deque(maxlen=10000)
# Target network updated every C steps
target_net = copy.deepcopy(policy_net)

Q-Value Landscape

Visualise Q-values for a simple gridworld.

Parameters

RL Training Reward Curves

Compare DQN, REINFORCE, and PPO learning curves.

Parameters

Exploration vs Exploitation

See the effect of epsilon on learning.

Parameters

RL Agent Learning

Watch an agent learn to navigate a gridworld.

Parameters

MDP Interaction Loop — Agent observes state, takes action, receives reward, transitions to next state.

RL Algorithm Taxonomy — Value-based (DQN), policy-based (REINFORCE, PPO), and actor-critic methods.

Quick Check

What does the discount factor gamma control in RL?

The learning rate

How much the agent values future vs immediate rewards

The exploration rate

Correction:

How much the agent values future vs immediate rewards

Quick Check

Why does DQN use a target network?

To double the model capacity

To stabilise training by providing a slowly-updated target for the Bellman equation

To speed up training

Correction:

To stabilise training by providing a slowly-updated target for the Bellman equation

Quick Check

What is the key advantage of policy gradient methods over DQN?

They converge faster

They naturally handle continuous action spaces

They use less memory

Correction:

They naturally handle continuous action spaces

Common Mistake: Not Using Target Network in DQN

Mistake:

Using the same network for both Q-value prediction and target computation.

Correction:

Use a separate target network updated every C steps or with soft update tau.

Common Mistake: Sparse Rewards

Mistake:

Environment gives reward only at episode end, making credit assignment extremely difficult.

Correction:

Use reward shaping, curiosity-driven exploration, or hindsight experience replay.

Common Mistake: Setting Gamma = 1.0

Mistake:

Using gamma=1 in infinite-horizon tasks, causing divergent returns.

Correction:

Use gamma < 1 (typically 0.99) for infinite-horizon MDPs.

Key Takeaway

RL learns from interaction. DQN for discrete actions, PPO for continuous. Always use experience replay and target networks for stability. Reward design is as important as algorithm choice.

Key Takeaway

Wireless resource allocation (power control, scheduling, beam management) naturally maps to MDPs. States = channel conditions, actions = resource allocations, rewards = throughput or QoS.

Why This Matters: DQN for Power Control

A DQN agent can learn optimal power allocation in multi-cell networks. State: channel gains and interference levels. Action: discrete power levels. Reward: sum-rate or energy efficiency. The agent learns to coordinate without explicit inter-cell communication.

Historical Note: DQN: Deep RL's Breakthrough

2015

Mnih et al. (2015) demonstrated that a single DQN agent could learn to play Atari games from raw pixels, achieving human-level performance. Experience replay and target networks were the key stabilising innovations.

Historical Note: PPO: Simple and Effective

2017

Schulman et al. (2017) introduced PPO, which constrains policy updates using a clipped surrogate objective. PPO became the default algorithm for many RL applications due to its simplicity and robustness.

MDP

Markov Decision Process: formal framework for sequential decision-making under uncertainty.

DQN

Deep Q-Network: neural network approximation of the Q-function with experience replay.

PPO

Proximal Policy Optimization: policy gradient method with clipped surrogate objective.

Experience Replay

Buffer storing past transitions, sampled randomly to break correlation in training data.

Advantage Function

$A(s,a) = Q(s,a) - V(s)$ : how much better an action is than the average.

RL Algorithm Comparison

Algorithm	Type	Action Space	Stability	Key Feature
Q-Learning	Value-based	Discrete	Convergent (tabular)	Bellman update
DQN	Value-based	Discrete	Good (w/ tricks)	Experience replay + target net
REINFORCE	Policy gradient	Both	High variance	Monte Carlo returns
PPO	Actor-critic	Both	Very stable	Clipped surrogate objective
SAC	Actor-critic	Continuous	Stable	Maximum entropy RL

Markov Decision Processes and Environments

Why Reinforcement Learning?

Definition: Markov Decision Process (MDP)

Definition: Policy and Value Functions

Definition: Bellman Equation

Definition: Gymnasium (Gym) Environment Interface

Definition: Discounted Return

Definition: Epsilon-Greedy Exploration

Theorem: Bellman Optimality and Convergence

Theorem: Policy Gradient Theorem

Theorem: Advantage Function and Variance Reduction

Example: Custom Gymnasium Environment

Implementation

Example: Tabular Q-Learning

Implementation

Example: Deep Q-Network (DQN)

Key components

Q-Value Landscape

Parameters

RL Training Reward Curves

Parameters

Exploration vs Exploitation

Parameters

RL Agent Learning

Parameters

MDP Interaction Loop

RL Algorithm Taxonomy

Quick Check

Quick Check

Quick Check

Common Mistake: Not Using Target Network in DQN

Common Mistake: Sparse Rewards

Common Mistake: Setting Gamma = 1.0

Key Takeaway

Key Takeaway

Why This Matters: DQN for Power Control

Historical Note: DQN: Deep RL's Breakthrough

Historical Note: PPO: Simple and Effective

MDP

DQN

PPO

Experience Replay

Advantage Function

RL Algorithm Comparison

Definition:
Markov Decision Process (MDP)

Definition:
Policy and Value Functions

Definition:
Bellman Equation

Definition:
Gymnasium (Gym) Environment Interface

Definition:
Discounted Return

Definition:
Epsilon-Greedy Exploration