Markov Decision Processes and Environments

Why Reinforcement Learning?

RL learns to make sequential decisions by interacting with an environment. Unlike supervised learning, there are no labels β€” only rewards. This framework naturally models resource allocation, power control, and scheduling in wireless networks.

Definition:

Markov Decision Process (MDP)

An MDP is a tuple (S,A,P,R,Ξ³)(\mathcal{S}, \mathcal{A}, P, R, \gamma):

  • S\mathcal{S}: state space
  • A\mathcal{A}: action space
  • P(sβ€²βˆ£s,a)P(s'|s, a): transition probability
  • R(s,a,sβ€²)R(s, a, s'): reward function
  • γ∈[0,1)\gamma \in [0, 1): discount factor

The Markov property: P(st+1∣st,at)=P(st+1∣s1,a1,…,st,at)P(s_{t+1}|s_t, a_t) = P(s_{t+1}|s_1, a_1, \ldots, s_t, a_t).

Definition:

Policy and Value Functions

A policy Ο€(a∣s)\pi(a|s) maps states to action distributions.

State-value function: VΟ€(s)=EΟ€[βˆ‘t=0∞γtRt∣s0=s]V^\pi(s) = \mathbb{E}_\pi\left[\sum_{t=0}^{\infty} \gamma^t R_t \mid s_0 = s\right]

Action-value function: QΟ€(s,a)=EΟ€[βˆ‘t=0∞γtRt∣s0=s,a0=a]Q^\pi(s, a) = \mathbb{E}_\pi\left[\sum_{t=0}^{\infty} \gamma^t R_t \mid s_0 = s, a_0 = a\right]

Definition:

Bellman Equation

The optimal value function satisfies:

Vβˆ—(s)=max⁑a[R(s,a)+Ξ³βˆ‘sβ€²P(sβ€²βˆ£s,a)Vβˆ—(sβ€²)]V^*(s) = \max_a \left[R(s, a) + \gamma \sum_{s'} P(s'|s, a) V^*(s')\right]

Qβˆ—(s,a)=R(s,a)+Ξ³βˆ‘sβ€²P(sβ€²βˆ£s,a)max⁑aβ€²Qβˆ—(sβ€²,aβ€²)Q^*(s, a) = R(s, a) + \gamma \sum_{s'} P(s'|s, a) \max_{a'} Q^*(s', a')

Definition:

Gymnasium (Gym) Environment Interface

The standard RL environment API:

import gymnasium as gym

env = gym.make('CartPole-v1')
obs, info = env.reset()
for _ in range(1000):
    action = env.action_space.sample()
    obs, reward, terminated, truncated, info = env.step(action)
    if terminated or truncated:
        obs, info = env.reset()

Definition:

Discounted Return

The discounted return from time tt:

Gt=βˆ‘k=0∞γkRt+k+1G_t = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}

With Ξ³=0.99\gamma = 0.99, rewards 100 steps in the future are worth Ξ³100β‰ˆ0.37\gamma^{100} \approx 0.37 of immediate rewards.

Definition:

Epsilon-Greedy Exploration

With probability Ρ\varepsilon, take a random action; otherwise take the greedy action arg⁑max⁑aQ(s,a)\arg\max_a Q(s, a). Decay Ρ\varepsilon from 1.0 to 0.01 over training.

Theorem: Bellman Optimality and Convergence

For finite MDPs, the Bellman optimality equation has a unique solution Qβˆ—Q^*. Q-learning converges to Qβˆ—Q^* with probability 1 if all state-action pairs are visited infinitely often and the learning rate satisfies βˆ‘tΞ±t=∞\sum_t \alpha_t = \infty, βˆ‘tΞ±t2<∞\sum_t \alpha_t^2 < \infty.

Q-learning is a stochastic approximation of value iteration, which converges because the Bellman operator is a contraction mapping.

Theorem: Policy Gradient Theorem

For a parameterised policy πθ\pi_\theta:

βˆ‡ΞΈJ(ΞΈ)=Eπθ[βˆ‡ΞΈlog⁑πθ(a∣s)β‹…Qπθ(s,a)]\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}\left[\nabla_\theta \log \pi_\theta(a|s) \cdot Q^{\pi_\theta}(s, a)\right]

This gradient can be estimated from trajectories without knowing PP.

Increase the probability of actions that lead to high returns. The log-probability trick converts the expectation over actions into a form that can be estimated via sampling.

Theorem: Advantage Function and Variance Reduction

The advantage function AΟ€(s,a)=QΟ€(s,a)βˆ’VΟ€(s)A^\pi(s, a) = Q^\pi(s, a) - V^\pi(s) measures how much better action aa is compared to average. Using advantage in the policy gradient (instead of QQ) reduces variance without introducing bias:

βˆ‡ΞΈJ=Eπθ[βˆ‡ΞΈlog⁑πθ(a∣s)β‹…Aπθ(s,a)]\nabla_\theta J = \mathbb{E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(a|s) \cdot A^{\pi_\theta}(s, a)]

Subtracting the baseline V(s)V(s) reduces noise in the gradient estimate by centring the returns around their expected value.

Example: Custom Gymnasium Environment

Implement a simple resource allocation environment.

Example: Tabular Q-Learning

Implement Q-learning for a small discrete environment.

Example: Deep Q-Network (DQN)

Implement DQN with experience replay and target network.

Q-Value Landscape

Visualise Q-values for a simple gridworld.

Parameters

RL Training Reward Curves

Compare DQN, REINFORCE, and PPO learning curves.

Parameters

Exploration vs Exploitation

See the effect of epsilon on learning.

Parameters

RL Agent Learning

Watch an agent learn to navigate a gridworld.

Parameters

MDP Interaction Loop

MDP Interaction Loop
Agent observes state, takes action, receives reward, transitions to next state.

RL Algorithm Taxonomy

RL Algorithm Taxonomy
Value-based (DQN), policy-based (REINFORCE, PPO), and actor-critic methods.

Quick Check

What does the discount factor gamma control in RL?

The learning rate

How much the agent values future vs immediate rewards

The exploration rate

Quick Check

Why does DQN use a target network?

To double the model capacity

To stabilise training by providing a slowly-updated target for the Bellman equation

To speed up training

Quick Check

What is the key advantage of policy gradient methods over DQN?

They converge faster

They naturally handle continuous action spaces

They use less memory

Common Mistake: Not Using Target Network in DQN

Mistake:

Using the same network for both Q-value prediction and target computation.

Correction:

Use a separate target network updated every C steps or with soft update tau.

Common Mistake: Sparse Rewards

Mistake:

Environment gives reward only at episode end, making credit assignment extremely difficult.

Correction:

Use reward shaping, curiosity-driven exploration, or hindsight experience replay.

Common Mistake: Setting Gamma = 1.0

Mistake:

Using gamma=1 in infinite-horizon tasks, causing divergent returns.

Correction:

Use gamma < 1 (typically 0.99) for infinite-horizon MDPs.

Key Takeaway

RL learns from interaction. DQN for discrete actions, PPO for continuous. Always use experience replay and target networks for stability. Reward design is as important as algorithm choice.

Key Takeaway

Wireless resource allocation (power control, scheduling, beam management) naturally maps to MDPs. States = channel conditions, actions = resource allocations, rewards = throughput or QoS.

Why This Matters: DQN for Power Control

A DQN agent can learn optimal power allocation in multi-cell networks. State: channel gains and interference levels. Action: discrete power levels. Reward: sum-rate or energy efficiency. The agent learns to coordinate without explicit inter-cell communication.

Historical Note: DQN: Deep RL's Breakthrough

2015

Mnih et al. (2015) demonstrated that a single DQN agent could learn to play Atari games from raw pixels, achieving human-level performance. Experience replay and target networks were the key stabilising innovations.

Historical Note: PPO: Simple and Effective

2017

Schulman et al. (2017) introduced PPO, which constrains policy updates using a clipped surrogate objective. PPO became the default algorithm for many RL applications due to its simplicity and robustness.

MDP

Markov Decision Process: formal framework for sequential decision-making under uncertainty.

DQN

Deep Q-Network: neural network approximation of the Q-function with experience replay.

PPO

Proximal Policy Optimization: policy gradient method with clipped surrogate objective.

Experience Replay

Buffer storing past transitions, sampled randomly to break correlation in training data.

Advantage Function

A(s,a)=Q(s,a)βˆ’V(s)A(s,a) = Q(s,a) - V(s): how much better an action is than the average.

RL Algorithm Comparison

AlgorithmTypeAction SpaceStabilityKey Feature
Q-LearningValue-basedDiscreteConvergent (tabular)Bellman update
DQNValue-basedDiscreteGood (w/ tricks)Experience replay + target net
REINFORCEPolicy gradientBothHigh varianceMonte Carlo returns
PPOActor-criticBothVery stableClipped surrogate objective
SACActor-criticContinuousStableMaximum entropy RL