Markov Decision Processes and Environments
Why Reinforcement Learning?
RL learns to make sequential decisions by interacting with an environment. Unlike supervised learning, there are no labels β only rewards. This framework naturally models resource allocation, power control, and scheduling in wireless networks.
Definition: Markov Decision Process (MDP)
Markov Decision Process (MDP)
An MDP is a tuple :
- : state space
- : action space
- : transition probability
- : reward function
- : discount factor
The Markov property: .
Definition: Policy and Value Functions
Policy and Value Functions
A policy maps states to action distributions.
State-value function:
Action-value function:
Definition: Bellman Equation
Bellman Equation
The optimal value function satisfies:
Definition: Gymnasium (Gym) Environment Interface
Gymnasium (Gym) Environment Interface
The standard RL environment API:
import gymnasium as gym
env = gym.make('CartPole-v1')
obs, info = env.reset()
for _ in range(1000):
action = env.action_space.sample()
obs, reward, terminated, truncated, info = env.step(action)
if terminated or truncated:
obs, info = env.reset()
Definition: Discounted Return
Discounted Return
The discounted return from time :
With , rewards 100 steps in the future are worth of immediate rewards.
Definition: Epsilon-Greedy Exploration
Epsilon-Greedy Exploration
With probability , take a random action; otherwise take the greedy action . Decay from 1.0 to 0.01 over training.
Theorem: Bellman Optimality and Convergence
For finite MDPs, the Bellman optimality equation has a unique solution . Q-learning converges to with probability 1 if all state-action pairs are visited infinitely often and the learning rate satisfies , .
Q-learning is a stochastic approximation of value iteration, which converges because the Bellman operator is a contraction mapping.
Theorem: Policy Gradient Theorem
For a parameterised policy :
This gradient can be estimated from trajectories without knowing .
Increase the probability of actions that lead to high returns. The log-probability trick converts the expectation over actions into a form that can be estimated via sampling.
Theorem: Advantage Function and Variance Reduction
The advantage function measures how much better action is compared to average. Using advantage in the policy gradient (instead of ) reduces variance without introducing bias:
Subtracting the baseline reduces noise in the gradient estimate by centring the returns around their expected value.
Example: Custom Gymnasium Environment
Implement a simple resource allocation environment.
Implementation
import gymnasium as gym
from gymnasium import spaces
import numpy as np
class PowerControlEnv(gym.Env):
def __init__(self, n_users=4):
super().__init__()
self.n_users = n_users
self.observation_space = spaces.Box(0, 1, (n_users,))
self.action_space = spaces.Discrete(n_users)
def reset(self, seed=None):
self.channels = np.random.rand(self.n_users)
return self.channels.astype(np.float32), {}
def step(self, action):
sinr = self.channels[action] / (0.1 + 0.5 * self.channels.sum())
reward = float(np.log2(1 + sinr))
self.channels = np.random.rand(self.n_users)
return self.channels.astype(np.float32), reward, False, False, {}
Example: Tabular Q-Learning
Implement Q-learning for a small discrete environment.
Implementation
Q = np.zeros((n_states, n_actions))
for episode in range(1000):
s = env.reset()[0]
for t in range(100):
if np.random.random() < epsilon:
a = env.action_space.sample()
else:
a = np.argmax(Q[s])
s2, r, done, _, _ = env.step(a)
Q[s, a] += alpha * (r + gamma * np.max(Q[s2]) - Q[s, a])
s = s2
if done: break
Example: Deep Q-Network (DQN)
Implement DQN with experience replay and target network.
Key components
class DQN(nn.Module):
def __init__(self, obs_dim, n_actions):
super().__init__()
self.net = nn.Sequential(
nn.Linear(obs_dim, 128), nn.ReLU(),
nn.Linear(128, 128), nn.ReLU(),
nn.Linear(128, n_actions))
def forward(self, x): return self.net(x)
# Experience replay buffer
replay = deque(maxlen=10000)
# Target network updated every C steps
target_net = copy.deepcopy(policy_net)
Q-Value Landscape
Visualise Q-values for a simple gridworld.
Parameters
RL Training Reward Curves
Compare DQN, REINFORCE, and PPO learning curves.
Parameters
Exploration vs Exploitation
See the effect of epsilon on learning.
Parameters
RL Agent Learning
Watch an agent learn to navigate a gridworld.
Parameters
MDP Interaction Loop
RL Algorithm Taxonomy
Quick Check
What does the discount factor gamma control in RL?
The learning rate
How much the agent values future vs immediate rewards
The exploration rate
Quick Check
Why does DQN use a target network?
To double the model capacity
To stabilise training by providing a slowly-updated target for the Bellman equation
To speed up training
Quick Check
What is the key advantage of policy gradient methods over DQN?
They converge faster
They naturally handle continuous action spaces
They use less memory
Common Mistake: Not Using Target Network in DQN
Mistake:
Using the same network for both Q-value prediction and target computation.
Correction:
Use a separate target network updated every C steps or with soft update tau.
Common Mistake: Sparse Rewards
Mistake:
Environment gives reward only at episode end, making credit assignment extremely difficult.
Correction:
Use reward shaping, curiosity-driven exploration, or hindsight experience replay.
Common Mistake: Setting Gamma = 1.0
Mistake:
Using gamma=1 in infinite-horizon tasks, causing divergent returns.
Correction:
Use gamma < 1 (typically 0.99) for infinite-horizon MDPs.
Key Takeaway
RL learns from interaction. DQN for discrete actions, PPO for continuous. Always use experience replay and target networks for stability. Reward design is as important as algorithm choice.
Key Takeaway
Wireless resource allocation (power control, scheduling, beam management) naturally maps to MDPs. States = channel conditions, actions = resource allocations, rewards = throughput or QoS.
Why This Matters: DQN for Power Control
A DQN agent can learn optimal power allocation in multi-cell networks. State: channel gains and interference levels. Action: discrete power levels. Reward: sum-rate or energy efficiency. The agent learns to coordinate without explicit inter-cell communication.
Historical Note: DQN: Deep RL's Breakthrough
2015Mnih et al. (2015) demonstrated that a single DQN agent could learn to play Atari games from raw pixels, achieving human-level performance. Experience replay and target networks were the key stabilising innovations.
Historical Note: PPO: Simple and Effective
2017Schulman et al. (2017) introduced PPO, which constrains policy updates using a clipped surrogate objective. PPO became the default algorithm for many RL applications due to its simplicity and robustness.
MDP
Markov Decision Process: formal framework for sequential decision-making under uncertainty.
DQN
Deep Q-Network: neural network approximation of the Q-function with experience replay.
PPO
Proximal Policy Optimization: policy gradient method with clipped surrogate objective.
Experience Replay
Buffer storing past transitions, sampled randomly to break correlation in training data.
Advantage Function
: how much better an action is than the average.
RL Algorithm Comparison
| Algorithm | Type | Action Space | Stability | Key Feature |
|---|---|---|---|---|
| Q-Learning | Value-based | Discrete | Convergent (tabular) | Bellman update |
| DQN | Value-based | Discrete | Good (w/ tricks) | Experience replay + target net |
| REINFORCE | Policy gradient | Both | High variance | Monte Carlo returns |
| PPO | Actor-critic | Both | Very stable | Clipped surrogate objective |
| SAC | Actor-critic | Continuous | Stable | Maximum entropy RL |