Ferkans — Interactive Telecom Tutor

Definition:
Reinforcement Learning from Human Feedback (RLHF)

RLHF aligns a pre-trained LLM with human preferences in three stages:

Supervised Fine-Tuning (SFT): Fine-tune on high-quality instruction-response pairs to get $\pi_\text{SFT}$
Reward Model Training: Train $r_\phi(x, y)$ on human preference data: given two responses $y_w \succ y_l$ , minimize $\mathcal{L}_\text{RM} = -\log \sigma(r_\phi(x, y_w) - r_\phi(x, y_l))$
RL Optimization: Optimize the policy $\pi_\theta$ using PPO: $\max_\theta \mathbb{E}_{x, y \sim \pi_\theta}\left[r_\phi(x, y) - \beta \text{KL}(\pi_\theta \| \pi_\text{SFT})\right]$

The KL penalty prevents the policy from diverging too far from the SFT model, maintaining generation quality.

Definition:
Direct Preference Optimization (DPO)

DPO eliminates the reward model by directly optimizing preferences:

$\mathcal{L}_\text{DPO} = -\log \sigma\!\left(\beta \log \frac{\pi_\theta(y_w \mid x)}{\pi_\text{ref}(y_w \mid x)} - \beta \log \frac{\pi_\theta(y_l \mid x)}{\pi_\text{ref}(y_l \mid x)}\right)$

This is equivalent to RLHF with the implicit reward:

$r(x, y) = \beta \log \frac{\pi_\theta(y \mid x)}{\pi_\text{ref}(y \mid x)}$

DPO is simpler (no RL loop), more stable, and produces comparable results.

DPO has become the dominant alignment method due to its simplicity. It requires only preference pairs and standard supervised training.

Example: DPO Training Loop

Implement the DPO loss for a simplified preference dataset.

Solution

Implementation

import torch
import torch.nn.functional as F

def dpo_loss(policy_logprobs_w, policy_logprobs_l,
             ref_logprobs_w, ref_logprobs_l, beta=0.1):
    """
    policy_logprobs_w: log pi_theta(y_w | x) for winning responses
    policy_logprobs_l: log pi_theta(y_l | x) for losing responses
    ref_logprobs_w/l: same for reference model
    """
    log_ratio_w = policy_logprobs_w - ref_logprobs_w
    log_ratio_l = policy_logprobs_l - ref_logprobs_l
    logits = beta * (log_ratio_w - log_ratio_l)
    return -F.logsigmoid(logits).mean()

RLHF Pipeline Visualization

Visualize the three stages of RLHF and their loss curves

Parameters

Quick Check

What is the main advantage of DPO over RLHF?

DPO produces better aligned models

DPO eliminates the need for a separate reward model and RL optimization

DPO requires less preference data

Correction:

DPO eliminates the need for a separate reward model and RL optimization

DPO directly optimizes preferences using a classification-like loss, avoiding the instability of PPO.

Common Mistake: Reward Model Hacking

Mistake:

Training the policy too aggressively against the reward model.

Correction:

Without the KL penalty, the policy learns to exploit weaknesses in the reward model rather than genuinely improving. Always use a KL penalty ( $\beta \in [0.01, 0.5]$ ) and monitor for degenerate outputs like excessive length or repetitive patterns.

Why This Matters: RLHF and Reward Design in Wireless

The reward model concept parallels reward design in wireless RL: a network optimizer must balance throughput, latency, and fairness just as an LLM must balance helpfulness, safety, and truthfulness. The challenge of reward specification and reward hacking appears in both domains.

See full treatment in Chapter 38

RLHF

Reinforcement Learning from Human Feedback — a technique that aligns LLMs with human preferences by training a reward model on human comparisons and optimizing the LLM policy via PPO.

DPO (Direct Preference Optimization)

A simpler alternative to RLHF that directly optimizes a policy from preference pairs without an explicit reward model, using a classification-style loss.

Related: RLHF

Historical Note: The Path to RLHF

2019-2022

RLHF was first applied to language models by Ziegler et al. (2019) at OpenAI. InstructGPT (2022) demonstrated that RLHF on a small 1.3B model could outperform the 175B GPT-3 on human evaluations. This result showed that alignment can be more impactful than scale alone.

RLHF and Alignment

Definition: Reinforcement Learning from Human Feedback (RLHF)

Definition: Direct Preference Optimization (DPO)