RLHF and Alignment

Definition:

Reinforcement Learning from Human Feedback (RLHF)

RLHF aligns a pre-trained LLM with human preferences in three stages:

  1. Supervised Fine-Tuning (SFT): Fine-tune on high-quality instruction-response pairs to get πSFT\pi_\text{SFT}
  2. Reward Model Training: Train rϕ(x,y)r_\phi(x, y) on human preference data: given two responses ywyly_w \succ y_l, minimize LRM=logσ(rϕ(x,yw)rϕ(x,yl))\mathcal{L}_\text{RM} = -\log \sigma(r_\phi(x, y_w) - r_\phi(x, y_l))
  3. RL Optimization: Optimize the policy πθ\pi_\theta using PPO: maxθEx,yπθ[rϕ(x,y)βKL(πθπSFT)]\max_\theta \mathbb{E}_{x, y \sim \pi_\theta}\left[r_\phi(x, y) - \beta \text{KL}(\pi_\theta \| \pi_\text{SFT})\right]

The KL penalty prevents the policy from diverging too far from the SFT model, maintaining generation quality.

Definition:

Direct Preference Optimization (DPO)

DPO eliminates the reward model by directly optimizing preferences:

LDPO=logσ ⁣(βlogπθ(ywx)πref(ywx)βlogπθ(ylx)πref(ylx))\mathcal{L}_\text{DPO} = -\log \sigma\!\left(\beta \log \frac{\pi_\theta(y_w \mid x)}{\pi_\text{ref}(y_w \mid x)} - \beta \log \frac{\pi_\theta(y_l \mid x)}{\pi_\text{ref}(y_l \mid x)}\right)

This is equivalent to RLHF with the implicit reward:

r(x,y)=βlogπθ(yx)πref(yx)r(x, y) = \beta \log \frac{\pi_\theta(y \mid x)}{\pi_\text{ref}(y \mid x)}

DPO is simpler (no RL loop), more stable, and produces comparable results.

DPO has become the dominant alignment method due to its simplicity. It requires only preference pairs and standard supervised training.

Example: DPO Training Loop

Implement the DPO loss for a simplified preference dataset.

RLHF Pipeline Visualization

Visualize the three stages of RLHF and their loss curves

Parameters

Quick Check

What is the main advantage of DPO over RLHF?

DPO produces better aligned models

DPO eliminates the need for a separate reward model and RL optimization

DPO requires less preference data

Common Mistake: Reward Model Hacking

Mistake:

Training the policy too aggressively against the reward model.

Correction:

Without the KL penalty, the policy learns to exploit weaknesses in the reward model rather than genuinely improving. Always use a KL penalty (β[0.01,0.5]\beta \in [0.01, 0.5]) and monitor for degenerate outputs like excessive length or repetitive patterns.

Why This Matters: RLHF and Reward Design in Wireless

The reward model concept parallels reward design in wireless RL: a network optimizer must balance throughput, latency, and fairness just as an LLM must balance helpfulness, safety, and truthfulness. The challenge of reward specification and reward hacking appears in both domains.

See full treatment in Chapter 38

RLHF

Reinforcement Learning from Human Feedback — a technique that aligns LLMs with human preferences by training a reward model on human comparisons and optimizing the LLM policy via PPO.

Related: DPO (Direct Preference Optimization)

DPO (Direct Preference Optimization)

A simpler alternative to RLHF that directly optimizes a policy from preference pairs without an explicit reward model, using a classification-style loss.

Related: RLHF

Historical Note: The Path to RLHF

2019-2022

RLHF was first applied to language models by Ziegler et al. (2019) at OpenAI. InstructGPT (2022) demonstrated that RLHF on a small 1.3B model could outperform the 175B GPT-3 on human evaluations. This result showed that alignment can be more impactful than scale alone.