RLHF and Alignment
Definition: Reinforcement Learning from Human Feedback (RLHF)
Reinforcement Learning from Human Feedback (RLHF)
RLHF aligns a pre-trained LLM with human preferences in three stages:
- Supervised Fine-Tuning (SFT): Fine-tune on high-quality instruction-response pairs to get
- Reward Model Training: Train on human preference data: given two responses , minimize
- RL Optimization: Optimize the policy using PPO:
The KL penalty prevents the policy from diverging too far from the SFT model, maintaining generation quality.
Definition: Direct Preference Optimization (DPO)
Direct Preference Optimization (DPO)
DPO eliminates the reward model by directly optimizing preferences:
This is equivalent to RLHF with the implicit reward:
DPO is simpler (no RL loop), more stable, and produces comparable results.
DPO has become the dominant alignment method due to its simplicity. It requires only preference pairs and standard supervised training.
Example: DPO Training Loop
Implement the DPO loss for a simplified preference dataset.
Implementation
import torch
import torch.nn.functional as F
def dpo_loss(policy_logprobs_w, policy_logprobs_l,
ref_logprobs_w, ref_logprobs_l, beta=0.1):
"""
policy_logprobs_w: log pi_theta(y_w | x) for winning responses
policy_logprobs_l: log pi_theta(y_l | x) for losing responses
ref_logprobs_w/l: same for reference model
"""
log_ratio_w = policy_logprobs_w - ref_logprobs_w
log_ratio_l = policy_logprobs_l - ref_logprobs_l
logits = beta * (log_ratio_w - log_ratio_l)
return -F.logsigmoid(logits).mean()
RLHF Pipeline Visualization
Visualize the three stages of RLHF and their loss curves
Parameters
Quick Check
What is the main advantage of DPO over RLHF?
DPO produces better aligned models
DPO eliminates the need for a separate reward model and RL optimization
DPO requires less preference data
DPO directly optimizes preferences using a classification-like loss, avoiding the instability of PPO.
Common Mistake: Reward Model Hacking
Mistake:
Training the policy too aggressively against the reward model.
Correction:
Without the KL penalty, the policy learns to exploit weaknesses in the reward model rather than genuinely improving. Always use a KL penalty () and monitor for degenerate outputs like excessive length or repetitive patterns.
Why This Matters: RLHF and Reward Design in Wireless
The reward model concept parallels reward design in wireless RL: a network optimizer must balance throughput, latency, and fairness just as an LLM must balance helpfulness, safety, and truthfulness. The challenge of reward specification and reward hacking appears in both domains.
See full treatment in Chapter 38
RLHF
Reinforcement Learning from Human Feedback — a technique that aligns LLMs with human preferences by training a reward model on human comparisons and optimizing the LLM policy via PPO.
Related: DPO (Direct Preference Optimization)
DPO (Direct Preference Optimization)
A simpler alternative to RLHF that directly optimizes a policy from preference pairs without an explicit reward model, using a classification-style loss.
Related: RLHF
Historical Note: The Path to RLHF
2019-2022RLHF was first applied to language models by Ziegler et al. (2019) at OpenAI. InstructGPT (2022) demonstrated that RLHF on a small 1.3B model could outperform the 175B GPT-3 on human evaluations. This result showed that alignment can be more impactful than scale alone.