Fine-Tuning Pre-Trained Models

Definition:

Low-Rank Adaptation (LoRA)

LoRA freezes the pre-trained weights W0\mathbf{W}_0 and adds a trainable low-rank decomposition:

W=W0+αrBA\mathbf{W} = \mathbf{W}_0 + \frac{\alpha}{r} \mathbf{B}\mathbf{A}

where ARr×d\mathbf{A} \in \mathbb{R}^{r \times d} and BRd×r\mathbf{B} \in \mathbb{R}^{d \times r} with rank rdr \ll d.

Trainable parameters: 2rd2 \cdot r \cdot d per layer (vs. d2d^2 for full fine-tuning).

With r=16r = 16 and d=4096d = 4096, LoRA adds only 0.2% parameters while achieving comparable performance to full fine-tuning.

Definition:

QLoRA: Quantized LoRA

QLoRA combines 4-bit quantization of base weights with LoRA adapters:

  1. Quantize base model to NF4 (4-bit NormalFloat)
  2. Add LoRA adapters in fp16/bf16
  3. Train only the LoRA parameters

This allows fine-tuning a 65B model on a single 48GB GPU.

Definition:

Parameter-Efficient Fine-Tuning (PEFT)

PEFT methods modify only a small fraction of model parameters:

Method Trainable Params Approach
LoRA 0.1-1% Low-rank weight update
Prefix Tuning 0.1% Learnable prefix tokens
Adapters 1-5% Small bottleneck layers
Full FT 100% Update all parameters

Definition:

SFT Training Data Format

Supervised fine-tuning uses instruction-response pairs:

{
  "instruction": "Compute the SNR for received power -80 dBm and noise -100 dBm",
  "response": "SNR = P_rx - N = -80 - (-100) = 20 dB"
}

Quality > quantity: 1000 high-quality examples often outperform 100,000 noisy ones.

Definition:

Fine-Tuning Learning Rate

Fine-tuning requires much lower learning rates than pre-training:

ηFT105 to 104\eta_\text{FT} \approx 10^{-5} \text{ to } 10^{-4}

Use cosine decay with warmup: \eta(t) = \eta_\max \cdot \frac{1 + \cos(\pi t / T)}{2}

Theorem: LoRA Rank Selection

For a weight matrix W0Rd×d\mathbf{W}_0 \in \mathbb{R}^{d \times d}, the expressiveness of the LoRA update is bounded by:

rank(ΔW)r\text{rank}(\Delta\mathbf{W}) \le r

Empirically, r[4,64]r \in [4, 64] is sufficient for most tasks. The optimal rr depends on the task complexity and dataset size.

Fine-tuning for a specific domain shifts weights in a low-dimensional subspace. LoRA captures this subspace with rank rr.

Theorem: Catastrophic Forgetting in Fine-Tuning

Full fine-tuning on domain data Dnew\mathcal{D}_\text{new} can degrade performance on the original distribution Dorig\mathcal{D}_\text{orig}:

L(θFT,Dorig)>L(θ0,Dorig)L(\theta_{\text{FT}}, \mathcal{D}_\text{orig}) > L(\theta_0, \mathcal{D}_\text{orig})

LoRA mitigates this because the original weights W0\mathbf{W}_0 are frozen.

When you overwrite weights with new task knowledge, old knowledge can be lost. LoRA adds rather than replaces.

Theorem: Fine-Tuning Data Scaling

Fine-tuning performance on downstream tasks follows:

Perf(n)abnγ\text{Perf}(n) \approx a - b \cdot n^{-\gamma}

where nn is the number of training examples and γ0.3\gamma \approx 0.3-0.50.5. Beyond ~10K examples, returns diminish significantly for most tasks.

The pre-trained model already has strong representations. Fine-tuning mainly teaches it the task format and domain conventions.

Example: Fine-Tuning with LoRA using PEFT

Fine-tune a 7B model on wireless communication Q&A data using LoRA.

Example: Preparing Fine-Tuning Data

Convert wireless paper abstracts into instruction-tuning format.

Example: Evaluating Fine-Tuned Models

Evaluate a fine-tuned model on domain-specific benchmarks.

LoRA Rank Explorer

Explore how LoRA rank affects parameter count and performance

Parameters

Fine-Tuning Loss Comparison

Compare training dynamics for different fine-tuning methods

Parameters

GPU Memory Estimator

Estimate GPU memory for different fine-tuning configurations

Parameters

LoRA Weight Update Animation

Visualize how LoRA adapters modify attention weights during training

Parameters

Quantization Explorer

Compare model sizes and quality across quantization levels

Parameters

LoRA Architecture

LoRA Architecture
LoRA adds trainable low-rank matrices alongside frozen pre-trained weights.

Fine-Tuning Pipeline

Fine-Tuning Pipeline
From data preparation through training to evaluation and deployment.

Quick Check

Why does LoRA work with such low rank (r=16) for fine-tuning a 4096-dimensional model?

The model is not actually learning anything

Fine-tuning shifts weights in a low-dimensional subspace

Higher ranks would cause overfitting

Quick Check

What is the main advantage of QLoRA over standard LoRA?

Better model quality

4x reduction in base model memory through NF4 quantization

Faster training speed

Quick Check

For fine-tuning, which is more important: data quality or data quantity?

Data quantity — more data is always better

Data quality — 1000 excellent examples often beat 100K noisy ones

Neither — the model architecture matters most

Common Mistake: Overfitting Small Datasets

Mistake:

Training for too many epochs on small fine-tuning datasets.

Correction:

Use early stopping, monitor validation loss, and keep epochs low (1-3 for instruction tuning). The pre-trained model already has strong representations.

Common Mistake: Applying LoRA to Wrong Modules

Mistake:

Only applying LoRA to query projections.

Correction:

Apply LoRA to all attention projections (Q, K, V, O) and optionally the MLP layers. This provides the best performance-efficiency trade-off.

Common Mistake: Evaluation Data Contamination

Mistake:

Including evaluation data in the training set accidentally.

Correction:

Always split data before any processing. Use held-out test sets that are never seen during training or hyperparameter tuning.

Key Takeaway

LoRA fine-tuning modifies only 0.1-1% of parameters while achieving comparable results to full fine-tuning. QLoRA further reduces memory requirements by 4x through base weight quantization.

Key Takeaway

Data quality dominates quantity for fine-tuning. A well-curated dataset of 1000-5000 examples is typically sufficient for domain adaptation. Focus on diverse, high-quality instruction-response pairs.

Why This Matters: Domain-Specific LLMs for Telecom

Fine-tuning on 3GPP specifications, IEEE papers, and simulation code creates LLMs that understand telecom terminology, can generate correct simulation code, and reason about system-level design trade-offs. This is a growing area of applied research.

See full treatment in Chapter 38

Historical Note: The LoRA Breakthrough

2021

Hu et al. (2021) at Microsoft introduced LoRA, showing that the weight updates during fine-tuning are inherently low-rank. This enabled fine-tuning GPT-3 (175B) with only ~18M trainable parameters, making large model adaptation accessible to researchers without massive GPU clusters.

Historical Note: QLoRA Democratization

2023

Dettmers et al. (2023) showed that combining 4-bit quantization with LoRA allows fine-tuning a 65B model on a single 48GB GPU. This democratized LLM fine-tuning, enabling graduate students to customize frontier-class models.

Fine-Tuning Method Comparison

MethodTrainable %Memory (7B)QualitySpeed
Full FT100%~60 GBBestSlowest
LoRA (r=16)0.08%~16 GBNear-bestFast
QLoRA (r=16)0.08%~6 GBGoodFast
Prefix Tuning0.1%~15 GBGoodFast

LoRA (Low-Rank Adaptation)

A parameter-efficient fine-tuning method that adds trainable low-rank matrices to frozen pre-trained weights, requiring only 0.1-1% additional parameters.

QLoRA

Quantized LoRA — combines 4-bit NormalFloat quantization of base weights with LoRA adapters, enabling fine-tuning of large models on consumer GPUs.

PEFT

Parameter-Efficient Fine-Tuning — a family of methods that fine-tune only a small subset of model parameters, including LoRA, adapters, and prefix tuning.

SFT (Supervised Fine-Tuning)

Fine-tuning a pre-trained model on curated instruction-response pairs to improve instruction following.

Catastrophic Forgetting

The tendency of neural networks to lose previously learned knowledge when fine-tuned on new data, mitigated by LoRA and other regularization techniques.