Fine-Tuning Pre-Trained Models
Definition: Low-Rank Adaptation (LoRA)
Low-Rank Adaptation (LoRA)
LoRA freezes the pre-trained weights and adds a trainable low-rank decomposition:
where and with rank .
Trainable parameters: per layer (vs. for full fine-tuning).
With and , LoRA adds only 0.2% parameters while achieving comparable performance to full fine-tuning.
Definition: QLoRA: Quantized LoRA
QLoRA: Quantized LoRA
QLoRA combines 4-bit quantization of base weights with LoRA adapters:
- Quantize base model to NF4 (4-bit NormalFloat)
- Add LoRA adapters in fp16/bf16
- Train only the LoRA parameters
This allows fine-tuning a 65B model on a single 48GB GPU.
Definition: Parameter-Efficient Fine-Tuning (PEFT)
Parameter-Efficient Fine-Tuning (PEFT)
PEFT methods modify only a small fraction of model parameters:
| Method | Trainable Params | Approach |
|---|---|---|
| LoRA | 0.1-1% | Low-rank weight update |
| Prefix Tuning | 0.1% | Learnable prefix tokens |
| Adapters | 1-5% | Small bottleneck layers |
| Full FT | 100% | Update all parameters |
Definition: SFT Training Data Format
SFT Training Data Format
Supervised fine-tuning uses instruction-response pairs:
{
"instruction": "Compute the SNR for received power -80 dBm and noise -100 dBm",
"response": "SNR = P_rx - N = -80 - (-100) = 20 dB"
}
Quality > quantity: 1000 high-quality examples often outperform 100,000 noisy ones.
Definition: Fine-Tuning Learning Rate
Fine-Tuning Learning Rate
Fine-tuning requires much lower learning rates than pre-training:
Use cosine decay with warmup: \eta(t) = \eta_\max \cdot \frac{1 + \cos(\pi t / T)}{2}
Theorem: LoRA Rank Selection
For a weight matrix , the expressiveness of the LoRA update is bounded by:
Empirically, is sufficient for most tasks. The optimal depends on the task complexity and dataset size.
Fine-tuning for a specific domain shifts weights in a low-dimensional subspace. LoRA captures this subspace with rank .
Theorem: Catastrophic Forgetting in Fine-Tuning
Full fine-tuning on domain data can degrade performance on the original distribution :
LoRA mitigates this because the original weights are frozen.
When you overwrite weights with new task knowledge, old knowledge can be lost. LoRA adds rather than replaces.
Theorem: Fine-Tuning Data Scaling
Fine-tuning performance on downstream tasks follows:
where is the number of training examples and -. Beyond ~10K examples, returns diminish significantly for most tasks.
The pre-trained model already has strong representations. Fine-tuning mainly teaches it the task format and domain conventions.
Example: Fine-Tuning with LoRA using PEFT
Fine-tune a 7B model on wireless communication Q&A data using LoRA.
Setup and Training
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM, TrainingArguments, Trainer
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3-8B")
lora_config = LoraConfig(
r=16, lora_alpha=32, lora_dropout=0.05,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 6,553,600 || all params: 8,036,065,280 || trainable%: 0.08
Example: Preparing Fine-Tuning Data
Convert wireless paper abstracts into instruction-tuning format.
Data Conversion
import json
def abstract_to_instruction(abstract, keywords):
return {
"instruction": f"Summarize this paper and list key contributions:\n{abstract}",
"response": f"Keywords: {', '.join(keywords)}\n..."
}
Example: Evaluating Fine-Tuned Models
Evaluate a fine-tuned model on domain-specific benchmarks.
Evaluation
from lm_eval import evaluator
results = evaluator.simple_evaluate(
model="hf", model_args=f"pretrained={model_path}",
tasks=["winogrande", "hellaswag", "arc_easy"],
)
LoRA Rank Explorer
Explore how LoRA rank affects parameter count and performance
Parameters
Fine-Tuning Loss Comparison
Compare training dynamics for different fine-tuning methods
Parameters
GPU Memory Estimator
Estimate GPU memory for different fine-tuning configurations
Parameters
LoRA Weight Update Animation
Visualize how LoRA adapters modify attention weights during training
Parameters
Quantization Explorer
Compare model sizes and quality across quantization levels
Parameters
LoRA Architecture
Fine-Tuning Pipeline
Quick Check
Why does LoRA work with such low rank (r=16) for fine-tuning a 4096-dimensional model?
The model is not actually learning anything
Fine-tuning shifts weights in a low-dimensional subspace
Higher ranks would cause overfitting
The weight changes needed for domain adaptation lie in a much lower-dimensional space than the full weight matrix.
Quick Check
What is the main advantage of QLoRA over standard LoRA?
Better model quality
4x reduction in base model memory through NF4 quantization
Faster training speed
QLoRA quantizes the frozen base weights to 4 bits, allowing fine-tuning of much larger models on limited hardware.
Quick Check
For fine-tuning, which is more important: data quality or data quantity?
Data quantity — more data is always better
Data quality — 1000 excellent examples often beat 100K noisy ones
Neither — the model architecture matters most
Pre-trained models already have strong representations. Fine-tuning mainly teaches task format and domain conventions, requiring less but higher-quality data.
Common Mistake: Overfitting Small Datasets
Mistake:
Training for too many epochs on small fine-tuning datasets.
Correction:
Use early stopping, monitor validation loss, and keep epochs low (1-3 for instruction tuning). The pre-trained model already has strong representations.
Common Mistake: Applying LoRA to Wrong Modules
Mistake:
Only applying LoRA to query projections.
Correction:
Apply LoRA to all attention projections (Q, K, V, O) and optionally the MLP layers. This provides the best performance-efficiency trade-off.
Common Mistake: Evaluation Data Contamination
Mistake:
Including evaluation data in the training set accidentally.
Correction:
Always split data before any processing. Use held-out test sets that are never seen during training or hyperparameter tuning.
Key Takeaway
LoRA fine-tuning modifies only 0.1-1% of parameters while achieving comparable results to full fine-tuning. QLoRA further reduces memory requirements by 4x through base weight quantization.
Key Takeaway
Data quality dominates quantity for fine-tuning. A well-curated dataset of 1000-5000 examples is typically sufficient for domain adaptation. Focus on diverse, high-quality instruction-response pairs.
Why This Matters: Domain-Specific LLMs for Telecom
Fine-tuning on 3GPP specifications, IEEE papers, and simulation code creates LLMs that understand telecom terminology, can generate correct simulation code, and reason about system-level design trade-offs. This is a growing area of applied research.
See full treatment in Chapter 38
Historical Note: The LoRA Breakthrough
2021Hu et al. (2021) at Microsoft introduced LoRA, showing that the weight updates during fine-tuning are inherently low-rank. This enabled fine-tuning GPT-3 (175B) with only ~18M trainable parameters, making large model adaptation accessible to researchers without massive GPU clusters.
Historical Note: QLoRA Democratization
2023Dettmers et al. (2023) showed that combining 4-bit quantization with LoRA allows fine-tuning a 65B model on a single 48GB GPU. This democratized LLM fine-tuning, enabling graduate students to customize frontier-class models.
Fine-Tuning Method Comparison
| Method | Trainable % | Memory (7B) | Quality | Speed |
|---|---|---|---|---|
| Full FT | 100% | ~60 GB | Best | Slowest |
| LoRA (r=16) | 0.08% | ~16 GB | Near-best | Fast |
| QLoRA (r=16) | 0.08% | ~6 GB | Good | Fast |
| Prefix Tuning | 0.1% | ~15 GB | Good | Fast |
LoRA (Low-Rank Adaptation)
A parameter-efficient fine-tuning method that adds trainable low-rank matrices to frozen pre-trained weights, requiring only 0.1-1% additional parameters.
QLoRA
Quantized LoRA — combines 4-bit NormalFloat quantization of base weights with LoRA adapters, enabling fine-tuning of large models on consumer GPUs.
PEFT
Parameter-Efficient Fine-Tuning — a family of methods that fine-tune only a small subset of model parameters, including LoRA, adapters, and prefix tuning.
SFT (Supervised Fine-Tuning)
Fine-tuning a pre-trained model on curated instruction-response pairs to improve instruction following.
Catastrophic Forgetting
The tendency of neural networks to lose previously learned knowledge when fine-tuned on new data, mitigated by LoRA and other regularization techniques.