Using Pre-Trained Models
Why Not Train From Scratch?
Training a large model from scratch requires massive data and compute. Pre-trained models on ImageNet or large corpora capture general features (edges, textures, semantics) that transfer to new tasks with minimal fine-tuning.
Definition: Transfer Learning
Transfer Learning
Transfer learning reuses a model trained on a source task for a different target task:
- Feature extraction: freeze backbone, train only the classifier head
- Fine-tuning: unfreeze some/all layers, train with low LR
import torchvision.models as models
model = models.resnet18(weights='IMAGENET1K_V1')
# Freeze backbone
for param in model.parameters():
param.requires_grad = False
# Replace classifier
model.fc = nn.Linear(512, num_classes)
Definition: Fine-Tuning Strategies
Fine-Tuning Strategies
Gradual unfreezing: start with frozen backbone, unfreeze deeper layers progressively.
Discriminative LR: use lower learning rates for early layers and higher rates for later layers.
Linear probing then fine-tuning: first train only the head (linear probe), then fine-tune the full model.
Definition: torchvision Pre-Trained Models
torchvision Pre-Trained Models
from torchvision.models import resnet50, ResNet50_Weights
model = resnet50(weights=ResNet50_Weights.IMAGENET1K_V2)
preprocess = ResNet50_Weights.IMAGENET1K_V2.transforms()
# Use preprocess for inference transforms
Definition: HuggingFace Model Hub
HuggingFace Model Hub
from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained('bert-base-uncased')
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
Definition: LoRA: Low-Rank Adaptation
LoRA: Low-Rank Adaptation
Instead of fine-tuning all weights, LoRA adds trainable low-rank matrices to frozen weights:
Only and are trained, reducing trainable parameters by 100x or more.
Theorem: Feature Transferability Across Layers
Early CNN layers learn generic features (edges, textures) that transfer well across tasks. Deeper layers learn task-specific features. The transferability of features decreases monotonically with layer depth (Yosinski et al., 2014).
This is why fine-tuning works: keep early generic features frozen and only retrain the task-specific later layers.
Theorem: LoRA Rank Selection
For a weight matrix , LoRA with rank adds parameters instead of . For and : 12,288 vs 589,824 (48x reduction). Empirically, suffices for most fine-tuning tasks.
The weight updates during fine-tuning often have low intrinsic rank, meaning the full parameter space is not needed.
Theorem: Sample Efficiency of Transfer Learning
Transfer learning from a pre-trained model achieves the same performance as training from scratch with 10-100x more data (Raghu et al., 2019). With only 100 labelled examples per class, a fine-tuned model can significantly outperform a from-scratch model.
The pre-trained features provide a strong initialisation that is already close to a good solution for the target task.
Example: Feature Extraction with Pre-Trained ResNet
Use a pre-trained ResNet as a fixed feature extractor.
Implementation
model = models.resnet18(weights='IMAGENET1K_V1')
model.fc = nn.Identity() # remove classifier
model.eval()
with torch.no_grad():
features = model(images) # (B, 512)
# Train a simple classifier on features
clf = nn.Linear(512, num_classes)
Example: Fine-Tuning with Discriminative LR
Fine-tune a ResNet with different LR per layer group.
Implementation
optimizer = torch.optim.Adam([
{'params': model.layer1.parameters(), 'lr': 1e-5},
{'params': model.layer2.parameters(), 'lr': 1e-5},
{'params': model.layer3.parameters(), 'lr': 1e-4},
{'params': model.layer4.parameters(), 'lr': 1e-4},
{'params': model.fc.parameters(), 'lr': 1e-3},
])
Example: LoRA Implementation
Add LoRA to a linear layer.
Implementation
class LoRALinear(nn.Module):
def __init__(self, linear, rank=4):
super().__init__()
self.linear = linear
self.linear.weight.requires_grad = False
d_out, d_in = linear.weight.shape
self.A = nn.Parameter(torch.randn(rank, d_in) * 0.01)
self.B = nn.Parameter(torch.zeros(d_out, rank))
def forward(self, x):
return self.linear(x) + (x @ self.A.T @ self.B.T)
Transfer Learning Efficiency
Compare accuracy vs data size for from-scratch vs transfer learning.
Parameters
LoRA Parameter Savings
See how rank affects trainable parameter count.
Parameters
Layer Freezing Strategy Explorer
See trainable parameters vs layers frozen.
Parameters
Fine-Tuning Progress
Watch accuracy improve during fine-tuning.
Parameters
Transfer Learning Pipeline
Model Export Formats
Quick Check
When should you freeze the backbone and train only the head?
When the target dataset is very large
When the target dataset is small and similar to the source domain
When you want the best possible accuracy
Quick Check
What is the main advantage of LoRA over full fine-tuning?
Better accuracy
Orders of magnitude fewer trainable parameters
Faster inference
Quick Check
Which model export format is most portable across frameworks?
TorchScript
ONNX
pickle
ONNX is supported by TensorFlow, PyTorch, ONNX Runtime, and many inference engines.
Common Mistake: Overfitting When Fine-Tuning on Small Data
Mistake:
Fine-tuning all layers on a small target dataset leads to overfitting.
Correction:
Use feature extraction first, then fine-tune with strong regularisation (small LR, weight decay, early stopping).
Common Mistake: Wrong Preprocessing for Pre-Trained Models
Mistake:
Using different normalisation (mean/std) than the pre-trained model expects.
Correction:
Always use the transforms provided with the model weights (e.g., ResNet50_Weights.transforms()).
Common Mistake: BatchNorm Statistics During Fine-Tuning
Mistake:
BatchNorm running stats from pre-training may not match target domain.
Correction:
Fine-tune with model.train() to update running stats, or freeze BN layers.
Key Takeaway
Pre-trained models are the starting point for most tasks. Feature extraction for small datasets, fine-tuning for medium, full training only for very large. LoRA enables efficient adaptation.
Key Takeaway
Export models via TorchScript (torch.jit.script/trace) for PyTorch deployment, ONNX for cross-framework portability, and TensorRT for maximum GPU inference speed.
Why This Matters: Plug-and-Play Denoisers for Inverse Problems
Pre-trained denoisers (DRUNet) can be used as priors in iterative algorithms for channel estimation, signal recovery, and imaging. The PnP framework alternates between a data-fidelity step and a denoising step, requiring no task-specific training.
See full treatment in Chapter 27
Historical Note: ImageNet: The Foundation of Transfer Learning
2012-2017The ImageNet Large Scale Visual Recognition Challenge (2010-2017) produced the pre-trained models that enabled transfer learning. Features learned on 1.2 million images transfer remarkably well to tasks with as few as 100 examples per class.
Historical Note: LoRA: Efficient Fine-Tuning
2021Hu et al. (2021) introduced LoRA, showing that fine-tuning updates have low intrinsic rank. This enabled adaptation of billion-parameter models on single GPUs, democratising large model fine-tuning.
Transfer Learning
Reusing a model trained on one task for a different but related task.
Fine-Tuning
Continuing training of a pre-trained model on new data, typically with a lower learning rate.
LoRA
Low-Rank Adaptation: adding trainable low-rank matrices to frozen pre-trained weights.
ONNX
Open Neural Network Exchange: cross-framework model format for portable deployment.
TorchScript
PyTorch's JIT compilation format for deploying models without Python runtime.
Fine-Tuning Strategy Selection
| Strategy | Trainable Params | Data Needed | Best For |
|---|---|---|---|
| Feature extraction | Head only | Very small | Quick baseline, similar domains |
| Fine-tune last layers | Moderate | Small-medium | Moderate domain shift |
| Fine-tune all | All | Medium-large | Significant domain shift |
| LoRA | ~1% of model | Small-medium | Large models, limited compute |