Using Pre-Trained Models

Why Not Train From Scratch?

Training a large model from scratch requires massive data and compute. Pre-trained models on ImageNet or large corpora capture general features (edges, textures, semantics) that transfer to new tasks with minimal fine-tuning.

Definition:

Transfer Learning

Transfer learning reuses a model trained on a source task for a different target task:

  1. Feature extraction: freeze backbone, train only the classifier head
  2. Fine-tuning: unfreeze some/all layers, train with low LR
import torchvision.models as models
model = models.resnet18(weights='IMAGENET1K_V1')
# Freeze backbone
for param in model.parameters():
    param.requires_grad = False
# Replace classifier
model.fc = nn.Linear(512, num_classes)

Definition:

Fine-Tuning Strategies

Gradual unfreezing: start with frozen backbone, unfreeze deeper layers progressively.

Discriminative LR: use lower learning rates for early layers and higher rates for later layers.

Linear probing then fine-tuning: first train only the head (linear probe), then fine-tune the full model.

Definition:

torchvision Pre-Trained Models

from torchvision.models import resnet50, ResNet50_Weights
model = resnet50(weights=ResNet50_Weights.IMAGENET1K_V2)
preprocess = ResNet50_Weights.IMAGENET1K_V2.transforms()
# Use preprocess for inference transforms

Definition:

HuggingFace Model Hub

from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained('bert-base-uncased')
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

Definition:

LoRA: Low-Rank Adaptation

Instead of fine-tuning all weights, LoRA adds trainable low-rank matrices ΔW=BA\Delta \mathbf{W} = \mathbf{B}\mathbf{A} to frozen weights:

W=W+BA,BRd×r,ARr×d,rd\mathbf{W}' = \mathbf{W} + \mathbf{B}\mathbf{A}, \quad \mathbf{B} \in \mathbb{R}^{d \times r}, \mathbf{A} \in \mathbb{R}^{r \times d}, r \ll d

Only A\mathbf{A} and B\mathbf{B} are trained, reducing trainable parameters by 100x or more.

Theorem: Feature Transferability Across Layers

Early CNN layers learn generic features (edges, textures) that transfer well across tasks. Deeper layers learn task-specific features. The transferability of features decreases monotonically with layer depth (Yosinski et al., 2014).

This is why fine-tuning works: keep early generic features frozen and only retrain the task-specific later layers.

Theorem: LoRA Rank Selection

For a weight matrix WRd×d\mathbf{W} \in \mathbb{R}^{d \times d}, LoRA with rank rr adds 2dr2dr parameters instead of d2d^2. For d=768d = 768 and r=8r = 8: 12,288 vs 589,824 (48x reduction). Empirically, r[4,16]r \in [4, 16] suffices for most fine-tuning tasks.

The weight updates during fine-tuning often have low intrinsic rank, meaning the full parameter space is not needed.

Theorem: Sample Efficiency of Transfer Learning

Transfer learning from a pre-trained model achieves the same performance as training from scratch with 10-100x more data (Raghu et al., 2019). With only 100 labelled examples per class, a fine-tuned model can significantly outperform a from-scratch model.

The pre-trained features provide a strong initialisation that is already close to a good solution for the target task.

Example: Feature Extraction with Pre-Trained ResNet

Use a pre-trained ResNet as a fixed feature extractor.

Example: Fine-Tuning with Discriminative LR

Fine-tune a ResNet with different LR per layer group.

Example: LoRA Implementation

Add LoRA to a linear layer.

Transfer Learning Efficiency

Compare accuracy vs data size for from-scratch vs transfer learning.

Parameters

LoRA Parameter Savings

See how rank affects trainable parameter count.

Parameters

Layer Freezing Strategy Explorer

See trainable parameters vs layers frozen.

Parameters

Fine-Tuning Progress

Watch accuracy improve during fine-tuning.

Parameters

Transfer Learning Pipeline

Transfer Learning Pipeline
Pre-trained backbone + new head: freeze, fine-tune, or LoRA.

Model Export Formats

Model Export Formats
PyTorch -> TorchScript, ONNX, TensorRT for deployment.

Quick Check

When should you freeze the backbone and train only the head?

When the target dataset is very large

When the target dataset is small and similar to the source domain

When you want the best possible accuracy

Quick Check

What is the main advantage of LoRA over full fine-tuning?

Better accuracy

Orders of magnitude fewer trainable parameters

Faster inference

Quick Check

Which model export format is most portable across frameworks?

TorchScript

ONNX

pickle

Common Mistake: Overfitting When Fine-Tuning on Small Data

Mistake:

Fine-tuning all layers on a small target dataset leads to overfitting.

Correction:

Use feature extraction first, then fine-tune with strong regularisation (small LR, weight decay, early stopping).

Common Mistake: Wrong Preprocessing for Pre-Trained Models

Mistake:

Using different normalisation (mean/std) than the pre-trained model expects.

Correction:

Always use the transforms provided with the model weights (e.g., ResNet50_Weights.transforms()).

Common Mistake: BatchNorm Statistics During Fine-Tuning

Mistake:

BatchNorm running stats from pre-training may not match target domain.

Correction:

Fine-tune with model.train() to update running stats, or freeze BN layers.

Key Takeaway

Pre-trained models are the starting point for most tasks. Feature extraction for small datasets, fine-tuning for medium, full training only for very large. LoRA enables efficient adaptation.

Key Takeaway

Export models via TorchScript (torch.jit.script/trace) for PyTorch deployment, ONNX for cross-framework portability, and TensorRT for maximum GPU inference speed.

Why This Matters: Plug-and-Play Denoisers for Inverse Problems

Pre-trained denoisers (DRUNet) can be used as priors in iterative algorithms for channel estimation, signal recovery, and imaging. The PnP framework alternates between a data-fidelity step and a denoising step, requiring no task-specific training.

See full treatment in Chapter 27

Historical Note: ImageNet: The Foundation of Transfer Learning

2012-2017

The ImageNet Large Scale Visual Recognition Challenge (2010-2017) produced the pre-trained models that enabled transfer learning. Features learned on 1.2 million images transfer remarkably well to tasks with as few as 100 examples per class.

Historical Note: LoRA: Efficient Fine-Tuning

2021

Hu et al. (2021) introduced LoRA, showing that fine-tuning updates have low intrinsic rank. This enabled adaptation of billion-parameter models on single GPUs, democratising large model fine-tuning.

Transfer Learning

Reusing a model trained on one task for a different but related task.

Fine-Tuning

Continuing training of a pre-trained model on new data, typically with a lower learning rate.

LoRA

Low-Rank Adaptation: adding trainable low-rank matrices to frozen pre-trained weights.

ONNX

Open Neural Network Exchange: cross-framework model format for portable deployment.

TorchScript

PyTorch's JIT compilation format for deploying models without Python runtime.

Fine-Tuning Strategy Selection

StrategyTrainable ParamsData NeededBest For
Feature extractionHead onlyVery smallQuick baseline, similar domains
Fine-tune last layersModerateSmall-mediumModerate domain shift
Fine-tune allAllMedium-largeSignificant domain shift
LoRA~1% of modelSmall-mediumLarge models, limited compute