Using Pre-Trained Models

Why Not Train From Scratch?

Training a large model from scratch requires massive data and compute. Pre-trained models on ImageNet or large corpora capture general features (edges, textures, semantics) that transfer to new tasks with minimal fine-tuning.

Definition:
Transfer Learning

Transfer learning reuses a model trained on a source task for a different target task:

Feature extraction: freeze backbone, train only the classifier head
Fine-tuning: unfreeze some/all layers, train with low LR

import torchvision.models as models
model = models.resnet18(weights='IMAGENET1K_V1')
# Freeze backbone
for param in model.parameters():
    param.requires_grad = False
# Replace classifier
model.fc = nn.Linear(512, num_classes)

Definition:
Fine-Tuning Strategies

Gradual unfreezing: start with frozen backbone, unfreeze deeper layers progressively.

Discriminative LR: use lower learning rates for early layers and higher rates for later layers.

Linear probing then fine-tuning: first train only the head (linear probe), then fine-tune the full model.

Definition:
torchvision Pre-Trained Models

from torchvision.models import resnet50, ResNet50_Weights
model = resnet50(weights=ResNet50_Weights.IMAGENET1K_V2)
preprocess = ResNet50_Weights.IMAGENET1K_V2.transforms()
# Use preprocess for inference transforms

Definition:
HuggingFace Model Hub

from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained('bert-base-uncased')
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

Definition:
LoRA: Low-Rank Adaptation

Instead of fine-tuning all weights, LoRA adds trainable low-rank matrices $\Delta \mathbf{W} = \mathbf{B}\mathbf{A}$ to frozen weights:

$\mathbf{W}' = \mathbf{W} + \mathbf{B}\mathbf{A}, \quad \mathbf{B} \in \mathbb{R}^{d \times r}, \mathbf{A} \in \mathbb{R}^{r \times d}, r \ll d$

Only $\mathbf{A}$ and $\mathbf{B}$ are trained, reducing trainable parameters by 100x or more.

Theorem: Feature Transferability Across Layers

Early CNN layers learn generic features (edges, textures) that transfer well across tasks. Deeper layers learn task-specific features. The transferability of features decreases monotonically with layer depth (Yosinski et al., 2014).

This is why fine-tuning works: keep early generic features frozen and only retrain the task-specific later layers.

Theorem: LoRA Rank Selection

For a weight matrix $\mathbf{W} \in \mathbb{R}^{d \times d}$ , LoRA with rank $r$ adds $2dr$ parameters instead of $d^2$ . For $d = 768$ and $r = 8$ : 12,288 vs 589,824 (48x reduction). Empirically, $r \in [4, 16]$ suffices for most fine-tuning tasks.

The weight updates during fine-tuning often have low intrinsic rank, meaning the full parameter space is not needed.

Theorem: Sample Efficiency of Transfer Learning

Transfer learning from a pre-trained model achieves the same performance as training from scratch with 10-100x more data (Raghu et al., 2019). With only 100 labelled examples per class, a fine-tuned model can significantly outperform a from-scratch model.

The pre-trained features provide a strong initialisation that is already close to a good solution for the target task.

Example: Feature Extraction with Pre-Trained ResNet

Use a pre-trained ResNet as a fixed feature extractor.

Solution

Implementation

model = models.resnet18(weights='IMAGENET1K_V1')
model.fc = nn.Identity()  # remove classifier
model.eval()
with torch.no_grad():
    features = model(images)  # (B, 512)
# Train a simple classifier on features
clf = nn.Linear(512, num_classes)

Example: Fine-Tuning with Discriminative LR

Fine-tune a ResNet with different LR per layer group.

Solution

Implementation

optimizer = torch.optim.Adam([
    {'params': model.layer1.parameters(), 'lr': 1e-5},
    {'params': model.layer2.parameters(), 'lr': 1e-5},
    {'params': model.layer3.parameters(), 'lr': 1e-4},
    {'params': model.layer4.parameters(), 'lr': 1e-4},
    {'params': model.fc.parameters(), 'lr': 1e-3},
])

Example: LoRA Implementation

Add LoRA to a linear layer.

Solution

Implementation

class LoRALinear(nn.Module):
    def __init__(self, linear, rank=4):
        super().__init__()
        self.linear = linear
        self.linear.weight.requires_grad = False
        d_out, d_in = linear.weight.shape
        self.A = nn.Parameter(torch.randn(rank, d_in) * 0.01)
        self.B = nn.Parameter(torch.zeros(d_out, rank))
    def forward(self, x):
        return self.linear(x) + (x @ self.A.T @ self.B.T)

Transfer Learning Efficiency

Compare accuracy vs data size for from-scratch vs transfer learning.

Parameters

LoRA Parameter Savings

See how rank affects trainable parameter count.

Parameters

Layer Freezing Strategy Explorer

See trainable parameters vs layers frozen.

Parameters

Fine-Tuning Progress

Watch accuracy improve during fine-tuning.

Parameters

Transfer Learning Pipeline — Pre-trained backbone + new head: freeze, fine-tune, or LoRA.

Model Export Formats — PyTorch -> TorchScript, ONNX, TensorRT for deployment.

Quick Check

When should you freeze the backbone and train only the head?

When the target dataset is very large

When the target dataset is small and similar to the source domain

When you want the best possible accuracy

Correction:

When the target dataset is small and similar to the source domain

Quick Check

What is the main advantage of LoRA over full fine-tuning?

Better accuracy

Orders of magnitude fewer trainable parameters

Faster inference

Correction:

Orders of magnitude fewer trainable parameters

Quick Check

Which model export format is most portable across frameworks?

TorchScript

ONNX

pickle

Correction:

ONNX

ONNX is supported by TensorFlow, PyTorch, ONNX Runtime, and many inference engines.

Common Mistake: Overfitting When Fine-Tuning on Small Data

Mistake:

Fine-tuning all layers on a small target dataset leads to overfitting.

Correction:

Use feature extraction first, then fine-tune with strong regularisation (small LR, weight decay, early stopping).

Common Mistake: Wrong Preprocessing for Pre-Trained Models

Mistake:

Using different normalisation (mean/std) than the pre-trained model expects.

Correction:

Always use the transforms provided with the model weights (e.g., ResNet50_Weights.transforms()).

Common Mistake: BatchNorm Statistics During Fine-Tuning

Mistake:

BatchNorm running stats from pre-training may not match target domain.

Correction:

Fine-tune with model.train() to update running stats, or freeze BN layers.

Key Takeaway

Pre-trained models are the starting point for most tasks. Feature extraction for small datasets, fine-tuning for medium, full training only for very large. LoRA enables efficient adaptation.

Key Takeaway

Export models via TorchScript (torch.jit.script/trace) for PyTorch deployment, ONNX for cross-framework portability, and TensorRT for maximum GPU inference speed.

Why This Matters: Plug-and-Play Denoisers for Inverse Problems

Pre-trained denoisers (DRUNet) can be used as priors in iterative algorithms for channel estimation, signal recovery, and imaging. The PnP framework alternates between a data-fidelity step and a denoising step, requiring no task-specific training.

See full treatment in Chapter 27

Historical Note: ImageNet: The Foundation of Transfer Learning

2012-2017

The ImageNet Large Scale Visual Recognition Challenge (2010-2017) produced the pre-trained models that enabled transfer learning. Features learned on 1.2 million images transfer remarkably well to tasks with as few as 100 examples per class.

Historical Note: LoRA: Efficient Fine-Tuning

2021

Hu et al. (2021) introduced LoRA, showing that fine-tuning updates have low intrinsic rank. This enabled adaptation of billion-parameter models on single GPUs, democratising large model fine-tuning.

Transfer Learning

Reusing a model trained on one task for a different but related task.

Fine-Tuning

Continuing training of a pre-trained model on new data, typically with a lower learning rate.

LoRA

Low-Rank Adaptation: adding trainable low-rank matrices to frozen pre-trained weights.

ONNX

Open Neural Network Exchange: cross-framework model format for portable deployment.

TorchScript

PyTorch's JIT compilation format for deploying models without Python runtime.

Fine-Tuning Strategy Selection

Strategy	Trainable Params	Data Needed	Best For
Feature extraction	Head only	Very small	Quick baseline, similar domains
Fine-tune last layers	Moderate	Small-medium	Moderate domain shift
Fine-tune all	All	Medium-large	Significant domain shift
LoRA	~1% of model	Small-medium	Large models, limited compute

Prerequisites & Notation Pre-Trained Denoisers for Inverse Problems

SECTION OUTLINE

Using Pre-Trained Models

Why Not Train From Scratch?

Definition: Transfer Learning

Definition: Fine-Tuning Strategies

Definition: torchvision Pre-Trained Models

Definition: HuggingFace Model Hub

Definition: LoRA: Low-Rank Adaptation

Theorem: Feature Transferability Across Layers

Theorem: LoRA Rank Selection

Theorem: Sample Efficiency of Transfer Learning

Example: Feature Extraction with Pre-Trained ResNet

Implementation

Example: Fine-Tuning with Discriminative LR

Implementation

Example: LoRA Implementation

Implementation

Transfer Learning Efficiency

Parameters

LoRA Parameter Savings

Parameters

Layer Freezing Strategy Explorer

Parameters

Fine-Tuning Progress

Parameters

Transfer Learning Pipeline

Model Export Formats

Quick Check

Quick Check

Quick Check

Common Mistake: Overfitting When Fine-Tuning on Small Data

Common Mistake: Wrong Preprocessing for Pre-Trained Models

Common Mistake: BatchNorm Statistics During Fine-Tuning

Key Takeaway

Key Takeaway

Why This Matters: Plug-and-Play Denoisers for Inverse Problems

Historical Note: ImageNet: The Foundation of Transfer Learning

Historical Note: LoRA: Efficient Fine-Tuning

Transfer Learning

Fine-Tuning

LoRA

ONNX

TorchScript

Fine-Tuning Strategy Selection

Definition:
Transfer Learning

Definition:
Fine-Tuning Strategies

Definition:
torchvision Pre-Trained Models

Definition:
HuggingFace Model Hub

Definition:
LoRA: Low-Rank Adaptation