Automatic Differentiation (autograd)

Definition:

Computational Graph and Autograd

When requires_grad=True is set on a tensor, PyTorch records every operation into a directed acyclic graph (DAG). Each node stores the operation (a Function) and its inputs. Calling .backward() traverses this graph in reverse (topological) order, applying the chain rule to accumulate gradients.

x = torch.tensor([2.0, 3.0], requires_grad=True)
y = x ** 2 + 3 * x
z = y.sum()
z.backward()
print(x.grad)   # tensor([7., 9.])  i.e. 2x + 3

The graph is dynamic β€” it is rebuilt from scratch on every forward pass. This allows data-dependent control flow (if/else, loops) that would be impossible with static graphs.

Gradients are accumulated into .grad, not overwritten. Call x.grad.zero_() (or use an optimizer) before each new backward pass to avoid stale gradient accumulation.

Definition:

Leaf Tensors and Intermediate Tensors

A leaf tensor is one created directly by the user (not as the result of an operation). Only leaf tensors with requires_grad=True retain their .grad attribute after .backward().

a = torch.tensor([1.0], requires_grad=True)   # leaf
b = a * 2                                       # NOT a leaf
print(a.is_leaf)   # True
print(b.is_leaf)   # False

Intermediate tensors' gradients are computed during backprop but discarded unless you call b.retain_grad() before .backward().

Theorem: Reverse-Mode Automatic Differentiation

For a composition L=fn∘fnβˆ’1βˆ˜β‹―βˆ˜f1(x)L = f_n \circ f_{n-1} \circ \cdots \circ f_1(\mathbf{x}), the gradient is:

βˆ‚Lβˆ‚x=βˆ‚f1βˆ‚xTβˆ‚f2βˆ‚f1Tβ‹―βˆ‚fnβˆ‚fnβˆ’1T\frac{\partial L}{\partial \mathbf{x}} = \frac{\partial f_1}{\partial \mathbf{x}}^T \frac{\partial f_2}{\partial f_1}^T \cdots \frac{\partial f_n}{\partial f_{n-1}}^T

Reverse-mode AD computes this right to left (from output to input) in a single backward pass, regardless of the dimension of x\mathbf{x}. The cost is O(1)O(1) backward passes for any number of parameters.

Forward-mode AD would need one pass per input dimension. Reverse-mode needs one pass per output dimension. Since loss functions map Rn→R\mathbb{R}^n \to \mathbb{R} (scalar output), reverse-mode is dramatically more efficient — this is why backpropagation works.

Theorem: Gradient of a Quadratic Form

For symmetric A∈RnΓ—n\mathbf{A} \in \mathbb{R}^{n \times n} and the quadratic form f(x)=xTAxf(\mathbf{x}) = \mathbf{x}^T \mathbf{A} \mathbf{x}:

βˆ‡xf=2Ax\nabla_{\mathbf{x}} f = 2\mathbf{A}\mathbf{x}

PyTorch's autograd computes this exactly:

A = torch.eye(3, dtype=torch.float64)
x = torch.tensor([1., 2., 3.], dtype=torch.float64, requires_grad=True)
f = x @ A @ x
f.backward()
print(x.grad)              # tensor([2., 4., 6.]) = 2*A@x

This is the multivariable analogue of ddx(ax2)=2ax\frac{d}{dx}(ax^2) = 2ax. It appears constantly in optimization: the gradient of a least-squares objective βˆ₯Axβˆ’bβˆ₯2\|\mathbf{A}\mathbf{x} - \mathbf{b}\|^2 involves this form.

Autograd-Powered Gradient Descent

Watch gradient descent minimize a 2D function using autograd to compute gradients automatically. Compare convergence for different learning rates and functions.

Parameters

Example: Differentiating Through Matrix Operations

Use autograd to compute βˆ‚βˆ‚Atr(ATB)\frac{\partial}{\partial \mathbf{A}} \mathrm{tr}(\mathbf{A}^T \mathbf{B}) for random matrices A\mathbf{A} and B\mathbf{B}. Verify the result matches the analytical formula.

Example: Disabling Gradient Tracking for Inference

Show how torch.no_grad() disables the computational graph to save memory during inference or evaluation.

Computational Graph Visualizer

Enter a mathematical expression and see the computational graph that autograd builds. Observe how the backward pass traverses nodes in reverse order.

Parameters

Quick Check

If you call .backward() twice without zeroing gradients, what happens to .grad?

It is overwritten with the new gradient

It is doubled (accumulated)

A RuntimeError is raised

It remains unchanged

Quick Check

What does .detach() return?

A copy of the tensor on CPU

A view sharing data but detached from the computation graph

A new tensor with gradients zeroed

The same tensor with requires_grad set to False

Common Mistake: In-Place Ops Break Autograd

Mistake:

Using in-place operations on tensors that are part of the computation graph:

x = torch.tensor([1.0], requires_grad=True)
y = x * 2
y.add_(1)   # RuntimeError: in-place operation on a leaf Variable

Correction:

Use out-of-place operations when autograd is active:

y = x * 2
y = y + 1   # creates new tensor, graph remains valid

Why This Matters: Autograd Powers Wireless System Optimization

In modern wireless communications, autograd enables end-to-end learning of transmitter and receiver jointly. Instead of deriving gradients of the bit error rate analytically (often intractable), autograd differentiates through the entire communication chain: encoder β†’\to channel β†’\to decoder. This is the foundation of autoencoder-based communication systems.

See full treatment in Chapter 30

Key Takeaway

Autograd builds a dynamic computational graph on-the-fly and computes exact gradients via reverse-mode AD in a single backward pass. Use torch.no_grad() for inference, .detach() to sever graph connections, and always zero gradients between iterations.

Autograd

PyTorch's automatic differentiation engine that records operations on tensors and computes gradients via reverse-mode differentiation.

Related: Computational Graph

Computational Graph

A directed acyclic graph where nodes represent operations and edges represent data flow. Built dynamically during the forward pass and traversed in reverse during .backward().

Related: Autograd

Autograd Basics and Patterns

python
Complete examples of autograd usage: gradient computation, higher-order derivatives, custom backward functions, and Jacobian computation.
# Code from: ch12/python/autograd_basics.py
# Load from backend supplements endpoint