Automatic Differentiation (autograd)

Definition:
Computational Graph and Autograd

When requires_grad=True is set on a tensor, PyTorch records every operation into a directed acyclic graph (DAG). Each node stores the operation (a Function) and its inputs. Calling .backward() traverses this graph in reverse (topological) order, applying the chain rule to accumulate gradients.

x = torch.tensor([2.0, 3.0], requires_grad=True)
y = x ** 2 + 3 * x
z = y.sum()
z.backward()
print(x.grad)   # tensor([7., 9.])  i.e. 2x + 3

The graph is dynamic — it is rebuilt from scratch on every forward pass. This allows data-dependent control flow (if/else, loops) that would be impossible with static graphs.

Gradients are accumulated into .grad, not overwritten. Call x.grad.zero_() (or use an optimizer) before each new backward pass to avoid stale gradient accumulation.

Definition:
Leaf Tensors and Intermediate Tensors

A leaf tensor is one created directly by the user (not as the result of an operation). Only leaf tensors with requires_grad=True retain their .grad attribute after .backward().

a = torch.tensor([1.0], requires_grad=True)   # leaf
b = a * 2                                       # NOT a leaf
print(a.is_leaf)   # True
print(b.is_leaf)   # False

Intermediate tensors' gradients are computed during backprop but discarded unless you call b.retain_grad() before .backward().

Theorem: Reverse-Mode Automatic Differentiation

For a composition $L = f_n \circ f_{n-1} \circ \cdots \circ f_1(\mathbf{x})$ , the gradient is:

$\frac{\partial L}{\partial \mathbf{x}} = \frac{\partial f_1}{\partial \mathbf{x}}^T \frac{\partial f_2}{\partial f_1}^T \cdots \frac{\partial f_n}{\partial f_{n-1}}^T$

Reverse-mode AD computes this right to left (from output to input) in a single backward pass, regardless of the dimension of $\mathbf{x}$ . The cost is $O(1)$ backward passes for any number of parameters.

Forward-mode AD would need one pass per input dimension. Reverse-mode needs one pass per output dimension. Since loss functions map $\mathbb{R}^n \to \mathbb{R}$ (scalar output), reverse-mode is dramatically more efficient — this is why backpropagation works.

Proof

Complexity Argument

Let the computation have $n$ parameters and $m$ outputs.

Forward-mode: $O(n)$ passes, each $O(T)$ where $T$ is the forward cost.
Reverse-mode: $O(m)$ passes, each $O(T)$ .
For $m = 1$ (scalar loss): reverse-mode is $O(T)$ total.

Theorem: Gradient of a Quadratic Form

For symmetric $\mathbf{A} \in \mathbb{R}^{n \times n}$ and the quadratic form $f(\mathbf{x}) = \mathbf{x}^T \mathbf{A} \mathbf{x}$ :

$\nabla_{\mathbf{x}} f = 2\mathbf{A}\mathbf{x}$

PyTorch's autograd computes this exactly:

A = torch.eye(3, dtype=torch.float64)
x = torch.tensor([1., 2., 3.], dtype=torch.float64, requires_grad=True)
f = x @ A @ x
f.backward()
print(x.grad)              # tensor([2., 4., 6.]) = 2*A@x

This is the multivariable analogue of $\frac{d}{dx}(ax^2) = 2ax$ . It appears constantly in optimization: the gradient of a least-squares objective $\|\mathbf{A}\mathbf{x} - \mathbf{b}\|^2$ involves this form.

Autograd-Powered Gradient Descent

Watch gradient descent minimize a 2D function using autograd to compute gradients automatically. Compare convergence for different learning rates and functions.

Parameters

Example: Differentiating Through Matrix Operations

Use autograd to compute $\frac{\partial}{\partial \mathbf{A}} \mathrm{tr}(\mathbf{A}^T \mathbf{B})$ for random matrices $\mathbf{A}$ and $\mathbf{B}$ . Verify the result matches the analytical formula.

Solution

Autograd Computation

import torch

A = torch.randn(3, 3, dtype=torch.float64, requires_grad=True)
B = torch.randn(3, 3, dtype=torch.float64)

f = torch.trace(A.T @ B)
f.backward()

print(f"Autograd gradient:\n{A.grad}")
print(f"Analytical (= B):\n{B}")
print(f"Match: {torch.allclose(A.grad, B)}")

Mathematical Verification

We know that $\frac{\partial}{\partial \mathbf{A}} \mathrm{tr}(\mathbf{A}^T \mathbf{B}) = \mathbf{B}$ . Autograd confirms this identity numerically.

Example: Disabling Gradient Tracking for Inference

Show how torch.no_grad() disables the computational graph to save memory during inference or evaluation.

Solution

Implementation

import torch

x = torch.randn(1000, 1000, requires_grad=True)

# WITH autograd — builds graph, uses ~4x memory
y = x @ x.T + x.sum()

# WITHOUT autograd — no graph, minimal memory
with torch.no_grad():
    y_fast = x @ x.T + x.sum()
    print(f"requires_grad: {y_fast.requires_grad}")  # False

# .detach() creates a view without gradient tracking
x_det = x.detach()
print(f"Shares memory: {x_det.data_ptr() == x.data_ptr()}")  # True

When to Use Each

torch.no_grad() — for evaluation blocks where nothing needs gradients.
.detach() — to extract a tensor from the graph as a new leaf.
.requires_grad_(False) — to permanently disable gradients on a tensor.

Computational Graph Visualizer

Enter a mathematical expression and see the computational graph that autograd builds. Observe how the backward pass traverses nodes in reverse order.

Parameters

Quick Check

If you call .backward() twice without zeroing gradients, what happens to .grad?

It is overwritten with the new gradient

It is doubled (accumulated)

A RuntimeError is raised

It remains unchanged

Correction:

It is doubled (accumulated)

PyTorch accumulates gradients by default. Call .grad.zero_() between backward passes.

Quick Check

What does .detach() return?

A copy of the tensor on CPU

A view sharing data but detached from the computation graph

A new tensor with gradients zeroed

The same tensor with requires_grad set to False

Correction:

A view sharing data but detached from the computation graph

.detach() returns a view (same storage) with requires_grad=False, severed from the graph.

Common Mistake: In-Place Ops Break Autograd

Mistake:

Using in-place operations on tensors that are part of the computation graph:

x = torch.tensor([1.0], requires_grad=True)
y = x * 2
y.add_(1)   # RuntimeError: in-place operation on a leaf Variable

Correction:

Use out-of-place operations when autograd is active:

y = x * 2
y = y + 1   # creates new tensor, graph remains valid

Why This Matters: Autograd Powers Wireless System Optimization

In modern wireless communications, autograd enables end-to-end learning of transmitter and receiver jointly. Instead of deriving gradients of the bit error rate analytically (often intractable), autograd differentiates through the entire communication chain: encoder $\to$ channel $\to$ decoder. This is the foundation of autoencoder-based communication systems.

See full treatment in Chapter 30

Key Takeaway

Autograd builds a dynamic computational graph on-the-fly and computes exact gradients via reverse-mode AD in a single backward pass. Use torch.no_grad() for inference, .detach() to sever graph connections, and always zero gradients between iterations.

Autograd

PyTorch's automatic differentiation engine that records operations on tensors and computes gradients via reverse-mode differentiation.

Related: Computational Graph

Computational Graph

A directed acyclic graph where nodes represent operations and edges represent data flow. Built dynamically during the forward pass and traversed in reverse during .backward().

Related: Autograd

Autograd Basics and Patterns

python

Complete examples of autograd usage: gradient computation, higher-order derivatives, custom backward functions, and Jacobian computation.

# Code from: ch12/python/autograd_basics.py
# Load from backend supplements endpoint

Tensors vs. NumPy Arrays Complex Tensors in PyTorch

Automatic Differentiation (autograd)

Definition: Computational Graph and Autograd

Definition: Leaf Tensors and Intermediate Tensors

Theorem: Reverse-Mode Automatic Differentiation

Complexity Argument

Theorem: Gradient of a Quadratic Form

Autograd-Powered Gradient Descent

Parameters

Example: Differentiating Through Matrix Operations

Autograd Computation

Mathematical Verification

Example: Disabling Gradient Tracking for Inference

Implementation

When to Use Each

Computational Graph Visualizer

Parameters

Quick Check

Quick Check

Common Mistake: In-Place Ops Break Autograd

Why This Matters: Autograd Powers Wireless System Optimization

Key Takeaway

Autograd

Computational Graph

Autograd Basics and Patterns

Definition:
Computational Graph and Autograd

Definition:
Leaf Tensors and Intermediate Tensors