Automatic Differentiation (autograd)
Definition: Computational Graph and Autograd
Computational Graph and Autograd
When requires_grad=True is set on a tensor, PyTorch records every
operation into a directed acyclic graph (DAG). Each node stores
the operation (a Function) and its inputs. Calling .backward()
traverses this graph in reverse (topological) order, applying the
chain rule to accumulate gradients.
x = torch.tensor([2.0, 3.0], requires_grad=True)
y = x ** 2 + 3 * x
z = y.sum()
z.backward()
print(x.grad) # tensor([7., 9.]) i.e. 2x + 3
The graph is dynamic β it is rebuilt from scratch on every forward pass. This allows data-dependent control flow (if/else, loops) that would be impossible with static graphs.
Gradients are accumulated into .grad, not overwritten.
Call x.grad.zero_() (or use an optimizer) before each new backward
pass to avoid stale gradient accumulation.
Definition: Leaf Tensors and Intermediate Tensors
Leaf Tensors and Intermediate Tensors
A leaf tensor is one created directly by the user (not as the
result of an operation). Only leaf tensors with requires_grad=True
retain their .grad attribute after .backward().
a = torch.tensor([1.0], requires_grad=True) # leaf
b = a * 2 # NOT a leaf
print(a.is_leaf) # True
print(b.is_leaf) # False
Intermediate tensors' gradients are computed during backprop but
discarded unless you call b.retain_grad() before .backward().
Theorem: Reverse-Mode Automatic Differentiation
For a composition , the gradient is:
Reverse-mode AD computes this right to left (from output to input) in a single backward pass, regardless of the dimension of . The cost is backward passes for any number of parameters.
Forward-mode AD would need one pass per input dimension. Reverse-mode needs one pass per output dimension. Since loss functions map (scalar output), reverse-mode is dramatically more efficient β this is why backpropagation works.
Complexity Argument
Let the computation have parameters and outputs.
- Forward-mode: passes, each where is the forward cost.
- Reverse-mode: passes, each .
- For (scalar loss): reverse-mode is total.
Theorem: Gradient of a Quadratic Form
For symmetric and the quadratic form :
PyTorch's autograd computes this exactly:
A = torch.eye(3, dtype=torch.float64)
x = torch.tensor([1., 2., 3.], dtype=torch.float64, requires_grad=True)
f = x @ A @ x
f.backward()
print(x.grad) # tensor([2., 4., 6.]) = 2*A@x
This is the multivariable analogue of . It appears constantly in optimization: the gradient of a least-squares objective involves this form.
Autograd-Powered Gradient Descent
Watch gradient descent minimize a 2D function using autograd to compute gradients automatically. Compare convergence for different learning rates and functions.
Parameters
Example: Differentiating Through Matrix Operations
Use autograd to compute for random matrices and . Verify the result matches the analytical formula.
Autograd Computation
import torch
A = torch.randn(3, 3, dtype=torch.float64, requires_grad=True)
B = torch.randn(3, 3, dtype=torch.float64)
f = torch.trace(A.T @ B)
f.backward()
print(f"Autograd gradient:\n{A.grad}")
print(f"Analytical (= B):\n{B}")
print(f"Match: {torch.allclose(A.grad, B)}")
Mathematical Verification
We know that . Autograd confirms this identity numerically.
Example: Disabling Gradient Tracking for Inference
Show how torch.no_grad() disables the computational graph to
save memory during inference or evaluation.
Implementation
import torch
x = torch.randn(1000, 1000, requires_grad=True)
# WITH autograd β builds graph, uses ~4x memory
y = x @ x.T + x.sum()
# WITHOUT autograd β no graph, minimal memory
with torch.no_grad():
y_fast = x @ x.T + x.sum()
print(f"requires_grad: {y_fast.requires_grad}") # False
# .detach() creates a view without gradient tracking
x_det = x.detach()
print(f"Shares memory: {x_det.data_ptr() == x.data_ptr()}") # True
When to Use Each
torch.no_grad()β for evaluation blocks where nothing needs gradients..detach()β to extract a tensor from the graph as a new leaf..requires_grad_(False)β to permanently disable gradients on a tensor.
Computational Graph Visualizer
Enter a mathematical expression and see the computational graph that autograd builds. Observe how the backward pass traverses nodes in reverse order.
Parameters
Quick Check
If you call .backward() twice without zeroing gradients, what happens
to .grad?
It is overwritten with the new gradient
It is doubled (accumulated)
A RuntimeError is raised
It remains unchanged
PyTorch accumulates gradients by default. Call .grad.zero_() between backward passes.
Quick Check
What does .detach() return?
A copy of the tensor on CPU
A view sharing data but detached from the computation graph
A new tensor with gradients zeroed
The same tensor with requires_grad set to False
.detach() returns a view (same storage) with requires_grad=False, severed from the graph.
Common Mistake: In-Place Ops Break Autograd
Mistake:
Using in-place operations on tensors that are part of the computation graph:
x = torch.tensor([1.0], requires_grad=True)
y = x * 2
y.add_(1) # RuntimeError: in-place operation on a leaf Variable
Correction:
Use out-of-place operations when autograd is active:
y = x * 2
y = y + 1 # creates new tensor, graph remains valid
Why This Matters: Autograd Powers Wireless System Optimization
In modern wireless communications, autograd enables end-to-end learning of transmitter and receiver jointly. Instead of deriving gradients of the bit error rate analytically (often intractable), autograd differentiates through the entire communication chain: encoder channel decoder. This is the foundation of autoencoder-based communication systems.
See full treatment in Chapter 30
Key Takeaway
Autograd builds a dynamic computational graph on-the-fly and
computes exact gradients via reverse-mode AD in a single backward pass.
Use torch.no_grad() for inference, .detach() to sever graph connections,
and always zero gradients between iterations.
Autograd
PyTorch's automatic differentiation engine that records operations on tensors and computes gradients via reverse-mode differentiation.
Related: Computational Graph
Computational Graph
A directed acyclic graph where nodes represent operations and edges
represent data flow. Built dynamically during the forward pass and
traversed in reverse during .backward().
Related: Autograd
Autograd Basics and Patterns
# Code from: ch12/python/autograd_basics.py
# Load from backend supplements endpoint