Unconstrained Optimization
Why Optimization Is Everywhere
Every machine learning model, every signal processing algorithm, every
engineering design problem ultimately reduces to an optimization problem:
find the parameters that minimize (or maximize) some objective.
SciPy provides scipy.optimize.minimize, a unified interface to
dozens of battle-tested optimization algorithms, each suited to
different problem structures.
In wireless communications, water-filling power allocation, beamforming design, and maximum-likelihood detection are all optimization problems. The tools in this chapter let you solve them in a few lines of Python.
Definition: Unconstrained Optimization Problem
Unconstrained Optimization Problem
An unconstrained optimization problem is:
where is the objective function. A point is a local minimizer if for all in a neighborhood of .
Necessary condition (first-order): .
Sufficient condition (second-order): and (positive definite).
A global minimizer satisfies for all . For non-convex problems, local and global minimizers may differ.
Definition: Gradient and Hessian
Gradient and Hessian
The gradient of is the vector of first partial derivatives:
The Hessian is the matrix of second partial derivatives:
For a twice-differentiable function, the Hessian is symmetric. The Hessian encodes the curvature of : its eigenvalues determine whether the function curves up (positive), down (negative), or is flat in each direction.
Definition: The scipy.optimize.minimize Interface
The scipy.optimize.minimize Interface
from scipy.optimize import minimize
result = minimize(
fun, # callable: f(x) -> float
x0, # initial guess (1D array)
method=..., # 'Nelder-Mead', 'BFGS', 'L-BFGS-B', etc.
jac=..., # gradient (callable or True for auto)
hess=..., # Hessian (callable, or HessianUpdateStrategy)
tol=..., # convergence tolerance
options=..., # dict of solver-specific options
)
The returned OptimizeResult object contains:
result.x: the optimal pointresult.fun: the optimal valueresult.jac: gradient at the solutionresult.nit: number of iterationsresult.success: whether the optimizer converged
Definition: Nelder-Mead (Simplex) Method
Nelder-Mead (Simplex) Method
Nelder-Mead is a derivative-free method that maintains a simplex of vertices in . At each iteration, it reflects, expands, contracts, or shrinks the simplex based on function values at the vertices.
result = minimize(f, x0, method='Nelder-Mead')
Pros: No gradient needed; robust for noisy or non-smooth objectives.
Cons: Slow convergence (); no convergence guarantee in dimension ; does not scale to high dimensions ().
Use Nelder-Mead only when gradients are unavailable or the objective is noisy. For smooth problems, gradient-based methods are orders of magnitude faster.
Definition: BFGS and L-BFGS-B Quasi-Newton Methods
BFGS and L-BFGS-B Quasi-Newton Methods
BFGS (Broyden-Fletcher-Goldfarb-Shanno) is a quasi-Newton method that builds an approximate inverse Hessian from gradient differences:
where is chosen by a line search. The update uses rank-2 corrections to maintain positive definiteness.
L-BFGS-B (Limited-memory BFGS with Bounds) stores only the last gradient differences (typically ), reducing memory from to . It also supports simple box constraints .
# BFGS (no bounds)
result = minimize(f, x0, method='BFGS', jac=grad_f)
# L-BFGS-B (with bounds)
bounds = [(0, None)] * n # x_i >= 0
result = minimize(f, x0, method='L-BFGS-B', jac=grad_f, bounds=bounds)
Definition: Newton-CG and Conjugate Gradient Methods
Newton-CG and Conjugate Gradient Methods
Newton's method uses the exact Hessian for quadratic convergence:
Newton-CG avoids forming the full Hessian inverse by solving the Newton system approximately using the Conjugate Gradient method. This requires only Hessian-vector products, not the full Hessian matrix.
result = minimize(f, x0, method='Newton-CG',
jac=grad_f, hess=hess_f)
# Or with Hessian-vector product only
result = minimize(f, x0, method='Newton-CG',
jac=grad_f, hessp=hess_vec_product)
CG (Conjugate Gradient): A separate method='CG' uses the nonlinear conjugate gradient method (Fletcher-Reeves / Polak-Ribiere). It requires only gradients, not Hessians.
Theorem: Convergence Rates of Optimization Methods
For a twice-differentiable, strongly convex function with Lipschitz-continuous Hessian:
| Method | Convergence Rate | Per-Iteration Cost |
|---|---|---|
| Gradient Descent | Linear: | |
| Conjugate Gradient | Superlinear | |
| BFGS | Superlinear | |
| Newton | Quadratic: | |
| Nelder-Mead | Sublinear (no guarantee) |
The condition number of the Hessian determines how fast gradient descent converges: the linear rate constant is .
Newton's method fits a local quadratic and jumps to its minimum, giving quadratic convergence. BFGS approximates this behavior without computing the Hessian. Gradient descent uses only first-order information and struggles with ill-conditioned problems.
Theorem: Second-Order Sufficient Conditions
If and (positive definite), then is a strict local minimizer of .
Conversely, if has a negative eigenvalue, then is a saddle point, and one can decrease by moving in the direction of the corresponding eigenvector.
Taylor expansion
For small: .
Positive definiteness gives the result
Since , we have for all . Hence for all sufficiently small .
Theorem: Water-Filling Power Allocation
Given channel gains and a total power budget , the capacity-maximizing power allocation for parallel Gaussian channels is:
where and is the water level chosen so that .
This maximizes subject to , .
Pour water (power) into a vessel whose bottom has heights . Water fills to a common level . Good channels (low bottom) get more power; poor channels (high bottom) may get none.
Example: Minimizing the Rosenbrock Function
Minimize the Rosenbrock function
using different methods
from scipy.optimize.minimize. Compare iteration counts and
final accuracy.
Setup
import numpy as np
from scipy.optimize import minimize
def rosenbrock(x):
return (1 - x[0])**2 + 100 * (x[1] - x[0]**2)**2
def rosenbrock_grad(x):
return np.array([
-2*(1 - x[0]) - 400*x[0]*(x[1] - x[0]**2),
200*(x[1] - x[0]**2)
])
def rosenbrock_hess(x):
return np.array([
[2 - 400*(x[1] - 3*x[0]**2), -400*x[0]],
[-400*x[0], 200]
])
Compare methods
x0 = np.array([-1.0, 1.0])
for method in ['Nelder-Mead', 'CG', 'BFGS', 'Newton-CG']:
kw = {'jac': rosenbrock_grad} if method != 'Nelder-Mead' else {}
if method == 'Newton-CG':
kw['hess'] = rosenbrock_hess
res = minimize(rosenbrock, x0, method=method, **kw)
print(f"{method:12s}: nit={res.nit:4d}, "
f"f={res.fun:.2e}, x={res.x}")
Typical output:
- Nelder-Mead: ~170 iterations, accuracy
- CG: ~30 iterations
- BFGS: ~34 iterations
- Newton-CG: ~25 iterations (fewest with Hessian)
Key insight
The Rosenbrock function has a narrow curved valley with condition number . Gradient-based methods struggle with the ill-conditioning; Newton-CG exploits curvature information to converge fastest.
Example: Water-Filling Power Allocation via SciPy
Implement the water-filling algorithm for parallel channels with gains drawn from an exponential distribution. Total power budget , noise variance .
Direct water-filling
import numpy as np
np.random.seed(42)
N = 8
gains = np.random.exponential(1.0, N)
P_total = 10.0
sigma2 = 1.0
# Sort channels by gain (descending)
idx = np.argsort(gains)[::-1]
gains_sorted = gains[idx]
# Water-filling: find water level mu
noise_floor = sigma2 / gains_sorted
for K in range(N, 0, -1):
mu = (P_total + np.sum(noise_floor[:K])) / K
powers = np.maximum(mu - noise_floor, 0)
if powers[K-1] > 0:
break
# Unsort
p_opt = np.zeros(N)
p_opt[idx] = powers
capacity = np.sum(np.log2(1 + p_opt * gains / sigma2))
print(f"Water level: {mu:.3f}")
print(f"Sum capacity: {capacity:.3f} bits/s/Hz")
Verify with scipy.optimize
from scipy.optimize import minimize
def neg_capacity(p):
return -np.sum(np.log2(1 + p * gains / sigma2))
from scipy.optimize import LinearConstraint
con = LinearConstraint(np.ones(N), 0, P_total)
bounds = [(0, None)] * N
res = minimize(neg_capacity, np.ones(N) * P_total/N,
method='SLSQP', bounds=bounds,
constraints={'type': 'ineq',
'fun': lambda p: P_total - np.sum(p)})
print(f"SciPy capacity: {-res.fun:.3f}")
print(f"Max |p_diff|: {np.max(np.abs(res.x - p_opt)):.2e}")
Example: Speeding Up BFGS with Analytical Gradients
Show the speedup from providing an analytical gradient versus relying on finite-difference approximation for a quadratic with .
Setup
import numpy as np
from scipy.optimize import minimize
import time
np.random.seed(42)
n = 100
M = np.random.randn(n, n)
A = M.T @ M + 0.1 * np.eye(n) # SPD
b = np.random.randn(n)
def f(x):
return 0.5 * x @ A @ x - b @ x
def grad_f(x):
return A @ x - b
Benchmark
x0 = np.zeros(n)
t0 = time.perf_counter()
res_fd = minimize(f, x0, method='BFGS')
t_fd = time.perf_counter() - t0
t0 = time.perf_counter()
res_an = minimize(f, x0, method='BFGS', jac=grad_f)
t_an = time.perf_counter() - t0
print(f"Finite-diff: {t_fd:.3f}s, nfev={res_fd.nfev}")
print(f"Analytical: {t_an:.3f}s, nfev={res_an.nfev}")
print(f"Speedup: {t_fd/t_an:.1f}x")
With analytical gradients, you typically see a 3-10x speedup and fewer function evaluations.
Optimizer Trajectories on 2D Functions
Visualize the paths taken by different optimization methods (Nelder-Mead, CG, BFGS, Newton-CG) on classic test functions.
Parameters
Water-Filling Power Allocation
Explore how the water-filling solution changes with total power budget and channel conditions.
Parameters
Gradient Descent Convergence Animation
Watch gradient descent, BFGS, and Newton's method converge on a 2D quadratic with tunable condition number.
Parameters
Taxonomy of Optimization Methods
Unconstrained Optimization Examples
# Code from: ch08/python/unconstrained_optimization.py
# Load from backend supplements endpointQuick Check
Which method should you use first for a smooth, unconstrained problem with available gradients?
Nelder-Mead
BFGS
Newton-CG
Brute force grid search
BFGS is the default choice for smooth problems with gradients. It has superlinear convergence and only O(n^2) storage.
Quick Check
What information does BFGS use that gradient descent does not?
The Hessian matrix
An approximate inverse Hessian built from gradient history
Higher-order derivatives
Random perturbations
BFGS accumulates curvature information via rank-2 updates to an approximate inverse Hessian.
Common Mistake: Not Supplying Gradients to BFGS
Mistake:
Using minimize(f, x0, method='BFGS') without providing jac.
SciPy falls back to finite-difference gradients, which requires
function evaluations per gradient and introduces approximation error.
Correction:
Always supply analytical gradients when possible:
minimize(f, x0, method='BFGS', jac=grad_f).
Use jac=True if your function returns (f, grad) as a tuple.
Verify gradients with scipy.optimize.check_grad(f, grad_f, x0).
Common Mistake: Trusting Local Optima in Non-Convex Problems
Mistake:
Running minimize once on a non-convex problem and assuming the
result is the global minimum.
Correction:
For non-convex problems, use multiple random restarts or global
methods like scipy.optimize.differential_evolution or
scipy.optimize.basinhopping. Always check result.success
and compare objective values from different starting points.
Key Takeaway
Match the method to your information. Nelder-Mead for derivative-free, BFGS for gradient-only, Newton-CG for gradient + Hessian, L-BFGS-B for large-scale with bounds. More information per iteration = faster convergence.
Key Takeaway
Always provide analytical gradients when possible. Finite-difference
approximations cost extra function evaluations per step and
introduce truncation error. Use check_grad to verify your gradient
implementation before running the optimizer.
Why This Matters: Water-Filling in MIMO-OFDM Systems
Water-filling power allocation is the optimal strategy for parallel Gaussian channels, which arise naturally in MIMO-OFDM systems after SVD decomposition of the channel matrix. The water-filling solution from this section directly determines how power is distributed across subcarriers and spatial streams in 5G NR and Wi-Fi 6.
See full treatment in Chapter 12
Historical Note: The Nelder-Mead Simplex Method
1960sJohn Nelder and Roger Mead published their simplex method in 1965 in The Computer Journal. Despite having no convergence proof for dimensions , it remains one of the most widely used optimization methods in science and engineering, especially for noisy or non-differentiable objectives. SciPy implements the original algorithm with adaptive parameter selection (Gao and Han, 2012).
Historical Note: Four Independent Discoverers of BFGS
1970sThe BFGS update was independently discovered by Broyden, Fletcher, Goldfarb, and Shanno, all in 1970. This was one of the most remarkable cases of simultaneous discovery in numerical analysis. The limited-memory variant (L-BFGS) by Nocedal (1980) made it practical for large-scale problems and is the workhorse behind many machine learning optimizers.
Gradient
The vector of first partial derivatives , pointing in the direction of steepest ascent.
Related: Hessian, Objective Function
Hessian
The matrix of second partial derivatives , encoding the curvature of the objective function.
Related: Gradient
Quasi-Newton Method
An optimization method that approximates the Hessian (or its inverse) from gradient differences, avoiding the cost of computing exact second derivatives. BFGS is the most popular example.
Line Search
A one-dimensional optimization along the search direction to find a step size satisfying sufficient decrease (Wolfe conditions).
Comparison of Unconstrained Optimization Methods in SciPy
| Method | Gradient? | Hessian? | Convergence | Best For |
|---|---|---|---|---|
| Nelder-Mead | No | No | Sublinear | Noisy/non-smooth, n < 20 |
| CG | Yes | No | Superlinear | Medium-scale smooth |
| BFGS | Yes | No | Superlinear | Default for smooth problems |
| L-BFGS-B | Yes | No | Superlinear | Large-scale, box constraints |
| Newton-CG | Yes | Yes (or HVP) | Quadratic | When Hessian is cheap |
| trust-ncg | Yes | Yes | Quadratic | Non-convex with good Hessian |