Unconstrained Optimization

Why Optimization Is Everywhere

Every machine learning model, every signal processing algorithm, every engineering design problem ultimately reduces to an optimization problem: find the parameters that minimize (or maximize) some objective. SciPy provides scipy.optimize.minimize, a unified interface to dozens of battle-tested optimization algorithms, each suited to different problem structures.

In wireless communications, water-filling power allocation, beamforming design, and maximum-likelihood detection are all optimization problems. The tools in this chapter let you solve them in a few lines of Python.

Definition:

Unconstrained Optimization Problem

An unconstrained optimization problem is:

min⁑x∈Rnf(x)\min_{\mathbf{x} \in \mathbb{R}^n} f(\mathbf{x})

where f:Rnβ†’Rf : \mathbb{R}^n \to \mathbb{R} is the objective function. A point x⋆\mathbf{x}^\star is a local minimizer if f(x⋆)≀f(x)f(\mathbf{x}^\star) \le f(\mathbf{x}) for all x\mathbf{x} in a neighborhood of x⋆\mathbf{x}^\star.

Necessary condition (first-order): βˆ‡f(x⋆)=0\nabla f(\mathbf{x}^\star) = \mathbf{0}.

Sufficient condition (second-order): βˆ‡f(x⋆)=0\nabla f(\mathbf{x}^\star) = \mathbf{0} and βˆ‡2f(x⋆)≻0\nabla^2 f(\mathbf{x}^\star) \succ 0 (positive definite).

A global minimizer satisfies f(x⋆)≀f(x)f(\mathbf{x}^\star) \le f(\mathbf{x}) for all x\mathbf{x}. For non-convex problems, local and global minimizers may differ.

Definition:

Gradient and Hessian

The gradient of f:Rn→Rf : \mathbb{R}^n \to \mathbb{R} is the vector of first partial derivatives:

βˆ‡f(x)=[βˆ‚fβˆ‚x1βˆ‚fβˆ‚x2β‹―βˆ‚fβˆ‚xn]T\nabla f(\mathbf{x}) = \begin{bmatrix} \frac{\partial f}{\partial x_1} & \frac{\partial f}{\partial x_2} & \cdots & \frac{\partial f}{\partial x_n} \end{bmatrix}^T

The Hessian is the nΓ—nn \times n matrix of second partial derivatives:

[βˆ‡2f(x)]ij=βˆ‚2fβˆ‚xiβˆ‚xj[\nabla^2 f(\mathbf{x})]_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}

For a twice-differentiable function, the Hessian is symmetric. The Hessian encodes the curvature of ff: its eigenvalues determine whether the function curves up (positive), down (negative), or is flat in each direction.

Definition:

The scipy.optimize.minimize Interface

from scipy.optimize import minimize

result = minimize(
    fun,          # callable: f(x) -> float
    x0,           # initial guess (1D array)
    method=...,   # 'Nelder-Mead', 'BFGS', 'L-BFGS-B', etc.
    jac=...,      # gradient (callable or True for auto)
    hess=...,     # Hessian (callable, or HessianUpdateStrategy)
    tol=...,      # convergence tolerance
    options=...,  # dict of solver-specific options
)

The returned OptimizeResult object contains:

  • result.x: the optimal point x⋆\mathbf{x}^\star
  • result.fun: the optimal value f(x⋆)f(\mathbf{x}^\star)
  • result.jac: gradient at the solution
  • result.nit: number of iterations
  • result.success: whether the optimizer converged

Definition:

Nelder-Mead (Simplex) Method

Nelder-Mead is a derivative-free method that maintains a simplex of n+1n+1 vertices in Rn\mathbb{R}^n. At each iteration, it reflects, expands, contracts, or shrinks the simplex based on function values at the vertices.

result = minimize(f, x0, method='Nelder-Mead')

Pros: No gradient needed; robust for noisy or non-smooth objectives.

Cons: Slow convergence (O(1/k)O(1/\sqrt{k})); no convergence guarantee in dimension >2> 2; does not scale to high dimensions (n>20n > 20).

Use Nelder-Mead only when gradients are unavailable or the objective is noisy. For smooth problems, gradient-based methods are orders of magnitude faster.

Definition:

BFGS and L-BFGS-B Quasi-Newton Methods

BFGS (Broyden-Fletcher-Goldfarb-Shanno) is a quasi-Newton method that builds an approximate inverse Hessian Bkβˆ’1\mathbf{B}_k^{-1} from gradient differences:

xk+1=xkβˆ’Ξ±kBkβˆ’1βˆ‡f(xk)\mathbf{x}_{k+1} = \mathbf{x}_k - \alpha_k \mathbf{B}_k^{-1} \nabla f(\mathbf{x}_k)

where Ξ±k\alpha_k is chosen by a line search. The update uses rank-2 corrections to maintain positive definiteness.

L-BFGS-B (Limited-memory BFGS with Bounds) stores only the last mm gradient differences (typically m=10m = 10), reducing memory from O(n2)O(n^2) to O(mn)O(mn). It also supports simple box constraints li≀xi≀uil_i \le x_i \le u_i.

# BFGS (no bounds)
result = minimize(f, x0, method='BFGS', jac=grad_f)

# L-BFGS-B (with bounds)
bounds = [(0, None)] * n  # x_i >= 0
result = minimize(f, x0, method='L-BFGS-B', jac=grad_f, bounds=bounds)

Definition:

Newton-CG and Conjugate Gradient Methods

Newton's method uses the exact Hessian for quadratic convergence:

xk+1=xkβˆ’[βˆ‡2f(xk)]βˆ’1βˆ‡f(xk)\mathbf{x}_{k+1} = \mathbf{x}_k - [\nabla^2 f(\mathbf{x}_k)]^{-1} \nabla f(\mathbf{x}_k)

Newton-CG avoids forming the full Hessian inverse by solving the Newton system βˆ‡2fβ‹…d=βˆ’βˆ‡f\nabla^2 f \cdot \mathbf{d} = -\nabla f approximately using the Conjugate Gradient method. This requires only Hessian-vector products, not the full Hessian matrix.

result = minimize(f, x0, method='Newton-CG',
                  jac=grad_f, hess=hess_f)

# Or with Hessian-vector product only
result = minimize(f, x0, method='Newton-CG',
                  jac=grad_f, hessp=hess_vec_product)

CG (Conjugate Gradient): A separate method='CG' uses the nonlinear conjugate gradient method (Fletcher-Reeves / Polak-Ribiere). It requires only gradients, not Hessians.

Theorem: Convergence Rates of Optimization Methods

For a twice-differentiable, strongly convex function ff with Lipschitz-continuous Hessian:

Method Convergence Rate Per-Iteration Cost
Gradient Descent Linear: βˆ₯xk+1βˆ’x⋆βˆ₯≀cβˆ₯xkβˆ’x⋆βˆ₯\|\mathbf{x}_{k+1} - \mathbf{x}^\star\| \le c \|\mathbf{x}_k - \mathbf{x}^\star\| O(n)O(n)
Conjugate Gradient Superlinear O(n)O(n)
BFGS Superlinear O(n2)O(n^2)
Newton Quadratic: βˆ₯xk+1βˆ’x⋆βˆ₯≀cβˆ₯xkβˆ’x⋆βˆ₯2\|\mathbf{x}_{k+1} - \mathbf{x}^\star\| \le c \|\mathbf{x}_k - \mathbf{x}^\star\|^2 O(n3)O(n^3)
Nelder-Mead Sublinear (no guarantee) O(n)O(n)

The condition number ΞΊ=Ξ»max⁑/Ξ»min⁑\kappa = \lambda_{\max} / \lambda_{\min} of the Hessian determines how fast gradient descent converges: the linear rate constant is c=(ΞΊβˆ’1)/(ΞΊ+1)c = (\kappa - 1)/(\kappa + 1).

Newton's method fits a local quadratic and jumps to its minimum, giving quadratic convergence. BFGS approximates this behavior without computing the Hessian. Gradient descent uses only first-order information and struggles with ill-conditioned problems.

Theorem: Second-Order Sufficient Conditions

If βˆ‡f(x⋆)=0\nabla f(\mathbf{x}^\star) = \mathbf{0} and βˆ‡2f(x⋆)≻0\nabla^2 f(\mathbf{x}^\star) \succ 0 (positive definite), then x⋆\mathbf{x}^\star is a strict local minimizer of ff.

Conversely, if βˆ‡2f(x⋆)\nabla^2 f(\mathbf{x}^\star) has a negative eigenvalue, then x⋆\mathbf{x}^\star is a saddle point, and one can decrease ff by moving in the direction of the corresponding eigenvector.

Theorem: Water-Filling Power Allocation

Given channel gains {hi}i=1N\{h_i\}_{i=1}^{N} and a total power budget PP, the capacity-maximizing power allocation for parallel Gaussian channels is:

pi⋆=(ΞΌβˆ’Οƒ2∣hi∣2)+p_i^\star = \left(\mu - \frac{\sigma^2}{|h_i|^2}\right)^+

where (β‹…)+=max⁑(β‹…,0)(\cdot)^+ = \max(\cdot, 0) and ΞΌ\mu is the water level chosen so that βˆ‘ipi⋆=P\sum_i p_i^\star = P.

This maximizes βˆ‘ilog⁑2(1+pi∣hi∣2/Οƒ2)\sum_i \log_2(1 + p_i |h_i|^2 / \sigma^2) subject to βˆ‘ipi≀P\sum_i p_i \le P, piβ‰₯0p_i \ge 0.

Pour water (power) into a vessel whose bottom has heights Οƒ2/∣hi∣2\sigma^2/|h_i|^2. Water fills to a common level ΞΌ\mu. Good channels (low bottom) get more power; poor channels (high bottom) may get none.

Example: Minimizing the Rosenbrock Function

Minimize the Rosenbrock function f(x,y)=(1βˆ’x)2+100(yβˆ’x2)2f(x, y) = (1-x)^2 + 100(y-x^2)^2 using different methods from scipy.optimize.minimize. Compare iteration counts and final accuracy.

Example: Water-Filling Power Allocation via SciPy

Implement the water-filling algorithm for N=8N=8 parallel channels with gains ∣hi∣2|h_i|^2 drawn from an exponential distribution. Total power budget P=10P = 10, noise variance Οƒ2=1\sigma^2 = 1.

Example: Speeding Up BFGS with Analytical Gradients

Show the speedup from providing an analytical gradient versus relying on finite-difference approximation for a quadratic f(x)=12xTAxβˆ’bTxf(\mathbf{x}) = \frac{1}{2}\mathbf{x}^T\mathbf{A}\mathbf{x} - \mathbf{b}^T\mathbf{x} with n=100n=100.

Optimizer Trajectories on 2D Functions

Visualize the paths taken by different optimization methods (Nelder-Mead, CG, BFGS, Newton-CG) on classic test functions.

Parameters

Water-Filling Power Allocation

Explore how the water-filling solution changes with total power budget and channel conditions.

Parameters

Gradient Descent Convergence Animation

Watch gradient descent, BFGS, and Newton's method converge on a 2D quadratic with tunable condition number.

Parameters

Taxonomy of Optimization Methods

Taxonomy of Optimization Methods
Classification of optimization methods by the information they use. More information (gradients, Hessians) enables faster convergence but requires more computation per iteration.

Unconstrained Optimization Examples

python
Complete examples of Nelder-Mead, BFGS, L-BFGS-B, CG, and Newton-CG with analytical gradients and Hessians.
# Code from: ch08/python/unconstrained_optimization.py
# Load from backend supplements endpoint

Quick Check

Which method should you use first for a smooth, unconstrained problem with available gradients?

Nelder-Mead

BFGS

Newton-CG

Brute force grid search

Quick Check

What information does BFGS use that gradient descent does not?

The Hessian matrix

An approximate inverse Hessian built from gradient history

Higher-order derivatives

Random perturbations

Common Mistake: Not Supplying Gradients to BFGS

Mistake:

Using minimize(f, x0, method='BFGS') without providing jac. SciPy falls back to finite-difference gradients, which requires n+1n+1 function evaluations per gradient and introduces approximation error.

Correction:

Always supply analytical gradients when possible: minimize(f, x0, method='BFGS', jac=grad_f). Use jac=True if your function returns (f, grad) as a tuple. Verify gradients with scipy.optimize.check_grad(f, grad_f, x0).

Common Mistake: Trusting Local Optima in Non-Convex Problems

Mistake:

Running minimize once on a non-convex problem and assuming the result is the global minimum.

Correction:

For non-convex problems, use multiple random restarts or global methods like scipy.optimize.differential_evolution or scipy.optimize.basinhopping. Always check result.success and compare objective values from different starting points.

Key Takeaway

Match the method to your information. Nelder-Mead for derivative-free, BFGS for gradient-only, Newton-CG for gradient + Hessian, L-BFGS-B for large-scale with bounds. More information per iteration = faster convergence.

Key Takeaway

Always provide analytical gradients when possible. Finite-difference approximations cost O(n)O(n) extra function evaluations per step and introduce truncation error. Use check_grad to verify your gradient implementation before running the optimizer.

Why This Matters: Water-Filling in MIMO-OFDM Systems

Water-filling power allocation is the optimal strategy for parallel Gaussian channels, which arise naturally in MIMO-OFDM systems after SVD decomposition of the channel matrix. The water-filling solution from this section directly determines how power is distributed across subcarriers and spatial streams in 5G NR and Wi-Fi 6.

See full treatment in Chapter 12

Historical Note: The Nelder-Mead Simplex Method

1960s

John Nelder and Roger Mead published their simplex method in 1965 in The Computer Journal. Despite having no convergence proof for dimensions >1> 1, it remains one of the most widely used optimization methods in science and engineering, especially for noisy or non-differentiable objectives. SciPy implements the original algorithm with adaptive parameter selection (Gao and Han, 2012).

Historical Note: Four Independent Discoverers of BFGS

1970s

The BFGS update was independently discovered by Broyden, Fletcher, Goldfarb, and Shanno, all in 1970. This was one of the most remarkable cases of simultaneous discovery in numerical analysis. The limited-memory variant (L-BFGS) by Nocedal (1980) made it practical for large-scale problems and is the workhorse behind many machine learning optimizers.

Objective Function

The function f(x)f(\mathbf{x}) to be minimized (or maximized) in an optimization problem.

Related: Gradient, Hessian

Gradient

The vector of first partial derivatives βˆ‡f(x)\nabla f(\mathbf{x}), pointing in the direction of steepest ascent.

Related: Hessian, Objective Function

Hessian

The matrix of second partial derivatives βˆ‡2f(x)\nabla^2 f(\mathbf{x}), encoding the curvature of the objective function.

Related: Gradient

Quasi-Newton Method

An optimization method that approximates the Hessian (or its inverse) from gradient differences, avoiding the cost of computing exact second derivatives. BFGS is the most popular example.

Comparison of Unconstrained Optimization Methods in SciPy

MethodGradient?Hessian?ConvergenceBest For
Nelder-MeadNoNoSublinearNoisy/non-smooth, n < 20
CGYesNoSuperlinearMedium-scale smooth
BFGSYesNoSuperlinearDefault for smooth problems
L-BFGS-BYesNoSuperlinearLarge-scale, box constraints
Newton-CGYesYes (or HVP)QuadraticWhen Hessian is cheap
trust-ncgYesYesQuadraticNon-convex with good Hessian