Ferkans — Interactive Telecom Tutor

Why Optimization Is Everywhere

Every machine learning model, every signal processing algorithm, every engineering design problem ultimately reduces to an optimization problem: find the parameters that minimize (or maximize) some objective. SciPy provides scipy.optimize.minimize, a unified interface to dozens of battle-tested optimization algorithms, each suited to different problem structures.

In wireless communications, water-filling power allocation, beamforming design, and maximum-likelihood detection are all optimization problems. The tools in this chapter let you solve them in a few lines of Python.

Definition:
Unconstrained Optimization Problem

An unconstrained optimization problem is:

$\min_{\mathbf{x} \in \mathbb{R}^n} f(\mathbf{x})$

where $f : \mathbb{R}^n \to \mathbb{R}$ is the objective function. A point $\mathbf{x}^\star$ is a local minimizer if $f(\mathbf{x}^\star) \le f(\mathbf{x})$ for all $\mathbf{x}$ in a neighborhood of $\mathbf{x}^\star$ .

Necessary condition (first-order): $\nabla f(\mathbf{x}^\star) = \mathbf{0}$ .

Sufficient condition (second-order): $\nabla f(\mathbf{x}^\star) = \mathbf{0}$ and $\nabla^2 f(\mathbf{x}^\star) \succ 0$ (positive definite).

A global minimizer satisfies $f(\mathbf{x}^\star) \le f(\mathbf{x})$ for all $\mathbf{x}$ . For non-convex problems, local and global minimizers may differ.

Definition:
Gradient and Hessian

The gradient of $f : \mathbb{R}^n \to \mathbb{R}$ is the vector of first partial derivatives:

$\nabla f(\mathbf{x}) = \begin{bmatrix} \frac{\partial f}{\partial x_1} & \frac{\partial f}{\partial x_2} & \cdots & \frac{\partial f}{\partial x_n} \end{bmatrix}^T$

The Hessian is the $n \times n$ matrix of second partial derivatives:

$[\nabla^2 f(\mathbf{x})]_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}$

For a twice-differentiable function, the Hessian is symmetric. The Hessian encodes the curvature of $f$ : its eigenvalues determine whether the function curves up (positive), down (negative), or is flat in each direction.

Definition:
The scipy.optimize.minimize Interface

from scipy.optimize import minimize

result = minimize(
    fun,          # callable: f(x) -> float
    x0,           # initial guess (1D array)
    method=...,   # 'Nelder-Mead', 'BFGS', 'L-BFGS-B', etc.
    jac=...,      # gradient (callable or True for auto)
    hess=...,     # Hessian (callable, or HessianUpdateStrategy)
    tol=...,      # convergence tolerance
    options=...,  # dict of solver-specific options
)

The returned OptimizeResult object contains:

result.x: the optimal point $\mathbf{x}^\star$
result.fun: the optimal value $f(\mathbf{x}^\star)$
result.jac: gradient at the solution
result.nit: number of iterations
result.success: whether the optimizer converged

Definition:
Nelder-Mead (Simplex) Method

Nelder-Mead is a derivative-free method that maintains a simplex of $n+1$ vertices in $\mathbb{R}^n$ . At each iteration, it reflects, expands, contracts, or shrinks the simplex based on function values at the vertices.

result = minimize(f, x0, method='Nelder-Mead')

Pros: No gradient needed; robust for noisy or non-smooth objectives.

Cons: Slow convergence ( $O(1/\sqrt{k})$ ); no convergence guarantee in dimension $> 2$ ; does not scale to high dimensions ( $n > 20$ ).

Use Nelder-Mead only when gradients are unavailable or the objective is noisy. For smooth problems, gradient-based methods are orders of magnitude faster.

Definition:
BFGS and L-BFGS-B Quasi-Newton Methods

BFGS (Broyden-Fletcher-Goldfarb-Shanno) is a quasi-Newton method that builds an approximate inverse Hessian $\mathbf{B}_k^{-1}$ from gradient differences:

$\mathbf{x}_{k+1} = \mathbf{x}_k - \alpha_k \mathbf{B}_k^{-1} \nabla f(\mathbf{x}_k)$

where $\alpha_k$ is chosen by a line search. The update uses rank-2 corrections to maintain positive definiteness.

L-BFGS-B (Limited-memory BFGS with Bounds) stores only the last $m$ gradient differences (typically $m = 10$ ), reducing memory from $O(n^2)$ to $O(mn)$ . It also supports simple box constraints $l_i \le x_i \le u_i$ .

# BFGS (no bounds)
result = minimize(f, x0, method='BFGS', jac=grad_f)

# L-BFGS-B (with bounds)
bounds = [(0, None)] * n  # x_i >= 0
result = minimize(f, x0, method='L-BFGS-B', jac=grad_f, bounds=bounds)

Definition:
Newton-CG and Conjugate Gradient Methods

Newton's method uses the exact Hessian for quadratic convergence:

$\mathbf{x}_{k+1} = \mathbf{x}_k - [\nabla^2 f(\mathbf{x}_k)]^{-1} \nabla f(\mathbf{x}_k)$

Newton-CG avoids forming the full Hessian inverse by solving the Newton system $\nabla^2 f \cdot \mathbf{d} = -\nabla f$ approximately using the Conjugate Gradient method. This requires only Hessian-vector products, not the full Hessian matrix.

result = minimize(f, x0, method='Newton-CG',
                  jac=grad_f, hess=hess_f)

# Or with Hessian-vector product only
result = minimize(f, x0, method='Newton-CG',
                  jac=grad_f, hessp=hess_vec_product)

CG (Conjugate Gradient): A separate method='CG' uses the nonlinear conjugate gradient method (Fletcher-Reeves / Polak-Ribiere). It requires only gradients, not Hessians.

Theorem: Convergence Rates of Optimization Methods

For a twice-differentiable, strongly convex function $f$ with Lipschitz-continuous Hessian:

Method	Convergence Rate	Per-Iteration Cost
Gradient Descent	Linear: $\\|\mathbf{x}_{k+1} - \mathbf{x}^\star\\| \le c \\|\mathbf{x}_k - \mathbf{x}^\star\\|$	$O(n)$
Conjugate Gradient	Superlinear	$O(n)$
BFGS	Superlinear	$O(n^2)$
Newton	Quadratic: $\\|\mathbf{x}_{k+1} - \mathbf{x}^\star\\| \le c \\|\mathbf{x}_k - \mathbf{x}^\star\\|^2$	$O(n^3)$
Nelder-Mead	Sublinear (no guarantee)	$O(n)$

The condition number $\kappa = \lambda_{\max} / \lambda_{\min}$ of the Hessian determines how fast gradient descent converges: the linear rate constant is $c = (\kappa - 1)/(\kappa + 1)$ .

Newton's method fits a local quadratic and jumps to its minimum, giving quadratic convergence. BFGS approximates this behavior without computing the Hessian. Gradient descent uses only first-order information and struggles with ill-conditioned problems.

Theorem: Second-Order Sufficient Conditions

If $\nabla f(\mathbf{x}^\star) = \mathbf{0}$ and $\nabla^2 f(\mathbf{x}^\star) \succ 0$ (positive definite), then $\mathbf{x}^\star$ is a strict local minimizer of $f$ .

Conversely, if $\nabla^2 f(\mathbf{x}^\star)$ has a negative eigenvalue, then $\mathbf{x}^\star$ is a saddle point, and one can decrease $f$ by moving in the direction of the corresponding eigenvector.

Proof

Taylor expansion

For $\mathbf{d}$ small: $f(\mathbf{x}^\star + \mathbf{d}) = f(\mathbf{x}^\star) + \underbrace{\nabla f(\mathbf{x}^\star)^T \mathbf{d}}_{=0} + \tfrac{1}{2}\mathbf{d}^T \nabla^2 f(\mathbf{x}^\star) \mathbf{d} + O(\|\mathbf{d}\|^3)$ .

Positive definiteness gives the result

Since $\nabla^2 f(\mathbf{x}^\star) \succ 0$ , we have $\mathbf{d}^T \nabla^2 f(\mathbf{x}^\star) \mathbf{d} > 0$ for all $\mathbf{d} \neq \mathbf{0}$ . Hence $f(\mathbf{x}^\star + \mathbf{d}) > f(\mathbf{x}^\star)$ for all sufficiently small $\mathbf{d}$ .

Theorem: Water-Filling Power Allocation

Given channel gains $\{h_i\}_{i=1}^{N}$ and a total power budget $P$ , the capacity-maximizing power allocation for parallel Gaussian channels is:

$p_i^\star = \left(\mu - \frac{\sigma^2}{|h_i|^2}\right)^+$

where $(\cdot)^+ = \max(\cdot, 0)$ and $\mu$ is the water level chosen so that $\sum_i p_i^\star = P$ .

This maximizes $\sum_i \log_2(1 + p_i |h_i|^2 / \sigma^2)$ subject to $\sum_i p_i \le P$ , $p_i \ge 0$ .

Pour water (power) into a vessel whose bottom has heights $\sigma^2/|h_i|^2$ . Water fills to a common level $\mu$ . Good channels (low bottom) get more power; poor channels (high bottom) may get none.

Example: Minimizing the Rosenbrock Function

Minimize the Rosenbrock function $f(x, y) = (1-x)^2 + 100(y-x^2)^2$ using different methods from scipy.optimize.minimize. Compare iteration counts and final accuracy.

Solution

Setup

import numpy as np
from scipy.optimize import minimize

def rosenbrock(x):
    return (1 - x[0])**2 + 100 * (x[1] - x[0]**2)**2

def rosenbrock_grad(x):
    return np.array([
        -2*(1 - x[0]) - 400*x[0]*(x[1] - x[0]**2),
        200*(x[1] - x[0]**2)
    ])

def rosenbrock_hess(x):
    return np.array([
        [2 - 400*(x[1] - 3*x[0]**2), -400*x[0]],
        [-400*x[0], 200]
    ])

Compare methods

x0 = np.array([-1.0, 1.0])
for method in ['Nelder-Mead', 'CG', 'BFGS', 'Newton-CG']:
    kw = {'jac': rosenbrock_grad} if method != 'Nelder-Mead' else {}
    if method == 'Newton-CG':
        kw['hess'] = rosenbrock_hess
    res = minimize(rosenbrock, x0, method=method, **kw)
    print(f"{method:12s}: nit={res.nit:4d}, "
          f"f={res.fun:.2e}, x={res.x}")

Typical output:

Nelder-Mead: ~170 iterations, accuracy $\sim 10^{-10}$
CG: ~30 iterations
BFGS: ~34 iterations
Newton-CG: ~25 iterations (fewest with Hessian)

Key insight

The Rosenbrock function has a narrow curved valley with condition number $\kappa \approx 2500$ . Gradient-based methods struggle with the ill-conditioning; Newton-CG exploits curvature information to converge fastest.

Example: Water-Filling Power Allocation via SciPy

Implement the water-filling algorithm for $N=8$ parallel channels with gains $|h_i|^2$ drawn from an exponential distribution. Total power budget $P = 10$ , noise variance $\sigma^2 = 1$ .

Solution

Direct water-filling

import numpy as np

np.random.seed(42)
N = 8
gains = np.random.exponential(1.0, N)
P_total = 10.0
sigma2 = 1.0

# Sort channels by gain (descending)
idx = np.argsort(gains)[::-1]
gains_sorted = gains[idx]

# Water-filling: find water level mu
noise_floor = sigma2 / gains_sorted
for K in range(N, 0, -1):
    mu = (P_total + np.sum(noise_floor[:K])) / K
    powers = np.maximum(mu - noise_floor, 0)
    if powers[K-1] > 0:
        break

# Unsort
p_opt = np.zeros(N)
p_opt[idx] = powers
capacity = np.sum(np.log2(1 + p_opt * gains / sigma2))
print(f"Water level: {mu:.3f}")
print(f"Sum capacity: {capacity:.3f} bits/s/Hz")

Verify with scipy.optimize

from scipy.optimize import minimize

def neg_capacity(p):
    return -np.sum(np.log2(1 + p * gains / sigma2))

from scipy.optimize import LinearConstraint
con = LinearConstraint(np.ones(N), 0, P_total)
bounds = [(0, None)] * N
res = minimize(neg_capacity, np.ones(N) * P_total/N,
               method='SLSQP', bounds=bounds,
               constraints={'type': 'ineq',
                            'fun': lambda p: P_total - np.sum(p)})
print(f"SciPy capacity: {-res.fun:.3f}")
print(f"Max |p_diff|: {np.max(np.abs(res.x - p_opt)):.2e}")

Example: Speeding Up BFGS with Analytical Gradients

Show the speedup from providing an analytical gradient versus relying on finite-difference approximation for a quadratic $f(\mathbf{x}) = \frac{1}{2}\mathbf{x}^T\mathbf{A}\mathbf{x} - \mathbf{b}^T\mathbf{x}$ with $n=100$ .

Solution

Setup

import numpy as np
from scipy.optimize import minimize
import time

np.random.seed(42)
n = 100
M = np.random.randn(n, n)
A = M.T @ M + 0.1 * np.eye(n)  # SPD
b = np.random.randn(n)

def f(x):
    return 0.5 * x @ A @ x - b @ x

def grad_f(x):
    return A @ x - b

Benchmark

x0 = np.zeros(n)

t0 = time.perf_counter()
res_fd = minimize(f, x0, method='BFGS')
t_fd = time.perf_counter() - t0

t0 = time.perf_counter()
res_an = minimize(f, x0, method='BFGS', jac=grad_f)
t_an = time.perf_counter() - t0

print(f"Finite-diff:  {t_fd:.3f}s, nfev={res_fd.nfev}")
print(f"Analytical:   {t_an:.3f}s, nfev={res_an.nfev}")
print(f"Speedup: {t_fd/t_an:.1f}x")

With analytical gradients, you typically see a 3-10x speedup and fewer function evaluations.

Optimizer Trajectories on 2D Functions

Visualize the paths taken by different optimization methods (Nelder-Mead, CG, BFGS, Newton-CG) on classic test functions.

Parameters

Water-Filling Power Allocation

Explore how the water-filling solution changes with total power budget and channel conditions.

Parameters

Gradient Descent Convergence Animation

Watch gradient descent, BFGS, and Newton's method converge on a 2D quadratic with tunable condition number.

Parameters

Taxonomy of Optimization Methods — Classification of optimization methods by the information they use. More information (gradients, Hessians) enables faster convergence but requires more computation per iteration.

Unconstrained Optimization Examples

python

Complete examples of Nelder-Mead, BFGS, L-BFGS-B, CG, and Newton-CG with analytical gradients and Hessians.

# Code from: ch08/python/unconstrained_optimization.py
# Load from backend supplements endpoint

Quick Check

Which method should you use first for a smooth, unconstrained problem with available gradients?

Nelder-Mead

BFGS

Newton-CG

Brute force grid search

Correction:

BFGS

BFGS is the default choice for smooth problems with gradients. It has superlinear convergence and only O(n^2) storage.

Quick Check

What information does BFGS use that gradient descent does not?

The Hessian matrix

An approximate inverse Hessian built from gradient history

Higher-order derivatives

Random perturbations

Correction:

An approximate inverse Hessian built from gradient history

BFGS accumulates curvature information via rank-2 updates to an approximate inverse Hessian.

Common Mistake: Not Supplying Gradients to BFGS

Mistake:

Using minimize(f, x0, method='BFGS') without providing jac. SciPy falls back to finite-difference gradients, which requires $n+1$ function evaluations per gradient and introduces approximation error.

Correction:

Always supply analytical gradients when possible: minimize(f, x0, method='BFGS', jac=grad_f). Use jac=True if your function returns (f, grad) as a tuple. Verify gradients with scipy.optimize.check_grad(f, grad_f, x0).

Common Mistake: Trusting Local Optima in Non-Convex Problems

Mistake:

Running minimize once on a non-convex problem and assuming the result is the global minimum.

Correction:

For non-convex problems, use multiple random restarts or global methods like scipy.optimize.differential_evolution or scipy.optimize.basinhopping. Always check result.success and compare objective values from different starting points.

Key Takeaway

Match the method to your information. Nelder-Mead for derivative-free, BFGS for gradient-only, Newton-CG for gradient + Hessian, L-BFGS-B for large-scale with bounds. More information per iteration = faster convergence.

Key Takeaway

Always provide analytical gradients when possible. Finite-difference approximations cost $O(n)$ extra function evaluations per step and introduce truncation error. Use check_grad to verify your gradient implementation before running the optimizer.

Why This Matters: Water-Filling in MIMO-OFDM Systems

Water-filling power allocation is the optimal strategy for parallel Gaussian channels, which arise naturally in MIMO-OFDM systems after SVD decomposition of the channel matrix. The water-filling solution from this section directly determines how power is distributed across subcarriers and spatial streams in 5G NR and Wi-Fi 6.

See full treatment in Chapter 12

Historical Note: The Nelder-Mead Simplex Method

1960s

John Nelder and Roger Mead published their simplex method in 1965 in The Computer Journal. Despite having no convergence proof for dimensions $> 1$ , it remains one of the most widely used optimization methods in science and engineering, especially for noisy or non-differentiable objectives. SciPy implements the original algorithm with adaptive parameter selection (Gao and Han, 2012).

Historical Note: Four Independent Discoverers of BFGS

1970s

The BFGS update was independently discovered by Broyden, Fletcher, Goldfarb, and Shanno, all in 1970. This was one of the most remarkable cases of simultaneous discovery in numerical analysis. The limited-memory variant (L-BFGS) by Nocedal (1980) made it practical for large-scale problems and is the workhorse behind many machine learning optimizers.

Objective Function

The function $f(\mathbf{x})$ to be minimized (or maximized) in an optimization problem.

Related: Gradient, Hessian

Gradient

The vector of first partial derivatives $\nabla f(\mathbf{x})$ , pointing in the direction of steepest ascent.

Related: Hessian, Objective Function

Hessian

The matrix of second partial derivatives $\nabla^2 f(\mathbf{x})$ , encoding the curvature of the objective function.

Related: Gradient

Quasi-Newton Method

An optimization method that approximates the Hessian (or its inverse) from gradient differences, avoiding the cost of computing exact second derivatives. BFGS is the most popular example.

Line Search

A one-dimensional optimization along the search direction to find a step size $\alpha_k$ satisfying sufficient decrease (Wolfe conditions).

Comparison of Unconstrained Optimization Methods in SciPy

Method	Gradient?	Hessian?	Convergence	Best For
Nelder-Mead	No	No	Sublinear	Noisy/non-smooth, n < 20
CG	Yes	No	Superlinear	Medium-scale smooth
BFGS	Yes	No	Superlinear	Default for smooth problems
L-BFGS-B	Yes	No	Superlinear	Large-scale, box constraints
Newton-CG	Yes	Yes (or HVP)	Quadratic	When Hessian is cheap
trust-ncg	Yes	Yes	Quadratic	Non-convex with good Hessian

Unconstrained Optimization

Why Optimization Is Everywhere

Definition: Unconstrained Optimization Problem

Definition: Gradient and Hessian

Definition: The scipy.optimize.minimize Interface

Definition: Nelder-Mead (Simplex) Method

Definition: BFGS and L-BFGS-B Quasi-Newton Methods

Definition: Newton-CG and Conjugate Gradient Methods

Theorem: Convergence Rates of Optimization Methods

Theorem: Second-Order Sufficient Conditions

Taylor expansion

Positive definiteness gives the result

Theorem: Water-Filling Power Allocation

Example: Minimizing the Rosenbrock Function

Setup

Compare methods

Key insight

Example: Water-Filling Power Allocation via SciPy

Direct water-filling

Verify with scipy.optimize

Example: Speeding Up BFGS with Analytical Gradients

Setup

Benchmark

Optimizer Trajectories on 2D Functions

Parameters

Water-Filling Power Allocation

Parameters

Gradient Descent Convergence Animation

Parameters

Taxonomy of Optimization Methods

Unconstrained Optimization Examples

Quick Check

Quick Check

Common Mistake: Not Supplying Gradients to BFGS

Common Mistake: Trusting Local Optima in Non-Convex Problems

Key Takeaway

Key Takeaway

Why This Matters: Water-Filling in MIMO-OFDM Systems

Historical Note: The Nelder-Mead Simplex Method

Historical Note: Four Independent Discoverers of BFGS

Objective Function

Gradient

Hessian

Quasi-Newton Method

Line Search

Comparison of Unconstrained Optimization Methods in SciPy

Definition:
Unconstrained Optimization Problem

Definition:
Gradient and Hessian

Definition:
The scipy.optimize.minimize Interface

Definition:
Nelder-Mead (Simplex) Method

Definition:
BFGS and L-BFGS-B Quasi-Newton Methods

Definition:
Newton-CG and Conjugate Gradient Methods