Iterative Algorithms
Why Iterative Methods?
While KKT conditions characterise the solution, we still need an algorithm to find it. For small problems, closed-form solutions like water-filling exist. For large-scale wireless problems β massive MIMO precoder design, network utility maximisation, deep learning β we need iterative methods that converge to the optimum.
Definition: Gradient Descent
Gradient Descent
Gradient descent minimises a differentiable function via the iteration
where is the step size (or learning rate).
- Fixed step size: for all .
- Exact line search: .
- Backtracking line search: Start with large , shrink by factor until the Armijo condition is satisfied.
Theorem: Convergence Rate of Gradient Descent
Suppose is convex and -smooth (i.e., is -Lipschitz: ). With fixed step size :
This is an convergence rate. If is also -strongly convex, the rate improves to linear (geometric):
The ratio is the condition number β it controls the convergence speed.
A well-conditioned problem () has roughly circular contours and converges fast. An ill-conditioned problem () has elongated contours, causing zig-zagging and slow convergence.
$L$-smoothness bound
By -smoothness: . Setting : .
Telescoping
By the first-order convexity condition: (via a careful telescoping argument over ). This gives the rate.
Gradient Descent with Backtracking Line Search
Complexity: per iteration (gradient computation dominates)The Armijo condition in line 5 ensures sufficient decrease. Typical values: , .
Gradient Descent on a Quadratic
Watch gradient descent converge on a 2D quadratic with adjustable condition number and step size. Observe the zig-zagging behaviour when the condition number is large.
Parameters
Definition: Projected Gradient Descent
Projected Gradient Descent
For constrained problems where is a closed convex set, projected gradient descent adds a projection step:
where is the Euclidean projection onto .
For many constraint sets (box, simplex, ball, PSD cone), the projection has a closed-form solution. For the simplex (power allocation with , ), projection involves sorting and thresholding.
Example: Common Projections
Give closed-form projections for the following sets:
- Box
- ball
- Probability simplex
Box projection
(clip each coordinate independently).
$\ell_2$ ball projection
(scale to the boundary if outside).
Simplex projection
Sort components . Find . Set . Then . Complexity: due to sorting.
Definition: Proximal Operator
Proximal Operator
For a (possibly non-smooth) convex function , the proximal operator with parameter is
The proximal operator generalises projection: when is the indicator function of a convex set , .
Definition: Proximal Gradient Descent
Proximal Gradient Descent
For composite problems where is smooth and is convex (possibly non-smooth), proximal gradient descent alternates:
When (LASSO regularisation), the proximal operator is the soft-thresholding operator.
In wireless, regularisation arises in sparse channel estimation and compressed sensing for mmWave channel recovery.
Definition: Alternating Optimization and Block Coordinate Descent
Alternating Optimization and Block Coordinate Descent
When optimising over multiple variable blocks , alternating optimisation (or block coordinate descent) cycles through:
for in each iteration.
For convex problems, this converges to the global optimum. For non-convex problems, it converges to a stationary point.
The weighted MMSE (WMMSE) algorithm for MIMO interference networks is a celebrated example: it alternates between receiver, weight, and precoder updates, each of which has a closed form.
Gradient Descent Animation
Watch gradient descent trace a path on a 2D objective surface. Compare the behaviour on well-conditioned vs. ill-conditioned quadratics.
Parameters
Gradient Descent β Well-Conditioned vs. Ill-Conditioned
Convergence Criteria in Real-Time Systems
In wireless system design, iterative algorithms must converge within strict latency budgets:
- 5G NR slot duration: 0.5 ms (at 30 kHz SCS) to 0.125 ms (at 120 kHz SCS). A precoder must be computed within a fraction of a slot. This limits WMMSE-type algorithms to 3β5 iterations.
- Stopping criteria: In practice, algorithms stop at a fixed iteration count rather than convergence tolerance. Typical choices: 3β10 iterations for beamforming, 1β3 for power control.
- Warm starting: Initialising from the previous slot's solution exploits temporal correlation and often gives near-optimal performance in 1β2 iterations.
- Hardware constraints: FPGA and ASIC implementations require fixed-point arithmetic. Gradient descent with 16-bit fixed-point can lose 2β3 dB compared to floating-point at high SNR.
- β’
5G NR slot: 0.5 ms at 30 kHz SCS β precoder computation must fit within this budget
- β’
Fixed iteration count (3β10) used instead of convergence tolerance
- β’
Fixed-point arithmetic (16-bit) can degrade performance by 2β3 dB at high SNR
Common Mistake: Step Size Too Large Causes Divergence
Mistake:
Choosing a step size (where is the Lipschitz constant of the gradient), thinking bigger steps mean faster convergence.
Correction:
For an -smooth function, step sizes above cause gradient descent to diverge. The "safe" choice is . Backtracking line search automatically adapts the step size.
Comparison of First-Order Methods
| Method | Per-Iteration Cost | Convergence (convex) | Convergence (strongly convex) |
|---|---|---|---|
| Gradient descent | |||
| Projected GD | |||
| Proximal GD | |||
| Nesterov acceleration |
Why This Matters: Iterative Algorithms in MIMO Precoding
The WMMSE (Weighted Minimum Mean Square Error) algorithm by Shi et al. (2011) is one of the most widely used iterative methods in wireless. It solves the non-convex sum-rate maximisation problem for MIMO interference channels by reformulating it as a block-coordinate descent over auxiliary variables. Each sub-problem is a convex QP with a closed-form solution.
See full treatment in Limited Feedback
Quick Check
Gradient descent on an -smooth, -strongly convex function with step size converges at what rate?
(sublinear)
(accelerated sublinear)
Linear:
Quadratic convergence
Correct. Strong convexity with parameter and smoothness gives linear (geometric) convergence. The condition number controls the rate.
Step Size (Learning Rate)
The parameter controlling the magnitude of each gradient descent update: .
Related: Gradient Descent, Convergence Rate, Lipschitz Constant
Condition Number
For an -smooth and -strongly convex function, . Measures the "eccentricity" of the function's contours. Larger means slower convergence for first-order methods.
Related: Gradient Descent, Convergence Rate