Iterative Algorithms

Why Iterative Methods?

While KKT conditions characterise the solution, we still need an algorithm to find it. For small problems, closed-form solutions like water-filling exist. For large-scale wireless problems β€” massive MIMO precoder design, network utility maximisation, deep learning β€” we need iterative methods that converge to the optimum.

Definition:

Gradient Descent

Gradient descent minimises a differentiable function f:Rn→Rf : \mathbb{R}^n \to \mathbb{R} via the iteration

xk+1=xkβˆ’Ξ±kβˆ‡f(xk)\mathbf{x}_{k+1} = \mathbf{x}_k - \alpha_k \nabla f(\mathbf{x}_k)

where Ξ±k>0\alpha_k > 0 is the step size (or learning rate).

  • Fixed step size: Ξ±k=Ξ±\alpha_k = \alpha for all kk.
  • Exact line search: Ξ±k=arg⁑min⁑α>0f(xkβˆ’Ξ±βˆ‡f(xk))\alpha_k = \arg\min_{\alpha > 0} f(\mathbf{x}_k - \alpha \nabla f(\mathbf{x}_k)).
  • Backtracking line search: Start with large Ξ±\alpha, shrink by factor β∈(0,1)\beta \in (0,1) until the Armijo condition is satisfied.

Theorem: Convergence Rate of Gradient Descent

Suppose ff is convex and LL-smooth (i.e., βˆ‡f\nabla f is LL-Lipschitz: βˆ₯βˆ‡f(x)βˆ’βˆ‡f(y)βˆ₯≀Lβˆ₯xβˆ’yβˆ₯\|\nabla f(\mathbf{x}) - \nabla f(\mathbf{y})\| \leq L \|\mathbf{x} - \mathbf{y}\|). With fixed step size Ξ±=1/L\alpha = 1/L:

f(xk)βˆ’f(x⋆)≀Lβˆ₯x0βˆ’x⋆βˆ₯22k.f(\mathbf{x}_k) - f(\mathbf{x}^\star) \leq \frac{L \|\mathbf{x}_0 - \mathbf{x}^\star\|^2}{2k}.

This is an O(1/k)O(1/k) convergence rate. If ff is also ΞΌ\mu-strongly convex, the rate improves to linear (geometric):

f(xk)βˆ’f(x⋆)≀(1βˆ’ΞΌL)k(f(x0)βˆ’f(x⋆)).f(\mathbf{x}_k) - f(\mathbf{x}^\star) \leq \left(1 - \frac{\mu}{L}\right)^k \left(f(\mathbf{x}_0) - f(\mathbf{x}^\star)\right).

The ratio ΞΊ=L/ΞΌ\kappa = L/\mu is the condition number β€” it controls the convergence speed.

A well-conditioned problem (ΞΊβ‰ˆ1\kappa \approx 1) has roughly circular contours and converges fast. An ill-conditioned problem (κ≫1\kappa \gg 1) has elongated contours, causing zig-zagging and slow convergence.

,

Gradient Descent with Backtracking Line Search

Complexity: O(n)O(n) per iteration (gradient computation dominates)
Input: ff, βˆ‡f\nabla f, initial x0\mathbf{x}_0, tolerance Ξ΅\varepsilon
Parameters: α0>0\alpha_0 > 0, β∈(0,1)\beta \in (0,1), c∈(0,1)c \in (0,1)
Output: Approximate minimiser x⋆\mathbf{x}^\star
1. for k=0,1,2,…k = 0, 1, 2, \ldots do
2. gβ†βˆ‡f(xk)\quad \mathbf{g} \leftarrow \nabla f(\mathbf{x}_k)
3. \quad if βˆ₯gβˆ₯<Ξ΅\|\mathbf{g}\| < \varepsilon then return xk\mathbf{x}_k
4. α←α0\quad \alpha \leftarrow \alpha_0
5. \quad while f(xkβˆ’Ξ±g)>f(xk)βˆ’cΞ±βˆ₯gβˆ₯2f(\mathbf{x}_k - \alpha \mathbf{g}) > f(\mathbf{x}_k) - c \alpha \|\mathbf{g}\|^2 do
6. α←βα\quad\quad \alpha \leftarrow \beta \alpha
7. \quad end while
8. xk+1←xkβˆ’Ξ±g\quad \mathbf{x}_{k+1} \leftarrow \mathbf{x}_k - \alpha \mathbf{g}
9. end for

The Armijo condition in line 5 ensures sufficient decrease. Typical values: c=10βˆ’4c = 10^{-4}, Ξ²=0.5\beta = 0.5.

,

Gradient Descent on a Quadratic

Watch gradient descent converge on a 2D quadratic f(x)=12xTQxf(\mathbf{x}) = \frac{1}{2}\mathbf{x}^T \mathbf{Q} \mathbf{x} with adjustable condition number and step size. Observe the zig-zagging behaviour when the condition number is large.

Parameters
5
0.2
30
4
3

Definition:

Projected Gradient Descent

For constrained problems min⁑x∈Cf(x)\min_{\mathbf{x} \in \mathcal{C}} f(\mathbf{x}) where C\mathcal{C} is a closed convex set, projected gradient descent adds a projection step:

xk+1=Ξ C(xkβˆ’Ξ±kβˆ‡f(xk))\mathbf{x}_{k+1} = \Pi_{\mathcal{C}}(\mathbf{x}_k - \alpha_k \nabla f(\mathbf{x}_k))

where Ξ C(y)=arg⁑min⁑x∈Cβˆ₯xβˆ’yβˆ₯\Pi_{\mathcal{C}}(\mathbf{y}) = \arg\min_{\mathbf{x} \in \mathcal{C}} \|\mathbf{x} - \mathbf{y}\| is the Euclidean projection onto C\mathcal{C}.

For many constraint sets (box, simplex, β„“2\ell_2 ball, PSD cone), the projection has a closed-form solution. For the simplex (power allocation with piβ‰₯0p_i \geq 0, βˆ‘pi=P\sum p_i = P), projection involves sorting and thresholding.

Example: Common Projections

Give closed-form projections for the following sets:

  1. Box [a,b]n[a, b]^n
  2. β„“2\ell_2 ball {x:βˆ₯xβˆ₯≀r}\{\mathbf{x} : \|\mathbf{x}\| \leq r\}
  3. Probability simplex {x:xiβ‰₯0,β€…β€Šβˆ‘ixi=1}\{\mathbf{x} : x_i \geq 0,\; \sum_i x_i = 1\}

Definition:

Proximal Operator

For a (possibly non-smooth) convex function gg, the proximal operator with parameter Ξ±>0\alpha > 0 is

proxΞ±g(v)=arg⁑min⁑x(g(x)+12Ξ±βˆ₯xβˆ’vβˆ₯2).\text{prox}_{\alpha g}(\mathbf{v}) = \arg\min_{\mathbf{x}} \left(g(\mathbf{x}) + \frac{1}{2\alpha}\|\mathbf{x} - \mathbf{v}\|^2\right).

The proximal operator generalises projection: when gg is the indicator function of a convex set C\mathcal{C}, proxΞ±g=Ξ C\text{prox}_{\alpha g} = \Pi_{\mathcal{C}}.

Definition:

Proximal Gradient Descent

For composite problems min⁑xβ€…β€Šf(x)+g(x)\min_{\mathbf{x}}\; f(\mathbf{x}) + g(\mathbf{x}) where ff is smooth and gg is convex (possibly non-smooth), proximal gradient descent alternates:

xk+1=proxΞ±kg(xkβˆ’Ξ±kβˆ‡f(xk)).\mathbf{x}_{k+1} = \text{prox}_{\alpha_k g} (\mathbf{x}_k - \alpha_k \nabla f(\mathbf{x}_k)).

When g(x)=Ξ»βˆ₯xβˆ₯1g(\mathbf{x}) = \lambda \|\mathbf{x}\|_1 (LASSO regularisation), the proximal operator is the soft-thresholding operator.

In wireless, β„“1\ell_1 regularisation arises in sparse channel estimation and compressed sensing for mmWave channel recovery.

Definition:

Alternating Optimization and Block Coordinate Descent

When optimising over multiple variable blocks (x1,x2,…,xB)(\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_B), alternating optimisation (or block coordinate descent) cycles through:

xb(k+1)=arg⁑min⁑xbf(x1(k+1),…,xbβˆ’1(k+1),xb,xb+1(k),…,xB(k))\mathbf{x}_b^{(k+1)} = \arg\min_{\mathbf{x}_b} f(\mathbf{x}_1^{(k+1)}, \ldots, \mathbf{x}_{b-1}^{(k+1)}, \mathbf{x}_b, \mathbf{x}_{b+1}^{(k)}, \ldots, \mathbf{x}_B^{(k)})

for b=1,2,…,Bb = 1, 2, \ldots, B in each iteration.

For convex problems, this converges to the global optimum. For non-convex problems, it converges to a stationary point.

The weighted MMSE (WMMSE) algorithm for MIMO interference networks is a celebrated example: it alternates between receiver, weight, and precoder updates, each of which has a closed form.

Gradient Descent Animation

Watch gradient descent trace a path on a 2D objective surface. Compare the behaviour on well-conditioned vs. ill-conditioned quadratics.

Parameters
0.1
10

Gradient Descent β€” Well-Conditioned vs. Ill-Conditioned

Side-by-side comparison of gradient descent on a well-conditioned quadratic (ΞΊ=2\kappa = 2) and an ill-conditioned one (ΞΊ=20\kappa = 20). Observe the zig-zagging behaviour when the condition number is large.
Left: ΞΊ=2\kappa = 2 converges smoothly. Right: ΞΊ=20\kappa = 20 zig-zags along the narrow valley, taking many more iterations.
⚠️Engineering Note

Convergence Criteria in Real-Time Systems

In wireless system design, iterative algorithms must converge within strict latency budgets:

  • 5G NR slot duration: 0.5 ms (at 30 kHz SCS) to 0.125 ms (at 120 kHz SCS). A precoder must be computed within a fraction of a slot. This limits WMMSE-type algorithms to 3–5 iterations.
  • Stopping criteria: In practice, algorithms stop at a fixed iteration count rather than convergence tolerance. Typical choices: 3–10 iterations for beamforming, 1–3 for power control.
  • Warm starting: Initialising from the previous slot's solution exploits temporal correlation and often gives near-optimal performance in 1–2 iterations.
  • Hardware constraints: FPGA and ASIC implementations require fixed-point arithmetic. Gradient descent with 16-bit fixed-point can lose 2–3 dB compared to floating-point at high SNR.
Practical Constraints
  • β€’

    5G NR slot: 0.5 ms at 30 kHz SCS β€” precoder computation must fit within this budget

  • β€’

    Fixed iteration count (3–10) used instead of convergence tolerance

  • β€’

    Fixed-point arithmetic (16-bit) can degrade performance by 2–3 dB at high SNR

Common Mistake: Step Size Too Large Causes Divergence

Mistake:

Choosing a step size Ξ±>2/L\alpha > 2/L (where LL is the Lipschitz constant of the gradient), thinking bigger steps mean faster convergence.

Correction:

For an LL-smooth function, step sizes above 2/L2/L cause gradient descent to diverge. The "safe" choice is Ξ±=1/L\alpha = 1/L. Backtracking line search automatically adapts the step size.

Comparison of First-Order Methods

MethodPer-Iteration CostConvergence (convex)Convergence (strongly convex)
Gradient descentO(n)O(n)O(1/k)O(1/k)O((1βˆ’ΞΌ/L)k)O((1 - \mu/L)^k)
Projected GDO(n+Tproj)O(n + T_{\text{proj}})O(1/k)O(1/k)O((1βˆ’ΞΌ/L)k)O((1 - \mu/L)^k)
Proximal GDO(n+Tprox)O(n + T_{\text{prox}})O(1/k)O(1/k)O((1βˆ’ΞΌ/L)k)O((1 - \mu/L)^k)
Nesterov accelerationO(n)O(n)O(1/k2)O(1/k^2)O((1βˆ’ΞΌ/L)k)O((1 - \sqrt{\mu/L})^k)

Why This Matters: Iterative Algorithms in MIMO Precoding

The WMMSE (Weighted Minimum Mean Square Error) algorithm by Shi et al. (2011) is one of the most widely used iterative methods in wireless. It solves the non-convex sum-rate maximisation problem for MIMO interference channels by reformulating it as a block-coordinate descent over auxiliary variables. Each sub-problem is a convex QP with a closed-form solution.

See full treatment in Limited Feedback

Quick Check

Gradient descent on an LL-smooth, ΞΌ\mu-strongly convex function with step size Ξ±=1/L\alpha = 1/L converges at what rate?

O(1/k)O(1/k) (sublinear)

O(1/k2)O(1/k^2) (accelerated sublinear)

Linear: O((1βˆ’ΞΌ/L)k)O((1 - \mu/L)^k)

Quadratic convergence

Step Size (Learning Rate)

The parameter Ξ±k>0\alpha_k > 0 controlling the magnitude of each gradient descent update: xk+1=xkβˆ’Ξ±kβˆ‡f(xk)\mathbf{x}_{k+1} = \mathbf{x}_k - \alpha_k \nabla f(\mathbf{x}_k).

Related: Gradient Descent, Convergence Rate, Lipschitz Constant

Condition Number

For an LL-smooth and ΞΌ\mu-strongly convex function, ΞΊ=L/ΞΌβ‰₯1\kappa = L/\mu \geq 1. Measures the "eccentricity" of the function's contours. Larger ΞΊ\kappa means slower convergence for first-order methods.

Related: Gradient Descent, Convergence Rate