Ferkans — Interactive Telecom Tutor

Why Iterative Methods?

While KKT conditions characterise the solution, we still need an algorithm to find it. For small problems, closed-form solutions like water-filling exist. For large-scale wireless problems — massive MIMO precoder design, network utility maximisation, deep learning — we need iterative methods that converge to the optimum.

Definition:
Gradient Descent

Gradient descent minimises a differentiable function $f : \mathbb{R}^n \to \mathbb{R}$ via the iteration

$\mathbf{x}_{k+1} = \mathbf{x}_k - \alpha_k \nabla f(\mathbf{x}_k)$

where $\alpha_k > 0$ is the step size (or learning rate).

Fixed step size: $\alpha_k = \alpha$ for all $k$ .
Exact line search: $\alpha_k = \arg\min_{\alpha > 0} f(\mathbf{x}_k - \alpha \nabla f(\mathbf{x}_k))$ .
Backtracking line search: Start with large $\alpha$ , shrink by factor $\beta \in (0,1)$ until the Armijo condition is satisfied.

Theorem: Convergence Rate of Gradient Descent

Suppose $f$ is convex and $L$ -smooth (i.e., $\nabla f$ is $L$ -Lipschitz: $\|\nabla f(\mathbf{x}) - \nabla f(\mathbf{y})\| \leq L \|\mathbf{x} - \mathbf{y}\|$ ). With fixed step size $\alpha = 1/L$ :

$f(\mathbf{x}_k) - f(\mathbf{x}^\star) \leq \frac{L \|\mathbf{x}_0 - \mathbf{x}^\star\|^2}{2k}.$

This is an $O(1/k)$ convergence rate. If $f$ is also $\mu$ -strongly convex, the rate improves to linear (geometric):

$f(\mathbf{x}_k) - f(\mathbf{x}^\star) \leq \left(1 - \frac{\mu}{L}\right)^k \left(f(\mathbf{x}_0) - f(\mathbf{x}^\star)\right).$

The ratio $\kappa = L/\mu$ is the condition number — it controls the convergence speed.

A well-conditioned problem ( $\kappa \approx 1$ ) has roughly circular contours and converges fast. An ill-conditioned problem ( $\kappa \gg 1$ ) has elongated contours, causing zig-zagging and slow convergence.

Proof

$L$-smoothness bound

By $L$ -smoothness: $f(\mathbf{y}) \leq f(\mathbf{x}) + \nabla f(\mathbf{x})^T (\mathbf{y} - \mathbf{x}) + \frac{L}{2}\|\mathbf{y} - \mathbf{x}\|^2$ . Setting $\mathbf{y} = \mathbf{x} - \frac{1}{L}\nabla f(\mathbf{x})$ : $f(\mathbf{x}_{k+1}) \leq f(\mathbf{x}_k) - \frac{1}{2L}\|\nabla f(\mathbf{x}_k)\|^2$ .

Telescoping

By the first-order convexity condition: $\|\nabla f(\mathbf{x}_k)\|^2 \geq 2L(f(\mathbf{x}_k) - f(\mathbf{x}^\star)) \cdot \frac{1}{k}$ (via a careful telescoping argument over $\sum_{j=0}^{k-1}(f(\mathbf{x}_j) - f(\mathbf{x}^\star))$ ). This gives the $O(1/k)$ rate. $\blacksquare$

,

Gradient Descent with Backtracking Line Search

Complexity:

O(n)

per iteration (gradient computation dominates)

Input:

f

,

\nabla f

, initial

\mathbf{x}_0

, tolerance

\varepsilon

Parameters:

\alpha_0 > 0

,

\beta \in (0,1)

,

c \in (0,1)

Output: Approximate minimiser

\mathbf{x}^\star

1. for

k = 0, 1, 2, \ldots

do

2.

\quad \mathbf{g} \leftarrow \nabla f(\mathbf{x}_k)

3.

\quad

if

\|\mathbf{g}\| < \varepsilon

then return

\mathbf{x}_k

4.

\quad \alpha \leftarrow \alpha_0

5.

\quad

while

f(\mathbf{x}_k - \alpha \mathbf{g}) > f(\mathbf{x}_k) - c \alpha \|\mathbf{g}\|^2

do

6.

\quad\quad \alpha \leftarrow \beta \alpha

7.

\quad

end while

8.

\quad \mathbf{x}_{k+1} \leftarrow \mathbf{x}_k - \alpha \mathbf{g}

9. end for

The Armijo condition in line 5 ensures sufficient decrease. Typical values: $c = 10^{-4}$ , $\beta = 0.5$ .

,

Gradient Descent on a Quadratic

Watch gradient descent converge on a 2D quadratic $f(\mathbf{x}) = \frac{1}{2}\mathbf{x}^T \mathbf{Q} \mathbf{x}$ with adjustable condition number and step size. Observe the zig-zagging behaviour when the condition number is large.

Parameters

Condition number

\kappa

5

Step size

\alpha

0.2

Number of iterations30

Starting

x_1

4

Starting

x_2

3

Definition:
Projected Gradient Descent

For constrained problems $\min_{\mathbf{x} \in \mathcal{C}} f(\mathbf{x})$ where $\mathcal{C}$ is a closed convex set, projected gradient descent adds a projection step:

$\mathbf{x}_{k+1} = \Pi_{\mathcal{C}}(\mathbf{x}_k - \alpha_k \nabla f(\mathbf{x}_k))$

where $\Pi_{\mathcal{C}}(\mathbf{y}) = \arg\min_{\mathbf{x} \in \mathcal{C}} \|\mathbf{x} - \mathbf{y}\|$ is the Euclidean projection onto $\mathcal{C}$ .

For many constraint sets (box, simplex, $\ell_2$ ball, PSD cone), the projection has a closed-form solution. For the simplex (power allocation with $p_i \geq 0$ , $\sum p_i = P$ ), projection involves sorting and thresholding.

Example: Common Projections

Give closed-form projections for the following sets:

Box $[a, b]^n$
$\ell_2$ ball $\{\mathbf{x} : \|\mathbf{x}\| \leq r\}$
Probability simplex $\{\mathbf{x} : x_i \geq 0,\; \sum_i x_i = 1\}$

Solution

Box projection

$[\Pi(\mathbf{y})]_i = \min(\max(y_i, a), b)$ (clip each coordinate independently).

$\ell_2$ ball projection

$\Pi(\mathbf{y}) = \begin{cases} \mathbf{y} & \|\mathbf{y}\| \leq r \\ r \mathbf{y} / \|\mathbf{y}\| & \text{otherwise} \end{cases}$ (scale to the boundary if outside).

Simplex projection

Sort components $y_{(1)} \geq y_{(2)} \geq \cdots \geq y_{(n)}$ . Find $\rho = \max\{j : y_{(j)} - \frac{1}{j}(\sum_{i=1}^j y_{(i)} - 1) > 0\}$ . Set $\tau = \frac{1}{\rho}(\sum_{i=1}^\rho y_{(i)} - 1)$ . Then $[\Pi(\mathbf{y})]_i = (y_i - \tau)^+$ . Complexity: $O(n \log n)$ due to sorting. $\blacksquare$

Definition:
Proximal Operator

For a (possibly non-smooth) convex function $g$ , the proximal operator with parameter $\alpha > 0$ is

$\text{prox}_{\alpha g}(\mathbf{v}) = \arg\min_{\mathbf{x}} \left(g(\mathbf{x}) + \frac{1}{2\alpha}\|\mathbf{x} - \mathbf{v}\|^2\right).$

The proximal operator generalises projection: when $g$ is the indicator function of a convex set $\mathcal{C}$ , $\text{prox}_{\alpha g} = \Pi_{\mathcal{C}}$ .

Definition:
Proximal Gradient Descent

For composite problems $\min_{\mathbf{x}}\; f(\mathbf{x}) + g(\mathbf{x})$ where $f$ is smooth and $g$ is convex (possibly non-smooth), proximal gradient descent alternates:

$\mathbf{x}_{k+1} = \text{prox}_{\alpha_k g} (\mathbf{x}_k - \alpha_k \nabla f(\mathbf{x}_k)).$

When $g(\mathbf{x}) = \lambda \|\mathbf{x}\|_1$ (LASSO regularisation), the proximal operator is the soft-thresholding operator.

In wireless, $\ell_1$ regularisation arises in sparse channel estimation and compressed sensing for mmWave channel recovery.

Definition:
Alternating Optimization and Block Coordinate Descent

When optimising over multiple variable blocks $(\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_B)$ , alternating optimisation (or block coordinate descent) cycles through:

$\mathbf{x}_b^{(k+1)} = \arg\min_{\mathbf{x}_b} f(\mathbf{x}_1^{(k+1)}, \ldots, \mathbf{x}_{b-1}^{(k+1)}, \mathbf{x}_b, \mathbf{x}_{b+1}^{(k)}, \ldots, \mathbf{x}_B^{(k)})$

for $b = 1, 2, \ldots, B$ in each iteration.

For convex problems, this converges to the global optimum. For non-convex problems, it converges to a stationary point.

The weighted MMSE (WMMSE) algorithm for MIMO interference networks is a celebrated example: it alternates between receiver, weight, and precoder updates, each of which has a closed form.

Gradient Descent Animation

Watch gradient descent trace a path on a 2D objective surface. Compare the behaviour on well-conditioned vs. ill-conditioned quadratics.

Parameters

Objective surface

Step size

\alpha

0.1

Condition number

\kappa

(quadratic only)10

Gradient Descent — Well-Conditioned vs. Ill-Conditioned

Side-by-side comparison of gradient descent on a well-conditioned quadratic (

\kappa = 2

) and an ill-conditioned one (

\kappa = 20

). Observe the zig-zagging behaviour when the condition number is large.

Left:

\kappa = 2

converges smoothly. Right:

\kappa = 20

zig-zags along the narrow valley, taking many more iterations.

⚠️Engineering Note

Convergence Criteria in Real-Time Systems

In wireless system design, iterative algorithms must converge within strict latency budgets:

5G NR slot duration: 0.5 ms (at 30 kHz SCS) to 0.125 ms (at 120 kHz SCS). A precoder must be computed within a fraction of a slot. This limits WMMSE-type algorithms to 3–5 iterations.
Stopping criteria: In practice, algorithms stop at a fixed iteration count rather than convergence tolerance. Typical choices: 3–10 iterations for beamforming, 1–3 for power control.
Warm starting: Initialising from the previous slot's solution exploits temporal correlation and often gives near-optimal performance in 1–2 iterations.
Hardware constraints: FPGA and ASIC implementations require fixed-point arithmetic. Gradient descent with 16-bit fixed-point can lose 2–3 dB compared to floating-point at high SNR.

Practical Constraints

•
5G NR slot: 0.5 ms at 30 kHz SCS — precoder computation must fit within this budget
•
Fixed iteration count (3–10) used instead of convergence tolerance
•
Fixed-point arithmetic (16-bit) can degrade performance by 2–3 dB at high SNR

Common Mistake: Step Size Too Large Causes Divergence

Mistake:

Choosing a step size $\alpha > 2/L$ (where $L$ is the Lipschitz constant of the gradient), thinking bigger steps mean faster convergence.

Correction:

For an $L$ -smooth function, step sizes above $2/L$ cause gradient descent to diverge. The "safe" choice is $\alpha = 1/L$ . Backtracking line search automatically adapts the step size.

Comparison of First-Order Methods

Method	Per-Iteration Cost	Convergence (convex)	Convergence (strongly convex)
Gradient descent	$O(n)$	$O(1/k)$	$O((1 - \mu/L)^k)$
Projected GD	$O(n + T_{\text{proj}})$	$O(1/k)$	$O((1 - \mu/L)^k)$
Proximal GD	$O(n + T_{\text{prox}})$	$O(1/k)$	$O((1 - \mu/L)^k)$
Nesterov acceleration	$O(n)$	$O(1/k^2)$	$O((1 - \sqrt{\mu/L})^k)$

Why This Matters: Iterative Algorithms in MIMO Precoding

The WMMSE (Weighted Minimum Mean Square Error) algorithm by Shi et al. (2011) is one of the most widely used iterative methods in wireless. It solves the non-convex sum-rate maximisation problem for MIMO interference channels by reformulating it as a block-coordinate descent over auxiliary variables. Each sub-problem is a convex QP with a closed-form solution.

See full treatment in Limited Feedback

Quick Check

Gradient descent on an $L$ -smooth, $\mu$ -strongly convex function with step size $\alpha = 1/L$ converges at what rate?

$O(1/k)$ (sublinear)

$O(1/k^2)$ (accelerated sublinear)

Linear: $O((1 - \mu/L)^k)$

Quadratic convergence

Correction:

Linear:

O((1 - \mu/L)^k)

Correct. Strong convexity with parameter $\mu$ and smoothness $L$ gives linear (geometric) convergence. The condition number $\kappa = L/\mu$ controls the rate.

Step Size (Learning Rate)

The parameter $\alpha_k > 0$ controlling the magnitude of each gradient descent update: $\mathbf{x}_{k+1} = \mathbf{x}_k - \alpha_k \nabla f(\mathbf{x}_k)$ .

Condition Number

For an $L$ -smooth and $\mu$ -strongly convex function, $\kappa = L/\mu \geq 1$ . Measures the "eccentricity" of the function's contours. Larger $\kappa$ means slower convergence for first-order methods.

Iterative Algorithms

Why Iterative Methods?

Definition: Gradient Descent

Theorem: Convergence Rate of Gradient Descent

$L$-smoothness bound

Telescoping

Gradient Descent with Backtracking Line Search

Gradient Descent on a Quadratic

Parameters

Definition: Projected Gradient Descent

Example: Common Projections

Box projection

$\ell_2$ ball projection

Simplex projection

Definition: Proximal Operator

Definition: Proximal Gradient Descent

Definition: Alternating Optimization and Block Coordinate Descent

Gradient Descent Animation

Parameters

Gradient Descent — Well-Conditioned vs. Ill-Conditioned

Convergence Criteria in Real-Time Systems

Common Mistake: Step Size Too Large Causes Divergence

Comparison of First-Order Methods

Why This Matters: Iterative Algorithms in MIMO Precoding

Quick Check

Step Size (Learning Rate)

Condition Number

Definition:
Gradient Descent

Definition:
Projected Gradient Descent

Definition:
Proximal Operator

Definition:
Proximal Gradient Descent

Definition:
Alternating Optimization and Block Coordinate Descent