Ferkans — Interactive Telecom Tutor

Why ADMM When We Already Have FISTA?

FISTA is fast, but it has two limitations. First, it is a pure proximal method: complex regularizers (group- $\ell_1$ , fused-LASSO, total variation, analysis- $\ell_1$ $\|\mathbf{D}\mathbf{x}\|_1$ with a non-trivial $\mathbf{D}$ ) do not have closed-form proximal operators. Second, FISTA is inherently serial — one mat-vec at a time — whereas many distributed applications demand algorithms that split across machines. The Alternating Direction Method of Multipliers (ADMM) solves both problems by decoupling variables through splitting, giving each sub-problem its own clean structure (a linear solve, a prox, a dual update). Every primitive can be parallelized. ADMM is the workhorse of distributed sparse optimization and appears under different names throughout signal processing — split Bregman, Douglas-Rachford, operator-splitting.

Definition:
Variable Splitting and Augmented Lagrangian

Rewrite the LASSO by introducing a copy $\mathbf{z}$ of $\mathbf{x}$ : $\min_{\mathbf{x},\,\mathbf{z}}\; \tfrac{1}{2}\|\mathbf{y}-\mathbf{A}\mathbf{x}\|_2^2 + \lambda\|\mathbf{z}\|_1 \quad\text{s.t.}\quad \mathbf{x}-\mathbf{z}=\mathbf{0}.$ The augmented Lagrangian with penalty $\rho>0$ is $\mathcal{L}_\rho(\mathbf{x},\mathbf{z},\boldsymbol{\nu}) = \tfrac{1}{2}\|\mathbf{y}-\mathbf{A}\mathbf{x}\|_2^2 + \lambda\|\mathbf{z}\|_1 + \boldsymbol{\nu}^\top(\mathbf{x}-\mathbf{z}) + \tfrac{\rho}{2}\|\mathbf{x}-\mathbf{z}\|_2^2.$ With the scaled dual $\mathbf{u}=\boldsymbol{\nu}/\rho$ , we absorb the linear term into the quadratic: $\mathcal{L}_\rho = \tfrac{1}{2}\|\mathbf{y}-\mathbf{A}\mathbf{x}\|_2^2 + \lambda\|\mathbf{z}\|_1 + \tfrac{\rho}{2}\|\mathbf{x}-\mathbf{z}+\mathbf{u}\|_2^2 - \tfrac{\rho}{2}\|\mathbf{u}\|_2^2.$

ADMM for the LASSO

Complexity:

O(N^3)

once for Cholesky of

\mathbf{A}^\top\mathbf{A}+\rho\mathbf{I}

, then

O(N^2)

per iteration (just two triangular solves). Alternatively

O(MN)

per iter via matrix-inversion lemma when

M\ll N

.

Input: A, y, lambda, rho, tol.

Precompute: M = (A^T A + rho * I)^{-1} (or Cholesky factor thereof)

q_prec = A^T y (reused every iter)

Initialize: x_0 = 0, z_0 = 0, u_0 = 0.

1: for k = 0, 1, 2, ... do

2: x_{k+1} <- M @ (q_prec + rho*(z_k - u_k)) # x-update: ridge regression

3: z_{k+1} <- soft_threshold(x_{k+1} + u_k, lambda/rho) # z-update: componentwise prox

4: u_{k+1} <- u_k + (x_{k+1} - z_{k+1}) # dual update

5: r_pri <- ||x_{k+1} - z_{k+1}||_2 # primal residual

6: r_dual <- rho * ||z_{k+1} - z_k||_2 # dual residual

7: if r_pri < eps_pri and r_dual < eps_dual then stop

8: end for

The three updates have complementary roles: x-update handles the smooth data fit (a ridge regression), z-update handles the non-smooth penalty (a soft-threshold), u-update accumulates the constraint violation $\mathbf{x}-\mathbf{z}$ . When the algorithm has converged, $\mathbf{x}=\mathbf{z}$ (primal feasible) and $\mathbf{u}$ stabilizes at the dual optimum divided by $\rho$ .

Deriving the x-Update

Fixing $\mathbf{z}=\mathbf{z}^{(k)}$ and $\mathbf{u}=\mathbf{u}^{(k)}$ , the $\mathbf{x}$ -update minimizes $\tfrac{1}{2}\|\mathbf{y}-\mathbf{A}\mathbf{x}\|_2^2 + \tfrac{\rho}{2}\|\mathbf{x}-\mathbf{z}^{(k)}+\mathbf{u}^{(k)}\|_2^2$ . Setting the gradient to zero yields the normal equations $(\mathbf{A}^\top\mathbf{A} + \rho\mathbf{I})\mathbf{x} = \mathbf{A}^\top\mathbf{y} + \rho(\mathbf{z}^{(k)} - \mathbf{u}^{(k)})$ . When $M\ll N$ (typical in compressed sensing), use the Woodbury identity $(\mathbf{A}^\top\mathbf{A}+\rho\mathbf{I})^{-1} = \tfrac{1}{\rho}\mathbf{I} - \tfrac{1}{\rho}\mathbf{A}^\top(\rho\mathbf{I}+\mathbf{A}\mathbf{A}^\top)^{-1}\mathbf{A}$ to invert an $M\times M$ matrix instead of $N\times N$ .

Theorem: ADMM Convergence for Convex Problems

Consider the problem $\min_{\mathbf{x},\mathbf{z}}\; f(\mathbf{x}) + g(\mathbf{z}) \quad\text{s.t.}\quad \mathbf{C}\mathbf{x}+\mathbf{D}\mathbf{z}=\mathbf{c},$ with $f,g$ closed, proper, convex and $\mathcal{L}_0$ having a saddle point. Then, for any $\rho>0$ , the ADMM iterates satisfy: (i) primal residual $\mathbf{r}^{(k)}=\mathbf{C}\mathbf{x}^{(k)}+\mathbf{D}\mathbf{z}^{(k)}-\mathbf{c}\to\mathbf{0}$ ; (ii) objective $f(\mathbf{x}^{(k)})+g(\mathbf{z}^{(k)})\to p^\star$ ; (iii) dual $\boldsymbol{\nu}^{(k)}\to\boldsymbol{\nu}^\star$ . The objective gap and primal residual decay as $O(1/k)$ (in the ergodic sense).

ADMM is a particular instantiation of the Douglas-Rachford splitting applied to the dual problem. The $O(1/k)$ rate matches ISTA; ADMM is not accelerated by default. Under strong convexity one can achieve linear convergence, and accelerated/over-relaxed variants exist. The practical appeal is robustness: ADMM converges for any positive $\rho$ , though the speed depends dramatically on $\rho$ .

Proof

Monotone operator setup

Define the maximal monotone operators $T_f = \partial f$ and $T_g = \partial g$ (on the appropriate dual variables). ADMM can be rewritten as Douglas-Rachford splitting on the sum $T_f + T_g$ , which is known to converge whenever the problem has a solution. The argument is due to Lions and Mercier (1979).

Lyapunov function

Define $V^{(k)} = \|\mathbf{z}^{(k)} - \mathbf{z}^\star\|_2^2 + \tfrac{1}{\rho^2}\|\boldsymbol{\nu}^{(k)} - \boldsymbol{\nu}^\star\|_2^2$ . Direct computation shows $V^{(k+1)} \leq V^{(k)} - \tfrac{1}{\rho^2}\|\boldsymbol{\nu}^{(k+1)}-\boldsymbol{\nu}^{(k)}\|^2 - \rho\|\mathbf{z}^{(k+1)}-\mathbf{z}^{(k)}\|^2$ . The Lyapunov function is monotone non-increasing, and summability of the decrements forces both the primal and dual residuals to vanish.

Ergodic rate

Averaging the one-step inequalities and using convexity of $f+g$ yields $|f(\bar{\mathbf{x}}^{(K)}) + g(\bar{\mathbf{z}}^{(K)}) - p^\star| \leq O(1/K)$ and $\|\bar{\mathbf{r}}^{(K)}\| \leq O(1/K)$ for the running averages $\bar{\mathbf{x}},\bar{\mathbf{z}}$ .

Definition:
Primal and Dual Residuals

The primal residual measures constraint violation: $\mathbf{r}^{(k)}_{\text{pri}} = \mathbf{x}^{(k)} - \mathbf{z}^{(k)}$ . The dual residual measures change between consecutive $\mathbf{z}$ -iterates: $\mathbf{r}^{(k)}_{\text{dual}} = \rho(\mathbf{z}^{(k)} - \mathbf{z}^{(k-1)})$ . Standard stopping criterion: stop when both $\|\mathbf{r}_{\text{pri}}\|_2 \leq \varepsilon_{\text{pri}},\quad \|\mathbf{r}_{\text{dual}}\|_2 \leq \varepsilon_{\text{dual}},$ where the tolerances combine an absolute part $\sqrt{N}\varepsilon_{\text{abs}}$ and a relative part $\varepsilon_{\text{rel}} \cdot \max(\|\mathbf{x}^{(k)}\|,\|\mathbf{z}^{(k)}\|)$ .

Small primal residual = constraint $\mathbf{x}=\mathbf{z}$ nearly satisfied. Small dual residual = the $\mathbf{z}$ -iterate has settled. Both must be small: monitoring only one can give false convergence. Boyd et al. (2011) recommend $\varepsilon_{\text{abs}}=10^{-4}$ , $\varepsilon_{\text{rel}}=10^{-3}$ as defaults.

Example: ADMM on a 2D LASSO

With $\mathbf{A}=\mathbf{I}_2$ , $\mathbf{y}=(1.2,\,0.1)^\top$ , $\lambda=0.5$ , $\rho=1$ , and zero initialization, perform one full ADMM iteration.

Solution

$\mathbf{x}$-update

$\mathbf{A}^\top\mathbf{A}+\rho\mathbf{I} = 2\mathbf{I}$ , so $\mathbf{M}=\tfrac{1}{2}\mathbf{I}$ . $\mathbf{x}^{(1)} = \tfrac{1}{2}\bigl(\mathbf{A}^\top\mathbf{y} + \rho(\mathbf{z}^{(0)}-\mathbf{u}^{(0)})\bigr) = \tfrac{1}{2}((1.2,0.1)^\top + (0,0)^\top) = (0.6,\,0.05)^\top$ .

$\mathbf{z}$-update

$\mathbf{z}^{(1)} = S_{\lambda/\rho}(\mathbf{x}^{(1)}+\mathbf{u}^{(0)}) = S_{0.5}((0.6,0.05)^\top)$ . $|0.6|>0.5\to 0.1$ ; $|0.05|<0.5\to 0$ . So $\mathbf{z}^{(1)}=(0.1,\,0)^\top$ .

$\mathbf{u}$-update

$\mathbf{u}^{(1)} = \mathbf{u}^{(0)} + (\mathbf{x}^{(1)}-\mathbf{z}^{(1)}) = (0,0)^\top + (0.5,\,0.05)^\top = (0.5,\,0.05)^\top$ . Primal residual $\|\mathbf{r}_{\text{pri}}\|_2 = \sqrt{0.25+0.0025} \approx 0.502$ , indicating $\mathbf{x}$ and $\mathbf{z}$ still disagree — more iterations needed.

Verification against closed-form LASSO

For $\mathbf{A}=\mathbf{I}$ , LASSO is componentwise: $\mathbf{x}^\star = S_{\lambda}(\mathbf{y}) = S_{0.5}((1.2,0.1)^\top) = (0.7,\,0)^\top$ . ADMM will drive $\mathbf{x}^{(k)},\mathbf{z}^{(k)}$ toward $(0.7,0)^\top$ as $k\to\infty$ .

ADMM Primal and Dual Residuals

Track both residuals as a function of iteration on the same synthetic problem used in the ISTA/FISTA comparison. Play with $\rho$ — too small a penalty wastes dual updates; too large a penalty stalls the $\mathbf{z}$ -update.

Parameters

Measurements

M

120

Signal dimension

N

200

Sparsity

s

15

Regularization

\lambda

0.02

ADMM penalty

\rho

1

Iterations120

⚠️Engineering Note

Choosing and Adapting $\rho$

The performance of ADMM is notoriously sensitive to $\rho$ . Too small: primal residual decays slowly. Too large: dual residual decays slowly. Boyd's recipe: if $\|\mathbf{r}_{\text{pri}}\| > \mu\|\mathbf{r}_{\text{dual}}\|$ , increase $\rho\leftarrow\tau^{\text{incr}}\rho$ ; if $\|\mathbf{r}_{\text{dual}}\| > \mu\|\mathbf{r}_{\text{pri}}\|$ , decrease $\rho\leftarrow\rho/\tau^{\text{decr}}$ . Typical values: $\mu=10$ , $\tau^{\text{incr}}=\tau^{\text{decr}}=2$ . When $\rho$ changes, rescale $\mathbf{u}$ to keep the augmented Lagrangian consistent: $\mathbf{u}\leftarrow \mathbf{u}\cdot(\rho_{\text{old}}/\rho_{\text{new}})$ .

Practical Constraints

•
Cache Cholesky factorization only when $\rho$ is held fixed for many iterations.
•
For parallel ADMM (consensus form), communication cost often dominates; choose $\rho$ to minimize total wall-time, not iteration count.

Why This Matters: ADMM for Massive-MIMO Detection

Modern massive-MIMO receivers (5G/6G) solve regularized least-squares problems in which each user's signal is embedded in a massive $\mathbf{A}\in\mathbb{C}^{M\times N}$ with $N\gg M$ . ADMM-based detectors (splitting across base-station antennas) achieve per-subcarrier latencies compatible with sub-millisecond frame structures, whereas interior point solvers cannot meet the latency budget. The same machinery underpins activity detection in grant-free random access, which we meet in Chapter 15.

See full treatment in Chapter 15

Common Mistake: Stopping Only on Primal Residual

Mistake:

Declaring convergence when $\|\mathbf{x}-\mathbf{z}\|$ is small.

Correction:

A small primal residual means feasibility, not optimality. Always check the dual residual $\rho\|\mathbf{z}^{(k)}-\mathbf{z}^{(k-1)}\|$ as well. A small primal residual with a large dual residual indicates that the $\mathbf{z}$ -updates are still moving — the iterate is feasible but not yet optimal.

Common Mistake: Letting $\rho\to 0$ for Tight Fit

Mistake:

Reducing $\rho$ below $10^{-3}$ to "not distort" the original objective.

Correction:

As $\rho\to 0$ , the $\mathbf{z}$ -update's effective threshold $\lambda/\rho$ blows up — the iterate collapses to the zero vector every iteration. ADMM converges for any $\rho>0$ , but the practical sweet spot is typically $\rho \in [0.1\|\mathbf{A}\|^2,\,\|\mathbf{A}\|^2]$ .

Historical Note: ADMM: From 1970s Numerical Analysis to 2010s Machine Learning

1975-2011

ADMM was introduced by Glowinski-Marroco (1975) and Gabay-Mercier (1976) for partial differential equations, and placed on firm convergence footing by Lions and Mercier (1979). It lay dormant in numerical analysis for decades until Boyd, Parikh, Chu, Peleato and Eckstein's 2011 monograph Distributed Optimization and Statistical Learning via ADMM popularized it in machine learning and signal processing. The monograph reframed ADMM as a general recipe for distributed convex problems, showed how to split over data, features, or constraints, and sparked a decade of research — from consensus optimization to the split-Bregman method for TV denoising.

ADMM

Alternating Direction Method of Multipliers. Applies Douglas-Rachford splitting to the dual of a separable convex problem. Alternates primal minimizations over decoupled blocks with a dual update.

Related: ADMM for the LASSO

Augmented Lagrangian

The Lagrangian with an added quadratic penalty on constraint violation. Converts equality- constrained problems into unconstrained problems whose minimizers coincide with the original.

Quick Check

Which of the three ADMM steps for LASSO is the one that injects sparsity?

The $\mathbf{x}$ -update (ridge regression).

The $\mathbf{z}$ -update (soft-threshold).

The $\mathbf{u}$ -update (dual).

None of them individually; sparsity only emerges at convergence.

Correction:

The

\mathbf{z}

-update (soft-threshold).

Soft-thresholding kills components below $\lambda/\rho$ . The sparsity ends up in $\mathbf{z}$ , which equals $\mathbf{x}$ at convergence.

Quick Check

ADMM has converged in 100 iterations with primal residual $10^{-6}$ but dual residual $10^{-1}$ . What happened?

The algorithm is converged — ignore the dual residual.

$\rho$ was too large, making $\mathbf{z}$ -updates stall.

$\lambda$ was too small.

The LASSO has no solution.

Correction:

\rho

was too large, making

\mathbf{z}

-updates stall.

Large $\rho$ weights the constraint heavily, so $\mathbf{x}$ tracks $\mathbf{z}$ (tiny primal residual) but $\mathbf{z}$ drifts slowly (large dual residual).

Key Takeaway

ADMM splits the LASSO into a smooth ridge regression ( $\mathbf{x}$ -update) and a soft-threshold ( $\mathbf{z}$ -update) linked through a dual variable $\mathbf{u}$ . It converges for any $\rho>0$ at rate $O(1/k)$ , is trivially parallelizable, and handles regularizers that FISTA cannot (group- $\ell_1$ , TV, analysis- $\ell_1$ ). Its Achilles heel is the sensitivity to $\rho$ — adaptive schemes from Boyd et al. (2011) are essential.

ADMM for Sparse Problems

Why ADMM When We Already Have FISTA?

Definition: Variable Splitting and Augmented Lagrangian

ADMM for the LASSO

Deriving the x-Update

Theorem: ADMM Convergence for Convex Problems

Monotone operator setup

Lyapunov function

Ergodic rate

Definition: Primal and Dual Residuals

Example: ADMM on a 2D LASSO

$\mathbf{x}$-update

$\mathbf{z}$-update

$\mathbf{u}$-update

Verification against closed-form LASSO

ADMM Primal and Dual Residuals

Parameters

Choosing and Adapting ρ\rhoρ

Why This Matters: ADMM for Massive-MIMO Detection

Common Mistake: Stopping Only on Primal Residual

Common Mistake: Letting ρ→0\rho\to 0ρ→0 for Tight Fit

Historical Note: ADMM: From 1970s Numerical Analysis to 2010s Machine Learning

ADMM

Augmented Lagrangian

Quick Check

Quick Check

Key Takeaway

Definition:
Variable Splitting and Augmented Lagrangian

Definition:
Primal and Dual Residuals

Choosing and Adapting $\rho$

Common Mistake: Letting $\rho\to 0$ for Tight Fit