ADMM for Sparse Problems

Why ADMM When We Already Have FISTA?

FISTA is fast, but it has two limitations. First, it is a pure proximal method: complex regularizers (group-β„“1\ell_1, fused-LASSO, total variation, analysis-β„“1\ell_1 βˆ₯Dxβˆ₯1\|\mathbf{D}\mathbf{x}\|_1 with a non-trivial D\mathbf{D}) do not have closed-form proximal operators. Second, FISTA is inherently serial β€” one mat-vec at a time β€” whereas many distributed applications demand algorithms that split across machines. The Alternating Direction Method of Multipliers (ADMM) solves both problems by decoupling variables through splitting, giving each sub-problem its own clean structure (a linear solve, a prox, a dual update). Every primitive can be parallelized. ADMM is the workhorse of distributed sparse optimization and appears under different names throughout signal processing β€” split Bregman, Douglas-Rachford, operator-splitting.

Definition:

Variable Splitting and Augmented Lagrangian

Rewrite the LASSO by introducing a copy z\mathbf{z} of x\mathbf{x}: min⁑x, zβ€…β€Š12βˆ₯yβˆ’Axβˆ₯22+Ξ»βˆ₯zβˆ₯1s.t.xβˆ’z=0.\min_{\mathbf{x},\,\mathbf{z}}\; \tfrac{1}{2}\|\mathbf{y}-\mathbf{A}\mathbf{x}\|_2^2 + \lambda\|\mathbf{z}\|_1 \quad\text{s.t.}\quad \mathbf{x}-\mathbf{z}=\mathbf{0}. The augmented Lagrangian with penalty ρ>0\rho>0 is Lρ(x,z,Ξ½)=12βˆ₯yβˆ’Axβˆ₯22+Ξ»βˆ₯zβˆ₯1+ν⊀(xβˆ’z)+ρ2βˆ₯xβˆ’zβˆ₯22.\mathcal{L}_\rho(\mathbf{x},\mathbf{z},\boldsymbol{\nu}) = \tfrac{1}{2}\|\mathbf{y}-\mathbf{A}\mathbf{x}\|_2^2 + \lambda\|\mathbf{z}\|_1 + \boldsymbol{\nu}^\top(\mathbf{x}-\mathbf{z}) + \tfrac{\rho}{2}\|\mathbf{x}-\mathbf{z}\|_2^2. With the scaled dual u=Ξ½/ρ\mathbf{u}=\boldsymbol{\nu}/\rho, we absorb the linear term into the quadratic: Lρ=12βˆ₯yβˆ’Axβˆ₯22+Ξ»βˆ₯zβˆ₯1+ρ2βˆ₯xβˆ’z+uβˆ₯22βˆ’Ο2βˆ₯uβˆ₯22.\mathcal{L}_\rho = \tfrac{1}{2}\|\mathbf{y}-\mathbf{A}\mathbf{x}\|_2^2 + \lambda\|\mathbf{z}\|_1 + \tfrac{\rho}{2}\|\mathbf{x}-\mathbf{z}+\mathbf{u}\|_2^2 - \tfrac{\rho}{2}\|\mathbf{u}\|_2^2.

ADMM for the LASSO

Complexity: O(N3)O(N^3) once for Cholesky of A⊀A+ρI\mathbf{A}^\top\mathbf{A}+\rho\mathbf{I}, then O(N2)O(N^2) per iteration (just two triangular solves). Alternatively O(MN)O(MN) per iter via matrix-inversion lemma when Mβ‰ͺNM\ll N.
Input: A, y, lambda, rho, tol.
Precompute: M = (A^T A + rho * I)^{-1} (or Cholesky factor thereof)
q_prec = A^T y (reused every iter)
Initialize: x_0 = 0, z_0 = 0, u_0 = 0.
1: for k = 0, 1, 2, ... do
2: x_{k+1} <- M @ (q_prec + rho*(z_k - u_k)) # x-update: ridge regression
3: z_{k+1} <- soft_threshold(x_{k+1} + u_k, lambda/rho) # z-update: componentwise prox
4: u_{k+1} <- u_k + (x_{k+1} - z_{k+1}) # dual update
5: r_pri <- ||x_{k+1} - z_{k+1}||_2 # primal residual
6: r_dual <- rho * ||z_{k+1} - z_k||_2 # dual residual
7: if r_pri < eps_pri and r_dual < eps_dual then stop
8: end for

The three updates have complementary roles: x-update handles the smooth data fit (a ridge regression), z-update handles the non-smooth penalty (a soft-threshold), u-update accumulates the constraint violation xβˆ’z\mathbf{x}-\mathbf{z}. When the algorithm has converged, x=z\mathbf{x}=\mathbf{z} (primal feasible) and u\mathbf{u} stabilizes at the dual optimum divided by ρ\rho.

Deriving the x-Update

Fixing z=z(k)\mathbf{z}=\mathbf{z}^{(k)} and u=u(k)\mathbf{u}=\mathbf{u}^{(k)}, the x\mathbf{x}-update minimizes 12βˆ₯yβˆ’Axβˆ₯22+ρ2βˆ₯xβˆ’z(k)+u(k)βˆ₯22\tfrac{1}{2}\|\mathbf{y}-\mathbf{A}\mathbf{x}\|_2^2 + \tfrac{\rho}{2}\|\mathbf{x}-\mathbf{z}^{(k)}+\mathbf{u}^{(k)}\|_2^2. Setting the gradient to zero yields the normal equations (A⊀A+ρI)x=A⊀y+ρ(z(k)βˆ’u(k))(\mathbf{A}^\top\mathbf{A} + \rho\mathbf{I})\mathbf{x} = \mathbf{A}^\top\mathbf{y} + \rho(\mathbf{z}^{(k)} - \mathbf{u}^{(k)}). When Mβ‰ͺNM\ll N (typical in compressed sensing), use the Woodbury identity (A⊀A+ρI)βˆ’1=1ρIβˆ’1ρA⊀(ρI+AA⊀)βˆ’1A(\mathbf{A}^\top\mathbf{A}+\rho\mathbf{I})^{-1} = \tfrac{1}{\rho}\mathbf{I} - \tfrac{1}{\rho}\mathbf{A}^\top(\rho\mathbf{I}+\mathbf{A}\mathbf{A}^\top)^{-1}\mathbf{A} to invert an MΓ—MM\times M matrix instead of NΓ—NN\times N.

Theorem: ADMM Convergence for Convex Problems

Consider the problem min⁑x,zβ€…β€Šf(x)+g(z)s.t.Cx+Dz=c,\min_{\mathbf{x},\mathbf{z}}\; f(\mathbf{x}) + g(\mathbf{z}) \quad\text{s.t.}\quad \mathbf{C}\mathbf{x}+\mathbf{D}\mathbf{z}=\mathbf{c}, with f,gf,g closed, proper, convex and L0\mathcal{L}_0 having a saddle point. Then, for any ρ>0\rho>0, the ADMM iterates satisfy: (i) primal residual r(k)=Cx(k)+Dz(k)βˆ’cβ†’0\mathbf{r}^{(k)}=\mathbf{C}\mathbf{x}^{(k)}+\mathbf{D}\mathbf{z}^{(k)}-\mathbf{c}\to\mathbf{0}; (ii) objective f(x(k))+g(z(k))β†’p⋆f(\mathbf{x}^{(k)})+g(\mathbf{z}^{(k)})\to p^\star; (iii) dual Ξ½(k)→ν⋆\boldsymbol{\nu}^{(k)}\to\boldsymbol{\nu}^\star. The objective gap and primal residual decay as O(1/k)O(1/k) (in the ergodic sense).

ADMM is a particular instantiation of the Douglas-Rachford splitting applied to the dual problem. The O(1/k)O(1/k) rate matches ISTA; ADMM is not accelerated by default. Under strong convexity one can achieve linear convergence, and accelerated/over-relaxed variants exist. The practical appeal is robustness: ADMM converges for any positive ρ\rho, though the speed depends dramatically on ρ\rho.

Definition:

Primal and Dual Residuals

The primal residual measures constraint violation: rpri(k)=x(k)βˆ’z(k)\mathbf{r}^{(k)}_{\text{pri}} = \mathbf{x}^{(k)} - \mathbf{z}^{(k)}. The dual residual measures change between consecutive z\mathbf{z}-iterates: rdual(k)=ρ(z(k)βˆ’z(kβˆ’1))\mathbf{r}^{(k)}_{\text{dual}} = \rho(\mathbf{z}^{(k)} - \mathbf{z}^{(k-1)}). Standard stopping criterion: stop when both βˆ₯rpriβˆ₯2≀Ρpri,βˆ₯rdualβˆ₯2≀Ρdual,\|\mathbf{r}_{\text{pri}}\|_2 \leq \varepsilon_{\text{pri}},\quad \|\mathbf{r}_{\text{dual}}\|_2 \leq \varepsilon_{\text{dual}}, where the tolerances combine an absolute part NΞ΅abs\sqrt{N}\varepsilon_{\text{abs}} and a relative part Ξ΅relβ‹…max⁑(βˆ₯x(k)βˆ₯,βˆ₯z(k)βˆ₯)\varepsilon_{\text{rel}} \cdot \max(\|\mathbf{x}^{(k)}\|,\|\mathbf{z}^{(k)}\|).

Small primal residual = constraint x=z\mathbf{x}=\mathbf{z} nearly satisfied. Small dual residual = the z\mathbf{z}-iterate has settled. Both must be small: monitoring only one can give false convergence. Boyd et al. (2011) recommend Ξ΅abs=10βˆ’4\varepsilon_{\text{abs}}=10^{-4}, Ξ΅rel=10βˆ’3\varepsilon_{\text{rel}}=10^{-3} as defaults.

Example: ADMM on a 2D LASSO

With A=I2\mathbf{A}=\mathbf{I}_2, y=(1.2, 0.1)⊀\mathbf{y}=(1.2,\,0.1)^\top, Ξ»=0.5\lambda=0.5, ρ=1\rho=1, and zero initialization, perform one full ADMM iteration.

ADMM Primal and Dual Residuals

Track both residuals as a function of iteration on the same synthetic problem used in the ISTA/FISTA comparison. Play with ρ\rho β€” too small a penalty wastes dual updates; too large a penalty stalls the z\mathbf{z}-update.

Parameters
120
200
15
0.02
1
120
⚠️Engineering Note

Choosing and Adapting ρ\rho

The performance of ADMM is notoriously sensitive to ρ\rho. Too small: primal residual decays slowly. Too large: dual residual decays slowly. Boyd's recipe: if βˆ₯rpriβˆ₯>ΞΌβˆ₯rdualβˆ₯\|\mathbf{r}_{\text{pri}}\| > \mu\|\mathbf{r}_{\text{dual}}\|, increase ρ←τincrρ\rho\leftarrow\tau^{\text{incr}}\rho; if βˆ₯rdualβˆ₯>ΞΌβˆ₯rpriβˆ₯\|\mathbf{r}_{\text{dual}}\| > \mu\|\mathbf{r}_{\text{pri}}\|, decrease ρ←ρ/Ο„decr\rho\leftarrow\rho/\tau^{\text{decr}}. Typical values: ΞΌ=10\mu=10, Ο„incr=Ο„decr=2\tau^{\text{incr}}=\tau^{\text{decr}}=2. When ρ\rho changes, rescale u\mathbf{u} to keep the augmented Lagrangian consistent: u←uβ‹…(ρold/ρnew)\mathbf{u}\leftarrow \mathbf{u}\cdot(\rho_{\text{old}}/\rho_{\text{new}}).

Practical Constraints
  • β€’

    Cache Cholesky factorization only when ρ\rho is held fixed for many iterations.

  • β€’

    For parallel ADMM (consensus form), communication cost often dominates; choose ρ\rho to minimize total wall-time, not iteration count.

Why This Matters: ADMM for Massive-MIMO Detection

Modern massive-MIMO receivers (5G/6G) solve regularized least-squares problems in which each user's signal is embedded in a massive A∈CMΓ—N\mathbf{A}\in\mathbb{C}^{M\times N} with N≫MN\gg M. ADMM-based detectors (splitting across base-station antennas) achieve per-subcarrier latencies compatible with sub-millisecond frame structures, whereas interior point solvers cannot meet the latency budget. The same machinery underpins activity detection in grant-free random access, which we meet in Chapter 15.

See full treatment in Chapter 15

Common Mistake: Stopping Only on Primal Residual

Mistake:

Declaring convergence when βˆ₯xβˆ’zβˆ₯\|\mathbf{x}-\mathbf{z}\| is small.

Correction:

A small primal residual means feasibility, not optimality. Always check the dual residual ρβˆ₯z(k)βˆ’z(kβˆ’1)βˆ₯\rho\|\mathbf{z}^{(k)}-\mathbf{z}^{(k-1)}\| as well. A small primal residual with a large dual residual indicates that the z\mathbf{z}-updates are still moving β€” the iterate is feasible but not yet optimal.

Common Mistake: Letting ρ→0\rho\to 0 for Tight Fit

Mistake:

Reducing ρ\rho below 10βˆ’310^{-3} to "not distort" the original objective.

Correction:

As ρ→0\rho\to 0, the z\mathbf{z}-update's effective threshold Ξ»/ρ\lambda/\rho blows up β€” the iterate collapses to the zero vector every iteration. ADMM converges for any ρ>0\rho>0, but the practical sweet spot is typically ρ∈[0.1βˆ₯Aβˆ₯2, βˆ₯Aβˆ₯2]\rho \in [0.1\|\mathbf{A}\|^2,\,\|\mathbf{A}\|^2].

Historical Note: ADMM: From 1970s Numerical Analysis to 2010s Machine Learning

1975-2011

ADMM was introduced by Glowinski-Marroco (1975) and Gabay-Mercier (1976) for partial differential equations, and placed on firm convergence footing by Lions and Mercier (1979). It lay dormant in numerical analysis for decades until Boyd, Parikh, Chu, Peleato and Eckstein's 2011 monograph Distributed Optimization and Statistical Learning via ADMM popularized it in machine learning and signal processing. The monograph reframed ADMM as a general recipe for distributed convex problems, showed how to split over data, features, or constraints, and sparked a decade of research β€” from consensus optimization to the split-Bregman method for TV denoising.

ADMM

Alternating Direction Method of Multipliers. Applies Douglas-Rachford splitting to the dual of a separable convex problem. Alternates primal minimizations over decoupled blocks with a dual update.

Related: ADMM for the LASSO

Augmented Lagrangian

The Lagrangian with an added quadratic penalty on constraint violation. Converts equality- constrained problems into unconstrained problems whose minimizers coincide with the original.

Related: Variable Splitting and Augmented Lagrangian

Quick Check

Which of the three ADMM steps for LASSO is the one that injects sparsity?

The x\mathbf{x}-update (ridge regression).

The z\mathbf{z}-update (soft-threshold).

The u\mathbf{u}-update (dual).

None of them individually; sparsity only emerges at convergence.

Quick Check

ADMM has converged in 100 iterations with primal residual 10βˆ’610^{-6} but dual residual 10βˆ’110^{-1}. What happened?

The algorithm is converged β€” ignore the dual residual.

ρ\rho was too large, making z\mathbf{z}-updates stall.

Ξ»\lambda was too small.

The LASSO has no solution.

Key Takeaway

ADMM splits the LASSO into a smooth ridge regression (x\mathbf{x}-update) and a soft-threshold (z\mathbf{z}-update) linked through a dual variable u\mathbf{u}. It converges for any ρ>0\rho>0 at rate O(1/k)O(1/k), is trivially parallelizable, and handles regularizers that FISTA cannot (group-β„“1\ell_1, TV, analysis-β„“1\ell_1). Its Achilles heel is the sensitivity to ρ\rho β€” adaptive schemes from Boyd et al. (2011) are essential.