ADMM for Sparse Problems
Why ADMM When We Already Have FISTA?
FISTA is fast, but it has two limitations. First, it is a pure proximal method: complex regularizers (group-, fused-LASSO, total variation, analysis- with a non-trivial ) do not have closed-form proximal operators. Second, FISTA is inherently serial β one mat-vec at a time β whereas many distributed applications demand algorithms that split across machines. The Alternating Direction Method of Multipliers (ADMM) solves both problems by decoupling variables through splitting, giving each sub-problem its own clean structure (a linear solve, a prox, a dual update). Every primitive can be parallelized. ADMM is the workhorse of distributed sparse optimization and appears under different names throughout signal processing β split Bregman, Douglas-Rachford, operator-splitting.
Definition: Variable Splitting and Augmented Lagrangian
Variable Splitting and Augmented Lagrangian
Rewrite the LASSO by introducing a copy of : The augmented Lagrangian with penalty is With the scaled dual , we absorb the linear term into the quadratic:
ADMM for the LASSO
Complexity: once for Cholesky of , then per iteration (just two triangular solves). Alternatively per iter via matrix-inversion lemma when .The three updates have complementary roles: x-update handles the smooth data fit (a ridge regression), z-update handles the non-smooth penalty (a soft-threshold), u-update accumulates the constraint violation . When the algorithm has converged, (primal feasible) and stabilizes at the dual optimum divided by .
Deriving the x-Update
Fixing and , the -update minimizes . Setting the gradient to zero yields the normal equations . When (typical in compressed sensing), use the Woodbury identity to invert an matrix instead of .
Theorem: ADMM Convergence for Convex Problems
Consider the problem with closed, proper, convex and having a saddle point. Then, for any , the ADMM iterates satisfy: (i) primal residual ; (ii) objective ; (iii) dual . The objective gap and primal residual decay as (in the ergodic sense).
ADMM is a particular instantiation of the Douglas-Rachford splitting applied to the dual problem. The rate matches ISTA; ADMM is not accelerated by default. Under strong convexity one can achieve linear convergence, and accelerated/over-relaxed variants exist. The practical appeal is robustness: ADMM converges for any positive , though the speed depends dramatically on .
Monotone operator setup
Define the maximal monotone operators and (on the appropriate dual variables). ADMM can be rewritten as Douglas-Rachford splitting on the sum , which is known to converge whenever the problem has a solution. The argument is due to Lions and Mercier (1979).
Lyapunov function
Define . Direct computation shows . The Lyapunov function is monotone non-increasing, and summability of the decrements forces both the primal and dual residuals to vanish.
Ergodic rate
Averaging the one-step inequalities and using convexity of yields and for the running averages .
Definition: Primal and Dual Residuals
Primal and Dual Residuals
The primal residual measures constraint violation: . The dual residual measures change between consecutive -iterates: . Standard stopping criterion: stop when both where the tolerances combine an absolute part and a relative part .
Small primal residual = constraint nearly satisfied. Small dual residual = the -iterate has settled. Both must be small: monitoring only one can give false convergence. Boyd et al. (2011) recommend , as defaults.
Example: ADMM on a 2D LASSO
With , , , , and zero initialization, perform one full ADMM iteration.
$\mathbf{x}$-update
, so . .
$\mathbf{z}$-update
. ; . So .
$\mathbf{u}$-update
. Primal residual , indicating and still disagree β more iterations needed.
Verification against closed-form LASSO
For , LASSO is componentwise: . ADMM will drive toward as .
ADMM Primal and Dual Residuals
Track both residuals as a function of iteration on the same synthetic problem used in the ISTA/FISTA comparison. Play with β too small a penalty wastes dual updates; too large a penalty stalls the -update.
Parameters
Choosing and Adapting
The performance of ADMM is notoriously sensitive to . Too small: primal residual decays slowly. Too large: dual residual decays slowly. Boyd's recipe: if , increase ; if , decrease . Typical values: , . When changes, rescale to keep the augmented Lagrangian consistent: .
- β’
Cache Cholesky factorization only when is held fixed for many iterations.
- β’
For parallel ADMM (consensus form), communication cost often dominates; choose to minimize total wall-time, not iteration count.
Why This Matters: ADMM for Massive-MIMO Detection
Modern massive-MIMO receivers (5G/6G) solve regularized least-squares problems in which each user's signal is embedded in a massive with . ADMM-based detectors (splitting across base-station antennas) achieve per-subcarrier latencies compatible with sub-millisecond frame structures, whereas interior point solvers cannot meet the latency budget. The same machinery underpins activity detection in grant-free random access, which we meet in Chapter 15.
See full treatment in Chapter 15
Common Mistake: Stopping Only on Primal Residual
Mistake:
Declaring convergence when is small.
Correction:
A small primal residual means feasibility, not optimality. Always check the dual residual as well. A small primal residual with a large dual residual indicates that the -updates are still moving β the iterate is feasible but not yet optimal.
Common Mistake: Letting for Tight Fit
Mistake:
Reducing below to "not distort" the original objective.
Correction:
As , the -update's effective threshold blows up β the iterate collapses to the zero vector every iteration. ADMM converges for any , but the practical sweet spot is typically .
Historical Note: ADMM: From 1970s Numerical Analysis to 2010s Machine Learning
1975-2011ADMM was introduced by Glowinski-Marroco (1975) and Gabay-Mercier (1976) for partial differential equations, and placed on firm convergence footing by Lions and Mercier (1979). It lay dormant in numerical analysis for decades until Boyd, Parikh, Chu, Peleato and Eckstein's 2011 monograph Distributed Optimization and Statistical Learning via ADMM popularized it in machine learning and signal processing. The monograph reframed ADMM as a general recipe for distributed convex problems, showed how to split over data, features, or constraints, and sparked a decade of research β from consensus optimization to the split-Bregman method for TV denoising.
ADMM
Alternating Direction Method of Multipliers. Applies Douglas-Rachford splitting to the dual of a separable convex problem. Alternates primal minimizations over decoupled blocks with a dual update.
Related: ADMM for the LASSO
Augmented Lagrangian
The Lagrangian with an added quadratic penalty on constraint violation. Converts equality- constrained problems into unconstrained problems whose minimizers coincide with the original.
Quick Check
Which of the three ADMM steps for LASSO is the one that injects sparsity?
The -update (ridge regression).
The -update (soft-threshold).
The -update (dual).
None of them individually; sparsity only emerges at convergence.
Soft-thresholding kills components below . The sparsity ends up in , which equals at convergence.
Quick Check
ADMM has converged in 100 iterations with primal residual but dual residual . What happened?
The algorithm is converged β ignore the dual residual.
was too large, making -updates stall.
was too small.
The LASSO has no solution.
Large weights the constraint heavily, so tracks (tiny primal residual) but drifts slowly (large dual residual).
Key Takeaway
ADMM splits the LASSO into a smooth ridge regression (-update) and a soft-threshold (-update) linked through a dual variable . It converges for any at rate , is trivially parallelizable, and handles regularizers that FISTA cannot (group-, TV, analysis-). Its Achilles heel is the sensitivity to β adaptive schemes from Boyd et al. (2011) are essential.