Ferkans — Interactive Telecom Tutor

Why Proximal Operators?

Many modern optimization problems have objectives that are the sum of a smooth term and a non-smooth regularizer: $\min_{\mathbf{x}} f(\mathbf{x}) + g(\mathbf{x})$ . Gradient descent cannot handle the non-smooth $g$ directly. Proximal operators generalize the gradient step to non-smooth functions, enabling algorithms like ISTA, FISTA, and ADMM.

In signal processing, proximal operators implement denoising, sparsity promotion, and constraint projection as simple closed-form operations.

Definition:
Proximal Operator

The proximal operator of a function $g : \mathbb{R}^n \to \mathbb{R} \cup \{+\infty\}$ with step size $\lambda > 0$ is:

$\mathrm{prox}_{\lambda g}(\mathbf{v}) = \arg\min_{\mathbf{x}} \left\{ g(\mathbf{x}) + \frac{1}{2\lambda}\|\mathbf{x} - \mathbf{v}\|_2^2 \right\}$

Interpretation: find the point that balances minimizing $g$ against staying close to $\mathbf{v}$ . When $g = 0$ , the proximal operator is the identity. When $g$ is the indicator function of a set $C$ ( $g(\mathbf{x}) = 0$ if $\mathbf{x} \in C$ , $+\infty$ otherwise), the proximal operator is the projection onto $C$ .

The proximal operator always exists and is unique when $g$ is closed, proper, and convex.

Definition:
Soft-Thresholding Operator

The proximal operator of $g(\mathbf{x}) = \|\mathbf{x}\|_1$ is the soft-thresholding (shrinkage) operator, applied element-wise:

$\mathcal{S}_\lambda(v_i) = \mathrm{sign}(v_i) \max(|v_i| - \lambda, 0)$

def soft_threshold(v, lam):
    """Proximal operator of lam * ||x||_1."""
    return np.sign(v) * np.maximum(np.abs(v) - lam, 0)

This shrinks each component toward zero by $\lambda$ . Components with $|v_i| \le \lambda$ are set exactly to zero, producing sparsity — the mechanism behind LASSO and compressed sensing.

Definition:
Projection as Proximal Operator

The proximal operator of the indicator function $\iota_C$ of a closed convex set $C$ is the Euclidean projection:

$\mathrm{prox}_{\iota_C}(\mathbf{v}) = \Pi_C(\mathbf{v}) = \arg\min_{\mathbf{x} \in C} \|\mathbf{x} - \mathbf{v}\|_2$

Common projections:

# Projection onto non-negative orthant
def proj_nonneg(v):
    return np.maximum(v, 0)

# Projection onto L2 ball of radius r
def proj_l2_ball(v, r):
    norm_v = np.linalg.norm(v)
    return v * min(1, r / norm_v) if norm_v > 0 else v

# Projection onto simplex {x >= 0, sum(x) = 1}
def proj_simplex(v):
    n = len(v)
    u = np.sort(v)[::-1]
    cssv = np.cumsum(u) - 1
    rho = np.max(np.where(u > cssv / np.arange(1, n+1)))
    theta = cssv[rho] / (rho + 1)
    return np.maximum(v - theta, 0)

Definition:
Group Soft-Thresholding

For the group LASSO penalty $\sum_g \|\mathbf{x}_g\|_2$ (mixed $\ell_{2,1}$ norm), the proximal operator applies block-wise shrinkage:

$[\mathrm{prox}_{\lambda g}(\mathbf{v})]_g = \mathbf{v}_g \cdot \max\!\left(1 - \frac{\lambda}{\|\mathbf{v}_g\|_2}, 0\right)$

This sets entire groups to zero (group sparsity) rather than individual elements:

def group_soft_threshold(v, lam, groups):
    """Proximal of lam * sum_g ||x_g||_2."""
    result = np.zeros_like(v)
    for g in groups:
        vg = v[g]
        norm_vg = np.linalg.norm(vg)
        if norm_vg > lam:
            result[g] = vg * (1 - lam / norm_vg)
    return result

Definition:
Total Variation (TV) Proximal Operator

The total variation of a 1D signal is $\mathrm{TV}(\mathbf{x}) = \sum_{i=1}^{n-1} |x_{i+1} - x_i|$ , which promotes piecewise-constant solutions.

The proximal operator of TV has no simple closed form but can be computed in $O(n)$ time using the taut-string algorithm or via iterative methods:

def prox_tv_1d(v, lam, n_iter=100):
    """Proximal of lam * TV(x) via dual (Chambolle)."""
    n = len(v)
    p = np.zeros(n - 1)  # dual variable
    for _ in range(n_iter):
        # Forward difference of primal
        x = v - lam * _adj_diff(p, n)
        # Gradient step on dual
        dx = np.diff(x)
        p = p + dx / (2 * lam)
        # Project onto [-1, 1]
        p = np.clip(p, -1, 1)
    return v - lam * _adj_diff(p, n)

def _adj_diff(p, n):
    """Adjoint of forward difference: -div."""
    d = np.zeros(n)
    d[0] = -p[0]
    d[1:-1] = p[:-1] - p[1:]
    d[-1] = p[-1]
    return d

Theorem: ISTA (Iterative Shrinkage-Thresholding Algorithm)

For $\min_{\mathbf{x}} f(\mathbf{x}) + g(\mathbf{x})$ where $f$ is $L$ -smooth (i.e., $\nabla f$ is $L$ -Lipschitz) and $g$ is convex, the iteration:

$\mathbf{x}_{k+1} = \mathrm{prox}_{\frac{1}{L} g}\!\left( \mathbf{x}_k - \frac{1}{L}\nabla f(\mathbf{x}_k)\right)$

converges to the global minimum with rate $O(1/k)$ : $f(\mathbf{x}_k) + g(\mathbf{x}_k) - f^\star - g^\star \le \frac{L\|\mathbf{x}_0 - \mathbf{x}^\star\|^2}{2k}$ .

FISTA (Fast ISTA) adds Nesterov momentum to achieve $O(1/k^2)$ : $\mathbf{y}_{k+1} = \mathbf{x}_k + \frac{k-1}{k+2}(\mathbf{x}_k - \mathbf{x}_{k-1})$ , then apply the proximal step to $\mathbf{y}_{k+1}$ .

ISTA splits the problem: gradient descent handles the smooth part, the proximal operator handles the non-smooth part. Each step takes a gradient step then "cleans up" with the proximal operator.

Example: ISTA for LASSO

Implement ISTA to solve the LASSO problem $\min_{\mathbf{x}} \frac{1}{2}\|\mathbf{A}\mathbf{x} - \mathbf{b}\|_2^2 + \lambda\|\mathbf{x}\|_1$ .

Solution

Implementation

import numpy as np

def ista_lasso(A, b, lam, n_iter=500):
    m, n = A.shape
    L = np.linalg.norm(A.T @ A, 2)  # Lipschitz constant
    x = np.zeros(n)
    objective = []

    for k in range(n_iter):
        grad = A.T @ (A @ x - b)
        x = soft_threshold(x - grad / L, lam / L)
        obj = 0.5 * np.linalg.norm(A @ x - b)**2 + lam * np.linalg.norm(x, 1)
        objective.append(obj)

    return x, objective

def soft_threshold(v, lam):
    return np.sign(v) * np.maximum(np.abs(v) - lam, 0)

FISTA acceleration

def fista_lasso(A, b, lam, n_iter=500):
    m, n = A.shape
    L = np.linalg.norm(A.T @ A, 2)
    x = np.zeros(n)
    x_prev = x.copy()
    t = 1.0
    objective = []

    for k in range(n_iter):
        y = x + (t - 1) / (t + 2) * (x - x_prev)
        grad = A.T @ (A @ y - b)
        x_prev = x.copy()
        x = soft_threshold(y - grad / L, lam / L)
        t = (1 + np.sqrt(1 + 4*t**2)) / 2
        obj = 0.5*np.linalg.norm(A@x-b)**2 + lam*np.linalg.norm(x,1)
        objective.append(obj)

    return x, objective

Proximal Operator Visualization

Visualize how soft-thresholding, projection, and group soft-thresholding transform input signals.

Parameters

Proximal Operators and ISTA/FISTA

python

Implementations of soft-thresholding, projections, group soft-thresholding, TV proximal, ISTA, and FISTA.

# Code from: ch08/python/proximal_operators.py
# Load from backend supplements endpoint

Quick Check

What does the soft-thresholding operator $\mathcal{S}_\lambda(v)$ do to components with $|v_i| < \lambda$ ?

Leaves them unchanged

Sets them to zero

Doubles them

Scales them by lambda

Correction:

Sets them to zero

Components smaller than the threshold are set exactly to zero, producing sparsity.

Common Mistake: Wrong Step Size in ISTA

Mistake:

Using step size $1/L$ where $L$ is guessed instead of computed. If $L$ is too small (step too large), ISTA diverges.

Correction:

Compute $L = \|\mathbf{A}^T\mathbf{A}\|_2$ (largest singular value squared of $\mathbf{A}$ ) for least-squares problems: L = np.linalg.norm(A.T @ A, 2). Alternatively, use backtracking line search to adaptively find $L$ .

Key Takeaway

Proximal operators are the building blocks of modern optimization. Soft-thresholding gives LASSO, projection gives constrained optimization, group soft-thresholding gives group LASSO, and TV proximal gives piecewise-constant denoising. Memorize these four operators and you can implement most first-order optimization algorithms.

Why This Matters: Compressed Sensing in Wireless Channel Estimation

Wireless channels are often sparse in the delay-Doppler domain: only a few paths carry significant energy. ISTA and FISTA with soft-thresholding (the LASSO proximal) recover these sparse channels from far fewer pilot symbols than traditional least-squares estimation requires. This is the basis of compressed sensing channel estimation in 5G mmWave systems.

See full treatment in Chapter 15

Proximal Operator

The mapping $\mathrm{prox}_{\lambda g}(\mathbf{v}) = \arg\min_{\mathbf{x}} g(\mathbf{x}) + \frac{1}{2\lambda}\|\mathbf{x} - \mathbf{v}\|^2$ , generalizing the gradient step to non-smooth functions.

Related: Soft-Thresholding

Soft-Thresholding

The element-wise operator $\mathcal{S}_\lambda(v_i) = \mathrm{sign}(v_i)\max(|v_i| - \lambda, 0)$ , which is the proximal operator of the $\ell_1$ norm.

Related: Proximal Operator

Implementing Proximal Operators