Ferkans — Interactive Telecom Tutor

Wirtinger Flow — Non-Convex Gradient Descent for Phase Retrieval

Wirtinger flow (Candes, Li, and Soltanolkotabi, 2015) takes a radically different approach from PhaseLift: instead of convexifying the problem, it directly minimizes a non-convex loss function using gradient descent. Surprisingly, with a careful spectral initialization and sufficient measurements, gradient descent converges to the global optimum — despite the loss landscape having many saddle points.

The point is that the loss landscape, while non-convex globally, has no spurious local minima in a basin around the true solution. Spectral initialization places us inside this basin, and gradient descent does the rest. Wirtinger flow is the practical workhorse of modern phase retrieval.

Definition:
The Intensity Loss Function

Wirtinger flow minimizes the intensity-based loss:

$f(\mathbf{z}) = \frac{1}{4M}\sum_{i=1}^{M} \left(|\langle \mathbf{a}_i, \mathbf{z}\rangle|^2 - y_i\right)^2.$

This is a quartic function of $\mathbf{z}$ — non-convex with potentially many saddle points.

Alternative: amplitude loss.

$g(\mathbf{z}) = \frac{1}{2M}\sum_{i=1}^{M} \left(|\langle \mathbf{a}_i, \mathbf{z}\rangle| - \sqrt{y_i}\right)^2.$

The amplitude loss has better conditioning near the solution but is non-smooth when $\langle \mathbf{a}_i, \mathbf{z}\rangle = 0$ . Truncated variants handle both losses.

Wirtinger Calculus

A framework for differentiating real-valued functions of complex variables. For $f : \mathbb{C}^N \to \mathbb{R}$ , the Wirtinger gradient $\nabla_{\bar{\mathbf{z}}} f$ provides the steepest ascent direction, and $-\nabla_{\bar{\mathbf{z}}} f$ is the gradient descent direction. Essential for optimizing over complex-valued signals.

Related: The Intensity Loss Function

Theorem: Wirtinger Gradient of the Intensity Loss

For the intensity loss $f(\mathbf{z}) = \frac{1}{4M}\sum_{i=1}^{M} (|\langle \mathbf{a}_i, \mathbf{z}\rangle|^2 - y_i)^2$ , the Wirtinger gradient is:

$\nabla_{\bar{\mathbf{z}}} f(\mathbf{z}) = \frac{1}{2M}\sum_{i=1}^{M} \left(|\langle \mathbf{a}_i, \mathbf{z}\rangle|^2 - y_i\right) \langle \mathbf{a}_i, \mathbf{z}\rangle \,\mathbf{a}_i.$

Each term in the sum is the residual $(|\langle \mathbf{a}_i, \mathbf{z}\rangle|^2 - y_i)$ times the inner product $\langle \mathbf{a}_i, \mathbf{z}\rangle$ times the measurement direction $\mathbf{a}_i$ . Measurements with large residuals contribute more to the gradient — this is why truncation helps.

Proof

Wirtinger calculus basics

For $f : \mathbb{C}^N \to \mathbb{R}$ , the Wirtinger derivative is: $\nabla_{\bar{\mathbf{z}}} f = \frac{\partial f}{\partial \text{Re}(\mathbf{z})} + j\frac{\partial f}{\partial \text{Im}(\mathbf{z})}$ .

Chain rule application

Let $h_i = \langle \mathbf{a}_i, \mathbf{z}\rangle$ . Then $|h_i|^2 = h_i \bar{h}_i$ and $\frac{\partial |h_i|^2}{\partial \bar{\mathbf{z}}} = h_i \mathbf{a}_i$ .

Assembling the gradient

Applying the chain rule to each squared residual: $\frac{\partial}{\partial \bar{\mathbf{z}}} (|h_i|^2 - y_i)^2 = 2(|h_i|^2 - y_i) \cdot h_i \mathbf{a}_i$ . Summing and dividing by $4M$ gives the stated formula. $\blacksquare$

Definition:
Spectral Initialization

The convergence of Wirtinger flow depends critically on initialization. Spectral initialization provides a starting point in the basin of attraction of the global optimum.

Algorithm:

Form the weighted matrix $\mathbf{Y} = \frac{1}{M}\sum_{i=1}^{M} y_i\,\mathbf{a}_i\mathbf{a}_i^H$ .
Compute the leading eigenvector: $\mathbf{z}^{(0)} = \sqrt{\lambda_1(\mathbf{Y})}\, \mathbf{v}_1(\mathbf{Y})$ .

Intuition: $\mathbb{E}[\mathbf{Y}] = \mathbf{x}_0\mathbf{x}_0^H + \|\mathbf{x}_0\|^2\mathbf{I}$ , so the leading eigenvector of $\mathbf{Y}$ approximates $\mathbf{x}_0$ (up to global phase).

Quality: With $M \geq CN$ Gaussian measurements:

$\text{dist}(\mathbf{z}^{(0)}, \mathbf{x}_0) \leq \epsilon_0\|\mathbf{x}_0\|$

where $\epsilon_0 \to 0$ as $M/N \to \infty$ . Typically $M \geq 6N$ gives $\epsilon_0 \lesssim 0.3$ .

,

Spectral Initialization

An initialization strategy for non-convex phase retrieval that computes the leading eigenvector of the weighted matrix $\mathbf{Y} = \frac{1}{M}\sum_i y_i \mathbf{a}_i\mathbf{a}_i^H$ . Provides a starting point within a constant factor of the true signal, enabling gradient descent to converge globally.

Related: The Intensity Loss Function

Wirtinger Flow Algorithm

Complexity:

O(MN)

per iteration;

O(MN^2\log N)

total with

T = O(N\log(1/\epsilon))

iterations

Input: Measurements

\{(\mathbf{a}_i, y_i)\}_{i=1}^M

,

step size

\mu

, iterations

T

Output: Estimate

\hat{\mathbf{x}}

1. Spectral initialization:

\mathbf{Y} \leftarrow \frac{1}{M}\sum_{i=1}^M y_i\,\mathbf{a}_i\mathbf{a}_i^H

2.

\mathbf{z}^{(0)} \leftarrow \sqrt{\lambda_1(\mathbf{Y})}\,\mathbf{v}_1(\mathbf{Y})

3. for

t = 0, 1, \ldots, T-1

do

4.

\quad \nabla \leftarrow \frac{1}{2M}\sum_{i=1}^M (|\langle\mathbf{a}_i, \mathbf{z}^{(t)}\rangle|^2 - y_i)\,\langle\mathbf{a}_i, \mathbf{z}^{(t)}\rangle\,\mathbf{a}_i

5.

\quad \mathbf{z}^{(t+1)} \leftarrow \mathbf{z}^{(t)} - \frac{\mu}{\|\mathbf{z}^{(0)}\|^2}\,\nabla

6. end for

7. return

\hat{\mathbf{x}} = \mathbf{z}^{(T)}

The normalization by $\|\mathbf{z}^{(0)}\|^2$ in line 5 ensures scale invariance: the step size does not depend on the signal energy.

Theorem: Convergence Guarantee for Wirtinger Flow

Suppose $\mathbf{a}_i \sim \mathcal{CN}(\mathbf{0}, \mathbf{I})$ , $M \geq CN\log N$ (for a universal constant $C$ ), and $\mathbf{z}^{(0)}$ is the spectral initialization. Then with step size $\mu \leq c_0/(N\|\mathbf{x}_0\|^2)$ , Wirtinger flow converges linearly:

$\text{dist}(\mathbf{z}^{(t)}, \mathbf{x}_0) \leq (1 - c_1/N)^t \cdot \text{dist}(\mathbf{z}^{(0)}, \mathbf{x}_0),$

with high probability. After $T = O(N\log(1/\epsilon))$ iterations, the relative error is $\leq \epsilon$ .

Total complexity: $O(MN \cdot N\log(1/\epsilon)) = O(MN^2\log N\log(1/\epsilon))$ with Gaussian measurements. With FFT structure: $O(MN\log N\log(1/\epsilon))$ .

The loss landscape has no spurious local minima in a neighborhood of $\mathbf{x}_0$ (and its global phase rotations). Spectral initialization lands inside this neighborhood, and the restricted strong convexity of the loss in this region drives linear convergence.

Proof

Local restricted strong convexity

In the neighborhood $\mathcal{B} = \{\mathbf{z} : \text{dist}(\mathbf{z}, \mathbf{x}_0) \leq \delta\|\mathbf{x}_0\|\}$ , the intensity loss satisfies a restricted strong convexity condition: $\langle \nabla f(\mathbf{z}), \mathbf{z} - e^{j\phi^*}\mathbf{x}_0\rangle \geq \alpha\,\text{dist}^2(\mathbf{z}, \mathbf{x}_0)$ with high probability when $M \geq CN\log N$ .

Spectral initialization lands in $\mathcal{B}$

The spectral initialization satisfies $\text{dist}(\mathbf{z}^{(0)}, \mathbf{x}_0) \leq \epsilon_0\|\mathbf{x}_0\|$ with $\epsilon_0 < \delta$ when $M \geq CN\log N$ .

Contraction per iteration

Combining restricted strong convexity with the Lipschitz property of $\nabla f$ : $\text{dist}(\mathbf{z}^{(t+1)}, \mathbf{x}_0) \leq (1 - c_1/N)\,\text{dist}(\mathbf{z}^{(t)}, \mathbf{x}_0)$ .

Iteration count

Iterating the contraction: after $T = \frac{N}{c_1}\log\frac{1}{\epsilon}$ steps, $\text{dist}(\mathbf{z}^{(T)}, \mathbf{x}_0) \leq \epsilon\,\text{dist}(\mathbf{z}^{(0)}, \mathbf{x}_0)$ . $\blacksquare$

Historical Note: Gerchberg--Saxton: The Grandfather of Phase Retrieval (1972)

1970s

The Gerchberg--Saxton (GS) algorithm, published in 1972 by Ralph Gerchberg and Owen Saxton, was the first practical algorithm for phase retrieval from Fourier magnitudes. It alternates between the Fourier and spatial domains, replacing the magnitude in each domain with the known constraints.

GS is an alternating projection method: it projects between the set of signals consistent with the Fourier magnitude and the set consistent with spatial constraints (support, non-negativity). On tree-structured constraint sets, alternating projections converge; on non-convex sets (like the magnitude constraint), convergence to the global optimum is not guaranteed.

Despite lacking convergence guarantees, GS and its descendants (Fienup's Hybrid Input-Output, error reduction) remained the standard algorithms for four decades, until Wirtinger flow provided the first provably convergent alternative.

Definition:
Truncated Wirtinger Flow

Truncated Wirtinger Flow (Chen and Candes, 2017) discards measurements with large residuals from the gradient computation:

$\nabla_{\bar{\mathbf{z}}} f_{\text{trunc}} = \frac{1}{2M}\sum_{i \in \mathcal{T}^{(t)}} \left(|\langle \mathbf{a}_i, \mathbf{z}\rangle|^2 - y_i\right) \langle \mathbf{a}_i, \mathbf{z}\rangle\,\mathbf{a}_i,$

where $\mathcal{T}^{(t)} = \{i : |\langle \mathbf{a}_i, \mathbf{z}^{(t)}\rangle| \leq \alpha_h \sqrt{y_i}\} \cap \{i : ||\langle \mathbf{a}_i, \mathbf{z}^{(t)}\rangle|^2 - y_i| \leq \alpha_y y_i + \beta\}$ .

Key improvement: Truncation reduces the measurement requirement from $M = O(N\log N)$ to $M = O(N)$ — optimal up to constants.

Other variants:

Reshaped Wirtinger flow: Uses the amplitude loss $g$ with truncation, achieving optimal $M = O(N)$ without the log factor.
Incremental (stochastic) Wirtinger flow: Uses one measurement per iteration for $O(N)$ per-step cost — suitable for streaming data.

Wirtinger Flow Convergence

Demonstrates Wirtinger flow convergence on a 1D signal recovery task. The plot shows relative error $\text{dist}(\mathbf{z}^{(t)}, \mathbf{x}_0)/\|\mathbf{x}_0\|$ versus iteration number.

Adjust the measurement ratio $M/N$ and noise level to observe:

Linear convergence after spectral initialization
Faster convergence with more measurements
Error floor determined by noise level
Effect of truncation on convergence speed

Parameters

Algorithm

M/N

ratio6

Noise level

\sigma/\|\mathbf{y}\|

0

Example: Wirtinger Flow vs. PhaseLift: Speed and Accuracy

Setup: $N = 128$ , $M = 512$ Gaussian measurements, SNR = 30 dB.

Compare recovery quality, computation time, and memory for PhaseLift, Wirtinger flow, truncated WF, and Gerchberg--Saxton.

Solution

Numerical comparison

Method	Relative error	Time	Memory
PhaseLift (SDP)	$2.1 \times 10^{-3}$	45 s	128 MB
Wirtinger flow (200 iter)	$2.3 \times 10^{-3}$	0.15 s	0.5 MB
Truncated WF (100 iter)	$1.9 \times 10^{-3}$	0.08 s	0.5 MB
Gerchberg--Saxton (500 iter)	$8.7 \times 10^{-2}$	0.10 s	0.5 MB

Key observations

Wirtinger flow matches PhaseLift accuracy in $300\times$ less time and $256\times$ less memory.
Truncated WF is faster and slightly more accurate due to robustness against outlier gradients.
Gerchberg--Saxton stagnates at a much higher error — it lacks spectral initialization and has no convergence guarantee.

Scaling comparison

At $N = 1024$ : Wirtinger flow takes $\sim$ 5 s (200 iterations), while PhaseLift is intractable. The gap widens dramatically with signal dimension.

,

Wirtinger Flow: Gradient Descent on the Non-Convex Landscape

Visualizes the gradient descent trajectory of Wirtinger flow on a 2D slice of the intensity loss landscape. Watch how spectral initialization places the iterate near the global minimum, and gradient descent converges despite the non-convex terrain.

The loss landscape (intensity loss for a

2 \times 1

signal) has saddle points and a ring of global minima (the global phase orbit). Spectral initialization (blue dot) starts near the basin of attraction; gradient descent (blue trajectory) converges linearly.

Common Mistake: Random Initialization Fails for Wirtinger Flow

Mistake:

Using a random initialization (e.g., $\mathbf{z}^{(0)} \sim \mathcal{CN}(\mathbf{0}, \mathbf{I})$ ) instead of spectral initialization. With random starting points, Wirtinger flow converges to the global optimum less than 5% of the time for $N > 32$ — it gets trapped at saddle points or converges to spurious local minima.

Correction:

Always use spectral initialization. The leading eigenvector of $\frac{1}{M}\sum_i y_i \mathbf{a}_i\mathbf{a}_i^H$ costs only $O(MN)$ (via power iteration or Lanczos) and guarantees a starting point within the basin of attraction when $M \geq 6N$ .

Quick Check

What is the per-iteration computational cost of Wirtinger flow with $M$ Gaussian measurements and signal dimension $N$ ?

$O(N)$

$O(MN)$

$O(MN^2)$

$O(N^3)$

Correction:

O(MN)

Each iteration computes all $M$ inner products $\langle \mathbf{a}_i, \mathbf{z}\rangle$ at cost $O(N)$ each, then assembles the gradient — total $O(MN)$ .

Key Takeaway

Wirtinger flow minimizes the non-convex intensity loss via gradient descent with Wirtinger calculus — $O(MN)$ per iteration. Spectral initialization places the starting point in the basin of attraction, enabling linear convergence to the global optimum with high probability. Truncated Wirtinger flow achieves optimal sample complexity $M = O(N)$ by discarding outlier gradient contributions. Wirtinger flow is orders of magnitude faster than PhaseLift while achieving comparable accuracy — it is the practical method of choice for phase retrieval.

Non-Convex Methods: Wirtinger Flow

Wirtinger Flow — Non-Convex Gradient Descent for Phase Retrieval

Definition: The Intensity Loss Function

Wirtinger Calculus

Theorem: Wirtinger Gradient of the Intensity Loss

Wirtinger calculus basics

Chain rule application

Assembling the gradient

Definition: Spectral Initialization

Spectral Initialization

Wirtinger Flow Algorithm

Theorem: Convergence Guarantee for Wirtinger Flow

Local restricted strong convexity

Spectral initialization lands in $\mathcal{B}$

Contraction per iteration

Iteration count

Historical Note: Gerchberg--Saxton: The Grandfather of Phase Retrieval (1972)

Definition: Truncated Wirtinger Flow

Wirtinger Flow Convergence

Parameters

Example: Wirtinger Flow vs. PhaseLift: Speed and Accuracy

Numerical comparison

Key observations

Scaling comparison

Wirtinger Flow: Gradient Descent on the Non-Convex Landscape

Common Mistake: Random Initialization Fails for Wirtinger Flow

Quick Check

Key Takeaway

Definition:
The Intensity Loss Function

Definition:
Spectral Initialization

Definition:
Truncated Wirtinger Flow