Non-Convex Methods: Wirtinger Flow

Wirtinger Flow β€” Non-Convex Gradient Descent for Phase Retrieval

Wirtinger flow (Candes, Li, and Soltanolkotabi, 2015) takes a radically different approach from PhaseLift: instead of convexifying the problem, it directly minimizes a non-convex loss function using gradient descent. Surprisingly, with a careful spectral initialization and sufficient measurements, gradient descent converges to the global optimum β€” despite the loss landscape having many saddle points.

The point is that the loss landscape, while non-convex globally, has no spurious local minima in a basin around the true solution. Spectral initialization places us inside this basin, and gradient descent does the rest. Wirtinger flow is the practical workhorse of modern phase retrieval.

Definition:

The Intensity Loss Function

Wirtinger flow minimizes the intensity-based loss:

f(z)=14Mβˆ‘i=1M(∣⟨ai,z⟩∣2βˆ’yi)2.f(\mathbf{z}) = \frac{1}{4M}\sum_{i=1}^{M} \left(|\langle \mathbf{a}_i, \mathbf{z}\rangle|^2 - y_i\right)^2.

This is a quartic function of z\mathbf{z} β€” non-convex with potentially many saddle points.

Alternative: amplitude loss.

g(z)=12Mβˆ‘i=1M(∣⟨ai,zβŸ©βˆ£βˆ’yi)2.g(\mathbf{z}) = \frac{1}{2M}\sum_{i=1}^{M} \left(|\langle \mathbf{a}_i, \mathbf{z}\rangle| - \sqrt{y_i}\right)^2.

The amplitude loss has better conditioning near the solution but is non-smooth when ⟨ai,z⟩=0\langle \mathbf{a}_i, \mathbf{z}\rangle = 0. Truncated variants handle both losses.

Wirtinger Calculus

A framework for differentiating real-valued functions of complex variables. For f:CNβ†’Rf : \mathbb{C}^N \to \mathbb{R}, the Wirtinger gradient βˆ‡zΛ‰f\nabla_{\bar{\mathbf{z}}} f provides the steepest ascent direction, and βˆ’βˆ‡zΛ‰f-\nabla_{\bar{\mathbf{z}}} f is the gradient descent direction. Essential for optimizing over complex-valued signals.

Related: The Intensity Loss Function

Theorem: Wirtinger Gradient of the Intensity Loss

For the intensity loss f(z)=14Mβˆ‘i=1M(∣⟨ai,z⟩∣2βˆ’yi)2f(\mathbf{z}) = \frac{1}{4M}\sum_{i=1}^{M} (|\langle \mathbf{a}_i, \mathbf{z}\rangle|^2 - y_i)^2, the Wirtinger gradient is:

βˆ‡zΛ‰f(z)=12Mβˆ‘i=1M(∣⟨ai,z⟩∣2βˆ’yi)⟨ai,zβŸ©β€‰ai.\nabla_{\bar{\mathbf{z}}} f(\mathbf{z}) = \frac{1}{2M}\sum_{i=1}^{M} \left(|\langle \mathbf{a}_i, \mathbf{z}\rangle|^2 - y_i\right) \langle \mathbf{a}_i, \mathbf{z}\rangle \,\mathbf{a}_i.

Each term in the sum is the residual (∣⟨ai,z⟩∣2βˆ’yi)(|\langle \mathbf{a}_i, \mathbf{z}\rangle|^2 - y_i) times the inner product ⟨ai,z⟩\langle \mathbf{a}_i, \mathbf{z}\rangle times the measurement direction ai\mathbf{a}_i. Measurements with large residuals contribute more to the gradient β€” this is why truncation helps.

Definition:

Spectral Initialization

The convergence of Wirtinger flow depends critically on initialization. Spectral initialization provides a starting point in the basin of attraction of the global optimum.

Algorithm:

  1. Form the weighted matrix Y=1Mβˆ‘i=1Myi aiaiH\mathbf{Y} = \frac{1}{M}\sum_{i=1}^{M} y_i\,\mathbf{a}_i\mathbf{a}_i^H.
  2. Compute the leading eigenvector: z(0)=Ξ»1(Y) v1(Y)\mathbf{z}^{(0)} = \sqrt{\lambda_1(\mathbf{Y})}\, \mathbf{v}_1(\mathbf{Y}).

Intuition: E[Y]=x0x0H+βˆ₯x0βˆ₯2I\mathbb{E}[\mathbf{Y}] = \mathbf{x}_0\mathbf{x}_0^H + \|\mathbf{x}_0\|^2\mathbf{I}, so the leading eigenvector of Y\mathbf{Y} approximates x0\mathbf{x}_0 (up to global phase).

Quality: With Mβ‰₯CNM \geq CN Gaussian measurements:

dist(z(0),x0)≀ϡ0βˆ₯x0βˆ₯\text{dist}(\mathbf{z}^{(0)}, \mathbf{x}_0) \leq \epsilon_0\|\mathbf{x}_0\|

where Ο΅0β†’0\epsilon_0 \to 0 as M/Nβ†’βˆžM/N \to \infty. Typically Mβ‰₯6NM \geq 6N gives Ο΅0≲0.3\epsilon_0 \lesssim 0.3.

,

Spectral Initialization

An initialization strategy for non-convex phase retrieval that computes the leading eigenvector of the weighted matrix Y=1Mβˆ‘iyiaiaiH\mathbf{Y} = \frac{1}{M}\sum_i y_i \mathbf{a}_i\mathbf{a}_i^H. Provides a starting point within a constant factor of the true signal, enabling gradient descent to converge globally.

Related: The Intensity Loss Function

Wirtinger Flow Algorithm

Complexity: O(MN)O(MN) per iteration; O(MN2log⁑N)O(MN^2\log N) total with T=O(Nlog⁑(1/ϡ))T = O(N\log(1/\epsilon)) iterations
Input: Measurements {(ai,yi)}i=1M\{(\mathbf{a}_i, y_i)\}_{i=1}^M,
step size ΞΌ\mu, iterations TT
Output: Estimate x^\hat{\mathbf{x}}
1. Spectral initialization:
Y←1Mβˆ‘i=1Myi aiaiH\mathbf{Y} \leftarrow \frac{1}{M}\sum_{i=1}^M y_i\,\mathbf{a}_i\mathbf{a}_i^H
2. z(0)←λ1(Y) v1(Y)\mathbf{z}^{(0)} \leftarrow \sqrt{\lambda_1(\mathbf{Y})}\,\mathbf{v}_1(\mathbf{Y})
3. for t=0,1,…,Tβˆ’1t = 0, 1, \ldots, T-1 do
4. βˆ‡β†12Mβˆ‘i=1M(∣⟨ai,z(t)⟩∣2βˆ’yi)β€‰βŸ¨ai,z(t)βŸ©β€‰ai\quad \nabla \leftarrow \frac{1}{2M}\sum_{i=1}^M (|\langle\mathbf{a}_i, \mathbf{z}^{(t)}\rangle|^2 - y_i)\,\langle\mathbf{a}_i, \mathbf{z}^{(t)}\rangle\,\mathbf{a}_i
5. z(t+1)←z(t)βˆ’ΞΌβˆ₯z(0)βˆ₯2β€‰βˆ‡\quad \mathbf{z}^{(t+1)} \leftarrow \mathbf{z}^{(t)} - \frac{\mu}{\|\mathbf{z}^{(0)}\|^2}\,\nabla
6. end for
7. return x^=z(T)\hat{\mathbf{x}} = \mathbf{z}^{(T)}

The normalization by βˆ₯z(0)βˆ₯2\|\mathbf{z}^{(0)}\|^2 in line 5 ensures scale invariance: the step size does not depend on the signal energy.

Theorem: Convergence Guarantee for Wirtinger Flow

Suppose ai∼CN(0,I)\mathbf{a}_i \sim \mathcal{CN}(\mathbf{0}, \mathbf{I}), Mβ‰₯CNlog⁑NM \geq CN\log N (for a universal constant CC), and z(0)\mathbf{z}^{(0)} is the spectral initialization. Then with step size μ≀c0/(Nβˆ₯x0βˆ₯2)\mu \leq c_0/(N\|\mathbf{x}_0\|^2), Wirtinger flow converges linearly:

dist(z(t),x0)≀(1βˆ’c1/N)tβ‹…dist(z(0),x0),\text{dist}(\mathbf{z}^{(t)}, \mathbf{x}_0) \leq (1 - c_1/N)^t \cdot \text{dist}(\mathbf{z}^{(0)}, \mathbf{x}_0),

with high probability. After T=O(Nlog⁑(1/Ο΅))T = O(N\log(1/\epsilon)) iterations, the relative error is ≀ϡ\leq \epsilon.

Total complexity: O(MNβ‹…Nlog⁑(1/Ο΅))=O(MN2log⁑Nlog⁑(1/Ο΅))O(MN \cdot N\log(1/\epsilon)) = O(MN^2\log N\log(1/\epsilon)) with Gaussian measurements. With FFT structure: O(MNlog⁑Nlog⁑(1/Ο΅))O(MN\log N\log(1/\epsilon)).

The loss landscape has no spurious local minima in a neighborhood of x0\mathbf{x}_0 (and its global phase rotations). Spectral initialization lands inside this neighborhood, and the restricted strong convexity of the loss in this region drives linear convergence.

Historical Note: Gerchberg--Saxton: The Grandfather of Phase Retrieval (1972)

1970s

The Gerchberg--Saxton (GS) algorithm, published in 1972 by Ralph Gerchberg and Owen Saxton, was the first practical algorithm for phase retrieval from Fourier magnitudes. It alternates between the Fourier and spatial domains, replacing the magnitude in each domain with the known constraints.

GS is an alternating projection method: it projects between the set of signals consistent with the Fourier magnitude and the set consistent with spatial constraints (support, non-negativity). On tree-structured constraint sets, alternating projections converge; on non-convex sets (like the magnitude constraint), convergence to the global optimum is not guaranteed.

Despite lacking convergence guarantees, GS and its descendants (Fienup's Hybrid Input-Output, error reduction) remained the standard algorithms for four decades, until Wirtinger flow provided the first provably convergent alternative.

Definition:

Truncated Wirtinger Flow

Truncated Wirtinger Flow (Chen and Candes, 2017) discards measurements with large residuals from the gradient computation:

βˆ‡zΛ‰ftrunc=12Mβˆ‘i∈T(t)(∣⟨ai,z⟩∣2βˆ’yi)⟨ai,zβŸ©β€‰ai,\nabla_{\bar{\mathbf{z}}} f_{\text{trunc}} = \frac{1}{2M}\sum_{i \in \mathcal{T}^{(t)}} \left(|\langle \mathbf{a}_i, \mathbf{z}\rangle|^2 - y_i\right) \langle \mathbf{a}_i, \mathbf{z}\rangle\,\mathbf{a}_i,

where T(t)={i:∣⟨ai,z(t)βŸ©βˆ£β‰€Ξ±hyi}∩{i:∣∣⟨ai,z(t)⟩∣2βˆ’yiβˆ£β‰€Ξ±yyi+Ξ²}\mathcal{T}^{(t)} = \{i : |\langle \mathbf{a}_i, \mathbf{z}^{(t)}\rangle| \leq \alpha_h \sqrt{y_i}\} \cap \{i : ||\langle \mathbf{a}_i, \mathbf{z}^{(t)}\rangle|^2 - y_i| \leq \alpha_y y_i + \beta\}.

Key improvement: Truncation reduces the measurement requirement from M=O(Nlog⁑N)M = O(N\log N) to M=O(N)M = O(N) β€” optimal up to constants.

Other variants:

  • Reshaped Wirtinger flow: Uses the amplitude loss gg with truncation, achieving optimal M=O(N)M = O(N) without the log factor.
  • Incremental (stochastic) Wirtinger flow: Uses one measurement per iteration for O(N)O(N) per-step cost β€” suitable for streaming data.

Wirtinger Flow Convergence

Demonstrates Wirtinger flow convergence on a 1D signal recovery task. The plot shows relative error dist(z(t),x0)/βˆ₯x0βˆ₯\text{dist}(\mathbf{z}^{(t)}, \mathbf{x}_0)/\|\mathbf{x}_0\| versus iteration number.

Adjust the measurement ratio M/NM/N and noise level to observe:

  • Linear convergence after spectral initialization
  • Faster convergence with more measurements
  • Error floor determined by noise level
  • Effect of truncation on convergence speed
Parameters
6
0

Example: Wirtinger Flow vs. PhaseLift: Speed and Accuracy

Setup: N=128N = 128, M=512M = 512 Gaussian measurements, SNR = 30 dB.

Compare recovery quality, computation time, and memory for PhaseLift, Wirtinger flow, truncated WF, and Gerchberg--Saxton.

,

Wirtinger Flow: Gradient Descent on the Non-Convex Landscape

Visualizes the gradient descent trajectory of Wirtinger flow on a 2D slice of the intensity loss landscape. Watch how spectral initialization places the iterate near the global minimum, and gradient descent converges despite the non-convex terrain.
The loss landscape (intensity loss for a 2Γ—12 \times 1 signal) has saddle points and a ring of global minima (the global phase orbit). Spectral initialization (blue dot) starts near the basin of attraction; gradient descent (blue trajectory) converges linearly.

Common Mistake: Random Initialization Fails for Wirtinger Flow

Mistake:

Using a random initialization (e.g., z(0)∼CN(0,I)\mathbf{z}^{(0)} \sim \mathcal{CN}(\mathbf{0}, \mathbf{I})) instead of spectral initialization. With random starting points, Wirtinger flow converges to the global optimum less than 5% of the time for N>32N > 32 β€” it gets trapped at saddle points or converges to spurious local minima.

Correction:

Always use spectral initialization. The leading eigenvector of 1Mβˆ‘iyiaiaiH\frac{1}{M}\sum_i y_i \mathbf{a}_i\mathbf{a}_i^H costs only O(MN)O(MN) (via power iteration or Lanczos) and guarantees a starting point within the basin of attraction when Mβ‰₯6NM \geq 6N.

Quick Check

What is the per-iteration computational cost of Wirtinger flow with MM Gaussian measurements and signal dimension NN?

O(N)O(N)

O(MN)O(MN)

O(MN2)O(MN^2)

O(N3)O(N^3)

Key Takeaway

Wirtinger flow minimizes the non-convex intensity loss via gradient descent with Wirtinger calculus β€” O(MN)O(MN) per iteration. Spectral initialization places the starting point in the basin of attraction, enabling linear convergence to the global optimum with high probability. Truncated Wirtinger flow achieves optimal sample complexity M=O(N)M = O(N) by discarding outlier gradient contributions. Wirtinger flow is orders of magnitude faster than PhaseLift while achieving comparable accuracy β€” it is the practical method of choice for phase retrieval.