Exercises

ex-ch25-01

Easy

Sparse PCA minimax rate. Consider the spiked model with sparsity $k$ , ambient dimension $d$ , and $n$ samples. Write down the information-theoretic rate $\lambda_{\text{stat}}^2$ and the conjectured polynomial-time rate $\lambda_{\text{comp}}^2$ . For $d = 1000$ , $n = 500$ , $k = 10$ , compute both and the ratio.

Show Hint

The statistical rate scales like $k \log(d) / n$ .

The computational rate scales like $k^2 / n$ .

The ratio is $k / \log(d)$ .

Solution

Write down the rates

Up to constants, $\lambda_{\text{stat}}^2 = k \log(d) / n$ and $\lambda_{\text{comp}}^2 = k^2 / n$ .

Plug in numbers

With $d=1000$ , $\log d \approx 6.91$ . Then $\lambda_{\text{stat}}^2 = 10 \cdot 6.91 / 500 \approx 0.138$ and $\lambda_{\text{comp}}^2 = 100 / 500 = 0.200$ . Ratio $\lambda_{\text{comp}}^2 / \lambda_{\text{stat}}^2 = k / \log d \approx 1.45$ .

Interpretation

A polynomial-time algorithm needs roughly $45\%$ more signal strength than the information-theoretic minimum in this regime.

ex-ch25-02

Easy

Regret of the best fixed expert. In a prediction-with-experts game with $N=4$ experts, $T=100$ rounds, and cumulative losses $L_T^1 = 72, L_T^2 = 45, L_T^3 = 61, L_T^4 = 58$ . The learner achieved cumulative loss $64$ . Compute the regret against the best expert.

Show Hint

Regret is learner loss minus best-expert loss.

The best expert is the one with smallest $L_T^i$ .

Solution

Find the best expert

$\min_i L_T^i = L_T^2 = 45$ .

Regret

$R_T = 64 - 45 = 19$ .

ex-ch25-03

Easy

Consensus matrix check. For the ring on $N=4$ nodes with weights $w_{ii} = 1/2$ and $w_{i,i\pm1} = 1/4$ , verify that $\mathbf{W}$ is symmetric, doubly stochastic, and has $\mathbf{1}$ as an eigenvector.

Show Hint

Write $\mathbf{W}$ as a circulant matrix.

Sum across any row.

Solution

Build matrix

$\mathbf{W} = \frac{1}{4}\begin{bmatrix} 2 & 1 & 0 & 1\\ 1 & 2 & 1 & 0\\ 0 & 1 & 2 & 1\\ 1 & 0 & 1 & 2 \end{bmatrix}.$ $

Verify properties

Symmetric by inspection. Each row sums to $4/4 = 1$ ; by symmetry so does each column. Then $\mathbf{W}\mathbf{1} = \mathbf{1}$ because every row-sum is $1$ , so $\mathbf{1}$ is an eigenvector with eigenvalue $1$ .

ex-ch25-04

Easy

UCB exploration bonus. The UCB1 index at time $t$ for arm $a$ that has been pulled $n_a$ times is $\hat\mu_a + \sqrt{2 \ln t / n_a}$ . For $t=100$ , $n_a = 4$ , $\hat\mu_a = 0.6$ , compute the index.

Show Hint

Plug into the formula.

Solution

Compute

Bonus = $\sqrt{2 \cdot \ln 100 / 4} = \sqrt{\ln 100 / 2} = \sqrt{4.605 / 2} \approx 1.517$ . Index $\approx 2.117$ .

ex-ch25-05

Easy

Differential privacy mechanism. Let $\mu_n$ be the sample mean of $n$ iid $[0,1]$ -valued observations. To release an $\varepsilon$ -differentially private estimate, add Laplace noise with scale $b = \Delta / \varepsilon$ . Compute $\Delta$ and $b$ for $n=200$ , $\varepsilon = 0.5$ .

Show Hint

The sensitivity of the mean is $1/n$ (changing one sample moves the mean by at most $1/n$ ).

Solution

Sensitivity

$\Delta = 1/n = 1/200 = 0.005$ .

Noise scale

$b = 0.005 / 0.5 = 0.01$ . The MSE added is $2 b^2 = 2 \times 10^{-4}$ , which is comparable to the sampling variance $1/(12n) \approx 4.2 \times 10^{-4}$ .

ex-ch25-06

Medium

Multiplicative weights regret bound. Show that against any sequence of losses $\ell_t^i \in [0,1]$ , the MWU algorithm with step size $\eta = \sqrt{2 \ln N / T}$ has regret $R_T \le \sqrt{2 T \ln N}$ .

Show Hint

Track the potential $\Phi_t = \sum_i w_t^i$ and bound $\ln(\Phi_{t+1}/\Phi_t)$ .

Use $e^{-\eta x} \le 1 - \eta x + \eta^2 x^2 / 2$ for $x \in [0,1]$ .

Lower bound $\ln \Phi_T \ge -\eta L_T^\star - \ln N$ (best expert).

Solution

Potential method

Let $\Phi_t = \sum_i w_t^i$ with $w_1^i = 1$ . The update $w_{t+1}^i = w_t^i e^{-\eta \ell_t^i}$ gives $\ln(\Phi_{t+1}/\Phi_t) = \ln \mathbb{E}_{i \sim p_t} [e^{-\eta \ell_t^i}]$ .

Bound per-round change

Applying $e^{-\eta x} \le 1 - \eta x + \eta^2 x^2 / 2$ and then $\ln(1+y) \le y$ , $\ln(\Phi_{t+1}/\Phi_t) \le -\eta \langle p_t, \ell_t\rangle + \eta^2 / 2.$

Telescope and lower bound

Summing: $\ln \Phi_{T+1} \le \ln N - \eta L_T^{\text{alg}} + T\eta^2/2$ . Also $\ln \Phi_{T+1} \ge -\eta L_T^\star$ . Combine and set $\eta = \sqrt{2 \ln N / T}$ to get $L_T^{\text{alg}} - L_T^\star \le \sqrt{2T \ln N}$ .

ex-ch25-07

Medium

Online to batch conversion. Suppose an online algorithm achieves regret $R_T \le C\sqrt{T}$ for convex losses. Show that averaging its iterates gives a statistical estimator $\bar\theta_T$ with excess risk $\mathbb{E}[f(\bar\theta_T)] - f(\theta^\star) \le C / \sqrt{T}$ .

Show Hint

Use convexity of $f$ .

Take expectation over iid data.

Solution

Regret inequality

$\sum_{t=1}^T f_t(\theta_t) - \min_\theta \sum_t f_t(\theta) \le R_T$ where $f_t$ is the loss on sample $t$ .

Take expectations

For iid data $\mathbb{E}[f_t(\theta)] = f(\theta)$ , and the comparator $\min_\theta$ dominates $f(\theta^\star)$ , so $\sum_t \mathbb{E}[f(\theta_t)] - T f(\theta^\star) \le R_T$ .

Jensen

By convexity, $f(\bar\theta_T) \le \frac{1}{T}\sum_t f(\theta_t)$ . Taking expectations and dividing by $T$ : $\mathbb{E}[f(\bar\theta_T)] - f(\theta^\star) \le R_T / T \le C / \sqrt{T}$ .

ex-ch25-08

Medium

Spectral gap controls consensus. For a symmetric doubly stochastic $\mathbf{W}$ , show that the averaging error satisfies $\|\mathbf{x}^{(t)} - \bar x \mathbf{1}\|_2 \le \lambda_2(\mathbf{W})^t \|\mathbf{x}^{(0)} - \bar x\mathbf{1}\|_2$ .

Show Hint

Project onto the subspace orthogonal to $\mathbf{1}$ .

Use the spectral decomposition of $\mathbf{W}$ .

Solution

Decompose initial state

Write $\mathbf{x}^{(0)} = \bar x \mathbf{1} + \mathbf{e}^{(0)}$ where $\mathbf{1}^T \mathbf{e}^{(0)} = 0$ .

Apply iteration

$\mathbf{x}^{(t)} = \mathbf{W}^t \mathbf{x}^{(0)} = \bar x \mathbf{1} + \mathbf{W}^t \mathbf{e}^{(0)}$ since $\mathbf{W}\mathbf{1} = \mathbf{1}$ .

Bound with spectral norm

Restricted to $\mathbf{1}^\perp$ , the spectral radius of $\mathbf{W}$ is $\lambda_2(\mathbf{W})$ . Thus $\|\mathbf{W}^t \mathbf{e}^{(0)}\|_2 \le \lambda_2^t \|\mathbf{e}^{(0)}\|_2$ .

ex-ch25-09

Medium

UCB regret via confidence bounds. Sketch the argument that UCB1 achieves regret $\mathcal{O}(\sum_{a: \Delta_a > 0} \log(T) / \Delta_a)$ in stochastic bandits with Gaussian rewards and gap $\Delta_a$ .

Show Hint

Bound the expected number of pulls of a suboptimal arm.

Use Hoeffding to show the optimal arm is not under-explored.

Show that if arm $a$ is pulled, either its empirical mean is too high or the optimal arm is too low.

Solution

Event decomposition

Arm $a$ is pulled at round $t$ only if $\hat\mu_a + \sqrt{2\ln t / n_a} \ge \hat\mu_\star + \sqrt{2\ln t / n_\star}$ .

Bad events

This requires at least one of: (i) $\hat\mu_\star < \mu_\star - \sqrt{2\ln t / n_\star}$ , (ii) $\hat\mu_a > \mu_a + \sqrt{2\ln t / n_a}$ , or (iii) $n_a < 8\ln t / \Delta_a^2$ .

Hoeffding and union bound

Each of (i),(ii) has probability at most $t^{-4}$ , summed gives $\mathcal{O}(1)$ contribution. Event (iii) gives $\mathbb{E}[n_a(T)] \le 8\ln T / \Delta_a^2 + \mathcal{O}(1)$ , so regret contribution is $\mathcal{O}(\ln T / \Delta_a)$ .

ex-ch25-10

Medium

Stochastic approximation. Consider Robbins-Monro updates $\theta_{t+1} = \theta_t - \gamma_t (\theta_t - y_t)$ where $y_t = \theta^\star + w_t$ , $w_t \sim \mathcal{N}(0, \sigma^2)$ iid. With $\gamma_t = 1/t$ , show that $\mathbb{E}[(\theta_t - \theta^\star)^2] = \sigma^2 / t$ in the limit.

Show Hint

Let $e_t = \theta_t - \theta^\star$ , write recursion for $e_t$ .

Unroll the recursion.

Solution

Error recursion

$e_{t+1} = e_t - \gamma_t (e_t - w_t) = (1-\gamma_t) e_t + \gamma_t w_t$ .

Unroll with $\gamma_t = 1/t$

$e_{t+1} = \prod_{s=1}^{t}(1-1/s) e_1 + \sum_{s=1}^{t}\prod_{r=s+1}^{t}(1-1/r) \frac{w_s}{s}$ . The product $\prod_{s=1}^{t}(1-1/s) = 0$ , and $\prod_{r=s+1}^{t}(1-1/r) = s/t$ , so $e_{t+1} = \frac{1}{t}\sum_{s=1}^{t} w_s$ .

Variance

$\mathbb{E}[e_{t+1}^2] = \sigma^2 / t$ by independence.

ex-ch25-11

Medium

Gossip vs synchronous consensus. On a ring of $N$ nodes, synchronous averaging takes $\Theta(N^2)$ rounds with total messages $N \cdot \Theta(N^2)$ . Pairwise randomized gossip converges in $\Theta(N^2)$ pair exchanges in expectation. Which uses fewer messages? How does the answer change on a complete graph?

Show Hint

Each synchronous round uses $2|E|$ message transmissions ( $|E|$ = edges).

On a ring $|E| = N$ ; on complete $|E| = \binom{N}{2}$ .

Solution

Ring messages

Synchronous: $\Theta(N^2)$ rounds $\times$ $2N$ messages = $\Theta(N^3)$ . Gossip: $\Theta(N^2)$ pair exchanges $\times$ 2 messages = $\Theta(N^2)$ . Gossip wins by factor $N$ .

Complete graph

Synchronous converges in $\Theta(1)$ rounds (spectral gap is $1$ ), using $\Theta(N^2)$ messages. Gossip still takes $\Theta(N^2)$ pair exchanges. They match up to constants.

Lesson

Gossip wins when the spectral gap is small; synchronous wins when the graph is dense and each round is globally useful.

ex-ch25-12

Medium

Paper critique. A submitted paper claims a new estimator achieves MSE scaling of $1/n^2$ for estimating a Lipschitz density at a point. Explain why this claim is extraordinary and what checks you would perform as a reviewer.

Show Hint

What is the minimax rate for nonparametric density estimation?

The claim should either cite the breaking of a lower bound or restrict assumptions.

Solution

Known lower bound

The minimax rate for pointwise estimation of a Lipschitz density is $n^{-2/3}$ . A $1/n^2$ rate is faster than the parametric rate $1/n$ , which is suspicious.

Reviewer checks

(i) Is the claim in MSE or squared-bias-plus-variance correctly accounted? (ii) Does the proof implicitly restrict to a parametric sub-family (smoother than Lipschitz)? (iii) Is the constant allowed to depend on $n$ ? (iv) Are boundary and tuning effects absorbed?

Conclusion

Either the class is stronger than claimed, the estimator uses side information, or the analysis has a bug. Request clarified assumptions and a simulation against the Le Cam lower bound.

ex-ch25-13

Medium

Metropolis-Hastings weights. Design symmetric gossip weights $\mathbf{W}$ for a graph where each node only knows its own degree and its neighbours' degrees. Use the Metropolis rule $w_{ij} = 1/(1 + \max(d_i, d_j))$ for $(i,j) \in E$ and verify double stochasticity.

Show Hint

The diagonal must be chosen to make each row sum to $1$ .

Solution

Off-diagonal

Set $w_{ij} = 1/(1 + \max(d_i, d_j))$ for edges, $0$ elsewhere. Symmetric by construction.

Diagonal

$w_{ii} = 1 - \sum_{j \sim i} w_{ij}$ . Since $w_{ij} \le 1/(1+d_i)$ and there are $d_i$ neighbours, $w_{ii} \ge 1 - d_i/(1+d_i) = 1/(1+d_i) > 0$ .

Double stochasticity

Rows sum to $1$ by construction; symmetry gives column sums.

ex-ch25-14

Hard

Planted clique reduction. Sketch how a polynomial-time sparse-PCA algorithm achieving rate $\lambda < C\sqrt{k^2/n}$ would refute the planted clique hardness conjecture.

Show Hint

Encode a graph $G$ on $N$ vertices as a sample matrix.

A planted clique of size $k$ induces a rank-1 spike.

Detecting the spike lets you recover the clique.

Solution

Reduction construction

Given a graph $G$ with adjacency $\mathbf{A}$ , form samples whose covariance is $\mathbf{I} + \lambda \mathbf{v}\mathbf{v}^T$ where $\mathbf{v}$ is the clique indicator (sparsity $k$ ). Under the null (no clique) there is no spike.

What the algorithm gives

If a poly-time sparse-PCA estimator achieves detection below $\sqrt{k^2/n}$ , applying it to the reduction detects planted cliques with $k = o(\sqrt{N})$ , which contradicts the planted clique conjecture.

Takeaway

Beating the computational rate is therefore at least as hard as refuting a widely-believed average-case complexity assumption.

ex-ch25-15

Hard

Follow-the-regularized-leader. For OCO with convex losses $f_t$ and strongly convex regularizer $R$ , show that $\theta_t = \arg\min_\theta \{\sum_{s<t} f_s(\theta) + R(\theta)/\eta\}$ attains regret $R_T \le R(\theta^\star)/\eta + \eta \sum_t \|\nabla f_t(\theta_t)\|_\star^2 / 2$ .

Show Hint

Use the FTL-BTL lemma: follow-the-leader on $f_0 = R/\eta$ has telescoping regret.

Bound the stability term via strong convexity of $R$ .

Solution

FTL-BTL lemma

For $f_0 = R/\eta$ and iterates $\theta_t$ minimizing $\sum_{s \le t-1} f_s$ , telescoping gives $\sum_t f_t(\theta_t) - \sum_t f_t(\theta^\star) \le f_0(\theta^\star) - f_0(\theta_1) + \sum_t [f_t(\theta_t) - f_t(\theta_{t+1})]$ .

Stability from strong convexity

Strong convexity of $R$ implies $\|\theta_t - \theta_{t+1}\| \le \eta \|\nabla f_t(\theta_t)\|_\star$ ; convexity of $f_t$ gives $f_t(\theta_t) - f_t(\theta_{t+1}) \le \|\nabla f_t(\theta_t)\|_\star \|\theta_t - \theta_{t+1}\|$ .

Combine

Plugging in, $R_T \le R(\theta^\star)/\eta + \eta \sum_t \|\nabla f_t(\theta_t)\|_\star^2$ . The factor $1/2$ comes from a tighter version of the stability step.

ex-ch25-16

Hard

Distributed Kalman filter consistency. Each of $N$ sensors observes $y_i = h_i \theta + v_i$ with $v_i \sim \mathcal{N}(0, 1)$ iid. Show that the consensus+innovations update $\theta_i^{(t+1)} = \sum_j w_{ij}\theta_j^{(t)} + \alpha_t h_i(y_i - h_i \theta_i^{(t)})$ converges to the centralized ML estimator under appropriate $\alpha_t$ .

Show Hint

Split analysis into disagreement $\theta_i - \bar\theta$ and average $\bar\theta$ .

Require $\sum_t \alpha_t = \infty$ , $\sum_t \alpha_t^2 < \infty$ .

Solution

Average dynamics

Summing over $i$ and using double stochasticity of $\mathbf{W}$ : $\bar\theta^{(t+1)} = \bar\theta^{(t)} + \frac{\alpha_t}{N} \sum_i h_i(y_i - h_i\bar\theta^{(t)}) + \text{disagreement}$ . This is a stochastic approximation toward $\theta^\star_{\text{ML}} = (\sum_i h_i^2)^{-1}\sum_i h_i y_i$ .

Disagreement vanishes

Consensus contracts the disagreement at rate $\lambda_2^t$ , while innovation injects noise of order $\alpha_t$ . With $\alpha_t \to 0$ , disagreement shrinks to $0$ in mean-square.

Robbins-Monro limit

Standard stochastic approximation arguments (under $\sum \alpha_t = \infty, \sum \alpha_t^2 < \infty$ ) yield $\theta_i^{(t)} \to \theta^\star_{\text{ML}}$ almost surely.

, ,

ex-ch25-17

Hard

Cost of privacy. Under $\varepsilon$ -differential privacy, the minimax risk for estimating a scalar mean over iid $[0,1]$ samples is $R_n \asymp \max(1/n, 1/(n\varepsilon)^2)$ . Derive both regimes and identify the sample size at which privacy becomes free.

Show Hint

Laplace noise of scale $1/(n\varepsilon)$ contributes MSE $\Theta(1/(n\varepsilon)^2)$ .

Without privacy the rate is $1/n$ .

Solution

Non-private part

Empirical mean has variance $\sigma^2/n \le 1/(4n)$ .

Private part

Adding Laplace noise with scale $1/(n\varepsilon)$ adds MSE $2/(n\varepsilon)^2$ .

Total and regime

Total $\asymp 1/n + 1/(n\varepsilon)^2$ . Privacy is free when $1/(n\varepsilon)^2 \lesssim 1/n$ , i.e. $n \gtrsim 1/\varepsilon^2$ .

ex-ch25-18

Challenge

Open-ended: statistical-computational trade-offs in your field. Pick a telecom estimation problem (pick one: channel estimation in massive MIMO with sparse angular support, user-activity detection, symbol decoding with low-precision ADCs). (a) Identify a candidate statistical-computational gap in that problem. (b) Propose a polynomial-time estimator and its rate. (c) Propose a lower-bound strategy (reduction, sum-of-squares lower bound, or average-case hardness).

Show Hint

Sparse recovery with $k \ll d$ often has gaps like sparse PCA.

Low-rank tensor estimation has very wide gaps.

Average-case hardness from planted clique transfers via black-box reductions.

Solution

Example: massive MIMO sparse channel

Angular-domain channel is $k$ -sparse in $d$ directions with $n$ pilots. Information-theoretic minimax rate is $k \log(d/k)/n$ ; LASSO achieves $k \log d / n$ ; thresholding after matched filter $k^2/n$ .

Polynomial estimator

Orthogonal matching pursuit or LASSO with $\ell_1$ penalty; analysis in Ch.14-15 shows mutual-incoherence rates.

Lower bound candidate

A reduction from planted dense submatrix would give conditional lower bounds at $k^2/n$ — the widely-believed barrier for poly-time sparse PCA/CCA. Write up as a research note.

, ,

References

N. Cesa-Bianchi and G. Lugosi, Prediction, Learning, and Games, Cambridge University Press, 2006— The canonical reference for adversarial online learning. Theorem 2.2 is the multiplicative-weights regret bound cited in Section 25.2.
E. Hazan, Introduction to Online Convex Optimization, 2016— Modern treatment of OCO and regret minimization. Used for the online-to-batch conversion and projected-gradient regret analysis.
S. Bubeck and N. Cesa-Bianchi, Regret Analysis of Stochastic and Nonstochastic Multi-Armed Bandit Problems, 2012— Comprehensive survey of stochastic and adversarial bandits. Covers UCB, EXP3, and the lower-bound techniques used in Section 25.2.
P. Auer, N. Cesa-Bianchi, and P. Fischer, Finite-Time Analysis of the Multiarmed Bandit Problem, 2002— Original paper introducing UCB1 with the logarithmic regret bound cited in Theorem 25.2.2.
H. Robbins and S. Monro, A Stochastic Approximation Method, 1951— Origin of stochastic approximation, the mathematical parent of modern online estimation and SGD.
L. Xiao and S. Boyd, Fast Linear Iterations for Distributed Averaging, 2004— The reference for optimal symmetric gossip weights and the spectral-gap convergence rate used in Section 25.3.
S. Boyd, A. Ghosh, B. Prabhakar, and D. Shah, Randomized Gossip Algorithms, 2006— Rigorous analysis of averaging and pairwise gossip, including the message-complexity lower bounds compared in the simulation.
R. Olfati-Saber, J. A. Fax, and R. M. Murray, Consensus and Cooperation in Networked Multi-Agent Systems, 2007— Broad survey of consensus dynamics, Laplacian spectra, and applications to distributed estimation and control.
S. Kar, J. M. F. Moura, and K. Ramanan, Distributed Parameter Estimation in Sensor Networks: Nonlinear Observation Models and Imperfect Communication, 2012— Distributed consensus+innovations estimation with convergence guarantees under random link failures.
Q. Berthet and P. Rigollet, Computational Lower Bounds for Sparse PCA, 2013— Landmark result establishing a polynomial-time statistical- computational gap for sparse PCA under the planted clique hypothesis — the basis for Section 25.1.
F. Krzakala, C. Moore, E. Mossel, J. Neeman, A. Sly, L. Zdeborova, and P. Zhang, Spectral Redemption in Clustering Sparse Networks, 2013— Non-backtracking spectral methods closing part of the statistical- computational gap for stochastic block models.
C. Dwork, F. McSherry, K. Nissim, and A. Smith, Calibrating Noise to Sensitivity in Private Data Analysis, 2006— Introduces differential privacy and the Laplace mechanism — the starting point for privacy-constrained estimation.
M. J. Wainwright, High-Dimensional Statistics: A Non-Asymptotic Viewpoint, Cambridge University Press, 2019— Comprehensive reference for minimax rates, concentration, and non-asymptotic analysis underlying the tradeoffs in this chapter.
A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro, Robust Stochastic Approximation Approach to Stochastic Programming, 2009— Modern convergence analysis for online / stochastic gradient estimation used throughout Section 25.2.

Chapter Summary References & Further Reading