Exercises

ex-ch08-01

Easy

State Jensen's inequality for a concave function φ\varphi and a random variable XX. Apply it to φ=log\varphi = \log to derive the bound logE[X]E[logX]\log \mathbb{E}[X] \geq \mathbb{E}[\log X] for a positive random variable XX.

ex-ch08-02

Medium

Show directly (without invoking the ELBO identity) that for any valid qq, logf(y;θ)F(q,θ)=D(qp(y;θ)).\log f(\mathbf{y};\boldsymbol{\theta}) - \mathcal{F}(q,\boldsymbol{\theta}) = D(q \Vert p(\cdot\mid\mathbf{y};\boldsymbol{\theta})).

ex-ch08-03

Easy

In GMM-EM, show that kγik(t)=1\sum_k \gamma_{ik}^{(t)} = 1 for each ii.

ex-ch08-04

Medium

Derive the M-step update for the mixing weights πk\pi_k in a GMM by maximizing kNklogπk\sum_k N_k \log \pi_k subject to kπk=1\sum_k \pi_k = 1, πk0\pi_k \geq 0. Identify the Lagrange multiplier.

ex-ch08-05

Medium

Consider a two-component 1-D GMM with known equal variances σ2=1\sigma^2 = 1 and known mixing weights π1=π2=1/2\pi_1 = \pi_2 = 1/2. Write out the E-step and M-step updates for (μ1,μ2)(\mu_1, \mu_2) explicitly. Show that the mapping is symmetric under swapping (μ1,μ2)(μ2,μ1)(\mu_1,\mu_2) \leftrightarrow (\mu_2,\mu_1).

ex-ch08-06

Hard

Prove Fisher's identity: for a latent-variable model with smooth densities, θlogf(y;θ)=EZy,θ ⁣[θlogp(y,Z;θ)].\nabla_{\boldsymbol{\theta}} \log f(\mathbf{y};\boldsymbol{\theta}) = \mathbb{E}_{\mathbf{Z}\mid\mathbf{y},\boldsymbol{\theta}}\!\big[\nabla_{\boldsymbol{\theta}}\log p(\mathbf{y},\mathbf{Z};\boldsymbol{\theta})\big]. State the regularity conditions needed to interchange differentiation and integration.

ex-ch08-07

Hard

(Louis's formula.) Show that the observed Fisher information can be written as θ2logf(y;θ)=EZy,θ[2logp(y,Z;θ)]CovZy,θ(logp(y,Z;θ)).-\nabla_{\boldsymbol{\theta}}^2 \log f(\mathbf{y};\boldsymbol{\theta}) = -\mathbb{E}_{\mathbf{Z}\mid\mathbf{y},\boldsymbol{\theta}}[\nabla^2 \log p(\mathbf{y},\mathbf{Z};\boldsymbol{\theta})] - \operatorname{Cov}_{\mathbf{Z}\mid\mathbf{y},\boldsymbol{\theta}}(\nabla \log p(\mathbf{y},\mathbf{Z};\boldsymbol{\theta})).

ex-ch08-08

Medium

Consider nn i.i.d. samples yiy_i with yi=θ+ziy_i = \theta + z_i where ziN(0,1)z_i \sim \mathcal{N}(0,1). Suppose, however, that only yi+=max(yi,0)y_i^+ = \max(y_i,0) is observed (the data are censored at zero). Set up the EM algorithm for estimating θ\theta. Identify what the latent variable is and write out the E-step.

ex-ch08-09

Easy

True or false: if the EM iterates θ(t)\boldsymbol{\theta}^{(t)} converge to some θ\boldsymbol{\theta}^{\star}, then θ\boldsymbol{\theta}^{\star} is a global maximum of the log-likelihood. Justify.

ex-ch08-10

Hard

Show that for a GMM with full covariances and nn samples, the log-likelihood is unbounded above. Construct an explicit sequence of parameters along which it diverges.

ex-ch08-11

Medium

For K-means, show that Lloyd's algorithm monotonically decreases the distortion J=iyiμci2J = \sum_i \|\mathbf{y}_i - \boldsymbol{\mu}_{c_i}\|^2.

ex-ch08-12

Medium

In the SBL model from Section 8.5, derive the M-step update αj(t+1)=1/(μj(t),2+Σjj(t))\alpha_j^{(t+1)} = 1/(\mu_j^{(t),2} + \boldsymbol{\Sigma}_{jj}^{(t)}).

ex-ch08-13

Easy

Why must responsibilities be strictly positive during EM iterations? What happens if γik(t)=0\gamma_{ik}^{(t)} = 0 for all ii and some kk?

ex-ch08-14

Challenge

(Wu 1983.) Let (θ)\ell(\boldsymbol{\theta}) be the incomplete-data log-likelihood, continuous on a closed bounded set Λ0Λ\Lambda_0 \subset \Lambda. Show that if {θ(t)}Λ0\{\boldsymbol{\theta}^{(t)}\} \subset \Lambda_0 is produced by EM, then every limit point is a stationary point of \ell provided the map θQ(θθ)\boldsymbol{\theta} \mapsto Q(\boldsymbol{\theta}'\mid\boldsymbol{\theta}) is continuous in both arguments.

ex-ch08-15

Medium

For a mixture of two known-variance Gaussians with known means μ1,μ2\mu_1, \mu_2 and unknown mixing weight π(0,1)\pi \in (0,1), write the EM update for π\pi. Interpret the update as a weighted count.

ex-ch08-16

Hard

Consider EM for estimating the mean θ\boldsymbol{\theta} of a Gaussian N(θ,I)\mathcal{N}(\boldsymbol{\theta},\mathbf{I}) from i.i.d. observations, where each observation yijy_{ij} is censored independently: we observe y~ij=yij\tilde{y}_{ij} = y_{ij} if yijc|y_{ij}|\leq c, else y~ij=censored\tilde{y}_{ij} = \text{censored}. Derive the EM update for θ\boldsymbol{\theta}.

ex-ch08-17

Medium

Show that the EM updates for a GMM can equivalently be derived by maximizing the ELBO F(q,θ)\mathcal{F}(q,\boldsymbol{\theta}) by alternating coordinate ascent, where qq factorizes as q(z)=iqi(zi)q(\mathbf{z}) = \prod_i q_i(z_i) (which it does exactly for a GMM).

ex-ch08-18

Easy

A student claims: "EM always decreases KL divergence between qq and the posterior." Is this correct? If yes, justify; if no, correct it.

ex-ch08-19

Hard

(Speed of convergence.) Near a local maximum θ\boldsymbol{\theta}^{\star}, show that EM has linear convergence rate equal to the largest eigenvalue of IIc1(θ)Io(θ)\mathbf{I} - \mathbf{I}_c^{-1}(\boldsymbol{\theta}^{\star})\mathbf{I}_o(\boldsymbol{\theta}^{\star}), where Ic\mathbf{I}_c is the expected complete-data Fisher information and Io\mathbf{I}_o is the observed Fisher information. Interpret the ratio Io/Ic\mathbf{I}_o/\mathbf{I}_c as "fraction of information not missing."

ex-ch08-20

Medium

You are given 1-D data from a 3-component GMM. You fit it with K=3K=3, K=5K=5, and K=10K=10 components using EM and pick the one with the highest log-likelihood on the fitting data. Why is this procedure flawed? Which criterion should you use instead?