Exercises
ex-ch08-01
EasyState Jensen's inequality for a concave function and a random variable . Apply it to to derive the bound for a positive random variable .
Concave is convex.
Remember that for convex , .
Statement
If is concave and is a random variable with , then .
Apply to $\log$
The logarithm is concave on . Hence for positive . Equality holds iff is almost surely constant.
ex-ch08-02
MediumShow directly (without invoking the ELBO identity) that for any valid ,
Write and insert into .
Expand $\mathcal{F}$
.
Separate terms
. Rearranging gives the stated identity.
ex-ch08-03
EasyIn GMM-EM, show that for each .
Look at the definition of the responsibility — the denominator is the same for all .
Sum the numerators
.
ex-ch08-04
MediumDerive the M-step update for the mixing weights in a GMM by maximizing subject to , . Identify the Lagrange multiplier.
Form the Lagrangian and differentiate.
The multiplier will be determined by the sum constraint.
Lagrangian
.
Stationarity
.
Constraint
. Hence . Non-negativity holds automatically since .
ex-ch08-05
MediumConsider a two-component 1-D GMM with known equal variances and known mixing weights . Write out the E-step and M-step updates for explicitly. Show that the mapping is symmetric under swapping .
Responsibility is a sigmoid of the difference .
E-step
(logistic sigmoid); .
M-step
for .
Symmetry
Swapping swaps for every , hence swaps the M-step outputs. The symmetric fixed point is stationary — breaking symmetry in initialization is essential.
ex-ch08-06
HardProve Fisher's identity: for a latent-variable model with smooth densities, State the regularity conditions needed to interchange differentiation and integration.
Differentiate under the integral sign.
Differentiate the marginal
Assume dominated convergence applies (bound on uniform in locally). Then .
Log-derivative identity
.
Recognize the posterior
The fraction equals , so the right-hand side is the posterior expectation of the complete score, as claimed.
ex-ch08-07
Hard(Louis's formula.) Show that the observed Fisher information can be written as
Differentiate Fisher's identity once more.
Use the fact that the score has zero posterior mean at the MLE.
Differentiate Fisher's identity
Let . Then . Differentiating once more using the product rule: .
Score identity for the posterior
.
Combine
Plugging in and simplifying yields Louis's formula. The first term is the expected complete-data Hessian; the second is (minus) the posterior covariance of the complete-data score.
ex-ch08-08
MediumConsider i.i.d. samples with where . Suppose, however, that only is observed (the data are censored at zero). Set up the EM algorithm for estimating . Identify what the latent variable is and write out the E-step.
The latent variable is (the uncensored value) for those with .
The E-step requires — a truncated Gaussian mean.
Complete data
For each censored (i.e. ), the latent variable satisfies and .
E-step
For censored : (the mean of a Gaussian truncated to ), where are the standard normal pdf/cdf. For uncensored : .
M-step
— simple sample mean over the imputed values.
ex-ch08-09
EasyTrue or false: if the EM iterates converge to some , then is a global maximum of the log-likelihood. Justify.
Think about non-convex optimization in general.
Answer
False. EM converges to a stationary point (local max, saddle point, or sometimes a boundary). The log-likelihood for mixtures and most latent-variable models is non-concave, so local maxima are generic. Multi-start is the standard remedy.
ex-ch08-10
HardShow that for a GMM with full covariances and samples, the log-likelihood is unbounded above. Construct an explicit sequence of parameters along which it diverges.
Let one component's mean equal and send its covariance to zero.
Construction
Fix (say), , , and keep the other component with any finite parameters covering all data.
Divergence
as . Hence , and the total log-likelihood (a sum over samples) also diverges. Remedy: regularize the covariances.
ex-ch08-11
MediumFor K-means, show that Lloyd's algorithm monotonically decreases the distortion .
Both the assignment step and the update step separately decrease .
Assignment step
Holding fixed, reassigning each to minimizes its contribution to , so does not increase.
Update step
Holding fixed, becomes a separable sum of quadratics in ; each is minimized by the mean of its cluster. So the update does not increase either. Combining, is monotonically non-increasing. Since and there are finitely many possible assignments, Lloyd's algorithm converges in finitely many iterations.
ex-ch08-12
MediumIn the SBL model from Section 8.5, derive the M-step update .
The complete-data log-likelihood (viewing as latent) contains .
Take the posterior expectation, then maximize over .
Expected log-prior
.
Maximize
Differentiating with respect to : .
ex-ch08-13
EasyWhy must responsibilities be strictly positive during EM iterations? What happens if for all and some ?
Look at in the M-step.
Answer
If , the M-step divisions become and the component is dead — it cannot recover because zero responsibility gives zero prior in the next E-step. In practice, maintain by regularizing , or drop the collapsed component and restart with .
ex-ch08-14
Challenge(Wu 1983.) Let be the incomplete-data log-likelihood, continuous on a closed bounded set . Show that if is produced by EM, then every limit point is a stationary point of provided the map is continuous in both arguments.
Use monotonicity + boundedness to get convergence of .
Convert fixed-point property of a limit point into a stationarity condition via Fisher's identity.
Convergence of likelihood
is non-decreasing (Theorem 8.2) and bounded on , hence converges to some .
Fixed-point of a limit point
Let be a limit point, . By continuity of the EM map and uniqueness of the accumulation's log-likelihood , is a fixed point: .
Stationarity
Interior fixed points satisfy . Fisher's identity gives .
ex-ch08-15
MediumFor a mixture of two known-variance Gaussians with known means and unknown mixing weight , write the EM update for . Interpret the update as a weighted count.
Responsibility is .
E-step
as above.
M-step
— the average soft count of samples assigned to component 1.
ex-ch08-16
HardConsider EM for estimating the mean of a Gaussian from i.i.d. observations, where each observation is censored independently: we observe if , else . Derive the EM update for .
Treat uncensored 's as observed; censored ones as latent.
For censored : is a truncated-Gaussian integral.
Imputation
For censored of observation , let .
M-step
(component-wise, using imputed or observed values).
ex-ch08-17
MediumShow that the EM updates for a GMM can equivalently be derived by maximizing the ELBO by alternating coordinate ascent, where factorizes as (which it does exactly for a GMM).
Optimizing over at fixed yields .
q-step
of has unconstrained solution .
$\boldsymbol{\theta}$-step
of with reduces to maximizing , which recovers the Theorem 8.4 updates.
ex-ch08-18
EasyA student claims: "EM always decreases KL divergence between and the posterior." Is this correct? If yes, justify; if no, correct it.
The E-step sends to zero, but the M-step changes .
Answer
Partially correct. The E-step sets , so the KL drops to zero with respect to the current . The M-step then moves , which may increase the KL to the new posterior. The quantity that monotonically increases is , not the negative KL.
ex-ch08-19
Hard(Speed of convergence.) Near a local maximum , show that EM has linear convergence rate equal to the largest eigenvalue of , where is the expected complete-data Fisher information and is the observed Fisher information. Interpret the ratio as "fraction of information not missing."
Linearize the EM map around ; use Louis's formula for the observed Fisher information.
Linearize
Write the EM map as . At the fixed point , the Jacobian is (Dempster-Laird-Rubin 1977, §3.2).
Rate
The convergence rate is the spectral radius of . When little information is missing () the rate is small (fast convergence). When much is missing () the rate approaches 1 (slow convergence) — which motivates acceleration methods.
ex-ch08-20
MediumYou are given 1-D data from a 3-component GMM. You fit it with , , and components using EM and pick the one with the highest log-likelihood on the fitting data. Why is this procedure flawed? Which criterion should you use instead?
More components = more parameters = higher training likelihood, always.
Flaw
Log-likelihood on the training data increases monotonically with (more flexibility more overfitting). In the limit , each component is a Dirac at a data point and the likelihood diverges.
Remedies
Use a criterion that penalizes complexity: BIC (where is the number of free parameters), AIC , or cross-validated likelihood. For GMMs, BIC is standard.