Semantic Metrics

What Makes a Good Reconstruction?

Classical information theory measures distortion by comparing source and reconstruction symbol by symbol: MSE, Hamming distance, absolute error. But humans and machines evaluate quality differently. Two images can have the same MSE yet look vastly different — one with imperceptible high-frequency noise, the other with visible blurring. A speech signal with low MSE may sound robotic, while one with higher MSE but preserved prosody sounds natural. The question is: can we define distortion measures that capture perceptual or semantic quality, and what are the information-theoretic implications?

Definition:

Semantic Distortion Measures

A semantic distortion measure dsem(s,s^)d_{\text{sem}}(s, \hat{s}) evaluates the quality of reconstruction s^\hat{s} based on task-relevant or perceptual criteria rather than symbol-by-symbol fidelity. Common examples:

  • Perceptual quality: dperc(s,s^)=f(s)f(s^)2d_{\text{perc}}(s, \hat{s}) = \|f(s) - f(\hat{s})\|^2 where ff is a learned feature extractor (e.g., VGG features for images, wav2vec for audio)
  • Task accuracy: dtask(s,s^)=1[c(s^)c(s)]d_{\text{task}}(s, \hat{s}) = \mathbb{1}[c(\hat{s}) \neq c(s)] where cc is a classifier — distortion is 0 if the task output is preserved, 1 otherwise
  • Semantic similarity: dsim(s,s^)=1cos(e(s),e(s^))d_{\text{sim}}(s, \hat{s}) = 1 - \text{cos}(e(s), e(\hat{s})) where ee is an embedding function (e.g., CLIP for image-text alignment)

Theorem: Rate-Distortion Theory with Semantic Distortion

For a source SS with distribution PSP_S and a semantic distortion measure dsem(s,s^)d_{\text{sem}}(s, \hat{s}), the rate-distortion function is: Rsem(D)=minPS^S:E[dsem(S,S^)]DI(S;S^)R_{\text{sem}}(D) = \min_{P_{\hat{S}|S}: \, \mathbb{E}[d_{\text{sem}}(S, \hat{S})] \leq D} I(S; \hat{S}) When dsemd_{\text{sem}} is a feature-space MSE f(S)f(S^)2\|f(S) - f(\hat{S})\|^2, this reduces to the rate-distortion function of the feature vector f(S)f(S): Rsem(D)Rf(S)(D)R_{\text{sem}}(D) \leq R_{f(S)}(D) with equality when ff is sufficient for reconstruction.

The rate-distortion function with semantic distortion can be much lower than with MSE because the semantic measure ignores irrelevant variations. If f(S)f(S) is a low-dimensional feature, then Rf(S)(D)RS(D)R_{f(S)}(D) \ll R_{S}(D) — we need far fewer bits to preserve features than to preserve pixels. This is why semantic communication can achieve dramatic compression gains.

Example: MSE vs. Perceptual Distortion for Images

A dd-dimensional Gaussian source has covariance ΣS\Sigma_S with eigenvalues λ1λ2λd\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_d. A perceptual feature extractor projects onto the top mdm \ll d principal components: f(S)=UmSRmf(S) = U_m^\top S \in \mathbb{R}^m. Compare RMSE(D)R_{\text{MSE}}(D) with Rperc(D)R_{\text{perc}}(D).

Definition:

The Perception-Distortion Tradeoff

The perception-distortion tradeoff (Blau and Michaeli, 2018) states that for any reconstruction S^\hat{S} of source SS: MSE(S,S^)+λdperc(PS,PS^)Dmin(λ)\text{MSE}(S, \hat{S}) + \lambda \cdot d_{\text{perc}}(P_S, P_{\hat{S}}) \geq D_{\min}(\lambda) where dperc(PS,PS^)d_{\text{perc}}(P_S, P_{\hat{S}}) measures the divergence between the distribution of the source and the distribution of reconstructions (e.g., FID for images). Perfect perception (PS^=PSP_{\hat{S}} = P_S) requires higher MSE, and low MSE requires imperfect perception (blurring).

This tradeoff explains why MMSE estimators produce blurry images: they minimize MSE but the output distribution PS^P_{\hat{S}} concentrates around the conditional mean, losing the sharpness of PSP_S. Generative models (GANs, diffusion models) sacrifice MSE to improve perceptual quality by generating realistic-looking samples.

Perception-Distortion Tradeoff

Visualize the tradeoff between MSE (distortion) and distributional divergence (perception) for a Gaussian source with varying compression rate.

Parameters
10
1
1

Classical vs. Semantic Distortion Measures

MetricFormulaWhat It MeasuresLimitations
MSEss^2/d\|s - \hat{s}\|^2/dPer-dimension squared errorDoes not correlate with perception; penalizes irrelevant details
SSIMLuminance × contrast × structureStructural similarityHand-crafted; not differentiable end-to-end
LPIPSf(s)f(s^)2\|f(s) - f(\hat{s})\|^2 (VGG features)Learned perceptual similarityDepends on pretrained network; not interpretable
FIDμSμS^2+tr(ΣS+ΣS^2(ΣSΣS^)1/2)\|\mu_S - \mu_{\hat{S}}\|^2 + \text{tr}(\Sigma_S + \Sigma_{\hat{S}} - 2(\Sigma_S \Sigma_{\hat{S}})^{1/2})Distributional divergenceRequires many samples; ignores per-sample quality
Task accuracyP[c(s^)=c(s)]\mathbb{P}[c(\hat{s}) = c(s)]Downstream task performanceTask-specific; binary (no gradation)

Common Mistake: FID with Few Samples Is Unreliable

Mistake:

Computing the FID (Fréchet Inception Distance) between small batches of generated images (e.g., N<1000N < 1000) and using it to compare semantic communication systems.

Correction:

FID estimates the Wasserstein-2 distance between Gaussian fits to the Inception feature distributions. With small samples, the covariance estimate is biased, and FID can vary by 30-50% across random seeds. Use at least N=10,000N = 10{,}000 samples for stable FID estimates, or use unbiased alternatives like Kernel FID or the CMMD metric.

Historical Note: From Weaver to Bao: 70 Years of Semantic Communication

1949–present

After Weaver's 1949 articulation of the three levels of communication, the semantic level was largely dormant for decades. Carnap and Bar-Hillel attempted a formal theory of "semantic information" in the 1950s, but it was too rigid to be practical. The modern revival began around 2019-2021, driven by three forces: (1) the success of deep learning in representation learning, making it possible to learn semantic features; (2) the approaching Shannon limits of 5G systems, motivating beyond-Shannon gains; and (3) the rise of machine-to-machine communication (IoT, autonomous driving), where "meaning" is well-defined as task performance. Bao et al. (2011) and Bourtsoulatze et al. (2019) were among the first to demonstrate learned JSCC for images, showing that neural networks could discover efficient source-channel codes without explicit training on information-theoretic objectives.

,
⚠️Engineering Note

Deployment Challenges for Semantic Communication

Deploying semantic communication in real systems faces several practical challenges:

  1. Model sharing: Both transmitter and receiver must have compatible neural network models. Unlike standard codecs, DeepJSCC models are not standardized.
  2. Computational cost: Neural network inference at both ends requires GPUs or dedicated hardware, which may not be available at edge devices.
  3. Generalization: A model trained for one source distribution (e.g., faces) may fail on another (e.g., landscapes). Domain adaptation or universal models are needed.
  4. Security: Adversarial attacks can exploit the neural network's learned representation to cause targeted misreconstruction.
  5. Interpretability: Unlike separate coding, it is hard to debug or verify DeepJSCC systems because the latent representation has no standard structure.

Quick Check

The perception-distortion tradeoff says that:

Low MSE always means high perceptual quality

Perfect perceptual quality (PS^=PSP_{\hat{S}} = P_S) requires accepting higher MSE

MSE and perceptual quality always improve together

The tradeoff only exists for non-Gaussian sources

Key Takeaway

Semantic distortion measures capture task-relevant or perceptual quality that classical MSE misses. The rate-distortion function under semantic distortion can be dramatically lower than under MSE, providing the information-theoretic foundation for semantic communication gains. However, the perception-distortion tradeoff warns that no reconstruction can simultaneously minimize MSE and match the source distribution — system designers must choose where on this tradeoff to operate.

Why This Matters: Semantic Communication and 6G

Semantic communication is a leading candidate for 6G systems, where the goal is to support AI-native applications (autonomous driving, immersive XR, digital twins) that require task-relevant information rather than bit-perfect reconstruction. The 3GPP and ITU are investigating semantic communication as a key technology for IMT-2030. See Book telecom, Ch. 32 for the broader 6G context.

Semantic Distortion

A quality measure that evaluates reconstruction based on task-relevant features or perceptual similarity rather than symbol-by-symbol error.

Related: Semantic Distortion Measures

Perception-Distortion Tradeoff

The fundamental tradeoff between reconstruction fidelity (low MSE) and realism (matching the source distribution), first formalized by Blau and Michaeli (2018).

Related: The Perception-Distortion Tradeoff