Beyond Shannon: Task-Relevant Communication

Why Go Beyond Shannon?

Shannon's framework answers the question: "How many bits are needed to reconstruct a message exactly (or within a given distortion)?" But in many modern applications, the receiver does not want to reconstruct the message — it wants to act on it. A self-driving car receiving camera images does not need pixel-perfect reconstruction; it needs to detect obstacles. A voice assistant does not need to reconstruct speech waveforms; it needs to understand the command.

The point is that Shannon's theory is agnostic to the meaning of the message — and this is both its strength (universal applicability) and its limitation (potential inefficiency when the task is known). Semantic communication asks: can we do better by encoding only what is relevant to the task?

Historical Note: Weaver's Three Levels of Communication

1949–present

In the 1949 preface to Shannon's "Mathematical Theory of Communication," Warren Weaver identified three levels of the communication problem: Level A (Technical): How accurately can the symbols be transmitted? (Shannon's theory.) Level B (Semantic): How precisely do the transmitted symbols convey the desired meaning? Level C (Effectiveness): How effectively does the received meaning affect conduct? Shannon explicitly addressed only Level A, writing that "the semantic aspects of communication are irrelevant to the engineering problem." For 70 years, this was the dominant paradigm. The resurgence of interest in Levels B and C — semantic and goal-oriented communication — is driven by the confluence of deep learning (which can learn task-specific representations), the approaching limits of Shannon-optimal systems (5G is nearly capacity-achieving), and the emergence of machine-to-machine communication where "meaning" is well-defined.

Definition:
The Rate-Utility Function

Let $S$ be a source, $G$ be a goal variable (the task-relevant information), and $U(\hat{S}, G)$ be a utility function measuring how well the reconstructed signal $\hat{S}$ serves the task. The rate-utility function is: $R_U(u) = \min_{P_{\hat{S}|S}: \, \mathbb{E}[U(\hat{S}, G)] \geq u} I(S; \hat{S})$ This is the minimum rate needed to achieve expected utility at least $u$ .

When $U(\hat{S}, G) = -d(S, \hat{S})$ (negative distortion), the rate-utility function reduces to the classical rate-distortion function $R$ . The key difference is that the utility depends on the goal $G$ , not the source $S$ itself. If $G$ is a low-dimensional function of $S$ (e.g., a classification label), then $R_U(u)$ can be much smaller than $R$ .

Theorem: Rate-Utility Bound via the Information Bottleneck

If the goal $G$ satisfies the Markov chain $G \multimap S \multimap \hat{S}$ , then: $R_U(u) \geq R_{\text{IB}}(u) \triangleq \min_{P_{T|S}: \, \mathbb{E}[U(T, G)] \geq u} I(S; T)$ Furthermore, if the utility function depends on $\hat{S}$ only through $I(\hat{S}; G)$ , then the rate-utility function is characterized by the information bottleneck: $R_U(u) = \min_{P_{T|S}: \, I(T;G) \geq u} I(S; T)$ which is the inverse of the information curve $\mathcal{I}$ from Chapter 28.

The IB provides the fundamental limit for task-relevant compression. The source $S$ contains both task-relevant information ( $I(S;G)$ bits) and irrelevant information. The rate-utility function says how many bits of $S$ you must transmit to achieve a given task performance — and this is always at least as many as the IB requires, because the IB is the tightest compression that preserves task relevance.

Proof

Lower bound by IB

Since $G \multimap S \multimap \hat{S}$ , any encoding $\hat{S}$ that achieves utility $u$ must satisfy $I(\hat{S}; G) \geq u_{\min}(u)$ where $u_{\min}$ is the minimum MI needed for utility $u$ . By the data processing inequality, this requires $I(S; \hat{S}) \geq I(\hat{S}; G) \geq u_{\min}(u)$ .

Achievability for MI-based utility

When utility $= I(\hat{S}; G)$ , the optimization becomes the IB problem exactly. The IB optimal $P_{T|S}^*$ achieves the minimum rate $I(S;T)$ for target utility $I(T;G) \geq u$ . This is achievable by the Blahut-Arimoto algorithm from Chapter 28.

Definition:
Semantic Source-Channel Encoder

A semantic encoder is a mapping $f_\theta : \mathcal{S}^n \to \mathcal{X}^k$ that maps a source block $S^n$ to a channel input sequence $X^k$ , where $k/n$ is the bandwidth ratio. Unlike classical separate source-channel coding, the semantic encoder:

Does not explicitly separate source coding from channel coding
Is task-aware: optimized for the utility $U$ , not for reconstruction fidelity
Is typically implemented as a neural network with parameters $\theta$ trained end-to-end The corresponding decoder $g_\phi : \mathcal{Y}^k \to \hat{\mathcal{S}}^n$ maps channel outputs to task-relevant reconstructions.

The bandwidth ratio $k/n$ is the analog of the rate in digital communication. When $k/n < 1$ , the system operates in bandwidth compression (more source symbols than channel uses); when $k/n > 1$ , it operates in bandwidth expansion (redundancy for error protection).

Example: Communication for Remote Classification

A sensor observes images $S \in \mathbb{R}^{224 \times 224 \times 3}$ and transmits over an AWGN channel with $\text{SNR} = 10$ dB to a receiver that must classify the image into one of $C = 1000$ classes (ImageNet). Compare the required rate for: (a) Reconstruct-then-classify: compress the image to MSE distortion $D$ , transmit at rate $R$ , then classify. (b) Semantic communication: transmit only the class-relevant features.

Solution

Reconstruct-then-classify

The source has $d = 224 \times 224 \times 3 = 150{,}528$ dimensions. For reasonable image quality ( $\text{PSNR} \geq 30$ dB), the rate-distortion function requires $R \geq 0.5$ - $2$ bits per pixel, giving a total of $75{,}264$ - $301{,}056$ bits per image. At channel capacity $C = \frac{1}{2}\log_2(1 + 10) \approx 1.73$ bits/use, this requires $43{,}500$ - $174{,}000$ channel uses.

Semantic communication

The goal is classification into $C = 1000$ classes, requiring at most $\log_2(1000) \approx 10$ bits of task-relevant information. Even with error protection at rate $R = 1/2$ (to handle channel errors), the total is $\approx 20$ bits, needing $\approx 12$ channel uses. This is a 10,000× reduction in bandwidth.

The catch

The semantic approach assumes a known classification task and a good feature extractor. If the task changes (e.g., from classification to object detection or to image captioning), the semantic encoder must be retrained or made flexible enough to support multiple tasks. Shannon's separation theorem guarantees that reconstruct-then-process works for any downstream task, at the cost of higher rate. This is the universality-efficiency tradeoff at the heart of semantic communication.

Rate-Utility vs. Rate-Distortion Comparison

Compare the rate-distortion function (reconstruct then process) with the rate-utility function (task-specific encoding) for a Gaussian source with a classification task.

Parameters

Source dimension d10

Task-relevant dimensions2

Source variance σ²1

Theorem: When Separation Is Suboptimal

For a source $S$ with goal $G$ , transmitted over a channel with capacity $C$ :

Separation is optimal when the source is ergodic and the channel is memoryless, in the limit of infinite blocklength, for any distortion measure $d(s, \hat{s})$ .
Separation is suboptimal in general when:
- The blocklength is finite (practical systems)
- The source and channel have memory that can be exploited jointly
- The channel is unknown or time-varying and must be learned online
- The system has strict latency constraints In all these cases, joint source-channel coding (JSCC) can outperform separate coding.

Shannon's separation theorem is an asymptotic result: in the infinite blocklength limit, you lose nothing by separating source and channel coding. But for finite blocklength, separation incurs a penalty because the source code must target a specific rate, which may not match the channel's instantaneous capacity (especially under fading). JSCC adapts gracefully: when the channel is good, more information gets through; when it is bad, the system degrades gracefully rather than failing catastrophically (the "cliff effect" of digital communication).

Proof

Separation theorem (classical)

By Shannon's separation theorem, if $R < C$ , there exists a separate source code at rate $R$ and a channel code at rate $R$ such that the end-to-end distortion approaches $D$ as blocklength $\to \infty$ . This is optimal.

Finite blocklength gap

At blocklength $n$ , the best achievable rate is approximately $C - \sqrt{V/n} Q^{-1}(\epsilon)$ for channel coding and $R + \sqrt{V_s/n} Q^{-1}(\epsilon)$ for source coding (where $V, V_s$ are dispersions). The gap between these second-order terms means that matching the rates requires $n$ to be large. For small $n$ , JSCC can exploit the slack.

Fading channel example

Over a block-fading channel with random capacity $C \sim f_C(c)$ , a fixed-rate digital scheme fails (outage) when $C < R$ . JSCC avoids this by transmitting an analog representation: the reconstruction quality degrades continuously with the channel, avoiding the cliff effect.

Common Mistake: The Semantic Communication Fallacy

Mistake:

Claiming that semantic communication always outperforms Shannon's separation approach by extracting only "meaningful" information.

Correction:

Shannon's separation theorem is optimal in the asymptotic regime for ergodic sources and memoryless channels. Semantic communication gains come from finite blocklength effects, channel adaptation, or task-specific metrics — not from a fundamental failure of Shannon's theory. Claims of "beating Shannon" typically compare against a poorly designed baseline (e.g., JPEG + LDPC at the wrong rate) rather than against the true rate-distortion limit. The honest comparison is: how close does the semantic system get to the rate-utility bound vs. how close does the separate system get to the rate-distortion bound?

Quick Check

Shannon's separation theorem says that separate source and channel coding is optimal. When is this NOT true?

When the source has memory

At finite blocklength or over time-varying channels

When the distortion measure is not MSE

When the channel is Gaussian

Correction:

At finite blocklength or over time-varying channels

The separation theorem is an asymptotic result. At finite blocklength, the rate matching penalty makes JSCC potentially better. Over time-varying channels, JSCC can adapt gracefully.

Semantic Communication

A communication paradigm that encodes and transmits only the meaning or task-relevant information in a message, rather than the literal bit sequence. The goal is to maximize utility at the receiver rather than minimize reconstruction error.

Related: The Rate-Utility Function

Goal-Oriented Communication

Communication designed to achieve a specific task or goal at the receiver (classification, control, decision-making), measured by a utility function rather than by signal fidelity.

Related: The Rate-Utility Function

⚠️Engineering Note

The Universality-Efficiency Tradeoff

Shannon's framework is universal: a good reconstruction of $S$ enables any downstream task, without knowing the task in advance. Semantic communication trades universality for efficiency: it is highly efficient for the designed task but may be useless for other tasks. In system design, this means:

Use semantic communication when the task is fixed and well-defined (e.g., sensor networks, industrial IoT)
Use Shannon's approach when the data may be used for multiple or unknown tasks (e.g., general-purpose networks)
Consider hybrid approaches that encode a "base layer" (sufficient for any task) plus a "semantic layer" (optimized for the primary task)

Why This Matters: Semantic Communication in the Telecom Book

The telecom book (Ch. 11) introduces information theory for wireless capacity analysis. The semantic communication framework in this chapter extends those foundations by replacing the reconstruction objective with a task-specific utility. See Book telecom, Ch. 32 for the broader 6G context where semantic communication is a key enabling technology.

Semantic Communication: Shannon vs. Task-Oriented

Compares the Shannon pipeline (source encoding → channel encoding → decoding → reconstruction) with the semantic pipeline (semantic encoding → channel → semantic decoding → task), highlighting the bandwidth savings from encoding only task-relevant information.

Prerequisites & Notation Joint Source-Channel Coding Revisited

Beyond Shannon: Task-Relevant Communication

Why Go Beyond Shannon?

Historical Note: Weaver's Three Levels of Communication

Definition: The Rate-Utility Function

Theorem: Rate-Utility Bound via the Information Bottleneck

Lower bound by IB

Achievability for MI-based utility

Definition: Semantic Source-Channel Encoder

Example: Communication for Remote Classification

Reconstruct-then-classify

Semantic communication

The catch

Rate-Utility vs. Rate-Distortion Comparison

Parameters

Theorem: When Separation Is Suboptimal

Separation theorem (classical)

Finite blocklength gap

Fading channel example

Common Mistake: The Semantic Communication Fallacy

Quick Check

Semantic Communication

Goal-Oriented Communication

The Universality-Efficiency Tradeoff

Why This Matters: Semantic Communication in the Telecom Book

Semantic Communication: Shannon vs. Task-Oriented

Definition:
The Rate-Utility Function

Definition:
Semantic Source-Channel Encoder