Ferkans — Interactive Telecom Tutor

Why Inference? Why Now?

Every chapter of this book has been about extracting a hidden quantity from noisy observations. We have estimated parameters, reconstructed signals, decided between hypotheses — all under an assumption that the channel between source and sensor was benign, or at most a scaling. Reality is rarely so kind. When symbols are transmitted faster than the channel's coherence bandwidth allows, each received sample is a linear combination of several past symbols plus noise. The operational question is simple: what did the transmitter actually send?

This is the equalization problem, and we will approach it exactly as we approached estimation and detection in earlier chapters — by writing down the likelihood and asking what the optimal inference rule does. The pattern will feel familiar. The twist is that the unknown is no longer a parameter but a sequence, and sequences live in a space whose size grows exponentially with length. That growth is the central computational tension of the chapter: the ML rule is well defined, but evaluating it naively is hopeless. The art lies in exploiting structure — the finite memory of the channel — to reduce an exponential search to one that is merely linear in sequence length.

Definition:
Discrete-Time ISI Channel

Let $\{x[n]\}$ be a sequence of transmitted symbols drawn from a finite alphabet $\mathcal{A} \subset \mathbb{C}$ with $M = |\mathcal{A}|$ and $\mathbb{E}[|x[n]|^2] = E_s$ . Let $\{h[k]\}_{k=0}^{L}$ be a causal FIR channel of memory $L$ , meaning $h[k] = 0$ for $k < 0$ and $k > L$ . The discrete-time ISI channel produces

$y[n] \;=\; \sum_{k=0}^{L} h[k]\, x[n-k] \;+\; w[n], \qquad n = 0, 1, \ldots, T-1,$

where $\{w[n]\}$ is circularly symmetric complex Gaussian noise, $w[n] \sim \mathcal{CN}(0, N_0)$ , independent across $n$ .

The channel taps $\mathbf{h} = [h[0], \ldots, h[L]]^T$ are assumed known at the receiver (having been estimated from pilot symbols in an earlier stage). The unknowns are the transmitted symbols $\mathbf{x} = [x[0], \ldots, x[T-1]]^T \in \mathcal{A}^T$ .

The terminology "intersymbol interference" reflects what happens when $L \geq 1$ : the sample $y[n]$ depends not only on $x[n]$ but on up to $L$ past symbols, which interfere with one another at the receiver. When $L = 0$ the channel collapses to a memoryless AWGN channel and equalization is unnecessary.

Definition:
Maximum-Likelihood Sequence Estimate (MLSE)

Given the received block $\mathbf{y} = [y[0], \ldots, y[T-1]]^T$ and known channel $\mathbf{h}$ , the maximum-likelihood sequence estimate is

$\hat{\mathbf{x}}_{\text{ML}} \;=\; \arg\max_{\mathbf{x} \in \mathcal{A}^T}\; f_{\mathbf{Y}|\mathbf{X}}(\mathbf{y}|\mathbf{x}).$

Under the white Gaussian noise model of DDiscrete-Time ISI Channel, the likelihood reduces to a squared-error criterion:

$\hat{\mathbf{x}}_{\text{ML}} \;=\; \arg\min_{\mathbf{x} \in \mathcal{A}^T}\; \sum_{n=0}^{T-1} \left| y[n] - \sum_{k=0}^{L} h[k]\, x[n-k] \right|^2.$

We refer to this rule as maximum-likelihood sequence estimation, or MLSE.

The MLSE rule is a statement, not an algorithm. It specifies what answer we want — the sequence of symbols that best explains the observations — without saying how to find it. The set $\mathcal{A}^T$ contains $M^T$ candidates; for a modest block of $T = 100$ BPSK symbols, that is $2^{100} \approx 10^{30}$ sequences. The problem is not to define optimality but to compute it.

Definition:
Channel State

The channel state at time $k$ is the vector of the $L$ most recent past symbols,

$s_k \;=\; (x[k-1], x[k-2], \ldots, x[k-L]) \;\in\; \mathcal{S},$

where $\mathcal{S} = \mathcal{A}^L$ has cardinality $|\mathcal{S}| = M^L$ .

The state $s_k$ summarizes everything about past transmissions that is relevant to the next observation $y[k]$ : given $s_k$ and the current input $x[k]$ , the noise-free output $\sum_{\ell=0}^{L} h[\ell] x[k-\ell]$ is completely determined.

Theorem: Markov Structure of the ISI Likelihood

For the channel of DDiscrete-Time ISI Channel, assume the initial state $s_0$ is known (for instance through a preamble of zeros). Then the log-likelihood of a candidate sequence $\mathbf{x} \in \mathcal{A}^T$ factorizes over transitions $(s_k, s_{k+1})$ of the state process:

$\ln f_{\mathbf{Y}|\mathbf{X}}(\mathbf{y}|\mathbf{x}) \;=\; -\frac{1}{N_0} \sum_{k=0}^{T-1} \gamma_k(s_k, s_{k+1}) \;+\; \text{const},$

where the branch metric is

$\gamma_k(s_k, s_{k+1}) \;=\; \left| y[k] - \sum_{\ell=0}^{L} h[\ell] x[k-\ell] \right|^2,$

and the symbols $x[k], x[k-1], \ldots, x[k-L]$ are read off from the pair $(s_k, s_{k+1})$ together with the input $x[k]$ .

The ISI channel is finite-memory, so knowing the state at time $k$ and the current input determines both the next state and the noise-free output. The likelihood therefore decomposes into per-transition terms — a sum of local costs along a path through the state graph. This is the same structure one sees in hidden Markov models and in the Viterbi decoder for convolutional codes, and it is precisely the structure that makes dynamic programming the right tool.

Proof

Observation model

Under the Gaussian noise assumption, the observations are conditionally independent given the input sequence:

$f_{\mathbf{Y}|\mathbf{X}}(\mathbf{y}|\mathbf{x}) \;=\; \prod_{k=0}^{T-1} f_{Y[k]|\mathbf{X}}(y[k]|\mathbf{x}).$

Each conditional density is circularly symmetric complex Gaussian with mean $\mu_k(\mathbf{x}) = \sum_{\ell=0}^{L} h[\ell] x[k-\ell]$ and variance $N_0$ .

Log-likelihood expansion

Taking logarithms,

$\ln f_{\mathbf{Y}|\mathbf{X}}(\mathbf{y}|\mathbf{x}) \;=\; -\frac{1}{N_0} \sum_{k=0}^{T-1} |y[k] - \mu_k(\mathbf{x})|^2 + \text{const},$

where the constant collects normalization factors independent of $\mathbf{x}$ .

Localization via the state

The mean $\mu_k(\mathbf{x})$ depends only on $(x[k], x[k-1], \ldots, x[k-L])$ , which is exactly the pair $(x[k], s_k)$ . Equivalently, knowing $s_k$ and $s_{k+1}$ determines $x[k]$ (as the new symbol that has just entered the state register) and hence determines $\mu_k$ completely.

Rewrite as a path metric

Define $\gamma_k(s_k, s_{k+1}) = |y[k] - \mu_k(\mathbf{x})|^2$ , which depends only on the transition $(s_k, s_{k+1})$ . Then

$\ln f_{\mathbf{Y}|\mathbf{X}}(\mathbf{y}|\mathbf{x}) = -\frac{1}{N_0} \sum_{k=0}^{T-1} \gamma_k(s_k, s_{k+1}) + \text{const}.$

Maximizing the log-likelihood over $\mathbf{x}$ is therefore equivalent to finding the sequence of states $(s_0, s_1, \ldots, s_T)$ that minimizes the cumulative branch-metric sum along a path in the state graph. $\blacksquare$

Key Takeaway

The global optimum of MLSE decomposes into a sum of local costs along paths through a finite-state graph. Finite memory begets dynamic programming — this is the structural observation on which the Viterbi algorithm rests.

Example: Branch Metrics for a Two-Tap Channel with BPSK

Consider a two-tap channel $\mathbf{h} = [h[0], h[1]]^T = [1, 0.5]^T$ with BPSK modulation $\mathcal{A} = \{-1, +1\}$ (so $M = 2$ , $L = 1$ , $|\mathcal{S}| = 2$ ). The initial state is $s_0 = +1$ . Suppose the receiver observes $y[0] = 1.2$ and $y[1] = -0.7$ .

Enumerate all state transitions, compute the corresponding branch metrics, and identify the two-symbol sequence that is most likely to have been transmitted.

Solution

Enumerate state transitions

The state $s_k$ records the previous symbol. From any state there are two transitions, one for each value of the new symbol $x[k]$ . The noise-free output is $\hat{y}_k = h[0] x[k] + h[1] x[k-1] = x[k] + 0.5 \cdot x[k-1]$ .

$k$	$s_k$ (= $x[k-1]$ )	$x[k]$	$s_{k+1}$	$\hat{y}_k$
0	$+1$	$+1$	$+1$	$+1.5$
0	$+1$	$-1$	$-1$	$-0.5$

Branch metrics at $k=0$

With $y[0] = 1.2$ , $\gamma_0(+1, +1) = |1.2 - 1.5|^2 = 0.09, \qquad \gamma_0(+1, -1) = |1.2 - (-0.5)|^2 = 2.89.$

The surviving path to state $s_1 = +1$ has cumulative metric $0.09$ ; the surviving path to $s_1 = -1$ has cumulative metric $2.89$ .

Branch metrics at $k=1$

From each of the two possible states $s_1 \in \{+1, -1\}$ , there are again two outgoing transitions. The noise-free outputs and metrics (with $y[1] = -0.7$ ) are

$s_1$	$x[1]$	$s_2$	$\hat{y}_1$	$\gamma_1$
$+1$	$+1$	$+1$	$1.5$	$4.84$
$+1$	$-1$	$-1$	$-0.5$	$0.04$
$-1$	$+1$	$+1$	$0.5$	$1.44$
$-1$	$-1$	$-1$	$-1.5$	$0.64$

Sum cumulative metrics and take the minimum

The four candidate paths $(x[0], x[1])$ have total metrics

$(+1, +1):\; 0.09 + 4.84 = 4.93, \quad (+1, -1):\; 0.09 + 0.04 = 0.13,$ $(-1, +1):\; 2.89 + 1.44 = 4.33, \quad (-1, -1):\; 2.89 + 0.64 = 3.53.$

The minimum is $0.13$ , achieved by $\hat{\mathbf{x}}_{\text{ML}} = (+1, -1)$ . The observations $y = (1.2, -0.7)$ are most consistent with first transmitting $+1$ (giving $\hat{y}_0 = 1.5$ , measured as $1.2$ ) and then $-1$ (giving $\hat{y}_1 = -0.5$ , measured as $-0.7$ ).

Operational reading

Even for this tiny example, enumerating paths grows as $2^T$ . The next section introduces the Viterbi algorithm, which performs the same minimization in $O(|\mathcal{S}|^2 \cdot T)$ operations by discarding dominated partial paths at each step.

Output SNR of ZF vs MMSE as Channel Null Deepens

A preview of the tension that motivates MLSE: when the channel $h = [1, -\alpha]$ develops a spectral null as $\alpha \to 1$ , the zero-forcing equalizer's output SNR collapses while the MMSE equalizer gracefully trades bias for variance. MLSE avoids inversion entirely and suffers no null-depth penalty — at the price of exponential complexity.

Parameters

Input SNR (dB)15

Max null depth

\alpha_{\max}

0.95

⚠️Engineering Note

The Known-Channel Assumption

Every equalizer in this chapter presumes that $\mathbf{h}$ has been estimated elsewhere — typically from a pilot or training sequence inserted at the start of each packet. The estimate $\hat{\mathbf{h}}$ is not the true $\mathbf{h}$ ; its error feeds directly into every subsequent computation. In practice one budgets SNR for pilots, designs pilots to have favorable autocorrelation (pseudonoise, Zadoff–Chu, or Golay sequences), and re-estimates the channel at the block rate dictated by the coherence time. Treating channel estimation and equalization as separate subsystems is the dominant paradigm in standards such as LTE and 5G NR, but it is not optimal — joint estimation and detection can do better when pilots are sparse.

Practical Constraints

•
Estimate $\hat{\mathbf{h}}$ once per coherence interval
•
Pilot overhead scales with channel memory $L$ and mobility

Common Mistake: MLSE Complexity Is Exponential in Memory, Not Block Length

Mistake:

Students often assume that because MLSE searches over $\mathcal{A}^T$ , its complexity must scale as $M^T$ — and therefore conclude that MLSE is simply infeasible for realistic block sizes.

Correction:

The naive enumeration has complexity $M^T$ , but the Viterbi algorithm exploits the Markov structure of the likelihood (theorem TMarkov Structure of the ISI Likelihood) and achieves complexity $O(M^{L+1} \cdot T)$ . Complexity is linear in block length and exponential in channel memory, not block length. For short-memory channels ( $L = 1, 2, 3$ ) MLSE is eminently practical; for long wireless channels with $L \sim 50$ , it is not.

Historical Note: Forney Recasts Viterbi for ISI (1972)

1970s

The Viterbi algorithm was introduced in 1967 for decoding convolutional codes. Its applicability to equalization was not immediate; it was G. David Forney Jr. who, in a landmark 1972 paper in the IEEE Transactions on Information Theory, recognized that an ISI channel is structurally identical to a convolutional code with memory $L$ — and that Viterbi decoding therefore yields the maximum-likelihood symbol sequence estimate. Forney's paper did more than transfer an algorithm; it established the unified view that channels with memory, convolutional codes, and Markov sources all share a single inference framework. That view is the reason this book places equalization and detection in the same chapter of an inference textbook.

Three Views of the Same Channel

Quantity	Estimation view	Detection view	Equalization view
Unknown	Continuous parameter $\theta$	Discrete hypothesis $H \in \{0, 1\}$	Discrete sequence $\mathbf{x} \in \mathcal{A}^T$
Objective	Minimize $\mathbb{E}[\\|\hat{\theta} - \theta\\|^2]$	Minimize $P(\text{error})$	Minimize sequence error probability
ML rule complexity	Closed form (often)	Single comparison	Exponential in $T$ (naive)
Structural exploit	Linearity, Gaussianity	Sufficient statistic	Finite memory → state machine
Canonical algorithm	LMMSE, Wiener filter	Likelihood ratio test	Viterbi / dynamic programming

Intersymbol interference (ISI)

The phenomenon in which a received sample depends on multiple past transmitted symbols, arising from any channel whose impulse response extends beyond one symbol period. ISI is the signal-processing shadow of bandwidth limitation: the more tightly a symbol's energy is packed in time, the more it spills into adjacent symbol periods through the channel.

Maximum-likelihood sequence estimation (MLSE)

The rule that returns the transmitted sequence maximizing the conditional density of the received block under a known channel model. For Gaussian noise, MLSE reduces to a minimum-distance search over sequences. Implemented efficiently via the Viterbi algorithm when the channel has finite memory.

Channel memory

The integer $L$ such that the impulse response $h[k]$ vanishes for $k > L$ . A channel with memory $L$ causes each output sample to depend on $L+1$ consecutive input symbols. The size of the MLSE trellis, $M^L$ , is exponential in this quantity.

Related: Intersymbol interference (ISI)

The ISI Channel as an Inference Problem