Online Learning and Sequential Estimation

When Nature Does Not Commit to a Distribution

Classical statistical inference — the framework of Kay, of Cramér, of this book through Chapter 24 — presupposes that data arrive i.i.d. from a fixed (possibly unknown) distribution. In wireless systems that assumption is often wrong: channels drift, interference patterns change, users join and leave, and the adversary (if there is one) adapts. The online-learning framework is designed for exactly this mismatch.

Instead of minimizing expected loss under a fixed distribution, we minimize regret against the best fixed decision in hindsight, with no distributional assumption at all. Remarkably, tight regret bounds exist, and they are achieved by algorithms that look a lot like the stochastic approximation schemes of Robbins and Monro — except the analysis is adversarial.

The point is that online learning is not a different subject from estimation; it is estimation liberated from the i.i.d. assumption, at the price of comparing to a weaker benchmark.

Definition:

Cumulative Regret

In online learning, at round t=1,,Tt = 1, \ldots, T the learner picks action wtW\mathbf{w}_t \in \mathcal{W}, observes a convex loss t:WR\ell_t : \mathcal{W} \to \mathbb{R}, and incurs loss t(wt)\ell_t(\mathbf{w}_t). The cumulative regret of the learner relative to the best fixed action in hindsight is

RT  =  t=1Tt(wt)    minwWt=1Tt(w).R_T \;=\; \sum_{t=1}^{T} \ell_t(\mathbf{w}_t) \;-\; \min_{\mathbf{w}^{\star} \in \mathcal{W}} \sum_{t=1}^{T} \ell_t(\mathbf{w}^{\star}).

The learner is no-regret if RT/T0R_T / T \to 0 as TT \to \infty.

The losses t\ell_t need not be stochastic — they can be picked adversarially after seeing wt\mathbf{w}_t. What makes the problem tractable is that the losses are convex (or linear) in w\mathbf{w}.

Definition:

Online Convex Optimization (OCO)

The OCO framework specifies: (i) a convex feasible set WRd\mathcal{W} \subseteq \mathbb{R}^d; (ii) at each round, a convex loss function t:WR\ell_t : \mathcal{W} \to \mathbb{R} revealed after the decision; (iii) possibly subgradient feedback gtt(wt)\mathbf{g}_t \in \partial \ell_t(\mathbf{w}_t). The goal is to design an algorithm with sublinear regret.

Notice that OCO is convex optimization — the convexity of t\ell_t is what makes tight regret bounds possible. The contrast with offline convex optimization is that we see the objective one piece at a time, not all at once. Every bound in this section ultimately flows from the convexity reflex: if t\ell_t were non-convex, no sublinear regret would be achievable in general.

Regret

The gap between the learner's cumulative loss and the loss of the best fixed decision in hindsight. A learner is no-regret if this gap grows sublinearly in TT, i.e., RT=o(T)R_T = o(T).

Multiplicative Weights Update (MWU)

An online learning algorithm for the experts problem in which each expert's weight is multiplied by exp(ηti)\exp(-\eta \ell_t^i) after observing losses. It achieves RT2TlnNR_T \leq \sqrt{2T\ln N} with optimal step size η=2lnN/T\eta = \sqrt{2\ln N / T}.

Multiplicative Weights Update (Prediction with Experts)

Complexity: O(NT)\mathcal{O}(NT) total; O(N)\mathcal{O}(N) per round
Input: NN experts, horizon TT, step size η>0\eta > 0
Output: Mixed strategy ptΔN1\mathbf{p}_t \in \Delta^{N-1} at each round
1. Initialize wi(1)1w_i^{(1)} \leftarrow 1 for i=1,,Ni = 1, \ldots, N
2. for t=1,2,,Tt = 1, 2, \ldots, T do
3. pi(t)wi(t)/jwj(t)\quad p_i^{(t)} \leftarrow w_i^{(t)} / \sum_{j} w_j^{(t)}
4. \quad Sample or play mixed strategy p(t)\mathbf{p}^{(t)}; observe losses t1,,tN[0,1]\ell_t^1, \ldots, \ell_t^N \in [0,1]
5. \quad Incur expected loss ipi(t)ti\sum_i p_i^{(t)} \ell_t^i
6. wi(t+1)wi(t)exp(ηti)i\quad w_i^{(t+1)} \leftarrow w_i^{(t)} \exp(-\eta\, \ell_t^i) \quad \forall i
7. end for

The update is a gradient step in the information geometry of the probability simplex — it is mirror descent with the negative entropy as the Bregman divergence. This pattern will reappear in exponentiated gradient and in the PAC-Bayes literature.

Theorem: Regret Bound for Multiplicative Weights

Against any sequence of losses ti[0,1]\ell_t^i \in [0,1] and any fixed expert ii^\star, the MWU algorithm with step size η=(2lnN)/T\eta = \sqrt{(2\ln N)/T} satisfies

t=1Tipi(t)ti    t=1Tti    2TlnN.\sum_{t=1}^{T} \sum_{i} p_i^{(t)} \ell_t^i \;-\; \sum_{t=1}^{T} \ell_t^{i^\star} \;\leq\; \sqrt{2T\ln N}.

That is, RT2TlnNR_T \leq \sqrt{2T\ln N}, which is sublinear in TT and logarithmic in the number of experts.

The potential function is the log-total-weight Φt=lniwi(t)\Phi_t = \ln \sum_i w_i^{(t)}. Each round, Φt\Phi_t changes by (at most) η(learner’s expected loss)+η2/8-\eta \cdot (\text{learner's expected loss}) + \eta^2/8. Summing telescopes, and balancing the two terms with η=\eta = \sqrt{\cdot} gives the T\sqrt{T} rate.

Why MWU Is Everywhere

The multiplicative weights rule reappears under many names: boosting (with exponential weights on training examples), Hedge (game theory), entropic mirror descent (optimization), exponentiated gradient (neural networks), and even in solving LPs (Plotkin–Shmoys–Tardos framework). This is the same trick, repeatedly: when your decision space is a simplex, the natural geometry is information-theoretic, and exponential updates are the natural gradient step. Notice that the TlnN\sqrt{T\ln N} rate is tight — it matches the minimax lower bound for the experts problem.

Definition:

Stochastic Multi-Armed Bandits

A KK-armed bandit plays as follows: at round tt, the learner selects an arm At{1,,K}A_t \in \{1, \ldots, K\}, observes a reward rtνAtr_t \sim \nu_{A_t} with mean μAt\mu_{A_t}, and never observes rewards of arms not pulled. The regret is

RT=TμE[t=1Trt],μ=maxaμa.R_T = T \mu^\star - \mathbb{E}\Bigl[\sum_{t=1}^T r_t\Bigr], \quad \mu^\star = \max_a \mu_a.

The hallmark tradeoff is exploration vs exploitation: try arms you have not sampled enough vs pull the currently-best.

The bandit setting is harder than full-information online learning because only the loss of the chosen arm is revealed. Optimal regret scales as O(KT)\mathcal{O}(\sqrt{KT}) (minimax) or O(logT)\mathcal{O}(\log T) (gap-dependent). We will use bandits for beam management, where arms = beam codewords.

Upper Confidence Bound (UCB1)

Complexity: O(TK)\mathcal{O}(TK) time, O(K)\mathcal{O}(K) memory
Input: KK arms, horizon TT
Output: A sequence of arm pulls
1. Pull each arm once (t=1,,Kt = 1, \ldots, K) to initialize
2. for t=K+1,K+2,,Tt = K+1, K+2, \ldots, T do
3. \quad Compute empirical mean μ^a(t)\hat{\mu}_a^{(t)} and pull count na(t)n_a^{(t)} for each arm
4. \quad Compute UCB: UCBa(t)μ^a(t)+2ln(t)/na(t)\mathrm{UCB}_a^{(t)} \leftarrow \hat{\mu}_a^{(t)} + \sqrt{2\ln(t)/n_a^{(t)}}
5. \quad Pull arm At=argmaxaUCBa(t)A_t = \arg\max_a \mathrm{UCB}_a^{(t)}
6. \quad Observe reward rtr_t, update empirical mean
7. end for

The 2lnt/na\sqrt{2\ln t / n_a} term is a Hoeffding confidence radius. Pulling the arm with the largest UCB is "optimism in the face of uncertainty": either the arm is genuinely good, or the uncertainty shrinks and we stop pulling it.

Theorem: UCB1 Regret Bound

For KK arms with bounded rewards in [0,1][0,1] and gaps Δa=μμa\Delta_a = \mu^\star - \mu_a, the UCB1 algorithm satisfies

RTa:Δa>0(8lnTΔa+(1+π23)Δa).R_T \leq \sum_{a : \Delta_a > 0} \left(\frac{8\ln T}{\Delta_a} + \Bigl(1 + \frac{\pi^2}{3}\Bigr)\Delta_a\right).

In particular RT=O(KTlnT)R_T = \mathcal{O}(\sqrt{KT\ln T}) in the worst case and RT=O(KlogT/Δmin)R_T = \mathcal{O}(K\log T / \Delta_{\min}) when gaps are bounded away from zero.

A suboptimal arm aa is pulled roughly lnT/Δa2\ln T / \Delta_a^2 times before its confidence interval disentangles it from the optimum. The per-pull regret is Δa\Delta_a, so the total contribution is lnT/Δa\ln T / \Delta_a — which matches the lower bound (Lai–Robbins).

🎓CommIT Contribution(2021)

Contextual Bandits for Beam Management at mmWave

Y. Wang, G. CaireIEEE Trans. Veh. Technol., vol. 70, no. 5

CommIT-group work cast mmWave beam alignment as a contextual bandit: arms = beam pairs, context = mobility state, rewards = post-alignment SNR. By using a UCB-style policy with mobility-aware features, the scheme shortens beam training overhead compared to exhaustive search while retaining near-optimal SNR. This is one place where online learning theory becomes an actual algorithm in a standards-relevant wireless system.

banditsbeam-managementmmwaveView Paper →

Example: Bandit Beam Management: A Numerical Case

A base station has K=16K = 16 candidate beams. True beam gains are drawn from N(0,1)\mathcal{N}(0, 1) once and then fixed. Each round the base station probes one beam, observing the true gain plus N(0,0.25)\mathcal{N}(0, 0.25) noise. Compare UCB1 to random and greedy policies after T=200T = 200 rounds.

MWU Regret vs Horizon

Plot of cumulative regret for multiplicative weights on a synthetic experts problem with NN experts, comparing the empirical regret to the theoretical 2TlnN\sqrt{2T\ln N} bound. Vary NN, TT, and step size η\eta.

Parameters
10
500
1
1

UCB vs Random vs Greedy (Beam Arms)

Compare cumulative regret of UCB1, epsilon-greedy, and pure exploration for KK Gaussian-reward arms with gap parameter Δ\Delta.

Parameters
10
1000
0.3
0.5

Historical Note: Robbins and Monro: The Ancestor of Online Learning

1951–present

The 1951 paper of Herbert Robbins and Sutton Monro introduced stochastic approximation: solve f(θ)=0f(\theta) = 0 using noisy function evaluations via the update θt+1=θtηtyt\theta_{t+1} = \theta_t - \eta_t y_t. This was the first sequential estimation scheme with a convergence analysis, and it contained the germ of every online learning result that followed: the idea that you can track a moving target by taking small steps against noisy gradients.

Bernard Widrow's LMS filter (1960) was stochastic approximation dressed in signal-processing clothing; Boris Polyak's 1990 averaging scheme showed that you could get asymptotically optimal rates by simply averaging iterates. The modern adversarial regret analysis (Cesa-Bianchi–Lugosi, Zinkevich, Hazan) sharpens these ideas and drops the distributional assumption entirely.

Common Mistake: Confusing No-Regret with Converging to the Optimum

Mistake:

A paper claims an online gradient descent scheme "converges to the optimal solution" of a wireless resource allocation problem. But the problem is non-stationary (channels drift over time), so there is no fixed optimum to converge to.

Correction:

In online settings, the correct guarantee is no-regret against the best fixed policy in hindsight, not convergence to an optimum. When the environment is stationary, no-regret plus a few extra assumptions implies convergence to the minimizer in a time-average sense. When the environment drifts, the best-in-hindsight itself moves, and one needs dynamic regret bounds that depend on the path length of the comparator.

Why This Matters: Bandits in 5G NR Beam Management

3GPP NR beam management specifies a codebook of up to 64 beams at the base station. Initial access must identify the best beam with minimal overhead. A UCB-style policy over the beam codebook achieves sub-linear regret in the number of beam-training slots; variants that exploit mobility context (contextual bandits) cut overhead further. This is a direct instantiation of the exploration-exploitation tradeoff in a standards-defined problem.

⚠️Engineering Note

Drift Invalidates Standard Regret Bounds

The O(T)\mathcal{O}(\sqrt{T}) regret bound assumes a single best fixed action throughout the horizon. In practice the best beam, precoder, or user-scheduling policy changes with mobility. Two mitigations: (i) restart the learner periodically (sliding-window MWU); (ii) use drift-aware bounds (Besbes–Gur–Zeevi 2014) that depend on a budget on how much the comparator moves. Ignoring drift leads to linear regret in the long horizon.

Practical Constraints
  • Restart period must be shorter than the coherence time of the environment

  • Drift-aware algorithms require knowing or estimating a drift budget

  • Discounted UCB (D-UCB) achieves O(T(1γ))\mathcal{O}(\sqrt{T(1 - \gamma)}) under geometric forgetting

Key Takeaway

Regret is the right figure of merit when distributions are non-stationary or adversarial: it measures how quickly you approach the benchmark, not distance to a fixed truth. The two central algorithms — multiplicative weights on the simplex and online gradient descent on convex sets — both achieve O(T)\mathcal{O}(\sqrt{T}) worst-case regret, and both arise from the convexity of the per-round loss.

Quick Check

A bandit algorithm achieves O(logT)\mathcal{O}(\log T) regret on a problem with a Δ>0\Delta > 0 gap. Your colleague claims this means "the loss per round goes to zero." Is this correct?

Yes — dividing by TT, the per-round regret is log(T)/T0\log(T)/T \to 0.

No — logT\log T grows, so the loss grows.

Only if the environment is i.i.d.

No — regret is a lower bound on loss, so loss may still be large.