Online Learning and Sequential Estimation
When Nature Does Not Commit to a Distribution
Classical statistical inference — the framework of Kay, of Cramér, of this book through Chapter 24 — presupposes that data arrive i.i.d. from a fixed (possibly unknown) distribution. In wireless systems that assumption is often wrong: channels drift, interference patterns change, users join and leave, and the adversary (if there is one) adapts. The online-learning framework is designed for exactly this mismatch.
Instead of minimizing expected loss under a fixed distribution, we minimize regret against the best fixed decision in hindsight, with no distributional assumption at all. Remarkably, tight regret bounds exist, and they are achieved by algorithms that look a lot like the stochastic approximation schemes of Robbins and Monro — except the analysis is adversarial.
The point is that online learning is not a different subject from estimation; it is estimation liberated from the i.i.d. assumption, at the price of comparing to a weaker benchmark.
Definition: Cumulative Regret
Cumulative Regret
In online learning, at round the learner picks action , observes a convex loss , and incurs loss . The cumulative regret of the learner relative to the best fixed action in hindsight is
The learner is no-regret if as .
The losses need not be stochastic — they can be picked adversarially after seeing . What makes the problem tractable is that the losses are convex (or linear) in .
Definition: Online Convex Optimization (OCO)
Online Convex Optimization (OCO)
The OCO framework specifies: (i) a convex feasible set ; (ii) at each round, a convex loss function revealed after the decision; (iii) possibly subgradient feedback . The goal is to design an algorithm with sublinear regret.
Notice that OCO is convex optimization — the convexity of is what makes tight regret bounds possible. The contrast with offline convex optimization is that we see the objective one piece at a time, not all at once. Every bound in this section ultimately flows from the convexity reflex: if were non-convex, no sublinear regret would be achievable in general.
Regret
The gap between the learner's cumulative loss and the loss of the best fixed decision in hindsight. A learner is no-regret if this gap grows sublinearly in , i.e., .
Multiplicative Weights Update (MWU)
An online learning algorithm for the experts problem in which each expert's weight is multiplied by after observing losses. It achieves with optimal step size .
Multiplicative Weights Update (Prediction with Experts)
Complexity: total; per roundThe update is a gradient step in the information geometry of the probability simplex — it is mirror descent with the negative entropy as the Bregman divergence. This pattern will reappear in exponentiated gradient and in the PAC-Bayes literature.
Theorem: Regret Bound for Multiplicative Weights
Against any sequence of losses and any fixed expert , the MWU algorithm with step size satisfies
That is, , which is sublinear in and logarithmic in the number of experts.
The potential function is the log-total-weight . Each round, changes by (at most) . Summing telescopes, and balancing the two terms with gives the rate.
Define the potential and track its change per round.
Use the inequality for .
Telescope the potential bound and observe that (since one expert's weight alone is at most the total).
Minimize the resulting upper bound in .
Potential function
Let and , with . Using for and noting that ,
Change in potential
Dividing by and taking logs (using ): where is the learner's expected loss and we used .
Telescoping
Summing from to : On the other hand, since is at most , we have .
Combine and optimize
Rearranging: Minimizing over gives and .
Why MWU Is Everywhere
The multiplicative weights rule reappears under many names: boosting (with exponential weights on training examples), Hedge (game theory), entropic mirror descent (optimization), exponentiated gradient (neural networks), and even in solving LPs (Plotkin–Shmoys–Tardos framework). This is the same trick, repeatedly: when your decision space is a simplex, the natural geometry is information-theoretic, and exponential updates are the natural gradient step. Notice that the rate is tight — it matches the minimax lower bound for the experts problem.
Definition: Stochastic Multi-Armed Bandits
Stochastic Multi-Armed Bandits
A -armed bandit plays as follows: at round , the learner selects an arm , observes a reward with mean , and never observes rewards of arms not pulled. The regret is
The hallmark tradeoff is exploration vs exploitation: try arms you have not sampled enough vs pull the currently-best.
The bandit setting is harder than full-information online learning because only the loss of the chosen arm is revealed. Optimal regret scales as (minimax) or (gap-dependent). We will use bandits for beam management, where arms = beam codewords.
Upper Confidence Bound (UCB1)
Complexity: time, memoryThe term is a Hoeffding confidence radius. Pulling the arm with the largest UCB is "optimism in the face of uncertainty": either the arm is genuinely good, or the uncertainty shrinks and we stop pulling it.
Theorem: UCB1 Regret Bound
For arms with bounded rewards in and gaps , the UCB1 algorithm satisfies
In particular in the worst case and when gaps are bounded away from zero.
A suboptimal arm is pulled roughly times before its confidence interval disentangles it from the optimum. The per-pull regret is , so the total contribution is — which matches the lower bound (Lai–Robbins).
Bound pulls of suboptimal arm
By Hoeffding, for . Hence the expected number of pulls of suboptimal arm beyond this threshold is via a union bound over .
Sum up regret
Each pull of arm contributes to the regret; the expected total pulls of arm is bounded by . Summing over gives the stated bound.
Contextual Bandits for Beam Management at mmWave
CommIT-group work cast mmWave beam alignment as a contextual bandit: arms = beam pairs, context = mobility state, rewards = post-alignment SNR. By using a UCB-style policy with mobility-aware features, the scheme shortens beam training overhead compared to exhaustive search while retaining near-optimal SNR. This is one place where online learning theory becomes an actual algorithm in a standards-relevant wireless system.
Example: Bandit Beam Management: A Numerical Case
A base station has candidate beams. True beam gains are drawn from once and then fixed. Each round the base station probes one beam, observing the true gain plus noise. Compare UCB1 to random and greedy policies after rounds.
Baseline: random probing
Random probing pulls each arm times in expectation. Expected regret with ; for standard normals, (the expected max of 16 standard normals), so .
Greedy after short warmup
Pull each arm once, commit to empirical best. Probability of picking the wrong arm is non-negligible at one sample per arm, leading to linear regret in the bad event. Expected regret typically exceeds UCB's by a factor of .
UCB1
UCB achieves in this regime. The improvement over random is roughly — and it grows with .
MWU Regret vs Horizon
Plot of cumulative regret for multiplicative weights on a synthetic experts problem with experts, comparing the empirical regret to the theoretical bound. Vary , , and step size .
Parameters
UCB vs Random vs Greedy (Beam Arms)
Compare cumulative regret of UCB1, epsilon-greedy, and pure exploration for Gaussian-reward arms with gap parameter .
Parameters
Historical Note: Robbins and Monro: The Ancestor of Online Learning
1951–presentThe 1951 paper of Herbert Robbins and Sutton Monro introduced stochastic approximation: solve using noisy function evaluations via the update . This was the first sequential estimation scheme with a convergence analysis, and it contained the germ of every online learning result that followed: the idea that you can track a moving target by taking small steps against noisy gradients.
Bernard Widrow's LMS filter (1960) was stochastic approximation dressed in signal-processing clothing; Boris Polyak's 1990 averaging scheme showed that you could get asymptotically optimal rates by simply averaging iterates. The modern adversarial regret analysis (Cesa-Bianchi–Lugosi, Zinkevich, Hazan) sharpens these ideas and drops the distributional assumption entirely.
Common Mistake: Confusing No-Regret with Converging to the Optimum
Mistake:
A paper claims an online gradient descent scheme "converges to the optimal solution" of a wireless resource allocation problem. But the problem is non-stationary (channels drift over time), so there is no fixed optimum to converge to.
Correction:
In online settings, the correct guarantee is no-regret against the best fixed policy in hindsight, not convergence to an optimum. When the environment is stationary, no-regret plus a few extra assumptions implies convergence to the minimizer in a time-average sense. When the environment drifts, the best-in-hindsight itself moves, and one needs dynamic regret bounds that depend on the path length of the comparator.
Why This Matters: Bandits in 5G NR Beam Management
3GPP NR beam management specifies a codebook of up to 64 beams at the base station. Initial access must identify the best beam with minimal overhead. A UCB-style policy over the beam codebook achieves sub-linear regret in the number of beam-training slots; variants that exploit mobility context (contextual bandits) cut overhead further. This is a direct instantiation of the exploration-exploitation tradeoff in a standards-defined problem.
Drift Invalidates Standard Regret Bounds
The regret bound assumes a single best fixed action throughout the horizon. In practice the best beam, precoder, or user-scheduling policy changes with mobility. Two mitigations: (i) restart the learner periodically (sliding-window MWU); (ii) use drift-aware bounds (Besbes–Gur–Zeevi 2014) that depend on a budget on how much the comparator moves. Ignoring drift leads to linear regret in the long horizon.
- •
Restart period must be shorter than the coherence time of the environment
- •
Drift-aware algorithms require knowing or estimating a drift budget
- •
Discounted UCB (D-UCB) achieves under geometric forgetting
Key Takeaway
Regret is the right figure of merit when distributions are non-stationary or adversarial: it measures how quickly you approach the benchmark, not distance to a fixed truth. The two central algorithms — multiplicative weights on the simplex and online gradient descent on convex sets — both achieve worst-case regret, and both arise from the convexity of the per-round loss.
Quick Check
A bandit algorithm achieves regret on a problem with a gap. Your colleague claims this means "the loss per round goes to zero." Is this correct?
Yes — dividing by , the per-round regret is .
No — grows, so the loss grows.
Only if the environment is i.i.d.
No — regret is a lower bound on loss, so loss may still be large.
The per-round average regret is , which does vanish. This is the no-regret property.