Ferkans — Interactive Telecom Tutor

When Nature Does Not Commit to a Distribution

Classical statistical inference — the framework of Kay, of Cramér, of this book through Chapter 24 — presupposes that data arrive i.i.d. from a fixed (possibly unknown) distribution. In wireless systems that assumption is often wrong: channels drift, interference patterns change, users join and leave, and the adversary (if there is one) adapts. The online-learning framework is designed for exactly this mismatch.

Instead of minimizing expected loss under a fixed distribution, we minimize regret against the best fixed decision in hindsight, with no distributional assumption at all. Remarkably, tight regret bounds exist, and they are achieved by algorithms that look a lot like the stochastic approximation schemes of Robbins and Monro — except the analysis is adversarial.

The point is that online learning is not a different subject from estimation; it is estimation liberated from the i.i.d. assumption, at the price of comparing to a weaker benchmark.

Definition:
Cumulative Regret

In online learning, at round $t = 1, \ldots, T$ the learner picks action $\mathbf{w}_t \in \mathcal{W}$ , observes a convex loss $\ell_t : \mathcal{W} \to \mathbb{R}$ , and incurs loss $\ell_t(\mathbf{w}_t)$ . The cumulative regret of the learner relative to the best fixed action in hindsight is

$R_T \;=\; \sum_{t=1}^{T} \ell_t(\mathbf{w}_t) \;-\; \min_{\mathbf{w}^{\star} \in \mathcal{W}} \sum_{t=1}^{T} \ell_t(\mathbf{w}^{\star}).$

The learner is no-regret if $R_T / T \to 0$ as $T \to \infty$ .

The losses $\ell_t$ need not be stochastic — they can be picked adversarially after seeing $\mathbf{w}_t$ . What makes the problem tractable is that the losses are convex (or linear) in $\mathbf{w}$ .

Definition:
Online Convex Optimization (OCO)

The OCO framework specifies: (i) a convex feasible set $\mathcal{W} \subseteq \mathbb{R}^d$ ; (ii) at each round, a convex loss function $\ell_t : \mathcal{W} \to \mathbb{R}$ revealed after the decision; (iii) possibly subgradient feedback $\mathbf{g}_t \in \partial \ell_t(\mathbf{w}_t)$ . The goal is to design an algorithm with sublinear regret.

Notice that OCO is convex optimization — the convexity of $\ell_t$ is what makes tight regret bounds possible. The contrast with offline convex optimization is that we see the objective one piece at a time, not all at once. Every bound in this section ultimately flows from the convexity reflex: if $\ell_t$ were non-convex, no sublinear regret would be achievable in general.

Regret

The gap between the learner's cumulative loss and the loss of the best fixed decision in hindsight. A learner is no-regret if this gap grows sublinearly in $T$ , i.e., $R_T = o(T)$ .

Multiplicative Weights Update (MWU)

An online learning algorithm for the experts problem in which each expert's weight is multiplied by $\exp(-\eta \ell_t^i)$ after observing losses. It achieves $R_T \leq \sqrt{2T\ln N}$ with optimal step size $\eta = \sqrt{2\ln N / T}$ .

Multiplicative Weights Update (Prediction with Experts)

Complexity:

\mathcal{O}(NT)

total;

\mathcal{O}(N)

per round

Input:

N

experts, horizon

T

, step size

\eta > 0

Output: Mixed strategy

\mathbf{p}_t \in \Delta^{N-1}

at each round

1. Initialize

w_i^{(1)} \leftarrow 1

for

i = 1, \ldots, N

2. for

t = 1, 2, \ldots, T

do

3.

\quad p_i^{(t)} \leftarrow w_i^{(t)} / \sum_{j} w_j^{(t)}

4.

\quad

Sample or play mixed strategy

\mathbf{p}^{(t)}

; observe losses

\ell_t^1, \ldots, \ell_t^N \in [0,1]

5.

\quad

Incur expected loss

\sum_i p_i^{(t)} \ell_t^i

6.

\quad w_i^{(t+1)} \leftarrow w_i^{(t)} \exp(-\eta\, \ell_t^i) \quad \forall i

7. end for

The update is a gradient step in the information geometry of the probability simplex — it is mirror descent with the negative entropy as the Bregman divergence. This pattern will reappear in exponentiated gradient and in the PAC-Bayes literature.

Theorem: Regret Bound for Multiplicative Weights

Against any sequence of losses $\ell_t^i \in [0,1]$ and any fixed expert $i^\star$ , the MWU algorithm with step size $\eta = \sqrt{(2\ln N)/T}$ satisfies

$\sum_{t=1}^{T} \sum_{i} p_i^{(t)} \ell_t^i \;-\; \sum_{t=1}^{T} \ell_t^{i^\star} \;\leq\; \sqrt{2T\ln N}.$

That is, $R_T \leq \sqrt{2T\ln N}$ , which is sublinear in $T$ and logarithmic in the number of experts.

The potential function is the log-total-weight $\Phi_t = \ln \sum_i w_i^{(t)}$ . Each round, $\Phi_t$ changes by (at most) $-\eta \cdot (\text{learner's expected loss}) + \eta^2/8$ . Summing telescopes, and balancing the two terms with $\eta = \sqrt{\cdot}$ gives the $\sqrt{T}$ rate.

Show Hint

Define the potential $\Phi_t = \ln\sum_i w_i^{(t)}$ and track its change per round.

Use the inequality $e^{-\eta x} \leq 1 - \eta x + \eta^2 x^2 / 2$ for $x \in [0,1]$ .

Telescope the potential bound and observe that $\Phi_T \geq -\eta \sum_t \ell_t^{i^\star}$ (since one expert's weight alone is at most the total).

Minimize the resulting upper bound in $\eta$ .

Proof

Potential function

Let $W_t = \sum_i w_i^{(t)}$ and $\Phi_t = \ln W_t$ , with $\Phi_1 = \ln N$ . Using $e^{-\eta x} \leq 1 - \eta x + \eta^2 x^2/2$ for $x \in [0,1]$ and noting that $\ell_t^i \in [0,1]$ , $W_{t+1} = \sum_i w_i^{(t)} e^{-\eta \ell_t^i} \leq \sum_i w_i^{(t)}\Bigl(1 - \eta\ell_t^i + \tfrac{\eta^2}{2}(\ell_t^i)^2\Bigr).$

Change in potential

Dividing by $W_t$ and taking logs (using $\ln(1+x) \leq x$ ): $\Phi_{t+1} - \Phi_t \leq -\eta \sum_i p_i^{(t)} \ell_t^i + \frac{\eta^2}{2} \sum_i p_i^{(t)} (\ell_t^i)^2 \leq -\eta L_t + \frac{\eta^2}{2},$ where $L_t = \sum_i p_i^{(t)} \ell_t^i$ is the learner's expected loss and we used $\ell_t^i \leq 1$ .

Telescoping

Summing from $t = 1$ to $T$ : $\Phi_{T+1} - \ln N \leq -\eta \sum_{t=1}^{T} L_t + \frac{\eta^2 T}{2}.$ On the other hand, since $w_{i^\star}^{(T+1)} = \exp(-\eta \sum_t \ell_t^{i^\star})$ is at most $W_{T+1}$ , we have $\Phi_{T+1} \geq -\eta \sum_t \ell_t^{i^\star}$ .

Combine and optimize

Rearranging: $\sum_t L_t - \sum_t \ell_t^{i^\star} \leq \frac{\ln N}{\eta} + \frac{\eta T}{2}.$ Minimizing over $\eta$ gives $\eta^\star = \sqrt{2\ln N/T}$ and $R_T \leq \sqrt{2T\ln N}$ . $\blacksquare$

Why MWU Is Everywhere

The multiplicative weights rule reappears under many names: boosting (with exponential weights on training examples), Hedge (game theory), entropic mirror descent (optimization), exponentiated gradient (neural networks), and even in solving LPs (Plotkin–Shmoys–Tardos framework). This is the same trick, repeatedly: when your decision space is a simplex, the natural geometry is information-theoretic, and exponential updates are the natural gradient step. Notice that the $\sqrt{T\ln N}$ rate is tight — it matches the minimax lower bound for the experts problem.

Definition:
Stochastic Multi-Armed Bandits

A $K$ -armed bandit plays as follows: at round $t$ , the learner selects an arm $A_t \in \{1, \ldots, K\}$ , observes a reward $r_t \sim \nu_{A_t}$ with mean $\mu_{A_t}$ , and never observes rewards of arms not pulled. The regret is

$R_T = T \mu^\star - \mathbb{E}\Bigl[\sum_{t=1}^T r_t\Bigr], \quad \mu^\star = \max_a \mu_a.$

The hallmark tradeoff is exploration vs exploitation: try arms you have not sampled enough vs pull the currently-best.

The bandit setting is harder than full-information online learning because only the loss of the chosen arm is revealed. Optimal regret scales as $\mathcal{O}(\sqrt{KT})$ (minimax) or $\mathcal{O}(\log T)$ (gap-dependent). We will use bandits for beam management, where arms = beam codewords.

Upper Confidence Bound (UCB1)

Complexity:

\mathcal{O}(TK)

time,

\mathcal{O}(K)

memory

Input:

K

arms, horizon

T

Output: A sequence of arm pulls

1. Pull each arm once (

t = 1, \ldots, K

) to initialize

2. for

t = K+1, K+2, \ldots, T

do

3.

\quad

Compute empirical mean

\hat{\mu}_a^{(t)}

and pull count

n_a^{(t)}

for each arm

4.

\quad

Compute UCB:

\mathrm{UCB}_a^{(t)} \leftarrow \hat{\mu}_a^{(t)} + \sqrt{2\ln(t)/n_a^{(t)}}

5.

\quad

Pull arm

A_t = \arg\max_a \mathrm{UCB}_a^{(t)}

6.

\quad

Observe reward

r_t

, update empirical mean

7. end for

The $\sqrt{2\ln t / n_a}$ term is a Hoeffding confidence radius. Pulling the arm with the largest UCB is "optimism in the face of uncertainty": either the arm is genuinely good, or the uncertainty shrinks and we stop pulling it.

Theorem: UCB1 Regret Bound

For $K$ arms with bounded rewards in $[0,1]$ and gaps $\Delta_a = \mu^\star - \mu_a$ , the UCB1 algorithm satisfies

$R_T \leq \sum_{a : \Delta_a > 0} \left(\frac{8\ln T}{\Delta_a} + \Bigl(1 + \frac{\pi^2}{3}\Bigr)\Delta_a\right).$

In particular $R_T = \mathcal{O}(\sqrt{KT\ln T})$ in the worst case and $R_T = \mathcal{O}(K\log T / \Delta_{\min})$ when gaps are bounded away from zero.

A suboptimal arm $a$ is pulled roughly $\ln T / \Delta_a^2$ times before its confidence interval disentangles it from the optimum. The per-pull regret is $\Delta_a$ , so the total contribution is $\ln T / \Delta_a$ — which matches the lower bound (Lai–Robbins).

Proof

Bound pulls of suboptimal arm

By Hoeffding, $\Pr\{\mathrm{UCB}_a^{(t)} \geq \mu^\star \mid n_a^{(t)} = s\} \leq t^{-4}$ for $s \geq 8\ln t/\Delta_a^2$ . Hence the expected number of pulls of suboptimal arm $a$ beyond this threshold is $\mathcal{O}(1)$ via a union bound over $t$ .

Sum up regret

Each pull of arm $a$ contributes $\Delta_a$ to the regret; the expected total pulls of arm $a$ is bounded by $8\ln T / \Delta_a^2 + \mathcal{O}(1)$ . Summing over $a$ gives the stated bound. $\blacksquare$

🎓CommIT Contribution(2021)

Contextual Bandits for Beam Management at mmWave

Y. Wang, G. Caire — IEEE Trans. Veh. Technol., vol. 70, no. 5

CommIT-group work cast mmWave beam alignment as a contextual bandit: arms = beam pairs, context = mobility state, rewards = post-alignment SNR. By using a UCB-style policy with mobility-aware features, the scheme shortens beam training overhead compared to exhaustive search while retaining near-optimal SNR. This is one place where online learning theory becomes an actual algorithm in a standards-relevant wireless system.

banditsbeam-managementmmwaveView Paper →

Example: Bandit Beam Management: A Numerical Case

A base station has $K = 16$ candidate beams. True beam gains are drawn from $\mathcal{N}(0, 1)$ once and then fixed. Each round the base station probes one beam, observing the true gain plus $\mathcal{N}(0, 0.25)$ noise. Compare UCB1 to random and greedy policies after $T = 200$ rounds.

Solution

Baseline: random probing

Random probing pulls each arm $T/K = 12.5$ times in expectation. Expected regret $\mathbb{E}[R_T] = T\cdot\mathbb{E}[\mu^\star - \bar\mu]$ with $\bar\mu = 0$ ; for $K = 16$ standard normals, $\mathbb{E}[\mu^\star] \approx 1.77$ (the expected max of 16 standard normals), so $R_T \approx 354$ .

Greedy after short warmup

Pull each arm once, commit to empirical best. Probability of picking the wrong arm is non-negligible at one sample per arm, leading to linear regret $\mathcal{O}(T)$ in the bad event. Expected regret typically exceeds UCB's by a factor of $\sqrt{T}$ .

UCB1

UCB achieves $R_T = \mathcal{O}(\sqrt{KT\log T}) \approx 60$ in this regime. The improvement over random is roughly $6\times$ — and it grows with $T$ .

MWU Regret vs Horizon

Plot of cumulative regret for multiplicative weights on a synthetic experts problem with $N$ experts, comparing the empirical regret to the theoretical $\sqrt{2T\ln N}$ bound. Vary $N$ , $T$ , and step size $\eta$ .

Parameters

N

(experts)10

T

(rounds)500

\eta / \eta^{\star}

1

seed1

UCB vs Random vs Greedy (Beam Arms)

Compare cumulative regret of UCB1, epsilon-greedy, and pure exploration for $K$ Gaussian-reward arms with gap parameter $\Delta$ .

Parameters

K

(arms)10

T

(rounds)1000

\Delta

(gap)0.3

\sigma

(noise)0.5

Historical Note: Robbins and Monro: The Ancestor of Online Learning

1951–present

The 1951 paper of Herbert Robbins and Sutton Monro introduced stochastic approximation: solve $f(\theta) = 0$ using noisy function evaluations via the update $\theta_{t+1} = \theta_t - \eta_t y_t$ . This was the first sequential estimation scheme with a convergence analysis, and it contained the germ of every online learning result that followed: the idea that you can track a moving target by taking small steps against noisy gradients.

Bernard Widrow's LMS filter (1960) was stochastic approximation dressed in signal-processing clothing; Boris Polyak's 1990 averaging scheme showed that you could get asymptotically optimal rates by simply averaging iterates. The modern adversarial regret analysis (Cesa-Bianchi–Lugosi, Zinkevich, Hazan) sharpens these ideas and drops the distributional assumption entirely.

Common Mistake: Confusing No-Regret with Converging to the Optimum

Mistake:

A paper claims an online gradient descent scheme "converges to the optimal solution" of a wireless resource allocation problem. But the problem is non-stationary (channels drift over time), so there is no fixed optimum to converge to.

Correction:

In online settings, the correct guarantee is no-regret against the best fixed policy in hindsight, not convergence to an optimum. When the environment is stationary, no-regret plus a few extra assumptions implies convergence to the minimizer in a time-average sense. When the environment drifts, the best-in-hindsight itself moves, and one needs dynamic regret bounds that depend on the path length of the comparator.

Why This Matters: Bandits in 5G NR Beam Management

3GPP NR beam management specifies a codebook of up to 64 beams at the base station. Initial access must identify the best beam with minimal overhead. A UCB-style policy over the beam codebook achieves sub-linear regret in the number of beam-training slots; variants that exploit mobility context (contextual bandits) cut overhead further. This is a direct instantiation of the exploration-exploitation tradeoff in a standards-defined problem.

⚠️Engineering Note

Drift Invalidates Standard Regret Bounds

The $\mathcal{O}(\sqrt{T})$ regret bound assumes a single best fixed action throughout the horizon. In practice the best beam, precoder, or user-scheduling policy changes with mobility. Two mitigations: (i) restart the learner periodically (sliding-window MWU); (ii) use drift-aware bounds (Besbes–Gur–Zeevi 2014) that depend on a budget on how much the comparator moves. Ignoring drift leads to linear regret in the long horizon.

Practical Constraints

•
Restart period must be shorter than the coherence time of the environment
•
Drift-aware algorithms require knowing or estimating a drift budget
•
Discounted UCB (D-UCB) achieves $\mathcal{O}(\sqrt{T(1 - \gamma)})$ under geometric forgetting

Key Takeaway

Regret is the right figure of merit when distributions are non-stationary or adversarial: it measures how quickly you approach the benchmark, not distance to a fixed truth. The two central algorithms — multiplicative weights on the simplex and online gradient descent on convex sets — both achieve $\mathcal{O}(\sqrt{T})$ worst-case regret, and both arise from the convexity of the per-round loss.

Quick Check

A bandit algorithm achieves $\mathcal{O}(\log T)$ regret on a problem with a $\Delta > 0$ gap. Your colleague claims this means "the loss per round goes to zero." Is this correct?

Yes — dividing by $T$ , the per-round regret is $\log(T)/T \to 0$ .

No — $\log T$ grows, so the loss grows.

Only if the environment is i.i.d.

No — regret is a lower bound on loss, so loss may still be large.

Correction:

Yes — dividing by

T

, the per-round regret is

\log(T)/T \to 0

.

The per-round average regret is $R_T/T = \mathcal{O}(\log T / T)$ , which does vanish. This is the no-regret property.

Online Learning and Sequential Estimation

When Nature Does Not Commit to a Distribution

Definition: Cumulative Regret

Definition: Online Convex Optimization (OCO)

Regret

Multiplicative Weights Update (MWU)

Multiplicative Weights Update (Prediction with Experts)

Theorem: Regret Bound for Multiplicative Weights

Potential function

Change in potential

Telescoping

Combine and optimize

Why MWU Is Everywhere

Definition: Stochastic Multi-Armed Bandits

Upper Confidence Bound (UCB1)

Theorem: UCB1 Regret Bound

Bound pulls of suboptimal arm

Sum up regret

Contextual Bandits for Beam Management at mmWave

Example: Bandit Beam Management: A Numerical Case

Baseline: random probing

Greedy after short warmup

UCB1

MWU Regret vs Horizon

Parameters

UCB vs Random vs Greedy (Beam Arms)

Parameters

Historical Note: Robbins and Monro: The Ancestor of Online Learning

Common Mistake: Confusing No-Regret with Converging to the Optimum

Why This Matters: Bandits in 5G NR Beam Management

Drift Invalidates Standard Regret Bounds

Key Takeaway

Quick Check

Definition:
Cumulative Regret

Definition:
Online Convex Optimization (OCO)

Definition:
Stochastic Multi-Armed Bandits