Ferkans — Interactive Telecom Tutor

The Joint Optimization

Sections §17.1–§17.2 defined the wireless-FL problem and its convergence behavior as a function of per-round aggregation MSE and user participation. Two levers shape the per-round MSE in practice:

Device scheduling $\mathcal{S}_t$ — which users upload each round.
Power allocation $P_k^{(t)}$ — how much transmit budget each user spends.

These are coupled. Selecting more users increases statistical representativeness but, for AirComp, worsens the worst-user MSE if weak-channel users are included. Allocating more power to weak users accelerates their gradients' contribution but drains batteries.

The point is that the wireless-FL design problem is a bi-level optimization: the outer layer minimizes convergence loss over $T$ rounds; the inner layer chooses $(\mathcal{S}_t, P_k^{(t)})$ each round to satisfy MSE target at minimum cost. This section derives optimal (and practical heuristic) scheduling and power-control rules and quantifies their effect on end-to-end convergence.

,

Definition:
Wireless-FL Joint Optimization

Given a total-round budget $T$ , per-user energy budgets $\{E_k\}$ , and channel process $\{h_k^{(t)}\}$ , the wireless-FL joint optimization is $\min_{\{(\mathcal{S}_t, P_k^{(t)})\}_{t=0}^{T-1}} \mathbb{E}[F(\boldsymbol{\theta}_T)]$ subject to $\sum_{t: k \in \mathcal{S}_t} P_k^{(t)} \;\leq\; E_k \qquad \forall k,$ $|\mathcal{S}_t| \leq n \text{ (all users)}, \qquad P_k^{(t)} \leq P_{\max} \qquad \forall (k, t).$

The expectation is over randomness in gradients and channel noise. By Theorem 17.2.1, this reduces to minimizing the noise floor — an online per-round optimization over $\ntn{mseagg}(t)$ .

Theorem: Optimal Per-Round AirComp Scheduling

Given per-round CSI $\{\gamma_k^{(t)}\}$ and MSE tolerance $\mathsf{MSE}^{\text{tol}}$ , the round- $t$ scheduling and power rules $\mathcal{S}_t = \{k : \gamma_k^{(t)} \geq \tau_t\}, \qquad P_k^{(t)} = \eta_t^2 \sigma_s^2 / |h_k^{(t)}|^2$ (for $k \in \mathcal{S}_t$ ) with threshold $\tau_t = \sigma^2/\mathsf{MSE}^{\text{tol}}$ and common amplitude $\eta_t = \sqrt{\tau_t}$ are Pareto-optimal: they minimize MSE subject to user count, per-user power, and jointly satisfy the feasibility constraints of Theorem 16.2.1.

Proof

Feasibility

Users with $\gamma_k^{(t)} \geq \tau_t$ can satisfy $b_k h_k = \eta_t$ with $|b_k|^2 \leq P_k/\sigma_s^2 \cdot \gamma_k^{(t)}/\tau_t$ . Equality when $\gamma_k^{(t)} = \tau_t$ .

MSE

$\mathsf{MSE}(\mathcal{S}_t, \eta_t) = \sigma^2/\eta_t^2 = \sigma^2/\tau_t = \mathsf{MSE}^{\text{tol}}$ . Exactly meeting the target.

Pareto optimality

From the threshold-scheduling theorem (§16.2 Thm 16.2.2): the threshold set is Pareto-optimal. Increasing $\tau_t$ reduces MSE but shrinks $\mathcal{S}_t$ ; decreasing $\tau_t$ expands $\mathcal{S}_t$ but loses MSE. $\tau_t = \sigma^2/\mathsf{MSE}^{\text{tol}}$ is the corner of the frontier.

Operational

The designer picks the MSE target (from the convergence analysis); the scheduling and power rules fall out analytically. No iterative optimization per round. For deployment: pre-compute $\{\tau_t\}$ offline based on channel statistics.

,

Example: Worked Scheduling at Round 5

Round $t = 5$ : $n = 20$ users with effective channel gains (units $P \sigma_s^{-2}$ ): $\gamma_k^{(5)} = \{1.2, 0.3, 0.8, 0.1, 2.0, 0.5, 0.9, 1.5, 0.7, 0.4, 1.1, 0.6, 0.2, 0.3, 0.8, 1.8, 0.5, 0.9, 1.3, 0.7\}$ . Noise variance $\sigma^2 = 0.01$ . MSE target $\mathsf{MSE}^{\text{tol}} = 0.02$ . Apply Theorem 17.3.1 to compute $\mathcal{S}_5$ .

Solution

Compute $\tau_5$

$\tau_5 = 0.01/0.02 = 0.5$ .

Identify users satisfying $\gamma_k \geq 0.5$

Users with $\gamma_k \geq 0.5$ : $\{1, 3, 5, 6, 7, 8, 9, 11, 12, 15, 16, 17, 18, 19, 20\}$ . Count: $15$ of $20$ .

Compute $\eta_5$

$\eta_5 = \sqrt{0.5} \approx 0.707$ .

Per-user power

$P_k^{(5)} = 0.5 \sigma_s^2 / |h_k|^2$ (from $\gamma_k^{(5)} \sigma_s^2 = |h_k|^2 P_k$ ).

Verify feasibility

Users with $\gamma_k^{(5)} \approx 0.5$ use full budget. Users with $\gamma_k^{(5)} > 0.5$ use less than full budget — they have slack.

Operational

$5$ users are excluded this round. If they are regularly excluded, their data contributes less — bias introduced. Mitigate via rotation or fairness-aware scheduling (below).

Definition:
Fairness-Aware Scheduling

A scheduling rule is $\alpha$ -proportionally fair if, over $T$ rounds, each user $k$ has participated in $\mathcal{S}_t$ for at least $\alpha T / n$ rounds (i.e., each user carries at least a $\alpha$ -fraction of its "proportional share" of participations).

Threshold scheduling (Theorem 17.3.1) is not fair: persistently weak-channel users are systematically excluded. To enforce $\alpha$ -fairness, add a lower-bound constraint: $\sum_{t=0}^{T-1} \mathbb{1}[k \in \mathcal{S}_t] \;\geq\; \alpha T/n, \quad \forall k.$

Under this constraint, the per-round MSE increases (the scheduler must accept weaker users in some rounds), but all users' gradients contribute.

Theorem: Fairness-MSE Trade-off

For a given $\alpha \in [0, 1]$ , the optimal $\alpha$ -fair scheduler has per-round MSE that is, on average, at most a factor $\eta_{\alpha} \;=\; \frac{1}{1 - \alpha \cdot \kappa}$ larger than the unconstrained optimal MSE, where $\kappa = (\mathbb{E}[\gamma_{\min}]/\mathbb{E}[\gamma_{(\alpha)}]) - 1$ is a channel-distribution-dependent constant ( $\gamma_{(\alpha)}$ is the $\alpha$ -th percentile).

Interpretation. Tighter fairness ( $\alpha \to 1$ , every user included every round) approaches the worst-user bottleneck; looser fairness ( $\alpha \to 0$ , threshold scheduling) approaches the unconstrained optimum.

Proof

Unconstrained MSE

Minimum MSE = $\sigma^2/\mathbb{E}[\max \gamma]$ (threshold scheduling selects best).

$\alpha$-fair constraint

Must schedule the weakest $\alpha n$ users some of the time. Weakest user has average $\mathbb{E}[\gamma_{\min}]$ — dragging down the schedule-weighted average.

Bound factor

Simple algebra: MSE scales as $\sigma^2/\mathbb{E}[\gamma_{\alpha\text{-aware}}]$ where the effective $\gamma$ is a weighted average. Combining gives the stated $\eta_{\alpha}$ .

Operational

For typical channels (Rayleigh), $\eta_{0.5} \approx 1.5$ — $50\%$ -fairness costs $\sim 50\%$ MSE. For the FL convergence, this means $\sim 50\%$ more rounds to reach the same loss floor — a manageable trade-off in exchange for unbiased participation.

Fairness vs. MSE Trade-off

Vary the fairness parameter $\alpha$ and observe how the average per-round aggregation MSE increases as more weak users are forced into the schedule. Compare this to the unconstrained threshold-scheduling baseline ( $\alpha = 0$ ). The simulation draws Rayleigh-distributed channels.

Parameters

n

— users50

T

— rounds100

\alpha_{\max}

— maximum fairness1

Theorem: Energy-Constrained Power Allocation

Given a per-user energy budget $E_k$ and $T$ rounds, the optimal power allocation across the rounds user $k$ participates in is water-filling on the channel gains: $P_k^{(t)\star} = \left[\frac{1}{\lambda_k} - \frac{1}{\gamma_k^{(t)}}\right]^+ \text{subject to } \sum_{t: k \in \mathcal{S}_t} P_k^{(t)} = E_k,$ where $\lambda_k > 0$ is the dual variable for the energy constraint.

Interpretation. User $k$ spends more power in high-channel-gain rounds, less (or zero) in bad rounds. Water-filling is the standard Lagrangian answer to a sum-concave objective with linear constraint.

Proof

Dual Lagrangian

Minimizing per-round MSE over $P_k^{(t)}$ (linear in $|h_k^{(t)}|^2 P_k^{(t)}$ ) with energy constraint $\sum_t P_k^{(t)} \leq E_k$ gives water-filling at its most textbook.

Closed form

Standard derivation — skip the Lagrange multiplication; result is the bracketed-positive form above.

Operational

Each user needs local knowledge of their channel history and the dual variable $\lambda_k$ (can be computed by sorting $\gamma_k^{(t)}$ across $t$ ). Total power budget is respected across rounds; individual rounds see wildly varying per-user powers.

Joint Optimization — What's Tractable?

The full joint problem — scheduling, power allocation, learning rate, convergence — is non-convex. Practical decomposition:

Per-round decomposition. Assume average MSE targets; each round applies Theorem 17.3.1.
Per-user decomposition. Each user solves its energy water-filling (Theorem 17.3.2) independently of others, given a target MSE.
Meta-level: the target MSE is picked from the convergence analysis (§17.2).

The decomposition is provably suboptimal but provably tractable. Empirically, the loss from decomposition is typically $< 5\%$ of the optimal. Ongoing research closes the gap.

⚠️Engineering Note

Deploying Wireless-FL Scheduling

Production wireless-FL scheduling guidelines:

Estimate channel statistics offline. Before deployment, collect a few hours of channel measurements from each user. This drives the target MSE selection.
Choose $\alpha$ from heterogeneity. If user gradients are i.i.d. across devices (uniform datasets), low $\alpha$ is fine. If heterogeneous (different demographics per device), enforce higher $\alpha$ .
Pre-compute the threshold schedule. For a fixed target MSE, the threshold $\tau_t$ is predictable given channel statistics. Compute once, apply online.
Adaptive $\alpha$ scheduling. In non-stationary environments (e.g., moving devices), adapt $\alpha$ based on tracked channel variability.
Monitor convergence in real time. The FL loss (or its estimate) provides feedback on whether the current $(\tau_t, \alpha)$ is in the MSE- dominated regime. If yes, tighten $\tau_t$ ; if no, loosen to reduce power cost.
Integrate with other layers. MAC- layer scheduling interacts with physical-layer power control, which interacts with the FL learning rate. A hierarchical design — cross-layer control of FL over the wireless stack — is where deployments are heading.

Practical Constraints

•
Offline channel statistics estimation before deployment
•
$\alpha$ from data heterogeneity
•
Adaptive $\alpha$ for non-stationary environments
•
Cross-layer design: MAC + PHY + FL

📋 Ref: Yang et al. 2020; Chen et al. 2021

,

Common Mistake: CSI Acquisition Is Not Free

Mistake:

Assume perfect CSIT is available at zero cost — and design the FL system around ideal scheduling decisions.

Correction:

CSIT requires uplink pilots from each user in each round — adding a non-trivial overhead (typically 10-20% of round bandwidth). Poor CSIT degrades scheduling (wrong thresholds) and power control (misaligned $b_k h_k$ ). Budget for CSIT acquisition: either (i) pilot-based estimation at each round (overhead per round), or (ii) reciprocity-based estimation in TDD systems (lower overhead but requires TDD). Production FL should include the CSIT overhead in the total energy/bandwidth budget. Under-estimated CSIT cost inflates paper performance vs. reality.

Key Takeaway

Wireless-FL scheduling is a Pareto optimization: tight MSE (threshold scheduling) favors best-channel users; fair participation requires $\alpha$ -fairness at an MSE cost factor $\sim 1/(1 - \alpha\kappa)$ . Energy budgets reduce to per-user water-filling (Theorem 17.3.2). The full joint problem is non-convex; practical decomposition (per-round + per-user) is within $5\%$ of optimal. The golden thread — privacy, robustness, efficiency — reappears here as fairness vs. MSE vs. energy: three axes, no perfect corner.

Quick Check

A wireless-FL system enforces $\alpha = 0.5$ fairness (each user participates in at least half their proportional-share rounds). Relative to unconstrained threshold scheduling, the average per-round MSE is:

$2\times$ larger

$\sim 1.5\times$ larger (Rayleigh channels)

Equal to unconstrained

$10\times$ larger

Correction:

\sim 1.5\times

larger (Rayleigh channels)

Per Theorem 17.3.2, $\eta_{0.5} \approx 1 + 0.5\kappa$ for typical Rayleigh fading gives $\sim 1.5\times$ . Manageable convergence-rate cost for unbiased participation.

Scheduling, Power, and Resource Allocation

The Joint Optimization

Definition: Wireless-FL Joint Optimization

Theorem: Optimal Per-Round AirComp Scheduling

Feasibility

MSE

Pareto optimality

Operational

Example: Worked Scheduling at Round 5

Compute $\tau_5$

Identify users satisfying $\gamma_k \geq 0.5$

Compute $\eta_5$

Per-user power

Verify feasibility

Operational

Definition: Fairness-Aware Scheduling

Theorem: Fairness-MSE Trade-off

Unconstrained MSE

$\alpha$-fair constraint

Bound factor

Operational

Fairness vs. MSE Trade-off

Parameters

Theorem: Energy-Constrained Power Allocation

Dual Lagrangian

Closed form

Operational

Joint Optimization — What's Tractable?

Deploying Wireless-FL Scheduling

Common Mistake: CSI Acquisition Is Not Free

Key Takeaway

Quick Check

Definition:
Wireless-FL Joint Optimization

Definition:
Fairness-Aware Scheduling