Ferkans — Interactive Telecom Tutor

Why Fuse Multiple Sensor Modalities?

No single sensor modality is sufficient for robust perception in all conditions. Cameras provide high spatial resolution and semantic understanding but fail in fog, darkness, and glare. LiDAR provides precise 3D geometry but degrades in rain and cannot measure velocity. Radar provides range and velocity in all weather conditions but has low angular resolution.

Multi-modal fusion creates a complementary representation that inherits the strengths of each modality while mitigating their individual weaknesses. This section develops the mathematical framework for fusion and connects it to the RF imaging model of this book.

Definition:
Multi-Modal Sensor Fusion

Multi-modal fusion combines data from different sensor modalities to produce a unified scene representation. The key modalities for autonomous driving and environmental sensing:

Modality	Strengths	Weaknesses
Radar	Range, velocity, all-weather	Low angular resolution
Camera	High resolution, colour, semantics	No depth, fails in fog/dark
LiDAR	Precise 3D point cloud	No velocity, degrades in rain

The fusion output is typically a set of 3D bounding boxes with class labels, velocities, and uncertainty estimates.

Radar provides the "backbone" of robust sensing (all-weather detection + velocity), while camera and LiDAR add high-resolution spatial detail.

Definition:
Fusion Architectures

Early fusion: Concatenate raw or minimally processed sensor data into a unified representation before processing:

$\hat{\boldsymbol{\sigma}} = f_\theta([\mathbf{y}_{\text{radar}}, \mathbf{y}_{\text{camera}}, \mathbf{y}_{\text{LiDAR}}])$

Pros: network learns optimal feature extraction jointly. Cons: high-dimensional input; difficult to handle missing modalities.

Late fusion: Process each modality independently, then combine at the decision level:

$\hat{\boldsymbol{\sigma}} = g(\hat{\boldsymbol{\sigma}}_{\text{radar}}, \hat{\boldsymbol{\sigma}}_{\text{camera}}, \hat{\boldsymbol{\sigma}}_{\text{LiDAR}})$

Pros: modular; robust to single-sensor failure. Cons: cannot learn cross-modal correlations.

Mid-level (feature) fusion: Extract features from each modality independently, then fuse in a shared feature space:

$\mathbf{z} = h_\theta(\mathbf{z}_{\text{radar}}, \mathbf{z}_{\text{camera}}, \mathbf{z}_{\text{LiDAR}})$

This is the most common approach in practice, balancing flexibility and cross-modal interaction.

Mid-level fusion with transformer-based cross-attention between modalities is the current state of the art. The attention mechanism learns which spatial regions benefit from which modality.

,

Definition:
Bird's-Eye-View (BEV) Fusion

BEV fusion projects all sensor data into a common bird's-eye-view (top-down) coordinate frame before fusion. This avoids the perspective distortion of camera images and naturally handles the 3D nature of radar and LiDAR data.

The pipeline:

Camera $\to$ BEV: Use a learned depth estimator to "lift" 2D image features into 3D, then project onto the BEV plane (LSS, BEVFormer).
LiDAR $\to$ BEV: Voxelise the point cloud and project onto the BEV plane.
Radar $\to$ BEV: Map radar detections (range, azimuth, velocity) directly onto the BEV plane.
Fusion: Concatenate or cross-attend the BEV feature maps from all modalities; apply a detection head (e.g., CenterPoint).

Theorem: Information-Theoretic Bound on Fusion Gain

Let $Y_1, Y_2$ be measurements from two sensor modalities observing a scene parameter $\Theta$ . The mutual information satisfies:

$I(\Theta; Y_1, Y_2) = I(\Theta; Y_1) + I(\Theta; Y_2 \mid Y_1) \geq \max\{I(\Theta; Y_1), I(\Theta; Y_2)\},$

with equality if and only if $Y_2$ is conditionally independent of $\Theta$ given $Y_1$ (i.e., $Y_2$ provides no additional information).

For the Fisher information matrix, the analogous result is:

$\mathbf{J}(\Theta; Y_1, Y_2) = \mathbf{J}(\Theta; Y_1) + \mathbf{J}(\Theta; Y_2 \mid Y_1) \succeq \mathbf{J}(\Theta; Y_1),$

meaning that fusion always reduces estimation variance (in the matrix sense) compared to any single modality.

Fusion can never hurt: additional measurements can only increase the information about the scene. The gain depends on the complementarity of the modalities — if they provide information about different aspects of the scene (radar: velocity, camera: texture), the gain is large.

Proof

Chain rule for mutual information

By the chain rule: $I(\Theta; Y_1, Y_2) = I(\Theta; Y_1) + I(\Theta; Y_2 \mid Y_1)$ . Since mutual information is non-negative, $I(\Theta; Y_2 \mid Y_1) \geq 0$ , giving $I(\Theta; Y_1, Y_2) \geq I(\Theta; Y_1)$ . By symmetry, $I(\Theta; Y_1, Y_2) \geq I(\Theta; Y_2)$ .

Fisher information additivity

For conditionally independent observations ( $f(y_1, y_2 \mid \theta) = f(y_1 \mid \theta)\,f(y_2 \mid \theta)$ ), the FIM is additive: $\mathbf{J}(\Theta; Y_1, Y_2) = \mathbf{J}(\Theta; Y_1) + \mathbf{J}(\Theta; Y_2)$ . More generally, conditional Fisher information $\mathbf{J}(\Theta; Y_2 \mid Y_1) \succeq 0$ gives the matrix inequality. $\blacksquare$

BEV Fusion: Detection Performance Comparison

Compare detection performance (average precision) for radar-only, camera-only, LiDAR-only, and fused perception across different weather conditions and SNR levels. Observe that (i) camera performance degrades sharply in fog and night, (ii) LiDAR degrades moderately in rain, (iii) radar is robust across conditions but has lower baseline performance, and (iv) fusion consistently outperforms all individual modalities.

Parameters

\text{SNR}

(dB)10

Weather condition

Number of targets5

Example: Radar-Camera Fusion for Pedestrian Detection

A vehicle has a 77 GHz radar (range resolution 0.1 m, angular resolution $5^\circ$ ) and a monocular camera ( $1920 \times 1080$ ). A pedestrian is at 30 m range, partially occluded by a parked car. How does mid-level fusion improve detection?

Solution

Radar alone

The radar detects a target at 30 m with radial velocity $1.5$ m/s (walking speed). Angular resolution at 30 m: $5^\circ \approx 2.6$ m. The pedestrian and parked car are in the same angular bin — difficult to separate.

Camera alone

The camera sees the visible portion of the pedestrian but the occluded part is behind the parked car. Detection confidence: $\sim 60\%$ (partial occlusion). No depth or velocity information.

Mid-level fusion

Radar features: Range (30 m) and velocity (1.5 m/s) confirm a moving target distinct from the stationary car.
Camera features: Projected radar detection onto the image plane using $\mathbf{p}_{\text{img}} = \mathbf{K}[\mathbf{R} \mid \mathbf{t}]\,\mathbf{p}_{\text{radar}}$ . The region of interest around the projection is fed to the classifier.
Cross-attention: The network attends to radar velocity features (1.5 m/s $\implies$ pedestrian, not car) and camera appearance features (body shape visible above the car).
Result: Detection confidence $> 95\%$ , with accurate range, velocity, and classification.

Definition:
NeuRadar and Neural Scene Synthesis

NeuRadar is a neural rendering framework that jointly synthesises novel views for radar, camera, and LiDAR from a shared 3D scene representation. The key idea:

Represent the scene as a neural field $f_\theta(\mathbf{p})$ that outputs density, colour, radar cross-section, and LiDAR reflectance at each 3D point.
Render each modality through its own differentiable forward model:
- Camera: Standard volume rendering (NeRF).
- LiDAR: Ray casting with range and intensity rendering.
- Radar: Coherent rendering with range-Doppler response.
Train jointly on multi-modal data, with the shared 3D structure enforcing cross-modal consistency.

The unified representation enables cross-modal supervision: LiDAR ground truth can supervise radar geometry learning, and camera images can provide texture information for radar scene understanding.

🎓CommIT Contribution(2025)

Multi-Sensor Data Fusion for RF Imaging

J. Gao, G. Caire — TU Berlin Technical Report (CommIT Group)

The CommIT group developed a multi-sensor fusion framework for RF imaging that borrows from Junyuan Gao's data fusion approach. The key contributions:

Per-sensor back-projection + learned fusion: Each sensor (radio unit) independently produces a back-projection image $\hat{\mathbf{c}}^{(i)}_{\text{BP}} = \mathbf{A}^{(i),H}\,\mathbf{D}^{-1}\,\mathbf{y}^{(i)}$ . A learned fusion network combines these per-sensor images into a final reconstruction.
Robustness to phase incoherence: When radio units have large/arbitrary phase errors between them, separate estimation
- fusion is more robust than joint processing. This is directly relevant to distributed ISAC networks (Chapter 37).
2D system model: The framework uses a 2D forward model with multi-sensor data, capturing non-isotropic scattering from different viewing angles.

As Caire noted: "this may also be the basis for the training of the AI models for fusion... and in the future for regularised inverses."

multi-sensor-fusionrf-imagingdistributed-sensing

,

Quick Check

Which fusion architecture is best suited when one sensor modality (e.g., camera) may be completely unavailable (e.g., total darkness)?

Early fusion

Mid-level fusion with cross-attention

Late fusion

No fusion needed

Correction:

Late fusion

Late fusion processes each modality independently; if one is unavailable, the others still produce valid detections. The fusion function $g$ simply ignores the missing input.

Common Mistake: Sensor Calibration Drift in Multi-Modal Systems

Mistake:

Assuming that the spatial registration (extrinsic calibration) between radar, camera, and LiDAR remains constant after initial calibration, and not monitoring for calibration drift in deployed systems.

Correction:

Vehicle vibrations, temperature changes, and minor collisions cause the relative poses between sensors to drift over time. A $1^\circ$ angular misalignment between radar and camera at 30 m range translates to $\sim 0.5$ m positional error ( $\sim 30$ pixels in the image). This can cause the fusion network to associate radar detections with the wrong image regions.

Remedies: (i) Online calibration refinement using natural correspondences. (ii) Soft association: let the network learn to handle small misalignments via attention over a spatial neighbourhood. (iii) Regular recalibration schedules.

⚠️Engineering Note

Latency Constraints in Real-Time Fusion

Autonomous driving requires end-to-end perception latency $< 100$ ms. Sensor modalities have different data rates and processing times:

Camera: 30 fps, CNN inference $\sim 20$ ms (GPU).
LiDAR: 10--20 fps, point cloud processing $\sim 30$ ms.
Radar: 10--20 fps, signal processing $\sim 5$ ms.

The fusion system must synchronise these asynchronous streams (temporal alignment) and produce fused detections within the latency budget. Early fusion is typically slowest (largest input); late fusion is fastest (independent processing, simple combination). Mid-level fusion with efficient cross-attention (e.g., deformable attention) achieves a practical balance.

Practical Constraints

•
End-to-end latency < 100 ms for L3+ autonomous driving
•
Temporal synchronisation across modalities within 10 ms

BEV (Bird's-Eye-View)

A top-down representation of the scene in a 2D grid aligned with the ground plane. Commonly used as the shared coordinate frame for multi-modal sensor fusion in autonomous driving.

Related: Bird's-Eye-View (BEV) Fusion

Key Takeaway

Multi-modal fusion combines the complementary strengths of radar (velocity, all-weather), camera (resolution, semantics), and LiDAR (3D geometry). Mid-level fusion with BEV representations and cross-attention is the current best practice. The CommIT group's per-sensor back-projection + learned fusion approach is particularly well-suited to distributed RF imaging, where phase incoherence between nodes makes joint processing fragile.

Multi-Modal Fusion: Radar + Camera + LiDAR