Multi-Modal Fusion: Radar + Camera + LiDAR

Why Fuse Multiple Sensor Modalities?

No single sensor modality is sufficient for robust perception in all conditions. Cameras provide high spatial resolution and semantic understanding but fail in fog, darkness, and glare. LiDAR provides precise 3D geometry but degrades in rain and cannot measure velocity. Radar provides range and velocity in all weather conditions but has low angular resolution.

Multi-modal fusion creates a complementary representation that inherits the strengths of each modality while mitigating their individual weaknesses. This section develops the mathematical framework for fusion and connects it to the RF imaging model of this book.

Definition:

Multi-Modal Sensor Fusion

Multi-modal fusion combines data from different sensor modalities to produce a unified scene representation. The key modalities for autonomous driving and environmental sensing:

Modality Strengths Weaknesses
Radar Range, velocity, all-weather Low angular resolution
Camera High resolution, colour, semantics No depth, fails in fog/dark
LiDAR Precise 3D point cloud No velocity, degrades in rain

The fusion output is typically a set of 3D bounding boxes with class labels, velocities, and uncertainty estimates.

Radar provides the "backbone" of robust sensing (all-weather detection + velocity), while camera and LiDAR add high-resolution spatial detail.

Definition:

Fusion Architectures

Early fusion: Concatenate raw or minimally processed sensor data into a unified representation before processing:

Οƒ^=fΞΈ([yradar,ycamera,yLiDAR])\hat{\boldsymbol{\sigma}} = f_\theta([\mathbf{y}_{\text{radar}}, \mathbf{y}_{\text{camera}}, \mathbf{y}_{\text{LiDAR}}])

Pros: network learns optimal feature extraction jointly. Cons: high-dimensional input; difficult to handle missing modalities.

Late fusion: Process each modality independently, then combine at the decision level:

Οƒ^=g(Οƒ^radar,Οƒ^camera,Οƒ^LiDAR)\hat{\boldsymbol{\sigma}} = g(\hat{\boldsymbol{\sigma}}_{\text{radar}}, \hat{\boldsymbol{\sigma}}_{\text{camera}}, \hat{\boldsymbol{\sigma}}_{\text{LiDAR}})

Pros: modular; robust to single-sensor failure. Cons: cannot learn cross-modal correlations.

Mid-level (feature) fusion: Extract features from each modality independently, then fuse in a shared feature space:

z=hΞΈ(zradar,zcamera,zLiDAR)\mathbf{z} = h_\theta(\mathbf{z}_{\text{radar}}, \mathbf{z}_{\text{camera}}, \mathbf{z}_{\text{LiDAR}})

This is the most common approach in practice, balancing flexibility and cross-modal interaction.

Mid-level fusion with transformer-based cross-attention between modalities is the current state of the art. The attention mechanism learns which spatial regions benefit from which modality.

,

Definition:

Bird's-Eye-View (BEV) Fusion

BEV fusion projects all sensor data into a common bird's-eye-view (top-down) coordinate frame before fusion. This avoids the perspective distortion of camera images and naturally handles the 3D nature of radar and LiDAR data.

The pipeline:

  1. Camera β†’\to BEV: Use a learned depth estimator to "lift" 2D image features into 3D, then project onto the BEV plane (LSS, BEVFormer).
  2. LiDAR β†’\to BEV: Voxelise the point cloud and project onto the BEV plane.
  3. Radar β†’\to BEV: Map radar detections (range, azimuth, velocity) directly onto the BEV plane.
  4. Fusion: Concatenate or cross-attend the BEV feature maps from all modalities; apply a detection head (e.g., CenterPoint).

Theorem: Information-Theoretic Bound on Fusion Gain

Let Y1,Y2Y_1, Y_2 be measurements from two sensor modalities observing a scene parameter Θ\Theta. The mutual information satisfies:

I(Θ;Y1,Y2)=I(Θ;Y1)+I(Θ;Y2∣Y1)β‰₯max⁑{I(Θ;Y1),I(Θ;Y2)},I(\Theta; Y_1, Y_2) = I(\Theta; Y_1) + I(\Theta; Y_2 \mid Y_1) \geq \max\{I(\Theta; Y_1), I(\Theta; Y_2)\},

with equality if and only if Y2Y_2 is conditionally independent of Θ\Theta given Y1Y_1 (i.e., Y2Y_2 provides no additional information).

For the Fisher information matrix, the analogous result is:

J(Θ;Y1,Y2)=J(Θ;Y1)+J(Θ;Y2∣Y1)βͺ°J(Θ;Y1),\mathbf{J}(\Theta; Y_1, Y_2) = \mathbf{J}(\Theta; Y_1) + \mathbf{J}(\Theta; Y_2 \mid Y_1) \succeq \mathbf{J}(\Theta; Y_1),

meaning that fusion always reduces estimation variance (in the matrix sense) compared to any single modality.

Fusion can never hurt: additional measurements can only increase the information about the scene. The gain depends on the complementarity of the modalities β€” if they provide information about different aspects of the scene (radar: velocity, camera: texture), the gain is large.

BEV Fusion: Detection Performance Comparison

Compare detection performance (average precision) for radar-only, camera-only, LiDAR-only, and fused perception across different weather conditions and SNR levels. Observe that (i) camera performance degrades sharply in fog and night, (ii) LiDAR degrades moderately in rain, (iii) radar is robust across conditions but has lower baseline performance, and (iv) fusion consistently outperforms all individual modalities.

Parameters
10
5

Example: Radar-Camera Fusion for Pedestrian Detection

A vehicle has a 77 GHz radar (range resolution 0.1 m, angular resolution 5∘5^\circ) and a monocular camera (1920Γ—10801920 \times 1080). A pedestrian is at 30 m range, partially occluded by a parked car. How does mid-level fusion improve detection?

Definition:

NeuRadar and Neural Scene Synthesis

NeuRadar is a neural rendering framework that jointly synthesises novel views for radar, camera, and LiDAR from a shared 3D scene representation. The key idea:

  1. Represent the scene as a neural field fΞΈ(p)f_\theta(\mathbf{p}) that outputs density, colour, radar cross-section, and LiDAR reflectance at each 3D point.
  2. Render each modality through its own differentiable forward model:
    • Camera: Standard volume rendering (NeRF).
    • LiDAR: Ray casting with range and intensity rendering.
    • Radar: Coherent rendering with range-Doppler response.
  3. Train jointly on multi-modal data, with the shared 3D structure enforcing cross-modal consistency.

The unified representation enables cross-modal supervision: LiDAR ground truth can supervise radar geometry learning, and camera images can provide texture information for radar scene understanding.

πŸŽ“CommIT Contribution(2025)

Multi-Sensor Data Fusion for RF Imaging

J. Gao, G. Caire β€” TU Berlin Technical Report (CommIT Group)

The CommIT group developed a multi-sensor fusion framework for RF imaging that borrows from Junyuan Gao's data fusion approach. The key contributions:

  1. Per-sensor back-projection + learned fusion: Each sensor (radio unit) independently produces a back-projection image c^BP(i)=A(i),H Dβˆ’1 y(i)\hat{\mathbf{c}}^{(i)}_{\text{BP}} = \mathbf{A}^{(i),H}\,\mathbf{D}^{-1}\,\mathbf{y}^{(i)}. A learned fusion network combines these per-sensor images into a final reconstruction.

  2. Robustness to phase incoherence: When radio units have large/arbitrary phase errors between them, separate estimation

    • fusion is more robust than joint processing. This is directly relevant to distributed ISAC networks (Chapter 37).
  3. 2D system model: The framework uses a 2D forward model with multi-sensor data, capturing non-isotropic scattering from different viewing angles.

As Caire noted: "this may also be the basis for the training of the AI models for fusion... and in the future for regularised inverses."

multi-sensor-fusionrf-imagingdistributed-sensing
,

Quick Check

Which fusion architecture is best suited when one sensor modality (e.g., camera) may be completely unavailable (e.g., total darkness)?

Early fusion

Mid-level fusion with cross-attention

Late fusion

No fusion needed

Common Mistake: Sensor Calibration Drift in Multi-Modal Systems

Mistake:

Assuming that the spatial registration (extrinsic calibration) between radar, camera, and LiDAR remains constant after initial calibration, and not monitoring for calibration drift in deployed systems.

Correction:

Vehicle vibrations, temperature changes, and minor collisions cause the relative poses between sensors to drift over time. A 1∘1^\circ angular misalignment between radar and camera at 30 m range translates to ∼0.5\sim 0.5 m positional error (∼30\sim 30 pixels in the image). This can cause the fusion network to associate radar detections with the wrong image regions.

Remedies: (i) Online calibration refinement using natural correspondences. (ii) Soft association: let the network learn to handle small misalignments via attention over a spatial neighbourhood. (iii) Regular recalibration schedules.

⚠️Engineering Note

Latency Constraints in Real-Time Fusion

Autonomous driving requires end-to-end perception latency <100< 100 ms. Sensor modalities have different data rates and processing times:

  • Camera: 30 fps, CNN inference ∼20\sim 20 ms (GPU).
  • LiDAR: 10--20 fps, point cloud processing ∼30\sim 30 ms.
  • Radar: 10--20 fps, signal processing ∼5\sim 5 ms.

The fusion system must synchronise these asynchronous streams (temporal alignment) and produce fused detections within the latency budget. Early fusion is typically slowest (largest input); late fusion is fastest (independent processing, simple combination). Mid-level fusion with efficient cross-attention (e.g., deformable attention) achieves a practical balance.

Practical Constraints
  • β€’

    End-to-end latency < 100 ms for L3+ autonomous driving

  • β€’

    Temporal synchronisation across modalities within 10 ms

BEV (Bird's-Eye-View)

A top-down representation of the scene in a 2D grid aligned with the ground plane. Commonly used as the shared coordinate frame for multi-modal sensor fusion in autonomous driving.

Related: Bird's-Eye-View (BEV) Fusion

Cross-Modal Supervision

Using ground truth or predictions from one sensor modality (e.g., LiDAR depth maps) to supervise the training of a network that processes a different modality (e.g., radar).

Related: NeuRadar and Neural Scene Synthesis

Key Takeaway

Multi-modal fusion combines the complementary strengths of radar (velocity, all-weather), camera (resolution, semantics), and LiDAR (3D geometry). Mid-level fusion with BEV representations and cross-attention is the current best practice. The CommIT group's per-sensor back-projection + learned fusion approach is particularly well-suited to distributed RF imaging, where phase incoherence between nodes makes joint processing fragile.