Multi-Modal Fusion: Radar + Camera + LiDAR
Why Fuse Multiple Sensor Modalities?
No single sensor modality is sufficient for robust perception in all conditions. Cameras provide high spatial resolution and semantic understanding but fail in fog, darkness, and glare. LiDAR provides precise 3D geometry but degrades in rain and cannot measure velocity. Radar provides range and velocity in all weather conditions but has low angular resolution.
Multi-modal fusion creates a complementary representation that inherits the strengths of each modality while mitigating their individual weaknesses. This section develops the mathematical framework for fusion and connects it to the RF imaging model of this book.
Definition: Multi-Modal Sensor Fusion
Multi-Modal Sensor Fusion
Multi-modal fusion combines data from different sensor modalities to produce a unified scene representation. The key modalities for autonomous driving and environmental sensing:
| Modality | Strengths | Weaknesses |
|---|---|---|
| Radar | Range, velocity, all-weather | Low angular resolution |
| Camera | High resolution, colour, semantics | No depth, fails in fog/dark |
| LiDAR | Precise 3D point cloud | No velocity, degrades in rain |
The fusion output is typically a set of 3D bounding boxes with class labels, velocities, and uncertainty estimates.
Radar provides the "backbone" of robust sensing (all-weather detection + velocity), while camera and LiDAR add high-resolution spatial detail.
Definition: Fusion Architectures
Fusion Architectures
Early fusion: Concatenate raw or minimally processed sensor data into a unified representation before processing:
Pros: network learns optimal feature extraction jointly. Cons: high-dimensional input; difficult to handle missing modalities.
Late fusion: Process each modality independently, then combine at the decision level:
Pros: modular; robust to single-sensor failure. Cons: cannot learn cross-modal correlations.
Mid-level (feature) fusion: Extract features from each modality independently, then fuse in a shared feature space:
This is the most common approach in practice, balancing flexibility and cross-modal interaction.
Mid-level fusion with transformer-based cross-attention between modalities is the current state of the art. The attention mechanism learns which spatial regions benefit from which modality.
Definition: Bird's-Eye-View (BEV) Fusion
Bird's-Eye-View (BEV) Fusion
BEV fusion projects all sensor data into a common bird's-eye-view (top-down) coordinate frame before fusion. This avoids the perspective distortion of camera images and naturally handles the 3D nature of radar and LiDAR data.
The pipeline:
- Camera BEV: Use a learned depth estimator to "lift" 2D image features into 3D, then project onto the BEV plane (LSS, BEVFormer).
- LiDAR BEV: Voxelise the point cloud and project onto the BEV plane.
- Radar BEV: Map radar detections (range, azimuth, velocity) directly onto the BEV plane.
- Fusion: Concatenate or cross-attend the BEV feature maps from all modalities; apply a detection head (e.g., CenterPoint).
Theorem: Information-Theoretic Bound on Fusion Gain
Let be measurements from two sensor modalities observing a scene parameter . The mutual information satisfies:
with equality if and only if is conditionally independent of given (i.e., provides no additional information).
For the Fisher information matrix, the analogous result is:
meaning that fusion always reduces estimation variance (in the matrix sense) compared to any single modality.
Fusion can never hurt: additional measurements can only increase the information about the scene. The gain depends on the complementarity of the modalities β if they provide information about different aspects of the scene (radar: velocity, camera: texture), the gain is large.
Chain rule for mutual information
By the chain rule: . Since mutual information is non-negative, , giving . By symmetry, .
Fisher information additivity
For conditionally independent observations (), the FIM is additive: . More generally, conditional Fisher information gives the matrix inequality.
BEV Fusion: Detection Performance Comparison
Compare detection performance (average precision) for radar-only, camera-only, LiDAR-only, and fused perception across different weather conditions and SNR levels. Observe that (i) camera performance degrades sharply in fog and night, (ii) LiDAR degrades moderately in rain, (iii) radar is robust across conditions but has lower baseline performance, and (iv) fusion consistently outperforms all individual modalities.
Parameters
Example: Radar-Camera Fusion for Pedestrian Detection
A vehicle has a 77 GHz radar (range resolution 0.1 m, angular resolution ) and a monocular camera (). A pedestrian is at 30 m range, partially occluded by a parked car. How does mid-level fusion improve detection?
Radar alone
The radar detects a target at 30 m with radial velocity m/s (walking speed). Angular resolution at 30 m: m. The pedestrian and parked car are in the same angular bin β difficult to separate.
Camera alone
The camera sees the visible portion of the pedestrian but the occluded part is behind the parked car. Detection confidence: (partial occlusion). No depth or velocity information.
Mid-level fusion
- Radar features: Range (30 m) and velocity (1.5 m/s) confirm a moving target distinct from the stationary car.
- Camera features: Projected radar detection onto the image plane using . The region of interest around the projection is fed to the classifier.
- Cross-attention: The network attends to radar velocity features (1.5 m/s pedestrian, not car) and camera appearance features (body shape visible above the car).
- Result: Detection confidence , with accurate range, velocity, and classification.
Definition: NeuRadar and Neural Scene Synthesis
NeuRadar and Neural Scene Synthesis
NeuRadar is a neural rendering framework that jointly synthesises novel views for radar, camera, and LiDAR from a shared 3D scene representation. The key idea:
- Represent the scene as a neural field that outputs density, colour, radar cross-section, and LiDAR reflectance at each 3D point.
- Render each modality through its own differentiable forward model:
- Camera: Standard volume rendering (NeRF).
- LiDAR: Ray casting with range and intensity rendering.
- Radar: Coherent rendering with range-Doppler response.
- Train jointly on multi-modal data, with the shared 3D structure enforcing cross-modal consistency.
The unified representation enables cross-modal supervision: LiDAR ground truth can supervise radar geometry learning, and camera images can provide texture information for radar scene understanding.
Multi-Sensor Data Fusion for RF Imaging
The CommIT group developed a multi-sensor fusion framework for RF imaging that borrows from Junyuan Gao's data fusion approach. The key contributions:
-
Per-sensor back-projection + learned fusion: Each sensor (radio unit) independently produces a back-projection image . A learned fusion network combines these per-sensor images into a final reconstruction.
-
Robustness to phase incoherence: When radio units have large/arbitrary phase errors between them, separate estimation
- fusion is more robust than joint processing. This is directly relevant to distributed ISAC networks (Chapter 37).
-
2D system model: The framework uses a 2D forward model with multi-sensor data, capturing non-isotropic scattering from different viewing angles.
As Caire noted: "this may also be the basis for the training of the AI models for fusion... and in the future for regularised inverses."
Quick Check
Which fusion architecture is best suited when one sensor modality (e.g., camera) may be completely unavailable (e.g., total darkness)?
Early fusion
Mid-level fusion with cross-attention
Late fusion
No fusion needed
Late fusion processes each modality independently; if one is unavailable, the others still produce valid detections. The fusion function simply ignores the missing input.
Common Mistake: Sensor Calibration Drift in Multi-Modal Systems
Mistake:
Assuming that the spatial registration (extrinsic calibration) between radar, camera, and LiDAR remains constant after initial calibration, and not monitoring for calibration drift in deployed systems.
Correction:
Vehicle vibrations, temperature changes, and minor collisions cause the relative poses between sensors to drift over time. A angular misalignment between radar and camera at 30 m range translates to m positional error ( pixels in the image). This can cause the fusion network to associate radar detections with the wrong image regions.
Remedies: (i) Online calibration refinement using natural correspondences. (ii) Soft association: let the network learn to handle small misalignments via attention over a spatial neighbourhood. (iii) Regular recalibration schedules.
Latency Constraints in Real-Time Fusion
Autonomous driving requires end-to-end perception latency ms. Sensor modalities have different data rates and processing times:
- Camera: 30 fps, CNN inference ms (GPU).
- LiDAR: 10--20 fps, point cloud processing ms.
- Radar: 10--20 fps, signal processing ms.
The fusion system must synchronise these asynchronous streams (temporal alignment) and produce fused detections within the latency budget. Early fusion is typically slowest (largest input); late fusion is fastest (independent processing, simple combination). Mid-level fusion with efficient cross-attention (e.g., deformable attention) achieves a practical balance.
- β’
End-to-end latency < 100 ms for L3+ autonomous driving
- β’
Temporal synchronisation across modalities within 10 ms
BEV (Bird's-Eye-View)
A top-down representation of the scene in a 2D grid aligned with the ground plane. Commonly used as the shared coordinate frame for multi-modal sensor fusion in autonomous driving.
Related: Bird's-Eye-View (BEV) Fusion
Cross-Modal Supervision
Using ground truth or predictions from one sensor modality (e.g., LiDAR depth maps) to supervise the training of a network that processes a different modality (e.g., radar).
Related: NeuRadar and Neural Scene Synthesis
Key Takeaway
Multi-modal fusion combines the complementary strengths of radar (velocity, all-weather), camera (resolution, semantics), and LiDAR (3D geometry). Mid-level fusion with BEV representations and cross-attention is the current best practice. The CommIT group's per-sensor back-projection + learned fusion approach is particularly well-suited to distributed RF imaging, where phase incoherence between nodes makes joint processing fragile.