Cooperative Perception via OTFS-ISAC

Why Cooperate?

An autonomous vehicle sees the world through its own sensors — cameras, lidar, radar — with range limited by line of sight and occlusion. A truck ahead blocks the view of the car beyond it; a building corner hides pedestrians; weather reduces effective range. Cooperative perception lifts this limitation by letting each vehicle share its sensed scene with neighbors, so the collective has a more complete world model than any single vehicle. The OTFS-ISAC waveform is ideally suited to this: it already produces DD-domain scene estimates as part of its normal operation, and the V2V link can carry them. This section shows how.

Definition:

Cooperative Perception

Cooperative perception (CP) is the sharing of sensed-scene information between vehicles to augment individual perception. Data shared include:

  • Target lists: discrete objects with position, velocity, class, confidence. Bandwidth: 1\sim 1 kbps per vehicle (sparse).
  • Occupancy grids: 2D probability maps of road occupancy. Bandwidth: 100\sim 100-10001000 kbps (depends on resolution).
  • Dense point clouds: full lidar-like 3D data. Bandwidth: 10\sim 10-100100 Mbps.
  • DD-domain scene: the OTFS-ISAC Θ^={(τi,νi,θi,ϕi,ai)}\hat\Theta = \{(\tau_i, \nu_i, \theta_i, \phi_i, a_i)\} directly. Bandwidth: 110\sim 1-10 kbps.

Architectures:

  • Centralized: RSU or BS aggregates scenes from all vehicles in coverage. V2I high-bandwidth.
  • Distributed: vehicles broadcast directly (V2V sidelink). Each vehicle fuses own sensing with received data.
  • Hybrid: RSU for long-range scene + V2V for immediate-neighbor updates.
,

Theorem: Cooperative Perception Accuracy Gain

For a scene observed by KK cooperating vehicles, each with sensing CRB σk2\sigma_k^2 for a target parameter, the fused estimate has CRB σfused2  =  (kσk2)1.\sigma_{\text{fused}}^2 \;=\; \left(\sum_k \sigma_k^{-2}\right)^{-1}. For KK equal-accuracy vehicles: σfused2=σ2/K\sigma_{\text{fused}}^2 = \sigma^2/K. The fused accuracy gain is K\sqrt{K} in standard deviation.

Consequence. 4 vehicles at a T-intersection, each sensing with σθ=1°\sigma_{\theta} = 1°, fuse to σθfused=0.5°\sigma_{\theta}^{\text{fused}} = 0.5° — halving angle uncertainty. More importantly, occluded targets (not visible from any single vehicle) become visible through the union of observations.

The point is that cooperative perception is both quantitatively and qualitatively better than individual perception. Quantitatively: K\sqrt{K} gain from information pooling. Qualitatively: coverage extension to NLOS targets. The DD-domain representation makes the fusion particularly clean: each vehicle's scene estimate is a list of (τ,ν,θ,ϕ,a)(\tau, \nu, \theta, \phi, a) tuples; fusion is a maximum- likelihood merge of these tuples.

,

Key Takeaway

Cooperative perception's biggest win is coverage, not accuracy. While the theoretical accuracy gain is K\sqrt{K}, the bigger operational win is covering NLOS and occluded targets — e.g., a pedestrian behind a parked truck becomes visible via a neighbor vehicle's radar. Tail-probability arguments (not just mean-error) justify CP: it eliminates "catastrophic" misses, not just reduces average error.

Definition:

DD-Domain Scene Exchange Format

For OTFS-ISAC-equipped vehicles, the canonical cooperative perception payload is the DD-domain scene: S  =  {(τ^i,ν^i,θ^i,ϕ^i,a^i,class^i,conf^i)}i=1Plocal,\mathcal{S} \;=\; \left\{(\hat\tau_i, \hat\nu_i, \hat\theta_i, \hat\phi_i, |\hat a_i|, \hat{\mathrm{class}}_i, \hat{\mathrm{conf}}_i)\right\}_{i=1}^{P_{\text{local}}}, plus metadata: vehicle position (from GPS or SLAM), timestamp, frame ID, uncertainty ellipse.

Size: 5 floats + 1 enum + 1 float per target, plus 50\sim 50 bytes metadata. For Plocal=20P_{\text{local}} = 20 targets: 700\sim 700 bytes per frame. At 10 Hz frame rate: 7\sim 7 kbps — 1000× less than raw lidar point-cloud sharing.

Fusion protocol: received scenes are transformed from sender's coordinate system to receiver's (using GPS + yaw), then fused via nearest-neighbor association (target-target matching by (τ,ν,θ,ϕ)(\tau, \nu, \theta, \phi) proximity).

Cooperative Perception Fusion

Input: Own scene S_own, received scenes {S_k}, transformation metadata
Output: Fused scene S_fused
1. TRANSFORM (per neighbor k):
Convert S_k to local coordinate system using:
- Neighbor GPS + yaw.
- Timestamp alignment (compensate for propagation delay).
2. ASSOCIATE (global matching):
Form cost matrix C[i, j] = d((s_i_own, s_j_k)) using:
- Spatial distance in DD-angle-position space.
- Class agreement.
- Time alignment.
Solve multi-dimensional assignment.
3. FUSE (per matched pair):
For each matched target i: merge information from all sources
using Kalman-style information fusion:
θ_fused = (Σ σ_k^{-2} θ_k) / (Σ σ_k^{-2})
σ_fused^2 = 1 / Σ σ_k^{-2}
Class: majority vote weighted by confidence.
Uncertainty: minimum of contributing uncertainties.
4. ADD UNMATCHED:
Targets in received scenes not matched to own: add to S_fused
with "remote-only" flag.
5. CONFIRMATION:
Target confirmed if observed by ≥ 2 vehicles.
Tentative if only one source.
Complexity: O(P^2 K) per frame for association. For P = 20 targets,
K = 4 neighbors: ~1600 ops. Modest.

Example: T-Intersection Cooperative Perception

A pedestrian crosses a road at a T-intersection. Three vehicles approach: one on the main road, one entering from the left, one exiting to the right. Each has OTFS-ISAC sensors at 77 GHz. The pedestrian is occluded from the main-road vehicle by a parked truck.

(a) Determine which vehicle first senses the pedestrian. (b) Compute the information propagation time. (c) Discuss the impact on collision avoidance.

,

Theorem: Cooperative Perception Throughput Scaling

For a V2V sidelink with KK neighbors each transmitting DD-domain scenes at rate RsR_s, the total required V2V bandwidth is Btotal  =  KRs/SEV2V,B_{\text{total}} \;=\; K \cdot R_s / \mathrm{SE}_{\text{V2V}}, where SEV2V\mathrm{SE}_{\text{V2V}} is the V2V spectral efficiency. For 77 GHz OTFS-V2V with 10 neighbors, Rs=7R_s = 7 kbps, and SE=4\mathrm{SE} = 4 bits/s/Hz: Btotal  =  107 kbps/4  =  17.5 kHz.B_{\text{total}} \;=\; 10 \cdot 7 \text{ kbps} / 4 \;=\; 17.5 \text{ kHz}. Negligible compared to the 100+ MHz bandwidth of 77-GHz V2V.

Consequence. DD-scene CP is essentially free of overhead — the V2V link can serve multiple simultaneous neighbors without rate reduction. Point-cloud CP (Rs=10R_s = 10 Mbps per vehicle) would require 25 MHz — still fits, but with less margin for data.

The DD-scene format is so compact that bandwidth is not the bottleneck. Latency and reliability are. The former is delivered by OTFS's short frame duration + DD structure; the latter by the diversity gain of Theorem 15.5. DD-domain CP is thus a "sweet spot" — low bandwidth, high utility, high reliability.

Cooperative Perception Coverage Gain

Plot the fraction of detected targets vs number of cooperating vehicles. Includes LOS-only and NLOS-heavy scenarios. Sliders: scenario density, vehicle spacing.

Parameters
5
1
0.4
⚠️Engineering Note

CP Privacy and Trust

Cooperative perception shares sensed-scene data — including trajectories of pedestrians, parked vehicles, personal property. This creates privacy and trust concerns:

  • Pedestrian tracking: receiving vehicles may track pedestrian identities via cross-referenced observations. Mitigation: de-identify shared scenes (target class only, no identity).
  • Trust: a malicious vehicle could broadcast false scene data, causing downstream receivers to react incorrectly. Mitigation: cryptographic authentication (X.509 certificates per vehicle), sanity checks against own observations, ignore scenes from unauthenticated sources.
  • Privacy regulation: GDPR (Europe) and CCPA (California) may restrict sensor data sharing. Local anonymization before V2V broadcast is essential.

Practical deployments (2024): C-V2X broadcasts Basic Safety Messages (BSM) with vehicle identification. Regulators require pseudonymization; vehicle IDs rotate every 5 minutes to prevent long-term tracking. OTFS V2V CP inherits this scheme; additional privacy-preserving aggregation is a research topic.

Practical Constraints
  • De-identify shared scene data (no pedestrian IDs)

  • Cryptographic authentication per vehicle

  • Cross-validate received data with own observations

  • Regulatory: GDPR, CCPA compliance

Common Mistake: Coordinate Transformation Errors

Mistake:

Fusing DD-scenes without accurate coordinate transformation (sender's frame → receiver's frame). GPS errors, yaw estimation errors, timestamp misalignment all introduce biases that turn a helpful neighbor into a harmful one.

Correction:

Deploy stringent calibration: GPS + inertial measurement unit (IMU) + wheel-odometry fusion for each vehicle's own position/ velocity. Time-synchronize vehicles via GNSS-PPS or PTP over sidelink. Filter outliers in CP fusion: observations with transformation uncertainty >σthreshold> \sigma_{\text{threshold}} are discarded. Typical automotive threshold: 30-cm position, 1° yaw, 1-ms timestamp — achievable with 2024 automotive-grade hardware.