Multi-View Stereo and Structure from Motion

Why Multi-View Geometry Matters for RF

Every RF imaging system with multiple Tx-Rx pairs is, at its core, a multi-view system. Each Tx-Rx pair "views" the scene from a different geometric perspective β€” just as cameras in a stereo rig observe the same scene from different positions. The mathematical machinery of multi-view geometry β€” epipolar constraints, the fundamental matrix, bundle adjustment β€” transfers directly to RF with one critical difference: RF measurements are coherent (phase-bearing), while camera images are incoherent (intensity only).

This section develops the optical multi-view framework; Section 28.3 adapts it to RF wave propagation.

Definition:

Pinhole Camera Model and Projection

A pinhole camera maps a 3D point P=[X,Y,Z]T\mathbf{P} = [X, Y, Z]^\mathsf{T} in world coordinates to a 2D image point x=[u,v]T\mathbf{x} = [u, v]^\mathsf{T} via perspective projection:

x~=K[R∣t] P~,\tilde{\mathbf{x}} = \mathbf{K}[\mathbf{R} \mid \mathbf{t}]\,\tilde{\mathbf{P}},

where x~∈R3\tilde{\mathbf{x}} \in \mathbb{R}^3 and P~∈R4\tilde{\mathbf{P}} \in \mathbb{R}^4 are homogeneous coordinates, R∈SO(3)\mathbf{R} \in SO(3) and t∈R3\mathbf{t} \in \mathbb{R}^3 are the camera extrinsics (rotation and translation), and the intrinsic matrix is:

K=[fxscx0fycy001],\mathbf{K} = \begin{bmatrix} f_x & s & c_x \\ 0 & f_y & c_y \\ 0 & 0 & 1 \end{bmatrix},

with focal lengths fx,fyf_x, f_y, principal point (cx,cy)(c_x, c_y), and skew ss (usually 0).

The full 3Γ—43 \times 4 projection matrix P=K[R∣t]\mathbf{P} = \mathbf{K}[\mathbf{R} \mid \mathbf{t}] has 11 degrees of freedom (6 extrinsic + 5 intrinsic).

Definition:

Epipolar Geometry and the Fundamental Matrix

Given two cameras observing the same 3D point P\mathbf{P}, the projections x1\mathbf{x}_1 and x2\mathbf{x}_2 in the two images satisfy the epipolar constraint:

x~2T F x~1=0,\tilde{\mathbf{x}}_2^\mathsf{T}\,\mathbf{F}\,\tilde{\mathbf{x}}_1 = 0,

where F∈R3Γ—3\mathbf{F} \in \mathbb{R}^{3 \times 3} is the fundamental matrix (rank 2, 7 degrees of freedom).

Geometric interpretation: The point x1\mathbf{x}_1 in image 1 constrains the corresponding point x2\mathbf{x}_2 in image 2 to lie on the epipolar line β„“2=F x~1\ell_2 = \mathbf{F}\,\tilde{\mathbf{x}}_1. This reduces stereo matching from a 2D search to a 1D search.

When the cameras are calibrated (intrinsics K1,K2\mathbf{K}_1, \mathbf{K}_2 known), the fundamental matrix factors as:

F=K2βˆ’T E K1βˆ’1,\mathbf{F} = \mathbf{K}_2^{-\mathsf{T}}\,\mathbf{E}\,\mathbf{K}_1^{-1},

where E=[t]Γ—R\mathbf{E} = [\mathbf{t}]_\times \mathbf{R} is the essential matrix (5 DOF), with [t]Γ—[\mathbf{t}]_\times the skew-symmetric matrix of the baseline translation.

The fundamental matrix encodes the relative geometry between two views. It can be estimated from β‰₯8\geq 8 point correspondences (8-point algorithm) or β‰₯7\geq 7 with the 7-point algorithm exploiting the rank-2 constraint.

Theorem: Properties of the Essential Matrix

The essential matrix E=[t]Γ—R\mathbf{E} = [\mathbf{t}]_\times \mathbf{R} satisfies:

  1. E\mathbf{E} has rank 2 and exactly two equal nonzero singular values.
  2. The SVD of E\mathbf{E} is E=U diag(Οƒ,Οƒ,0) VT\mathbf{E} = \mathbf{U}\,\text{diag}(\sigma, \sigma, 0)\,\mathbf{V}^\mathsf{T}, where Οƒ=βˆ₯tβˆ₯\sigma = \|\mathbf{t}\|.
  3. Given E\mathbf{E}, the rotation and translation can be recovered (up to a four-fold ambiguity resolved by the positive-depth constraint).

The essential matrix captures the rigid-body geometry between two calibrated cameras. Its rank-2 structure reflects the fact that a single point correspondence constrains but does not determine the 3D point β€” one degree of freedom (depth) remains.

Epipolar Geometry Visualisation

Visualise epipolar geometry for a stereo camera pair. A 3D point projects onto both images; the epipolar lines show the geometric constraint between correspondences. Increasing the baseline separates the epipoles further and increases the disparity (the displacement between corresponding points), which improves depth estimation precision.

Parameters
1
500
5

Definition:

Structure from Motion (SfM)

Structure from Motion jointly estimates 3D scene structure (a sparse point cloud) and camera poses from a collection of unposed images:

  1. Feature extraction: Detect and describe keypoints in each image (e.g., SIFT, SuperPoint).
  2. Feature matching: Find correspondences between image pairs.
  3. Geometric verification: Filter matches using epipolar geometry (the fundamental matrix F\mathbf{F} satisfies x~2TF x~1=0\tilde{\mathbf{x}}_2^\mathsf{T} \mathbf{F}\, \tilde{\mathbf{x}}_1 = 0 for corresponding points).
  4. Bundle adjustment: Jointly optimise 3D point positions {Pj}\{\mathbf{P}_j\} and camera parameters {Kk,Rk,tk}\{\mathbf{K}_k, \mathbf{R}_k, \mathbf{t}_k\} by minimising the reprojection error:

min⁑{Pj},{Rk,tk}βˆ‘k,jρ ⁣(βˆ₯Ο€(Kk,Rk,tk,Pj)βˆ’xk,jβˆ₯2),\min_{\{\mathbf{P}_j\}, \{\mathbf{R}_k, \mathbf{t}_k\}} \sum_{k,j} \rho\!\left(\|\pi(\mathbf{K}_k, \mathbf{R}_k, \mathbf{t}_k, \mathbf{P}_j) - \mathbf{x}_{k,j}\|^2\right),

where Ο€\pi is the projection function and ρ\rho is a robust loss (e.g., Huber).

SfM is the standard preprocessing step for NeRF and 3DGS: it provides the camera poses needed for training. COLMAP is the most widely used SfM pipeline.

Definition:

Bundle Adjustment

Bundle adjustment is a nonlinear least-squares optimisation that jointly refines 3D point positions and camera parameters. Let ΞΈ=({Pj},{Rk,tk})\theta = (\{\mathbf{P}_j\}, \{\mathbf{R}_k, \mathbf{t}_k\}) denote the unknowns. The cost function is:

minβ‘ΞΈβˆ‘(k,j)∈Vρ ⁣(βˆ₯Ο€k(Pj;ΞΈ)βˆ’xk,jβˆ₯2),\min_\theta \sum_{(k,j) \in \mathcal{V}} \rho\!\left(\|\pi_k(\mathbf{P}_j; \theta) - \mathbf{x}_{k,j}\|^2\right),

where V\mathcal{V} is the set of visibility pairs (point jj seen in camera kk).

The Jacobian of this system has a sparse block structure (each observation depends on exactly one point and one camera), enabling the Schur complement trick: eliminate point variables first, then solve a reduced system over camera variables only.

Levenberg-Marquardt is the standard solver, with cost per iteration O(∣Vβˆ£β€‰c2+nc3)O(|\mathcal{V}|\,c^2 + n_c^3), where cc is the camera parameter dimension and ncn_c is the number of cameras.

The Schur complement trick reduces a system with millions of 3D points and hundreds of cameras to a dense system of size ∼6nc\sim 6 n_c, making bundle adjustment tractable for large-scale SfM.

Example: The COLMAP SfM Pipeline

Describe the steps to go from a set of uncalibrated photographs to camera poses suitable for training a NeRF.

Why This Matters: Multi-View Geometry in RF Imaging

The geometry of multi-view imaging has direct parallels in RF:

  • SfM ↔\leftrightarrow Array calibration: Estimating antenna positions and orientations from calibration measurements is analogous to camera pose estimation in SfM.

  • Epipolar geometry ↔\leftrightarrow Range-azimuth ambiguity: The fundamental matrix constrains where a correspondence can appear; similarly, the range-azimuth ambiguity in monostatic radar constrains where a scatterer can be localised.

  • Bundle adjustment ↔\leftrightarrow Autofocus: Joint estimation of scene and nuisance parameters (camera poses / phase errors) from measurements.

  • Stereo disparity ↔\leftrightarrow Bistatic range difference: In stereo vision, depth is recovered from disparity; in bistatic radar, the target position is recovered from the range difference between Tx and Rx paths.

These parallels motivate adapting computer vision's mature 3D reconstruction pipeline to RF imaging problems.

,

Quick Check

The fundamental matrix F\mathbf{F} has 7 degrees of freedom. What is the minimum number of point correspondences needed to estimate it (using the classical linear method)?

5

7

8

11

Historical Note: From Photogrammetry to Computer Vision

1981--1997

Epipolar geometry was first formalised in the context of aerial photogrammetry in the early 20th century, where overlapping photographs from aircraft were used to create topographic maps. The fundamental matrix was introduced by Luong and Faugeras in 1996, unifying earlier work on the essential matrix (Longuet-Higgins, 1981) with uncalibrated cameras. The 8-point algorithm, rediscovered by Hartley in 1997, demonstrated that careful normalisation of point coordinates makes the linear estimation of F\mathbf{F} practical and numerically stable.

Epipolar Line

The line in image 2 on which the projection of a 3D point must lie, given its projection in image 1. Computed as β„“2=F x~1\ell_2 = \mathbf{F}\,\tilde{\mathbf{x}}_1.

Related: Epipolar Geometry and the Fundamental Matrix

Bundle Adjustment

Nonlinear least-squares refinement of 3D point positions and camera parameters by minimising the total reprojection error across all views and observed points.

Related: Bundle Adjustment

Common Mistake: SfM Scale Ambiguity

Mistake:

Assuming that SfM recovers metric (absolute) scale from images alone.

Correction:

Monocular SfM recovers structure and motion only up to an unknown global scale factor. The fundamental matrix encodes epipolar geometry but not the absolute baseline length. To recover metric scale, you need at least one known distance (a calibration object) or additional sensor data (GPS, IMU, known object size). In RF imaging, the carrier wavelength provides a natural scale reference that optical SfM lacks.

Key Takeaway

Multi-view geometry β€” epipolar constraints, the fundamental/essential matrix, and bundle adjustment β€” provides the mathematical backbone for 3D reconstruction from 2D observations. These concepts transfer directly to RF imaging: Tx-Rx pairs are "cameras," range measurements replace pixel disparities, and autofocus is the RF analog of bundle adjustment.