Neural Radiance Fields Recap

Why NeRF Matters for RF Imaging

Neural radiance fields (NeRFs) represent a 3D scene as a continuous function parameterised by a neural network, rather than as a discrete voxel grid or mesh. This shift from discrete to continuous is precisely what makes NeRF attractive for RF imaging: the scene is queried at arbitrary positions, and the rendering integral is differentiable end-to-end. Before we adapt NeRF for radio frequencies, we need a solid understanding of how it works in its native optical domain.

The golden thread of this chapter: NeRF replaces the voxel grid and the sensing matrix A\mathbf{A} with a continuous neural scene function, but to work for RF it must respect the physics of complex-valued fields, diffraction, and specular scattering.

Definition:

Neural Radiance Field (NeRF)

A Neural Radiance Field represents a 3D scene as a continuous volumetric function parameterised by an MLP:

Fθ ⁣:(x,d)↦(Οƒ,c),F_\theta \colon (\mathbf{x}, \mathbf{d}) \mapsto (\sigma, \mathbf{c}),

where x∈R3\mathbf{x} \in \mathbb{R}^3 is a 3D position, d∈S2\mathbf{d} \in \mathbb{S}^2 is a viewing direction, Οƒβ‰₯0\sigma \geq 0 is the volume density (opacity per unit length), and c∈[0,1]3\mathbf{c} \in [0,1]^3 is the view-dependent colour (RGB).

The density Οƒ\sigma depends only on position (geometry is view-independent), while the colour depends on both position and direction (appearance may be view-dependent):

Οƒ=gΞΈ(Ξ³(x)),c=hΞΈ(Ξ³(x),Ξ³(d)).\sigma = g_\theta(\gamma(\mathbf{x})), \qquad \mathbf{c} = h_\theta(\gamma(\mathbf{x}), \gamma(\mathbf{d})).

Definition:

Differentiable Volume Rendering

The colour C^(r)\hat{C}(\mathbf{r}) of a pixel is computed by casting a ray r(t)=o+td\mathbf{r}(t) = \mathbf{o} + t\mathbf{d} from the camera origin o\mathbf{o} through the pixel and integrating:

C^(r)=∫tntfT(t) σ(r(t)) c(r(t),d) dt,\hat{C}(\mathbf{r}) = \int_{t_n}^{t_f} T(t)\,\sigma(\mathbf{r}(t))\,\mathbf{c}(\mathbf{r}(t), \mathbf{d})\,dt,

where the transmittance is

T(t)=exp⁑ ⁣(βˆ’βˆ«tntΟƒ(r(s)) ds),T(t) = \exp\!\left(-\int_{t_n}^{t} \sigma(\mathbf{r}(s))\,ds\right),

representing the probability that the ray travels from tnt_n to tt without being absorbed.

Numerical approximation. With NN stratified samples {ti}i=1N\{t_i\}_{i=1}^N along the ray:

C^β‰ˆβˆ‘i=1NTi (1βˆ’eβˆ’ΟƒiΞ΄i) ci,Ti=exp⁑ ⁣(βˆ’βˆ‘j=1iβˆ’1ΟƒjΞ΄j),\hat{C} \approx \sum_{i=1}^{N} T_i\,(1 - e^{-\sigma_i \delta_i})\,\mathbf{c}_i, \qquad T_i = \exp\!\left(-\sum_{j=1}^{i-1} \sigma_j \delta_j\right),

where Ξ΄i=ti+1βˆ’ti\delta_i = t_{i+1} - t_i. This computation is differentiable with respect to ΞΈ\theta via backpropagation.

Theorem: Volume Rendering as Weighted Combination

Define the alpha value Ξ±i=1βˆ’eβˆ’ΟƒiΞ΄i\alpha_i = 1 - e^{-\sigma_i \delta_i} at sample ii. Then the quadrature approximation can be written as:

C^=βˆ‘i=1Nwi ci,wi=TiΞ±i,Ti=∏j=1iβˆ’1(1βˆ’Ξ±j).\hat{C} = \sum_{i=1}^{N} w_i\, \mathbf{c}_i, \qquad w_i = T_i \alpha_i, \qquad T_i = \prod_{j=1}^{i-1}(1 - \alpha_j).

The weights {wi}\{w_i\} satisfy βˆ‘i=1Nwi≀1\sum_{i=1}^N w_i \leq 1, with equality when the ray is fully absorbed before tft_f. These weights define a probability distribution over sample positions, which enables hierarchical sampling in the fine network.

Definition:

Positional Encoding

The positional encoding maps low-dimensional coordinates to a higher-dimensional space, enabling the MLP to learn high-frequency functions. For a scalar pp:

Ξ³(p)=[sin⁑(20Ο€p),β€…β€Šcos⁑(20Ο€p),β€…β€Šβ€¦,β€…β€Šsin⁑(2Lβˆ’1Ο€p),β€…β€Šcos⁑(2Lβˆ’1Ο€p)].\gamma(p) = \bigl[\sin(2^0 \pi p),\; \cos(2^0 \pi p),\; \ldots,\; \sin(2^{L-1} \pi p),\; \cos(2^{L-1} \pi p)\bigr].

For a 3D position x∈R3\mathbf{x} \in \mathbb{R}^3, each coordinate is encoded independently, yielding a 6L6L-dimensional vector. Typical values: L=10L = 10 for position (6060-dimensional), L=4L = 4 for direction (2424-dimensional).

Without positional encoding, the MLP's spectral bias causes it to learn only low-frequency functions, producing blurry reconstructions. The encoding effectively lifts the input into a space where the MLP can represent sharp edges and fine details.

,

Positional Encoding Frequencies

Visualise how positional encoding lifts a 1D (or 3D) input into a higher-dimensional feature space. Each frequency band 2lΟ€2^l \pi captures spatial variation at a different scale. Low LL produces smooth representations; high LL captures fine detail but may overfit noise in the measurements.

Parameters
10

Example: Training a NeRF from Posed Images

Describe the training procedure for a NeRF given K=100K = 100 posed images {(Ik,Pk)}k=1K\{(I_k, \mathbf{P}_k)\}_{k=1}^K, where Pk\mathbf{P}_k contains the camera extrinsics and intrinsics.

Definition:

Instant-NGP: Multi-Resolution Hash Encoding

Instant-NGP (Mueller et al., 2022) replaces the slow positional encoding + deep MLP with a multi-resolution hash table:

  1. The scene is discretised into LL resolution levels, each with a grid of resolution Nl=⌊Nmin⁑⋅blβŒ‹N_l = \lfloor N_{\min} \cdot b^l \rfloor and growth factor bb.
  2. Grid vertices are mapped to a hash table of size TT with feature vectors of dimension FF per entry.
  3. For a query point x\mathbf{x}, trilinear interpolation retrieves features at each level; the concatenated Lβ‹…FL \cdot F-dimensional vector feeds a small MLP (2 layers, 64 units).

Result: Training drops from hours to seconds. Rendering reaches near-real-time rates. The key insight is that hash collisions are tolerable --- they are resolved by gradient-based optimisation, which assigns different feature vectors to colliding entries based on the training loss.

Definition:

Mip-NeRF: Anti-Aliased Volume Rendering

Mip-NeRF (Barron et al., 2021) addresses aliasing artifacts in NeRF by replacing point samples with conical frustums:

  • Each pixel subtends a cone (not a ray) through the scene.
  • The cone is divided into frustums between sample boundaries.
  • Each frustum is approximated by a 3D Gaussian N(ΞΌ,Ξ£)\mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma}).
  • The positional encoding is replaced by an integrated positional encoding (IPE): the expected value of the encoding over the Gaussian, computed in closed form.

This eliminates the scale ambiguity of point-sampled NeRF and improves quality for both close-up and distant views.

For RF imaging, Mip-NeRF's cone-tracing philosophy maps naturally to modelling antenna beam widths: a wide beam illuminates a cone, not a ray, and the received signal integrates over that cone.

Historical Note: The NeRF Revolution (2020-2024)

2020-2024

When Mildenhall et al. published NeRF in 2020, the paper's ability to synthesise photorealistic novel views from a handful of photographs took the computer vision community by storm. Within two years, over 500 papers extended NeRF to dynamic scenes, large-scale environments, generative modelling, and --- crucially for this chapter --- RF signal propagation.

The original NeRF required 12 hours of training per scene. Instant-NGP (Mueller et al., 2022) compressed this to under 5 seconds. 3D Gaussian Splatting (Kerbl et al., 2023) then achieved real-time rendering at comparable quality. The speed of this evolution --- from a novel idea to a mature, deployable technology in under four years --- is remarkable in computational science.

,

Common Mistake: NeRF Is Slow Without Acceleration

Mistake:

Assuming that the original NeRF can be used for real-time RF propagation prediction in deployed systems.

Correction:

The original NeRF requires ∼192\sim 192 MLP evaluations per ray, and a single RSS prediction casts multiple rays. For real-time applications, use either:

  • Instant-NGP (hash encoding, 10Γ—10\times--100Γ—100\times speedup);
  • Baked NeRF (pre-compute density on a sparse grid); or
  • 3D Gaussian Splatting (explicit primitives, >100>100 FPS).

For RF applications where training happens offline and inference queries are few (e.g., channel prediction for a handful of Tx-Rx pairs), the original NeRF's speed is acceptable.

Quick Check

In the NeRF architecture, which quantity depends on the viewing direction d\mathbf{d}?

Volume density Οƒ\sigma

Colour c\mathbf{c}

Both density and colour

Neither

Volume Density

The function Οƒ(x)β‰₯0\sigma(\mathbf{x}) \geq 0 representing the differential probability of a ray being absorbed at point x\mathbf{x}. Units: inverse length (mβˆ’1^{-1}). In the NeRF context, high density indicates opaque material (surfaces), while Οƒβ‰ˆ0\sigma \approx 0 indicates free space.

Related: Transmittance

Transmittance

The quantity T(t)=exp⁑(βˆ’βˆ«tntΟƒ(r(s)) ds)T(t) = \exp(-\int_{t_n}^{t} \sigma(\mathbf{r}(s))\,ds) representing the fraction of light (or RF energy) that survives propagation from tnt_n to tt along a ray. T(t)=1T(t) = 1 means no absorption; T(t)=0T(t) = 0 means full absorption.

Related: Volume Density

Positional Encoding

A fixed mapping γ ⁣:Rβ†’R2L\gamma \colon \mathbb{R} \to \mathbb{R}^{2L} that lifts scalar coordinates to a high-dimensional space using sinusoidal functions at exponentially increasing frequencies. This overcomes the spectral bias of MLPs and enables learning of high-frequency scene features.

Related: Hash Encoding (Instant-NGP)

Hash Encoding (Instant-NGP)

A learnable spatial encoding that stores feature vectors in a multi-resolution hash table. Query points are mapped to hash table entries via spatial hashing at each resolution level, and features are retrieved by trilinear interpolation. Replaces positional encoding with orders-of-magnitude speedup.

Related: Positional Encoding

Key Takeaway

NeRF represents 3D scenes as continuous volumetric fields (Οƒ,c)(\sigma, \mathbf{c}) parameterised by an MLP with positional encoding. Differentiable volume rendering integrates density and colour along camera rays, enabling end-to-end training from posed images. Instant-NGP and Mip-NeRF address NeRF's speed and aliasing limitations, respectively --- both improvements are directly relevant to the RF adaptations in the following sections.