Ferkans — Interactive Telecom Tutor

Why NeRF Matters for RF Imaging

Neural radiance fields (NeRFs) represent a 3D scene as a continuous function parameterised by a neural network, rather than as a discrete voxel grid or mesh. This shift from discrete to continuous is precisely what makes NeRF attractive for RF imaging: the scene is queried at arbitrary positions, and the rendering integral is differentiable end-to-end. Before we adapt NeRF for radio frequencies, we need a solid understanding of how it works in its native optical domain.

The golden thread of this chapter: NeRF replaces the voxel grid and the sensing matrix $\mathbf{A}$ with a continuous neural scene function, but to work for RF it must respect the physics of complex-valued fields, diffraction, and specular scattering.

Definition:
Neural Radiance Field (NeRF)

A Neural Radiance Field represents a 3D scene as a continuous volumetric function parameterised by an MLP:

$F_\theta \colon (\mathbf{x}, \mathbf{d}) \mapsto (\sigma, \mathbf{c}),$

where $\mathbf{x} \in \mathbb{R}^3$ is a 3D position, $\mathbf{d} \in \mathbb{S}^2$ is a viewing direction, $\sigma \geq 0$ is the volume density (opacity per unit length), and $\mathbf{c} \in [0,1]^3$ is the view-dependent colour (RGB).

The density $\sigma$ depends only on position (geometry is view-independent), while the colour depends on both position and direction (appearance may be view-dependent):

$\sigma = g_\theta(\gamma(\mathbf{x})), \qquad \mathbf{c} = h_\theta(\gamma(\mathbf{x}), \gamma(\mathbf{d})).$

Definition:
Differentiable Volume Rendering

The colour $\hat{C}(\mathbf{r})$ of a pixel is computed by casting a ray $\mathbf{r}(t) = \mathbf{o} + t\mathbf{d}$ from the camera origin $\mathbf{o}$ through the pixel and integrating:

$\hat{C}(\mathbf{r}) = \int_{t_n}^{t_f} T(t)\,\sigma(\mathbf{r}(t))\,\mathbf{c}(\mathbf{r}(t), \mathbf{d})\,dt,$

where the transmittance is

$T(t) = \exp\!\left(-\int_{t_n}^{t} \sigma(\mathbf{r}(s))\,ds\right),$

representing the probability that the ray travels from $t_n$ to $t$ without being absorbed.

Numerical approximation. With $N$ stratified samples $\{t_i\}_{i=1}^N$ along the ray:

$\hat{C} \approx \sum_{i=1}^{N} T_i\,(1 - e^{-\sigma_i \delta_i})\,\mathbf{c}_i, \qquad T_i = \exp\!\left(-\sum_{j=1}^{i-1} \sigma_j \delta_j\right),$

where $\delta_i = t_{i+1} - t_i$ . This computation is differentiable with respect to $\theta$ via backpropagation.

Theorem: Volume Rendering as Weighted Combination

Define the alpha value $\alpha_i = 1 - e^{-\sigma_i \delta_i}$ at sample $i$ . Then the quadrature approximation can be written as:

$\hat{C} = \sum_{i=1}^{N} w_i\, \mathbf{c}_i, \qquad w_i = T_i \alpha_i, \qquad T_i = \prod_{j=1}^{i-1}(1 - \alpha_j).$

The weights $\{w_i\}$ satisfy $\sum_{i=1}^N w_i \leq 1$ , with equality when the ray is fully absorbed before $t_f$ . These weights define a probability distribution over sample positions, which enables hierarchical sampling in the fine network.

Proof

Exponential-to-product conversion

We have $T_i = \exp(-\sum_{j<i} \sigma_j \delta_j) = \prod_{j<i} \exp(-\sigma_j \delta_j) = \prod_{j<i}(1 - \alpha_j)$ .

Telescoping sum

Notice $w_i = T_i - T_{i+1}$ since $T_{i+1} = T_i(1-\alpha_i) = T_i - T_i\alpha_i$ . Therefore $\sum_{i=1}^N w_i = T_1 - T_{N+1} = 1 - T_{N+1} \leq 1$ .

Probabilistic interpretation

The normalised weights $\hat{w}_i = w_i / \sum_j w_j$ define a categorical distribution over sample indices. Hierarchical sampling draws additional points from this distribution, concentrating samples near surfaces where the density is high. $\square$

Definition:
Positional Encoding

The positional encoding maps low-dimensional coordinates to a higher-dimensional space, enabling the MLP to learn high-frequency functions. For a scalar $p$ :

$\gamma(p) = \bigl[\sin(2^0 \pi p),\; \cos(2^0 \pi p),\; \ldots,\; \sin(2^{L-1} \pi p),\; \cos(2^{L-1} \pi p)\bigr].$

For a 3D position $\mathbf{x} \in \mathbb{R}^3$ , each coordinate is encoded independently, yielding a $6L$ -dimensional vector. Typical values: $L = 10$ for position ( $60$ -dimensional), $L = 4$ for direction ( $24$ -dimensional).

Without positional encoding, the MLP's spectral bias causes it to learn only low-frequency functions, producing blurry reconstructions. The encoding effectively lifts the input into a space where the MLP can represent sharp edges and fine details.

,

Positional Encoding Frequencies

Visualise how positional encoding lifts a 1D (or 3D) input into a higher-dimensional feature space. Each frequency band $2^l \pi$ captures spatial variation at a different scale. Low $L$ produces smooth representations; high $L$ captures fine detail but may overfit noise in the measurements.

Parameters

Number of frequency bands

L

10

Input dimensionality

Example: Training a NeRF from Posed Images

Describe the training procedure for a NeRF given $K = 100$ posed images $\{(I_k, \mathbf{P}_k)\}_{k=1}^K$ , where $\mathbf{P}_k$ contains the camera extrinsics and intrinsics.

Solution

Camera pose estimation

If poses are unknown, run structure-from-motion (e.g., COLMAP) on the image set to recover $\{\mathbf{P}_k\}$ . For RF applications, Tx/Rx positions are typically known.

Ray sampling and coarse rendering

For each batch, sample $B = 4096$ rays uniformly from all images. Along each ray, draw $N_c = 64$ stratified samples. Query the coarse MLP: $(\sigma_i, \mathbf{c}_i) = F_\theta^c(\gamma(\mathbf{x}_i), \gamma(\mathbf{d}))$ . Volume-render to get $\hat{C}^c(\mathbf{r})$ .

Hierarchical sampling and fine rendering

Use the coarse weights $\{w_i\}$ as a probability distribution to draw $N_f = 128$ additional samples near surfaces. Query the fine MLP on all $N_c + N_f = 192$ samples and render $\hat{C}^f(\mathbf{r})$ .

Loss and optimisation

$\mathcal{L} = \sum_{\mathbf{r} \in \mathcal{B}} \bigl[\|\hat{C}^c(\mathbf{r}) - C^*(\mathbf{r})\|_2^2 + \|\hat{C}^f(\mathbf{r}) - C^*(\mathbf{r})\|_2^2\bigr].$ $Optimiser: Adam with learning rate$ 5 \times 10^{-4} $, cosine decay. Training:$ \sim 100 $k--$ 300 $k iterations ($ \sim 12 $h on a single GPU for the original NeRF).$ \square$

Definition:
Instant-NGP: Multi-Resolution Hash Encoding

Instant-NGP (Mueller et al., 2022) replaces the slow positional encoding + deep MLP with a multi-resolution hash table:

The scene is discretised into $L$ resolution levels, each with a grid of resolution $N_l = \lfloor N_{\min} \cdot b^l \rfloor$ and growth factor $b$ .
Grid vertices are mapped to a hash table of size $T$ with feature vectors of dimension $F$ per entry.
For a query point $\mathbf{x}$ , trilinear interpolation retrieves features at each level; the concatenated $L \cdot F$ -dimensional vector feeds a small MLP (2 layers, 64 units).

Result: Training drops from hours to seconds. Rendering reaches near-real-time rates. The key insight is that hash collisions are tolerable --- they are resolved by gradient-based optimisation, which assigns different feature vectors to colliding entries based on the training loss.

Definition:
Mip-NeRF: Anti-Aliased Volume Rendering

Mip-NeRF (Barron et al., 2021) addresses aliasing artifacts in NeRF by replacing point samples with conical frustums:

Each pixel subtends a cone (not a ray) through the scene.
The cone is divided into frustums between sample boundaries.
Each frustum is approximated by a 3D Gaussian $\mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})$ .
The positional encoding is replaced by an integrated positional encoding (IPE): the expected value of the encoding over the Gaussian, computed in closed form.

This eliminates the scale ambiguity of point-sampled NeRF and improves quality for both close-up and distant views.

For RF imaging, Mip-NeRF's cone-tracing philosophy maps naturally to modelling antenna beam widths: a wide beam illuminates a cone, not a ray, and the received signal integrates over that cone.

Historical Note: The NeRF Revolution (2020-2024)

2020-2024

When Mildenhall et al. published NeRF in 2020, the paper's ability to synthesise photorealistic novel views from a handful of photographs took the computer vision community by storm. Within two years, over 500 papers extended NeRF to dynamic scenes, large-scale environments, generative modelling, and --- crucially for this chapter --- RF signal propagation.

The original NeRF required 12 hours of training per scene. Instant-NGP (Mueller et al., 2022) compressed this to under 5 seconds. 3D Gaussian Splatting (Kerbl et al., 2023) then achieved real-time rendering at comparable quality. The speed of this evolution --- from a novel idea to a mature, deployable technology in under four years --- is remarkable in computational science.

,

Common Mistake: NeRF Is Slow Without Acceleration

Mistake:

Assuming that the original NeRF can be used for real-time RF propagation prediction in deployed systems.

Correction:

The original NeRF requires $\sim 192$ MLP evaluations per ray, and a single RSS prediction casts multiple rays. For real-time applications, use either:

Instant-NGP (hash encoding, $10\times$ -- $100\times$ speedup);
Baked NeRF (pre-compute density on a sparse grid); or
3D Gaussian Splatting (explicit primitives, $>100$ FPS).

For RF applications where training happens offline and inference queries are few (e.g., channel prediction for a handful of Tx-Rx pairs), the original NeRF's speed is acceptable.

Quick Check

In the NeRF architecture, which quantity depends on the viewing direction $\mathbf{d}$ ?

Volume density $\sigma$

Colour $\mathbf{c}$

Both density and colour

Neither

Correction:

Colour

\mathbf{c}

Correct. View-dependent colour enables the model to capture specular highlights and other appearance effects that change with viewing angle.

Volume Density

The function $\sigma(\mathbf{x}) \geq 0$ representing the differential probability of a ray being absorbed at point $\mathbf{x}$ . Units: inverse length (m $^{-1}$ ). In the NeRF context, high density indicates opaque material (surfaces), while $\sigma \approx 0$ indicates free space.

Related: Transmittance

Transmittance

The quantity $T(t) = \exp(-\int_{t_n}^{t} \sigma(\mathbf{r}(s))\,ds)$ representing the fraction of light (or RF energy) that survives propagation from $t_n$ to $t$ along a ray. $T(t) = 1$ means no absorption; $T(t) = 0$ means full absorption.

Related: Volume Density

Positional Encoding

A fixed mapping $\gamma \colon \mathbb{R} \to \mathbb{R}^{2L}$ that lifts scalar coordinates to a high-dimensional space using sinusoidal functions at exponentially increasing frequencies. This overcomes the spectral bias of MLPs and enables learning of high-frequency scene features.

Related: Hash Encoding (Instant-NGP)

Hash Encoding (Instant-NGP)

A learnable spatial encoding that stores feature vectors in a multi-resolution hash table. Query points are mapped to hash table entries via spatial hashing at each resolution level, and features are retrieved by trilinear interpolation. Replaces positional encoding with orders-of-magnitude speedup.

Related: Positional Encoding

Key Takeaway

NeRF represents 3D scenes as continuous volumetric fields $(\sigma, \mathbf{c})$ parameterised by an MLP with positional encoding. Differentiable volume rendering integrates density and colour along camera rays, enabling end-to-end training from posed images. Instant-NGP and Mip-NeRF address NeRF's speed and aliasing limitations, respectively --- both improvements are directly relevant to the RF adaptations in the following sections.

Neural Radiance Fields Recap