Paper summary: NeRF — Neural Radiance Fields for View Synthesis

The NeRF model produced at the time of publication state-of-the-art novel view generation. What this means is that given a set of images taken in a specific scene NeRF learns to generate views from viewing angles it has not seen before. It does so by encoding a representation of the scene in its weights that can be queried.

6 min readJul 15, 2022

Imagine the situation where you have a situation where you have a dozen images of a scene and you would like to generate a smooth video from them — this is exactly the kind of scenario where NeRF is applied to.

Example of results of a view not seen during training compared to ground truth image

Key ideas

The key idea of NeRF is one quite unusual. The model is trained to generalize to new viewing angles, however, it is trained only for a specific scenery. This means that you cannot use the model to generate images for another scene without completely retraining it. Essentially, NeRF then forms a space-efficient representation of a scene given a significant amount of computational power.

Inference of the model trained on a set of images from a specific scene is done by inputting a sampled set of spatial coordinates in three dimensions as well as the camera angle we’re interested in. A trained NeRF then generates a realistic image of how the specific scene would appear from that camera angle.

Model overview

The model pipeline consists of three important parts. First we sample a large set spatial coordinates in three dimensions for each pixel in the image. We can imagine these as forming a light ray from the camera position towards the scene where the ray has the direction of the camera angle. We input the sampled points together with the viewing angle of the image to the model. Each point is processed independently by the model meaning that no information of what image the coordinates belongs to is given except by the viewing angle.

The model architecture is fairly simple and consists of a stack of fully-connected layers with 256 features mapped to four output nodes. For each coordinate the model outputs values for RGB and a volume density. Now, what exactly is meant by volume density? In the paper they describe it as the “differential probability of a ray terminating at an infinitesimal particle at location x” which is more simply how dense the mass is at a particular location. This will determine how much the color of this point is seen as well as if points further along the ray are visible from the specific camera angle.

Formula for a traditional volume rendering method

The formula for the volume rending looks daunting and we can view it as weighted mean of the colors along the ray weighted by the volume density and the transmittance. The transmittance can be viewed as the how much light has been transmitted up until the point t. The intuition behind this is that if there is a solid object closer to the camera than the point we are looking at t the transmittance at that point will be low. The color at point t will then influence the pixel color to a lower extent. The same holds true for a low volume density which means that the color at the particular point r(t) will have little influence on the pixel color. For example a point with in the scene air can appear to have different colors depending on the angle you view it from but the object behind it will determine the color that we see in a photo of a scene.

From the volume rendering method we obtain an image and during training we compare the pixel values of the generated image with the pixel values of the training images. Remember that we give the model nothing about the color or composition of the scene, only random points for it to predict on.

Network architecture

The network architecture is fairly simple with only fully-connected layers. Notable details are that they only input the coordinates to the network first and outputs the volume density independent of the viewing angle. Before the last few layers they input the viewing angle to generate the RGB color output. They also add a skip-connection of the coordinates in the middle of the network.

We can also further note that they apply a function denoted gamma applied to the input which will be explained further under positional encodings.

Training and tricks

The training of the model is performed separately on each scene. In practice this means that they have a couple of dozen of images from a specific scene where they know the viewing angle of each image. The loss function they use is an L2-loss comparing the pixel-values in the training image and the rendered image where we sum over the pixel distance for each ray. Notably we have two different terms in the loss function due to the sampling method they use to improve the efficiency of the rendering method. They call this method hierarchical volume sampling.

Hierarchical volume sampling

Hierarchical volume sampling builds on the idea that it is inefficient to sample points with equal density in the entire scene. The points most important to determine the color of a ray are the ones where the volume density is high since that signifies there is an object at the position. To solve this computational problem they have two separate models that they denote a coarse model, c, in the loss function and a fine model f. For the coarse model they sample points along each ray forming a coarse grid. From the output of the coarse model they determine what areas along the rays that are high-density areas. Along these areas they sample more points which are inputted to the fine model. The output from the fine model is then the image produced during inference.

Positional encodings

During training the authors noticed that inputting the spatial coordinates and viewing angle only did not allow the network to learn to regenerate the training images even. They explain it by that neural networks are biased to produce low frequency functions. To mitigate this problem they map the input by using Fourier features as given by the image below. We can view this as a sort of feature engineering or embedding of the feature to a higher dimensional space. In particular these feature map the input to a higher-frequency space.

Fourier features to encode input

Summary

The application areas of view generation and forming a 3D-representation of an object from a limited number of views are many and at the time of publication NeRF took a step forward in generating more detailed and higher-quality images compared to previous models. After NeRF several papers building upon their ideas have been published most recently [Instant Neural Graphics Primitives with a Multiresolution Hash Encoding](https://nvlabs.github.io/instant-ngp/assets/mueller2022instant.pdf) from Nvidia.

Link to paper : NeRF — Representing Scenes as Neural Radiance Fields for View Synthesis by Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, Ren Ng