NeRF系列(1):NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis 论文解读与公式推导(一)

简介: NeRF系列(1):NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis 论文解读与公式推导

NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis



项目:NeRF: Neural Radiance Fields


NeRF(神经辐射场)是一种利用神经网络表示和合成三维场景的技术。它在2020年由Ben Mildenhall等人在一篇名为《NeRF: Representing Scenes as Neural Radiance Fields》的研究论文中提出。




NeRF在生成具有复杂光照和视角相关效果的逼真图像方面取得了令人印象深刻的结果。然而,它也有一些局限性。由于NeRF需要在图像的每个像素处评估神经网络,它可能计算成本高且速度较慢。为了解决这些问题,提出了各种扩展和优化方法,如NeRF++改善了效率并处理动态场景,NeRF in the Wild则处理了不受控制的室外场景。


NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis 论文解读


We present a method that achieves state-of-the-art results for synthesizing novel views of complex scenes by optimizing an underlying continuous volumetric scene function using a sparse set of input views.

Our algorithm represents a scene using a fully-connected (non-convolutional) deep network, whose input is a single continuous 5D coordinate (spatial location (x,y,z) and viewing direction (θ,φ)) and whose output is the volume density and view-dependent emitted radiance at that spatial location.

We synthesize views by querying 5D coordinates along camera rays and use classic volume rendering techniques to project the output colors and densities into an image.

Because volume rendering is naturally differentiable, the only input required to optimize our representation is a set of images with known camera poses.

We describe how to effectively optimize neural radiance fields to render photorealistic novel views of scenes with complicated geometry and appearance, and demonstrate results that outperform prior work on neural rendering and view  synthesis.

View synthesis results are best viewed as videos, so we urge readers to view our supplementary video for convincing comparisons.



Keywords: scene representation, view synthesis, image-based rendering, volume rendering, 3D deep learning


1 Introduction

In this work, we address the long-standing problem of view synthesis in a new way by directly optimizing parameters of a continuous 5D scene representation to minimize the error of rendering a set of captured images.

We represent a static scene as a continuous 5D function that outputs the radiance emitted in each direction (θ,φ) at each point (x,y,z) in space, and a density at each point which acts like a differential opacity controlling how much radiance is accumulated by a ray passing through (x,y,z). Our method optimizes a deep fully-connected neural network without any convolutional layers (often referred to as a multilayer perceptron or MLP) to represent this function by regressing from a single 5D coordinate (x,y,z,θ,φ) to a single volume density

and view-dependent RGB color. To render this neural radiance field (NeRF) from a particular viewpoint we:

1) march camera rays through the scene to generate a sampled set of 3D points,

2) use those points and their corresponding 2D viewing directions as input to the neural network to produce an output set of colors and densities, and

3) use classical volume rendering techniques to accumulate those colors and densities into a 2D image.

Because this process is naturally differentiable, we can use gradient descent to optimize this model by minimizing the error between each observed image and the corresponding views rendered from our representation.

Minimizing this error across multiple views encourages the network to predict a coherent model of the scene by assigning high volume densities and accurate colors to the locations that contain the true underlying scene content.

Figure 2 visualizes this overall pipeline.

We find that the basic implementation of optimizing a neural radiance field representation for a complex scene does not converge to a sufficiently high resolution representation and is inefficient in the required number of samples per camera ray. We address these issues by transforming input 5D coordinates with a positional encoding that enables the MLP to represent higher frequency functions, and we propose a hierarchical sampling procedure to reduce the number of queries required to adequately sample this high-frequency scene representation.

Our approach inherits the benefits of volumetric representations: both can represent complex real-world geometry and appearance and are well suited for gradient-based optimization using projected images. Crucially, our method overcomes the prohibitive storage costs of discretized voxel grids when modeling complex scenes at high-resolutions. In summary, our technical contributions are:

– An approach for representing continuous scenes with complex geometry and materials as 5D neural radiance fields, parameterized as basic MLP networks.

– A differentiable rendering procedure based on classical volume rendering techniques, which we use to optimize these representations from standard RGB images. This includes a hierarchical sampling strategy to allocate the MLP’s capacity towards space with visible scene content.

A positional encoding to map each input 5D coordinate into a higher dimensional space, which enables us to successfully optimize neural radiance fields to represent high-frequency scene content.

We demonstrate that our resulting neural radiance field method quantitatively and qualitatively outperforms state-of-the-art view synthesis methods, including works that fit neural 3D representations to scenes as well as works that train deep convolutional networks to predict sampled volumetric representations. As far as we know, this paper presents the first continuous neural scene representation that is able to render high-resolution photorealistic novel views of real objects and scenes from RGB images captured in natural settings.

1 引言













  • 一种将具有复杂几何和材质的连续场景表示为5D神经辐射场的方法,参数化为基本的MLP网络。
  • 基于经典的体积渲染技术的可微渲染过程,我们使用该过程从标准RGB图像中优化这些表示。这包括一种分层采样策略,将MLP的容量分配给可见场景内容的空间。
  • 一种位置编码,将每个输入的5D坐标映射到更高维度空间,使我们能够成功地优化神经辐射场以表示高频率的场景内容。


2 Related Work

A promising recent direction in computer vision is encoding objects and scenes in the weights of an MLP that directly maps from a 3D spatial location to an implicit  representation of the shape, such as the signed distance [6] at that location. However, these methods have so far been unable to reproduce realistic scenes with complex geometry with the same fidelity as techniques that represent scenes using discrete representations such as triangle meshes or voxel grids.

In this section, we review these two lines of work and contrast them with our approach, which enhances the capabilities of neural scene representations to produce state-of-the-art results for rendering complex realistic scenes. A similar approach of using MLPs to map from low-dimensional coordinates to colors has also been used for representing other graphics functions such as images [44], textured materials [12,31,36,37], and indirect illumination values [38].

Neural 3D shape representations

Recent work has investigated the implicit representation of continuous 3D shapes as level sets by optimizing deep  networks that map xyz coordinates to signed distance functions [15,32] or occupancy fields [11,27]. However, these models are limited by their requirement of access to ground truth 3D geometry, typically obtained from synthetic 3D shape  datasets such as ShapeNet [3]. Subsequent work has relaxed this requirement of ground truth 3D shapes by formulating differentiable rendering functions that allow neural implicit shape  representations to be optimized using only 2D images. Niemeyer et al. [29] represent surfaces as 3D occupancy fields and use a numerical method to find the surface  intersection for each ray, then calculate an exact derivative using implicit differentiation. Each ray intersection location is provided as the input to a neural 3D texture field that predicts a diffuse color for that point. Sitzmann et al. [42] use a less direct neural 3D representation that simply outputs a feature vector and RGB color at each continuous 3D coordinate, and propose a differentiable rendering function consisting of a recurrent neural network that marches along each ray to decide where the surface is located.

Though these techniques can potentially represent complicated and high resolution geometry, they have so far been limited to simple shapes with low geometric complexity, resulting in oversmoothed renderings. We show that an alternate strategy of optimizing networks to encode 5D radiance fields (3D volumes with 2D view-dependent appearance) can represent higher-resolution geometry and appearance to render photorealistic novel views of complex scenes.

View synthesis and image-based rendering

Given a dense sampling of views, photorealistic novel views can be reconstructed by simple light field sample interpolation techniques [21,5,7]. For novel view synthesis with sparser view sampling, the computer vision and graphics communities have made significant progress by predicting traditional geometry and appearance representations from

observed images. One popular class of approaches uses mesh-based representations of scenes with either diffuse [48] or view-dependent [2,8,49] appearance. Differentiable rasterizers [4,10,23,25] or pathtracers [22,30] can directly optimize mesh representations to reproduce a set of input images using gradient descent. However, gradient-based mesh optimization based on image reprojection is often difficult, likely because of local minima or poor conditioning of the loss landscape. Furthermore, this strategy requires a template mesh with fixed topology to be provided as an initialization before optimization [22], which is typically unavailable for unconstrained real-world scenes.

Another class of methods use volumetric representations to address the task of high-quality photorealistic view synthesis from a set of input RGB images.

Volumetric approaches are able to realistically represent complex shapes and materials, are well-suited for gradient-based optimization, and tend to produce less visually distracting artifacts than mesh-based methods.

Early volumetric approaches used observed images to directly color voxel grids [19,40,45]. More recently, several methods [9,13,17,28,33,43,46,52] have used large datasets of multiple scenes to train deep networks that predict a sampled volumetric representation from a set of input images, and then use either alpha-compositing [34] or learned compositing along rays to render novel views at test time.

Other works have optimized a combination of convolutional networks (CNNs) and sampled

voxel grids for each specific scene, such that the CNN can compensate for discretization artifacts from low resolution voxel grids [41] or allow the predicted voxel grids to vary based on input time or animation controls [24].

While these volumetric techniques have achieved impressive results for novel view synthe-

sis, their ability to scale to higher resolution imagery is fundamentally limited by poor time and space complexity due to their discrete sampling — rendering higher resolution images requires a finer sampling of 3D space.

We circumvent this problem by instead encoding a continuous volume within the parameters of a deep fully-connected neural network, which not only produces significantly

higher quality renderings than prior volumetric approaches, but also requires just a fraction of the storage cost of those sampled volumetric representations.

2 相关工作




最近的研究工作研究了将连续的3D形状表示为通过优化将xyz坐标映射到有符号距离函数[15,32]或占用场[11,27]的深度网络的水平集。然而,这些模型受限于对真实3D几何的要求,通常从合成3D形状数据集(如ShapeNet [3])中获得。随后的工作通过制定可微渲染函数放宽了对真实3D形状的要求,从而只使用2D图像来优化神经隐式形状表示。Niemeyer等人 [29]将表面表示为3D占用场,并使用数值方法找到每条射线的表面交点,然后使用隐式微分计算精确导数。每个射线交点位置作为输入提供给神经3D纹理场,该场预测该点的漫反射颜色。Sitzmann等人 [42]使用了一种不太直接的神经3D表示,只是在每个连续的3D坐标处输出特征向量和RGB颜色,并提出了一个可微渲染函数,由一个逐射线的递归神经网络组成,用于确定表面的位置。



在进行密集视角采样的情况下,可以通过简单的光场样本插值技术 [21,5,7] 重构逼真的新视角。对于稀疏视角采样的新视角合成,计算机视觉和图形学界取得了显著的进展,通过从观察到的图像中预测传统几何和外观表示来进行。一种常见的方法类别使用基于网格的场景表示,其中场景具有漫反射 [48] 或视角相关的外观 [2,8,49]。可微分光栅化器 [4,10,23,25] 或路径追踪器 [22,30] 可以直接使用梯度下降优化网格表示,以重现一组输入图像。然而,基于图像投影的基于梯度的网格优化通常很困难,可能是因为局部最小值或损失景观的条件较差。此外,这种策略要求在优化之前提供一个具有固定拓扑结构的模板网格作为初始化 [22],而对于无约束的真实场景,通常无法提供此类模板网格。

另一类方法使用体积表示来解决从一组输入RGB图像高质量逼真视角合成的任务。体积表示能够逼真地表示复杂的形状和材质,非常适合基于梯度的优化,并且往往产生比基于网格的方法更少的视觉干扰性伪影。早期的体积方法直接使用观察到的图像对体素网格进行着色 [19,40,45]。最近,一些方法 [9,13,17,28,33,43,46,52] 使用大型数据集中的多个场景训练深度网络,从一组输入图像预测采样的体积表示,然后在测试时使用阿尔法合成 [34] 或学习的沿射线合成来渲染新视角。其他工作对每个特定场景优化了卷积网络(CNN)和采样的体素网格的组合,以使CNN能够弥补低分辨率体素网格的离散化伪影 [41] 或允许根据输入时间或动画控制来变化预测的体素网格 [24]。

尽管这些体积技术在新视角合成方面取得了令人印象深刻的成果,但它们在扩展到更高分辨率图像方面的能力受到了离散采样的时间和空间复杂性的根本限制 —— 渲染更高分辨率的图像需要对3D空间进行更细致的采样。


机器学习/深度学习 人工智能 文件存储
【小样本图像分割-3】HyperSegNAS: Bridging One-Shot Neural Architecture Search with 3D Medical Image Segmentation using HyperNet
47 0
【小样本图像分割-3】HyperSegNAS: Bridging One-Shot Neural Architecture Search with 3D Medical Image Segmentation using HyperNet
机器学习/深度学习 算法 TensorFlow
【文献学习】Analysis of Deep Complex-Valued Convolutional Neural Networks for MRI Reconstruction
63 0
【文献学习】Analysis of Deep Complex-Valued Convolutional Neural Networks for MRI Reconstruction
机器学习/深度学习 算法 图形学
【论文泛读】NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
【论文泛读】NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
编解码 计算机视觉
NeRF系列(3): Semantic-aware Occlusion Filtering Neural Radiance Fields in the Wild 论文解读
NeRF系列(3): Semantic-aware Occlusion Filtering Neural Radiance Fields in the Wild 论文解读
250 2
机器学习/深度学习 编解码 数据可视化
NeRF系列(1):NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis 论文解读与公式推导(二)
NeRF系列(1):NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis 论文解读与公式推导(二)
273 0
机器学习/深度学习 数据可视化 TensorFlow
NeRF系列(2):NeRF in the wild : Neural Radiance Fields for Unconstrained Photo Collections论文解读与公式推导
NeRF系列(2):NeRF in the wild : Neural Radiance Fields for Unconstrained Photo Collections论文解读与公式推导
450 0
机器学习/深度学习 定位技术
NeRF系列(4):Ha-NeRF: Hallucinated Neural Radiance Fields in the Wild论文解读
NeRF系列(4):Ha-NeRF: Hallucinated Neural Radiance Fields in the Wild论文解读
131 0
计算机视觉 Python
Ha-NeRF: Hallucinated Neural Radiance Fields in the Wild 代码复现与解读
Ha-NeRF: Hallucinated Neural Radiance Fields in the Wild 代码复现与解读
203 0
机器学习/深度学习 自然语言处理 算法
【论文精读】COLING 2022 -Event Detection with Dual Relational Graph Attention Networks
图神经网络(Scarselli et al, 2009)已被广泛用于编码事件检测的依赖树,因为它们可以基于信息聚合方案有效地捕获相关信息(Cao et al, 2021)。
208 0
机器学习/深度学习 自然语言处理 算法
7 Papers & Radios | 首篇扩散模型综述;没有3D卷积的3D重建方法
7 Papers & Radios | 首篇扩散模型综述;没有3D卷积的3D重建方法
154 0