NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
论文:https://arxiv.org/pdf/2003.08934v2.pdf
代码:https://github.com/yenchenlin/nerf-pytorch
项目:NeRF: Neural Radiance Fields
概要:
NeRF(神经辐射场)是一种利用神经网络表示和合成三维场景的技术。它在2020年由Ben Mildenhall等人在一篇名为《NeRF: Representing Scenes as Neural Radiance Fields》的研究论文中提出。
传统上,三维场景表示方法使用显式的几何模型,如网格或点云,来表示物体的形状和外观。NeRF采用了一种不同的方法,将场景表示为将三维坐标映射到RGB颜色的连续函数。这个函数被称为"神经辐射场",由一个神经网络建模。
NeRF的关键思想是通过从不同视点捕捉到的一组图像来训练神经网络,以逼近体积场景函数。在训练过程中,网络接受场景中的一个三维点作为输入,并输出该点的颜色。通过使用大量的输入-输出对来优化网络的参数,NeRF学习捕捉场景的复杂光照和视角相关效果。
一旦NeRF模型训练完成,它可以用于视图合成,即从任意视点生成场景的新图像。给定一个相机姿态,NeRF沿着相应的光线在三维空间中评估神经辐射场,从而为图像中的每个像素产生估计的颜色和深度。通过对多个光线进行评估并累积颜色,可以生成高质量的合成图像。
NeRF在生成具有复杂光照和视角相关效果的逼真图像方面取得了令人印象深刻的结果。然而,它也有一些局限性。由于NeRF需要在图像的每个像素处评估神经网络,它可能计算成本高且速度较慢。为了解决这些问题,提出了各种扩展和优化方法,如NeRF++改善了效率并处理动态场景,NeRF in the Wild则处理了不受控制的室外场景。
总的来说,NeRF在三维场景表示和视图合成方面取得了重大突破,为计算机图形学、虚拟现实和增强现实应用提供了强大的工具。
NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis 论文解读
Abstract.
We present a method that achieves state-of-the-art results for synthesizing novel views of complex scenes by optimizing an underlying continuous volumetric scene function using a sparse set of input views.
Our algorithm represents a scene using a fully-connected (non-convolutional) deep network, whose input is a single continuous 5D coordinate (spatial location (x,y,z) and viewing direction (θ,φ)) and whose output is the volume density and view-dependent emitted radiance at that spatial location.
We synthesize views by querying 5D coordinates along camera rays and use classic volume rendering techniques to project the output colors and densities into an image.
Because volume rendering is naturally differentiable, the only input required to optimize our representation is a set of images with known camera poses.
We describe how to effectively optimize neural radiance fields to render photorealistic novel views of scenes with complicated geometry and appearance, and demonstrate results that outperform prior work on neural rendering and view synthesis.
View synthesis results are best viewed as videos, so we urge readers to view our supplementary video for convincing comparisons.
摘要
我们提出了一种方法,通过优化底层连续体积场景函数,使用稀疏的输入视图集合,实现了合成复杂场景的新视图的最先进结果。我们的算法使用一个全连接(非卷积)深度网络来表示场景,其输入是一个连续的5D坐标(空间位置(x,y,z)和观察方向(θ,φ)),输出是该空间位置处的体积密度和视角相关的辐射强度。我们通过在相机光线上查询5D坐标并使用经典的体积渲染技术将输出颜色和密度投影到图像中来合成视图。由于体积渲染具有可微性,优化我们的表示所需的唯一输入是一组具有已知相机姿态的图像。我们描述了如何有效地优化神经辐射场,以呈现具有复杂几何和外观的逼真新视图的场景,并展示了在神经渲染和视图合成方面优于先前工作的结果。视图合成结果最好以视频形式查看,因此我们建议读者观看我们的附加视频以获得令人信服的比较。
Keywords: scene representation, view synthesis, image-based rendering, volume rendering, 3D deep learning
关键词:场景表征,视图合成,基于图像的渲染,体积渲染,3D深度学习
1 Introduction
In this work, we address the long-standing problem of view synthesis in a new way by directly optimizing parameters of a continuous 5D scene representation to minimize the error of rendering a set of captured images.
We represent a static scene as a continuous 5D function that outputs the radiance emitted in each direction (θ,φ) at each point (x,y,z) in space, and a density at each point which acts like a differential opacity controlling how much radiance is accumulated by a ray passing through (x,y,z). Our method optimizes a deep fully-connected neural network without any convolutional layers (often referred to as a multilayer perceptron or MLP) to represent this function by regressing from a single 5D coordinate (x,y,z,θ,φ) to a single volume density
and view-dependent RGB color. To render this neural radiance field (NeRF) from a particular viewpoint we:
1) march camera rays through the scene to generate a sampled set of 3D points,
2) use those points and their corresponding 2D viewing directions as input to the neural network to produce an output set of colors and densities, and
3) use classical volume rendering techniques to accumulate those colors and densities into a 2D image.
Because this process is naturally differentiable, we can use gradient descent to optimize this model by minimizing the error between each observed image and the corresponding views rendered from our representation.
Minimizing this error across multiple views encourages the network to predict a coherent model of the scene by assigning high volume densities and accurate colors to the locations that contain the true underlying scene content.
Figure 2 visualizes this overall pipeline.
We find that the basic implementation of optimizing a neural radiance field representation for a complex scene does not converge to a sufficiently high resolution representation and is inefficient in the required number of samples per camera ray. We address these issues by transforming input 5D coordinates with a positional encoding that enables the MLP to represent higher frequency functions, and we propose a hierarchical sampling procedure to reduce the number of queries required to adequately sample this high-frequency scene representation.
Our approach inherits the benefits of volumetric representations: both can represent complex real-world geometry and appearance and are well suited for gradient-based optimization using projected images. Crucially, our method overcomes the prohibitive storage costs of discretized voxel grids when modeling complex scenes at high-resolutions. In summary, our technical contributions are:
– An approach for representing continuous scenes with complex geometry and materials as 5D neural radiance fields, parameterized as basic MLP networks.
– A differentiable rendering procedure based on classical volume rendering techniques, which we use to optimize these representations from standard RGB images. This includes a hierarchical sampling strategy to allocate the MLP’s capacity towards space with visible scene content.
A positional encoding to map each input 5D coordinate into a higher dimensional space, which enables us to successfully optimize neural radiance fields to represent high-frequency scene content.
We demonstrate that our resulting neural radiance field method quantitatively and qualitatively outperforms state-of-the-art view synthesis methods, including works that fit neural 3D representations to scenes as well as works that train deep convolutional networks to predict sampled volumetric representations. As far as we know, this paper presents the first continuous neural scene representation that is able to render high-resolution photorealistic novel views of real objects and scenes from RGB images captured in natural settings.
1 引言
图1:我们提出了一种方法,通过优化连续的5D神经辐射场表示(场景的体积密度和视角相关的颜色在任何连续位置上)来表示场景,使用一组输入图像。我们使用体积渲染技术来沿光线累积这个场景表示的样本,从任意视点渲染场景。在这里,我们可视化了100个输入视图,在环绕半球上随机捕获的合成Drums场景的集合,并展示了从我们优化的NeRF表示中渲染的两个新视图。
本工作通过直接优化连续的5D场景表示参数来最小化渲染一组捕获图像的误差,以一种新的方式解决了长期存在的视图合成问题。
我们将静态场景表示为一个连续的5D函数,该函数在空间中的每个点(x,y,z)上输出每个方向(θ,φ)发射的辐射强度,并在每个点上输出一个密度,它起到类似微分不透明度的作用,控制通过(x,y,z)的光线积累的辐射量。我们的方法通过对深度全连接的神经网络进行优化(通常称为多层感知机或MLP),从单个5D坐标(x,y,z,θ,φ)回归到单个体积密度和视角相关的RGB颜色,来表示这个函数。
为了从特定视点渲染这个神经辐射场(NeRF),我们进行以下步骤:
1)沿着相机光线穿过场景生成一组采样的3D点,
2)使用这些点及其对应的2D视角作为输入,将其输入到神经网络中,生成一组输出的颜色和密度,
3)使用经典的体积渲染技术将这些颜色和密度累积到2D图像中。
由于这个过程是自然可微的,我们可以使用梯度下降来通过最小化每个观测图像与从我们的表示中渲染的相应视图之间的误差来优化这个模型。
通过最小化多个视图的误差,鼓励网络预测出一个连贯的场景模型,通过为包含真实场景内容的位置分配高体积密度和准确的颜色。
我们发现,对于复杂场景,基本实现优化神经辐射场表示不能收敛到足够高分辨率的表示,并且所需的每个相机光线样本数量效率低下。我们通过使用位置编码来转换输入的5D坐标,使MLP能够表示更高频率的函数,并提出了一种分层采样过程,以减少充分采样这个高频率场景表示所需的查询数量。
我们的方法继承了体积表示的优点:两者都能表示复杂的真实世界几何形状和外观,并且非常适合使用投影图像进行基于梯度的优化。关键是,我们的方法克服了在高分辨率下建模复杂场景时离散化体素网格的存储成本过高的问题。
总结起来,我们的技术贡献包括:
- 一种将具有复杂几何和材质的连续场景表示为5D神经辐射场的方法,参数化为基本的MLP网络。
- 基于经典的体积渲染技术的可微渲染过程,我们使用该过程从标准RGB图像中优化这些表示。这包括一种分层采样策略,将MLP的容量分配给可见场景内容的空间。
- 一种位置编码,将每个输入的5D坐标映射到更高维度空间,使我们能够成功地优化神经辐射场以表示高频率的场景内容。
我们证明了我们得到的神经辐射场方法在定量和定性方面优于最先进的视图合成方法,包括将神经3D表示拟合到场景中以及训练深度卷积网络来预测采样的体积表示的方法。据我们所知,本文首次提出了一种能够从自然环境中捕获的RGB图像中渲染出高分辨率逼真新视图的连续神经场景表示方法。
2 Related Work
A promising recent direction in computer vision is encoding objects and scenes in the weights of an MLP that directly maps from a 3D spatial location to an implicit representation of the shape, such as the signed distance [6] at that location. However, these methods have so far been unable to reproduce realistic scenes with complex geometry with the same fidelity as techniques that represent scenes using discrete representations such as triangle meshes or voxel grids.
In this section, we review these two lines of work and contrast them with our approach, which enhances the capabilities of neural scene representations to produce state-of-the-art results for rendering complex realistic scenes. A similar approach of using MLPs to map from low-dimensional coordinates to colors has also been used for representing other graphics functions such as images [44], textured materials [12,31,36,37], and indirect illumination values [38].
Neural 3D shape representations
Recent work has investigated the implicit representation of continuous 3D shapes as level sets by optimizing deep networks that map xyz coordinates to signed distance functions [15,32] or occupancy fields [11,27]. However, these models are limited by their requirement of access to ground truth 3D geometry, typically obtained from synthetic 3D shape datasets such as ShapeNet [3]. Subsequent work has relaxed this requirement of ground truth 3D shapes by formulating differentiable rendering functions that allow neural implicit shape representations to be optimized using only 2D images. Niemeyer et al. [29] represent surfaces as 3D occupancy fields and use a numerical method to find the surface intersection for each ray, then calculate an exact derivative using implicit differentiation. Each ray intersection location is provided as the input to a neural 3D texture field that predicts a diffuse color for that point. Sitzmann et al. [42] use a less direct neural 3D representation that simply outputs a feature vector and RGB color at each continuous 3D coordinate, and propose a differentiable rendering function consisting of a recurrent neural network that marches along each ray to decide where the surface is located.
Though these techniques can potentially represent complicated and high resolution geometry, they have so far been limited to simple shapes with low geometric complexity, resulting in oversmoothed renderings. We show that an alternate strategy of optimizing networks to encode 5D radiance fields (3D volumes with 2D view-dependent appearance) can represent higher-resolution geometry and appearance to render photorealistic novel views of complex scenes.
View synthesis and image-based rendering
Given a dense sampling of views, photorealistic novel views can be reconstructed by simple light field sample interpolation techniques [21,5,7]. For novel view synthesis with sparser view sampling, the computer vision and graphics communities have made significant progress by predicting traditional geometry and appearance representations from
observed images. One popular class of approaches uses mesh-based representations of scenes with either diffuse [48] or view-dependent [2,8,49] appearance. Differentiable rasterizers [4,10,23,25] or pathtracers [22,30] can directly optimize mesh representations to reproduce a set of input images using gradient descent. However, gradient-based mesh optimization based on image reprojection is often difficult, likely because of local minima or poor conditioning of the loss landscape. Furthermore, this strategy requires a template mesh with fixed topology to be provided as an initialization before optimization [22], which is typically unavailable for unconstrained real-world scenes.
Another class of methods use volumetric representations to address the task of high-quality photorealistic view synthesis from a set of input RGB images.
Volumetric approaches are able to realistically represent complex shapes and materials, are well-suited for gradient-based optimization, and tend to produce less visually distracting artifacts than mesh-based methods.
Early volumetric approaches used observed images to directly color voxel grids [19,40,45]. More recently, several methods [9,13,17,28,33,43,46,52] have used large datasets of multiple scenes to train deep networks that predict a sampled volumetric representation from a set of input images, and then use either alpha-compositing [34] or learned compositing along rays to render novel views at test time.
Other works have optimized a combination of convolutional networks (CNNs) and sampled
voxel grids for each specific scene, such that the CNN can compensate for discretization artifacts from low resolution voxel grids [41] or allow the predicted voxel grids to vary based on input time or animation controls [24].
While these volumetric techniques have achieved impressive results for novel view synthe-
sis, their ability to scale to higher resolution imagery is fundamentally limited by poor time and space complexity due to their discrete sampling — rendering higher resolution images requires a finer sampling of 3D space.
We circumvent this problem by instead encoding a continuous volume within the parameters of a deep fully-connected neural network, which not only produces significantly
higher quality renderings than prior volumetric approaches, but also requires just a fraction of the storage cost of those sampled volumetric representations.
2 相关工作
计算机视觉领域中一种有前景的研究方向是使用MLP的权重来编码物体和场景,直接将3D空间位置映射为隐式形状表示,例如该位置处的有符号距离[6]。然而,迄今为止,这些方法无法以与使用三角形网格或体素网格等离散表示场景相同的保真度来重现具有复杂几何的逼真场景。在本节中,我们回顾这两个工作领域,并将其与我们的方法进行对比,我们的方法增强了神经场景表示的能力,以实现渲染复杂逼真场景的最先进结果。
使用MLP将低维坐标映射到颜色的类似方法也用于表示其他图形函数,如图像[44]、纹理材质[12,31,36,37]和间接光照值[38]。
神经3D形状表示
最近的研究工作研究了将连续的3D形状表示为通过优化将xyz坐标映射到有符号距离函数[15,32]或占用场[11,27]的深度网络的水平集。然而,这些模型受限于对真实3D几何的要求,通常从合成3D形状数据集(如ShapeNet [3])中获得。随后的工作通过制定可微渲染函数放宽了对真实3D形状的要求,从而只使用2D图像来优化神经隐式形状表示。Niemeyer等人 [29]将表面表示为3D占用场,并使用数值方法找到每条射线的表面交点,然后使用隐式微分计算精确导数。每个射线交点位置作为输入提供给神经3D纹理场,该场预测该点的漫反射颜色。Sitzmann等人 [42]使用了一种不太直接的神经3D表示,只是在每个连续的3D坐标处输出特征向量和RGB颜色,并提出了一个可微渲染函数,由一个逐射线的递归神经网络组成,用于确定表面的位置。
尽管这些技术可能能够表示复杂和高分辨率的几何形状,但迄今为止仅限于具有低几何复杂性的简单形状,导致渲染过于平滑。我们展示了一种替代策略,通过优化网络来编码5D辐射场(具有2D视角相关外观的3D体积),能够表示更高分辨率的几何形状和外观,以渲染复杂场景的逼真新视角。
视角合成和基于图像的渲染:
在进行密集视角采样的情况下,可以通过简单的光场样本插值技术 [21,5,7] 重构逼真的新视角。对于稀疏视角采样的新视角合成,计算机视觉和图形学界取得了显著的进展,通过从观察到的图像中预测传统几何和外观表示来进行。一种常见的方法类别使用基于网格的场景表示,其中场景具有漫反射 [48] 或视角相关的外观 [2,8,49]。可微分光栅化器 [4,10,23,25] 或路径追踪器 [22,30] 可以直接使用梯度下降优化网格表示,以重现一组输入图像。然而,基于图像投影的基于梯度的网格优化通常很困难,可能是因为局部最小值或损失景观的条件较差。此外,这种策略要求在优化之前提供一个具有固定拓扑结构的模板网格作为初始化 [22],而对于无约束的真实场景,通常无法提供此类模板网格。
另一类方法使用体积表示来解决从一组输入RGB图像高质量逼真视角合成的任务。体积表示能够逼真地表示复杂的形状和材质,非常适合基于梯度的优化,并且往往产生比基于网格的方法更少的视觉干扰性伪影。早期的体积方法直接使用观察到的图像对体素网格进行着色 [19,40,45]。最近,一些方法 [9,13,17,28,33,43,46,52] 使用大型数据集中的多个场景训练深度网络,从一组输入图像预测采样的体积表示,然后在测试时使用阿尔法合成 [34] 或学习的沿射线合成来渲染新视角。其他工作对每个特定场景优化了卷积网络(CNN)和采样的体素网格的组合,以使CNN能够弥补低分辨率体素网格的离散化伪影 [41] 或允许根据输入时间或动画控制来变化预测的体素网格 [24]。
尽管这些体积技术在新视角合成方面取得了令人印象深刻的成果,但它们在扩展到更高分辨率图像方面的能力受到了离散采样的时间和空间复杂性的根本限制 —— 渲染更高分辨率的图像需要对3D空间进行更细致的采样。
我们通过在深度全连接神经网络的参数中编码连续体积来规避了这个问题,这不仅可以产生比先前的体积方法更高质量的渲染结果,而且还只需要相对较小的存储成本。