Abstract
Image animation consists of generating a video sequence so that an object in a source image is animated according to the motion of a driving video. Our framework addresses this problem without using any annotation or prior information about the specific object to animate. Once trained on a set of videos depicting objects of the same category (e.g. faces, human bodies), our method can be applied to any object of this class. To achieve this, we decouple appearance and motion information using a self-supervised formulation. To support complex motions, we use a representation consisting of a set of learned keypoints along with their local affine transformations. A generator network models occlusions arising during target motions and combines the appearance extracted from the source image and the motion derived from the driving video. Our framework scores best on diverse benchmarks and on a variety of object categories. 图像动画包括生成视频序列,以便根据驱动视频的运动使源图像中的对象动画。我们的框架解决了这个问题,没有使用任何注释或关于动画特定对象的先验信息。一旦在一组描述同一类别对象(例如人脸、人体)的视频上进行训练,我们的方法就可以应用于该类中的任何对象。为了实现这一点,我们解耦外观表面和运动信息使用一个自监督的公式。为了支持复杂的运动,我们使用一种由一组学习过的关键点及其局部仿射变换组成的表示法。生成器网络对目标运动中产生的遮挡进行建模,并将从源图像中提取的外观与从驾驶视频中提取的运动相结合。我们的框架在各种基准测试和各种对象类别上得分最高。
1 Introduction
Generating videos by animating objects in still images has countless applications across areas of interest including movie production, photography, and e-commerce. More precisely, image animation refers to the task of automatically synthesizing videos by combining the appearance extracted from a source image with motion patterns derived from a driving video. For instance, a face image of a certain person can be animated following the facial expressions of another individual (see Fig. 1). In the literature, most methods tackle this problem by assuming strong priors on the object representation (e.g. 3D model) [4] and resorting to computer graphics techniques [6, 33]. These approaches can be referred to as object-specific methods, as they assume knowledge about the model of the specific object to animate.
通过在静态图像中动画对象来生成视频有无数的应用程序,涉及的领域包括电影制作、摄影和电子商务。更准确地说,图像动画是指通过将从源图像中提取的外观与从驾驶视频中提取的运动模式结合起来,自动合成视频的任务。例如,一个人的面部图像可以根据另一个人的面部表情进行动画处理(见图1),在文献中,大多数方法通过对对象表示(如3D模型)[4]假设强先验并借助于计算机图形技术来解决这个问题[6,33]。这些方法可以被称为特定对象的方法,因为它们假定了解要动画的特定对象的模型。
Recently, deep generative models have emerged as effective techniques for image animation and video retargeting [2, 41, 3, 42, 27, 28, 37, 40, 31, 21]. In particular, Generative Adversarial Networks (GANs) [14] and Variational Auto-Encoders (VAEs) [20] have been used to transfer facial expressions [37] or motion patterns [3] between human subjects in videos. Nevertheless, these approaches usually rely on pre-trained models in order to extract object-specific representations such as keypoint locations. Unfortunately, these pre-trained models are built using costly ground-truth data annotations [2, 27, 31] and are not available in general for an arbitrary object category. To address this issues, recently Siarohin et al. [28] introduced Monkey-Net, the first object-agnostic deep model for image animation. Monkey-Net encodes motion information via keypoints learned in a self-supervised fashion. At test time, the source image is animated according to the corresponding keypoint trajectories estimated in the driving video. The major weakness of Monkey-Net is that it poorly models object appearance transformations in the keypoint neighborhoods assuming a zeroth order model (as we show in Sec. 3.1). This leads to poor generation quality in the case of large object pose changes (see Fig. 4). To tackle this issue,
we propose to use a set of self-learned keypoints together with local affine transformations to model complex motions. We therefore call our method a first-order motion model.
Second, we introduce an occlusion-aware generator, which adopts an occlusion mask automatically estimated to indicate object parts that are not visible in the source image and that should be inferred from the context. This is especially needed when the driving video contains large motion patterns and occlusions are typical.
Third, we extend the equivariance loss commonly used for keypoints detector training [18, 44], to improve the estimation of local affine transformations. Fourth, we experimentally show that our method significantly outperforms state-of-the-art image animation methods and can handle high-resolution datasets where other approaches generally fail.
Finally, we release a new high resolution dataset, Thai-Chi-HD, which we believe could become a reference benchmark for evaluating frameworks for image animation and video generation.
最近,深度生成模型已经成为图像动画和视频重定向的有效技术[2,41,3,42,27,28,37,40,31,21]。特别是,生成对抗网络(GANs)[14]和变分自动编码器(VAEs)[20]已被用于在视频中人类受试者之间转移面部表情[37]或运动模式[3]。然而,这些方法通常依靠预先训练好的模型来提取特定对象的表示,如关键点位置。不幸的是,这些预先训练过的模型是使用昂贵的ground-truth数据注释来构建的[2,27,31],通常不能用于任意对象类别。为了解决这个问题,最近Siarohin等人[28]推出了Monkey-Net,这是第一个面向对象的图像动画深度模型。Monkey-Net编码运动信息以一个自我监督的方式通过关键点学习。在测试时,根据在驾驶视频中估计的相应关键点轨迹对源图像进行动画处理。Monkey-Net的主要弱点是,在假定为零阶模型的情况下,它很难对关键点邻域中的对象外观变换进行建模(如3.1节所示)。这导致在大物体姿态变化的情况下生成质量较差(见图4)。为了解决这个问题,
我们提出使用一组自学习的关键点和局部仿射变换来建模复杂的运动。因此我们称我们的方法为一阶运动模型 [first-order motion model.]。
其次,我们介绍了一个遮挡感知生成器,它采用一个自动估计的遮挡掩模来指示目标部分,在源图像中不可见的,需要从上下文推断。这是特别需要的时候,驾驶视频包含大的运动模式和遮挡是典型的。
第三,我们扩展了关键点检测器训练中常用的等方差损失[18,44],以改进局部仿射变换的估计。
第四,我们的实验表明,我们的方法明显优于最先进的图像动画方法,可以处理高分辨率数据集,其他方法通常失败。
最后,我们发布了一个新的高分辨率数据集——Thai-Chi-HD,我们相信它可以成为评估图像动画和视频生成框架的参考基准。
2 Related work
Video Generation. Earlier works on deep video generation discussed how spatio-temporal neural networks could render video frames from noise vectors [36, 26]. More recently, several approaches tackled the problem of conditional video generation. For instance, Wang et al. [38] combine a recurrent neural network with a VAE in order to generate face videos. Considering a wider range of applications, Tulyakov et al. [34] introduced MoCoGAN, a recurrent architecture adversarially trained in order to synthesize videos from noise, categorical labels or static images. Another typical case of conditional generation is the problem of future frame prediction, in which the generated video is conditioned on the initial frame [12, 23, 30, 35, 44]. Note that in this task, realistic predictions can be obtained by simply warping the initial video frame [1, 12, 35]. Our approach is closely related to these previous works since we use a warping formulation to generate video sequences. However, in the case of image animation, the applied spatial deformations are not predicted but given by the driving video.
视频生成。在深度视频生成方面的早期工作讨论了时空神经网络如何从噪声向量渲染视频帧[36,26]。最近,一些方法解决了条件视频生成的问题。例如,Wang et al.[38]结合递归神经网络和VAE来生成人脸视频。考虑到更广泛的应用,Tulyakov等人[34]引入了MoCoGAN,一种经过反训练的周期性建筑,用于从噪声、分类标签或静态图像合成视频。条件生成的另一个典型情况是未来帧预测问题,生成的视频以初始帧为条件[12,23,30,35,44]。注意,在这个任务中,可以通过简单地扭曲初始视频帧来获得现实的预测[1,12,35]。我们的方法与之前的工作密切相关,因为我们使用扭曲公式来生成视频序列。然而,在图像动画的情况下,应用的空间变形不是预测,而是由驾驶视频给出。
Image Animation. Traditional approaches for image animation and video re-targeting [6, 33, 13] were designed for specific domains such as faces [45, 42], human silhouettes [8, 37, 27] or gestures [31] and required a strong prior of the animated object. For example, in face animation, method of Zollhofer et al. [45] produced realistic results at expense of relying on a 3D morphable model of the face. In many applications, however, such models are not available. Image animation can also be treated as a translation problem from one visual domain to another. For instance, Wang et al. [37] transferred human motion using the image-to-image translation framework of Isola et al. [16]. Similarly, Bansal et al. [3] extended conditional GANs by incorporating spatio-temporal cues in order to improve video translation between two given domains. Such approaches in order to animate a single person require hours of videos of that person labelled with semantic information, and therefore have to be retrained for each individual. In contrast to these works, we neither rely on labels, prior information about the animated objects, nor on specific training procedures for each object instance. Furthermore, our approach can be applied to any object within the same category (e.g., faces, human bodies, robot arms etc).
图像动画。传统的图像动画和视频重定向方法[6,33,13]是为特定领域设计的,如人脸[45,42],人体轮廓[8,37,27]或手势[31],并要求动画对象的强大先验。例如,在人脸动画中,Zollhofer等人[45]的方法以依赖人脸的3D morphable模型为代价,产生了逼真的结果。然而,在许多应用中,这样的模型是不可用的。图像动画也可以看作是一个从一个视觉领域到另一个视觉领域的转换问题。例如,Wang等人[37]使用Isola等人的图像到图像的翻译框架来传输人体运动。[16]。同样,Bansal等人[3]通过合并时空线索扩展了条件GANs,以改善两个给定域之间的视频平移。为了使一个人动起来,这种方法需要数小时的带有语义信息的视频,因此必须为每个人重新训练。与这些作品相比,我们既不依赖于标签,也不依赖于动画对象的先验信息,也不依赖于每个对象实例的特定训练程序。此外,我们的方法可以应用于同一类别中的任何对象。,人脸,人体,机器人手臂等)。
Several approaches were proposed that do not require priors about the object. X2Face [40] uses a dense motion field in order to generate the output video via image warping. Similarly to us they employ a reference pose that is used to obtain a canonical representation of the object. In our formulation, we do not require an explicit reference pose, leading to significantly simpler optimization and improved image quality. Siarohin et al. [28] introduced Monkey-Net, a self-supervised framework for animating arbitrary objects by using sparse keypoint trajectories. In this work, we also employ sparse trajectories induced by self-supervised keypoints. However, we model object motion in the neighbourhood of each predicted keypoint by a local affine transformation. Additionally, we explicitly model occlusions in order to indicate to the generator network the image regions that can be generated by warping the source image and the occluded areas that need to be inpainted. 提出了几种不需要关于对象的先验的方法。X2Face[40]使用密集运动场,通过图像翘曲生成输出视频。与我们相似的是,它们使用一个参考姿态来获得对象的规范表示。在我们的公式中,我们不需要一个明确的参考姿态,导致显著简化优化和改善图像质量。Siarohin等人[28]介绍了Monkey-Net,这是一个自监督框架,通过使用稀疏的关键点轨迹来创建任意对象的动画。在这项工作中,我们也使用稀疏轨迹由自监督关键点。然而,我们通过局部仿射变换在每个预测关键点的邻域内建模物体的运动。此外,为了向生成网络表明扭曲源图像可以生成的图像区域和需要绘制的遮挡区域,我们对遮挡进行了显式建模。
3 Method
We are interested in animating an object depicted in a source image S based on the motion of a similar object in a driving video D. Since direct supervision is not available (pairs of videos in which objects move similarly), we follow a self-supervised strategy inspired from Monkey-Net [28]. For training, we employ a large collection of video sequences containing objects of the same object category. Our model is trained to reconstruct the training videos by combining a single frame and a learned latent representation of the motion in the video. Observing frame pairs, each extracted from the same video, it learns to encode motion as a combination of motion-specific keypoint displacements and local affine transformations. At test time we apply our model to pairs composed of the source image and of each frame of the driving video and perform image animation of the source object. 我们感兴趣的动画对象描述了源图像的基于相似的对象的运动以来驾驶视频d直接监督不可用(对视频对象移动类似),我们遵循self-supervised策略启发从Monkey-Net[28]。为了进行训练,我们使用了大量的视频序列集合,其中包含了同一对象类别的对象。我们的模型被训练来重建训练视频结合一个单一的帧和一个学习的潜在的表示运动在视频。通过观察从同一视频中提取的帧对,它学会了将运动编码为特定运动关键点位移和局部仿射变换的组合。在测试时,我们将模型应用于由源图像和驱动视频的每一帧组成的对,并执行源对象的图像动画。
An overview of our approach is presented in Fig. 2. Our framework is composed of two main modules: the motion estimation module and the image generation module. The purpose of the motion estimation module is to predict a dense motion field from a frame D ∈ R 3×H×W of dimension H × W of the driving video D to the source frame S ∈ R 3×H×W . The dense motion field is later used to align the feature maps computed from S with the object pose in D. The motion field is modeled by a function TS←D : R 2 → R 2 that maps each pixel location in D with its corresponding location in S. TS←D is often referred to as backward optical flow. We employ backward optical flow, rather than forward optical flow, since back-warping can be implemented efficiently in a differentiable manner using bilinear sampling [17]. We assume there exists an abstract reference frame R. We independently estimate two transformations: from R to S (TS←R) and from R to D (TD←R). Note that unlike X2Face [40] the reference frame is an abstract concept that cancels out in our derivations later. Therefore it is never explicitly computed and cannot be visualized. This choice allows us to independently process D and S. This is desired since, at test time the model receives pairs of the source image and driving frames sampled from a different video, which can be very different visually. Instead of directly predicting TD←R and TS←R, the motion estimator module proceeds in two steps. 我们的方法的概述如图2所示。我们的框架由两个主要模块组成:运动估计模块和图像生成模块。运动估计模块的目的是预测从驱动视频D的维数H×W的帧D∈R 3×H×W到源帧S∈R 3×H×W的密集运动场。密集的运动领域后用于对齐对象构成的特征图谱计算从S D运动领域建模函数TS←D: R 2→R 2映射每个像素位置与相应的位置在美国TS←D D通常被称为反向光流。由于使用双线性采样[17]可以以可微的方式有效地实现反向翘曲,因此我们采用了反向光流而不是前向光流。我们假设存在一个抽象参考系R,我们独立估计两个转换:从R到S (TS←R)和从R到D (TD←R)。注意,与X2Face[40]不同的是,参考框架是一个抽象概念,在后面的派生中会被抵消。因此,它从不被显式地计算,也不能被可视化。这种选择允许我们独立处理D和s,这是我们所希望的,因为在测试时,模型接收来自不同视频的源图像和驱动帧,它们在视觉上可能非常不同。动作估计器模块不直接预测TD←R和TS←R,而是分两步进行。
In the first step, we approximate both transformations from sets of sparse trajectories, obtained by using keypoints learned in a self-supervised way. The locations of the keypoints in D and S are separately predicted by an encoder-decoder network. The keypoint representation acts as a bottleneck resulting in a compact motion representation. As shown by Siarohin et al. [28], such sparse motion representation is well-suited for animation as at test time, the keypoints of the source image can be moved using the keypoints trajectories in the driving video. We model motion in the neighbourhood of each keypoint using local affine transformations. Compared to using keypoint displacements only, the local affine transformations allow us to model a larger family of transformations. We use Taylor expansion to represent TD←R by a set of keypoint locations and affine transformations. To this end, the keypoint detector network outputs keypoint locations as well as the parameters of each affine transformation.
During the second step, a dense motion network combines the local approximations to obtain the resulting dense motion field Tˆ S←D. Furthermore, in addition to the dense motion field, this network outputs an occlusion mask Oˆ S←D that indicates which image parts of D can be reconstructed by warping of the source image and which parts should be inpainted, i.e.inferred from the context.
Finally, the generation module renders an image of the source object moving as provided in the driving video. Here, we use a generator network G that warps the source image according to Tˆ S←D and inpaints the image parts that are occluded in the source image. In the following sections we detail each of these step and the training procedure. 在第一步中,我们从稀疏轨迹集近似两个转换,通过使用自监督方式学习的关键点获得。通过编解码器网络分别预测D和S中关键点的位置。关键点表示是实现紧凑运动表示的瓶颈。如Siarohin等人[28]所示,这种稀疏运动表示非常适合于动画,因为在测试时,可以使用驾驶视频中的关键点轨迹移动源图像的关键点。我们使用局部仿射变换在每个关键点的邻域建模运动。与只使用关键点位移相比,局部仿射变换允许我们建模一个更大的变换家族。我们用泰勒展开通过一组关键点位置和仿射变换来表示TD←R。为此,关键点检测器网络输出关键点位置以及每个仿射变换的参数。
在第二步中,密集的运动网络结合了本地近似获得由此产生的密集运动领域TˆS←D。此外,除了茂密的运动领域,这个网络输出一个闭塞面具OˆS←D D表明图像部分可以重建源图像的扭曲和哪些部分应该填补,i.e.inferred从上下文。
最后,生成模块呈现源对象移动的图像,如驱动视频中提供的那样。在这里,我们使用一个发电机网络G扭曲源图像根据TˆS←D和填补图像部分被遮挡在源图像。在下面的部分中,我们将详细介绍这些步骤和培训过程。
3.1 Local Affine Transformations for Approximate Motion Description 局部仿射变换近似运动描述
The motion estimation module estimates the backward optical flow TS←D from a driving frame D to the source frame S. As discussed above, we propose to approximate TS←D by its first order Taylor expansion in a neighborhood of the keypoint locations. In the rest of this section, we describe the motivation behind this choice, and detail the proposed approximation of TS←D.
We assume there exist an abstract reference frame R. Therefore, estimating TS←D consists in estimating TS←R and TR←D. Furthermore, given a frame X, we estimate each transformation TX←R in the neighbourhood of the learned keypoints. Formally, given a transformation TX←R, we consider its first order Taylor expansions in K keypoints p1, . . . pK. Here, p1, . . . pK denote the coordinates of the keypoints in the reference frame R. Note that for the sake of simplicity in the following the point locations in the reference pose space are all denoted by p while the point locations in the X, S or D pose spaces are denoted by z. We obtain:
运动估计模块估计从驱动帧D到源帧s的反向光流TS←D。如上所述,我们建议通过其一阶泰勒展开在关键点位置的邻域来近似TS←D。在本节的其余部分中,我们将描述此选择背后的动机,并详细介绍提出的TS←D近似。
我们假设存在一个抽象的参考系R,因此,估算TS←D包含在估算TS←R和TR←D中。此外,给定一个坐标系X,我们估计每个变换TX←R在已学习关键点附近。正式地,给定一个变换TX←R,我们考虑它在K个关键点p1,…pK,这里是p1…pK表示的坐标参考系中的要点r .请注意,为了简单起见在参考点位置后构成的空间都是用p点位置在X,年代或D构成空间是用z。我们得到:
Combining Local Motions. We employ a convolutional network P to estimate Tˆ S←D from the set of Taylor approximations of TS←D(z) in the keypoints and the original source frame S. Importantly, since Tˆ S←D maps each pixel location in D with its corresponding location in S, the local patterns in Tˆ S←D, such as edges or texture, are pixel-to-pixel aligned with D but not with S. This misalignment issue makes the task harder for the network to predict Tˆ S←D from S. In order to provide inputs already roughly aligned with Tˆ S←D, we warp the source frame S according to local transformations estimated in Eq. (4). Thus, we obtain K transformed images S 1 , . . . S K that are each aligned with Tˆ S←D in the neighbourhood of a keypoint. Importantly, we also consider an additional image S 0 = S for the background.
For each keypoint pk we additionally compute heatmaps Hk indicating to the dense motion network where each transformation happens. Each Hk(z) is implemented as the difference of two heatmaps centered in TD←R(pk) and TS←R(pk):
结合局部运动。我们采用卷积网络P估计TˆS←D组泰勒近似的TS←D (z)的重点和原始帧S .重要的是,由于TˆS←D地图每个像素位置在D相应位置的年代,当地TˆS←D模式,如边缘或纹理,pixel-to-pixel与D但不与美国这个偏差问题使得网络任务更难预测TˆS←D S为了提供输入已经大致与TˆS←D,我们经源帧S根据当地转换在情商估计。(4)。因此,我们获得K S转换图像1,。K,都与Tˆ年代←D附近的一个关键点。重要的是,我们还考虑了一个额外的图像S 0 = S作为背景。
对于每一个关键点pk,我们额外计算热图Hk,表明在稠密的运动网络,每一个变换发生。每个Hk(z)实现为以TD←R(pk)和TS←R(pk)为中心的两个热图的差异: