Paper：《First Order Motion Model for Image Animation》翻译与解读（二）-阿里云开发者社区

3.2 Occlusion-aware Image Generation 遮挡感知图像生成

As mentioned in Sec.3, the source image S is not pixel-to-pixel aligned with the image to be generated Dˆ . In order to handle this misalignment, we use a feature warping strategy similar to [29, 28, 15]. More precisely, after two down-sampling convolutional blocks, we obtain a feature map ξ ∈ R H0×W0 of dimension H0 × W0 . We then warp ξ according to Tˆ S←D. In the presence of occlusions in S, optical flow may not be sufficient to generate Dˆ . Indeed, the occluded parts in S cannot be recovered by image-warping and thus should be inpainted. Consequently, we introduce an occlusion map Oˆ S←D ∈ [0, 1]H0×W0 to mask out the feature map regions that should be inpainted. Thus, the occlusion mask diminishes the impact of the features corresponding to the occluded parts. The transformed feature map is written as:

Sec.3提到过,源图像年代不是pixel-to-pixel与图像生成Dˆ。为了处理这种错位，我们使用了类似于[29,28,15]的特征扭曲策略。更准确地说,经过两个采样下来卷积块,我们获得一个特性映射ξ∈R H0×W0 H0×W0的维度。然后经ξ根据TˆS←D。存在遮挡的年代,光流可能不足以生成Dˆ。实际上，S中被遮挡的部分是无法通过图像扭曲恢复的，因此应该进行补绘。因此,我们引入一个闭塞地图OˆS←D∈[0,1]H0×W0面具出功能映射区域应该填补。因此，遮挡掩模减少了与遮挡部分相对应的特征的影响。转换后的feature map为:

3.3 Training Losses

We train our system in an end-to-end fashion combining several losses. First, we use the reconstruction loss based on the perceptual loss of Johnson et al. [19] using the pre-trained VGG-19 network as our main driving loss. The loss is based on implementation of Wang et al. [37]. With the input driving frame D and the corresponding reconstructed frame Dˆ , the reconstruction loss is written as:

我们以端到端的方式训练我们的系统，结合了一些损失。首先，我们使用基于Johnson等人[19]的感知损失的重建损失，使用预训练的vggg -19网络作为我们的主要驱动损失。损失是基于Wang等人[37]的实施。与输入驱动框架和相应的重构帧Dˆ,重建损失是写成:

Imposing Equivariance Constraint. Our keypoint predictor does not require any keypoint annotations during training. This may lead to unstable performance. Equivariance constraint is one of the most important factors driving the discovery of unsupervised keypoints [18, 43]. It forces the model to predict consistent keypoints with respect to known geometric transformations. We use thin plate splines deformations as they were previously used in unsupervised keypoint detection [18, 43] and are similar to natural image deformations. Since our motion estimator does not only predict the keypoints, but also the Jacobians, we extend the well-known equivariance loss to additionally include constraints on the Jacobians.

We assume that an image X undergoes a known spatial deformation TX←Y. In this case TX←Y can be an affine transformation or a thin plane spline deformation. After this deformation we obtain a new image Y. Now by applying our extended motion estimator to both images, we obtain a set of local approximations for TX←R and TY←R. The standard equivariance constraint writes as:

实施Equivariance约束。我们的关键点预测器在培训期间不需要任何关键点注释。这可能会导致不稳定的性能。等方差约束是驱动无监督关键点发现的最重要因素之一[18,43]。它迫使模型对已知的几何变换预测一致的关键点。我们使用薄板样条变形，因为它们以前在无监督关键点检测中使用[18,43]，并且类似于自然图像变形。由于我们的运动估计器不仅可以预测关键点，而且可以预测雅可比矩阵，因此我们扩展了众所周知的等方差损失，增加了对雅可比矩阵的约束。

我们假设图像X经历了已知的空间变形TX←Y。在这种情况下，TX←Y可以是仿射变换或薄平面样条变形。在此变形之后，我们得到了一个新的图像y。现在通过对这两幅图像应用我们的扩展运动估计器，我们得到了一组TX←R和TY←R的局部逼近。标准等方差约束为:

Note that the constraint Eq. (11) is strictly the same as the standard equivariance constraint for the keypoints [18, 43]. During training, we constrain every keypoint location using a simple L1 loss between the two sides of Eq. (11). However, implementing the second constraint from Eq. (12) with L1 would force the magnitude of the Jacobians to zero and would lead to numerical problems. To this end, we reformulate this constraint in the following way:

注意，约束Eq.(11)与关键点的标准等方差约束严格相同[18,43]。在训练过程中，我们在Eq.(11)的两边使用一个简单的L1损失来约束每个关键点的定位。但是，用L1来实现Eq.(12)中的第二个约束会使雅可比矩阵的大小为零，会导致数值问题。为此目的，我们以下列方式重新表述这一约束:

3.4 Testing Stage: Relative Motion Transfer 测试阶段:相对运动转移

At this stage our goal is to animate an object in a source frame S1 using the driving video D1, . . . DT . Each frame Dt is independently processed to obtain St. Rather than transferring the motion encoded in TS1←Dt (pk) to S, we transfer the relative motion between D1 and Dt to S1. In other words, we apply a transformation TDt←D1 (p) to the neighbourhood of each keypoint pk:

在这个阶段，我们的目标是动画的对象在源帧S1使用驾驶视频D1，…DT。我们将D1和Dt之间的相对运动转移到S1，而不是将TS1←Dt (pk)中编码的运动转移到S中。换句话说，我们对每个关键点pk的邻域应用变换TDt←D1 (p):

Detailed mathematical derivations are provided in Sup. Mat.. Intuitively, we transform the neighbourhood of each keypoint pk in S1 according to its local deformation in the driving video. Indeed, transferring relative motion over absolute coordinates allows to transfer only relevant motion patterns, while preserving global object geometry. Conversely, when transferring absolute coordinates, as in X2Face [40], the generated frame inherits the object proportions of the driving video. It’s important to note that one limitation of transferring relative motion is that we need to assume that the objects in S1 and D1 have similar poses (see [28]). Without initial rough alignment, Eq. (14) may lead to absolute keypoint locations physically impossible for the object of interest. 在Sup. Mat中提供了详细的数学推导。直观上，我们根据driving video中每个关键点pk的局部变形，对S1中每个关键点pk的邻域进行变换。实际上，在绝对坐标上传输相对运动只允许传输相关的运动模式，同时保留全局物体的几何形状。相反，在传输绝对坐标时，如在X2Face[40]中，生成的帧继承驱动视频的对象比例。需要注意的是，传递相对运动的一个限制是，我们需要假设S1和D1中的物体具有相似的姿态(见[28])。在没有初始粗对准的情况下，Eq.(14)可能导致感兴趣对象在物理上无法得到绝对的关键点位置。

4 Experiments

Datasets. We train and test our method on four different datasets containing various objects. Our model is capable of rendering videos of much higher resolution compared to [28] in all our experiments.

The VoxCeleb dataset [22] is a face dataset of 22496 videos, extracted from YouTube videos. For pre-processing, we extract an initial bounding box in the first video frame. We track this face until it is too far away from the initial position. Then, we crop the video frames using the smallest crop containing all the bounding boxes. The process is repeated until the end of the sequence. We filter out sequences that have resolution lower than 256 × 256 and the remaining videos are resized to 256 × 256 preserving the aspect ratio. It’s important to note that compared to X2Face [40], we obtain more natural videos where faces move freely within the bounding box. Overall, we obtain 12331 training videos and 444 test videos, with lengths varying from 64 to 1024 frames.

The UvA-Nemo dataset [9] is a facial analysis dataset that consists of 1240 videos. We apply the exact same pre-processing as for VoxCeleb. Each video starts with a neutral expression. Similar to Wang et al. [38], we use 1116 videos for training and 124 for evaluation.

The BAIR robot pushing dataset [10] contains videos collected by a Sawyer robotic arm pushing diverse objects over a table. It consists of 42880 training and 128 test videos. Each video is 30 frame long and has a 256 × 256 resolution.

Following Tulyakov et al. [34], we collected 280 tai-chi videos from YouTube. We use 252 videos for training and 28 for testing. Each video is split in short clips as described in pre-processing of VoxCeleb dataset. We retain only high quality videos and resized all the clips to 256 × 256 pixels (instead of 64 × 64 pixels in [34]). Finally, we obtain 3049 and 285 video chunks for training and testing respectively with video length varying from 128 to 1024 frames. This dataset is referred to as the Tai-Chi-HD dataset. The dataset will be made publicly available.

数据集。我们在包含不同对象的四个不同数据集上训练和测试我们的方法。在我们所有的实验中，我们的模型能够呈现比[28]分辨率高得多的视频。

VoxCeleb数据集[22]是从YouTube视频中提取的22496个视频的人脸数据集。为了进行预处理，我们在第一帧视频中提取一个初始边界框。我们跟踪这个面，直到它离初始位置太远。然后，我们使用包含所有边框的最小剪裁来裁剪视频帧。这个过程一直重复，直到序列结束。我们过滤掉分辨率低于256×256的序列，其余的视频调整为256×256，保持高宽比不变。值得注意的是，与X2Face[40]相比，我们获得了更自然的视频，其中面在边框内自由移动。总的来说，我们获得了12331个训练视频和444个测试视频，长度从64帧到1024帧不等。

UvA-Nemo数据集[9]是一个面部分析数据集，包含1240个视频。我们使用与VoxCeleb完全相同的预处理。每个视频都以一个中性的表情开始。与Wang et al.[38]类似，我们使用1116个视频进行培训，124个视频进行评估。

BAIR机器人推送数据集[10]包含了由Sawyer机器人手臂在桌子上推送不同对象所收集的视频。它由42880个训练视频和128个测试视频组成。每个视频为30帧长，分辨率为256×256。

在Tulyakov等人[34]之后，我们从YouTube上收集了280个太极视频。我们使用252个视频进行培训，28个视频进行测试。每个视频被分割成简短的片段，正如在VoxCeleb数据集预处理中描述的那样。我们只保留高质量的视频，并将所有剪辑调整为256×256像素(而不是[34]中的64×64像素)。最后，我们分别得到3049和285个视频块进行训练和测试，视频长度在128到1024帧之间。这个数据集被称为taichi - hd数据集。数据集将向公众开放。

Evaluation Protocol. Evaluating the quality of image animation is not obvious, since ground truth animations are not available. We follow the evaluation protocol of Monkey-Net [28]. First, we quantitatively evaluate each method on the "proxy" task of video reconstruction. This task consists of reconstructing the input video from a representation in which appearance and motion are decoupled. In our case, we reconstruct the input video by combining the sparse motion representation in (2) of each frame and the first video frame. Second, we evaluate our model on image animation according to a user-study. In all experiments we use K=10 as in [28]. Other implementation details are given in Sup. Mat.

评估方案。评价图像动画的质量并不明显，因为地面真实动画是不可用的。我们遵循猴网[28]的评估协议。首先，我们对视频重建的“代理”任务进行了定量评估。这个任务包括从外观和运动解耦的再现中重构输入视频。在我们的例子中，我们结合每一帧的稀疏运动表示和第一帧视频来重建输入视频。其次，我们根据用户研究评估我们的图像动画模型。在所有的实验中，我们使用K=10作为[28]。其他实现细节见 Sup. Mat.

Metrics. To evaluate video reconstruction, we adopt the metrics proposed in Monkey-Net [28]:

L1. We report the average L1 distance between the generated and the ground-truth videos.

Average Keypoint Distance (AKD). For the Tai-Chi-HD, VoxCeleb and Nemo datasets, we use 3rd-party pre-trained keypoint detectors in order to evaluate whether the motion of the input video is preserved. For the VoxCeleb and Nemo datasets we use the facial landmark detector of Bulat et al. [5]. For the Tai-Chi-HD dataset, we employ the human-pose estimator of Cao et al. [7]. These keypoints are independently computed for each frame. AKD is obtained by computing the average distance between the detected keypoints of the ground truth and of the generated video.

Missing Keypoint Rate (MKR). In the case of Tai-Chi-HD, the human-pose estimator returns an additional binary label for each keypoint indicating whether or not the keypoints were successfully detected. Therefore, we also report the MKR defined as the percentage of keypoints that are detected in the ground truth frame but not in the generated one. This metric assesses the appearance quality of each generated frame.

Average Euclidean Distance (AED). Considering an externally trained image representation, we report the average euclidean distance between the ground truth and generated frame representation, similarly to Esser et al. [11]. We employ the feature embedding used in Monkey-Net [28].

指标。为了评估视频重构，我们采用Monkey-Net[28]中提出的度量:

L1。我们报告了生成的视频和地面真实视频之间的平均L1距离。

平均关键点距离(AKD)。对于Tai-Chi-HD、VoxCeleb和Nemo数据集，我们使用第三方预训练的关键点检测器来评估输入视频的运动是否被保留。对于VoxCeleb和Nemo数据集，我们使用Bulat等人的面部地标检测器。[5]。对于taichi - hd数据集，我们采用了Cao等人[7]的人体姿态估计器。对于每一帧，这些关键点都是独立计算的。AKD是通过计算ground truth检测关键点与生成视频之间的平均距离得到的。

缺少关键点率(MKR)。在Tai-Chi-HD的情况下，人体姿态估计器为每个关键点返回一个额外的二进制标签，以指示是否成功地检测到关键点。因此，我们还报告MKR定义为在ground truth框架中检测到但在生成的框架中未检测到的关键点的百分比。这个度量评估每个生成帧的外观质量。

平均欧氏距离(AED)。考虑到外部训练的图像表示，我们报告了ground truth和生成的帧表示之间的平均欧氏距离，类似于Esser等人[11]。我们使用了在猴网[28]中使用的特征嵌入。

烧蚀研究。我们比较模型的以下变体。基线:不使用遮挡模板训练的最简单模型(Eq.(8)中OS←D=1)， Eq.(4)中雅可比矩阵(Jk =1)，并且仅在最高分辨率下使用Lrec进行监督;吡定:金字塔损失添加到基线;吡定+OS←D:关于Pyr。，我们将产生网络替换为封闭感知网络;江淮。Eq.(12)我们的局部仿射变换模型，但对雅可比矩阵没有等方差约束完整:包括3.1节中描述的局部仿射变换的完整模型。

Ablation Study. We compare the following variants of our model. Baseline: the simplest model trained without using the occlusion mask (OS←D=1 in Eq. (8)), jacobians (Jk = 1 in Eq. (4)) and is supervised with Lrec at the highest resolution only; Pyr.: the pyramid loss is added to Baseline; Pyr.+OS←D: with respect to Pyr., we replace the generator network with the occlusion-aware network; Jac. w/o Eq. (12) our model with local affine transformations but without equivariance constraints on jacobians Eq. (12); Full: the full model including local affine transformations described in Sec. 3.1.

In Fig. 3, we report the qualitative ablation. First, the pyramid loss leads to better results according to all the metrics except AKD. Second, adding OS←D to the model consistently improves all the metrics with respect to Pyr.. This illustrates the benefit of explicitly modeling occlusions. We found that without equivariance constraint over the jacobians, Jk becomes unstable which leads to poor motion estimations. Finally, our Full model further improves all the metrics. In particular, we note that, with respect to the Baseline model, the MKR of the full model is smaller by the factor of 2.75. It shows that our rich motion representation helps generate more realistic images. These results are confirmed by our qualitative evaluation in Tab. 1 where we compare the Baseline and the Full models. In these experiments, each frame D of the input video is reconstructed from its first frame (first column) and the estimated keypoint trajectories. We note that the Baseline model does not locate any keypoints in the arms area. Consequently, when the pose difference with the initial pose increases, the model cannot reconstruct the video (columns 3,4 and 5). In contrast, the Full model learns to detect a keypoint on each arm, and therefore, to more accurately reconstruct the input video even in the case of complex motion.

在图3中，我们报告了定性消融。首先，根据除AKD之外的所有指标，金字塔损失导致更好的结果。其次，在模型中添加OS←D可以持续地改进关于Pyr的所有度量。这说明了明确建模遮挡的好处。我们发现，如果雅可比矩阵上没有等方差约束，Jk将变得不稳定，从而导致较差的运动估计。最后，我们的完整模型进一步改进了所有的度量。特别地，我们注意到，相对于基线模型，完整模型的MKR要小2.75倍。这表明，我们丰富的运动表示有助于生成更真实的图像。我们在表1中对基线和完整模型进行了定性评价，验证了这些结果。在这些实验中，输入视频的每一帧D都从第一帧(第一列)和估计的关键点轨迹重建。我们注意到基线模型没有在武器区域定位任何关键点。因此，当与初始位姿的位姿差增大时，模型无法重构视频(第3、4、5列)，而全模型学习检测每只手臂上的一个关键点，从而在复杂运动的情况下更准确地重构输入视频。

Comparison with State of the Art. We now compare our method with state of the art for the video reconstruction task as in [28]. To the best of our knowledge, X2Face [40] and Monkey-Net [28] are the only previous approaches for model-free image animation. Quantitative results are reported in Tab. 3. We observe that our approach consistently improves every single metric for each of the four different datasets. Even on the two face datasets, VoxCeleb and Nemo datasets, our approach clearly outperforms X2Face that was originally proposed for face generation. The better performance of our approach compared to X2Face is especially impressive X2Face exploits a larger motion embedding (128 floats) than our approach (60=K*(2+4) floats). Compared to Monkey-Net that uses a motion representation with a similar dimension (50=K*(2+3)), the advantages of our approach are clearly visible on the Tai-Chi-HD dataset that contains highly non-rigid objects (i.e.human body).

We now report a qualitative comparison for image animation. Generated sequences are reported in Fig. 4. The results are well in line with the quantitative evaluation in Tab. 3. Indeed, in both examples, X2Face and Monkey-Net are not able to correctly transfer the body notion in the driving video, instead warping the human body in the source image as a blob. Conversely, our approach is able to generate significantly better looking videos in which each body part is independently animated. This qualitative evaluation illustrates the potential of our rich motion description. We complete our evaluation with a user study. We ask users to select the most realistic image animation. Each question consists of the source image, the driving video, and the corresponding results of our method and a competitive method. We require each question to be answered by 10 AMT worker. This evaluation is repeated on 50 different input pairs. Results are reported in Tab. 2. We observe that our method is clearly preferred over the competitor methods. Interestingly, the largest difference with the state of the art is obtained on Tai-Chi-HD: the most challenging dataset in our evaluation due to its rich motions. 与先进水平的比较。我们现在比较我们的方法与先进的视频重建任务在[28]。就我们所知，X2Face[40]和Monkey-Net[28]是之前唯一的无模型图像动画方法。定量结果在Tab中报告。3.我们观察到，我们的方法始终如一地改善了四个不同数据集的每一个指标。即使在VoxCeleb和Nemo这两个人脸数据集上，我们的方法也明显优于最初为人脸生成而提出的X2Face。与X2Face相比，我们的方法的更好性能尤其令人印象深刻。X2Face利用了更大的运动嵌入(128个浮点数)，而我们的方法(60=K*(2+4)浮点数)。与使用类似维数(50=K*(2+3))的运动表示的猴网相比，我们的方法在包含高度非刚性对象(如人体)的Tai-Chi-HD数据集上的优势是显而易见的。 

我们现在报告一个图像动画的定性比较。生成的序列如图所示。4. 结果与表3的定量评价很一致。实际上，在这两个例子中，X2Face和Monkey-Net都无法在驱动视频中正确传输身体概念，而是将源图像中的人体扭曲成一个blob。相反，我们的方法能够产生明显更好的视频，其中身体的每个部分都是独立的动画。这种定性评价说明了我们丰富的运动描述的潜力。我们通过用户研究来完成我们的评估。我们要求用户选择最真实的图像动画。每个问题由源图像，驾驶视频，以及相应的结果，我们的方法和竞争方法。我们要求每个问题由10个AMT工人回答。这个评估在50个不同的输入对上重复。结果如表2所示。我们观察到我们的方法明显优于竞争对手的方法。有趣的是，与目前最先进的最大差异是在Tai-Chi-HD上获得的:由于其丰富的运动，在我们的评估中最具挑战性的数据集。

5 Conclusions

We presented a novel approach for image animation based on keypoints and local affine transformations. Our novel mathematical formulation describes the motion field between two frames and is efficiently computed by deriving a first order Taylor expansion approximation. In this way, motion is described as a set of keypoints displacements and local affine transformations. A generator network combines the appearance of the source image and the motion representation of the driving video. In addition, we proposed to explicitly model occlusions in order to indicate to the generator network which image parts should be inpainted. We evaluated the proposed method both quantitatively and qualitatively and showed that our approach clearly outperforms state of the art on all the benchmarks. 本文提出了一种基于关键点和局部仿射变换的图像动画方法。我们的新的数学公式描述了两个帧之间的运动场，并通过推导一阶泰勒展开近似来有效地计算。这样，运动被描述为一组关键点位移和局部仿射变换。生成网络将源图像的外观和驱动视频的运动表示结合起来。此外，我们建议显式地建立遮挡模型，以便向生成器网络指示哪些图像部分需要修复。我们对所提出的方法进行了定量和定性的评估，并表明我们的方法在所有基准测试中都明显优于现有的技术水平。

Paper：《First Order Motion Model for Image Animation》翻译与解读（二）

热门文章

最新文章

相关电子书

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

Paper：《First Order Motion Model for Image Animation》翻译与解读（二）

热门文章

最新文章

相关电子书