论文结构
1.Introduction
2.Related Work
3.Stacked Generative Adversarial
Networks
3.1 Preliminaries
3.2 Conditioning Augmentation
3.3 Stage-I GAN
3.4 Stage-II GAN
3.5 Implementation details
4.Experiments
4.1 Datasets and evaluation metrics
4.2 Quantitative and qualitative results
4.3 Component analysis
5.Conclusions
摘要
原文
Synthesizing high-quality images from text descriptions is a challenging problem in computer vision and has many practical applications. Samples generated by existing text-to-image approaches can roughly reflect the meaning of the given descriptions, but they fail to contain necessary details and vivid object parts. In this paper, we propose Stacked Generative Adversarial Networks (StackGAN) to generate 256x256 photo-realistic images conditioned on text descriptions. We decompose the hard problem into more manageable sub-problems through a sketch-refinement process. The Stage-I GAN sketches the primitive shape and colors of the object based on the given text description, yielding Stage-I low-resolution images. The Stage-II GAN takes Stage-I results and text descriptions as inputs, and generates high-resolution images with photo-realistic details. It is able to rectify defects in Stage-I results and add compelling details with the refinement process. To improve the diversity of the synthesized images and stabilize the training of the conditional-GAN, we introduce a novel Conditioning Augmentation technique that encourages smoothness in the latent conditioning manifold. Extensive experiments and comparisons with state-of-the-arts on benchmark datasets demonstrate that the proposed method achieves significant improvements on generating photo-realistic images conditioned on text descriptions.
核心
现有文本到图像方法生成的样本,可以大致表达出给定的文本含义,但是图像细节和质量不佳
StackGAN能基于文本描述,生成256*256分辨率的照片级图像
把问题进行了分解,采用 草图绘制-精细绘制 两阶段过程
阶段1的GAN根据给定的文本描述,来绘制对象的原始形状和颜色;阶段2的GAN使用文本描述和阶段1的输出来作为输入,通过纠正草图中的缺陷和细节生成,来最终得到更高分辨率的图像
还提出了一种条件增强方法,能够增强潜在条件流形的平滑性
大量实验表明,以上方法在以文本描述为条件的照片级图像生成上取得了显著进步
研究背景
Research background
Energy-Based(EB) GAN
•
将判别器视作一个energy function,函数值(非负)越小代表data越可能是真实数据
•
使用自编码作为判别器(energy function)
•
判别器可以单独使用真实数据进行提前的预训练
•
可以基于ImageNet数据集训练,生成256*256分辨率的图片
文本生成图像
• VAE
• DRAW(Deep Recurrent Attention Writer)
•使用循环神经网络+注意力机制
•依次生成一个个对象叠加在一起得到最终结果
• GAN
在生成器中,text embedding跟随机噪声融合后一起输入到生成网络中
鉴别器会对错误情况进行分类,一种是生成的fake图像匹配了正确的文本,另一种是真实图像但匹配了错误文本