3、《What value Do Explicit High Level Concepts Have in Vision to Language Problems?》
http://cn.arxiv.org/pdf/1506.01144v6 该论文使用高层语义提高了模型效果。
Abstract: Much recent progress in Vision-to-Language (V2L) problems has been achieved through a combination of Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs). This approach does not explicitly represent high-level semantic concepts, but rather seeks to progress directly from image features to text. In this paper we investigate whether this direct approach succeeds due to, or despite, the fact that it avoids the explicit representation of high-level information. We propose a method of incorporating high-level concepts into the successful CNN-RNN approach, and show that it achieves a significant improvement on the state-of-the-art in both image captioning and visual question answering. We also show that the same mechanism can be used to introduce external semantic information and that doing so further improves performance. We achieve the best reported results on both image captioning and VQA on several benchmark datasets, and provide an analysis of the value of explicit high-level concepts in V2L problems.
4、《Mind’s Eye: A Recurrent Visual Representation for Image Caption Generation》
https://www.cv-foundation.org/openaccess/content_cvpr_2015/app/2A_022_ext.pdf
A good image description is often said to “paint a picture in your mind’s eye.” The creation of a mental image may play a significant role in sentence comprehension in humans [3]. In fact, it is often this mental image that is remembered long after the exact sentence is forgotten [5, 7]. As an illustrative example, Figure 1 shows how a mental image may vary and increase in richness as a description is read. Could computer vision algorithms that comprehend and generate image captions take advantage of similar evolving visual representations? Recently, several papers have explored learning joint feature spaces for images and their descriptions [2, 4, 9]. These approaches project image features and sentence features into a common space, which may be used for image search or for ranking image captions. Various approaches were used to learn the projection, including Kernel Canonical Correlation Analysis (KCCA) [2], recursive neural networks [9], or deep neural networks [4]. While these approaches project both semantics and visual features to a common embedding, they are not able to perform the inverse projection. That is, they cannot generate novel sentences or visual depictions from the embedding.
5、《From Captions to Visual Concepts and Back》
https://arxiv.org/pdf/1411.4952v2.pdf
Abstract:This paper presents a novel approach for automatically generating image descriptions: visual detectors and language models learn directly from a dataset of image captions. We use Multiple Instance Learning to train visual detectors for words that commonly occur in captions, including many different parts of speech such as nouns, verbs, and adjectives. The word detector outputs serve as conditional inputs to a maximum-entropy language model. The language model learns from a set of over 400,000 image descriptions to capture the statistics of word usage. We capture global semantics by re-ranking caption candidates using sentence-level features and a deep multimodal similarity model. When human judges compare the system captions to ones written by other people, the system captions have equal or better quality over 23% of the time.