原文下载:https://download.csdn.net/download/qq_41185868/15548439
Abstract
Deep Convolutional Neural Networks (CNNs) are a special type of Neural Networks, which have shown state-of-the-art performance on various competitive benchmarks. The powerful learning ability of deep CNN is largely due to the use of multiple feature extraction stages (hidden layers) that can automatically learn representations from the data. Availability of a large amount of data and improvements in the hardware processing units have accelerated the research in CNNs, and recently very interesting deep CNN architectures are reported. The recent race in developing deep CNNs shows that the innovative architectural ideas, as well as parameter optimization, can improve CNN performance. In this regard, different ideas in the CNN design have been explored such as the use of different activation and loss functions, parameter optimization, regularization, and restructuring of the processing units. However, the major improvement in representational capacity of the deep CNN is achieved by the restructuring of the processing units. Especially, the idea of using a block as a structural unit instead of a layer is receiving substantial attention. This survey thus focuses on the intrinsic taxonomy present in the recently reported deep CNN architectures and consequently, classifies the recent innovations in CNN architectures into seven different categories. These seven categories are based on spatial exploitation, depth, multi-path, width, feature map exploitation, channel boosting, and attention. Additionally, this survey also covers the elementary understanding of CNN components and sheds light on its current challenges and applications.
深度卷积神经网络(CNNs)是一种特殊类型的神经网络,在各种竞争性基准测试中表现出了最先进的性能。深度CNN强大的学习能力很大程度上是由于它使用了多个特征提取阶段(隐含层),可以从数据中自动学习表示。大量数据的可用性和硬件处理单元的改进加速了CNNs的研究,并且,最近报道了非常有意思的深度CNN架构。最近开发深度CNNs的竞赛表明,创新的架构思想和参数优化可以提高CNN的性能。为此,在CNN的设计中探索了不同的思路,如使用不同的激活和丢失函数、参数优化、正则化以及处理单元的重组。然而,深度CNN的代表性能力的主要提高是通过处理单元的重组实现的。特别是,使用一个块作为一个结构单元而不是一层的想法正在得到大量的关注。因此,本次调查的重点是最近报道的深度CNN架构的内在分类,因此,将CNN架构的最新创新分为七个不同的类别。这七个类别分别基于空间开发、深度、多路径、宽度、特征地图开发、通道提升和注意力机制。此外,本调查还涵盖了对CNN组件的基本理解,并阐明了其当前的挑战和应用。
Keywords: Deep Learning, Convolutional Neural Networks, Architecture, Representational Capacity, Residual Learning, and Channel Boosted CNN.
关键词:深度学习,卷积神经网络,架构,表征能力,残差学习,通道提升的CNN
1、Introduction
Machine Learning (ML) algorithms belong to a specialized area in Artificial Intelligence (AI), which endows intelligence to computers by learning the underlying relationships among the data and making decisions without being explicitly programmed. Different ML algorithms have been developed since the late 1990s, for the emulation of human sensory responses such as speech and vision, but they have generally failed to achieve human-level satisfaction [1]–[6]. The challenging nature of Machine Vision (MV) tasks gives rise to a specialized class of Neural Networks (NN), known as Convolutional Neural Network (CNN) [7].
机器学习(ML)算法属于人工智能(AI)的一个专门领域,它通过学习数据之间的基本关系并在没有显示编程的情况下做出决策,从而赋予计算机智能。自20世纪90年代末以来,针对语音、视觉等人类感官反应的仿真,人们开发了各种各样的ML算法,但普遍未能达到人的满意程度[1]-[6]。由于机器视觉(MV)任务的挑战性,产生了一类专门的神经网络(NN),称为卷积神经网络(CNN)[7]。
CNNs are considered as one of the best techniques for learning image content and have shown state-of-the-art results on image recognition, segmentation, detection, and retrieval related tasks [8], [9]. The success of CNN has captured attention beyond academia. In industry, companies such as Google, Microsoft, AT&T, NEC, and Facebook have developed active research groups for exploring new architectures of CNN [10]. At present, most of the frontrunners of image processing competitions are employing deep CNN based models.
CNNs被认为是学习图像内容的最佳技术之一,在图像识别、分割、检测和检索相关任务[8]、[9]方面已经取得了最新的成果。CNN的成功吸引了学术界以外的关注。在业界,谷歌、微软、AT&T、NEC、Facebook等公司都建立了活跃的研究小组,探索CNN[10]的新架构。目前,大多数图像处理竞赛的领跑者,都在使用基于深度CNN的模型。
The topology of CNN is divided into multiple learning stages composed of a combination of the convolutional layer, non-linear processing units, and subsampling layers [11]. Each layer performs multiple transformations using a bank of convolutional kernels (filters) [12]. Convolution operation extracts locally correlated features by dividing the image into small slices (similar to the retina of the human eye), making it capable of learning suitable features. Output of the convolutional kernels is assigned to non-linear processing units, which not only helps in learning abstraction but also embeds non-linearity in the feature space. This non-linearity generates different patterns of activations for different responses and thus facilitates in learning of semantic differences in images. Output of the non-linear function is usually followed by subsampling, which helps in summarizing the results and also makes the input invariant to geometrical distortions [12], [13]. CNN的拓扑结构分为多个学习阶段,包括卷积层、非线性处理单元和子采样层的组合[11]。每一层使用一组卷积核(滤波器)执行多重变换[12]。卷积操作通过将图像分割成小块(类似于人眼视网膜)来提取局部相关特征,使其能够学习合适的特征。卷积核的输出被分配给非线性处理单元,这不仅有助于学习抽象,而且在特征空间中嵌入非线性。这种非线性会为不同的反应产生不同的激活模式,从而有助于学习图像中的语义差异。非线性函数的输出通常随后是子采样,这有助于总结结果,并使输入对几何畸变保持不变[12],[13]。
The architectural design of CNN was inspired by Hubel and Wiesel’s work and thus largely follows the basic structure of primate’s visual cortex [14], [15]. CNN first came to limelight through the work of LeCuN in 1989 for the processing of grid-like topological data (images and time series data) [7], [16]. The popularity of CNN is largely due to its hierarchical feature extraction ability. Hierarchical organization of CNN emulates the deep and layered learning process of the Neocortex in the human brain, which automatically extract features from the underlying data [17]. The staging of learning process in CNN shows quite resemblance with primate’s ventral pathway of visual cortex (V1-V2-V4-IT/VTC) [18]. The visual cortex of primates first receives input from the retinotopic area, where multi-scale highpass filtering and contrast normalization is performed by the lateral geniculate nucleus. After this, detection is performed by different regions of the visual cortex categorized as V1, V2, V3, and V4. In fact, V1 and V2 portion of visual cortex are similar to convolutional, and subsampling layers, whereas inferior temporal region resembles the higher layers of CNN, which makes inference about the image [19]. During training, CNN learns through backpropagation algorithm, by regulating the change in weights with respect to the input. Minimization of a cost function by CNN using backpropagation algorithm is similar to the response based learning of human brain. CNN has the ability to extract low, mid, and high-level features. High level features (more abstract features) are a combination of lower and mid-level features. With the automatic feature extraction ability, CNN reduces the need for synthesizing a separate feature extractor [20]. Thus, CNN can learn good internal representation from raw pixels with diminutive processing. CNN的架构设计灵感来自于Hubel和Wiesel的工作,因此很大程度上遵循了灵长类动物视觉皮层的基本结构[14],[15]。CNN最早是在1989年通过LeCuN的工作引起了人们的注意,它处理了网格状的拓扑数据(图像和时间序列数据)[7],[16]。CNN的流行很大程度上是由于它的层次特征提取能力。CNN的分层组织模拟人脑皮层的深层和分层学习过程,它自动从底层数据中提取特征[17]。CNN中学习过程的分期与灵长类视觉皮层腹侧通路(V1-V2-V4-IT/VTC)非常相似[18]。灵长类动物的视觉皮层首先接收来自视黄醇区的输入,在视黄醇区,外侧膝状体核进行多尺度高通滤波和对比度归一化。之后,由视觉皮层的不同区域进行检测,这些区域分为V1、V2、V3和V4。事实上,视觉皮层的V1和V2部分与卷积层和亚采样层相似,而颞下区与CNN的高层相似,后者对图像进行推断[19]。在训练过程中,CNN通过反向传播算法学习,通过调节输入权重的变化。使用反向传播算法的CNN最小化代价函数类似于基于响应的人脑学习。CNN能够提取低、中、高级特征。高级特征(更抽象的特征)是低级和中级特征的组合。具有自动特征提取功能,CNN减少了合成单独特征提取器的需要[20]。因此,CNN可以通过较小的处理从原始像素中学习良好的内部表示。
The main boom in the use of CNN for image classification and segmentation occurred after it was observed that the representational capacity of a CNN can be enhanced by increasing its depth [21]. Deep architectures have an advantage over shallow architectures, when dealing with complex learning problems. Stacking of multiple linear and non-linear processing units in a layer wise fashion provides deep networks the ability to learn complex representations at different levels of abstraction. In addition, advancements in hardware and thus the availability of high computing resources is also one of the main reasons of the recent success of deep CNNs. Deep CNN architectures have shown significant performance of improvements over shallow and conventional vision based models. Apart from its use in supervised learning, deep CNNs have potential to learn useful representation from large scale of unlabeled data. Use of the multiple mapping functions by CNN enables it to improve the extraction of invariant representations and consequently, makes it capable to handle recognition tasks of hundreds of categories. Recently, it is shown that different level of features including both low and high-level can be transferred to a generic recognition task by exploiting the concept of Transfer Learning (TL) [22]–[24]. Important attributes of CNN are hierarchical learning, automatic feature extraction, multi-tasking, and weight sharing [25]–[27]. CNN用于图像分类和分割的主要兴起发生在观察到CNN的表示能力可以通过增加其深度来增强之后[21]。在处理复杂的学习问题时,深度架构比浅层架构具有优势。以分层方式堆叠多个线性和非线性处理单元,使深层网络能够在不同抽象级别学习复杂表示。此外,硬件的进步以及高计算资源的可用性也是deep CNNs最近成功的主要原因之一。深度CNN架构已经显示出比浅层和传统的基于视觉的模型有显著改进的性能。除了在监督学习中的应用外,深度CNN还具有从大规模未标记数据中学习有用表示的潜力。利用CNN的多重映射函数,提高了不变量表示的提取效率,使其能够处理数百个类别的识别任务。近年来,研究表明,利用迁移学习(TL)[22]-[24]的概念,可以将包括低层和高层特征在内的不同层次的特征,转化为一般的识别任务。CNN的重要特性是分层学习、自动特征提取、多任务处理和权重共享[25]-[27]。
Various improvements in CNN learning strategy and architecture were performed to make CNN scalable to large and complex problems. These innovations can be categorized as parameter optimization, regularization, structural reformulation, etc. However, it is observed that CNN based applications became prevalent after the exemplary performance of AlexNet on ImageNet dataset [21]. Thus major innovations in CNN have been proposed since 2012 and were mainly due to restructuring of processing units and designing of new blocks. Similarly, Zeiler and Fergus [28] introduced the concept of layer-wise visualization of features, which shifted the trend towards extraction of features at low spatial resolution in deep architecture such as VGG [29]. Nowadays, most of the new architectures are built upon the principle of simple and homogenous topology introduced by VGG. On the other hand, Google group introduced an interesting idea of split, transform, and merge, and the corresponding block is known as inception block. The inception block for the very first time gave the concept of branching within a layer, which allows abstraction of features at different spatial scales [30]. In 2015, the concept of skip connections introduced by ResNet [31] for the training of deep CNNs got famous, and afterwards, this concept was used by most of the succeeding Nets, such as Inception-ResNet, WideResNet, ResNext, etc [32]–[34].
在CNN学习策略和体系结构方面进行了各种改进,使CNN能够扩展到大型复杂问题。这些创新可分为参数优化、正则化、结构重构等。然而,据观察,在AlexNet在ImageNet数据集上的示范性能之后,基于CNN的应用变得普遍[21]。因此,自2012年以来,CNN提出了重大创新,主要归功于处理单元的重组和新区块的设计。类似地,Zeiler和Fergus[28]引入了特征分层可视化的概念,这改变了深度架构(如VGG[29])中以低空间分辨率提取特征的趋势。目前,大多数新的体系结构都是基于VGG提出的简单、同质的拓扑结构原理。另一方面,Google group引入了一个有趣的拆分、转换和合并的概念,相应的块称为inception块。inception块第一次给出了层内分支的概念,允许在不同的空间尺度上抽象特征[30]。2015年,ResNet[31]提出的用于训练深层CNNs的skip连接的概念很出名,之后,这个概念被大多数后续网络使用,如Inception ResNet、WideResNet、ResNext等[32]-[34]。
In order to improve the learning capacity of a CNN, different architectural designs such as WideResNet, Pyramidal Net, Xception etc. explored the effect of multilevel transformations in terms of an additional cardinality and increase in width [32], [34], [35]. Therefore, the focus of research shifted from parameter optimization and connections readjustment towards improved architectural design (layer structure) of the network. This shift resulted in many new architectural ideas such as channel boosting, spatial and channel wise exploitation and attention based information processing etc. [36]–[38].
为了提高CNN的学习能力,不同的结构设计,如WideResNet、金字塔网、exception等,从增加基数和增加宽度的角度探讨了多级转换的效果[32]、[34]、[35]。因此,研究的重点从网络的参数优化和连接调整转向网络的改进结构设计(层结构)。这种转变产生了许多新的架构思想,如信道增强、空间和信道利用以及基于注意力的信息处理等[36]-[38]。
In the past few years, different interesting surveys are conducted on deep CNNs that elaborate the basic components of CNN and their alternatives. The survey reported by [39] has reviewed the famous architectures from 2012-2015 along with their components. Similarly, in the literature, there are prominent surveys that discuss different algorithms of CNN and focus on applications of CNN [20], [26], [27], [40], [41]. Likewise, the survey presented in [42] discussed taxonomy of CNNs based on acceleration techniques. On the other hand, in this survey, we discuss the intrinsic taxonomy present in the recent and prominent CNN architectures. The various CNN architectures discussed in this survey can be broadly classified into seven main categories namely; spatial exploitation, depth, multi-path, width, feature map exploitation, channel boosting, and attention based CNNs. The rest of the paper is organized in the following order (shown in Fig. 1): Section 1 summarizes the underlying basics of CNN, its resemblance with primate’s visual cortex, as well as its contribution in MV. In this regard, Section 2 provides the overview on basic CNN components and Section 3 discusses the architectural evolution of deep CNNs. Whereas, Section 4, discusses the recent innovations in CNN architectures and categorizes CNNs into seven broad classes. Section 5 and 6 shed light on applications of CNNs and current challenges, whereas section 7 discusses future work and last section draws conclusion. 在过去的几年里,对深度CNN进行了不同有趣的调查,阐述了CNN的基本组成部分及其替代方案。[39]报告的调查回顾了2012-2015年著名架构及其组成部分。类似地,在文献中,有一些著名的调查讨论了CNN的不同算法,并着重于CNN的应用[20]、[26]、[27]、[40]、[41]。同样,在[42]中提出的调查讨论了基于加速技术的CNNs分类。另一方面,在这项调查中,我们讨论了在最近和著名的CNN架构中存在的内在分类法。本次调查中讨论的各种CNN架构大致可分为七大类,即:空间开发、深度、多径、宽度、特征地图开发、信道增强和基于注意力的CNN。论文的其余部分按以下顺序组织(如图1所示):第1节总结了CNN的基本原理,它与灵长类视觉皮层的相似性,以及它在MV中的贡献。在这方面,第2节概述了基本CNN组件,第3节讨论了deep CNNs的体系结构演变。第4节讨论了CNN体系结构的最新创新,并将CNN分为七大类。第5节和第6节阐述了CNNs的应用和当前面临的挑战,第7节讨论了未来的工作,最后一节得出结论。
Fig. 1: Organization of the survey paper.