CV：翻译并解读2019《A Survey of the Recent Architectures of Deep Convolutional Neural Networks》第一章~第三章（三）-阿里云开发者社区

2.4 Batch Normalization

注：根据博主的经验，此处常为考点！

Batch normalization is used to address the issues related to internal covariance shift within feature maps. The internal covariance shift is a change in the distribution of hidden units’ values, which slow down the convergence (by forcing learning rate to small value) and requires careful initialization of parameters. Batch normalization for a transformed feature map k lT is shown in equation (4).

批处理规范化用于解决与特征映射内部协方差偏移相关的问题。内协方差偏移是隐藏单元值分布的一种变化，它会减慢收敛速度（通过强制学习速率为小值），并且需要谨慎的初始化参数。转换后的特征映射k lT的批处理规范化如等式（4）所示。

In equation (4), k l N represents normalized feature map, kl F is the input feature map, B and 2 B  depict mean and variance of a feature map for a mini batch respectively. Batch normalization unifies the distribution of feature map values by bringing them to zero mean and unit variance [54]. Furthermore, it smoothens the flow of gradient and acts as a regulating factor, which thus helps in improving generalization of the network.

在式（4）中，k l N表示归一化特征映射，kl F是输入特征映射，Bμ和2 B分别表示小批量特征映射的均值和方差。批量规范化通过使特征映射值的平均值和单位方差为零来统一分布[54]。此外，它平滑了梯度的流动，起到了调节因子的作用，从而有助于提高网络的泛化能力。

2.5 Dropout

Dropout introduces regularization within the network, which ultimately improves generalization by randomly skipping some units or connections with a certain probability. In NNs, multiple connections that learn a non-linear relation are sometimes co-adapted, which causes overfitting [55]. This random dropping of some connections or units produces several thinned network architectures, and finally one representative network is selected with small weights. This selected architecture is then considered as an approximation of all of the proposed networks [56].

Dropout在网络中引入正则化，通过随机跳过某些具有一定概率的单元或连接，最终提高泛化能力。在NNs中，学习非线性关系的多个连接有时是协同适应的，这会导致过度拟合[55]。一些连接或单元的随机丢弃产生了几种细化的网络结构，最后选择了一种具有代表性的网络结构。然后将所选择的体系结构看作是所提出的所有网络的近似〔56〕。

2.6 Fully Connected Layer

Fully connected layer is mostly used at the end of the network for classification purpose. Unlike pooling and convolution, it is a global operation. It takes input from the previous layer and globally analyses output of all the preceding layers [57]. This makes a non-linear combination of selected features, which are used for the classification of data. [58].

全连接层主要用于网络末端的分类。与池化和卷积不同，它是一个全局操作。它接受前一层的输入，并全局分析所有前一层的输出[57]。这使得用于数据分类的选定特征的非线性组合。[58]。

Fig. 3: Evolutionary history of deep CNNs

3 Architectural Evolution of Deep CNN

Nowadays, CNNs are considered as the most widely used algorithms among biologically inspired AI techniques. CNN history begins from the neurobiological experiments conducted by Hubel and Wiesel (1959, 1962) [14], [59]. Their work provided a platform for many cognitive models, almost all of which were latterly replaced by CNN. Over the decades, different efforts have been carried out to improve the performance of CNNs. This history is pictorially represented in Fig. 3. These improvements can be categorized into five different eras and are discussed below.

目前，CNNs被认为是生物人工智能技术中应用最广泛的算法。CNN的历史始于Hubel和Wiesel（19591962）[14]，[59]进行的神经生物学实验。他们的工作为许多认知模型提供了一个平台，几乎所有的认知模型都被CNN所取代。几十年来，人们一直在努力提高CNNs的性能。这段历史在图3中用图形表示这些改进可以分为五个不同的时代，并在下面讨论。

3.1 Late 1980s-1999: Origin of CNN

CNNs have been applied to visual tasks since the late 1980s. In 1989, LeCuN et al. proposed the first multilayered CNN named as ConvNet, whose origin rooted in Fukushima’s Neocognitron [60], [61]. LeCuN proposed supervised training of ConvNet, using Backpropagation algorithm [7], [62] in comparison to the unsupervised reinforcement learning scheme used by its predecessor Neocognitron. LeCuN’s work thus made a foundation for the modern 2D CNNs. Supervised training in CNN provides the automatic feature learning ability from raw input, rather than designing of handcrafted features, used by traditional ML methods. This ConvNet showed successful results for handwritten digit and zip code recognition related problems [63]. In 1998, ConvNet was improved by LeCuN and used for classifying characters in a document recognition application [64]. This modified architecture was named as LeNet-5, which was an improvement over the initial CNN as it can extract feature representation in a hierarchical way from raw pixels [65]. Reliance of LeNet-5 on fewer parameters along with consideration of spatial topology of images enabled CNN to recognize rotational variants of the image [65]. Due to the good performance of CNN in optical character recognition, its commercial use in ATM and Banks started in 1993 and 1996, respectively. Though, many successful milestones were achieved by LeNet-5, yet the main concern associated with it was that its discrimination power was not scaled to classification tasks other than hand recognition.

自20世纪80年代末以来，CNNs已经被应用于视觉任务中。提出了第一个叫做ConvNet的多层CNN，其起源于Fukushima’s 的Neocognitron[60]，[61]。LeCuN提出了ConvNet的有监督训练，使用了Backpropagation算法[7]，[62]，与其前身Neocognitron使用的无监督强化学习方案相比。他的作品为现代2D CNN奠定了基础。CNN中的监督训练提供了从原始输入中自动学习特征的能力，而不是传统ML方法所使用的手工特征的设计。这个ConvNet显示了手写数字和邮政编码识别相关问题的成功结果[63]。1998年，LeCuN改进了ConvNet，并将其用于文档识别应用程序中的字符分类[64]。这种改进的结构被命名为LeNet-5，这是对初始CNN的改进，因为它可以从原始像素中以分层的方式提取特征表示[65]。LeNet-5对较少参数的依赖以及对图像空间拓扑的考虑使得CNN能够识别图像的旋转变体[65]。由于CNN在光学字符识别方面的良好性能，其在ATM和银行的商业应用分别始于1993年和1996年。尽管LeNet-5取得了许多成功的里程碑，但与之相关的主要问题是它的辨别能力并没有扩展到除手识别以外的分类任务。

3.2 Early 2000: Stagnation of CNN

In the late 1990s and early 2000s, interest in NNs reduced and less attention was given to explore the role of CNNs in different applications such as object detection, video surveillance, etc. Use of CNN in ML related tasks became dormant due to the insignificant improvement in performance at the cost of high computational time. At that time, other statistical methods and, in particular, SVM became more popular than CNN due to its relatively high performance [66]–[68]. It was widely presumed in early 2000 that the backpropagation algorithm used for training of CNN was not effective in converging to optimal points and therefore unable to learn useful features in supervised fashion as compared to handcrafted features [69]. Meanwhile, different researchers kept working on CNN and tried to optimize its performance. In 2003, Simard et al. improved CNN architecture and showed good results as compared to SVM on a hand digit benchmark dataset; MNIST [64], [68], [70]–[72]. This performance improvement expedited the research in CNN by extending its application in optical character recognition (OCR) to other script’s character recognition [72]–[74], deployment in image sensors for face detection in video conferencing, and regulation of street crimes, etc. Likewise, CNN based systems were industrialized in markets for tracking customers [75]–[77]. Moreover, CNN’s potential in other applications such as medical image segmentation, anomaly detection, and robot vision was also explored [78]–[80].

在20世纪90年代末和21世纪初，人们对神经网络的兴趣逐渐减少，对神经网络在目标检测、视频监控等不同应用中的作用的研究也越来越少。由于性能上的显著提高，在ML相关任务中使用神经网络以牺牲较高的计算时间而变得不活跃。当时，其他统计方法，特别是支持向量机，由于其相对较高的性能而变得比CNN更受欢迎[66]-[68]。2000年初，人们普遍认为，用于CNN训练的反向传播算法在收敛到最优点方面并不有效，因此与手工制作的特征相比，无法以监督方式学习有用的特征[69]。与此同时，不同的研究人员继续研究CNN，并试图优化其性能。2003年，Simard等人。改进了CNN的体系结构，与支持向量机相比，在一个手写数字基准数据集上显示了良好的结果；MNIST[64]，[68]，[70]–[72]。这种性能的提高加速了CNN的研究，将其在光学字符识别（OCR）中的应用扩展到其他脚本的字符识别[72]-[74]，在视频会议中部署用于面部检测的图像传感器，以及对街头犯罪的监管等。同样，基于CNN的系统也在市场上实现了工业化用于跟踪客户[75]–[77]。此外，CNN在医学图像分割、异常检测和机器人视觉等其他应用领域的潜力也得到了探索[78]-[80]。

3.3 2006-2011: Revival of CNN

Deep NNs have generally complex architecture and time intensive training phase that sometimes spanned over weeks and even months. In early 2000, there were only a few techniques for the training of deep Networks. Additionally, it was considered that CNN is not able to scale for complex problems. These challenges halted the use of CNN in ML related tasks.

深度NNs通常具有复杂的结构和时间密集型训练阶段，有时跨越数周甚至数月。在2000年初，只有少数技术用于训练深层网络。此外，有人认为CNN无法扩展到复杂的问题。这些挑战阻止了CNN在ML相关任务中的应用。

To address these problems, in 2006 many interesting methods were reported to overcome the difficulties encountered in the training of deep CNNs and learning of invariant features. Hinton proposed greedy layer-wise pre-training approach in 2006, for deep architectures, which revived and reinstated the importance of deep learning [81], [82]. The revival of a deep learning [83], [84] was one of the factors, which brought deep CNNs into the limelight. Huang et al. (2006) used max pooling instead of subsampling, which showed good results by learning of invariant features [46], [85].

为了解决这些问题，2006年报道了许多有趣的方法来克服在训练深层CNNs和学习不变特征方面遇到的困难。Hinton在2006年提出了贪婪的分层预训练方法，用于深层架构，这恢复了深层学习的重要性[81]，[82]。深度学习的复兴[83]，[84]是其中的一个因素，这使深度cnn成为了焦点。Huang等人。（2006）使用最大值池代替子采样，通过学习不变特征显示了良好的结果[46]，[85]

In late 2006, researchers started using graphics processing units (GPUs) [86], [87] to accelerate training of deep NN and CNN architectures [88], [89]. In 2007, NVIDIA launched the CUDA programming platform [90], [91], which allows exploitation of parallel processing capabilities of GPU with a much greater degree [92]. In essence, the use of GPUs for NN training [88], [93] and other hardware improvements were the main factor, which revived the research in CNN. In 2010, Fei-Fei Li’s group at Stanford, established a large database of images known as ImageNet, containing millions of labeled images [94]. This database was coupled with the annual ImageNet Large Scale Visual Recognition Challenge (ILSVRC) competitions, where the performances of various models have been evaluated and scored [95]. Consequently, ILSVRC and NIPS have been very active in strengthening research and increasing the use of CNN and thus making it popular. This was a turning point in improving the performance and increasing the use of CNN. 2006年末，研究人员开始使用图形处理单元（GPU）[86]，[87]来加速深度神经网络和CNN架构的训练[88]，[89]。2007年，NVIDIA推出了CUDA编程平台[90]，[91]，它允许在更大程度上利用GPU的并行处理能力[92]。从本质上讲，GPUs在神经网络训练中的应用[88]、[93]和其他硬件的改进是主要因素，这使CNN的研究重新活跃起来。2010年，李飞飞在斯坦福大学的团队建立了一个名为ImageNet的大型图像数据库，其中包含数百万个标记图像[94]。该数据库与年度ImageNet大型视觉识别挑战赛（ILSVRC）相结合，对各种模型的性能进行了评估和评分[95]。因此，ILSVRC和NIPS在加强研究和增加CNN的使用方面非常积极，从而使其流行起来。这是一个转折点，在提高性能和增加使用有线电视新闻网。

3.4 2012-2014: Rise of CNN

Availability of big training data, hardware advancements, and computational resources contributed to advancement in CNN algorithms. Renaissance of CNN in object detection, image classification, and segmentation related tasks had been observed in this period [9], [96]. However, the success of CNN in image classification tasks was not only due to the result of aforementioned factors but largely contributed by the architectural modifications, parameter optimization, incorporation of regulatory units, and reformulation and readjustment of connections within the network [39], [42], [97].

大训练数据的可用性、硬件的先进性和计算资源有助于CNN算法的进步。CNN在目标检测、图像分类和与分割相关的任务方面的复兴在这一时期已经被观察到了[9]，[96]。然而，CNN在图像分类任务中的成功不仅是由于上述因素的结果，而且在很大程度上是由于结构的修改、参数的优化、调节单元的合并以及网络内连接的重新制定和调整[39]、[42]、[97]。γi

The main breakthrough in CNN performance was brought by AlexNet [21]. AlexNet won the 2012-ILSVRC competition, which has been one of the most difficult challenges in image detection and classification. AlexNet improved performance by exploiting depth (incorporating multiple levels of transformation) and introduced regularization term in CNN. The exemplary performance of AlexNet [21] compared to conventional ML techniques in 2012-ILSVRC (AlexNet reduced error rate from 25.8 to 16.4) suggested that the main reason of the saturation in CNN performance before 2006 was largely due to the unavailability of enough training data and computational resources. In summary, before 2006, these resource deficiencies made it hard to train a high-capacity CNN without deterioration of performance [98].

CNN的主要突破是由AlexNet带来的[21]。AlexNet赢得了2012-ILSVRC比赛，这是图像检测和分类领域最困难的挑战之一。AlexNet利用深度（包含多个层次的转换）提高了性能，并在CNN中引入了正则化项。与2012-ILSVRC（AlexNet将错误率从25.8降低到16.4）中的传统ML技术相比，AlexNet的示例性性能[21]表明，2006年之前CNN性能饱和的主要原因是缺乏足够的训练数据和计算资源。总之，在2006年之前，这些资源不足使得在不降低性能的情况下难以训练高容量CNN[98]

With CNN becoming more of a commodity in the computer vision (CV) field, a number of attempts have been made to improve the performance of CNN with reduced computational cost. Therefore, each new architecture try to overcome the shortcomings of previously proposed architecture in combination with new structural reformulations. In year 2013 and 2014, researchers mainly focused on parameter optimization to accelerate CNN performance in a range of applications with a small increase in computational complexity. In 2013, Zeiler and Fergus [28] defined a mechanism to visualize learned filters of each CNN layer. Visualization approach was used to improve the feature extraction stage by reducing the size of the filters. Similarly, VGG architecture [29] proposed by the Oxford group, which was runner-up at the 2014-ILSVRC competition, made the receptive field much smaller in comparison to that of AlexNet but, with increased volume. In VGG, depth was increased from 9 layers to 16, by making the volume of features maps double at each layer. In the same year, GoogleNet [99] that won 2014-ILSVRC competition, not only exerted its efforts to reduce computational cost by changing layer design, but also widened the width in compliance with depth to improve CNN performance. GoogleNet introduced the concept of split, transform, and merge based blocks, within which multiscale and multilevel transformation is incorporated to capture both local and global information [33], [99], [100]. The use of multilevel transformations helps CNN in tackling details of images at various levels. In the year 2012-14, the main improvement in the learning capacity of CNN was achieved by increasing its depth and parameter optimization strategies. This suggested that the depth of a CNN helps in improving the performance of a classifier. 随着CNN在计算机视觉（CV）领域的应用越来越广泛，人们在降低计算成本的前提下，对CNN的性能进行了许多尝试。因此，每一个新的架构都试图结合新的结构重组来克服先前提出的建筑的缺点。在第2013和2014年，研究人员主要集中在参数优化，以加速CNN在一系列应用中的性能，计算复杂性的增加很小。2013年，Zeiler和Fergus[28]定义了一种机制，可以可视化每个CNN层的学习过滤器。采用可视化的方法，通过减小滤波器的尺寸来改善特征提取阶段。同样，在2014-ILSVRC竞赛中获得亚军的Oxford group提出的VGG架构[29]也使得接受场比AlexNet小得多，但随着体积的增加。在VGG中，深度从9层增加到16层，使每层的特征地图体积加倍。同年，赢得2014-ILSVRC竞赛的GoogleNet[99]不仅努力通过改变层设计来降低计算成本，还根据深度拓宽了宽度以提高CNN性能。GoogleNet引入了基于分割、变换和合并的块的概念，其中结合了多尺度和多级变换来捕获局部和全局信息[33]、[99]、[100]。多级转换的使用有助于CNN处理不同层次的图像细节。2012-2014年，CNN的学习能力主要通过提高其深度和参数优化策略来实现。这表明CNN的深度有助于提高分类器的性能。

3.5 2015-Present: Rapid increase in Architectural Innovations and Applications of CNN

It is generally observed the major improvements in CNN performance occurred from 2015-2019. The research in CNN is still on going and has a significant potential of improvement. Representational capacity of CNN depends on its depth and in a sense can help in learning complex problems by defining diverse level of features ranging from simple to complex. Multiple levels of transformation make learning easy by chopping complex problems into 15 smaller modules. However, the main challenge faced by deep architectures is the problem of negative learning, which occurs due to diminishing gradient at lower layers of the network. To handle this problem, different research groups worked on readjustment of layers connections and design of new modules. In earlier 2015, Srivastava et al. used the concept of cross-channel connectivity and information gating mechanism to solve the vanishing gradient problem and to improve the network representational capacity [101]–[103]. This idea got famous in late 2015 and a similar concept of residual blocks or skip connections was coined [31]. Residual blocks are a variant of cross-channel connectivity, which smoothen learning by regularizing the flow of information across blocks [104]–[106]. This idea was used in ResNet architecture for the training of 150 layers deep network [31]. The idea of cross-channel connectivity is further extended to multilayer connectivity by Deluge, DenseNet, etc. to improve representation [107], [108].

一般观察到，CNN在2015-2019年的表现出现了重大改善。CNN的研究仍在进行中，有很大的改进潜力。CNN的表征能力取决于它的深度，在某种意义上可以通过定义从简单到复杂的不同层次的特征来帮助学习复杂的问题。通过将复杂的问题分解成15个较小的模块，多层次的转换使学习变得容易。然而，深度架构面临的主要挑战是负学习问题，这是由于网络较低层的梯度减小而产生的。为了解决这个问题，不同的研究小组致力于重新调整层连接和设计新的模块。2015年初，Srivastava等人。利用跨通道连接和信息选通机制的概念解决了消失梯度问题，提高了网络的表示能力[101]–[103]。这一想法在2015年末变得很有名，并创造了类似的剩余块或跳过连接的概念[31]。剩余块是跨信道连接的一种变体，它通过调整跨块的信息流来平滑学习[104]–[106]。该思想被用于ResNet体系结构中，用于150层深度网络的训练[31]。为了改进表示[107]、[108]，通过Deluge、DenseNet等将跨信道连接的思想进一步扩展到多层连接。γi

In the year 2016, the width of the network was also explored in connection with depth to improve feature learning [34], [35]. Apart from this, no new architectural modification became prominent but instead, different researchers used hybrid of the already proposed architectures to improve deep CNN performance [33], [104]–[106], [109], [110]. This fact gave the intuition that there might be other factors more important as compared to the appropriate assembly of the network units that can effectively regulate CNN performance. In this regard, Hu et al. (2017) identified that the network representation has a role in learning of deep CNNs [111]. Hu et al. introduced the idea of feature map exploitation and pinpointed that less informative and domain extraneous features may affect the performance of the network to a larger extent. He exploited the aforementioned idea and proposed new architecture named as Squeeze and Excitation Network (SE-Network) [111]. It exploits feature map (commonly known as channel in literature) information by designing a specialized SE-block. This block assigns weight to each feature map depending upon its contribution in class discrimination. This idea was further investigated by different researchers, which assign attention to important regions by exploiting both spatial and feature map (channel) information [37], [38], [112]. In 2018, a new idea of channel boosting was introduced by Khan et al [36]. The motivation behind the training of network with boosted channel representation was to use an enriched representation. This idea effectively boost the performance of a CNN by learning diverse features as well as exploiting the already learnt features through the concept of TL.

2016年，还结合深度探索了网络的宽度，以改进特征学习[34]，[35]。除此之外，没有新的架构修改变得突出，但相反，不同的研究人员使用已经提出的架构的混合来改进深层CNN性能[33]、[104]–[106]、[109]、[110]。这一事实给人的直觉是，与能够有效调节CNN性能的网络单元的适当组装相比，可能还有其他因素更重要。在这方面，胡等人。（2017）确定了网络代表在学习深层CNN方面的作用[111]。Hu等人。介绍了特征图的开发思想，指出信息量小、领域无关的特征对网络性能的影响较大。他利用了上述思想，提出了一种新的结构，称为挤压激励网络（SE网络）[111]。它通过设计一个专门的SE块来开发特征映射（在文献中通常称为通道）信息。此块根据其在类别识别中的贡献为每个特征映射分配权重。不同的研究者对此进行了进一步的研究，他们利用空间和特征地图（通道）信息将注意力分配到重要区域[37]、[38]、[112]。2018年，Khan等人[36]提出了一种新的渠道提升理念。提高渠道表征的网络训练背后的动机是使用丰富的表征。这一思想通过学习不同的特征以及通过TL的概念利用已经学习的特征，有效地提高了CNN的性能

From 2012 up till now, a lot of improvements have been reported in CNN architecture. As regards the architectural advancement of CNNs, recently the focus of research has been on designing of new blocks that can boost network representation by exploiting both feature maps and spatial information or by adding artificial channels. 从2012年到现在，CNN的架构有很多改进。关于CNNs的体系结构进展，近年来的研究重点是设计新的块，通过利用特征图和空间信息或添加人工通道来增强网络表示。