CV：翻译并解读2019《A Survey of the Recent Architectures of Deep Convolutional Neural Networks》第一章~第三章（二）-阿里云开发者社区

CV：翻译并解读2019《A Survey of the Recent Architectures of Deep Convolutional Neural Networks》第一章~第三章（二）

2021-11-02 203

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介： CV：翻译并解读2019《A Survey of the Recent Architectures of Deep Convolutional Neural Networks》第一章~第三章

2 Basic CNN Components

Nowadays, CNN is considered as the most widely used ML technique, especially in vision related applications. CNNs have recently shown state-of-the-art results in various ML applications. A typical block diagram of an ML system is shown in Fig. 2. Since, CNN possesses both good feature extraction and strong discrimination ability, therefore in a ML system; it is mostly used for feature extraction and classification.

目前，CNN被认为是应用最广泛的ML技术，尤其是在视觉相关应用中。CNNs最近在各种ML应用中显示了最新的结果。ML系统的典型框图如图2所示。由于CNN具有良好的特征提取和较强的识别能力，因此在ML系统中，它主要用于特征提取和分类。

A typical CNN architecture generally comprises of alternate layers of convolution and pooling followed by one or more fully connected layers at the end. In some cases, fully connected layer is replaced with global average pooling layer. In addition to the various learning stages, different regulatory units such as batch normalization and dropout are also incorporated to optimize CNN performance [43]. The arrangement of CNN components play a fundamental role in designing new architectures and thus achieving enhanced performance. This section briefly discusses the role of these components in CNN architecture.

典型的CNN体系结构，通常包括交替的卷积层和池化，最后是一个或多个完全连接的层。在某些情况下，完全连接层被替换为全局平均池层。除了不同的学习阶段，不同的常规单位，如 batch normalization和dropout，也被纳入优化CNN的表现[43]。CNN组件的排列在设计新的体系结构和提高性能方面起着基础性的作用。本节简要讨论这些组件在CNN架构中的作用。

2.1 Convolutional Layer

Convolutional layer is composed of a set of convolutional kernels (each neuron act as a kernel). These kernels are associated with a small area of the image known as a receptive field. It works by dividing the image into small blocks (receptive fields) and convolving them with a specific set of weights (multiplying elements of the filter with the corresponding receptive field elements) [43]. Convolution operation can expressed as follows:

卷积层由一组卷积核组成（每个神经元充当一个核）。这些核与被称为感受野的图像的一小部分相关。它的工作原理是将图像分割成小的块（接收场），并用一组特定的权重（将滤波器的元素与相应的接收场元素相乘）[43]。卷积运算可以表示为：

Where, the input image is represented by x, y I , , xy shows spatial locality and k

l K represents the lth convolutional kernel of the kth layer. Division of image into small blocks helps in extracting locally correlated pixel values. This locally aggregated information is also known as feature motif. Different set of features within image are extracted by sliding convolutional kernel on the whole image with the same set of weights. This weight sharing feature of convolution operation makes CNN parameter efficient as compared to fully connected Nets. Convolution operation may further be categorized into different types based on the type and size of filters, type of padding, and the direction of convolution [44]. Additionally, if the kernel is symmetric, the convolution operation becomes a correlation operation [16].

其中，输入图像由x，y I，x y表示空间局部性，k l k表示第k层的第l卷积核。将图像分割成小块有助于提取局部相关像素值。这种局部聚集的信息也被称为特征模体。在相同的权值集下，通过滑动卷积核提取图像中不同的特征集。与全连通网络相比，卷积运算的这种权值共享特性使得CNN参数更有效。卷积操作还可以基于滤波器的类型和大小、填充的类型和卷积的方向而被分为不同的类型[44]。另外，如果核是对称的，卷积操作就变成相关性操作[16]。

2.2 Pooling Layer

Feature motifs, which result as an output of convolution operation can occur at different locations in the image. Once features are extracted, its exact location becomes less important as long as its approximate position relative to others is preserved. Pooling or downsampling like convolution, is an interesting local operation. It sums up similar information in the neighborhood of the receptive field and outputs the dominant response within this local region [45].

卷积运算输出的特图案可以出现在图像的不同位置。一旦特征被提取，其精确位置就变得不那么重要了，只要其相对于其他位置的近似位置被保留。像卷积一样的池化或下采样是一种有趣的本地操作。它总结了接受野附近的相似信息，并输出了该局部区域内的主导反应[45]。

Equation (2) shows the pooling operation in which l Z represents the lth output feature map, ,lxyF shows the lth input feature map, whereas p f (.) defines the type of pooling operation. The use ofpooling operation helps to extract a combination of features, which are invariant to translational shifts and small distortions [13], [46]. Reduction in the size of feature map to invariant feature set not only regulates complexity of the network but also helps in increasing the generalization by reducing overfitting. Different types of pooling formulations such as max, average, L2, overlapping, spatial pyramid pooling, etc. are used in CNN [47]–[49].

等式（2）表示池操作，其中l Z表示lth输出特征映射，lxyF表示lth输入特征映射，而p f（.）定义池操作的类型。使用pooling操作有助于提取特征的组合，这些特征对平移位移和小的失真是不变的[13]，[46]。将特征映射的大小减少到不变特征集不仅可以调节网络的复杂度，而且有助于通过减少过拟合来增加泛化。CNN中使用了不同类型的池公式，如max、average、L2、overlapping、空间金字塔池化等[47]–[49]。

2.3 Activation Function

Activation function serves as a decision function and helps in learning a complex pattern. Selection of an appropriate activation function can accelerate the learning process. Activation function for a convolved feature map is defined in equation (3).

激活函数作为一个决策函数，有助于学习一个复杂的模式。选择合适的激活函数可以加速学习过程。卷积特征映射的激活函数在方程（3）中定义。

In above equation, k l F is an output of a convolution operation, which is assigned to activation function; A f (.) that adds non-linearity and returns a transformed output k l T for kth layer. In literature, different activation functions such as sigmoid, tanh, maxout, ReLU, and variants of ReLU such as leaky ReLU, ELU, and PReLU [39], [48], [50], [51] are used to inculcate nonlinear combination of features. However, ReLU and its variants are preferred over others activations as it helps in overcoming the vanishing gradient problem [52], [53].

在上面的等式中，k l F是卷积运算的输出，该卷积运算被分配给激活函数；F（.）添加非线性并返回第k层的转换输出k l T。在文献中，不同的激活函数如sigmoid、tanh、maxout、ReLU和ReLU的变体如leaky ReLU、ELU和PReLU[39]、[48]、[50]、[51]被用来灌输特征的非线性组合。然而，ReLU及其变体比其他激活更受欢迎，因为它有助于克服消失梯度问题[52]，[53]。

Fig. 2: Basic layout of a typical ML system. In ML related tasks, initially data is preprocessed and then assigned to a classification system. A typical ML problem follows three steps: stage 1 is related to data gathering and generation, stage 2 performs preprocessing and feature selection, whereas stage 3 is based on model selection, parameter tuning, and analysis. CNN has a good feature extraction and strong discrimination ability, therefore in a ML system; it can be used for feature extraction and classification. 图2：典型ML系统的基本布局。在与ML相关的任务中，首先对数据进行预处理，然后将其分配给分类系统。一个典型的ML问题有三个步骤：阶段1与数据收集和生成相关，阶段2执行预处理和特征选择，而阶段3基于模型选择、参数调整和分析。CNN具有很好的特征提取能力和较强的识别能力，因此在ML系统中可以用于特征提取和分类。