CV：翻译并解读2019《A Survey of the Recent Architectures of Deep Convolutional Neural Networks》第四章（三）-阿里云开发者社区

4.4.2 Pyramidal Net

In earlier deep CNN architectures such as AlexNet, VGG, and ResNet, due to the deep stacking of multiple convolutional layers, depth of feature maps increase in subsequent layers. However, the spatial dimension decreases, as each convolutional layer is followed by a sub-sampling layer [21], [29], [31]. Therefore, Han et al. argued that in deep CNNs, enriched feature representation is compensated by a decrease in feature map size [35]. The drastic increase in the feature map depth and at the same time the loss of spatial information limits the learning ability of CNN. ResNet has shown remarkable results for image classification problem. However in ResNet, the deletion of residual block, where dimension of both spatial and feature map (channel) varies (feature map depth increases, while spatial dimension decreases), generally deteriorates performance. In this regard, stochastic ResNet, improved the performance by reducing information loss associated with the dropping of the residual unit [105]. To increase the learning ability of ResNet, Han et al. proposed, Pyramidal Net [35]. In contrast to the drastic decrease in spatial width with an increase in depth by ResNet, Pyramidal Net increases the width gradually per residual unit. This strategy enables pyramidal Net to cover all possible locations instead of maintaining the same spatial dimension within each residual block until down-sampling occurs. Because of a gradual increase in the depth of features map in a top-down fashion, it was named as pyramidal Net. In pyramidal Net, depth of feature maps is regulated by factor l , and is computed using equation (9).

Where, lD denotes the dimension of th l residual unit, n is the total number of the residual units, whereas  is a step factor and

n  regulates the increase in depth. The depth regulating factortries to distribute the burden of an increase in feature maps. Residual connections were inserted in between the layers by using zero-padded identity mapping. The advantage of zero-padded

identity mapping is that it needs less number of parameters as compared to the projection based shortcut connection, hence may result in better generalization [153]. Pyramidal Net uses two different approaches for the widening of the network including addition, and multiplication based widening. The difference between the two types of widening is that additive pyramidal structure increases linearly, multiplicative one increases geometrically [50], [54]. However,major problem with Pyramidal Net is that with the increase in width, a quadratic times increase in both space and time occurs.

4.4.3 Xception

Xception can be considered as an extreme Inception architecture, which exploits the idea of depthwise separable convolution introduced by AlexNet [21], [114]. Xception modified the original inception block by making it wider and replacing the different spatial dimensions (1x1, 5x5, 3x3) with a single dimension (3x3) followed by 1x1 to regulate computational complexity. Architecture of Xception block is shown in Fig. 9. Xception makes the network computationally efficient by decoupling spatial and feature map (channel) correlation. It works by first mapping the convolved output to low dimensional embeddings using 1x1 convolution and then spatially transforms it th k times, where k is a width defining cardinality, which determines the number of transformations. Xception makes computation easy by separately convolving each feature map across spatial axes, which is followed by pointwise convolution (1x1 convolutions) to perform cross-channel correlation. In Xception, 1x1 convolution is used to regulate feature map depth. In conventional CNN architectures; conventional convolutional operation uses only one transformation segment, inception block uses three transformation segment, whereas in Xception number of transformation segment is equal to the number of feature maps. Although, the transformation strategy adopted by Xception does not reduce the number of parameters, but it makes learning more efficient and results in improved performance.

Fig. 9: Xception building block.

4.4.4 Inception Family

Inception family of CNNs also comes under the class of width based methods [33], [99], [100]. In Inception networks, within a layer, varying sizes of the filters were used which increased the output of the intermediate layers. The use of the different sizes of filters are helpful in capturing the diversity in high-level features. Salient characteristics of Inception family are discussed in section 4.1.4 and 4.2.3.

4.5 Feature Map (ChannelFMap) Exploitation based CNNs

CNN became popular for MV tasks because of its hierarchical learning and automatic feature extraction ability [12]. Selection of features play an important role in determining the performance of classification, segmentation, and detection modules. Conventional feature extraction techniques are generally static and limit the performance of classification module because of limited types of features [154]. In CNN, features are dynamically set by tuning the weights associated with a kernel (mask). Also, multiple stages of feature extraction are used, which can extract diverse types of features (known as feature maps or channels in CNN). However, some of the feature maps impart little or no role in object discrimination [116]. Enormous feature sets may create an effect of noise and thus lead to over-fitting of the network. This suggests that apart from network engineering, selection of feature maps can play an important role in improving generalization of the network. In this section, feature maps and channels will be interchangeably used as many researchers have used the word channels for the feature maps.

Fig. 10: Squeeze and Excitation block.

4.5.1 Squeeze and Excitation Network

Squeeze and Excitation network (SE-Network) was reported by Hu et al. [116]. They proposed a new block for the selection of feature maps (commonly known as channels) relevant to object discrimination. This new block was named as SE-block (shown in Fig. 10), which suppresses the less important feature maps, but gives high weightage to the class specifying feature maps. SENetwork reported record decrease in error on ImageNet dataset. SE-block is a processing unit that is design in a generic way, and therefore can be added in any CNN architecture before the convolution layer. The working of this block consists of two operations; squeeze and excitation. Convolution kernel captures information locally, but it ignores contextual relation of features

(correlation) that are outside of this receptive field. To obtain a global view of feature maps, the squeeze block generates feature map wise statistics by suppressing spatial information of the convolved input. As global average pooling has the potential to learn the extent of target object effectively, therefore, it is employed by the squeeze operation to generate feature map wise statistics using the following equation [57], [155]:

Where, MD is a feature map descriptor, and *mn is a spatial dimension of input. The output of squeeze operation; MD is assigned to the excitation operation, which models motif-wise interdependencies by exploiting gating mechanism. Excitation operation assigns weights to feature maps using two layer feed forward NN, which is mathematically expressed in equation(11).

In equation (11), M V denotes weightage for each feature map, where  and  refer to the ReLU and sigmoid function, respectively. In excitation operation, 1 w and 2 w are used as a regulating factor to limit the model complexity and aid the generalization [50], [51]. The output of squeezeblock is preceded by ReLU activation function, which adds non-linearity in feature maps. Gating

mechanism is exploited in SE-block using sigmoid activation function, which models interdependencies among feature map and assigns a weight based on feature map relevance [156]. SE-block is simple and adaptively recalibrates each layer feature maps by multiplying convolved input with the motif responses.

4.5.2 Competitive Squeeze and Excitation Networks

Competitive Inner-Imaging Squeeze and Excitation for Residual Network also known as CMPESE Network was proposed by Hu et al. in 2018 [118]. Hu et al. used the idea of SE-block to improve the learning of deep residual networks [116]. SE-Network recalibrates the feature maps based upon their contribution in class discrimination. However, the main concern with SE-Net is that in ResNet, it only considers the residual information for determining the weight of each channel [116]. This minimizes the impact of SE-block and makes ResNet information redundant. Hu et al. addressed this problem by generating feature map wise statistics from both residual and identity mapping based features. In this regard, global representation of feature maps is generated using global average pooling operation, whereas relevance of feature maps is estimated by making competition between residual and identity mapping based descriptors. This phenomena is termed as inner imaging [118]. CMPE-SE block not only models the relationship between residual feature maps but also maps their relation with identity feature map and makes a competition between residual and identity feature maps. The mathematical expression for CMPE-SE block is represented using the following equation:

where idx is the identity mapping of input, seF represents the squeeze operation applied on residual feature map r u and identity feature map id x , and res F shows implementation of SE-block on residual feature maps. The output of squeeze operation is multiplied with the SE-block output res F . The backpropagation algorithm thus tries to optimize the competition between identity and residual feature maps and the relationship between all feature maps in the residual block.

4.6 Channel(Input) Exploitation based CNNs

Image representation plays an important role in determining the performance of the image- processing algorithms including both conventional and deep learning algorithms. A good representation of the image is one that can define the salient features of an image from a compact code. In literature, various types of conventional filters are applied to extract different levels of information for a single type image [157], [158]. These diverse representations are then used as an input of the model to improve performance [159], [160]. Now CNN is an effective feature learner that can automatically extract discriminating features depending upon the problem [161]. However, the learning of CNN relies on input representation. The lack of diversity and absence of class discernable information in the input may affect CNN performance as a discriminator. For this purpose, the concept of channel boosting (input channel dimension) using auxiliary learners is introduced in CNN to boost the representation of the network [36].

4.6.1 Channel Boosted CNN using TL

In 2018, Khan et al. proposed a new CNN architecture named as Channel boosted CNN (CB- CNN) based on the idea of boosting the number of input channels for improving the representational capacity of the network [36]. Block diagram of CB-CNN is shown in Fig. 11. Channel boosting is performed by artificially creating extra channels (known as auxiliary channels) through deep generative models and then exploiting it through the deep discriminative models. It provides the concept that TL can be used at both generation and discrimination stages. Data representation plays an important role in determining the performance of a classifier, as different representations may present different aspects of information [84]. For improving the representational potential of the data, Khan et al. exploited the power of TL and deep generative learners [24], [162], [163]. Generative learners attempt to characterize the data generating distribution during the learning phase. In CB-CNN, autoencoders are used as a generative learner to learn explanatory factors of variation behind the data. The concept of inductive TL is used in a novel way to build a boosted input representation by augmenting learned distribution of the input data with the original channel space (input channels). CB-CNN encodes channel-boosting phase into a generic block, which is inserted at the start of a deep Net. For training, Khan et al. used a pre-trained network, to reduce computational cost. The significance of the study is that multi- deep learners are used where generative learning models are used as auxiliary learners that enhance the representational capacity of deep CNN based discriminator. Although the potential of the channel boosting was only evaluated by inserting boosting block at the start, however, Khan et al. suggested that this idea can be extended by providing auxiliary channels at any layer in the deep architecture. CB-CNN has also been evaluated on medical image dataset, where it also shows improved results compared to previously proposed approaches. The convergence plot of CB-CNN on mitosis dataset is shown in Fig. 12.

Fig. 11: Basic architecture of CB-CNN.

Fig. 12: Convergence plot of CB-CNN on mitosis dataset. Loss and accuracy is shown on y-axis, whereas x-axis represents epochs. Training plot of CB-CNN shows that the model converges after about 14 epochs.

4.7 Attention based CNNs

Different levels of abstraction have an important role in defining discrimination power of the NN. In addition to learning different levels of abstraction, focusing on features relevant to the context also play a significant role in image localization and recognition. In human visual system, this phenomenon is referred as attention. Humans view the scene in a succession of partial glimpses and pay attention to context-relevant parts. This process not only serves to focus selected region but also deduces different interpretations of objects at that location and thus helps in capturing visual structure in a better way. A more or less similar kind of interpretability is added into RNN and LSTM [147], [148]. RNN and LSTM networks exploit attention module for generation of sequential data and the new samples are weighted based on their occurrence in previous iterations. The concept of attention was incorporated into CNN, by various researchers to improve representation and overcome the computational limits. This idea of attention also helps in making CNN intelligent enough to recognize objects even from cluttered backgrounds and complex scenes.

4.7.1 Residual Attention Neural Network

Wang et al. proposed a Residual Attention Network (RAN) to improve feature representation of the network [38]. The motivation behind the incorporation of attention in CNN was to make network capable of learning object aware features. RAN is a feed forward CNN, which was built by stacking residual blocks with attention module. Attention module is branched off into trunk and mask branches that adopt bottom-up top-down learning strategy. The assembly of two different learning strategies into the attention module enables fast feed-forward processing and top-down attention feedback in a single feed-forward process. Bottom-up feed-forward structure produces low-resolution feature maps with strong semantic information. Whereas, top-down architecture produces dense features in order to make an inference of each pixel. In the previously proposed studies, a top-down bottom-up learning strategy was used by Restricted Boltzmann Machines [164]. Similarly, Goh et al. exploited the top-down attention mechanism as a regularizing factor in Deep Boltzmann Machine (DBM) during the reconstruction phase of the training. Top-down learning strategy globally optimizes network in such a way that gradually output the maps to input during the learning process [82], [164], [165]. Attention module in RAN generates object aware soft mask Si,FM (xc ) at each layer [166]. Soft mask, Si,FM (xc ) assigns attention towards object using equation (13) by recalibrating trunk branch output and thus, behaves like a control gate for every neuron output. Ti,FM (xc )

In one of the previous studies, Transformation network [167], [168] also exploited the idea of attention in a simple way by incorporating it with convolution block, but the main problem was that attention module in Transformation network are fixed and cannot adapt to changing circumstances. RAN was made efficient towards recognition of cluttered, complex, and noisy images by stacking multiple attention modules. Hierarchical organization of RAN endowed the ability to adaptively assign weight to each feature map based on their relevance in the layers [38]. Learning of deep hierarchical structure was supported through residual units. Moreover, three different levels of attention: mixed, channel, and spatial attention were incorporated thus, leveraging the capability to capture object-aware features at different levels [38].

4.7.2 Convolutional Block Attention Module

The significance of attention mechanism and feature map exploitation is validated through RAN and SE-Network [38], [111]. In this regard, Woo et al. came up with new attention based CNN; named as Convolutional Block Attention Module (CBAM) [37]. CBAM is simple in design and similar to SE-Network. SE-Network only considers the contribution of feature maps in image classification, but it ignores the spatial locality of the object in images. Spatial location of the object has an important role in object detection. CBAM infer attention maps sequentially by first applying feature map (channel) attention and then spatial attention, to find the refined feature maps. In literature, generally, 1x1 convolution and pooling operations are used for spatial attention. Woo et al. showed that pooling of features along spatial axis generates an efficient feature descriptor. CBAM concatenates average pooling operation with max pooling, which generate a strong spatial attention map. Likewise, feature map statistics were modeled using a combination of max pooling and global average pooling operation. Woo et al. showed that max pooling can provide the clue about distinctive object features, whereas use of global average pooling returns suboptimal inference of feature map attention. Exploitation of both average pooling and max-pooling improves representational power of the network. These refined feature maps not only focus on the important part but also increase the representational power of the selected feature maps. Woo et al. empirically showed that formulation of 3D attention map via serial learning process helps in reduction of the parameters as well as computational cost. Due to the simplicity of CBAM, it can be integrated easily with any CNN architecture.

4.7.3 Concurrent Spatial and Channel Excitation Mechanism

In 2018, Roy et al. extended the work of Hu et al. by incorporating the effect of spatial information in combination with feature map (channel) information to make it applicable to segmentation tasks [111], [112]. They introduced three different modules: (i) squeezing spatially and exciting feature map wise (cSE), (ii) squeezing feature map wise and exciting spatially (sSE) and (iii) concurrent spatial and channel squeeze & excitation (scSE). In this work, autoencoder based convolutional NN was used for segmentation, whereas proposed modules were inserted after the encoder and decoder layer. In cSE module, the same concept as that of SE-block is exploited. In this module, scaling factor is derived based on the combination of feature maps in object detection. As spatial information has an important role in segmentation, therefore in sSE module, spatial locality has been given more importance than feature map information. For this purpose, different combinations of feature map are selected and exploited spatially to use them for segmentation. In the last module; scSE, attention to each channel is assigned by deriving scaling factor both from spatial and channel information and thus to selectively highlight the object specific feature maps [112].