CV:翻译并解读2019《A Survey of the Recent Architectures of Deep Convolutional Neural Networks》第四章(二)

简介: CV:翻译并解读2019《A Survey of the Recent Architectures of Deep Convolutional Neural Networks》第四章

4.2.1 Highway Networks


     Based on the intuition that the learning capacity can be improved by increasing the network depth, Srivastava et al. in 2015, proposed a deep CNN, named as Highway Network [101]. The main problem concerned with deep Nets is slow training and convergence speed [136]. Highway Network exploited depth for learning enriched feature representation by introducing new cross-layer connectivity (discussed in Section 4.3.1.). Therefore, highway networks are also categorized into multi-path based CNN architectures. Highway Network with 50-layers showed better convergence rate than thin but deep architectures on ImageNet dataset [94], [95]. Srivastava et al. experimentally showed that performance of a plain Net decreases after adding hidden units beyond 10 layers [137]. Highway networks, on the other hand, were shown to converge significantly faster than the plain ones, even with depth of 900 layers.

   


4.2.2 ResNet


    ResNet was proposed by He et al., which is considered as a continuation of deep Nets [31]. ResNet revolutionized the CNN architectural race by introducing the concept of residual learning in CNN and devised an efficient methodology for training of deep Nets. Similar to Highway Networks, it is also placed under the Multi-Path based CNNs, thus its learning methodology is discussed in Section 4.3.2. ResNet proposed 152-layers deep CNN, which won the 2015-ILSVRC competition. Architecture of the residual block of ResNet is shown in Fig. 7. ResNet, which was 20 and 8 times deeper than AlexNet and VGG respectively, showed less computational complexity than previously proposed Nets [21], [29]. He et al. empirically showed that ResNet with 50/101/152 layers has less error on image classification task than 34 layers plain Net. Moreover, ResNet gained 28% improvement on the famous image recognition benchmark dataset named as COCO [138]. Good performance of ResNet on image recognition and localization tasks showed that depth is of central importance for many visual recognition tasks.


   


4.2.3 Inception-V3, V4 and Inception-ResNet


   Inception-V3, V4 and Inception-ResNet, are improved versions of Inception-V1 and V2 [33], [99], [100]. The idea of Inception-V3 was to reduce the computational cost of deeper Nets without affecting the generalization. For this purpose, Szegedy et al. replaced large size filters (5x5 and 7x7) with small and asymmetric filters (1x7 and 1x5) and used 1x1 convolution as a bottleneck prior to the large filters [100]. This makes the traditional convolution operation more like cross-channel correlation. In one of the previous works, Lin et al. exploited the potential of 1x1 filters in NIN architecture [57]. Szegedy et al. [100] used the same concept in an intelligent way. In Inception-V3, 1x1 convolutional operation was used, which maps the input data into 3 or 4 separate spaces that are smaller than the original input space, and then maps all correlations in these smaller 3D spaces, via regular 3x3 or 5x5 convolutions. In Inception-ResNet, Szegedy et al. combined the power of residual learning and inception block [31], [33]. In doing so, filter concatenation was replaced by the residual connection. Moreover, Szegedy et al. experimentally showed that Inception-V4 with residual connections (Inception-ResNet) has the same generalization power as plain Inception-V4 but with increased depth and width. However, they observed that Inception-ResNet converges more quickly than Inception-V4, which clearly depicts that training with residual connections accelerates the training of Inception networks significantly.

   


4.2.4 ResNext


    ResNext, also known as Aggregated Residual Transform Network, is an improvement over the Inception Network [115]. Xie et al. exploited the concept of the split, transform and merge in a powerful but simple way by introducing a new term; cardinality [99]. Cardinality is an additional dimension, which refers to the size of the set of transformations [139], [140]. Inception network has not only improved learning capability of conventional CNNs but also makes a network resource effective. However, due to the use of diverse spatial embedding’s (such as use of 3x3, 5x5 and 1x1 filter) in the transformation branch, each layer needs to be customized separately. In fact, ResNext derives characteristic features from Inception, VGG, and ResNet [29], [31], [99]. ResNext utilized the deep homogenous topology of VGG and simplified GoogleNet architecture by fixing spatial resolution to 3x3 filters within the split, transform, and merge block. It also uses residual learning. Building block for ResNext is shown in Fig. 8. ResNext used multiple transformations within a split, transform and merge block and defined these transformations in terms of cardinality. Xie et al. (2017) showed that increase in cardinality significantly improves the performance. The complexity of ResNext was regulated by applying low embedding’s (1x1 filters) before 3x3 convolution. Whereas training was optimized by using skip connections [141].

   

                     

                                          Fig. 8: ResNext building block.


4.3 Multi-Path based CNNs


   Training of deep networks is a challenging task and this has been the subject of much the recent research on deep Nets. Deep CNN generally, perform well on complex tasks. However, deeper networks may suffer from performance degradation, gradient vanishing or explosion problems, which are not caused by overfitting but instead by an increase in the depth [53], [142]. Vanishing gradient problem not only results in higher test error but also in higher training error [142]–[144]. For training deeper Nets, the concept of multi-path or cross-layer connectivity was proposed [101], [107], [108], [113]. Multiple paths or shortcut connections can systematically connect one layer to another by skipping some intermediate layers to allow the specialized flow of information across the layers [145], [146]. Cross-layer connectivity partition the network into several blocks. These paths also try to solve the vanishing gradient problem by making gradient accessible to lower layers. For this purpose, different types of shortcut connections are used, such as zero-padded, projection-based, dropout, skip connections, and 1x1 connections, etc.

   

4.3.1 Highway Networks


   The increase in depth of a network improves performance mostly for complex problems, but it

also makes training of the network difficult. In deep Nets, due to a large number of layers, the

backpropagation of error may result in small gradient values at lower layers. To solve this

problem, Srivastava et al. [101] in 2015, proposed a new CNN architecture named as Highway

Network based on the idea of cross-layer connectivity. In Highway Network, the unimpeded

flow of information across layers is enabled by imparting two gating units within a layer

(equation (5)). The idea of a gating mechanism was inspired from Long Short Term Memory

(LSTM) based Recurrent Neural Networks (RNN) [147], [148]. The aggregation of information

by combining the lth layer, and previous l k layers information creates a regularizing effect,

making gradient-based training of very deep networks easy. This enables training of a network

with more than 100 layers, even as deep as 900 layers with Stochastic Gradient Descent (SGD)

algorithm. Cross-layer connectivity for Highway Network is defined in equation (5 & 6).  


In equation (5), g T refers to transformation gate, which expresses the amount of the produced output whereas g C is a carry gate. In a network, ( , ) l i Hl H x W represents working of hidden layers, and the residual implementation. Whereas, 1 ( , ) g i Cg T x W behaves as a switch in a layer, which decides the path for the flow of information.


4.3.2 ResNet


    To address the problem faced during training of deeper Nets, in 2015, He et al. proposed ResNet [31] in which they exploited the idea of bypass pathways used in Highway Networks. Mathematical formulation of ResNet is expressed in equation (7 & 8).


Where, ()i fx is a transformed signal, whereas ix is an original input. Original input ix is added to ()i fx through bypass pathways. In essence, ()ii g x x  , performs residual learning. ResNet introduced shortcut connections within layers to enable cross-layer connectivity, but these gates are data independent and parameter free in comparison to Highway Networks. In Highway Networks, when a gated shortcut is closed, the layers represent non-residual functions. However, in ResNet, residual information is always passed and identity shortcuts are never closed. Residual links (shortcut connections) speed up the convergence of deep networks, thus givingResNet the ability to avoid gradient diminishing problems. ResNet with the depth of 152 layers, (having 20 and 8 times more depth than AlexNet and VGG, respectively) won the 2015-ILSVRC championship [21]. Even with increased depth, ResNet exhibited lower computational complexity than VGG [29].


4.3.3 DenseNets


    In continuation of Highway Networks and ResNet, DenseNet was proposed to solve the vanishing gradient problem [31], [101], [107]. The problem with ResNet was that it explicitly

preserves information through additive identity transformations due to which many layers may

contribute very little or no information. To address this problem, DenseNet used cross-layer

connectivity but, in a modified fashion. DenseNet connected each layer to every other layer in a feed-forward fashion, thus feature maps of all preceding layers were used as inputs into all

subsequent layers. This establishes ( 1) 2 l l  direct connections in DenseNet, as compared to l connections between a layer and its preceding layer in the traditional CNNs. It imprints the effect of cross-layer depth wise convolutions. As DenseNet concatenates the previous layers features instead of adding them, thus, the network may gain the ability to explicitly differentiate between information that is added to the network and information that is preserved. DenseNet has narrow layer structure; however, it becomes parametrically expensive with an increase in a number of feature maps. The direct admittance of each layer to the gradients through the loss function improves the flow of information throughout the network. This incorporates a regularizing effect, which reduces overfitting on tasks with smaller training sets.


4.4 Width based Multi-Connection CNNs


    During 2012-2015, the focus was largely on exploiting the power of depth along with the effectiveness of multi-pass regulatory connections in network regularization [31], [101]. However, Kawaguchi et al. reported that the width of network is also important [149]. Multilayer perceptron gained an advantage of mapping complex functions over perceptron by making parallel use of multiple processing units within a layer. This suggests that width is an important parameter in defining principles of learning along with depth. Lu et al. (2017), and Hanin and Sellke (2017) have recently shown that NNs with ReLU activation function have to be wide enough in order to hold universal approximation property with the increase in depth [150]. Moreover, a class of continuous functions on a compact set cannot be arbitrarily well approximated by an arbitrarily deep network, if the maximum width of the network is not larger than the input dimension [135], [151]. Although, stacking of multiple layers (increasing depth) may learn diverse feature representations, but may not necessarily increase the learning power of the NN. One major problem linked with deep architectures is that some layers or processing units may not learn useful features. To tackle this problem, the focus of research shifted from deep and narrow architecture towards thin and wide architectures.  


4.4.1 WideResNet


     It is concerned that the main drawback associated with deep residual networks is the feature reuse problem in which some feature transformations or blocks may contribute very little to learning [152]. This problem was addressed by WideResNet [34]. Zagoruyko and Komodakis suggested that the main learning potential of deep residual networks is due to the residual units, whereas depth has a supplementary effect. WideResNet exploited the power of the residual blocks by making ResNet wide rather than deep [31]. WideResNet increased width by introducing an additional factor k, which controls the width of the network. WideResNet showed that the widening of the layers may provide a more effective way of performance improvement than by making the residual networks deep. Although, deep residual networks improved representational capacity, but they have some demerits such as time intensive training, inactivation of many feature maps (feature reuse problem), and gradient vanishing and exploding problem. He et al. addressed feature reuse problem by incorporating dropout in residual blocks to regularize network in an effective way [31]. Similarly, Huang et al. introduced the concept of stochastic depth by exploiting dropouts to solve vanishing gradient and slow learning problem [105]. It was observed that even fraction improvement in performance may require the addition of many new layers. An empirical study showed that WideResNet was twice the number of parameters as compared to ResNet, but can be trained in a better way than the deep networks [34]. Wider residual network was based on the observation that almost all architectures before residual networks, including the most successful Inception and VGG, were wider as compared to ResNet. In WideResNet, learning is made effective by adding a dropout in-between the convolutional layers rather than inside a residual block.  



相关文章
|
8月前
|
机器学习/深度学习 PyTorch 测试技术
SegNeXt: Rethinking Convolutional Attention Design for Semantic Segmentation 论文解读
我们提出了SegNeXt,一种用于语义分割的简单卷积网络架构。最近的基于transformer的模型由于在编码空间信息时self-attention的效率而主导了语义分割领域。在本文中,我们证明卷积注意力是比transformer中的self-attention更有效的编码上下文信息的方法。
203 0
|
机器学习/深度学习 搜索推荐
【推荐系统论文精读系列】(十四)--Information Fusion-Based Deep Neural Attentive Matrix Factorization Recommendation
推荐系统的出现,有效地缓解了信息过载的问题。而传统的推荐系统,要么忽略用户和物品的丰富属性信息,如用户的人口统计特征、物品的内容特征等,面对稀疏性问题,要么采用全连接网络连接特征信息,忽略不同属性信息之间的交互。本文提出了基于信息融合的深度神经注意矩阵分解(ifdnamf)推荐模型,该模型引入了用户和物品的特征信息,并采用不同信息域之间的交叉积来学习交叉特征。此外,还利用注意机制来区分不同交叉特征对预测结果的重要性。此外,ifdnamf采用深度神经网络来学习用户与项目之间的高阶交互。同时,作者在电影和图书这两个数据集上进行了广泛的实验,并证明了该模型的可行性和有效性。
230 0
【推荐系统论文精读系列】(十四)--Information Fusion-Based Deep Neural Attentive Matrix Factorization Recommendation
|
机器学习/深度学习 搜索推荐 算法
SysRec2016 | Deep Neural Networks for YouTube Recommendations
YouTube有很多用户原创内容,其商业模式和Netflix、国内的腾讯、爱奇艺等流媒体不同,后者是采购或自制的电影,并且YouTube的视频基数巨大,用户难以发现喜欢的内容。本文根据典型的两阶段信息检索二分法:首先描述一种深度候选生成模型,接着描述一种分离的深度排序模型。
238 0
SysRec2016 | Deep Neural Networks for YouTube Recommendations
|
机器学习/深度学习 数据挖掘 计算机视觉
CV:翻译并解读2019《A Survey of the Recent Architectures of Deep Convolutional Neural Networks》第四章(三)
CV:翻译并解读2019《A Survey of the Recent Architectures of Deep Convolutional Neural Networks》第四章
|
机器学习/深度学习 数据挖掘 计算机视觉
CV:翻译并解读2019《A Survey of the Recent Architectures of Deep Convolutional Neural Networks》第四章(一)
CV:翻译并解读2019《A Survey of the Recent Architectures of Deep Convolutional Neural Networks》第四章
CV:翻译并解读2019《A Survey of the Recent Architectures of Deep Convolutional Neural Networks》第四章(一)
|
机器学习/深度学习 人工智能 数据挖掘
CV:翻译并解读2019《A Survey of the Recent Architectures of Deep Convolutional Neural Networks》第五章~第八章(二)
CV:翻译并解读2019《A Survey of the Recent Architectures of Deep Convolutional Neural Networks》第五章~第八章
|
机器学习/深度学习 自然语言处理 数据挖掘
CV:翻译并解读2019《A Survey of the Recent Architectures of Deep Convolutional Neural Networks》第五章~第八章(一)
CV:翻译并解读2019《A Survey of the Recent Architectures of Deep Convolutional Neural Networks》第五章~第八章
|
机器学习/深度学习 人工智能 编解码
CV:翻译并解读2019《A Survey of the Recent Architectures of Deep Convolutional Neural Networks》第一章~第三章(一)
CV:翻译并解读2019《A Survey of the Recent Architectures of Deep Convolutional Neural Networks》第一章~第三章
CV:翻译并解读2019《A Survey of the Recent Architectures of Deep Convolutional Neural Networks》第一章~第三章(一)
|
机器学习/深度学习 数据挖掘 计算机视觉
CV:翻译并解读2019《A Survey of the Recent Architectures of Deep Convolutional Neural Networks》第一章~第三章(二)
CV:翻译并解读2019《A Survey of the Recent Architectures of Deep Convolutional Neural Networks》第一章~第三章
CV:翻译并解读2019《A Survey of the Recent Architectures of Deep Convolutional Neural Networks》第一章~第三章(二)
|
机器学习/深度学习 文字识别 并行计算
CV:翻译并解读2019《A Survey of the Recent Architectures of Deep Convolutional Neural Networks》第一章~第三章(三)
CV:翻译并解读2019《A Survey of the Recent Architectures of Deep Convolutional Neural Networks》第一章~第三章
CV:翻译并解读2019《A Survey of the Recent Architectures of Deep Convolutional Neural Networks》第一章~第三章(三)