4.2.1 Highway Networks
Based on the intuition that the learning capacity can be improved by increasing the network depth, Srivastava et al. in 2015, proposed a deep CNN, named as Highway Network [101]. The main problem concerned with deep Nets is slow training and convergence speed [136]. Highway Network exploited depth for learning enriched feature representation by introducing new cross-layer connectivity (discussed in Section 4.3.1.). Therefore, highway networks are also categorized into multi-path based CNN architectures. Highway Network with 50-layers showed better convergence rate than thin but deep architectures on ImageNet dataset [94], [95]. Srivastava et al. experimentally showed that performance of a plain Net decreases after adding hidden units beyond 10 layers [137]. Highway networks, on the other hand, were shown to converge significantly faster than the plain ones, even with depth of 900 layers.
4.2.2 ResNet
ResNet was proposed by He et al., which is considered as a continuation of deep Nets [31]. ResNet revolutionized the CNN architectural race by introducing the concept of residual learning in CNN and devised an efficient methodology for training of deep Nets. Similar to Highway Networks, it is also placed under the Multi-Path based CNNs, thus its learning methodology is discussed in Section 4.3.2. ResNet proposed 152-layers deep CNN, which won the 2015-ILSVRC competition. Architecture of the residual block of ResNet is shown in Fig. 7. ResNet, which was 20 and 8 times deeper than AlexNet and VGG respectively, showed less computational complexity than previously proposed Nets [21], [29]. He et al. empirically showed that ResNet with 50/101/152 layers has less error on image classification task than 34 layers plain Net. Moreover, ResNet gained 28% improvement on the famous image recognition benchmark dataset named as COCO [138]. Good performance of ResNet on image recognition and localization tasks showed that depth is of central importance for many visual recognition tasks.
4.2.3 Inception-V3, V4 and Inception-ResNet
Inception-V3, V4 and Inception-ResNet, are improved versions of Inception-V1 and V2 [33], [99], [100]. The idea of Inception-V3 was to reduce the computational cost of deeper Nets without affecting the generalization. For this purpose, Szegedy et al. replaced large size filters (5x5 and 7x7) with small and asymmetric filters (1x7 and 1x5) and used 1x1 convolution as a bottleneck prior to the large filters [100]. This makes the traditional convolution operation more like cross-channel correlation. In one of the previous works, Lin et al. exploited the potential of 1x1 filters in NIN architecture [57]. Szegedy et al. [100] used the same concept in an intelligent way. In Inception-V3, 1x1 convolutional operation was used, which maps the input data into 3 or 4 separate spaces that are smaller than the original input space, and then maps all correlations in these smaller 3D spaces, via regular 3x3 or 5x5 convolutions. In Inception-ResNet, Szegedy et al. combined the power of residual learning and inception block [31], [33]. In doing so, filter concatenation was replaced by the residual connection. Moreover, Szegedy et al. experimentally showed that Inception-V4 with residual connections (Inception-ResNet) has the same generalization power as plain Inception-V4 but with increased depth and width. However, they observed that Inception-ResNet converges more quickly than Inception-V4, which clearly depicts that training with residual connections accelerates the training of Inception networks significantly.
4.2.4 ResNext
ResNext, also known as Aggregated Residual Transform Network, is an improvement over the Inception Network [115]. Xie et al. exploited the concept of the split, transform and merge in a powerful but simple way by introducing a new term; cardinality [99]. Cardinality is an additional dimension, which refers to the size of the set of transformations [139], [140]. Inception network has not only improved learning capability of conventional CNNs but also makes a network resource effective. However, due to the use of diverse spatial embedding’s (such as use of 3x3, 5x5 and 1x1 filter) in the transformation branch, each layer needs to be customized separately. In fact, ResNext derives characteristic features from Inception, VGG, and ResNet [29], [31], [99]. ResNext utilized the deep homogenous topology of VGG and simplified GoogleNet architecture by fixing spatial resolution to 3x3 filters within the split, transform, and merge block. It also uses residual learning. Building block for ResNext is shown in Fig. 8. ResNext used multiple transformations within a split, transform and merge block and defined these transformations in terms of cardinality. Xie et al. (2017) showed that increase in cardinality significantly improves the performance. The complexity of ResNext was regulated by applying low embedding’s (1x1 filters) before 3x3 convolution. Whereas training was optimized by using skip connections [141].
Fig. 8: ResNext building block.
4.3 Multi-Path based CNNs
Training of deep networks is a challenging task and this has been the subject of much the recent research on deep Nets. Deep CNN generally, perform well on complex tasks. However, deeper networks may suffer from performance degradation, gradient vanishing or explosion problems, which are not caused by overfitting but instead by an increase in the depth [53], [142]. Vanishing gradient problem not only results in higher test error but also in higher training error [142]–[144]. For training deeper Nets, the concept of multi-path or cross-layer connectivity was proposed [101], [107], [108], [113]. Multiple paths or shortcut connections can systematically connect one layer to another by skipping some intermediate layers to allow the specialized flow of information across the layers [145], [146]. Cross-layer connectivity partition the network into several blocks. These paths also try to solve the vanishing gradient problem by making gradient accessible to lower layers. For this purpose, different types of shortcut connections are used, such as zero-padded, projection-based, dropout, skip connections, and 1x1 connections, etc.
4.3.1 Highway Networks
The increase in depth of a network improves performance mostly for complex problems, but it
also makes training of the network difficult. In deep Nets, due to a large number of layers, the
backpropagation of error may result in small gradient values at lower layers. To solve this
problem, Srivastava et al. [101] in 2015, proposed a new CNN architecture named as Highway
Network based on the idea of cross-layer connectivity. In Highway Network, the unimpeded
flow of information across layers is enabled by imparting two gating units within a layer
(equation (5)). The idea of a gating mechanism was inspired from Long Short Term Memory
(LSTM) based Recurrent Neural Networks (RNN) [147], [148]. The aggregation of information
by combining the lth layer, and previous l k layers information creates a regularizing effect,
making gradient-based training of very deep networks easy. This enables training of a network
with more than 100 layers, even as deep as 900 layers with Stochastic Gradient Descent (SGD)
algorithm. Cross-layer connectivity for Highway Network is defined in equation (5 & 6).
In equation (5), g T refers to transformation gate, which expresses the amount of the produced output whereas g C is a carry gate. In a network, ( , ) l i Hl H x W represents working of hidden layers, and the residual implementation. Whereas, 1 ( , ) g i Cg T x W behaves as a switch in a layer, which decides the path for the flow of information.
4.3.2 ResNet
To address the problem faced during training of deeper Nets, in 2015, He et al. proposed ResNet [31] in which they exploited the idea of bypass pathways used in Highway Networks. Mathematical formulation of ResNet is expressed in equation (7 & 8).
Where, ()i fx is a transformed signal, whereas ix is an original input. Original input ix is added to ()i fx through bypass pathways. In essence, ()ii g x x , performs residual learning. ResNet introduced shortcut connections within layers to enable cross-layer connectivity, but these gates are data independent and parameter free in comparison to Highway Networks. In Highway Networks, when a gated shortcut is closed, the layers represent non-residual functions. However, in ResNet, residual information is always passed and identity shortcuts are never closed. Residual links (shortcut connections) speed up the convergence of deep networks, thus givingResNet the ability to avoid gradient diminishing problems. ResNet with the depth of 152 layers, (having 20 and 8 times more depth than AlexNet and VGG, respectively) won the 2015-ILSVRC championship [21]. Even with increased depth, ResNet exhibited lower computational complexity than VGG [29].
4.3.3 DenseNets
In continuation of Highway Networks and ResNet, DenseNet was proposed to solve the vanishing gradient problem [31], [101], [107]. The problem with ResNet was that it explicitly
preserves information through additive identity transformations due to which many layers may
contribute very little or no information. To address this problem, DenseNet used cross-layer
connectivity but, in a modified fashion. DenseNet connected each layer to every other layer in a feed-forward fashion, thus feature maps of all preceding layers were used as inputs into all
subsequent layers. This establishes ( 1) 2 l l direct connections in DenseNet, as compared to l connections between a layer and its preceding layer in the traditional CNNs. It imprints the effect of cross-layer depth wise convolutions. As DenseNet concatenates the previous layers features instead of adding them, thus, the network may gain the ability to explicitly differentiate between information that is added to the network and information that is preserved. DenseNet has narrow layer structure; however, it becomes parametrically expensive with an increase in a number of feature maps. The direct admittance of each layer to the gradients through the loss function improves the flow of information throughout the network. This incorporates a regularizing effect, which reduces overfitting on tasks with smaller training sets.
4.4 Width based Multi-Connection CNNs
During 2012-2015, the focus was largely on exploiting the power of depth along with the effectiveness of multi-pass regulatory connections in network regularization [31], [101]. However, Kawaguchi et al. reported that the width of network is also important [149]. Multilayer perceptron gained an advantage of mapping complex functions over perceptron by making parallel use of multiple processing units within a layer. This suggests that width is an important parameter in defining principles of learning along with depth. Lu et al. (2017), and Hanin and Sellke (2017) have recently shown that NNs with ReLU activation function have to be wide enough in order to hold universal approximation property with the increase in depth [150]. Moreover, a class of continuous functions on a compact set cannot be arbitrarily well approximated by an arbitrarily deep network, if the maximum width of the network is not larger than the input dimension [135], [151]. Although, stacking of multiple layers (increasing depth) may learn diverse feature representations, but may not necessarily increase the learning power of the NN. One major problem linked with deep architectures is that some layers or processing units may not learn useful features. To tackle this problem, the focus of research shifted from deep and narrow architecture towards thin and wide architectures.
4.4.1 WideResNet
It is concerned that the main drawback associated with deep residual networks is the feature reuse problem in which some feature transformations or blocks may contribute very little to learning [152]. This problem was addressed by WideResNet [34]. Zagoruyko and Komodakis suggested that the main learning potential of deep residual networks is due to the residual units, whereas depth has a supplementary effect. WideResNet exploited the power of the residual blocks by making ResNet wide rather than deep [31]. WideResNet increased width by introducing an additional factor k, which controls the width of the network. WideResNet showed that the widening of the layers may provide a more effective way of performance improvement than by making the residual networks deep. Although, deep residual networks improved representational capacity, but they have some demerits such as time intensive training, inactivation of many feature maps (feature reuse problem), and gradient vanishing and exploding problem. He et al. addressed feature reuse problem by incorporating dropout in residual blocks to regularize network in an effective way [31]. Similarly, Huang et al. introduced the concept of stochastic depth by exploiting dropouts to solve vanishing gradient and slow learning problem [105]. It was observed that even fraction improvement in performance may require the addition of many new layers. An empirical study showed that WideResNet was twice the number of parameters as compared to ResNet, but can be trained in a better way than the deep networks [34]. Wider residual network was based on the observation that almost all architectures before residual networks, including the most successful Inception and VGG, were wider as compared to ResNet. In WideResNet, learning is made effective by adding a dropout in-between the convolutional layers rather than inside a residual block.