4 Architectural Innovations in CNN
Different improvements in CNN architecture have been made from 1989 to date. These improvements can be categorized as parameter optimization, regularization, structural reformulation, etc. However, it is observed that the main thrust in CNN performance improvement came from restructuring of processing units and designing of new blocks. Most of the innovations in CNN architectures have been made in relation with depth and spatial exploitation. Depending upon the type of architectural modification, CNN can be broadly categorized into seven different classes namely; spatial exploitation, depth, multi-path, width, feature map exploitation, channel boosting, and attention based CNNs. The taxonomy of deep presented in Fig. 4 showing seven different classes, while their summary is represented in Table 1.
Fig. 4: Taxonomy of deep CNN architectures.
Table 1 Performance comparison of the recent architectures of different categories. Top 5 error rate is reported for all architectures.
4.1 Spatial Exploitation based CNNs
CNN has a large number of parameters and hyperparameters such as the weights, biases, number of processing units (neurons), number of layers, filter size, stride, learning rate, activation function, etc. [119], [120]. As convolutional operation considers the neighborhood (locality) of input pixels, therefore different levels of correlation can be explored by using a different filter size. Consequently, in early 2000, researchers exploited spatial filters to improve performance in this regard; various size of filters were explored to evaluate their impact on learning of the network. Different size of filters encapsulate different levels of granularity; usually, small size filters extract fine-grained and large size extract coarse-grained information. In this way, by the adjustment of filter size, CNN can perform well both on coarse and fine-grained details.
4.1.1 LeNet
LeNet was proposed by LeCuN in 1998 [65]. It is famous due to its historical importance as it was the first CNN, which showed state-of-the-art performance on hand digit recognition tasks. It has the ability to classify digits without being affected by small distortions, rotation, and variation of position and scale. LeNet is a feed-forward NN that constitutes of five alternating layers of convolutional and pooling, followed by two fully connected layers. In early 2000, GPU was not commonly used to speed up training, and even CPUs were slow [121]. The main limitation of traditional multilayer fully connected NN was that it considers each pixel as a separate input and applies a transformation on it, which was a huge computational burden, specifically at that time [122]. LeNet exploited the underlying basis of image that the neighboring pixels are correlated to each other and are distributed across the entire image. Therefore, convolution with learnable parameters is an effective way to extract similar features at multiple locations with few parameters. This changed the conventional view of training where each pixel was considered as a separate input feature from its neighborhood and ignored the correlation among them. LeNet was the first CNN architecture, which not only reduced the number of parameters and computation but was able to automatically learn features.
4.1.2 AlexNet
LeNet [65] though begin the history of deep CNNs but at that time, CNN was limited to hand digit recognition tasks, and didn’t scale well to all classes of images. AlexNet [21] is considered as the first deep CNN architecture, which showed groundbreaking results for image classification and recognition tasks. AlexNet was proposed by Krizhevesky et al., who enhanced the learning capacity of the CNN by making it deeper and by applying a number of parameter optimizations strategies [21]. Basic architectural design of AlexNet is shown in Fig. 5. In early 2000, hardware limitations curtailed the learning capacity of deep CNN architecture by restricting them to small size. In order to get benefit of the representational capacity of CNN, Alexnet was trained in parallel on two NVIDIA GTX 580 GPUs to overcome shortcomings of the hardware. In AlexNet, feature extraction stages were extended from 5 (LeNet) to 7 to make CNN applicable for diverse categories of images. Despite the fact that generally, depth improves generalization for different resolutions of images but, the main drawback associated with increase in depth is overfitting. To address this challenge, Krizhevesky et al. (2012) exploited the idea of Hinton [56], [123], whereby their algorithm randomly skips some transformational units during training to enforce the model to learn features that are more robust. In addition to this, ReLU was employed as a non-saturating activation function to improve the convergence rate by alleviating the problem of vanishing gradient to some extent [53], [124]. Overlapping subsampling and local response normalization were also applied to improve the generalization by reducing overfitting. Other adjustments made were the use of large size filters (11x11 and 5x5) at the initial layers, compared to previously proposed networks. Due to efficient learning approach of AlexNet, it has a significant importance in the new generation of CNNs and has started a new era of research in the architectural advancements of CNNs.
Fig. 5: Basic layout of AlexNet architecture.
4.1.3 ZefNet
Learning mechanism of CNN, before 2013, was largely based on hit-and-trial, without knowing the exact reason behind the improvement. This lack of understanding limited the performance of deep CNNs on complex images. In 2013, Zeiler and Fergus proposed an interesting multilayer Deconvolutional NN (DeconvNet), which got famous as ZefNet [28]. ZefNet was developed to quantitatively visualize network performance. The idea of the visualization of network activity was to monitor CNN performance by interpreting neuron’s activation. In one of the previous studies, Erhan et al. (2009) exploited the same idea and optimized performance of Deep Belief Networks (DBNs) by visualizing hidden layers’ feature [125]. In the same manner, Le et al. (2011) evaluated the performance of deep unsupervised autoencoder (AE) by visualizing the image classes generated by the output neurons [126]. DeconvNet works in the same manner as the forward pass CNN but, reverses the order of convolutional and pooling operation. This reverse mapping projects the output of convolutional layer back to visually perceptible image patterns consequently gives the neuron-level interpretation of the internal feature representation learned at each layer [127], [128]. The objective of ZefNet was to monitor the learning scheme during training and thus use the findings in diagnosing a potential problem associated with the model. This idea was experimentally validated on AlexNet using DeconvNet, which showed that only a few neurons were active, while other neurons were dead (inactive) in the first and second layer of the network. Moreover, it showed that the features extracted by the second layer exhibited aliasing artifacts. Based on these findings, Zeiler and Fergus adjusted CNN topology and performed parameter optimization. Zeiler and Fergus maximized the learning of CNN by reducing both the filter size and stride to retain maximum number of features in the first two convolutional layers. This readjustment in CNN topology resulted in performance improvement, which suggested that features visualization can be used for identification of design shortcomings and for timely adjustment of parameters.
4.1.4 VGG
With the successful use of CNNs for image recognition, Simonyan et al. proposed a simple and effective design principle for CNN architectures. Their architecture named as VGG was modular in layers pattern [29]. VGG was made 19 layers deep compared to AlexNet and ZefNet to simulate the relationship of depth with the representational capacity of the network [21], [28]. ZefNet, which was a frontline network of 2013-ILSVRC competition, suggested that small size filters can improve the performance of the CNNs. Based on these findings, VGG replaced the 11x11 and 5x5 filters with a stack of 3x3 filters layer and experimentally demonstrated that concurrent placement of 3x3 filters can induce the effect of the large size filter (receptive field as effective as that of large size filters (5x5 and 7x7)). Use of the small size filters provide an additional benefit of low computational complexity by reducing the number of parameters. These findings set a new trend in research to work with smaller size filters in CNN. VGG regulates complexity of network by placing 1x1 convolution in between the convolutional layers, which in addition, learn a linear combination of the resultant feature maps. For the tuning of the network, max pooling is placed after the convolutional layer, while padding was performed to maintain the spatial resolution [46]. VGG showed good results both for image classification and localization problems. Although, VGG was not at the top place of 2014-ILSVRC competition but, got fame due to its simplicity, homogenous topology, and increased depth. The main limitation associated with VGG was that of high computational cost. Even with the use of small size filters, VGG suffered from high computational burden due to the use of about 140 million parameters.
4.1.5 GoogleNet
GoogleNet was the winner of the 2014-ILSVRC competition and is also known as Inception-V1. The main objective of the GoogleNet architecture was to achieve high accuracy with a reduced computational cost [99]. It introduced the new concept of inception block in CNN, whereby it incorporates multi-scale convolutional transformations using split, transform, and merge idea. The architecture of inception block is shown in Fig. 6. This block encapsulates filters of different sizes (1x1, 3x3, and 5x5) to capture spatial information at different scales (both at fine and coarse grain level). In GoogleNet, conventional convolutional layers are replaced in small blocks similar to the idea of substituting each layer with micro NN as proposed in Network in Network (NIN) architecture [57]. The exploitation of the idea of split, transform, and merge by GoogleNet, helped in addressing a problem related to the learning of diverse types of variations present in the same category of different images. In addition to the improvement in learning capacity, GoogleNet focus was to make CNN parameter efficient. GoogleNet regulates the computation by adding a bottleneck layer with a 1x1 convolutional filter, before employing large size kernels. It used sparse connections (not all the output feature maps are connected to all the input feature maps), to overcome the problem of redundant information and reduced cost by omitting feature maps (channels) that were not relevant. Furthermore, connection’s density was reduced by using global average pooling at the last layer, instead of using a fully connected layer. These parameter tunings caused a significant decrease in the number of parameters from 40 million to 5 million parameters. Other regulatory factors applied were batch normalization and use of RmsProp as an optimizer [129]. GoogleNet also introduced the concept of auxiliary learners to speed up the convergence rate. However, the main drawback of the GoogleNet was its heterogeneous topology that needs to be customized from module to module. Another, limitation of GoogleNet was a representation bottleneck that drastically reduces the feature space in the next layer and thus sometimes may lead to loss of useful information.
Fig. 6: Basic architecture of inception block
4.2 Depth based CNNs
Deep CNN architectures are based on the assumption that with the increase in depth, the network can better approximate the target function with a number of nonlinear mappings and improved feature representations [130]. Network depth has played an important role in the success of supervised training. Theoretical studies have shown that deep networks can represent certain classes of function more efficiently than shallow architectures [131]. Csáji represented universal approximation theorem in 2001, which states that a single hidden layer is sufficient to approximate any function, but this comes at the cost of exponentially many neurons thus, often making it computationally unfeasible [132]. In this regard, Bengio and Delalleau [133] suggested that deeper networks have the potential to maintain the expressive power of the network at a reduced cost [134]. In 2013, Bengio et al. empirically showed that deep networks are computationally more efficient for complex tasks [84], [135]. Inception and VGG, which showed the best performance in 2014-ILSVRC competition, further strengthen the idea that the depth is an essential dimension in regulating learning capacity of the networks [29], [33], [99], [100].