5.5 Speech Recognition
Speech is considered as a communication link between human beings. In ML field, before the availability of hardware resources, speech recognition models didn’t show promising results. With the advancement in hardware resources, training of DNN with large training data becomes possible. Deep CNN is mostly considered as the best option for image classification, however, recent studies have shown that it also performs good on speech recognition tasks. Hamid et al. reported a CNN based speaker independent speech recognition system [199]. Experimental results showed ten percent reduction in error rate in comparison to the earlier reported methods [200], [201]. In another work, various CNN architectures, which are either based on the full or limited number of weight sharing within the convolution layer, are explored [202]. Furthermore, the performance of a CNN is also evaluated after the initialization of whole network using pre- training phase [200]. Experimental results showed that almost all of the explored architectures yield good performance on phone and vocabulary recognition related tasks.
6 CNN Challenges
Deep CNN has achieved good performance on data that either is of the time series nature or follows a grid like topology. However, there are also some other challenges where deep CNN architectures have been put to tasks. In vision related tasks, one shortcoming of CNN is that it is generally, unable to show good performance when used to estimate the pose, orientation, and location of an object. In 2012, AlexNet solved this problem to some extent by introducing the concept of data augmentation. Data augmentation can help CNN in learning diverse internal representations, which ultimately lead to improved performance. Similarly, Hinton reported that lower layers should handover its knowledge only to the relevant neurons of the next layer. In this regard, Hinton proposed the Capsule Network approach [203], [204].
In another work, Szegedy et al. showed that training of CNN architecture on noisy image data can cause an increase of misclassification error [205]. The addition of the small quantity of random noise in the input image is capable to fool the network in such a way that the model will classify the original and its slightly perturbed version differently.
Interesting discussions are made by the different researchers related to performance of CNN on different ML tasks. Some of the challenges faced during the training of deep CNN model are given below:
Deep NN are generally like a black box and thus may lack in interpretation and explanation. Therefore, sometimes it is difficult to verify them, and in case of vision related tasks, CNN may offer little robustness against noise and other alterations to images.
Each layer of CNN automatically tries to extract better and problem specific features related to the task. However, for some tasks, it is important to know the nature of features extracted by the deep CNN before classification. The idea of feature visualization in CNNs can help in this direction.
Deep CNNs are based on supervised learning mechanism, and therefore, availability of a large and annotated data is required for its proper learning. In contrast, humans have the ability to learn and generalize from a few examples.
Hyperparameter selection highly influences the performance of CNN. A little change in the hyperparameter values can affect the overall performance of a CNN. That is why careful selection of parameters is a major design issue that needs to be addressed through some suitable optimization strategy.
Efficient training of CNN demands powerful hardware resources such as GPUs. However, it is still needed to explore that how to efficiently employ CNN in embedded and smart devices. A few applications of deep learning in embedded systems are wound intensity correction, law enforcement in smart cities, etc [206]–[208].
7 Future Directions
The exploitation of different innovative ideas in CNN architectural design has changed the direction of research, especially in MV. Good performance of CNN on grid like topological data presents it as a powerful representation model for image data. CNN architecture design is a promising research field and in future, it is likely to be one of the most widely used AI techniques.
Ensemble learning [209] is one of the prospective areas of research in CNNs. The combination of multiple and diverse architectures can aid model in improving generalization on diverse categories of images by extracting different levels of semantic representations. Similarly, concepts such as batch normalization, dropout, and new activation functions are also worth mentioning.
The potential of a CNN as a generative learner is exploited in image segmentation tasks, where it has shown good results [210]. The exploitation of generative learning capabilities of CNN at supervised feature extraction stages (learning of filter using backpropagation) can boost the representation power of the model. Similarly, new paradigms are needed that can enhance the learning capacity of CNN by incorporating informative feature maps that are learnt using auxiliary learners at the intermediate stages of CNN [36].
In human visual system, attention is one of the important mechanisms in capturing information from images. Attention mechanism operates in such a way that it not only extracts the essential information from image, but also stores its contextual relation with other components of images [211], [212]. In future, research will be carried out in the direction that preserves the spatial relevance of object along with discriminating features of object at later stages of learning.
The learning capacity of CNN is enhanced by exploiting the size of the network and it is made possible with the advancement in hardware processing units and computational resources. However, the training of deep and high capacity architectures is a significant overhead on memory usage and computational resources. This requires a lot of improvements in hardware that can accelerate research in CNNs. The main concern with CNNs is the run-time applicability. Moreover, use of CNN is hindered in small hardware, especially in mobile devices because of its high computational cost. In this regard, different hardware accelerators are needed for reducing both execution time and power consumption [213]. Some of the very interesting accelerators are already proposed, such as Application Specific Integrated Circuits, Eyeriss and Google Tensor Processing Unit [214]. Moreover, different operations have been performed to save hardware resources in terms of chip area and power, by reducing precision of operands and ternary quantization, or reducing the number of matrix multiplication operations. Now it is also time to redirect research towards hardware-oriented approximation models [215].
Deep CNN has a large number of hyperparameters such as activation function, kernel size, number of neurons per layers, and arrangement of layers, etc. The selection of hyperparameters and its evaluation time makes parameter tuning quite difficult in the context of deep learning. Hyper-parameter tuning is a tedious and intuition driven task, which cannot be defined via explicit formulation. In this regard, Genetic algorithms can also be used to automatically optimize the hyper-parameter by performing search both in a random fashion as well as by directing search by utilizing previous results [216]–[218].
The learning capacity of deep CNN model has a strong correlation with the size of the model. However, capacity of deep CNN model is restricted due to hardware resources [219]. In order to overcome hardware limitations, the concept of pipeline parallelism can be exploited to scale up deep CNN training. Google group has proposed a distributed machine learning library; GPipe [220] that uses synchronous stochastic gradient descent and pipeline parallelism for training. In future, the concept of pipelining can be used to accelerate the training of large models and to scale the performance without tuning hyperparameters.
8 Conclusion
CNN has made remarkable progress, especially in vision related tasks and has thus revived the interest of scientists in ANNs. In this context, several research works have been carried out to improve the CNN’s performance on vision related tasks. The advancements in CNNs can be categorized in different ways including activation, loss function, optimization, regularization, learning algorithms, and restructuring of processing units. This paper reviews advancements in the CNN architectures, especially, based on the design patterns of the processing units, and thus has proposed the taxonomy for CNN architectures. In addition to categorization of CNNs into different classes, this paper also covers the history of CNNs, its applications, challenges, and future directions.
Learning capacity of CNN is significantly improved over the years by exploiting depth and other structural modifications. It is observed in recent literature that the main boost in CNN performance has been achieved by replacing the conventional layer structure with blocks. Nowadays, one of the paradigm of research in CNN architectures is the development of new and effective block architectures. The role of these blocks in a network is that of an auxiliary learner, which by either exploiting spatial or feature map information or boosting of input channels improves the overall performance. These blocks play a significant role in boosting of CNN performance by making problem aware learning. Moreover, block based architecture of CNN encourages learning in a modular fashion and thereby, making architecture more simple and understandable. The concept of block being a structural unit is going to persist and further enhance CNN performance. Additionally, the idea of attention and exploitation of channel information in addition to spatial information within a block is expected to gain more importance.
Acknowledgments
We thank Pattern Recognition lab at DCIS, and PIEAS for providing us computational facilities.