3.1 HAR 任务中的多模态融合模态融合的目的是利用不同数据模态的互补优势,以达到更好的识别性能。现有的多模态融合方案主要有两种:(1)评分融合(score fusion),即对不同模态输出的 score 做融合,例如使用加权平均或学习一个分数融合模型。(2)特征融合,即对来自不同模态的特征进行组合。数据融合(在特征提取之前就融合不同模态的输入数据)可以看成是特征融合,因为某一模态的数据可以被视为该模态的原始特征。依据输入模态的不同,现有的多模态融合方法大概可以分为视觉模态之间的融合,与视觉 + 非视觉模态之间的融合两种,下面对这两类方法分别做一个较为详细的介绍。
视觉模态之间的融合(1)RGB + 深度模态:RGB 和深度模态分别能够捕捉外观信息和 3D 形状信息,因此它们具有比较强的互补性。[25]提出了一个 four-stream CNN,其中一个 stream 输入 RGB 数据,剩下三个 stream 分别输入三个不同视点下捕捉的深度运动图,融合策略选择评分融合。[26]将 RGB 和深度数据看做两对 RGB 和深度的动态图像,通过一个协同训练的 CNN 提取特征,并联合优化排序损失和 softmax 损失来进行训练。[27]同样提出了一个多流混合网络,该网络分别使用 CNN 和 3D convLSTM 来提取来自 RGB 和深度图的特征,然后通过典型关联分析(Canonical Correlation Analysis)进行模态间的信息融合。(2)RGB + 骨架模态:骨架模态可以提供身体位置和关节运动信息,同样和 RGB 模态是互补的。[28]提出了一个双流深度网络,两个 stream 分别是 CNN 和 RNN,用以分别处理 RGB 和骨架数据,融合方式同时尝试了特征融合和分数融合,并发现应用特征融合策略可以取得更好的效果。[29]设计了一个 three-stream 的 3D CNN 来分别处理人体姿态、运动和 RGB 图像,通过马尔科夫链来融合三个 stream,并用于动作分类。[30]提出了一种时空 LSTM 网络,它能够在 LSTM 单元内有效地融合 RGB 和骨架特征。(3)深度图 + 骨架模态:[31]将身体的每个部分与其他部分之间的相对几何关系作为骨架特征,将不同身体部分周围的深度图像块作为外观特征,以编码身体 - 对象和身体部分 - 身体部分之间的关系,进而实现可靠的 HAR。[32]提出了一种 three-stream 的 2D CNN,对深度和骨架序列提取的三种不同的手工特征进行分类,然后采用评分融合模块得到最终的分类结果。(4)RGB + 深度图 + 骨架模态:这类方法大多是前文提到的三类多模态融合方法的扩展。如 [33] 研究了模态之间的相关性,将它们分解成相关和独立的成分,然后使用一个结构化的基于稀疏性的分类器输出分类结果。[34]从每个模态提取 temporal feature map,然后再在模态维度对这些 feature map 执行 concat 操作,以获得跨 RGB、骨架和深度模态的时变信息。[35]提出了一个 five-stream network,历史运动图像、深度运动图、以及三个分别从 RGB, 深度和骨架序列生成的骨架图像分别是这 5 个 stream 的输入。(5)其他视觉模态间的融合:这些方法的思路与前文中所述的基本类似,比如 [36] 中提出了一个基于 TSN[37]的多模态融合模型,RGB、深度图、红外和光流序列分别使用 TSN 执行初始分类,然后使用一个融合网络,以获取最终的分类分数。视觉模态 + 非视觉模态的融合同样地,视觉与非视觉模态的融合,其目的也是希望能够利用不同模态之间的互补性,得到更精确的 HAR 模型。(1)视频与音频的融合:前文中已经提到,音频可以为视频的外观和运动信息提供补充信息。所以目前已经有一些基于深度学习的方法来融合这种模态的数据,比如 [38] 引入了一个 three-stream 的 CNN,从音频信号,RGB 帧和光流中分别提取特征,然后再进行融合(在该文中,特征融合的效果好于评分融合)。[39]是 [37] 的一个改进,其在每个时间绑定窗口内融合多模态输入序列(也就是说,融合来自不同模态的信息可能是异步的)。[40]利用音频信号减少了视频中的时间冗余,其思想是把使用 video clips 训练的 teacher network 中的知识提取到使用图像 - 音频对训练的 student network 中。(2)视频与加速度模态的融合:现有的基于深度学习的视频与加速度模态融合的方法大多是采用双流或多流网络的架构,比如 [41] 将惯性信号表示为图像,然后使用两个 CNN 分别处理视频和惯性信号,最后使用评分融合的方法融合两个模态的信号。[42]则是将 3D 视频帧序列和 2D 的惯性图像分别送入 3D CNN 和 2D CNN 中,然后执行模态间的融合。(3)其他类型的模态融合:这类方法中,相对比较有代表性的是 [43] 和[44],其中 [43] 的核心思想是将非 RGB 模态的数据,包括骨架、加速度和 wifi 数据都转换成彩色图像,然后送入 CNN 中。[44]则提出了一种 video-audio-text transformer(VATT),将视频,音频和文本数据的线性投影作为 transformer 的输入,并提取多模态的特征表示,VATT 还量化了不同模态的粒度,并且采用视频 - 音频对和视频 - 文本对的 NCE Loss 进行训练。
3.2 HAR 任务中的多模态协同学习多模态协同学习旨在探索如何利用辅助模态学习到的知识帮助另一个模态的学习,希望通过跨模态的知识传递和迁移可以克服单一模态的缺点,提高性能。多模态协同学习与多模态融合的一个关键区别在于,在多模态协同学习中,辅助模态的数据仅仅在训练阶段需要,测试阶段并不需要。所以多模态协同学习尤其适用于模态缺失的场景。此外对于模态样本数较小的场景,多模态协同学习也可以起到一定的帮助作用。
视觉模态的协同学习(1)RGB 和深度模态的协同学习。如 [45] 使用知识蒸馏的方法实现模态间的协同学习,其中 teacher network 输入深度图,而 student network 输入的则是 RGB 图像。[46]提出了一种基于对抗学习的知识提取策略用来训练 student network。[47]则提出了一种合作学习策略,即在不同的输入模态中,使用分类损失最小的模态所生成的预测标签,作为其他模态训练的附加监督信息。(2)RGB 和骨架模态的协同学习。如 [48] 利用 CNN+LSTM 执行基于 RGB 视频的分类,并利用在骨架数据上训练的 LSTM 模型充当调节器,强制两个模型的输出特征相似。(3)其他视觉模态的协同学习。除了 RGB、骨架、深度模态的协同学习之外,目前也有一些其他的视觉模态的协同学习的工作,比如 [49] 提出了一种可迁移的生成模型,该模型使用红外视频作为输入,并生成与其对应的 RGB 视频的虚假特征表达。该方法的生成器由两个子网络组成,第一个子网络用以区分生成的虚假特征和真实的 RGB 特征,第二个子网络将红外视频的特征表达和生成的特征作为输入,执行动作的分类。
视觉和非视觉模态的协同学习这类工作可以大致分为两种类型,第一种类型是在不同模态之间进行知识的迁移,如 [50] 中的 teacher network 使用非视觉模态训练,而 student network 使用 RGB 模态作为输入,通过强制 teacher 和 student 的 attention map 相似以弥补模态间的形态差距,并实现知识的提炼。第二种类型是利用不同模态之间的相关性进行自监督学习,比如 [51] 分别利用音频 / 视频模态中的无监督聚类结果作为视频 / 音频模态的监督信号。[52]使用视频和音频的时间同步信息作为自监督信号。
4 现有的数据集原论文中的 table 6 展示了目前 HAR 任务的各个数据模态的数据集,展示如下:
可以看到,对于绝大部分数据模态,目前都存在对应的数据集,这些数据集也在很大程度上方便了我们对 HAR 任务的研究和探索。
5 总结作者在原综述文章的最后一部分展望了 HAR 领域未来的发展方向,作者认为有 6 个方向可能是未来研究和探索的重点,分别是:(1)新的数据集(比如不受控环境下的多模态数据集);(2)多模态学习;(3)高效的行为分析;(4)早期行为识别(即只有一部分动作被执行);(5)大规模训练;(6)无监督和半监督学习。作者还提到,他们会定期地收集 HAR 领域的最新进展并更新到本综述文章中。
6 个人思考该综述调研了约 500 篇文章,涵盖了 HAR 任务中可能使用的各个模态,是对这一领域非常全面的总结。从综述中可以看到,无论是单模态还是多模态的模型,其 backbone 通常都是以下几种网络之一(或它们的组合): (1)2D CNN(空间信息的提取);(2)RNN/LSTM/GRU(时序信息的提取);(3)3D CNN(时间 + 空间维度的信息提取);(4)GNN/GCN(节点之间的关系抽取);(5)transformer(长时序建模)。对于 HAR 任务中的多模态融合来说,目前最常见的做法是使用一个双流或多流网络,每个 stream 分别提取一个模态的特征,然后再后接一个多模态融合模块。对于 HAR 任务中的多模态协同学习来说,目前常见的做法则是使用跨模态知识蒸馏或对抗学习的框架完成。这些 backbone 和融合 / 协同学习策略的组合,可以概括目前 HAR 领域的大部分文章。对不同模态的数据,往往需要不同的模型来提取其特征,这对于 HAR 的模型设计来说是非常不方便的。有时为了适配现有的模型,需要对某些模态的数据做一些特定的预处理(比如目前提取音频模态特征的一种常用方法是,将一维的音频信号转换成二维的频谱图,再送入 CNN 中进行特征提取),这些特定的预处理可能存在一定的信息丢失。所以是否可以有一种通用的模型,能够比较好地处理各种形态不同的多模态数据呢?这是目前整个 AI 界都比较关注的一个问题,而其在 HAR 任务上体现的尤为明显。transformer 目前在图像、文本等模态中都取得了非常好的效果,它能否成为我们期待的通用模型呢?以现在 AI 领域日新月异的发展速度,我相信我们很快就可以看到答案。另外,该综述的多模态学习部分,按照使用的模态对现有的工作进行了分类总结,而多模态学习的研究核心,很大的一部分在于模态间的融合或协同学习的策略,如果能够按照具体的融合或协同学习的策略对现有的工作进行分类总结,可能会更好一些。
参考文献
[1] K. Simonyan and A. Zisserman, "Two-stream convolutional networks for action recognition in videos," in Advances in Neural Information Processing Systems, vol. 27, 2014.
[2] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, "Large-scale video classification with convolutional neural networks," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 1725-1732.
[3] B. Zhang, L. Wang, Z. Wang, Y. Qiao, and H. Wang, "Real-time action recognition with enhanced motion vector cnns," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2718-2726.
[4] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell, "Long-term recurrent convolutional networks for visual recognition and description," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 2625-2624.
[5] S. Sharma, R. Kiros, and R. Salakhutdinov, "Action recognition using visual attention," arXiv preprint arXiv:1511.04119, 2015.
[6] Z. Wu, X. Wang, Y.-G. Jiang, H. Ye, and X. Xue, “Modeling spatial-temporal clues in a hybrid deep learning framework for video classification,” in Proceedings of the 23rd ACM international conference on Multimedia, 2015, pp. 461-470.
[7] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, "Learning spatiotemporal features with 3d convolutional networks," in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 4489-4497.
[8] G. Varol, I. Laptev, and C. Schmid, "Long-term temporal convolutions for action recognition," IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 6, pp. 1510-1517, 2017.
[9] Z. Qiu, T. Yao, and T. Mei, "Learning spatio-temporal representation with pseudo-3d residual networks," in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 4489-4497.
[10] Y. Zhou, X. Sun, C. Luo, Z.-J. Zha, and W. Zeng, "Spatiotemporal fusion in 3d cnns: A probabilistic view," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 1725-1732.
[11] J. Kim, S. Cha, D. Wee, S. Bae, and J. Kim, "Regularization on spatio-temporally smoothed feature for action recognition," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 12103-12112.
[12] ] G. Bertasius, H. Wang, and L. Torresani, "Is space-time attention all you need for video understanding?," in ICML, vol. 2, no. 3, 2021.
[13] Q. Fan, C.-F. Chen, and R. Panda, "Can an image classifier suffice for action recognition?," in International Conference on Learning Representations, 2022.
[14] D. Neimark, O. Bar, M. Zohar, and D. Asselmann, "Video transformer network," in Proceedings of the IEEE International Conference on Computer Vision, 2021, pp. 3163-3172.
[15] Y. Du, W. Wang, and L. Wang, "Hierarchical recurrent neural network for skeleton based action recognition," in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1110-1118.
[16] A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang, "Ntu rgb+d: A large scale dataset for 3d human activity analysis," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1010-1019.
[17] Y. Hou, Z. Li, P. Wang, and W. Li, "Skeleton optical spectra-based action recognition using convolutional neural networks," IEEE Transactions on Circuits and Systems for Video Technology, vol. 28, no. 3, 2016.
[18] P. Wang, Z. Li, Y. Hou, and W. Li, "Action recognition based on joint trajectory maps using convolutional neural networks," in Proceedings of the 24th ACM international conference on Multimedia, 2016, pp. 102-106.
[19] L. Shi, Y. Zhang, J. Cheng, and H. Lu, "Skeleton-based action recognition with directed graph neural networks," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 7912-7921.
[20] S. Yan, Y. Xiong, and D. Lin, "Spatial temporal graph convolutional networks for skeleton-based action recognition," in Thirty-second AAAI conference on artificial intelligence, 2018.
[21] Y. Zhang, B. Wu, W. Li, L. Duan, and C. Gan, "Stst: Spatial-temporal specialized transformer for skeleton-based action recognition," in Proceedings of the 29th ACM international conference on Multimedia, 2021, pp. 3229-3237.
[22] Y. Wang, Y. Xiao, F. Xiong, W. Jiang, Z. Cao, J. T. Zhou, and J. Yuan, "3dv: 3d dynamic voxel for action recognition in depth video," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 511-520.
[23] C. R. Qi, L. Yi, H. Su, and L. J. Guibas, "Pointnet++: Deep hierarchical feature learning on point sets in a metric space," in Advances in Neural Information Processing Systems, vol. 30, 2017.
[24] X. Liu, M. Yan, and J. Bohg, "Meteornet: Deep learning on dynamic 3d point cloud sequences," in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 9246-9255.
[25] J. Imran and P. Kumar, "Human action recognition using rgb-d sensor and deep convolutional neural networks," in 2016 international conference on advances in computing, communications and informatics (ICACCI), 2016, pp. 144-148.
[26] P. Wang, W. Li, J. Wan, P. Ogunbona, and X. Liu, "Cooperative training of deep aggregation networks for rgb-d action recognition," in Thirty-second AAAI conference on artificial intelligence, 2018.
[27] H. Wang, Z. Song, W. Li, and P. Wang, "A hybrid network for large-scale action recognition from rgb and depth modalities," Sensors, vol. 20, no. 11, 2020.
[28] R. Zhao, H. Ali, and P. Van der Smagt, "Two-stream rnn/cnn for action recognition in 3d videos," in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2017, pp. 4260-4267.
[29] M. Zolfaghari, G. L. Oliveira, N. Sedaghat, and T. Brox, "Chained multi-stream networks exploiting pose, motion, and appearance for action classification and detection," in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 2904-2913.
[30] J. Liu, A. Shahroudy, D. Xu, A. C. Kot, and G. Wang, "Skeleton-based action recognition using spatio-temporal lstm network with trust gates," IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 12, pp. 3007-3021, 2017.
[31] H. Rahmani and M. Bennamoun, "Learning action recognition model from depth and skeleton videos," in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 5832-5841.
[32] S. S. Rani, G. A. Naidu, and V. U. Shree, "Kinematic joint descriptor and depth motion descriptor with convolutional neural networks for human action recognition," Materials Today, vol. 37, 3164-3173, 2021.
[33] A. Shahroudy, T.-T. Ng, Y. Gong, and G. Wang, "Deep multimodal feature analysis for action recognition in rgb+d videos," IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 5, pp. 1045-1058, 2017.
[34] J.-F. Hu, W.-S. Zheng, J. Pan, J. Lai, and J. Zhang, "Deep bilinear learning for rgb-d action recognition," in Proceedings of the European Conference on Computer Vision, 2018, pp. 5832-5841.
[35] P. Khaire, P. Kumar, and J. Imran, "Combining cnn streams of rgb-d and skeletal data for human activity recognition," Pattern Recognition Letters, vol. 115, pp. 107-116, 2018.
[36] S. Ardianto and H.-M. Hang, "Multi-view and multi-modal action recognition with learned fusion," in 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 1601-1604, 2018.
[37] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool, "Temporal segment networks: Towards good practices for deep action recognition," in Proceedings of the European Conference on Computer Vision, 2016, pp. 20-36.
[38] C. Wang, H. Yang, and C. Meinel, "Exploring multimodal video representation for action recognition,"in 2016 International Joint Conference on Neural Networks (IJCNN), pp. 1924-1931, 2016.
[39] E. Kazakos, A. Nagrani, A. Zisserman, and D. Damen, "Epic-fusion: Audiovisual temporal binding for egocentric action recognition," in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 5492-5501.
[40] R. Gao, T.-H. Oh, K. Grauman, and L. Torresani, "Listen to look: Action recognition by previewing audio," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 10457-10467.
[41] N. Dawar and N. Kehtarnavaz, "A convolutional neural network-based sensor fusion system for monitoring transition movements in healthcare applications," in 2018 IEEE 14th International Conference on Control and Automation (ICCA), pp. 482-485, 2018.
[42] H. Wei, R. Jafari, and N. Kehtarnavaz, "Fusion of video and inertial sensing for deep learning–based human action recognition," Sensors, vol. 19, no. 17, 2019.
[43] A. Gorban, H. Idrees, Y.-G. Jiang, A. Roshan Zamir, I. Laptev, M. Shah, and R. Sukthankar, "THUMOS challenge: Action recognition with a large number of classes." http://www.thumos.info/, 2015.
[44] H. Akbari, L. Yuan, R. Qian, W.-H. Chuang, S.-F. Chang, Y. Cui, and B. Gong, "Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text,"in Advances in Neural Information Processing Systems, vol. 27, 2014.
[45] N. C. Garcia, P. Morerio, and V. Murino, "Modality distillation with multiple stream networks for action recognition," in Proceedings of the European Conference on Computer Vision, 2018, pp. 5832-5841.
[46] N. C. Garcia, P. Morerio, and V. Murino, "Learning with privileged information via adversarial discriminative modality distillation," IEEE transactions on pattern analysis and machine intelligence, vol. 42, no. 10, pp. 2581-2593, 2019.
[47] N. C. Garcia, S. A. Bargal, V. Ablavsky, P. Morerio, V. Murino, and S. Sclaroff, "Dmcl: Distillation multiple choice learning for multimodal action recognition," arXiv preprint arXiv:1912.10982, 2019.
[48] B. Mahasseni and S. Todorovic, "Regularizing long short term memory with 3d human-skeleton sequences for action recognition," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 3054-3062.
[49] . Wang, C. Gao, L. Yang, Y. Zhao, W. Zuo, and D. Meng, "Pm-gans: Discriminative representation learning for action recognition using partial-modalities," in Proceedings of the European Conference on Computer Vision, 2018, pp. 384-401.
[50] Y. Liu, K. Wang, G. Li, and L. Lin, "Semantics-aware adaptive knowledge distillation for sensor-to-vision action recognition," IEEE Transactions on Image Processing, vol. 30, pp. 5573-5588, 2021.
[51] H. Alwassel, D. Mahajan, L. Torresani, B. Ghanem, and D. Tran, "Self supervised learning by cross-modal audio-video clustering," arXiv preprint arXiv:1911.12667, 2019.
[52] B. Korbar, D. Tran, and L. Torresani, "Cooperative learning of audio and video models from self-supervised synchronization," in Advances in Neural Information Processing Systems, vol. 31, 2018.