RGB颜色模型
人工智能
计算机科学
判别式
水准点(测量)
模态(人机交互)
模式识别(心理学)
模式
卷积神经网络
动作识别
计算机视觉
深度学习
班级(哲学)
社会科学
大地测量学
社会学
地理
作者
Bruce X.B. Yu,Yan Liu,Xiang Zhang,Sheng-hua Zhong,Keith C. C. Chan
标识
DOI:10.1109/tpami.2022.3177813
摘要
Human action recognition (HAR) in RGB-D videos has been widely investigated since the release of affordable depth sensors. Currently, unimodal approaches (e.g., skeleton-based and RGB video-based) have realized substantial improvements with increasingly larger datasets. However, multimodal methods specifically with model-level fusion have seldom been investigated. In this paper, we propose a model-based multimodal network (MMNet) that fuses skeleton and RGB modalities via a model-based approach. The objective of our method is to improve ensemble recognition accuracy by effectively applying mutually complementary information from different data modalities. For the model-based fusion scheme, we use a spatiotemporal graph convolution network for the skeleton modality to learn attention weights that will be transferred to the network of the RGB modality. Extensive experiments are conducted on five benchmark datasets: NTU RGB+D 60, NTU RGB+D 120, PKU-MMD, Northwestern-UCLA Multiview, and Toyota Smarthome. Upon aggregating the results of multiple modalities, our method is found to outperform state-of-the-art approaches on six evaluation protocols of the five datasets; thus, the proposed MMNet can effectively capture mutually complementary features in different RGB-D video modalities and provide more discriminative features for HAR. We also tested our MMNet on an RGB video dataset Kinetics 400 that contains more outdoor actions, which shows consistent results with those of RGB-D video datasets.
科研通智能强力驱动
Strongly Powered by AbleSci AI