计算机科学
RGB颜色模型
变压器
人工智能
动作识别
模态(人机交互)
模式识别(心理学)
融合
计算机视觉
电压
电气工程
语言学
工程类
哲学
班级(哲学)
作者
Zhen Liu,Jun Cheng,Libo Liu,Ziliang Ren,Qieshi Zhang,Chengqun Song
标识
DOI:10.1016/j.knosys.2022.109741
摘要
RGB-D-based action recognition can achieve accurate and robust performance due to rich complementary information, and thus has many application scenarios. However, existing works combine multiple modalities by late fusion or learn multimodal representation with simple feature-level fusion methods, which fail to effectively utilize complementary semantic information and model interactions between unimodal features. In this paper, we design a self-attention-based modal enhancement module (MEM) and a cross-attention-based modal interaction module (MIM) to enhance and fuse RGB and depth features. Moreover, a novel bottleneck excitation feed-forward block (BEF) is proposed to enhance the expression ability of the model with few extra parameters and computational overhead. By integrating these two modules with BEFs, one basic fusion layer of the cross-modality fusion transformer is obtained. We apply the transformer on top of the dual-stream convolutional neural networks (ConvNets) to build a dual-stream cross-modality fusion transformer (DSCMT) for RGB-D action recognition. Extensive experiments on the NTU RGB+D 120, PKU-MMD, and THU-READ datasets verify the effectiveness and superiority of the DSCMT. Furthermore, our DSCMT can still make considerable improvements when changing convolutional backbones or when applied to different multimodal combinations, indicating its universality and scalability. The code is available at https://github.com/liuzwin98/DSCMT.
科研通智能强力驱动
Strongly Powered by AbleSci AI