计算机科学
人工智能
特征(语言学)
模态(人机交互)
水准点(测量)
模式识别(心理学)
接头(建筑物)
块(置换群论)
特征提取
光流
钥匙(锁)
判别式
机器学习
图像(数学)
哲学
工程类
几何学
建筑工程
语言学
地理
计算机安全
数学
大地测量学
作者
Fan Xia,Min Jiang,Jun Kong,Danfeng Zhuang
标识
DOI:10.1117/1.jei.31.1.013019
摘要
Joint spatio-temporal feature learning is the key to video-based action recognition. Off-the-shelf techniques mostly apply two-stream networks, and they either simply fuse the classification scores or only integrate the high-level features. However, these methods cannot learn inter-modality relationship well. We propose a joint attentive (JA) adaptive feature fusion (AFF) network, a three-stream network that improves inter-modality fusion by exploring complementary and interactive information of two modalities, RGB and optical flow. Specifically, we design an AFF block to implement layer-wise fusion between both modality network channels and feature levels, where spatio-temporal feature representations with different modalities and various levels can be fused effectively. To capture three-dimensional interaction of spatio-temporal features, we devise a JA module by incorporating the inter-dependencies learned with the spatial-channel attention mechanism and combine multi-scale attention to refine the fine-grained features. Extensive experiments on three public action recognition benchmark datasets demonstrate that our method achieves competitive results.
科研通智能强力驱动
Strongly Powered by AbleSci AI