计算机科学
串联(数学)
人工智能
特征(语言学)
融合机制
模式识别(心理学)
卷积神经网络
棱锥(几何)
特征提取
比例(比率)
融合
计算机视觉
哲学
语言学
物理
量子力学
数学
组合数学
脂质双层融合
光学
作者
Yunzuo Zhang,Tian Zhang,Cunyu Wu,Ran Tao
标识
DOI:10.1109/tmm.2023.3321394
摘要
Recently, video saliency prediction has attracted increasing attention, yet the improvement of its accuracy is still subject to the insufficient use of multi-scale spatiotemporal features. To address this issue, we propose a 3D convolutional Multi-scale Spatiotemporal Feature Fusion Network (MSFFNet) to achieve the full utilization of spatiotemporal features. Specifically, we propose a Bi-directional Temporal-Spatial Feature Pyramid (BiTSFP), the first application of bi-directional fusion architectures in this field, which adds the flow of shallow location information on the basis of the previous flow of deep semantic information. Then, different from simple addition and concatenation, we design an Attention-Guided Fusion (AGF) mechanism that can adaptively learn the fusion weights of adjacent features to integrate them appropriately. Moreover, a Framewise Attention (FA) module is introduced to selectively emphasize the useful frames, augmenting the multi-scale temporal features to be fused. Our model is simple but effective, and it can run in real-time. Experimental results on the DHF1K, Hollywood-2, and UCF-sports datasets demonstrate that the proposed MSFF-Net outperforms existing state-of-the-art methods in accuracy.
科研通智能强力驱动
Strongly Powered by AbleSci AI