计算机科学
人工智能
计算机视觉
变压器
突出
特征提取
目标检测
模式识别(心理学)
工程类
电压
电气工程
作者
Kan Huang,Chunwei Tian,Jingyong Su,Jun Lin
标识
DOI:10.1016/j.patrec.2022.06.006
摘要
Video salient object detection is a fundamental computer vision task aimed at highlighting the most conspicuous objects in a video sequence. There are two key challenges presented in video salient object detection: (1) how to extract effective feature representations from appearance and motion cues, and (2) how to combine both of them into robust saliency representation. To handle these challenges, in this paper, we propose a novel Transformer-based Cross Reference Network (TCRN), which fully exploits long-range context dependencies in both feature representation extraction and cross-modal (i.e., appearance and motion) integration. In contrast to existing CNN-based methods, our approach formulates video salient object detection as a sequence-to-sequence prediction task. In the proposed approach, the deep feature extraction is achieved by a pure vision transformer with multi-resolution token representations. Specifically, we design a Gated Cross Reference (GCR) module to effectively integrate appearance and motion into saliency representation. The GCR first propagates global context information between different modalities, and then perform cross-modal fusion by a gate mechanism. Extensive evaluations on five widely-used benchmarks show that the proposed Transformer-based method performs favorably against the existing state-of-the-art methods
科研通智能强力驱动
Strongly Powered by AbleSci AI