计算机科学
事件(粒子物理)
稳健性(进化)
人工智能
安全性令牌
班级(哲学)
视听
语音识别
音频信号处理
模式识别(心理学)
机器学习
音频信号
语音编码
多媒体
计算机安全
量子力学
基因
物理
化学
生物化学
作者
Yiling Wu,Xinfeng Zhang,Yaowei Wang,Qingming Huang
标识
DOI:10.1145/3503161.3548318
摘要
This paper focuses on the audio-visual event localization task that aims to match both visible and audible components in a video to identify the event of interest. Existing methods primarily ignore the continuity of audio-visual events and classify each segment separately. They either classify the event category score of each segment separately or calculate the event-relevant score of each segment separately. However, events in video are often continuous and last several segments. Motivated by these, we propose a span-based framework that considers consecutive segments jointly. The span-based framework handles the audio-visual localization task by predicting the event class and extracting the event span. Specifically, a [CLS] token is applied to collect the global information with self-attention mechanisms to predict the event class. Relevance scores and positional embeddings are inserted into the span predictor to estimate the start and end boundaries of the event. Multi-modal Mixup are further used to improve the robustness and generalization of the model. Experiments conducted on the AVE dataset demonstrate that the proposed method outperforms state-of-the-art methods.
科研通智能强力驱动
Strongly Powered by AbleSci AI