Attribute-Guided Cross-Modal Interaction and Enhancement for Audio-Visual Matching

计算机科学匹配（统计）水准点（测量）特征（语言学）情态动词人工智能嵌入模式识别（心理学）相似性（几何）特征提取图像（数学）哲学大地测量学统计化学语言学高分子化学地理数学

作者

Jiaxiang Wang,Aihua Zheng,Yan Yan,Ran He,Jin Tang

出处

期刊：IEEE Transactions on Information Forensics and Security [Institute of Electrical and Electronics Engineers]
日期：2024-01-01 卷期号：19: 4986-4998

标识

DOI：10.1109/tifs.2024.3388949

摘要

Audio-visual matching is an essential task that measures the correlation between audio clips and visual images. However, current methods rely solely on the joint embedding of global features from audio clips and face image pairs to learn semantic correlations. This approach overlooks the importance of high-confidence correlations and discrepancies of local subtle features, which are crucial for cross-modal matching. To address this issue, we propose a novel Attribute-guided Cross-modal Interaction and Enhancement Network (ACIENet), which employs multiple attributes to explore the associations of different key local subtle features. The ACIENet contains two novel modules: the Attribute-guided Interaction (AGI) module and the Attribute-guided Enhancement (AGE) module. The AGI module employs global feature alignment similarity to guide cross-modal local feature interactions, which enhances cross-modal association features for the same identity and expands cross-modal distinctive features for different identities. Additionally, the interactive features and original features are fused to ensure intra-class discriminability and inter-class correspondence. The AGE module captures subtle attribute-related features by using an attribute-driven network, thereby enhancing discrimination at the attribute level. Specifically, it strengthens the combined attribute-related features of gender and nationality. To prevent interference between multiple attribute features, we design a multi-attribute learning network as a parallel framework. Experiments conducted on a public benchmark dataset demonstrate the efficacy of the ACIENet method in different scenarios. Code and models are available at https://github.com/w1018979952/ACIENet.

求助该文献

最长约 10秒，即可获得该文献文件

Attribute-Guided Cross-Modal Interaction and Enhancement for Audio-Visual Matching

今日热心研友