计算机科学
特征学习
粒度
代表(政治)
情态动词
水准点(测量)
相似性(几何)
集合(抽象数据类型)
人工智能
骨料(复合)
自然语言处理
情报检索
法学
程序设计语言
地理
材料科学
复合材料
高分子化学
化学
图像(数学)
操作系统
政治
政治学
大地测量学
作者
Shengwei Zhao,Linhai Xu,Yuying Liu,Shaoyi Du
标识
DOI:10.1145/3539618.3592025
摘要
The purpose of audio-text retrieval is to learn a cross-modal similarity function between audio and text, enabling a given audio/text to find similar text/audio from a candidate set. Recent audio-text retrieval models aggregate multi-modal features into a single-grained representation. However, single-grained representation is difficult to solve the situation that an audio is described by multiple texts of different granularity levels, because the association pattern between audio and text is complex. Therefore, we propose an adaptive aggregation strategy to automatically find the optimal pool function to aggregate the features into a comprehensive representation, so as to learn valuable multi-grained representation. And multi-grained comparative learning is carried out in order to focus on the complex correlation between audio and text in different granularity. Meanwhile, text-guided token interaction is used to reduce the impact of redundant audio clips. We evaluated our proposed method on two audio-text retrieval benchmark datasets of Audiocaps and Clotho, achieving the state-of-the-art results in text-to-audio and audio-to-text retrieval. Our findings emphasize the importance of learning multi-modal multi-grained representation.
科研通智能强力驱动
Strongly Powered by AbleSci AI