计算机科学
相似性(几何)
接头(建筑物)
情态动词
人工智能
自然语言处理
情报检索
图像(数学)
工程类
建筑工程
化学
高分子化学
作者
Mingyuan Ge,Yewen Li,Honghao Wu,Mingyong Li
标识
DOI:10.1109/icassp48485.2024.10446490
摘要
In recent years, the work on video-text retrieval has been well-developed due to the emergence of large-scale pre-training methods. However, these works focus solely on inter-modal interactions and contrasts, neglecting the contrasts of multigrained features within modalities, which makes the similarity measurement less accurate. Worse still, many of these works only contrast features of the same grain size, ignoring the important information implied by features of different grain sizes. For these reasons, we propose a joint modal similarity contrastive learning model, named JM-CLIP. Firstly, the method employs a multidimensional contrastive strategy of two modal features, including inter-modal and intra-modal multi-grained feature contrasts. Secondly, to fuse the various contrastive similarities well, we also design a joint modal attention module to fuse the various similarities into a final joint multigranularity similarity score. The significant performance improvement achieved on three popular video-text retrieval datasets demonstrates the effectiveness and superiority of our proposed method. The code is available at https://github.com/DannielGe/JM-CLIP.
科研通智能强力驱动
Strongly Powered by AbleSci AI