计算机科学
匹配(统计)
人工智能
图像融合
情态动词
模式
模态(人机交互)
融合
图像(数学)
相似性(几何)
代表(政治)
方案(数学)
任务(项目管理)
模式识别(心理学)
传感器融合
自然语言处理
机器学习
计算机视觉
数学
工程类
社会学
哲学
数学分析
统计
化学
高分子化学
法学
系统工程
政治
语言学
社会科学
政治学
作者
Yifan Wang,Xing Xu,Wu Yu,Ruicong Xu,Zhiwei Cao,Heng Tao Shen
标识
DOI:10.1109/icme51207.2021.9428201
摘要
Image-text matching is a challenging task in cross-modal learning due to the discrepancy of data representation be-tween different modalities of images and texts. The main-stream methods adopt the late fusion to generate image-text similarity on encoded cross-modal features, and put effort to capture intra-modality associations with considerably high training cost. In this work, we propose to Combine Early and Late Fusion Together (CELFT), which is a universal hybrid fusion framework that can effectively overcome the above shortcomings of the late fusion scheme. In the pro-posed CELFT framework, the hybrid structure with early fusion and late fusion could facilitate the interaction between image and text modalities at early stage. Moreover, these two kinds of fusion strategies complement each other in capturing the inter-modal and intra-modal information, which ensure to learn more accurate image-text similarity. In the experiments, we choose four latest approaches based on the late fusion scheme as the base models, and integrate them with our CELFT framework. The results on two widely used image-text datasets MSCOCO and Flickr30K show that the matching performance of all base models is significantly improved with remarkably reduced training time.
科研通智能强力驱动
Strongly Powered by AbleSci AI