情态动词
计算机科学
跳跃式监视
任务(项目管理)
人工智能
判决
图像(数学)
面子(社会学概念)
匹配(统计)
自然语言处理
样品(材料)
模式识别(心理学)
计算机视觉
语言学
数学
统计
工程类
哲学
化学
高分子化学
系统工程
色谱法
作者
Tao Hong,Ya Wang,Xingwu Sun,Xiaoqing Li,Jinwen Ma
出处
期刊:Communications in computer and information science
日期:2023-11-26
卷期号:: 471-482
标识
DOI:10.1007/978-981-99-8148-9_37
摘要
Visual grounding (VG) is a representative multi-modal task that has recently gained increasing attention. Nevertheless, existing works still face challenges leading to under-performance due to insufficient training data. To address this, some researchers have attempted to generate new samples by integrating each two (image, text) pairs, inspired by the success of uni-modal CutMix series data augmentation. However, these methods mix images and texts separately and neglect their contextual correspondence. To overcome this limitation, we propose a novel data augmentation method for visual grounding task, called Cross-Modal Mix (CMMix). Our approach employs a fine-grained mix paradigm, where sentence-structure analysis is used to locate the central noun parts in texts, and their corresponding image patches are drafted through noun-specific bounding boxes in VG. In this way, CMMix maintains matching correspondence during mix operation, thereby retaining the coherent relationship between images and texts and resulting in richer and more meaningful mixed samples. Furthermore, we employ a filtering-sample-by-loss strategy to enhance the effectiveness of our method. Through experiments on four VG benchmarks: ReferItGame, RefCOCO, RefCOCO+, and RefCOCOg, the superiority of our method is fully verified.
科研通智能强力驱动
Strongly Powered by AbleSci AI