稳健性(进化)
计算机科学
情态动词
人工智能
再培训
答疑
构造(python库)
机器学习
自然语言处理
生物化学
化学
国际贸易
高分子化学
业务
基因
程序设计语言
作者
Akib Mashrur,Wei Luo,Nayyar A. Zaidi,Antonio Robles‐Kelly
标识
DOI:10.1016/j.cviu.2023.103862
摘要
Recent advances in vision-language models have resulted in improved accuracy in visual question answering (VQA) tasks. However, their robustness remains limited when faced with out-of-distribution data containing unanswerable questions. In this study, we first construct a simple randomised VQA dataset, incorporating unanswerable questions from the VQA v2 dataset, to evaluate the robustness of a state-of-the-art VQA model. Our findings reveal that the model struggles to predict the "unknown" answer or provides inaccurate responses with high confidence scores for irrelevant questions. To address this issue without retraining the large backbone models, we propose Cross Modal Augmentation (CMA), a model-agnostic, test-time-only, multi-modal semantic augmentation technique. CMA generates multiple semantically-consistent but heterogeneous instances from the visual and textual inputs, which are then fed to the model, and the predictions are combined to achieve a more robust output. We demonstrate that implementing CMA enables the VQA model to provide more reliable answers in scenarios involving unanswerable questions, and show that the approach is generalisable across different categories of pre-trained vision language models.
科研通智能强力驱动
Strongly Powered by AbleSci AI