隐藏字幕
自举(财务)
计算机科学
语言模型
一般化
人工智能
编码(集合论)
自然语言处理
图像(数学)
航程(航空)
滤波器(信号处理)
语音识别
计算机视觉
程序设计语言
集合(抽象数据类型)
数学分析
金融经济学
复合材料
经济
数学
材料科学
作者
Junnan Li,Dongxu Li,Caiming Xiong,Steven C. H. Hoi
出处
期刊:Cornell University - arXiv
日期:2022-01-28
被引量:5
标识
DOI:10.48550/arxiv.2201.12086
摘要
Vision-Language Pre-training (VLP) has advanced the performance for many vision-language tasks. However, most existing pre-trained models only excel in either understanding-based tasks or generation-based tasks. Furthermore, performance improvement has been largely achieved by scaling up the dataset with noisy image-text pairs collected from the web, which is a suboptimal source of supervision. In this paper, we propose BLIP, a new VLP framework which transfers flexibly to both vision-language understanding and generation tasks. BLIP effectively utilizes the noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones. We achieve state-of-the-art results on a wide range of vision-language tasks, such as image-text retrieval (+2.7% in average recall@1), image captioning (+2.8% in CIDEr), and VQA (+1.6% in VQA score). BLIP also demonstrates strong generalization ability when directly transferred to video-language tasks in a zero-shot manner. Code, models, and datasets are released at https://github.com/salesforce/BLIP.
科研通智能强力驱动
Strongly Powered by AbleSci AI