基础(证据)
介绍(产科)
计算机科学
多模式学习
数据科学
人工智能
政治学
医学
放射科
法学
作者
Chunyuan Li,Zhe Gan,Zhengyuan Yang,Jianwei Yang,Linjie Li,Lijuan Wang,Jianfeng Gao
出处
期刊:Cornell University - arXiv
日期:2023-01-01
被引量:9
标识
DOI:10.48550/arxiv.2309.10020
摘要
This paper presents a comprehensive survey of the taxonomy and evolution of multimodal foundation models that demonstrate vision and vision-language capabilities, focusing on the transition from specialist models to general-purpose assistants. The research landscape encompasses five core topics, categorized into two classes. (i) We start with a survey of well-established research areas: multimodal foundation models pre-trained for specific purposes, including two topics -- methods of learning vision backbones for visual understanding and text-to-image generation. (ii) Then, we present recent advances in exploratory, open research areas: multimodal foundation models that aim to play the role of general-purpose assistants, including three topics -- unified vision models inspired by large language models (LLMs), end-to-end training of multimodal LLMs, and chaining multimodal tools with LLMs. The target audiences of the paper are researchers, graduate students, and professionals in computer vision and vision-language multimodal communities who are eager to learn the basics and recent advances in multimodal foundation models.
科研通智能强力驱动
Strongly Powered by AbleSci AI