计算机科学
情态动词
培训(气象学)
自然语言处理
人工智能
机器学习
地理
化学
气象学
高分子化学
作者
Yunkai Chen,Qimeng Wang,Shiwei Wu,Yan Gao,Tong Xu,Yao Hu
出处
期刊:ACM Transactions on Knowledge Discovery From Data
[Association for Computing Machinery]
日期:2024-03-28
卷期号:18 (7): 1-19
被引量:5
摘要
Multi-modal large language models (MLLMs), such as GPT-4, exhibit great comprehension capabilities on human instruction, as well as zero-shot ability on new downstream multi-modal tasks. To integrate the different modalities within a unified embedding space, previous MLLMs attempted to conduct visual instruction tuning with massive and high-quality image-text pair data, which requires substantial costs in data collection and training resources. In this article, we propose TOMGPT (Text-Only training Multi-modal GPT), a cost-effective MLLM tuned solely on easily accessible text data with much fewer resources. Along with pre-trained visual-linguistic coupled modality space (e.g., CLIP and ALIGN model), a text-only training strategy is devised to further project the aligned multi-modal latent space to that of LLM, endowing the LLM with visual comprehension capabilities in an efficient manner. Instead of enormous image-text training data required by previous MLLMs, we find that TOMGPT can be well-tuned with fewer yet diverse GPT-generated free-form text data, as we establish the semantic connection between LLM and pre-trained vision-language model. A quantitative evaluation is conducted on both MME and LVLM, which are recently released and extensively utilized MLLM benchmarks. The experiments reveal that TOMGPT achieved reliable performance compared to numerous models trained on a large amount of image-text pair data. Case studies are also presented, demonstrating TOMGPT’s broad understanding and dialogue capabilities across diverse image categories.
科研通智能强力驱动
Strongly Powered by AbleSci AI