计算机科学
人工智能
任务(项目管理)
机器人
一般化
变压器
人机交互
可扩展性
协议(科学)
机器学习
数学
量子力学
替代医学
管理
电压
经济
病理
数学分析
物理
数据库
医学
作者
Yunfan Jiang,Agrim Gupta,Zichen Zhang,Guanzhi Wang,Yongqiang Dou,Yanjun Chen,Li Fei-Fei,Anima Anandkumar,Yuke Zhu,Linxi Fan
出处
期刊:Cornell University - arXiv
日期:2022-01-01
被引量:48
标识
DOI:10.48550/arxiv.2210.03094
摘要
Prompt-based learning has emerged as a successful paradigm in natural language processing, where a single general-purpose language model can be instructed to perform any task specified by input prompts. Yet task specification in robotics comes in various forms, such as imitating one-shot demonstrations, following language instructions, and reaching visual goals. They are often considered different tasks and tackled by specialized models. We show that a wide spectrum of robot manipulation tasks can be expressed with multimodal prompts, interleaving textual and visual tokens. Accordingly, we develop a new simulation benchmark that consists of thousands of procedurally-generated tabletop tasks with multimodal prompts, 600K+ expert trajectories for imitation learning, and a four-level evaluation protocol for systematic generalization. We design a transformer-based robot agent, VIMA, that processes these prompts and outputs motor actions autoregressively. VIMA features a recipe that achieves strong model scalability and data efficiency. It outperforms alternative designs in the hardest zero-shot generalization setting by up to $2.9\times$ task success rate given the same training data. With $10\times$ less training data, VIMA still performs $2.7\times$ better than the best competing variant. Code and video demos are available at https://vimalabs.github.io/
科研通智能强力驱动
Strongly Powered by AbleSci AI