Watch and Act: Learning Robotic Manipulation From Visual Demonstration
计算机科学
人机交互
人工智能
计算机视觉
作者
Shuo Yang,Wei Zhang,Ran Song,Jiyu Cheng,Hesheng Wang,Yibin Li
出处
期刊:IEEE transactions on systems, man, and cybernetics [Institute of Electrical and Electronics Engineers] 日期:2023-03-09卷期号:53 (7): 4404-4416被引量:9
标识
DOI:10.1109/tsmc.2023.3248324
摘要
Learning from demonstration holds the promise of enabling robots to learn diverse actions from expert experience. In contrast to learning from observation-action pairs, humans learn to imitate in a more flexible and efficient manner: learning behaviors by simply "watching." In this article, we propose a "watch-and-act" imitation learning pipeline that endows a robot with the ability of learning diverse manipulations from visual demonstrations. Specifically, we address this problem by intuitively casting it as two subtasks: 1) understanding the demonstration video and 2) learning the demonstrated manipulations. First, a captioning module based on visual change is presented to understand the demonstration by translating the demonstration video into a command sentence. Then, to execute the captioning command, a manipulation module that learns the demonstrated manipulations is built upon an instance segmentation model and a manipulation affordance prediction model. We validate the superiority of the two modules over existing methods separately via extensive experiments and demonstrate the whole robotic imitation system developed based on the two modules in diverse scenarios using a real robotic arm. Supplementary video is available at https://vsislab.github.io/watch-and-act/ .