姿势
计算机科学
人工智能
对象(语法)
词汇
计算机视觉
分辨率(逻辑)
估计
自然语言处理
语言学
工程类
哲学
系统工程
作者
Jaime Corsetti,Davide Boscaini,Francesco Giuliari,Changjae Oh,Andrea Cavallaro,Fabio Poiesi
出处
期刊:Cornell University - arXiv
日期:2024-06-24
标识
DOI:10.48550/arxiv.2406.16384
摘要
The generalisation to unseen objects in the 6D pose estimation task is very challenging. While Vision-Language Models (VLMs) enable using natural language descriptions to support 6D pose estimation of unseen objects, these solutions underperform compared to model-based methods. In this work we present Horyon, an open-vocabulary VLM-based architecture that addresses relative pose estimation between two scenes of an unseen object, described by a textual prompt only. We use the textual prompt to identify the unseen object in the scenes and then obtain high-resolution multi-scale features. These features are used to extract cross-scene matches for registration. We evaluate our model on a benchmark with a large variety of unseen objects across four datasets, namely REAL275, Toyota-Light, Linemod, and YCB-Video. Our method achieves state-of-the-art performance on all datasets, outperforming by 12.6 in Average Recall the previous best-performing approach.
科研通智能强力驱动
Strongly Powered by AbleSci AI