激光雷达
人工智能
计算机视觉
计算机科学
变压器
代表(政治)
模式识别(心理学)
遥感
地理
工程类
政治学
政治
电气工程
电压
法学
作者
Zhiqi Li,Wenhai Wang,Hongyang Li,Enze Xie,Chonghao Sima,Tong Lü,Yu Qiao,Jifeng Dai
标识
DOI:10.1109/tpami.2024.3515454
摘要
Multi-modality fusion strategy is currently the de-facto most competitive solution for 3D perception tasks. In this work, we present a new framework termed BEVFormer, which learns unified BEV representations from multi-modality data with spatiotemporal transformers to support multiple autonomous driving perception tasks. In a nutshell, BEVFormer exploits both spatial and temporal information by interacting with spatial and temporal space through predefined grid-shaped BEV queries. To aggregate spatial information, we design spatial cross-attention that each BEV query extracts the spatial features from both point cloud and camera input, thus completing multi-modality information fusion under BEV space. For temporal information, we propose temporal self-attention to fuse the history BEV information recurrently. By comparing with other fusion paradigms, we demonstrate that the fusion method proposed in this work is both succinct and effective. Our approach achieves the new state-of-the-art 74.1% in terms of NDS metric on the nuScenes test set. In addition, we extend BEVFormer to encompass a wide range of autonomous driving tasks, including object tracking, vectorized mapping, occupancy prediction, and end-to-end autonomous driving, achieving outstanding results across these tasks. The code is released at https://github.com/fundamentalvision/BEVFormer.
科研通智能强力驱动
Strongly Powered by AbleSci AI