人工智能
计算机视觉
特征(语言学)
计算机科学
像素
体素
卷积(计算机科学)
点云
目标检测
噪音(视频)
激光雷达
联营
特征提取
模式识别(心理学)
图像(数学)
遥感
地理
人工神经网络
哲学
语言学
作者
Huaijin Liu,Ji‐Xiang Du,Yong Zhang,Hongbo Zhang,Jiandian Zeng
标识
DOI:10.1016/j.patcog.2024.110284
摘要
Current LiDAR-only 3D detection methods inevitably suffer from the sparsity of point clouds and insufficient semantic information. To alleviate this difficulty, recent proposals densify LiDAR points by depth completion and then perform feature fusion with image pixels at the data-level or result-level. However, these methods often suffer from poor fusion effects and insufficient use of image information for voxel feature-level fusion. Meanwhile, noises brought by inaccurate depth completion significantly degrade detection accuracy. In this paper, we propose PVConvNet, a unified framework for multi-modal feature fusion that cleverly combines LiDAR points, virtual points and image pixels. Firstly, we develop an efficient Pixel-Voxel Sparse Convolution (PVConv) to perform voxel-wise feature-level fusion of point clouds and images. Secondly, we design a Noise-Resistant Dilated Sparse Convolution (NRDConv) to encode the voxel features of virtual points, which effectively reduces the impact of noise. Finally, we propose a unified RoI pooling strategy, namely Multimodal Voxel-RoI Pooling, for improving proposal refinement accuracy. We evaluate PVConvNet on the widely used KITTI dataset and the more challenging nuScenes dataset. Experimental results show that our method outperforms state-of-the-art multi-modal based methods, achieving a moderate 3D AP of 86.92% on the KITTI test set.
科研通智能强力驱动
Strongly Powered by AbleSci AI