计算机科学
人工智能
卷积神经网络
移动设备
单眼
变压器
深度学习
特征提取
计算机视觉
实时计算
机器学习
工程类
电压
电气工程
操作系统
作者
Albert Luginov,Ilya Makarov
标识
DOI:10.1109/ismar-adjunct60411.2023.00137
摘要
Self-supervised Monocular Depth Estimation (MDE) models trained solely on single-camera video have gained significant popularity. Recent studies have shown that Vision Transformers (ViT) can improve depth estimation quality, despite their high computational demands. In Extended Reality (XR) context, lightweight and fast models are crucial for seamless operation on mobile devices. This paper proposes SwiftDepth, a hybrid MDE framework that fulfils these requirements. The model combines the benefits of Convolutional Neural Network (CNN), which provides speed and shift invariance, and ViT, which offers a global receptive field. We utilize SwiftFormer, a low-latency feature extraction network with efficient additive attention. Also, we introduce a novel two-level decoder to enhance depth estimation quality without an increase in the number of parameters. Our model achieves comparable results to the state-of-the-art lightweight Lite-Mono on the KITTI outdoor dataset while demonstrating better generalization capability on the NYUv2 indoor dataset.
科研通智能强力驱动
Strongly Powered by AbleSci AI