计算机科学
推论
人工智能
变压器
编码器
单眼
融合机制
模式识别(心理学)
计算机工程
算法
融合
语言学
哲学
物理
量子力学
电压
脂质双层融合
操作系统
作者
Chen Lv,Chenggong Han,Junhui Chen,Deqiang Cheng,Jiansheng Qian
出处
期刊:Optik
[Elsevier]
日期:2023-07-26
卷期号:288: 171219-171219
被引量:2
标识
DOI:10.1016/j.ijleo.2023.171219
摘要
Supervised monocular depth estimation has always been one of the most important tasks in computer vision. With the convolution module as a basic operator, the U-shaped network architecture has become the de facto standard and has achieved tremendous success. However, due to the limited receptive field of the convolution operation, the CNNs are generally inferior in explicitly modeling the long-range dependencies. Originally proposed for natural language processing, the transformers are designed for performing sequence-to-sequence predictions based on global self-attention mechanism. Therefore, the transformers can capture long-range dependencies. However, they have limited localization abilities due to insufficient low-level details. In this work, we propose a TSD-Depth model, which merits both the transformers and the CNNs, as a strong alternative for self-supervised monocular depth estimation. The proposed model simultaneously extracts the global contextual information and local spatial detail features. Furthermore, by designing the hybrid encoder connection method and proper-sized transformer module, the global and local information can more effectively interact. In addition, a local multi-scale fusion block is first proposed to refine the fine-grained details. More importantly, the knowledge is learned by using self-distillation to skip the multi-scale fusion block concatenated with the encoder at the inference time, computed only during the training process for minimal overhead. The experimental results on NYU-v2 and ScanNet datasets show that the proposed TSD-Depth achieves the best performance as compared to the previous state-of-the-art methods.
科研通智能强力驱动
Strongly Powered by AbleSci AI