计算机科学
单眼
人工智能
特征(语言学)
估计
计算机视觉
融合
工程类
语言学
哲学
系统工程
作者
Zhongyu Wu,Hua Huang,Qishen Li,Penghui Chen
标识
DOI:10.1109/iaeac59436.2024.10503704
摘要
Monocular depth estimation is a fundamental task in computer vision and has drawn increasing attention. Recently, attention-based models and encoder-decoder architectures have led to great improvements in monocular depth estimation. Typically, most of the previous methods used repeated simple up-sampling operations during decoding, which may not make full use of the potential properties of the features extracted by the encoder, and there are problems of inaccurate prediction of the edge and depth maximum region. We propose an attention-based feature fusion module for encoder and decoder. We treat the monocular depth estimation as a pixel-level optimization problem, where the coarsest encoder feature is used to initialize the pixel-level optimization, which is then refined to higher resolution by the proposed attentional feature fusion (AFF). We formulate the prediction problem as ordinal regression over the bin centers that discretize the continuous depth range. It predicts a correspondingly different distribution of bins based on different pictures and we predict bins at the coarsest level using global pooling and MLP layers. In the NYUV2 dataset, the proposed architecture improving original model by 2.5.% and 1.1%, in terms of Log10 and Absolute relative error, respectively.
科研通智能强力驱动
Strongly Powered by AbleSci AI