粒度
计算机科学
判别式
人工智能
特征(语言学)
模式识别(心理学)
融合
机器学习
哲学
语言学
操作系统
作者
Yang Xu,Shanshan Wu,Biqi Wang,Ming–Hsuan Yang,Zebin Wu,Yazhou Yao,Zhihui Wei
标识
DOI:10.1016/j.patcog.2023.110042
摘要
Fine-grained visual classification (FGVC) is a difficult task due to the challenges of discriminative feature learning. Most existing methods directly use the final output of the network which always contains the global feature with high-level semantic information. However, the differences between fine-grained images are reflected in subtle local regions which often appear in the front of the network. When the texture of the background and object are similar or the proportion of the background is too large, the prediction will be greatly affected. In order to solve the above problems, this paper proposes multi-granularity feature fusion module (MGFF) and two-stage classification based on Vision-Transformer (ViT). The former comprehensively represents images by fusing features of different granularities, thus avoiding the limitations of single-scale features. The latter leverages the ViT model to separate the object from the background at a very small cost, thereby improving the accuracy of the prediction. We conduct comprehensive experiments and achieves the best performance in two fine-grained tasks on CUB-200-2011 and NA-Birds.
科研通智能强力驱动
Strongly Powered by AbleSci AI