计算机科学
判别式
特征学习
变压器
安全性令牌
人工智能
特征选择
集成学习
特征(语言学)
模式识别(心理学)
加权
特征提取
投票
机器学习
医学
语言学
哲学
物理
计算机安全
量子力学
电压
政治
政治学
法学
放射科
作者
Qin Xu,Jiahui Wang,Bo Jiang,Bin Luo
标识
DOI:10.1109/tmm.2023.3244340
摘要
Recently, vision transformers (ViTs) have been investigated in fine-grained visual recognition (FGVC) and are now considered state of the art. However, most ViT-based works ignore the different learning performances of the heads in the multi-head self-attention (MHSA) mechanism and its layers. To address these issues, in this paper, we propose a novel internal ensemble learning transformer (IELT) for FGVC. The proposed IELT involves three main modules: multi-head voting (MHV) module, cross-layer refinement (CLR) module, and dynamic selection (DS) module. To solve the problem of the inconsistent performances of multiple heads, we propose the MHV module, which considers all of the heads in each layer as weak learners and votes for tokens of discriminative regions as cross-layer feature based on the attention maps and spatial relationships. To effectively mine the cross-layer feature and suppress the noise, the CLR module is proposed, where the refined feature is extracted and the assist logits operation is developed for the final prediction. In addition, a newly designed DS module adjusts the token selection number at each layer by weighting their contributions of the refined feature. In this way, the idea of ensemble learning is combined with the ViT to improve fine-grained feature representation. The experiments demonstrate that our method achieves competitive results compared with the state of the art on five popular FGVC datasets. Source code has been released and can be found at https://github.com/mobulan/IELT .
科研通智能强力驱动
Strongly Powered by AbleSci AI