隐藏字幕
变压器
计算机科学
编码器
网格
人工智能
保险丝(电气)
图像(数学)
语音识别
模式识别(心理学)
工程类
电气工程
电压
几何学
数学
操作系统
作者
Liangshan Lou,Ke Lü,Jian Xue
标识
DOI:10.1109/icpr56361.2022.9956518
摘要
Image captioning is currently one of the most important multimodal tasks. With Transformer proposed, many Transformer-based models have achieved good performance in image captioning. However, substantial work is still required to improve the performance in the field of image captioning. We propose an improved model that uses the meshed-memory Transformer as its backbone. We propose the use of region features and grid features together. In addition, we use two identical parallel encoders to process region features and grid features separately, and fuse the outputs of each layer of the two encoders to form one of the inputs of the decoder. We comprehensively compare the performance of our model with the existing state-of-the-art models on the official COCO dataset. Experiments show that, on the Karpathy test split, our model outperforms the backbone on all evaluation metrics: for example, it increases BLEU-1 from 80.8% to 81.4%, and CIDEr from 131.2% to 133.5%.
科研通智能强力驱动
Strongly Powered by AbleSci AI