隐藏字幕
计算机科学
变压器
人工智能
编码器
计算机视觉
图像(数学)
自然语言处理
物理
量子力学
电压
操作系统
作者
Shikha Dubey,Farrukh Olimov,Muhammad Aasim Rafique,Joonmo Kim,Moongu Jeon
标识
DOI:10.1016/j.ins.2022.12.018
摘要
Encoder-decoder-based image captioning techniques are generally utilized to describe meaningful information present in an image. In this work, we investigate two unexplored ideas for image captioning using the transformer: 1) an object-focused label attention module (LAM), and 2) a geometrically coherent proposal (GCP) module that focuses on the scale and position of objects to benefit the transformer model by attaining better image perception. These modules demonstrate the enforcement of objects’ relevance in the surrounding environment. Furthermore, they explore the effectiveness of learning an explicit association between vision and language constructs. LAM and GCP tolerate the variation in objects’ class and its association with labels in multi-label classification. The proposed framework, label-attention transformer with geometrically coherent objects (LATGeO), acquires proposals of geometrically coherent objects using a deep neural network (DNN) and generates captions by investigating their relationships using LAM. The module LAM associates the extracted objects classes to the available dictionary using self-attention layers. Object coherence is acquired in the GCP module using the localized ratio of the proposals’ geometrical features. In this study, experimentation results are performed on MSCOCO dataset. The evaluation of LATGeO on MSCOCO advocates that objects’ relevance in surroundings and their visual features binding with geometrically localized ratios and associated labels generate improved and meaningful captions.
科研通智能强力驱动
Strongly Powered by AbleSci AI