隐藏字幕
计算机科学
特征(语言学)
人工智能
任务(项目管理)
编码(集合论)
自然语言处理
图像(数学)
软件部署
机制(生物学)
理解力
情报检索
程序设计语言
哲学
认识论
语言学
管理
集合(抽象数据类型)
经济
操作系统
作者
Qi Wang,Hongyu Deng,Xue Wu,Zhenguo Yang,Yun Liu,Yazhou Wang,Ge‐Fei Hao
标识
DOI:10.1016/j.neunet.2023.03.010
摘要
Text-based image captioning (TextCap) aims to remedy the shortcomings of existing image captioning tasks that ignore text content when describing images. Instead, it requires models to recognize and describe images from both visual and textual content to achieve a deeper level of comprehension of the images. However, existing methods tend to use numerous complex network architectures to improve performance, which still fails to adequately model the relationship between vision and text on the one side, while on the other side this leads to long running times, high memory consumption, and other unfavorable deployment problems. To solve the above issues, we have developed a lightweight captioning method with a collaborative mechanism, LCM-Captioner, which balances high efficiency with high performance. First, we propose a feature-lightening transformation for the TextCap task, named TextLighT, which is able to learn rich multimodal representations while mapping features to lower dimensions, thereby reducing memory costs. Next, we present a collaborative attention module for visual and text information, VTCAM, to facilitate the semantic alignment of multimodal information to uncover important visual objects and textual content. Finally, the conducted extensive experiments on the TextCaps dataset demonstrate the effectiveness of our method. Code is available at https://github.com/DengHY258/LCM-Captioner.
科研通智能强力驱动
Strongly Powered by AbleSci AI