计算机科学
自动汇总
可解释性
判决
人工智能
自然语言处理
钥匙(锁)
安全性令牌
领域(数学分析)
生成语法
特征(语言学)
嵌入
语言模型
机器学习
哲学
数学分析
语言学
计算机安全
数学
作者
Dengtian Lin,Liqiang Jing,Xuemeng Song,Meng Liu,Teng Sun,Liqiang Nie
标识
DOI:10.1145/3539618.3591633
摘要
Multimodal sentence summarization, aiming to generate a brief summary of the source sentence and image, is a new yet challenging task. Although existing methods have achieved compelling success, they still suffer from two key limitations: 1) lacking the adaptation of generative pre-trained language models for open-domain MMSS, and 2) lacking the explicit critical information modeling. To address these limitations, we propose a BART-MMSS framework, where BART is adopted as the backbone. To be specific, we propose a prompt-guided image encoding module to extract the source image feature. It leverages several soft to-be-learned prompts for image patch embedding, which facilitates the visual content injection to BART for open-domain MMSS tasks. Thereafter, we devise an explicit source critical token learning module to directly capture the critical tokens of the source sentence with the reference of the source image, where we incorporate explicit supervision to improve performance. Extensive experiments on a public dataset fully validate the superiority of our proposed method. In addition, the predicted tokens by the vision-guided key-token highlighting module can be easily understood by humans and hence improve the interpretability of our model.
科研通智能强力驱动
Strongly Powered by AbleSci AI