隐藏字幕
计算机科学
杠杆(统计)
答疑
人工智能
任务(项目管理)
自然语言处理
语言模型
自然语言
情报检索
机器学习
图像(数学)
经济
管理
作者
Hui Li Liu,Xiaojun Wan
标识
DOI:10.1145/3652583.3658061
摘要
Video captioning is the task of describing video content using natural sentences. While recent models have shown significant improvements in metrics, there are still some unresolved issues. Model-generated captions often contain factual errors and omit important details. In contrast, human-written captions excel in accurately and comprehensively describing the video content. In this work, we propose a novel method that utilizes question answering (QA) techniques to enhance video captioning models. We start by generating QA pairs from both videos and human-written captions. We propose a QA-enhanced captioning model to better leverage QA information. Finally, we employ reinforcement learning to train the model to maximize a QA reward. By incorporating QA-related techniques, our model can generate more accurate and comprehensive video captions. We conduct experiments on three datasets, namely ActivityNet Captions, YouCookII and MSR-VTT. The experimental results, ablation studies and human evaluations demonstrate the advantages of our method.
科研通智能强力驱动
Strongly Powered by AbleSci AI