计算机科学
任务(项目管理)
人工智能
透视图(图形)
情态动词
质量(理念)
人机交互
视频质量
多媒体
计算机视觉
自然语言处理
哲学
高分子化学
管理
化学
公制(单位)
经济
认识论
运营管理
作者
Xuefei Huang,Wei Ke,Hao Sheng
标识
DOI:10.1109/uv56588.2022.10185501
摘要
Video caption is the automatically generated of abstract expressions for the content contained in videos. It involves two important fields — computer vision and natural language processing, and has become a considerable research topic in smart life. Deep learning has successfully contributed to this task with good results. As we know, video contains various modals of information, yet most of the existing solutions start from the visual perspective of video, while ignoring the equally important audio modal information. Therefore, how to benefit from additional forms of cues other than visual information is a huge challenge. In this work, we propose a video caption generation method that fuses multimodal features in videos, and adds attention mechanism to improve the quality of generated description sentences. The experimental results demonstrate that the method is well validated on the MSR-VTT dataset.
科研通智能强力驱动
Strongly Powered by AbleSci AI