隐藏字幕
计算机科学
特征(语言学)
帧(网络)
视频质量
语义鸿沟
水准点(测量)
语义学(计算机科学)
自然语言处理
人工智能
任务(项目管理)
自然语言
多媒体
情报检索
图像(数学)
图像检索
电信
公制(单位)
哲学
语言学
运营管理
管理
大地测量学
经济
程序设计语言
地理
作者
Xuemei Luo,Xiaotong Luo,Di Wang,Jinhui Liu,Bo Wan,Lin Zhao
标识
DOI:10.1016/j.patcog.2023.109906
摘要
Video captioning aims to briefly describe the content of a video in accurate and fluent natural language, which is a hot research topic in multimedia processing. As a bridge between video and natural language, video captioning is a challenging task that requires a deep understanding of video content and effective utilization of diverse video multimodal information. Existing video captioning methods usually ignore the relative importance between different frames when aggregating frame-level video features and neglect the global semantic correlations between videos and texts in learning visual representations, resulting in the learned representations less effective. To address these problems, we propose a novel framework, namely Global Semantic Enhancement Network (GSEN) to generate high-quality captions for videos. Specifically, a feature aggregation module based on a lightweight attention mechanism is designed to aggregate frame-level video features, which highlights features of informative frames in video representations. In addition, a global semantic enhancement module is proposed to enhance semantic correlations for video and language representations in order to generate semantically more accurate captions. Extensive qualitative and quantitative experiments on two public benchmark datasets MSVD and MSR-VTT demonstrate that the proposed GSEN can achieve superior performance than state-of-the-art methods.
科研通智能强力驱动
Strongly Powered by AbleSci AI