计算机科学
人工智能
隐藏字幕
杠杆(统计)
粒度
图形
编码器
视频质量
自然语言处理
社会化媒体
自动汇总
机器学习
情报检索
公制(单位)
万维网
理论计算机科学
经济
图像(数学)
操作系统
运营管理
作者
Ping Li,Pan Zhang,Xianghua Xu
标识
DOI:10.1016/j.neucom.2020.12.137
摘要
Video as information carrier has gained overwhelming popularity in city surveillance and social networks, such as WeChat, Weibo, and TikTok. To bridge the semantic gap between video content (e.g., user and landmark building) and textual information (e.g., user location), video captioning has emerged as an attracting technique in recent years. Existing works mostly focus on sentence-level Part-of-Speech (POS) information and use Long Short-Term Memory (LSTM) as encoder, which neglects word or phrase-level POS information and also fails to globally consider long-range temporal relations among video frames. To address the drawbacks, we leverage multi-granularity POS guidance to learn Graph Convolutional Network (GCN) via meta-learning, abbreviated as GMMP (GCN Meta-learning with Multi-granularity POS), for generating high-quality captions for videos. It models temporal dependency by treating frames as nodes in the graph, and captures POS information of words and phrases by multi-granularity POS attention mechanism. We adopt meta-learning to better learn GCN by maximizing the reward of generated caption in a reinforcement task and also the probability of ground-truth caption in a supervised task, simultaneously. Experiments have verified the advantages of our GMMP model on several benchmark data sets.
科研通智能强力驱动
Strongly Powered by AbleSci AI