计算机科学
人工智能
棱锥(几何)
图形
视觉文字
计算机视觉
帧(网络)
图像检索
冗余(工程)
特征(语言学)
模式识别(心理学)
情报检索
图像(数学)
理论计算机科学
电信
语言学
哲学
物理
光学
操作系统
作者
Guoping Zhao,Mingyu Zhang,Yaxian Li,Jiajun Liu,Bingqing Zhang,Ji-Rong Wen
标识
DOI:10.1016/j.ipm.2020.102488
摘要
Conventionally, it is common that video retrieval methods aggregate the visual feature representations from every frame as the feature of the video, where each frame is treated as an isolated, static image. Such methods lack the power of modeling the intra-frame and inter-frame relationships for the local regions, and are often vulnerable to the visual redundancy and noise caused by various types of video transformation and editing, such as adding image patches, adding banner, etc. From the perspective of video retrieval, a video’s key information is more often than not convoyed by geometrically centered, dynamic visual content, and static areas often reside in regions that are farther from the center and often exhibit heavy visual redundancies temporally. This phenomenon is hardly investigated by conventional retrieval methods. In this article, we propose an unsupervised video retrieval method that simultaneously models intra-frame and inter-frame contextual information for video representation with a graph topology that is constructed on top of pyramid regional feature maps. By decomposing a frame into a pyramid regional sub-graph, and transforming a video into a regional graph, we use graph convolutional networks to extract features that incorporate information from multiple types of context. Our method is unsupervised and only uses the frame features extracted by pre-trained network. We have conducted extensive experiments and have demonstrated that the proposed method outperforms state-of-the-art video retrieval methods.
科研通智能强力驱动
Strongly Powered by AbleSci AI