隐藏字幕
计算机科学
特征(语言学)
人工智能
依赖关系(UML)
图像(数学)
语音识别
自然语言处理
语言学
哲学
作者
Huaishao Luo,Lei Ji,Ming Zhong,Yang Chen,Wen Lei,Nan Duan,Tianrui Li
出处
期刊:Neurocomputing
[Elsevier]
日期:2022-07-16
卷期号:508: 293-304
被引量:278
标识
DOI:10.1016/j.neucom.2022.07.028
摘要
Video clip retrieval and captioning tasks play an essential role in multimodal research and are the fundamental research problem for multimodal understanding and generation. The CLIP (Contrastive Language-Image Pre-training) model has demonstrated the power of visual concepts learning from web collected image-text datasets. In this paper, we propose a CLIP4Clip model to transfer the knowledge of the image-text pretrained CLIP model to video-text tasks in an end-to-end manner. Furthermore, we conduct several empirical studies including 1) Whether image feature is enough for video-text retrieval and captioning? 2) How a post-pretraining on a large-scale video-text dataset based on the CLIP affect the performance? 3) What is the practical mechanism to model temporal dependency between video frames? And 4) The Hyper-parameters sensitivity of the model. Extensive experimental results present that the CLIP4Clip model transferred from the CLIP can achieve SOTA results on various video-text datasets, including MSR-VTT, MSVD, LSMDC, and DiDeMo for multimodal understanding and generation tasks.
科研通智能强力驱动
Strongly Powered by AbleSci AI