Convolutional Reconstruction-to-Sequence for Video Captioning
编码器
序列(生物学)
作者
Aming Wu,Yahong Han,Yi Yang,Qinghua Hu,Fei Wu
出处
期刊:IEEE Transactions on Circuits and Systems for Video Technology [Institute of Electrical and Electronics Engineers] 日期:2020-11-01卷期号:30 (11): 4299-4308被引量:6
标识
DOI:10.1109/tcsvt.2019.2956593
摘要
Recent advances towards video captioning mainly follow an encoder-decoder (sequence-to-sequence) framework and generate captions via a recurrent neural network (RNN). However, employing RNN as the decoder (generator) is prone to diluting long-term information, which weakens its ability to capture long-term dependencies. Recently, some work has demonstrated that the convolutional neural network (CNN) could be used to model sequential information. Though strengths in representation ability and computation efficiency, CNN has not been well exploited in video captioning. The reason partially comes from the difficulty of modeling multi-modal sequence with CNN. In this paper, we devise a novel CNN-based encoder-decoder framework for video captioning. Particularly, we first append inter-frame differences to each CNN-extracted frame feature to get a more discriminative representation; then with that as the input, we encode each frame to be a more compact feature by a one-layer convolutional mapping, which could be taken as a reconstruction network. In the decoding stage, we first fuse visual and lexical feature; then we stack multiple dilated convolutional layers to form a hierarchical decoder. As long-term dependencies could be captured by a shorter path along the hierarchical structure, the decoder could alleviate the loss of long-term information. Experiments on two benchmark datasets show that our method could obtain state-of-the-art performance.