期刊:Studies in computational intelligence日期:2021-01-01卷期号:: 353-363
标识
DOI:10.1007/978-3-030-68291-0_28
摘要
Wildlife videos often have elaborate dynamics, and techniques for generating video captions for wildlife clips involve both natural language processing and computer vision. Current techniques for video captioning have shown encouraging results. However, these techniques derive captions based on video frames only, ignoring audio information. In this paper we propose to create video captions with the help of both audio and visual information, in natural language. We utilize deep neural networks with convolutional and recurrent neural networks both involved. Experimental results on a corpus of wildlife clips show that fusion of audio knowledge greatly improves the efficiency of video description. These superior results are achieved using convolutional neural networks (CNN) and recurrent neural networks (RNN).