LCSNet: End-to-end Lipreading with Channel-aware Feature Selection

计算机科学语音识别发音人工智能特征（语言学）任务（项目管理）连接主义模式识别（心理学）人工神经网络频道（广播）过程（计算）解码方法计算机网络语言学电信操作系统哲学经济管理

作者

Feng Xue,Tian Yang,Kang Liu,Zikun Hong,Mingwei Cao,Dan Guo,Richang Hong

出处

期刊：ACM Transactions on Multimedia Computing, Communications, and Applications [Association for Computing Machinery]
日期：2022-03-17 卷期号：19 (1s): 1-21 被引量：5

标识

DOI：10.1145/3524620

摘要

Lipreading is a task of decoding the movement of the speaker’s lip region into text. In recent years, lipreading methods based on deep neural network have attracted widespread attention, and the accuracy has far surpassed that of experienced human lipreaders. The visual differences in some phonemes are extremely subtle and pose a great challenge to lipreading. Most of the lipreading existing methods do not process the extracted visual features, which mainly suffer from two problems. First, the extracted features contain lot of useless information such as noise caused by differences in speech speed and lip shape, for example. In addition, the extracted features are not abstract enough to distinguish phonemes with similar pronunciation. These problems have a bad effect on the performance of lipreading. To extract features from the lip regions that are more distinguishable and more relevant to the speech content, this article proposes an end-to-end deep neural network-based lipreading model (LCSNet). The proposed model extracts the short-term spatio-temporal features and the motion trajectory features from the lip region in the video clips. The extracted features are filtered by the channel attention module to eliminate the useless features and then used as input to the proposed Selective Feature Fusion Module (SFFM) to extract the high-level abstract features. Afterwards, these features are used as input to the bidirectional GRU network in time order for temporal modeling to obtain the long-term spatio-temporal features. Finally, a Connectionist Temporal Classification (CTC) decoder is used to generate the output text. The experimental results show that the proposed model achieves a 1.0% CER and 2.3% WER on the GRID corpus database, which, respectively, represents an improvement of 52% and 47% compared to LipNet.

求助该文献

最长约 10秒，即可获得该文献文件

LCSNet: End-to-end Lipreading with Channel-aware Feature Selection

今日热心研友