计算机科学
语音识别
说话人识别
说话人日记
身份(音乐)
人工智能
特征(语言学)
编码器
特征向量
模式识别(心理学)
声学
语言学
操作系统
物理
哲学
作者
Longbin Lu,Xuebin Xu,Jun Fu
标识
DOI:10.1117/1.jei.31.3.033045
摘要
Lipreading aims to decode the speech content from a moving mouth. It is a very challenging task because lip appearance variations and speech contents are coupled together in the subtle movements of lip region. Especially in the speaker-independent recognition scenario, training and testing data are totally different in distribution due to the diverse speaker identities, making the learned model generalize poorly in the testing task. We propose a Siamese decoupling lipreading network (SDLipNet) to address this problem. Specially, we exploit an encoder–decoder framework to establish a collaborative representation of speaker identities and speech contents, and utilize the identity-specific information to regularize the content feature space. The identity features are derived from a Siamese identity encoder trained with paired visual speech data from different speakers. In addition, we align the content representation with a prior Gaussian distribution by imposing a Kullback–Leibler divergence constraint between the two outputs of the Siamese content encoder. In this way, the learned content feature space is supposed to be universal to the target speaker domain. Extensive experiments on two lipreading benchmarks demonstrate that our proposed SDLipNet can achieve better performance in the speaker-independent recognition task compared with the state-of-the-art lipreading methods.
科研通智能强力驱动
Strongly Powered by AbleSci AI