语音识别
说话人识别
Mel倒谱
计算机科学
判别式
特征提取
特征(语言学)
模式识别(心理学)
人工智能
说话人日记
语言学
哲学
摘要
Front-end or feature extractor is the first component in an automatic speaker recognition system. Feature extraction transforms the raw speech signal into a compact but effective representation that is more stable and discriminative than the original signal. Since the front-end is the first component in the chain, the quality of the later components (speaker modeling and pattern matching) is strongly determined by the quality of the front-end. In other words, classification can be at most as accurate as the features. Several feature extraction methods have been proposed, and successfully exploited in the speaker recognition task. However, almost exclusively, the methods are adopted directly from the speech recognition task. This is somewhat ironical, considering the opposite nature of the two tasks. In speech recognition, speaker variability is one of the major error sources, whereas in speaker recognition it is the information that we wish to extract. The mel-frequency cepstral coefficients (MFCC) is the most evident example of a feature set that is extensively used in speaker recognition, but originally developed for speech recognition purposes. When MFCC front-end is used in speaker recognition system, one makes an implicit assumption that the human hearing meachanism is the optimal speaker recognizer. However, this has not been confirmed, and in fact opposite results exist. Although several methods adopted from speech recognition have shown to work well in practise, they are often used as “black boxes” with fixed parameters. It is not understood what kind of information the features capture from the speech signal. Understanding the features at some level requires experience from specific areas such as speech physiology, acoustic phonetics, digital signal processing and statistical pattern recognition. According to the author’s general impression of literature, it seems more and more that currently, at the best we are guessing what is the code in the signal that carries our individuality. This thesis has two main purposes. On the one hand, we attempt to see the feature extraction as a whole, starting from understanding the speech production process, what is known about speaker individuality, and then going
科研通智能强力驱动
Strongly Powered by AbleSci AI