计算机科学
模态(人机交互)
语音识别
特征(语言学)
视听
融合
人工智能
多媒体
语言学
哲学
作者
Yicong Jiang,Youjun Chen,Tianzi Wang,Zengrui Jin,Xurong Xie,Hui Chen,Xunying Liu,Feng Tian
标识
DOI:10.1109/iscslp63861.2024.10800618
摘要
Dysarthria, a speech disorder resulting from neurological conditions, presents significant obstacles to speech intelligibility and daily communication. Automatic dysarthria assessment has the capability to provide low-cost diagnosis and treatment assistant support for such diseases as Parkinson's disease, Alzheimer's disease, and stroke. This study investigates the efficacy of cross-modality feature fusion using audio-visual data for the automatic assessment of dysarthric speech. Leveraging advanced self-supervised learning models, AV-HuBERT and Wav2Vec 2.0, we develop a multimodal system to enhance dysarthria severity classification. Utilizing the Mandarin Subacute Stroke Dysarthria Multimodal (MSDM) dataset, which includes synchronized audio and lip movement video recordings, our system achieves promising performance. Experimental results demonstrate that our back-end fusion and feature fusion approaches both outperform traditional single-modality methods, with the best back-end fusion system achieving a speaker-level F1 score of 0.841 while the best feature-level fusion system achieving a speaker-level F1 score of 0.772. This study marks the first application of pre-trained self-supervised learning models for multimodal dysarthria assessment, highlighting the potential for the assistance of diagnosis and treatment.
科研通智能强力驱动
Strongly Powered by AbleSci AI