计算机科学
嵌入
判决
自然语言处理
萧条(经济学)
人工智能
语音识别
经济
宏观经济学
作者
Junqi Xue,Ruihan Qin,Xinxu Zhou,Honghai Liu,Min Zhang,Zhiguo Zhang
标识
DOI:10.1109/icassp48485.2024.10446253
摘要
Automatic depression detection based on audio and text representations from participants' interviews has attracted widespread attention. However, most of previous researches only used one type of feature of one single modality for depression detection, so that the rich information of audio and text from interviews has not been fully utilized. Moreover, an effective multi-modal fusion approach to leverage the independence among audio and text representations is still lacking. To address these problems, we propose a multi-modal fusion depression detection model based on the interaction of multilevel audio features and text sentence embedding. Specifically, we first extract Low-Level Descriptors (LLDs), mel-spectrogram features, and wav2vec features from the audio. Then we design a Multi-level Audio Features Interaction Module (MAFIM) to fuse these three levels of features for a comprehensive audio representation. For interview text, we use pre-trained BERT to extract sentence-level embedding. Further, to effectively fuse audio and text representations, we design a Channel Attention-based Multi-modal Fusion Module (CAMFM) by taking into account the independence and correlation between two different modalities. Our proposed model shows better performance on two datasets, DAIC-WOZ and EATD-Corpus, than existing methods, so it has a high potential to be applied for interview-based depression detection in practice.
科研通智能强力驱动
Strongly Powered by AbleSci AI