对话
计算机科学
位于
背景(考古学)
风格(视觉艺术)
人机交互
语义学(计算机科学)
自然语言处理
语音合成
对话系统
人工智能
语言学
万维网
对话框
历史
古生物学
哲学
考古
生物
程序设计语言
作者
Yuchen Liu,Haoyu Zhang,Shichao Liu,Xiang Yin,Zejun Ma,Qin Jin
标识
DOI:10.1145/3581783.3613823
摘要
Conversational Text-to-speech Synthesis (TTS) aims to generate speech with proper style in the user-agent conversation scenario. Although previous works have explored modeling the context in the dialogue history to provide style information for the agent, there are still deficiencies in modeling the role-aware multi-modal context. Moreover, previous works ignore the emotional dependencies between the user and the agent, which includes: 1) agent understands emotional states of users, and 2) agent expresses proper emotion in the generated speech. In this work, we propose an Emotionally Situated Text-to-speech Synthesis (EmoSit-TTS) framework to understand users' semantics and subtle emotional states, and generate speech with proper speaking style and emotional expression in the user-agent conversation. Experiments on the DailyTalk dataset show the superiority of our proposed framework for the user-agent conversational TTS, especially in terms of emotion-aware expressiveness, which outperforms other state-of-the-art methods by 0.69 on MOS. Demos of our proposed framework are available at https://anonydemo.github.io.
科研通智能强力驱动
Strongly Powered by AbleSci AI