变压器
计算机科学
语音识别
建筑
网格
人工智能
工程类
电压
电气工程
艺术
几何学
数学
视觉艺术
作者
Karthik Ramesh,Xing Chao,Wupeng Wang,Dong Wang,Chunxia Xiao
标识
DOI:10.1109/icassp39728.2021.9414053
摘要
The transformer architecture has shown great capability in learning long-term dependency and works well in multiple domains. However, transformer has been less considered in audio-visual speech enhancement (AVSE) research, partly due to the convention that treats speech enhancement as a short-time signal processing task. In this paper, we challenge this common belief and show that an audio-visual transformer can significantly improve AVSE performance, by learning the long-term dependency of both intra-modality and inter-modality. We test this new transformer-based AVSE model on the GRID and AVSpeech datasets, and show that it beats several state-of-the-art models by a large margin.
科研通智能强力驱动
Strongly Powered by AbleSci AI