端到端原则
计算机科学
言语翻译
翻译(生物学)
语音识别
人工智能
机器翻译
自然语言处理
生物化学
化学
信使核糖核酸
基因
作者
Xiaohu Zhao,Haoran Sun,Yikun Lei,Deyi Xiong
标识
DOI:10.1016/j.eswa.2024.123241
摘要
The cross-attention mechanism enables Transformer to capture correspondences between the input and output. However, in the domain of end-to-end (E2E) speech-to-text translation (ST), the learned cross-attention weights often struggle to accurately correspond with actual alignments, given the need to align speech and text across different modalities and languages. In this paper, we present a simple yet effective method called regularized cross-attention learning, for end-to-end speech translation in a multitask learning (MTL) framework. RCAL leverages the knowledge from auxiliary automatic speech recognition (ASR) and machine translation (MT) tasks to generate a teacher cross-attention matrix, serving as prior alignment knowledge to enhance cross-attention learning within the ST task. An additional loss function is introduced as part of the MTL framework to facilitate this process. We conducted experiments on the MuST-C benchmark dataset to evaluate the effectiveness of RCAL. The results demonstrate that the proposed approach yields significant improvements over the baseline, with an average enhancement of +0.8 BLEU across four translation directions in two experimental settings, outperforming state-of-the-art E2E and cascaded speech translation models. Further analysis and visualization reveal that the model with RCAL effectively learns high-quality alignment information from auxiliary ASR and ST tasks, thereby improving the ST alignment quality. Moreover, the experiments with different sizes of MT and ST data provide strong evidence supporting our model’s robustness in various scenarios.
科研通智能强力驱动
Strongly Powered by AbleSci AI