计算机科学
机器翻译
自然语言处理
人工智能
翻译(生物学)
布鲁
合成数据
训练集
平行语料库
机器学习
语音识别
生物化学
化学
信使核糖核酸
基因
作者
Víctor M. Sánchez-Cartagena,Miquel Esplà-Gomis,Juan Antonio Pérez-Ortiz,Felipe Sánchez-Martínez
标识
DOI:10.1109/tpami.2023.3333949
摘要
When the amount of parallel sentences available to train a neural machine translation is scarce, a common practice is to generate new synthetic training samples from them.A number of approaches have been proposed to produce synthetic parallel sentences that are similar to those in the parallel data available.These approaches work under the assumption that non-fluent target-side synthetic training samples can be harmful and may deteriorate translation performance.Even so, in this paper we demonstrate that synthetic training samples with non-fluent target sentences can improve translation performance if they are used in a multilingual machine translation framework as if they were sentences in another language.We conducted experiments on ten low-resource and four high-resource translation tasks and found out that this simple approach consistently improves translation performance as compared to state-of-the-art methods for generating synthetic training samples similar to those found in corpora.Furthermore, this improvement is independent of the size of the original training corpus, the resulting systems are much more robust against domain shift and produce less hallucinations.
科研通智能强力驱动
Strongly Powered by AbleSci AI