How to evaluate machine translation: A review of automated and human metrics

计算机科学排名（信息检索）机器翻译机器学习人工智能机器翻译评价质量（理念）注释自然语言处理集合（抽象数据类型）数据挖掘机器翻译软件可用性哲学认识论基于实例的机器翻译程序设计语言

作者

Eirini Chatzikoumi

出处

期刊：Natural Language Engineering [Cambridge University Press]
日期：2019-09-11 卷期号：26 (2): 137-161 被引量：62

标识

DOI：10.1017/s1351324919000469

摘要

Abstract This article presents the most up-to-date, influential automated, semiautomated and human metrics used to evaluate the quality of machine translation (MT) output and provides the necessary background for MT evaluation projects. Evaluation is, as repeatedly admitted, highly relevant for the improvement of MT. This article is divided into three parts: the first one is dedicated to automated metrics; the second, to human metrics; and the last, to the challenges posed by neural machine translation (NMT) regarding its evaluation. The first part includes reference translation–based metrics; confidence or quality estimation (QE) metrics, which are used as alternatives for quality assessment; and diagnostic evaluation based on linguistic checkpoints. Human evaluation metrics are classified according to the criterion of whether human judges directly express a so-called subjective evaluation judgment, such as ‘good’ or ‘better than’, or not, as is the case in error classification. The former methods are based on directly expressed judgment (DEJ); therefore, they are called ‘DEJ-based evaluation methods’, while the latter are called ‘non-DEJ-based evaluation methods’. In the DEJ-based evaluation section, tasks such as fluency and adequacy annotation, ranking and direct assessment (DA) are presented, whereas in the non-DEJ-based evaluation section, tasks such as error classification and postediting are detailed, with definitions and guidelines, thus rendering this article a useful guide for evaluation projects. Following the detailed presentation of the previously mentioned metrics, the specificities of NMT are set forth along with suggestions for its evaluation, according to the latest studies. As human translators are the most adequate judges of the quality of a translation, emphasis is placed on the human metrics seen from a translator-judge perspective to provide useful methodology tools for interdisciplinary research groups that evaluate MT systems.

求助该文献

最长约 10秒，即可获得该文献文件

How to evaluate machine translation: A review of automated and human metrics

今日热心研友