计算机科学
答疑
水准点(测量)
软件
考试(生物学)
情报检索
人工智能
数据挖掘
软件工程
程序设计语言
古生物学
大地测量学
生物
地理
作者
Xiaoyuan Xie,Shuo Jin,Songqiang Chen
出处
期刊:Research Square - Research Square
日期:2022-04-20
标识
DOI:10.21203/rs.3.rs-1563040/v1
摘要
Abstract Question Answering (QA) is an attractive and challenging area in NLP community. With the development of QA technique, plenty of QA software has been applied in daily human life to provide convenient access of information retrieval. To investigate the performance of QA software, many benchmark datasets have been constructed to provide various test cases. However, current QA software is mainly tested in a reference-based paradigm, in which the expected outputs (labels) of test cases are mandatory to be annotated with much human effort before testing. As a result, neither the just-in-time test during usage nor the extensible test on massive unlabeled real-life data is feasible, which keeps the current testing of QA software from being flexible and sufficient. In this work, we propose a novel testing method, QAAskeR+, with five new Metamorphic Relations for QA software. QAAskeR+ does not refer to the annotated labels of test cases. Instead, based on the idea that a correct answer should imply a piece of reliable knowledge that always conforms with any other correct answer, QAAskeR+ tests QA software by inspecting its behaviors on multiple recursively asked questions that are relevant to the same or some further enriched knowledge. Experimental results show that QAAskeR+ can reveal quite a few violations that indicate actual answering issues on various mainstream QA software without using any pre-annotated labels.
科研通智能强力驱动
Strongly Powered by AbleSci AI