计算机科学
语言模型
领域(数学分析)
领域工程
领域知识
人工智能
自然语言处理
任务(项目管理)
软件
域模型
软件开发
程序设计语言
软件建设
经济
管理
数学分析
数学
作者
Ding, Ruiqing,Han, Xiao,Wang, Leye
出处
期刊:Cornell University - arXiv
日期:2022-12-10
标识
DOI:10.48550/arxiv.2212.05251
摘要
Natural Language Processing (NLP) is one of the core techniques in AI software. As AI is being applied to more and more domains, how to efficiently develop high-quality domain-specific language models becomes a critical question in AI software engineering. Existing domain-specific language model development processes mostly focus on learning a domain-specific pre-trained language model (PLM); when training the domain task-specific language model based on PLM, only a direct (and often unsatisfactory) fine-tuning strategy is adopted commonly. By enhancing the task-specific training procedure with domain knowledge graphs, we propose KnowledgeDA, a unified and low-code domain language model development service. Given domain-specific task texts input by a user, KnowledgeDA can automatically generate a domain-specific language model following three steps: (i) localize domain knowledge entities in texts via an embedding-similarity approach; (ii) generate augmented samples by retrieving replaceable domain entity pairs from two views of both knowledge graph and training data; (iii) select high-quality augmented samples for fine-tuning via confidence-based assessment. We implement a prototype of KnowledgeDA to learn language models for two domains, healthcare and software development. Experiments on five domain-specific NLP tasks verify the effectiveness and generalizability of KnowledgeDA. (Code is publicly available at https://github.com/RuiqingDing/KnowledgeDA.)
科研通智能强力驱动
Strongly Powered by AbleSci AI