Fine-tuning Large Language Models for Chemical Text Mining

计算机科学 自然语言处理
作者
Wei Zhang,Qinggong Wang,Xiangtai Kong,Jiacheng Xiong,Shengkun Ni,Duanhua Cao,Buying Niu,Mingan Chen,Yameng Li,Runze Zhang,Yitian Wang,Lehan Zhang,Xutong Li,Zhaoping Xiong,Qian Shi,Ziming Huang,Zunyun Fu,Mingyue Zheng
出处
期刊:Chemical Science [The Royal Society of Chemistry]
卷期号:15 (27): 10600-10611 被引量:9
标识
DOI:10.1039/d4sc00924j
摘要

Extracting knowledge from complex and diverse chemical texts is a pivotal task for both experimental and computational chemists. The task is still considered to be extremely challenging due to the complexity of the chemical language and scientific literature. This study explored the power of fine-tuned large language models (LLMs) on five intricate chemical text mining tasks: compound entity recognition, reaction role labelling, metal-organic framework (MOF) synthesis information extraction, nuclear magnetic resonance spectroscopy (NMR) data extraction, and the conversion of reaction paragraphs to action sequences. The fine-tuned LLMs demonstrated impressive performance, significantly reducing the need for repetitive and extensive prompt engineering experiments. For comparison, we guided ChatGPT (GPT-3.5-turbo) and GPT-4 with prompt engineering and fine-tuned GPT-3.5-turbo as well as other open-source LLMs such as Mistral, Llama3, Llama2, T5, and BART. The results showed that the fine-tuned ChatGPT models excelled in all tasks. They achieved exact accuracy levels ranging from 69% to 95% on these tasks with minimal annotated data. They even outperformed those task-adaptive pre-training and fine-tuning models that were based on a significantly larger amount of in-domain data. Notably, fine-tuned Mistral and Llama3 show competitive abilities. Given their versatility, robustness, and low-code capability, leveraging fine-tuned LLMs as flexible and effective toolkits for automated data acquisition could revolutionize chemical knowledge extraction.
最长约 10秒,即可获得该文献文件

科研通智能强力驱动
Strongly Powered by AbleSci AI
科研通是完全免费的文献互助平台,具备全网最快的应助速度,最高的求助完成率。 对每一个文献求助,科研通都将尽心尽力,给求助人一个满意的交代。
实时播报
思源应助乔心采纳,获得10
1秒前
三十三完成签到,获得积分10
2秒前
含蓄康发布了新的文献求助10
2秒前
cara完成签到,获得积分20
2秒前
JY发布了新的文献求助10
2秒前
dddd19发布了新的文献求助10
3秒前
tiantiantian完成签到,获得积分10
3秒前
荣弟完成签到,获得积分10
3秒前
atad2发布了新的文献求助10
3秒前
科研通AI5应助tong采纳,获得10
4秒前
月牙儿完成签到,获得积分10
4秒前
5秒前
wawuuuuu完成签到,获得积分10
6秒前
77关注了科研通微信公众号
6秒前
6秒前
Tomoyo发布了新的文献求助50
7秒前
7秒前
7秒前
8秒前
南风吹梦完成签到,获得积分10
8秒前
Owen应助Mr.R采纳,获得10
9秒前
上官老黑完成签到 ,获得积分10
9秒前
模糊中正应助wawuuuuu采纳,获得70
9秒前
9秒前
FartKing发布了新的文献求助10
10秒前
10秒前
田様应助Richardisme采纳,获得10
10秒前
含蓄康完成签到,获得积分10
11秒前
Fancy发布了新的文献求助10
11秒前
11秒前
11秒前
12秒前
12秒前
13秒前
13秒前
dyhiaefhv发布了新的文献求助10
14秒前
like发布了新的文献求助10
14秒前
xhf发布了新的文献求助10
14秒前
LUO完成签到,获得积分10
16秒前
自然的芙蓉完成签到,获得积分10
17秒前
高分求助中
Genetics: From Genes to Genomes 3000
Continuum thermodynamics and material modelling 3000
Production Logging: Theoretical and Interpretive Elements 2500
Healthcare Finance: Modern Financial Analysis for Accelerating Biomedical Innovation 2000
Applications of Emerging Nanomaterials and Nanotechnology 1111
Les Mantodea de Guyane Insecta, Polyneoptera 1000
Theory of Block Polymer Self-Assembly 750
热门求助领域 (近24小时)
化学 医学 材料科学 生物 工程类 有机化学 生物化学 纳米技术 内科学 物理 化学工程 计算机科学 复合材料 基因 遗传学 物理化学 催化作用 细胞生物学 免疫学 电极
热门帖子
关注 科研通微信公众号,转发送积分 3475278
求助须知:如何正确求助?哪些是违规求助? 3067370
关于积分的说明 9103709
捐赠科研通 2758761
什么是DOI,文献DOI怎么找? 1513790
邀请新用户注册赠送积分活动 699798
科研通“疑难数据库(出版商)”最低求助积分说明 699160