CoRTEx: contrastive learning for representing terms via explanations with applications on constructing biomedical knowledge graphs

统一医学语言系统 计算机科学 聚类分析 人工智能 自然语言处理 杠杆(统计) 本体论 机器学习 集合(抽象数据类型) 开放生物医学本体论 情报检索 领域知识 上层本体 哲学 认识论 程序设计语言 建议合并本体
作者
Huaiyuan Ying,Zhengyun Zhao,Yang Zhao,Sihang Zeng,Sheng Yu
出处
期刊:Journal of the American Medical Informatics Association [Oxford University Press]
卷期号:31 (9): 1912-1920 被引量:2
标识
DOI:10.1093/jamia/ocae115
摘要

Abstract Objectives Biomedical Knowledge Graphs play a pivotal role in various biomedical research domains. Concurrently, term clustering emerges as a crucial step in constructing these knowledge graphs, aiming to identify synonymous terms. Due to a lack of knowledge, previous contrastive learning models trained with Unified Medical Language System (UMLS) synonyms struggle at clustering difficult terms and do not generalize well beyond UMLS terms. In this work, we leverage the world knowledge from large language models (LLMs) and propose Contrastive Learning for Representing Terms via Explanations (CoRTEx) to enhance term representation and significantly improves term clustering. Materials and Methods The model training involves generating explanations for a cleaned subset of UMLS terms using ChatGPT. We employ contrastive learning, considering term and explanation embeddings simultaneously, and progressively introduce hard negative samples. Additionally, a ChatGPT-assisted BIRCH algorithm is designed for efficient clustering of a new ontology. Results We established a clustering test set and a hard negative test set, where our model consistently achieves the highest F1 score. With CoRTEx embeddings and the modified BIRCH algorithm, we grouped 35 580 932 terms from the Biomedical Informatics Ontology System (BIOS) into 22 104 559 clusters with O(N) queries to ChatGPT. Case studies highlight the model’s efficacy in handling challenging samples, aided by information from explanations. Conclusion By aligning terms to their explanations, CoRTEx demonstrates superior accuracy over benchmark models and robustness beyond its training set, and it is suitable for clustering terms for large-scale biomedical ontologies.

科研通智能强力驱动
Strongly Powered by AbleSci AI
科研通是完全免费的文献互助平台,具备全网最快的应助速度,最高的求助完成率。 对每一个文献求助,科研通都将尽心尽力,给求助人一个满意的交代。
实时播报
脑洞疼应助筱奇采纳,获得10
刚刚
huanir99发布了新的文献求助10
1秒前
华仔应助鲨鱼辣椒采纳,获得10
1秒前
星海极光发布了新的文献求助10
2秒前
ding应助徐六硕采纳,获得10
2秒前
花痴的听白完成签到,获得积分10
3秒前
kkk关闭了kkk文献求助
3秒前
我不是BOB完成签到,获得积分10
3秒前
seven_yao应助仁爱的以彤采纳,获得20
3秒前
starry完成签到,获得积分10
4秒前
小二郎应助爱听歌笑寒采纳,获得10
5秒前
小东子完成签到,获得积分10
6秒前
梅夕阳发布了新的文献求助10
7秒前
福瑞灯发布了新的文献求助10
9秒前
9秒前
Dr.zhong完成签到,获得积分10
10秒前
10秒前
10秒前
奋斗的醉柳完成签到,获得积分10
11秒前
yellow发布了新的文献求助20
11秒前
共享精神应助duzhi采纳,获得10
11秒前
Aegleseeker完成签到,获得积分10
12秒前
柚子茶完成签到,获得积分10
12秒前
13秒前
14秒前
Jhinnnn完成签到,获得积分10
14秒前
结实晓蕾应助科研通管家采纳,获得10
14秒前
Ava应助科研通管家采纳,获得10
14秒前
jackie发布了新的文献求助10
14秒前
完美世界应助科研通管家采纳,获得10
14秒前
JackRen完成签到,获得积分10
14秒前
研友_VZG7GZ应助科研通管家采纳,获得10
15秒前
15秒前
JamesPei应助科研通管家采纳,获得10
15秒前
共享精神应助科研通管家采纳,获得10
15秒前
15秒前
隐形曼青应助科研通管家采纳,获得10
15秒前
xxxy应助科研通管家采纳,获得10
15秒前
15秒前
15秒前
高分求助中
(应助此贴封号)【重要!!请各用户(尤其是新用户)详细阅读】【科研通的精品贴汇总】 10000
Modern Epidemiology, Fourth Edition 5000
Digital Twins of Advanced Materials Processing 2000
Weaponeering, Fourth Edition – Two Volume SET 2000
Polymorphism and polytypism in crystals 1000
Signals, Systems, and Signal Processing 610
Discrete-Time Signals and Systems 610
热门求助领域 (近24小时)
化学 材料科学 医学 生物 工程类 纳米技术 有机化学 物理 生物化学 化学工程 计算机科学 复合材料 内科学 催化作用 光电子学 物理化学 电极 冶金 遗传学 细胞生物学
热门帖子
关注 科研通微信公众号,转发送积分 6023899
求助须知:如何正确求助?哪些是违规求助? 7653354
关于积分的说明 16174434
捐赠科研通 5172349
什么是DOI,文献DOI怎么找? 2767510
邀请新用户注册赠送积分活动 1750932
关于科研通互助平台的介绍 1637339