UniProt公司
计算机科学
基因本体论
功能(生物学)
本体论
词汇
集合(抽象数据类型)
口译(哲学)
自然语言处理
期限(时间)
人工智能
情报检索
语言学
程序设计语言
基因
生物
遗传学
哲学
基因表达
物理
认识论
量子力学
作者
Swagarika Jaharlal Giri,Nabil Ibtehaz,Daisuke Kihara
标识
DOI:10.1038/s41540-024-00358-0
摘要
Understanding the biological functions of proteins is of fundamental importance in modern biology. To represent a function of proteins, Gene Ontology (GO), a controlled vocabulary, is frequently used, because it is easy to handle by computer programs avoiding open-ended text interpretation. Particularly, the majority of current protein function prediction methods rely on GO terms. However, the extensive list of GO terms that describe a protein function can pose challenges for biologists when it comes to interpretation. In response to this issue, we developed GO2Sum (Gene Ontology terms Summarizer), a model that takes a set of GO terms as input and generates a human-readable summary using the T5 large language model. GO2Sum was developed by fine-tuning T5 on GO term assignments and free-text function descriptions for UniProt entries, enabling it to recreate function descriptions by concatenating GO term descriptions. Our results demonstrated that GO2Sum significantly outperforms the original T5 model that was trained on the entire web corpus in generating Function, Subunit Structure, and Pathway paragraphs for UniProt entries.
科研通智能强力驱动
Strongly Powered by AbleSci AI