计算机科学
人工智能
自然语言处理
计算生物学
生物
作者
Rafael Josip Penić,Tin Vlašić,Roland G. Huber,Yue Wan,Mile Šikić
出处
期刊:Cornell University - arXiv
日期:2024-02-29
被引量:2
标识
DOI:10.48550/arxiv.2403.00043
摘要
Ribonucleic acid (RNA) plays a variety of crucial roles in fundamental biological processes. Recently, RNA has become an interesting drug target, emphasizing the need to improve our understanding of its structures and functions. Over the years, sequencing technologies have produced an enormous amount of unlabeled RNA data, which hides important knowledge and potential. Motivated by the successes of protein language models, we introduce RiboNucleic Acid Language Model (RiNALMo) to help unveil the hidden code of RNA. RiNALMo is the largest RNA language model to date with $650$ million parameters pre-trained on $36$ million non-coding RNA sequences from several available databases. RiNALMo is able to extract hidden knowledge and capture the underlying structure information implicitly embedded within the RNA sequences. RiNALMo achieves state-of-the-art results on several downstream tasks. Notably, we show that its generalization capabilities can overcome the inability of other deep learning methods for secondary structure prediction to generalize on unseen RNA families. The code has been made publicly available on https://github.com/lbcb-sci/RiNALMo.
科研通智能强力驱动
Strongly Powered by AbleSci AI