生成语法
计算机科学
自然语言处理
生成模型
语言学
语言模型
人工智能
心理学
哲学
作者
Haiyang Bian,Yixin Chen,Xiaomin Dong,Chen Li,Minsheng Hao,Sijie Chen,Jinyi Hu,Maosong Sun,Lei Wei,Xuegong Zhang
标识
DOI:10.1101/2024.01.25.577152
摘要
Abstract Gene expression could be perceived as a form of cell language, with underlying regulatory mechanisms akin to biological grammar. Decoding this “language” is critical in understanding cellular functions and behaviors, but presents significant challenges. Several works have attempted to learn the biological language by pre-training large foundation models based on single-cell transcriptomic data, inspired by the success of large language models in natural language processing. In this study, we further enrich the pre-training paradigm by integrating an abundance of metadata and a multiplicity of pre-training tasks, and obtain scMulan, a multitask generative pre-trained language model tailored for single-cell analysis. We represent a cell as a structured cell sentence (c-sentence) by encoding its gene expression, metadata terms, and target tasks as words of tuples, each consisting of entities and their corresponding values. We construct a unified generative framework to model the cell language on c-sentence and design three pretraining tasks to bridge the microscopic and macroscopic information within the c-sentences. We pre-train scMulan on 10 million single-cell transcriptomic data and their corresponding metadata, with 368 million parameters. As a single model, scMulan can accomplish tasks zero-shot for cell type annotation, batch integration, and conditional cell generation, guided by different task prompts. Also, scMulan is ready to be expanded for novel tasks through finetuning. We have evaluated the effectiveness of scMulan on multiple downstream tasks. As a foundation model, scMulan is pre-trained to capture both the microscopic regulations and macroscopic patterns of gene expression, positioning it as a multifunctional and easily expandable tool for comprehensive single-cell analysis.
科研通智能强力驱动
Strongly Powered by AbleSci AI