TCfinder: Robust tumor cell discrimination in scRNA‐seq based on gene pathway activity

鉴定(生物学) 注释 计算生物学 转录组 间质细胞 癌细胞 细胞 聚类分析 电池类型 计算机科学 癌症 基因 生物 人工智能 基因表达 癌症研究 遗传学 植物
作者
Chenxu Wu,Ning Wei,Tao Wu,Jing Chen,Huizi Yao,Ziyu Tao,Xiangyu Zhao,Kaixuan Diao,Jinyu Wang,Weiliang Wang,Xinxing Li,Qianqian Song,Xuesong Liu
标识
DOI:10.1002/imo2.22
摘要

TCfinder is a tumor cell identification tool, based on pathway activity and deep neural network (DNN). Across different platforms of scRNA-seq datasets, TCfinder demonstrates robust identification efficiency. It outperforms existing tumor cell identification tools and performs under sparse data. TCfinder is freely available as an R package at: https://github.com/XSLiuLab/TCfinder. Traditional RNA sequencing (RNA-seq) of bulk tumors only obtains the average information of gene expression [1], thus unable to precisely delineate the tumor microenvironment and infiltrating cell states. In contrast, single-cell RNA sequencing (scRNA-seq) technology allows to gain more in-depth insights into the cellular ecosystem, including the interrogation of specific cell populations and their respective transcriptomic characteristics [2], the intratumoral heterogeneity [3], tumor evolution map [4], and tumor phylogenetic tree, and so on [5], which has been widely applied in cancer research. One of the major challenges for cancer tissue-related scRNA-seq analysis is to accurately and quickly discriminate tumor cells from normal stromal cells [6]. Many cell-type annotation methods have been constructed for scRNA-seq analysis [7, 8]; however, few tools are available for tumor versus normal cell discrimination. Currently, there are two common strategies for distinguishing tumor cells from normal cells in scRNA-seq data. One strategy involves clustering and manual annotation [9]. Although manual annotation can effectively distinguish tumor cells from normal cells in single-cell data, it is still limited by the sparsity of single-cell data sets and relies heavily on professional experience and knowledge, making it difficult to scale up. The other strategy involves automatic annotation, which includes methods based on marker genes, such as ikarus [10], SCINA [11], scMRMA [12], or methods based on copy number variation (CNV) inference [13]. However, this strategy is problematic due to extensive dropout issues in scRNA-seq [14], resulting in insufficient expression of gene markers for most cancer cells. In addition, CNV is not universally prevalent in tumor cells, and some normal cells could have CNV [15], which limits the widespread application of CNV-based methods. Therefore, it is of great importance to develop a widely applicable automatic annotation method that can overcome the sparsity of single-cell data for the field of single-cell-based tumor research. Since a typical gene pathway usually has dozens of genes, pathway-based expression quantification overcomes data sparseness faced by traditional gene marker-based annotating methods. Additionally, alterations in gene pathways are one of the primary differences between cancer cells and normal cells [16]. Therefore, characterizing gene pathways has great potential for accurately distinguishing between cancerous and normal cells. Herein, we developed TCfinder (Tumor Cell finder) based on the pathway activity and deep neural network. TCfinder not only presents a robust performance in simulated scRNA-seq data with random gene inactivation but also shows improved tumor versus normal cell discrimination precision and accuracy in multiple cancer types compared with existing methods. We first analyzed the distribution of gene counts in tumor cells and normal cells (Figure S1). The number of genes measured in tumor cells is higher than in normal cells, which may be the result of more vigorous metabolism and growth of tumor cells. We collected all human gene pathways in the Kyoto Encyclopedia of Genes and Genomes database and scored the activities of each pathway (Methods for detail) to obtain a single-cell pathway score matrix. In the training data set, we performed Wilcoxon tests on the pathway scores between tumor cells and normal cells, retaining pathways with p values <0.05. Ultimately, we identified 213 pathways to be used in TCfinder (Table S1). In TCfinder, we utilized a fully connected neural network architecture to develop a model for discriminating between tumor and normal cells (Figure 1A). To develop a model broadly applicable at the pan-cancer level, we collected over 70,000 cells from six types of cancer as training data. These data sets were randomly divided into training and test sets in an 8:2 ratio. Additionally, we used 10 data sets comprising over 230,000 cells as independent validation data for the model (Table S2). The performance of TCfinder and existing methods, including ikarus, SCINA, copykat, scMRMA, was compared using independent data sets. TCfinder obtained an average F1 score of 0.98 on these data sets of the 10X platform and 0.95 on the SMART-Seq2 platform (Table S3). The other four methods showed poorer performance, as reflected by lower F1 score, accuracy, and precision compared with TCfinder (Figures 1B and S2). Tumor versus normal status professionally annotated scRNA-seq data sets are pretty limited. One possible strategy to address this issue is to use healthy tissue samples as normal samples and tumor cell line samples as tumor samples. To determine the actual false-positive and false-negative rates for tumor cell classification, we used single-cell data from healthy individuals in the GSE162616 data set and tumor cell lines in the GSE140440 data set for comparative testing. TCfinder exhibits lower false positives and false negatives compared with existing methods (Figures 1C and S3). To further demonstrate the robustness and applicability of our method, we validate the model performance by randomly retaining different numbers of genes or randomly inactivating different proportions of pathways (Figure S4A). TCfinder also shows robust performance and significantly outperformed other methods while retaining different number of genes (Figure S4B). This result demonstrates the robust performance of gene pathway expression-based methods compared to marker gene-based methods. For simulating different proportions of pathway inactivation, the model's F1 score remained above 0.8 when 60% of the pathways were randomly inactivated (Figure S4C). In addition, models retaining the top four contributing pathways (see next section for details) show significantly higher performance compared with models that do not contain these four pathways (Figure S4D). This further illustrates the important role of these four pathways in TCfinder. Using the limma package to perform differential analysis between the pathway scores of tumor and normal cells, pathways with |log2FoldChange| >1 and false discovery rate <0.05 were shown (Figure 1D). The results indicated that pathways active in tumor cells were primarily related to metabolism, such as oxidative phosphorylation, reactive oxygen species, glutathione metabolism, and glycolysis/gluconeogenesis. The Warburg effect has long established that glycolysis/gluconeogenesis is more active in tumors [17]. Conversely, pathways suppressed in tumor cells were mainly related to immune functions, including antigen processing and presentation, Th1 and Th2 cell differentiation, and natural killer cell-mediated cytotoxicity. These findings illustrate that, from a single-cell perspective, tumor cells exhibit a dual characteristic of metabolic hyperactivity and immune suppression. To further investigate which pathways contribute the most to the discrimination of tumor cells using the GSE148673 data set, we randomly shuffled each pathway and calculated the difference in its loss, which reflects the importance of the pathway. After 100 randomizations, we found the four most important pathways contribute to tumor cell identification: type I diabetes mellitus, oxidative phosphorylation, viral myocarditis, and antigen processing and presentation (Figure 1E). Their pathway scores present strong differential patterns between tumor and normal cells in single-cell data sets (GSE148673). To verify if these pathways were also prominent in bulk tumors, we collected The Cancer Genome Atlas (TCGA) data. Intriguingly, except for the oxidative phosphorylation pathway, the other three pathways that are related to autoimmunity and antigen presentation, do not present higher pathway scores in tumor samples, indicating that bulk expression cannot reveal the underlying tumor cell expression differences (Figure S5). We further examined the shared antigen presentation (AP)-related genes in these contributing pathways, which consist of genes including MHC I and MHC II (Figure S6). These AP-related genes emerged in single-cell tumor cells, rather than bulk tissue (Figure S7), suggesting that some critical tumor cell-related information is obscured in bulk tissue RNA-Seq. We developed TCfinder, a new gene pathway expression score-based deep learning method to accurately and rapidly discriminate cancer cells and normal cells in scRNA-seq. We distinguish cancer cells from normal cells from the perspective of gene pathways and carry out a unique score for each gene pathway so that when only a small number of genes are detected in single cell, the score value will reflect the activity of the entire pathway, which overcomes the sparsity problem of single-cell data. Based on multiple independent data sets and simulation data, TCfinder shows improved performance than existing methods. In traditional RNA-Seq for bulk tissue, the expression profile of tumor cells cannot be accurately determined due to the existence of a large number of cell types in bulk cancer tissue. In scRNA-Seq, TCfinder identified the oxidative phosphorylation and antigen presentation pathways as the important contributors for tumor versus normal cell discrimination, and tumor cells have higher oxidative phosphorylation and lower antigen presentation gene expression compared with normal cells. Downregulation of antigen presentation pathway has been reported to contribute to tumor immune escape and immunotherapy nonresponsiveness [18, 19]. Interestingly, this difference in single cells is not fully recapitulated in bulk tissues, in TCGA bulk tissues, tumor tissues do not have decreased antigen presentation pathway gene expression compared with surrounding normal tissues. It may be due to the fact that normal cells in bulk tissue mask the true expression of tumor cells. By uncovering potentially hidden information in bulk tumor tissues, TCfinder can be used in clinical settings to identify tumor cells and tailor personalized treatment plans based on the clinical characteristics of these tumor cells. Although TCfinder has demonstrated its ability to overcome data sparsity in multiple single-cell data sets and has shown promising performance, it still has some limitations. One of the main limitations is the small number of annotated single-cell data sets used for training the model. Many cancer tissue-related single-cell studies have been reported. However, few data sets have professionally annotated cancer versus normal cell status information. In general, despite its limitations, TCfinder represents a significant improvement over existing methods in addressing the issue of data sparsity in single-cell annotation. Its success in this regard may also provide useful insights for annotating other cell types. TCfinder is the first tool to distinguish tumor cells from normal cells in single-cell data from the perspective of gene pathway expression quantification. TCfinder performs well in single-cell data sets prepared using both 10× and SMART-Seq2, with prediction accuracy exceeding 0.95. Interestingly, the antigen presentation pathway was identified as the key pathway that distinguishes tumor cells from normal cells, and this antigen presentation pathway expression difference is not recapitulated in bulk tissue RNA-seq, suggesting that traditional bulk tissue RNA sequencing conceals the true information of a large number of cell states. Chenxu Wu: Writing—original draft; writing—review and editing; visualization; validation; methodology; software; formal analysis; project administration; data curation. Wei Ning: Investigation; methodology; validation; visualization; software. Tao Wu: Investigation; writing—review and editing. Jing Chen: Data curation. Huizi Yao: Writing—review and editing; data curation. Ziyu Tao: Data curation. Xiangyu Zhao: Data curation. Kaixuan Diao: Data curation. Jinyu Wang: Data curation. Weiliang Wang: Supervision. Xinxing Li: Supervision; writing—review and editing; project administration. Qianqian Song: Writing—review and editing. Xue-Song Liu: Writing—review and editing; conceptualization; data curation; supervision; funding acquisition; resources; project administration. The authors thank ShanghaiTech University High Performance Computing Public Service Platform for computing services. The authors thank multi-omics facility, molecular and cell biology core facility of ShanghaiTech University for technical help. This work was supported by the Shanghai Science and Technology Commission (No. 21ZR1442400), the National Natural Science Foundation of China (No. 31771373), and cross-disciplinary research fund of Shanghai Ninth People's Hospital, Shanghai Jiao Tong University School of Medicine and startup funding from ShanghaiTech University. The authors declare no conflict of interest. No animals or humans were involved in this study. Only publicly available data were used in this study, and data sources and handling of these data are described in the Materials and Methods and in Table S2. TCfinder is freely available as an R package at: https://github.com/XSLiuLab/TCfinder. All codes required to reproduce the results reported in this manuscript are freely available at: https://github.com/XSLiuLab/TCfinder/tree/master/inst/analysis. Supplementary materials (results, methods, figures, tables, graphical abstract, slides, videos, Chinese translated version, and updated materials) may be found in the online DOI or iMetaOmics, http://www.imeta.science/imetaomics/. Figure S1: Distribution of the detected gene numbers in tumor and normal cells. Normal cells present fewer detected genes than tumor cells. Figure S2: Performance comparison between TCfinder and other known methods. Figure S3: TCfinder correctly recognizes most cells in the GSE140440 (tumor cell line) dataset as tumor cells. Figure S4: Performance of TCfinder in the simulated datasets. Figure S5: Pathway scores at bulk tissue and single-cell level. Figure S6: Identify gene pathways important for tumor vs normal cell classification. Figure S7: Antigen presentation gene expression in bulk tissues. Figure S8: Heatmap showing the performance of TCfinder classifier. Figure S9: Performance comparisons of different machine learning models. Figure S10: Performance of different tumor vs normal cells classification methods for cells with the indicated number of genes. Figure S11: Application of TCfinder in exploring the trajectories/fates of tumor cells. Table S1: Pathways retained for model building after screening. Table S2: List of single cell datasets used in the paper, along with basic statistics Table S3: The performance of different methods. Please note: The publisher is not responsible for the content or functionality of any supporting information supplied by the authors. Any queries (other than missing content) should be directed to the corresponding author for the article.

科研通智能强力驱动
Strongly Powered by AbleSci AI
科研通是完全免费的文献互助平台,具备全网最快的应助速度,最高的求助完成率。 对每一个文献求助,科研通都将尽心尽力,给求助人一个满意的交代。
实时播报
zhen完成签到,获得积分10
1秒前
ns发布了新的文献求助30
2秒前
3秒前
逐风完成签到,获得积分10
3秒前
无奈的酒窝完成签到,获得积分10
4秒前
4秒前
5秒前
blingbling发布了新的文献求助10
5秒前
今后应助SherlockLiu采纳,获得30
7秒前
daniel发布了新的文献求助10
7秒前
Jason应助温言采纳,获得20
8秒前
逐风发布了新的文献求助30
9秒前
hhzz发布了新的文献求助10
9秒前
日月轮回完成签到,获得积分10
10秒前
11秒前
Yimim发布了新的文献求助10
11秒前
小小li完成签到 ,获得积分10
11秒前
小蘑菇应助细腻晓露采纳,获得10
11秒前
又胖了完成签到,获得积分10
12秒前
Eva完成签到,获得积分10
13秒前
13秒前
喵喵喵完成签到,获得积分20
13秒前
独摇之完成签到,获得积分10
13秒前
怡然雁凡完成签到,获得积分10
13秒前
顾jiu完成签到,获得积分10
14秒前
科研通AI5应助热依汗古丽采纳,获得10
14秒前
优秀剑愁完成签到 ,获得积分10
14秒前
敏感网络发布了新的文献求助50
15秒前
院士人启动完成签到,获得积分10
15秒前
16秒前
黄花菜完成签到 ,获得积分0
18秒前
18秒前
顾jiu发布了新的文献求助30
18秒前
Yimim完成签到,获得积分10
18秒前
19秒前
白菜完成签到,获得积分10
19秒前
20秒前
虚心山灵完成签到 ,获得积分20
20秒前
21秒前
白菜发布了新的文献求助30
22秒前
高分求助中
Continuum Thermodynamics and Material Modelling 3000
Production Logging: Theoretical and Interpretive Elements 2700
Social media impact on athlete mental health: #RealityCheck 1020
Ensartinib (Ensacove) for Non-Small Cell Lung Cancer 1000
Unseen Mendieta: The Unpublished Works of Ana Mendieta 1000
Bacterial collagenases and their clinical applications 800
El viaje de una vida: Memorias de María Lecea 800
热门求助领域 (近24小时)
化学 材料科学 生物 医学 工程类 有机化学 生物化学 物理 纳米技术 计算机科学 内科学 化学工程 复合材料 基因 遗传学 物理化学 催化作用 量子力学 光电子学 冶金
热门帖子
关注 科研通微信公众号,转发送积分 3527928
求助须知:如何正确求助?哪些是违规求助? 3108040
关于积分的说明 9287614
捐赠科研通 2805836
什么是DOI,文献DOI怎么找? 1540070
邀请新用户注册赠送积分活动 716904
科研通“疑难数据库(出版商)”最低求助积分说明 709808