TCGA-Assembler: open-source software for retrieving and processing TCGA data

计算机科学 软件 开源 开源软件 计算生物学 操作系统 生物
作者
Yangyong Zhu,Peng Qiu,Yuan Ji
出处
期刊:Nature Methods [Springer Nature]
卷期号:11 (6): 599-600 被引量:362
标识
DOI:10.1038/nmeth.2956
摘要

To the Editor: The Cancer Genome Atlas (TCGA) has been generating multi-modal genomics, epigenomics, and proteomics data for thousands of tumor samples across more than 20 types of cancer. While the access to most level-1 and -2 TCGA data is restricted, the entire level-3 TCGA data as well as some level-1 clinical data (e.g., survival and drug treatments) are publicly available. Included in the public data are genome-wide measurements of different genetic characterizations, such as DNA copy number, DNA methylation, and mRNA expression for the same genes, providing unprecedented opportunities for systematic investigation of cancer mechanisms at multiple molecular and regulatory layers [1-3]. Few tools of integrative data mining for TCGA are present, partly due to lack of tools to acquire and assemble the large scale TCGA data. Specifically, the level-3 TCGA data are stored as hundreds of thousands of sample- and platform-specific files, accessible through HTTP directories on the servers of TCGA Data Coordinating Center (DCC) [4]. Navigating through all of the files manually is impossible. Although Firehose [5] nicely assemble and publish TCGA data, it does not share the program code for data assembly. Currently the community does not have access to open-source data retrieving tools for automatic and flexible data acquisition, hence severely hindering the progress in systemic data integration and reproducible computational analysis using TCGA data. To meet these challenges, we introduce TCGA-Assembler, a software package that automates and streamlines the retrieval, assembly, and processing of public TCGA data. TCGA-Assembler equips users the ability to produce Firehose-type of TCGA data, with open-source and freely available program script. TCGA-Assembler opens a door for the development of data-mining and data-analysis tools that generate fully reproducible results, including data acquisition. TCGA-Assembler consists of two modules (Fig. 1a), both written in R (http://www.r-project.org). Module A streamlines data downloading and quality check, and module B processes the downloaded data for subsequent analyses (Supplementary Methods). In particular, module A takes advantage of the informative naming mechanism of TCGA data file system (Supplementary Fig. 1) and applies a recursive algorithm to retrieve the URLs of all data files. By string matching on the URLs, module A allows users to download most of TCGA public data (Supplementary Table 1) across genomic features and cancer types. For each genomics feature (such as gene expression from RNA-Seq) a data matrix combining multiple samples (Fig. 1b) is produced, with rows representing genomics units (such as genes) and columns representing samples. Module B provides convenient and important data preprocessing functions, such as mega-data assembly, data cleaning, and quantification of various measurements. For users interested in integrative analysis [6], a mega data matrix (Fig. 1c) is required that matches different types of genomics measurements for the same genes across samples. Module B provides a function “CombineMultiPlatfomData” to fulfill this requirement (Supplementary Methods), which involves intricate data-matching steps to overcome the feature-labeling discrepancies caused by different lab protocols and biotechnologies in the experiments. Other data-processing functions are also provided to facilitate downstream analysis (Supplementary Methods). Figure 1 TCGA-Assembler as a tool for acquiring, assembling, and processing public TCGA data. (a) Flowchart of TCGA- Assembler. Module A acquires data from TCGA DCC. Module B processes the obtained data using various functions. (b) Illustration of a data matrix ... Other big data tools for TCGA are available [5, 7, 8]. In particular, level-3 TCGA data can also be obtained from Firehose [5] at the MIT Broad Institute in the same format as in Fig. 1b, one for each cancer type and genomics platform. Module A of TCGA-Assembler not only provides the same type of data matrices, but also distributes R functions and associated computer program that produce the data matrices. Equipped with the open-source tool, users will be independent and control what and when TCGA data will be acquired locally. More importantly, quantitatively advanced users may integrate our open-source programs with downstream data analysis tools to realize reproducible and automated data analysis for TCGA. Unique to TCGA-Assembler is module B that provides critical functions for data cleaning and processing. For example, the mega data table (Fig. 1c) can be obtained with a single function, behind which substantial efforts have been directed to ensure the validity of process, such as to check and correct gene symbol discrepancies. Lastly, TCGA-Assembler is fully compatible with Firehose in that the data processing functions in Module B can directly process data files downloaded from Firehose. This compatibility is crucial to those who want to take advantage of both software pipelines. TCGA-Assembler will remain freely available and open-source. In the future, more data processing and analysis functions will be continuously added to TCGA-Assembler based on user feedback and new research needs. The authors request acknowledgment of the use of TCGA-Assembler in published works.
最长约 10秒,即可获得该文献文件

科研通智能强力驱动
Strongly Powered by AbleSci AI
科研通是完全免费的文献互助平台,具备全网最快的应助速度,最高的求助完成率。 对每一个文献求助,科研通都将尽心尽力,给求助人一个满意的交代。
实时播报
ding应助科研通管家采纳,获得10
刚刚
NexusExplorer应助科研通管家采纳,获得10
刚刚
yizhiGao应助科研通管家采纳,获得10
刚刚
科研通AI5应助科研通管家采纳,获得10
刚刚
wanci应助科研通管家采纳,获得10
刚刚
华仔应助科研通管家采纳,获得10
刚刚
上官若男应助科研通管家采纳,获得10
刚刚
大模型应助科研通管家采纳,获得10
刚刚
pinging应助科研通管家采纳,获得10
1秒前
唠叨的月光完成签到,获得积分10
1秒前
大模型应助科研通管家采纳,获得10
1秒前
清爽老九应助科研通管家采纳,获得20
1秒前
科研通AI5应助科研通管家采纳,获得20
1秒前
1秒前
传奇3应助科研通管家采纳,获得10
1秒前
清爽老九应助科研通管家采纳,获得20
1秒前
英姑应助科研通管家采纳,获得30
1秒前
酷波er应助科研通管家采纳,获得10
1秒前
优雅苑睐完成签到,获得积分10
2秒前
善学以致用应助CD采纳,获得10
2秒前
无花果应助孙奕采纳,获得10
3秒前
3秒前
HYH发布了新的文献求助20
3秒前
Rinohalt发布了新的文献求助10
4秒前
4秒前
4秒前
4秒前
5秒前
领导范儿应助通~采纳,获得10
5秒前
5秒前
fufufu123发布了新的文献求助10
5秒前
英姑应助猪猪hero采纳,获得10
5秒前
励志小薛发布了新的文献求助10
6秒前
怕孤独的从雪完成签到,获得积分20
6秒前
6秒前
joyce完成签到,获得积分10
6秒前
7秒前
xiaotian_fan发布了新的文献求助10
8秒前
sunlihao完成签到,获得积分10
8秒前
123发布了新的文献求助10
9秒前
高分求助中
Continuum Thermodynamics and Material Modelling 3000
Production Logging: Theoretical and Interpretive Elements 2700
Social media impact on athlete mental health: #RealityCheck 1020
Ensartinib (Ensacove) for Non-Small Cell Lung Cancer 1000
Unseen Mendieta: The Unpublished Works of Ana Mendieta 1000
Bacterial collagenases and their clinical applications 800
El viaje de una vida: Memorias de María Lecea 800
热门求助领域 (近24小时)
化学 材料科学 生物 医学 工程类 有机化学 生物化学 物理 纳米技术 计算机科学 内科学 化学工程 复合材料 基因 遗传学 物理化学 催化作用 量子力学 光电子学 冶金
热门帖子
关注 科研通微信公众号,转发送积分 3527884
求助须知:如何正确求助?哪些是违规求助? 3108006
关于积分的说明 9287444
捐赠科研通 2805757
什么是DOI,文献DOI怎么找? 1540033
邀请新用户注册赠送积分活动 716904
科研通“疑难数据库(出版商)”最低求助积分说明 709794