亲爱的研友该休息了!由于当前在线用户较少,发布求助请尽量完整的填写文献信息,科研通机器人24小时在线,伴您度过漫漫科研夜!身体可是革命的本钱,早点休息,好梦!

TCGA-Assembler: open-source software for retrieving and processing TCGA data

计算机科学 软件 开源 开源软件 计算生物学 操作系统 生物
作者
Yangyong Zhu,Peng Qiu,Yuan Ji
出处
期刊:Nature Methods [Springer Nature]
卷期号:11 (6): 599-600 被引量:362
标识
DOI:10.1038/nmeth.2956
摘要

To the Editor: The Cancer Genome Atlas (TCGA) has been generating multi-modal genomics, epigenomics, and proteomics data for thousands of tumor samples across more than 20 types of cancer. While the access to most level-1 and -2 TCGA data is restricted, the entire level-3 TCGA data as well as some level-1 clinical data (e.g., survival and drug treatments) are publicly available. Included in the public data are genome-wide measurements of different genetic characterizations, such as DNA copy number, DNA methylation, and mRNA expression for the same genes, providing unprecedented opportunities for systematic investigation of cancer mechanisms at multiple molecular and regulatory layers [1-3]. Few tools of integrative data mining for TCGA are present, partly due to lack of tools to acquire and assemble the large scale TCGA data. Specifically, the level-3 TCGA data are stored as hundreds of thousands of sample- and platform-specific files, accessible through HTTP directories on the servers of TCGA Data Coordinating Center (DCC) [4]. Navigating through all of the files manually is impossible. Although Firehose [5] nicely assemble and publish TCGA data, it does not share the program code for data assembly. Currently the community does not have access to open-source data retrieving tools for automatic and flexible data acquisition, hence severely hindering the progress in systemic data integration and reproducible computational analysis using TCGA data. To meet these challenges, we introduce TCGA-Assembler, a software package that automates and streamlines the retrieval, assembly, and processing of public TCGA data. TCGA-Assembler equips users the ability to produce Firehose-type of TCGA data, with open-source and freely available program script. TCGA-Assembler opens a door for the development of data-mining and data-analysis tools that generate fully reproducible results, including data acquisition. TCGA-Assembler consists of two modules (Fig. 1a), both written in R (http://www.r-project.org). Module A streamlines data downloading and quality check, and module B processes the downloaded data for subsequent analyses (Supplementary Methods). In particular, module A takes advantage of the informative naming mechanism of TCGA data file system (Supplementary Fig. 1) and applies a recursive algorithm to retrieve the URLs of all data files. By string matching on the URLs, module A allows users to download most of TCGA public data (Supplementary Table 1) across genomic features and cancer types. For each genomics feature (such as gene expression from RNA-Seq) a data matrix combining multiple samples (Fig. 1b) is produced, with rows representing genomics units (such as genes) and columns representing samples. Module B provides convenient and important data preprocessing functions, such as mega-data assembly, data cleaning, and quantification of various measurements. For users interested in integrative analysis [6], a mega data matrix (Fig. 1c) is required that matches different types of genomics measurements for the same genes across samples. Module B provides a function “CombineMultiPlatfomData” to fulfill this requirement (Supplementary Methods), which involves intricate data-matching steps to overcome the feature-labeling discrepancies caused by different lab protocols and biotechnologies in the experiments. Other data-processing functions are also provided to facilitate downstream analysis (Supplementary Methods). Figure 1 TCGA-Assembler as a tool for acquiring, assembling, and processing public TCGA data. (a) Flowchart of TCGA- Assembler. Module A acquires data from TCGA DCC. Module B processes the obtained data using various functions. (b) Illustration of a data matrix ... Other big data tools for TCGA are available [5, 7, 8]. In particular, level-3 TCGA data can also be obtained from Firehose [5] at the MIT Broad Institute in the same format as in Fig. 1b, one for each cancer type and genomics platform. Module A of TCGA-Assembler not only provides the same type of data matrices, but also distributes R functions and associated computer program that produce the data matrices. Equipped with the open-source tool, users will be independent and control what and when TCGA data will be acquired locally. More importantly, quantitatively advanced users may integrate our open-source programs with downstream data analysis tools to realize reproducible and automated data analysis for TCGA. Unique to TCGA-Assembler is module B that provides critical functions for data cleaning and processing. For example, the mega data table (Fig. 1c) can be obtained with a single function, behind which substantial efforts have been directed to ensure the validity of process, such as to check and correct gene symbol discrepancies. Lastly, TCGA-Assembler is fully compatible with Firehose in that the data processing functions in Module B can directly process data files downloaded from Firehose. This compatibility is crucial to those who want to take advantage of both software pipelines. TCGA-Assembler will remain freely available and open-source. In the future, more data processing and analysis functions will be continuously added to TCGA-Assembler based on user feedback and new research needs. The authors request acknowledgment of the use of TCGA-Assembler in published works.
最长约 10秒,即可获得该文献文件

科研通智能强力驱动
Strongly Powered by AbleSci AI
更新
大幅提高文件上传限制,最高150M (2024-4-1)

科研通是完全免费的文献互助平台,具备全网最快的应助速度,最高的求助完成率。 对每一个文献求助,科研通都将尽心尽力,给求助人一个满意的交代。
实时播报
4秒前
7秒前
依依完成签到,获得积分20
8秒前
cxwong发布了新的文献求助10
9秒前
15秒前
22秒前
35秒前
jyy发布了新的文献求助10
39秒前
大个应助科研通管家采纳,获得10
44秒前
桑吉卓玛关注了科研通微信公众号
1分钟前
佟天问完成签到 ,获得积分10
1分钟前
迷茫的一代完成签到,获得积分10
1分钟前
爆米花应助wen采纳,获得10
1分钟前
1分钟前
1分钟前
1分钟前
上官万仇发布了新的文献求助10
1分钟前
1分钟前
谷子完成签到 ,获得积分10
1分钟前
wen发布了新的文献求助10
1分钟前
jyy发布了新的文献求助10
1分钟前
2分钟前
领导范儿应助司徒灵松采纳,获得10
2分钟前
2分钟前
2分钟前
2分钟前
3分钟前
小蘑菇应助满锅采纳,获得10
3分钟前
江潇发布了新的文献求助10
3分钟前
3分钟前
满锅发布了新的文献求助10
3分钟前
3分钟前
Swear完成签到 ,获得积分10
3分钟前
满锅完成签到,获得积分10
3分钟前
江潇完成签到,获得积分10
3分钟前
内向宛凝发布了新的文献求助10
3分钟前
jyy发布了新的文献求助10
3分钟前
3分钟前
3分钟前
敏1997发布了新的文献求助10
3分钟前
高分求助中
The late Devonian Standard Conodont Zonation 2000
The Lali Section: An Excellent Reference Section for Upper - Devonian in South China 1500
Nickel superalloy market size, share, growth, trends, and forecast 2023-2030 1000
Smart but Scattered: The Revolutionary Executive Skills Approach to Helping Kids Reach Their Potential (第二版) 1000
Mantiden: Faszinierende Lauerjäger Faszinierende Lauerjäger 800
PraxisRatgeber: Mantiden: Faszinierende Lauerjäger 800
A new species of Coccus (Homoptera: Coccoidea) from Malawi 500
热门求助领域 (近24小时)
化学 医学 生物 材料科学 工程类 有机化学 生物化学 物理 内科学 纳米技术 计算机科学 化学工程 复合材料 基因 遗传学 催化作用 物理化学 免疫学 量子力学 细胞生物学
热门帖子
关注 科研通微信公众号,转发送积分 3244700
求助须知:如何正确求助?哪些是违规求助? 2888396
关于积分的说明 8252771
捐赠科研通 2556854
什么是DOI,文献DOI怎么找? 1385415
科研通“疑难数据库(出版商)”最低求助积分说明 650157
邀请新用户注册赠送积分活动 626265