亲爱的研友该休息了!由于当前在线用户较少,发布求助请尽量完整地填写文献信息,科研通机器人24小时在线,伴您度过漫漫科研夜!身体可是革命的本钱,早点休息,好梦!

CPStools: A package for analyzing chloroplast genome sequences

GenBank公司 基因组 系统发育树 生物 计算生物学 遗传学 全基因组测序 基因组计划 序列分析 基因
作者
Lijin Huang,Huanxi Yu,Zhi Wang,Wenbo Xu
标识
DOI:10.1002/imo2.25
摘要

CPStools is a user-friendly software for comprehensive chloroplast genome analysis. It integrates 10 functionalities including Genbank file checking, statistical information generation, sequence adjustment, inverted repeat (IR) regions identification, nucleotide diversity (Pi) analysis, relative synonymous codon usage (RSCU) calculation, simple sequence repeats (SSRs) identification, long sequence repeats (LSRs) statistics, phylogenetic analysis, and format conversion. CPStools handles Genbank or Fasta format inputs, delivering results comparable to other tools while excelling in data preparation for advanced analysis. It uniquely generates consensus merged protein-coding sequence (CDS) or protein sequences from multiple Genbank files, facilitating advanced phylogenetic analysis. CPStools offer reliable results for comprehensive chloroplast genome analysis. Chloroplasts are essential organelles in green plants and algae for photosynthesis [1]. Chloroplast genomes are typically circular, and are featured by a quadripartite structure with a small single-copy (SSC) region, a large single-copy (LSC) region, and two inverted repeat (IR) regions [2]. These genomes are pivotal in phylogenetic classification and species identification. With advancements in next-generation sequencing technology, chloroplast genome analysis has become routine. However, current tools for chloroplast genome analysis have notable limitations. For instance, MIcroSAtellite Identification (MISA) is widely used for detecting simple sequence repeats (SSRs), but it involves complex categorization that can be challenging for inexperienced users [3]. CodonW, used for calculating relative synonymous codon usage (RSCU) values, requires a time-consuming process to prepare the necessary inputs, such as extracting consensus protein-coding sequences (CDS) and filtering short sequences from multiple Genbank files [4]. Additionally, Geseq and Geneious, when used for identifying chloroplast genome regions, often produce inaccurate results due to short IR fragments [5, 6]. There is a clear need for efficient tools to provide accurate results and prepare data for advanced analysis, such as nucleotide diversity (Pi) and phylogenetic analyses. To address these challenges, we developed CPStools, which integrates 10 subcommands, and each one offers specific functionalities, overcoming the limitations of existing tools. By simplifying input requirements and automating complex processes, CPStools significantly enhances the efficiency and accuracy of chloroplast genome analyses. This streamlined approach not only saves considerable time for researchers but also reduces the likelihood of errors, making CPStools an important contribution in chloroplast genome studies. CPStools addresses 10 core functionalities which are essential for comparative genomic studies (Figure 1). These functions include Genbank file checking, statistical information generation, sequence adjustment, IR regions identification, Pi analysis, RSCU calculation, SSRs identification, long sequence repeats (LSRs) statistics, phylogenetic analysis, and format conversion. Nine sequences downloaded from NCBI were analyzed using CPStools for comparative analysis. During the analysis, 13 genes were identified that do not start with "ATG." All nine chloroplast genomes were annotated with 113 unique genes, except for Gynostemma yixingense, which had an incorrect annotation in the trnfM-CAU and trnM-CAU genes, a common error among inexperienced researchers. The "IR" subcommand was used for boundary detection, revealing that two of the nine sequences do not start with the first base pairs in the LSC region. In Geseq and Geneious, the nine sequences all start with the first base pairs in the LSC region, however, the short repeats cannot be identified accurately (Table S1). Combining the co-linear results and IR identification results, the "Seq" subcommand easily adjusted the sequences. Pi analysis was performed with "Pi" subcommand, extracting 110 shared single genes and 150 intergenic regions. After multiple alignments and calculating the pi values, regions with high pi values were selected as barcode regions for identification purposes (Figure 2A). The conversion from GenBank to mVISTA input was also accurately visualized using mVISTA (Figure 2B). Except for 51 genes in G. yixingense, the other eight species all had 52 genes retained, and the RSCU values were calculated with the RSCU subcommand (Figure 2C). Then, 44, 55, 52, 58, 37, 62, 47, 54, and 45 SSRs were identified in the nine Gynostemma species. The locations of these SSRs in the IGS, intron, and exon regions were detected, along using the analysis of LSRs (Figure 2D, Tables S2 and S3). All analyses were completed in half an hour with high accuracy. CPStools represents a breakthrough in chloroplast genome analysis, providing a user-friendly platform for rapid and comprehensive analysis with reliable results. It surpasses tools like Geseq and Geneious in identifying tetrad structures and handling short sequences, providing rapid results. Unlike labor-intensive processes required for DNAsp6 and CodonW, CPStools can efficiently extract shared gene sequences and batch-adjusting sequences by simplifying file preparation and supporting batch pi calculations. This significantly reduces research workflow time and makes CPStools highly advantageous for researchers. Continuous refinement and feature expansion are planned. CPStools relies on Biopython for parsing GenBank and Fasta files, which must strictly adhere to standard format specifications [7]. We recommend using CPGAVAS2 for annotation, as results from other software may not match CPStools due to format discrepancies [8]. The subcommand "gbcheck" checks GenBank files and adjusts them to the standard format required by CPStools. Researchers should ensure data compatibility and accuracy when using CPStools. CPStools represents a significant advance in the field of chloroplast genome analysis by integrating 10 essential functionalities into a single, user-friendly package. This tool simplifies and automates complex processes, significantly enhancing the efficiency and accuracy of chloroplast genome studies. By addressing the limitations of existing tools, CPStools provides reliable and comprehensive results, facilitating detailed genomic analyses and phylogenetic studies. The incorporation of features such as sequence adjustment, nucleotide diversity analysis, codon usage calculation, and repeat identification ensures that researchers can conduct thorough research in the least amount of time. Future developments will continue to expand its capabilities, making CPStools an important resource for researchers in the field of chloroplast genomics. The "gbcheck" subcommand offers two modes: self-checking Genbank files and comparative analysis with reference file. In the self-checking mode, the script examines CDS genes, assesses start and stop codons, and identifies multiple stop codons. Comparative mode compares annotation files through identifying discrepancies in gene annotation. The "info" subcommand provides statistical analysis of gene counts, types, and exon numbers, which is crucial for detailed genomic element statistics, speeding up and improving the accuracy of chloroplast genome annotation. The chloroplast genome's circular topology allows segmentation at arbitrary locations to yield linear sequences. Challenges arise when the IR region is split into fragments with only a few base pairs. The "IR" subcommand, using a seed size of 1000 base pairs, ensures accurate identification of the four chloroplast genome regions. The "Seq" subcommand provides modes for sequence adjustments: LSC aligns to the first base pair in the LSC region start, SSC orients the SSC region forward, and RP implements reverse complementation, positioning the first base pair in LSC at the sequence outset. The pi analysis detects polymorphisms within sequences, with regions of high mutation rates serving as genetic markers for species differentiation. Extracting sequences from gene and intergenic spacer (IGS) regions is challenging, and computing pi values via DNAsp6 is time-consuming because it only accepts a single multiple sequence alignment file for calculation [9]. This process is further complicated by the presence of over 200 consensus sequences extracted from the entire chloroplast genome. The "Pi" subcommands streamline this analysis by identifying and extracting consensus sequences from gene and IGS regions, supporting batch pi value computation, and organizing results by their location within chloroplast genomes. RSCU analysis, essential for understanding codon bias in chloroplast genomes, traditionally involves time-consuming steps, including filtering the lengths of conserved protein-coding sequences, excluding repetitive sequences, and computing relative codon usage frequencies. The "RSCU" subcommand allows rapid and accurate RSCU value calculation from multiple Genbank files. The "SSRs" subcommand accurately identifies SSRs using preset minimum lengths for different types: 10 for mononucleotides, 6 for dinucleotides, 5 for trinucleotides, and 4 for tetranucleotides, pentanucleotides, and hexanucleotides. It also locates each SSR within gene, intron, or IGS. The "LSRs" subcommand pinpoints each LSR within genomic structures, offering a clearer understanding of genomic variations. Phylogenetic analysis is primarily based on three types of data: the entire chloroplast genome, consensus CDS, and protein sequences. The "Seq" subcommand efficiently obtains and merges the complete chloroplast genome sequence. The "phy" subcommand facilitates the extraction and combination of shared CDS and protein sequences, preparing them for phylogenetic analysis. These sequences, following multiple alignments, are prepared for phylogenetic analysis. CPStools supports the conversion of gb files into tbl, Fasta, and mVISTA annotation formats. The "convert" subcommand supports these conversions, with tbl format for NCBI database uploads and mVISTA format for mVISTA software input [10]. Lijin Huang: Conceptualization; software; data curation; visualization; validation; writing—original draft; formal analysis. Huanxi Yu: Conceptualization; methodology; funding acquisition; investigation; data curation; writing—review and editing. Zhi Wang: Conceptualization; methodology; funding acquisition; visualization. Wenbo Xu: Conceptualization; investigation; writing—original draft; writing—review and editing; visualization; validation; methodology; software; formal analysis; project administration; data curation; supervision; resources. We would like to thank Mr. Lei Xu from Nanjing Genepioneer Biotechnologies Co., Ltd. for the visualization of CPStools. This work was supported by the Special Fund of the Chinese Central Government for Basic Scientific Research Operations in the Commonweal Research Institute (Grant no. GYZX240417), the National Key Research and Development Program of China (Grant no. SQ2020YFF0426320), and Innovative Team Project of Nanjing Institute of Environmental Sciences in MEE (Grant no. ZX2023QT022). The authors declare no conflict of interest. No animals or humans were involved in this study. CPStools and its dependencies are coded in Python, and the source code is available at GitHub (https://github.com/Xwb7533/CPStools). Sample data for each function is provided in the test data directory, along with a detailed help documentation. Additionally, video tutorials on how to use CPStools can be found on Bilibili (https://www.bilibili.com/video/BV1fZ421K7nw). Supplementary materials (tables, graphical abstract, slides, videos, Chinese translated version, and update materials) may be found in the online DOI or iMeta Science http://www.imeta.science/imetaomics/. Table S1: Comparison of chloroplast genome tetrad structure identification results using CPStools, Geseq, and Geneious. Table S2: Comparison of SSRs identified results from CPStools and MISA website. Table S3: Comparison of LSRs identified results from CPStools and Reputer website. Please note: The publisher is not responsible for the content or functionality of any supporting information supplied by the authors. Any queries (other than missing content) should be directed to the corresponding author for the article.
最长约 10秒,即可获得该文献文件

科研通智能强力驱动
Strongly Powered by AbleSci AI
科研通是完全免费的文献互助平台,具备全网最快的应助速度,最高的求助完成率。 对每一个文献求助,科研通都将尽心尽力,给求助人一个满意的交代。
实时播报
7秒前
CipherSage应助NattyPoe采纳,获得30
25秒前
1分钟前
1分钟前
1分钟前
1分钟前
曌毓发布了新的文献求助10
1分钟前
gjr关注了科研通微信公众号
1分钟前
1分钟前
gjr发布了新的文献求助40
1分钟前
2分钟前
木JJ发布了新的文献求助10
2分钟前
2分钟前
3分钟前
feizao完成签到,获得积分10
3分钟前
年轻花卷完成签到,获得积分10
3分钟前
科研通AI2S应助科研通管家采纳,获得10
3分钟前
英俊的铭应助科研通管家采纳,获得10
3分钟前
wanci应助喵哥233采纳,获得10
4分钟前
4分钟前
poki完成签到 ,获得积分10
5分钟前
喵哥233发布了新的文献求助10
5分钟前
NexusExplorer应助未命名采纳,获得10
5分钟前
5分钟前
未命名发布了新的文献求助10
5分钟前
科研通AI2S应助科研通管家采纳,获得10
5分钟前
5分钟前
ding应助科研通管家采纳,获得10
5分钟前
传奇3应助科研通管家采纳,获得10
5分钟前
gszy1975完成签到,获得积分10
7分钟前
四瓣丁香发布了新的文献求助10
7分钟前
xttawy发布了新的文献求助10
8分钟前
QC发布了新的文献求助20
8分钟前
xmsyq完成签到 ,获得积分10
8分钟前
xttawy发布了新的文献求助10
8分钟前
9分钟前
xttawy发布了新的文献求助10
9分钟前
9分钟前
科研通AI6.4应助huhdcid采纳,获得10
10分钟前
10分钟前
高分求助中
(应助此贴封号)【重要!!请各用户(尤其是新用户)详细阅读】【科研通的精品贴汇总】 10000
Burger's Medicinal Chemistry, Drug Discovery and Development, Volumes 1 - 8, 8 Volume Set, 8th Edition 1800
Cronologia da história de Macau 1600
Netter collection Volume 9 Part I upper digestive tract及Part III Liver Biliary Pancreas 3rd 2024 的超高清PDF,大小约几百兆,不是几十兆版本的 1050
Current concept for improving treatment of prostate cancer based on combination of LH-RH agonists with other agents 1000
Research Handbook on the Law of the Sea 1000
Contemporary Debates in Epistemology (3rd Edition) 1000
热门求助领域 (近24小时)
化学 材料科学 医学 生物 工程类 有机化学 纳米技术 计算机科学 化学工程 生物化学 物理 复合材料 内科学 催化作用 物理化学 光电子学 细胞生物学 基因 电极 遗传学
热门帖子
关注 科研通微信公众号,转发送积分 6165960
求助须知:如何正确求助?哪些是违规求助? 7993476
关于积分的说明 16621020
捐赠科研通 5272153
什么是DOI,文献DOI怎么找? 2812821
邀请新用户注册赠送积分活动 1792757
关于科研通互助平台的介绍 1658833