Complete telomere‐to‐telomere assemblies of two sorghum genomes to guide biological discovery

高粱 基因组 生物 参考基因组 甜高粱 基因组学 作物 顺序装配 生物技术 遗传学 农学 基因 转录组 基因表达
作者
Chuanzheng Wei,Lei Gao,Ruixue Xiao,Yanbo Wang,Bingru Chen,Wenhui Zou,Jihong Li,Emma Mace,David Jordan,Yongfu Tao
出处
期刊:iMeta [Wiley]
卷期号:3 (2) 被引量:3
标识
DOI:10.1002/imt2.193
摘要

The assembly of two sorghum T2T genomes corrected the assembly errors in the current reference, uncovered centromere variation, boosted functional genomics research, and accelerated sorghum improvement. Cultivated sorghum (Sorghum bicolor L. Moench) is a C4 crop well-known for its high efficiency of biomass accumulation and adaptation to drought and hot environments. It is a staple food for half a billion people in Africa and Asia and provides a major source of feed, fiber, and biofuel globally. The release of the first sorghum reference genome of BTx623 greatly accelerated functional genomics studies in sorghum and related C4 grasses [1]. Subsequent improvement has further enhanced the quality of the reference genome [2]. Assemblies of other sorghum genomes such as Tx430, Rio, and wild sorghum accessions have shown marked intraspecific sequence variation in this crop [3, 4]. However, all of the available sorghum genomes are still incomplete, in particular with unresolved centromeres and telomeres, constraining a full understanding of the genomic landscape in the sorghum gene pool. In this study, we utilized ultra-long reads from Oxford Nanopore Technology (ONT), high-fidelity (HiFi) long reads from Pacbio, Hi-C reads, and Illumina short reads to assemble complete sequences of two sorghum genomes, BTx623 and Ji2055. The white-seeded BTx623 has long served the sorghum community as the reference genome [1], while Ji2055 is an inbred line with red seeds that has led to the successful release of dozens of commercial varieties in China (Figure S1). We generated an average of >150× sequence coverage of ultra-long ONT data, >65× coverage of PacBio HiFi data, >50× HiC data, and >50× Illumina short reads data for both varieties (Table S1). The initial assemblies of the two genomes were obtained using Hifiasm [5] with HiFi reads only, resulting in two genomes containing 246 and 581 contigs, respectively. Hi-C data was then employed to anchor and orient these contigs into 10 pseudomolecules for each genome (Figure S2). Ultra-long ONT reads that were longer than 50 Kb were used together with HiFi reads to fill the sequence gaps and correct assembly errors, which reduced the number of gaps to only four for each assembly. These gaps were then closed via manual extension to achieve gap-free assemblies. Coverage depth analysis using HiFi reads identified 13 genomic regions with assembly errors, which were then corrected using HiFi and ONT reads. After further polishing of the assembled genomes with Illumina reads and HiFi reads, our final telomere-to-telomere (T2T) assemblies were obtained with a genome size of 719.90 Mb for BTx623-T2T (Figure S3) and 722.96 Mb for Ji2055-T2T. To validate the quality of our T2T assemblies, comprehensive assessments were performed. The overall accuracy of our T2T assemblies was supported by the uniform coverage distribution of PacBio HiFi and ONT reads across nearly all regions of our T2T assemblies (Figure 1A). The two T2T genomes were estimated to have a base accuracy rate of 99.99% using PacBio HiFi reads. The completeness of our T2T assemblies was assessed using the benchmarking universal single-copy orthologs pipeline [6], which showed our two assemblies captured >98.5% of the 1614 conserved orthologous genes, slightly higher than the BTx623-v3 (Table S2). LTR assembly index [7], which measures genome continuity, was also higher for our assemblies compared to BTX623-v3 (Table S2). Nearly all the HiFi reads (100%) and ONT (>99.80%) reads could be mapped back to their derived T2T assemblies, highlighting the completeness of our T2T genomes. The published resequencing data of 44 sorghum varieties [8] was also mapped to the T2T genomes, which displayed a significantly higher mapping rate against our T2T assembles (averaged at 99.20%) than against the BTX623-v3 genome (averaged 97.45%) (Table S3, Figure S4). All the centromeric regions of our T2T genomes contained the sorghum centromere-specific repetitive elements, PSau3A10 and pSau3A9 [9, 10] (Figures S5 and S6). Overall, these evidence presented strongly supports the accuracy and completeness of our T2T assemblies. These two T2T sorghum assemblies with their complete genome sequence and intact centromeres and telomeres of all 10 chromosomes represent a significant improvement over the previous version of the reference genome (Tables S4 and S5). Genome annotation showed repetitive elements accounted for 66.50% of the BTx623-T2T genome and 65.22% of the Ji2055-T2T genome, including around 50% retroelements and 9% DNA transposons (Table S6). Our T2T assemblies contained a slightly higher percentage of repetitive elements than BTx623-v3 (63.18%), mainly due to the larger amounts of satellites (over 38 Mb in each T2T genome) captured in the T2T genomes compared to BTx623-v3 (around 19 Mb of satellites). Genes in our T2T genomes were predicted using BRAKER combining evidence from protein homology and RNA-seq data with ab initio prediction. A total of 35,695 and 36,950 protein-coding genes were identified in the BTx623-T2T genome and Ji2055-T2T genome, respectively (Table S6). The majority (~83%) of these annotated genes had RNA-seq data support. Compared to BTx623-v3, the BTx623-T2T genome contained 36.25 Mb of newly assembled sequence. Most (94.10%) of the newly assembled sequence were repeat elements, including retroelements (44.01%) and satellites (45.34%) (Figure 1B, Table S7). These newly assembled sequences were mainly distributed around centromeric regions (82.12%) (Figure 1C). A total of 133 genes were identified in the newly assembled sequence with around 65% of them supported with RNA-seq data. These newly identified genes were predicted to play a role in transmembrane transport, regulation of transcription, developmental process, and so forth. The BTx623-T2T genome identified the misorientation of four genomics regions around the centromeres of chromosome 1 (7.39 Mb), 5 (20.80 Mb), 6 (6.28 Mb), and 7(13.13 Mb) in BTx623-v3, in addition to mispositioning of two over 1 Mb sequence segments on chromosome 5 and 7, and the absence of hundreds of sequence segments (Figure 1D). Correcting these assembly errors in the reference genome is critical for exploiting genetic information in these complex regions for functional genomics research in sorghum. The centromere size of BTx623-T2T varied from 2.24 Mb on chromosome 1 to 13.70 Mb on chromosome 4 (Table S5). DNA sequence in centromere was mainly composed of satellite and retrotransposon such as Gypsy and Copia (Table S8). However, the content of these repeat elements differed among chromosomes. Gypsy accounted for more centromeric sequence than satellite did in chromosomes 3, 5, 6, 7, 8, and 9, while satellite was the most abundant component of centromeric sequence in chromosomes 1, 2, 4, and 10. A total of 134 genes were identified in centromeric regions of BTx623-T2T. These genes were enriched with biological functions such as reproductive process, response to stimuli, developmental processes, and so forth, suggesting they are critical to fundamental biological processes in sorghum. The assembly of two sorghum T2T genomes allows us to investigate sequence variation across the sorghum genome with a focus on centromeric regions. Substantial sequence variation was observed between the BTx623-T2T genome and the Ji2055-T2T genome. The size of centromeres varied between the two T2T genomes, particularly for chromosomes 1, 5, and 7 (Figure 1E). However, the sequence composition of centromeres was largely stable between the corresponding chromosomes of the two genomes (Figure 1F), suggesting the variation of centromere size is unlikely due to the expansion of a particular class of repeat elements. Most of the genes (84.96%) in centromeres were syntenic between BTx623-T2T and Ji2055-T2T, possibly due to limited recombination in these regions. Sequence comparison of the two T2T genomes identified a total of six large inversions (>50 Kb) (Figure S7, Table S9). However, none of them overlapped with the centromeric regions. In summary, this study assembled complete genome sequence of the sorghum reference genome, BTx623, and a popular Chinese inbred line, Ji2055. These two high-quality T2T genomes could serve as the new reference genomes to guide biological discovery and unlock the full potential of global sequence variation for genetic improvement of sorghum. Yongfu Tao designed the project. Chuanzheng Wei, Lei Gao, Ruixue Xiao, Bingru Chen, Jihong Li, and Yanbo Wang analyzed the sequence. Chuanzheng Wei and Wenhui Zou performed glasshouse experiments. Yongfu Tao, Emma Mace, and David Jordan supervised the work. Chuanzheng Wei and Yongfu Tao wrote the manuscript. All authors have read the final manuscript and approved it for publication. The authors thank Prof. Weihua Pan (Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences) for the discussion on T2T genome assembly. This work was supported by the National Natural Science Fund for Excellent Young Scientists Fund Program (Overseas), the startup package from the Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, and the National Natural Science Foundation of China (No. 32372176). The authors declare no conflict of interest. No animals or humans were involved in this study. Figure S1: The seeds of Ji2055 (left) and BTx623 (right). Figure S2: HiC interaction figure of BTx623 and Ji2055. Figure S3: Circos plot shows genome feature of BTx623-T2T. (A) Chromosome, (B) centromere and telomere, (C) gene density, (D) density of repeat elements, (E) density of gypsy, (F) density of Copia, (G) density of DNA transposon, and (H) GC content. Figure S4: Mapping rate and coverage rate of 44 sorghum lines against three sorghum genomes. Figure S5: Distribution of different types of repeat elements around centromere regions of BTx-623. Motif includes PSau3A10 and pSau3A9. Figure S6: Distribution of different types of repeat elements around centromere regions of Ji2055 Motif includes PSau3A10 and pSau3A9. Figure S7: Sequence variation between BTx623-T2T and Ji2055-T2T. Table S1: Summary of sequencing data generated in this study. Table S2: Summary statistics of T2T sorghum genome assemblies. Table S3: Summary of the mapping rate and coverage of 44 sorghum re-sequencing data. Table S4: Summary of predicted telomeres in our T2T assemblies. Table S5: Summary of predicted centromeres in our T2T assemblies. Table S6: Genome annotation of our T2T genomes. Table S7: Annotation of repeat elements of newly identified sequence in BTx623-T2T. Table S8: Composition of centromeres in the two sorghum T2T genomes. Table S9: Sequence variation identified between BTx623-T2T with Ji2055-T2T. Please note: The publisher is not responsible for the content or functionality of any supporting information supplied by the authors. Any queries (other than missing content) should be directed to the corresponding author for the article.
最长约 10秒,即可获得该文献文件

科研通智能强力驱动
Strongly Powered by AbleSci AI
更新
大幅提高文件上传限制,最高150M (2024-4-1)

科研通是完全免费的文献互助平台,具备全网最快的应助速度,最高的求助完成率。 对每一个文献求助,科研通都将尽心尽力,给求助人一个满意的交代。
实时播报
刚刚
爆米花应助不开心采纳,获得10
刚刚
蓁66完成签到,获得积分10
刚刚
迷人的Jack完成签到,获得积分20
刚刚
jiong完成签到,获得积分10
1秒前
皮灵犀完成签到,获得积分10
1秒前
帆帆发布了新的文献求助10
1秒前
1秒前
1秒前
David完成签到,获得积分10
1秒前
1秒前
1秒前
2秒前
2秒前
2秒前
发嗲的炳完成签到,获得积分20
2秒前
晨曦完成签到,获得积分10
3秒前
伶俐的芷荷完成签到,获得积分10
3秒前
哦豁发布了新的文献求助10
3秒前
皮灵犀发布了新的文献求助10
4秒前
David发布了新的文献求助10
4秒前
852应助迷人的Jack采纳,获得10
4秒前
单纯的不可完成签到,获得积分10
5秒前
5秒前
Irissun完成签到,获得积分10
6秒前
小李发布了新的文献求助10
6秒前
ni应助发嗲的炳采纳,获得20
6秒前
木木发布了新的文献求助10
6秒前
7秒前
7秒前
lakelili发布了新的文献求助10
7秒前
深几许完成签到,获得积分20
8秒前
8秒前
NexusExplorer应助淡然平灵采纳,获得10
9秒前
可靠若云发布了新的文献求助10
10秒前
852应助jiangcai采纳,获得10
11秒前
12秒前
长情小鸽子完成签到,获得积分10
12秒前
不开心发布了新的文献求助10
13秒前
SCIAI应助科研通管家采纳,获得10
13秒前
高分求助中
Licensing Deals in Pharmaceuticals 2019-2024 3000
Effect of reactor temperature on FCC yield 2000
Very-high-order BVD Schemes Using β-variable THINC Method 1020
PraxisRatgeber: Mantiden: Faszinierende Lauerjäger 800
Near Infrared Spectra of Origin-defined and Real-world Textiles (NIR-SORT): A spectroscopic and materials characterization dataset for known provenance and post-consumer fabrics 610
Mission to Mao: Us Intelligence and the Chinese Communists in World War II 600
Promoting women's entrepreneurship in developing countries: the case of the world's largest women-owned community-based enterprise 500
热门求助领域 (近24小时)
化学 医学 生物 材料科学 工程类 有机化学 生物化学 物理 内科学 纳米技术 计算机科学 化学工程 复合材料 基因 遗传学 催化作用 物理化学 免疫学 量子力学 细胞生物学
热门帖子
关注 科研通微信公众号,转发送积分 3305566
求助须知:如何正确求助?哪些是违规求助? 2939312
关于积分的说明 8492936
捐赠科研通 2613754
什么是DOI,文献DOI怎么找? 1427569
科研通“疑难数据库(出版商)”最低求助积分说明 663115
邀请新用户注册赠送积分活动 647883