清晨好,您是今天最早来到科研通的研友!由于当前在线用户较少,发布求助请尽量完整的填写文献信息,科研通机器人24小时在线,伴您科研之路漫漫前行!

A complete assembly of the rice Nipponbare reference genome

生物 基因组 遗传学 计算生物学 基因
作者
Yong‐Min Liang,Wenchuang He,Tianyi Wang,Yingxue Yang,Qiang Xu,Xianjia Zhao,Longbo Yang,Hong Zhang,Xiaoxia Li,Yang Lv,Wu Chen,Shuo Cao,Xianmeng Wang,Bin Zhang,Xiangpei Liu,Xiao-Man Yu,Huiying He,Wei Hua,Yue Leng,Chuanlin Shi,Mingliang Guo,Zhipeng Zhang,Bintao Zhang,Qiaoling Yuan,Hongge Qian,Xinglan Cao,Yan Cui,Qianqian Zhang,Xiaofan Dai,Congcong Liu,Longbiao Guo,Yongfeng Zhou,Xiaoming Zheng,Jue Ruan,Zhukuan Cheng,Weihua Pan,Qian Qian
出处
期刊:Molecular Plant [Elsevier]
卷期号:16 (8): 1232-1236 被引量:44
标识
DOI:10.1016/j.molp.2023.08.003
摘要

In 2005, the current commonly used rice reference genome (Oryza sativa ssp. japonica cv. Nipponbare) was initially released by the International Rice Genome Sequencing Project (International Rice Genome Sequencing Project, 2005International Rice Genome Sequencing ProjectThe map-based sequence of the rice genome.Nature. 2005; 436: 793-800https://doi.org/10.1038/nature03895Crossref PubMed Scopus (3053) Google Scholar). Thereafter, the reference genome was further updated in 2013 with improved genome assembly (IRGSP-1.0) and gene annotations (MSU7, RAP-DB) (Kawahara et al., 2013Kawahara Y. de la Bastide M. Hamilton J.P. Kanamori H. McCombie W.R. Ouyang S. Schwartz D.C. Tanaka T. Wu J. Zhou S. et al.Improvement of the Oryza sativa Nipponbare reference genome using next generation sequence and optical map data.Rice. 2013; 6: 4https://doi.org/10.1186/1939-8433-6-4Crossref Scopus (1108) Google Scholar; Sakai et al., 2013Sakai H. Lee S.S. Tanaka T. Numa H. Kim J. Kawahara Y. Wakimoto H. Yang C.C. Iwamoto M. Abe T. et al.Rice Annotation Project Database (RAP-DB): an integrative and interactive database for rice genomics.Plant Cell Physiol. 2013; 54: e6https://doi.org/10.1093/pcp/pcs183Crossref PubMed Scopus (489) Google Scholar). In the past 10 years, this reference has been serving as one of the most important genetic resources for subsequent rice functional genomics efforts. As several rice genomes had been assembled into gapless chromosomes with only 2–5 telomeres absent (Li et al., 2021Li K. Jiang W. Hui Y. Kong M. Feng L.Y. Gao L.Z. Li P. Lu S. Gapless indica rice genome reveals synergistic contributions of active transposable elements and segmental duplications to rice genome evolution.Mol. Plant. 2021; 14: 1745-1756https://doi.org/10.1016/j.molp.2021.06.017Abstract Full Text Full Text PDF PubMed Scopus (31) Google Scholar; Song et al., 2021Song J.M. Xie W.Z. Wang S. Guo Y.X. Koo D.H. Kudrna D. Gong C. Huang Y. Feng J.W. Zhang W. et al.Two gap-free reference genomes and a global view of the centromere architecture in rice.Mol. Plant. 2021; 14: 1757-1767https://doi.org/10.1016/j.molp.2021.06.018Abstract Full Text Full Text PDF PubMed Scopus (77) Google Scholar; Zhang et al., 2022Zhang Y. Fu J. Wang K. Han X. Yan T. Su Y. Li Y. Lin Z. Qin P. Fu C. et al.The telomere-to-telomere gap-free genome of four rice parents reveals SV and PAV patterns in hybrid rice breeding.Plant Biotechnol. J. 2022; 20: 1642-1644https://doi.org/10.1111/pbi.13880Crossref PubMed Scopus (13) Google Scholar), the IRGSP-1.0 and its annotations still performed as the most widely used reference. However, limitations of sequencing technology and intricate genomic organization led to an under-representation of complex regions in this reference, leaving a total of 72 major gaps (including 19 telomeres), 167 minor gaps, and 779 unknown bases (Kawahara et al., 2013Kawahara Y. de la Bastide M. Hamilton J.P. Kanamori H. McCombie W.R. Ouyang S. Schwartz D.C. Tanaka T. Wu J. Zhou S. et al.Improvement of the Oryza sativa Nipponbare reference genome using next generation sequence and optical map data.Rice. 2013; 6: 4https://doi.org/10.1186/1939-8433-6-4Crossref Scopus (1108) Google Scholar), with an estimated length of ∼3% of the genome unsolved. To pursue a complete sequence of this foundational reference genome, we applied a hybrid assembly strategy that integrated Pacbio HiFi and Oxford Nanopore Technology (ONT) ultra-long reads to generate original contigs, which were then scaffolded onto a chromosome-level assembly with the support of the Hi-C dataset. Gap filling and terminal extension were further conducted to resolve the remaining seven gaps and one telomere region within the scaffolds. All gap-closure regions were supported with uniform coverage of ONT reads (Supplemental Figure 1). A large rDNA array was identified beside the telomere of short arm in chromosome 9 with nearly identical repeats of 45S rDNA (Supplemental Figure 2), which was artificially filled with consecutive blocks reflecting their estimated copy number (see supplemental materials and methods). This captured 93.8% of HiFi reads and 93.9% of ONT reads containing 45S rDNA by full-length mapping, but should be treated as model sequences. Following sequence polishing employing the HiFi and Illumina PE (next-generation sequencing [NGS]) reads, we produced a complete assembly of the rice reference genome, T2T-NIP (version AGIS-1.0), within which all 12 centromere and 24 telomere regions were resolved (Figure 1A). Multiple strategies were applied to evaluate the accuracy and completeness of T2T-NIP. All available primary data—including HiFi, ONT, NGS, and Hi-C—were remapped to T2T-NIP with high mapping rates of >99.6% in all datasets except for ONT reads (93.1%). All reads displayed uniform coverage across the whole genome, except for the Hi-C dataset because of large centromeres and complex regions near two telomeres (Figure 1B). Chromatin immunoprecipitation and sequencing (ChIP-seq) were conducted with the rice CENH3 antibody to identify the location and sequence of functional centromeres in T2T-NIP (Figure 1A, Supplemental Table 1, and Supplemental Figure 3). CentO-enriched regions were also identified by sequence homology to the 155- to 165-bp CentO satellite repeats (Figure 1A and Supplemental Table 1), eight of which showed similar or consistent size with a previous report as determined by fluorescence in situ hybridization (Cheng et al., 2002Cheng Z. Dong F. Langdon T. Ouyang S. Buell C.R. Gu M. Blattner F.R. Jiang J. Functional rice centromeres are marked by a satellite repeat and a centromere-specific retrotransposon.Plant Cell. 2002; 14: 1691-1704https://doi.org/10.1105/tpc.003079Crossref PubMed Scopus (321) Google Scholar). The consensus accuracy of the whole genome was estimated to be approximately one error per 5 million bases (Q63), which showed much higher sequence accuracy (Supplemental Table 2). For gene content assessment, T2T-NIP captured 99.88% of a BUSCO 1614 gene set (Supplemental Table 3), which was equal to or higher than previously reported gapless rice genomes (Li et al., 2021Li K. Jiang W. Hui Y. Kong M. Feng L.Y. Gao L.Z. Li P. Lu S. Gapless indica rice genome reveals synergistic contributions of active transposable elements and segmental duplications to rice genome evolution.Mol. Plant. 2021; 14: 1745-1756https://doi.org/10.1016/j.molp.2021.06.017Abstract Full Text Full Text PDF PubMed Scopus (31) Google Scholar; Song et al., 2021Song J.M. Xie W.Z. Wang S. Guo Y.X. Koo D.H. Kudrna D. Gong C. Huang Y. Feng J.W. Zhang W. et al.Two gap-free reference genomes and a global view of the centromere architecture in rice.Mol. Plant. 2021; 14: 1757-1767https://doi.org/10.1016/j.molp.2021.06.018Abstract Full Text Full Text PDF PubMed Scopus (77) Google Scholar; Zhang et al., 2022Zhang Y. Fu J. Wang K. Han X. Yan T. Su Y. Li Y. Lin Z. Qin P. Fu C. et al.The telomere-to-telomere gap-free genome of four rice parents reveals SV and PAV patterns in hybrid rice breeding.Plant Biotechnol. J. 2022; 20: 1642-1644https://doi.org/10.1111/pbi.13880Crossref PubMed Scopus (13) Google Scholar). A total of 1747 ribosomal RNA (rRNA) genes were identified in T2T-NIP, whereas only several hundred were identified in the IRGSP-1.0. A total of 57 359 protein-coding genes and 325 794 repeat elements (51.1%) were identified, both of which represent more than for IRGSP-1.0 (Supplemental Tables 4 and 5). In a model sequence of the 45S rDNA array, 1022 genes were annotated with support of transcriptome data (Supplemental Table 6). Among 314 protein-coding genes annotated in the gap-filling regions excluding this rDNA array, 142 genes were confirmed to be expressed in T2T-NIP and showed tissue-specific patterns (Supplemental Figure 4). With T2T-NIP, we achieved a complete sequence of the important rice reference genome with 385.7 million base pairs (Mbp), including abundant improvements compared with the prior assembly (Figure 1A and Supplemental Tables 4–6). Compared to IRGSP-1.0, T2T-NIP contains 12.5 Mbp of newly identified sequence, including rDNA arrays (33.2%), pericentromeric and centromeric regions (32.1%), transposable elements (27.1%), and telomere and subtelomeric regions (5.1%), all of which are necessary for fundamental cellular processes (Figure 1C–1E). Some of the largest gap-filling regions covered the centromeres of nine chromosomes, subtelomeric and telomeric regions of two chromosomes, and large complex and repetitive regions in three chromosomes, which are represented in IRGSP-1.0 as unknown or unresolved sequences (Figure 1A and Supplemental Table 7). In addition to these apparent gaps, other minor gap regions of IRGSP-1.0 were found to be artificial or otherwise incorrect (Supplemental Table 8). We investigated all possible 500 kb flanking regions adjacent to the 72 major gaps in IRGSP-1.0 and found that most regions far from centromeres and telomeres (39/44) showed excellent synteny with T2T-NIP, while almost all regions close to centromeric gaps (11/12) contained additional minor gaps with extensive large structural differences (e.g., deletions and inversions with lengths >20 kb) compared to T2T-NIP (Figure 1D). Additionally, four major gaps and their flanking regions with several minor gaps could be well resolved by T2T-NIP, resulting in two continuous regions of 100–117 kb (Figure 1D and Supplemental Table 7). These results demonstrated a significant update of the rice reference genome by resolving the gaps and misassembled structures probably caused by complex and large repetitive structures in IRGSP-1.0. T2T-NIP removes a long-standing barrier that has hidden 3% of the genome from sequence-based analysis, resolving all centromeric and telomeric regions. Therefore, it is important to further describe the initial analysis of a truly complete rice reference genome and to discuss its potential applications. We have produced a rich collection of annotations and omics datasets for T2T-NIP, including gene models and transposon elements (TEs), RNA sequencing, and methylation datasets, as presented in an online database (http://www.ricesuperpir.com/web/nip). To highlight the utility of these genetic resources, we demonstrate examples of complex duplicated regions in chromosomes 10 and 11 that were associated with previously unresolved gaps. The gene AGIS_Os10g035850 (denoted as LOC_Os10g43075 in IRGSP-1.0/MSU7) traversed across the boundary of a major gap at the subtelomeric region of chromosome 10, resulting in an incomplete annotation of only 76.3% of the entire gene and some misannotated exons in the previous version. T2T-NIP thus supported the correction of this gene model, including an addition of six new exons into each of its two splicing alternatives from the gap-filling region (Supplemental Figure 5). Most TE-related genes have multiple copies (paralogs) caused by repetitive sequences, which previously have always complicated their genetic analysis. When mapping NGS reads, the absence of the additional paralogs in IRGSP-1.0 causes these reads to incorrectly align to LOC_Os11g12240 (AGIS_Os11g010790), resulting in many false-positive variants (Figure 1F). When mapped to T2T-NIP, the reads show the expected coverage and a typical heterozygous variation pattern at a small region. Any variants within these paralogs, and others like them, will be overlooked when using IRGSP-1.0 as a reference, thereby promoting the importance of the release of T2T-NIP. To investigate how the T2T-NIP affects short-read variant calling, we collected NGS reads of 230 cultivated (Oryza sativa) and wild (Oryza rufipogon) rice accessions from our previous study (Shang et al., 2022Shang L. Li X. He H. Yuan Q. Song Y. Wei Z. Lin H. Hu M. Zhao F. Zhang C. et al.A super pan-genomic landscape of rice.Cell Res. 2022; 32: 878-896https://doi.org/10.1038/s41422-022-00685-zCrossref PubMed Scopus (39) Google Scholar). The cultivated collection consisted of three populations: Xian/indica (XI), Geng/japonica (GJ), and Aus (cA). The same pipeline was applied for variant calling based on T2T-NIP and IRGSP-1.0 to eliminate the interferences caused by software parameters. On average, BWA-MEM mapped an additional 1.04 × 107 (6.9%) of properly paired reads to T2T-NIP compared to IRGSP-1.0. Interestingly, even though more reads align to T2T-NIP, the subsequent per-read mismatch rate was 1.2%–8.2% lower across all populations (Figure 1G). Similarly, T2T-NIP improved other mapping characteristics such as reducing the number of misoriented read pairs (Figure 1H) and improving coverage uniformity (Figure 1I) compared to IRGSP-1.0. Within gene regions, we noted a decrease of 2.0%–4.3% in the standard deviation of read coverage with analogous improvements among all population groups (Figure 1I). From these alignments, we identified a total of 741 895 221 high-quality single-nucleotide variants and small indel variants relative to T2T-NIP (per-sample mean, 3 225 631) compared to 744 667 800 variants relative to IRGSP-1.0 (per-sample mean, 3 237 686), observing a shared decrease in the number of called variants per individual genome (Supplemental Figure 6 and Supplemental Table 9). Along with the improvement in the per-read mismatch rate, we attribute the reduction in the number of per-sample variant calls to the lower number of consensus errors, structural errors, and especially the resolution of the complex repetitive regions with correct copies in T2T-NIP (Figure 1F). This conclusion is supported by the observation that the number of heterozygous variants per sample decreased largely in all populations while their homozygous variants showed a slight increase except for GJ (Supplemental Figure 6 and Supplemental Table 9). These results demonstrated the superiority of T2T-NIP as a reference genome for more accurate mapping and variation analysis based on short reads. Next, we investigated the effects of using T2T-NIP as a reference genome for structural variant (SV) calling from published long reads (Shang et al., 2022Shang L. Li X. He H. Yuan Q. Song Y. Wei Z. Lin H. Hu M. Zhao F. Zhang C. et al.A super pan-genomic landscape of rice.Cell Res. 2022; 32: 878-896https://doi.org/10.1038/s41422-022-00685-zCrossref PubMed Scopus (39) Google Scholar). Alignment to T2T-NIP also reduced the observed mismatch rate per mapped read (Figure 1J) and the standard deviation of coverage within genes (Figure 1K) across all populations. T2T-NIP also corrected structural errors in IRGSP-1.0 and contained a complete assembly of the genome, which facilitated a much more accurate alignment, similar to what we observed for short reads (Supplemental Table S10). From these results, we observed a shared reduction (from −16.3% to −4.6%) in the number of SVs from different populations when calling variants against T2T-NIP instead of IRGSP-1.0. Similar to the results of the small variations above, the number of heterozygous variants decreased more than those of homozygous variants (Supplemental Figure 7), likely also due to improvements in resolution of the complex repetitive regions in T2T-NIP, which reduced the rare structures found in IRGSP-1.0. To supplement our variant and phenotype datasets, we conducted genome-wide association studies (GWASs) to assess potential improvements on efficiency of genetic analysis by using T2T-NIP as a reference genome instead of IRGSP-1.0. A total of 101 associated SNPs were identified for five agronomic traits, in which all associated SNPs were detected only from variant datasets relative to T2T-NIP. For example, a pleiotropic locus related to yield per plant in chromosome 1 (qYPP1) of T2T-NIP was significantly associated with both grain yield and plant height that was not identified using IRGSP-1.0 (Figure 1L–1M and Supplemental Figure 8). Gene-editing experiments and phenotype screening revealed significant differences of yield per plant and plant height between plants with wild type and function-loss mutation of a gene encoding the large subunit of ADP-glucose pyrophosphorylase, OsAGPL2 (Figure 1N and Supplemental Figure 8). A favorable haplotype of OsAGPL2 was identified, showing significantly higher yield per plant (44.7 ± 11.8 g) than the other haplotypes (Figure 1O). Additionally, we identified some T2T-NIP-specific associated SVs related to grain width (Supplemental Figure 9). These results demonstrated the enhanced efficiency of genetic analysis on population variation and gene mining based on T2T-NIP. In summary, we achieved complete sequences of the most commonly used rice reference genome in our assembly, T2T-NIP, by addressing the missing 3% of the genomic information, which represents a significant update to this important resource. T2T-NIP introduced ∼12.5 Mbp containing 1324 gene predictions, which include rDNA arrays, centromeric satellite arrays, subtelomeres, and large repeat regions, thereby unlocking these complex regions of the genome for rice variational and functional studies. All the raw sequencing reads, genome assembly, and annotations for T2T-NIP were deposited in the National Center for Biotechnology Information database under project accession number PRJNA953663 and the National Genomics Data Center database under project accession number PRJCA018610. The genome browser of T2T-NIP and its related annotations and omics datasets can also be easily accessed from our online database website (http://www.ricesuperpir.com/web/nip). This research was supported by the National Natural Science Foundation of China (32188102, 32101718), Guangdong Basic and Applied Basic Research Foundation (2023B1515020053), the Youth Innovation of Chinese Academy of Agricultural Sciences (Y20230C36), and the specific research fund of The Innovation Platform for Academicians of Hainan Province (YSPTZX202303).
最长约 10秒,即可获得该文献文件

科研通智能强力驱动
Strongly Powered by AbleSci AI
更新
大幅提高文件上传限制,最高150M (2024-4-1)

科研通是完全免费的文献互助平台,具备全网最快的应助速度,最高的求助完成率。 对每一个文献求助,科研通都将尽心尽力,给求助人一个满意的交代。
实时播报
44秒前
郜南烟发布了新的文献求助10
49秒前
白白嫩嫩完成签到,获得积分10
51秒前
zhang20082418完成签到,获得积分10
52秒前
充电宝应助郜南烟采纳,获得10
56秒前
HEIKU应助zhang20082418采纳,获得10
56秒前
Jenny完成签到,获得积分10
59秒前
Java完成签到,获得积分10
1分钟前
mzhang2完成签到 ,获得积分10
1分钟前
zokor完成签到 ,获得积分10
1分钟前
堇笙vv完成签到,获得积分10
1分钟前
无辜的行云完成签到 ,获得积分0
1分钟前
elisa828完成签到,获得积分10
2分钟前
CC完成签到,获得积分0
2分钟前
2分钟前
xun发布了新的文献求助10
2分钟前
自由飞翔完成签到 ,获得积分10
3分钟前
cai白白完成签到,获得积分0
3分钟前
3分钟前
郜南烟发布了新的文献求助10
3分钟前
完美世界应助郜南烟采纳,获得10
3分钟前
铜豌豆完成签到 ,获得积分10
3分钟前
科研狗完成签到 ,获得积分10
4分钟前
jlwang完成签到,获得积分10
4分钟前
包容的海豚完成签到 ,获得积分10
5分钟前
菠萝谷波完成签到 ,获得积分10
5分钟前
dragonhmw完成签到 ,获得积分10
5分钟前
在水一方完成签到 ,获得积分0
5分钟前
6分钟前
郜南烟发布了新的文献求助10
6分钟前
情怀应助xun采纳,获得10
6分钟前
新奇完成签到 ,获得积分20
6分钟前
爱学习的悦悦子完成签到 ,获得积分10
6分钟前
稳重傲晴完成签到 ,获得积分10
7分钟前
gobi完成签到 ,获得积分10
7分钟前
顺利的曼寒完成签到 ,获得积分10
7分钟前
Air完成签到 ,获得积分10
7分钟前
滕皓轩完成签到 ,获得积分10
7分钟前
8分钟前
郜南烟发布了新的文献求助10
8分钟前
高分求助中
Evolution 10000
Sustainability in Tides Chemistry 2800
The Young builders of New china : the visit of the delegation of the WFDY to the Chinese People's Republic 1000
юрские динозавры восточного забайкалья 800
English Wealden Fossils 700
叶剑英与华南分局档案史料 500
Foreign Policy of the French Second Empire: A Bibliography 500
热门求助领域 (近24小时)
化学 医学 生物 材料科学 工程类 有机化学 生物化学 物理 内科学 纳米技术 计算机科学 化学工程 复合材料 基因 遗传学 催化作用 物理化学 免疫学 量子力学 细胞生物学
热门帖子
关注 科研通微信公众号,转发送积分 3146832
求助须知:如何正确求助?哪些是违规求助? 2798126
关于积分的说明 7826724
捐赠科研通 2454681
什么是DOI,文献DOI怎么找? 1306428
科研通“疑难数据库(出版商)”最低求助积分说明 627788
版权声明 601565