摘要
As one of the most important crops to supply the majority of plant oil and protein for the whole world, soybean is facing an increasing global demand. The reference genome of accession “Williams82” opened the gate of genomics research in soybean (Schmutz et al., 2010Schmutz J. Cannon S.B. Schlueter J. Ma J. Mitros T. Nelson W. Hyten D.L. Song Q. Thelen J.J. Cheng J. et al.Genome sequence of the palaeopolyploid soybean.Nature. 2010; 463: 178-183Crossref PubMed Scopus (3117) Google Scholar). After that, vast multi-omics data were generated, thereby providing valuable resources for functional study and molecular breeding. Parts of these data have been collected in different soybean databases (see details in Supplemental Table 1), such as Soybase (Grant et al., 2010Grant D. Nelson R.T. Cannon S.B. Shoemaker R.C. SoyBase, the USDA-ARS soybean genetics and genomics database.Nucleic Acids Res. 2010; 38: D843-D846Crossref PubMed Scopus (396) Google Scholar) and SoyKB (Joshi et al., 2012Joshi T. Patil K. Fitzpatrick M.R. Franklin L.D. Yao Q. Cook J.R. Wang Z. Libault M. Brechenmacher L. Valliyodan B. et al.Soybean Knowledge Base (SoyKB): a web resource for soybean translational genomics.BMC Genom. 2012; 13: S15Crossref PubMed Scopus (70) Google Scholar), which made valuable efforts to facilitate the wide utility of these data. Nevertheless, these existing databases poorly tackled multi-omics data integration and interactivity for soybean, provoking tremendous challenges for researchers to deal with these big omics data, particularly considering the unprecedented rate of data growth (Yang et al., 2021Yang Y. Saand M.A. Huang L. Abdelaal W.B. Zhang J. Wu Y. Li J. Sirohi M.H. Wang F. Applications of multi-omics technologies for crop improvement.Front. Plant Sci. 2021; 12: 563953Crossref PubMed Scopus (49) Google Scholar). Thus, constructing an integrated multi-omics database for soybean that provides a one-stop solution for big data mining with friendly interactivity is highly desired. Here, we collect the reported high-quality omics data, including assembly genomes, graph pan-genome, resequencing, and phenotypic data of representative germplasms (Zhou et al., 2015Zhou Z. Jiang Y. Wang Z. Gou Z. Lyu J. Li W. Yu Y. Shu L. Zhao Y. Ma Y. et al.Resequencing 302 wild and cultivated accessions identifies genes related to domestication and improvement in soybean.Nat. Biotechnol. 2015; 33: 408-414Crossref PubMed Scopus (686) Google Scholar; Fang et al., 2017Fang C. Ma Y. Wu S. Liu Z. Wang Z. Yang R. Hu G. Zhou Z. Yu H. Zhang M. et al.Genome-wide association studies dissect the genetic networks underlying agronomical traits in soybean.Genome Biol. 2017; 18: 161-214Crossref PubMed Scopus (235) Google Scholar; Liu et al., 2020Liu Y. Du H. Li P. Shen Y. Peng H. Liu S. Zhou G.-A. Zhang H. Liu Z. Shi M. et al.Pan-genome of wild and cultivated soybeans.Cell. 2020; 182: 162-176.e13Abstract Full Text Full Text PDF PubMed Scopus (307) Google Scholar); de-novo-assembled genomes of the species in the subgenus Glycine (Zhuang et al., 2022Zhuang Y. Wang X. Li X. Hu J. Fan L. Landis J.B. Cannon S.B. Grimwood J. Schmutz J. Jackson S.A. et al.Phylogenomics of the genus Glycine sheds light on polyploid evolution and life-strategy transition.Nat. Plants. 2022; 8: 233-244Crossref PubMed Scopus (8) Google Scholar); transcriptomic and epigenomic data from different tissues, organs, and accessions (Shen et al., 2014Shen Y. Zhou Z. Wang Z. Li W. Fang C. Wu M. Ma Y. Liu T. Kong L.A. Peng D.L. Tian Z. Global dissection of alternative splicing in paleopolyploid soybean.Plant Cell. 2014; 26: 996-1008Crossref PubMed Scopus (197) Google Scholar, Shen et al., 2018Shen Y. Zhang J. Liu Y. Liu S. Liu Z. Duan Z. Wang Z. Zhu B. Guo Y.-L. Tian Z. DNA methylation footprints during soybean domestication and improvement.Genome Biol. 2018; 19: 128Crossref PubMed Scopus (45) Google Scholar, Shen et al., 2019Shen Y. Du H. Liu Y. Ni L. Wang Z. Liang C. Tian Z. Update soybean Zhonghuang 13 genome to a golden reference.Sci. China Life Sci. 2019; 62: 1257-1260Crossref PubMed Scopus (41) Google Scholar); and knowledge of quantitative trait locus and genome-wide association study (GWAS) (Grant et al., 2010Grant D. Nelson R.T. Cannon S.B. Shoemaker R.C. SoyBase, the USDA-ARS soybean genetics and genomics database.Nucleic Acids Res. 2010; 38: D843-D846Crossref PubMed Scopus (396) Google Scholar), and construct an integrated soybean multi-omics database, named SoyOmics (https://ngdc.cncb.ac.cn/soyomics). By equipping it with multiple analysis modules and toolkits, SoyOmics is of great utility to facilitate the global scientific community to fully use these big omics datasets for a wide range of soybean studies from fundamental functional investigation to molecular breeding. By integrating different multi-omics data, we develop six highly interactive basic modules in SoyOmics: "Genome", "Variome", "Transcriptome", "Phenome", "Homology", and "Synteny" (Figure 1A). The Genome module embodies the information of 2898 soybean germplasms and 27 de-novo-assembled genomes, providing users with open access to basic information of sequenced germplasms, assembled genomes and genes (Supplemental Figure 1). The Variome module organizes approximately 38 million SNPs and short insertion/deletions of the 2898 soybean accessions, facilitating users to check the variation information and whole-genome selective signals for any germplasm of interest (Supplemental Figure 2). The Transcriptome module contains two datasets of gene expression: one is from 27 tissues at different developmental stages from Williams82 and ZH13 accessions, respectively, and the other is from nine tissues at different developmental stages from each of the 26 accessions used for pan-genome analysis. In this module, users can obtain gene expression profiles and gene orthologous information by specifying gene ID or functional description (Supplemental Figure 3). The Phenome module collects approximately 27 000 records of 115 phenotypes with terms defined as controlled vocabularies that fall into five classes (including morphology, growth and development, biochemistry, biotic stress, and vigor) as well as 17 subclasses (Supplemental Figure 4). The Homology module displays the soybean pan-genome by characterizing 57 480 homologous gene groups. Users can specify any gene ID, homologous group ID, or gene functional description to retrieve the homologous group of interest (Supplemental Figure 5). The Synteny module deposits approximately 550 000 large-scale structural variations (SV) in the pan-genome, in which users can visualize and download the SVs and synteny blocks by setting a specific genomic region. Furthermore, the graph pan-genome is embedded and a SequenceTubeMap web service (https://github.com/vgteam/sequenceTubeMap) is deployed for visualization of pan-genome threads (or haplotypes) according to nodes made up by SVs (Supplemental Figure 6). In addition, SoyOmics is designed to provide a user-friendly search bar in each module and to cover as much as possible substances. According to the searching category and inputting context, it features a powerful search engine to provide comprehensive associated results with friendly links from one module to other modules (Supplemental Figure 7). In addition to the six modules, we design several commonly easy-to-use toolkits, including easyGWAS, ExpPattern, HapSnap, VersionMap, SoyArray, and SeqFetch (Figure 1A). A BLAST module based on NCBI BLAST+ (https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/) is also developed for sequence searching against genome, mRNA, coding sequence (CDS), and protein sequences of the pan-genome accessions. The easyGWAS is a tool for quick-start GWAS analysis, providing a friendly interface for parameters setting and algorithms selection and offering multiple high-quality outputs including Manhattan plot, QQ-plot, and text results (Supplemental Figure 8). The ExpPattern is for conducting expression pattern analysis for a gene list against soybean tissues. It can generate expression heatmaps, with options of whether to execute clustering or not (Supplemental Figure 9). Besides, the tspex (https://github.com/apcamargo/tspex/) is incorporated in the ExpPattern for advice on a gene’s tissue specificity. The HapSnap is designed for haplotype analysis for a genomic region. Users can refine the variations via selection of variation type and quality control. The output includes haplotype frequency, haplotype vs. genotype, and linkage disequilibrium (Supplemental Figure 10). The VersionMap is capable of converting the genomic region between ZH13 (v2.0) and other de novo genomes of soybean or the gene ID between Williams82 (a2.v1) and ZH13 (v2.0) (Supplemental Figure 11). The SeqFetch is developed to get the sequence for a specific genomic region, gene, mRNA, CDS, and/or protein from 29 soybean genomes (Supplemental Figure 12). We also develop a toolkit named SoyArray by embedding the information of GenoBaits soybean array (Liu et al., 2022Liu Y. Liu S. Zhang Z. Ni L. Chen X. Ge Y. Zhou G. Tian Z. GenoBaits Soy40K: a highly flexible and low-cost SNP array for soybean studies.Sci. China Life Sci. 2022; 65: 1898-1901Crossref PubMed Scopus (5) Google Scholar) in which users can search and download the marker information they are interested in. We also afford a function in the SoyArray to compare divergent sites between two germplasms based the makers from GenoBaits soybean array, which is helpful for parents’ picking in genetic or breeding study (Supplemental Figure 13). As SoyOmics integrates a wide variety of soybean multi-omics data, it can be used for deep mining ranging from fundamental research to molecular breeding. Here, we take a previously reported seed coat color causal gene, G (Wang et al., 2018Wang M. Li W. Fang C. Xu F. Liu Y. Wang Z. Yang R. Zhang M. Liu S. Lu S. et al.Parallel selection on a dormancy gene during domestication of crops from multiple families.Nat. Genet. 2018; 50: 1435-1441Crossref PubMed Scopus (129) Google Scholar), as an example. In SoyOmics, we can group germplasms by green or yellow seed coat colors (Figure 1B). According to the phenotype data, we can conveniently conduct GWAS analysis using the easyGWAS toolkit and then identify a significant association signal that is located in the G gene, SoyZH13_01G182000 (Figures 1C and 1D). According to the interested association genetic variant, users can get phenotype variations among different genotypes, such as the seed coat color (Figure 1E). By searching the candidate gene SoyZH13_01G182000 from different modules, users can obtain a wealth of gene information including basic summary, functional annotation, homology in 29 soybean genomes, and expression pattern in 28 tissues (Figures 1D, 1F, and 1G). Furthermore, users can also investigate functional annotations for any variant of interest (Figure 1H), linkage disequilibrium around the association genetic variant (Figure 1I), allele frequency in different populations (Figure 1J), and selection sweeps for the association regions by three different test methods (Figure 1K). Notably, the majority of charts generated in SoyOmics can be directly downloaded and edited. In summary, SoyOmics features comprehensive integration of multi-omics datasets and provides user-friendly interfaces for soybean study. Compared with other popular soybean databases, SoyOmics has significant advantages in multi-omics interaction, pan-genome scan, and online analysis functionality (Supplemental Table 1), conforming well to the trend of omics database in the post-genomics era. Undoubtedly, soybean omics data are generated at increasing scales and rates, including resequencing data for more germplasms, transcriptome data from bulk, single-cell, and spatial RNA sequencing, epigenetic data from Hi-C, ATAC-seq (assay for transposase-accessible chromatin using sequencing), or histone modification, etc. Therefore, future directions for SoyOmics mainly focus on continuous integration of these newly generated omics data. In addition, artificial intelligence-based approaches for deep mining of these big data would provide valuable insights for a wide range of soybean studies, particularly for AI breeding in the era of big data. Toward this end, we would like to call for global collaborations to build SoyOmics as a valuable platform for the whole research community around the world. This work was supported by the Strategic Priority Research Program of the Chinese Academy of Sciences (XDA24000000, XDA19050302, and XDA24040201); the Science and Technology Innovation 2030 - Major Project (2022ZD04017); the National Natural Science Foundation of China (32030021, 32000475, and 32201775); the National Key Research and Development Program of China (2021YFF1001201); the Taishan Scholars Program; the Xplorer Prize Award; the Youth Innovation Promotion Association of the Chinese Academy of Sciences (Y2021038); and the China National Postdoctoral Program for innovative Talents (BX2021354).