摘要
Article14 December 2017Open Access Transparent process Subspecies in the global human gut microbiome Paul I Costea Paul I Costea orcid.org/0000-0003-1645-3947 Structural and Computational Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany Search for more papers by this author Luis Pedro Coelho Luis Pedro Coelho Structural and Computational Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany Search for more papers by this author Shinichi Sunagawa Shinichi Sunagawa Structural and Computational Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany Department of Biology, Institute of Microbiology, ETH Zurich, Zurich, Switzerland Search for more papers by this author Robin Munch Robin Munch Structural and Computational Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany Search for more papers by this author Jaime Huerta-Cepas Jaime Huerta-Cepas Structural and Computational Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany Search for more papers by this author Kristoffer Forslund Kristoffer Forslund Structural and Computational Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany Search for more papers by this author Falk Hildebrand Falk Hildebrand Structural and Computational Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany Search for more papers by this author Almagul Kushugulova Almagul Kushugulova Center for Life Sciences, Nazarbayev University, Astana, Kazakhstan Search for more papers by this author Georg Zeller Corresponding Author Georg Zeller [email protected] orcid.org/0000-0003-1429-7485 Structural and Computational Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany Search for more papers by this author Peer Bork Corresponding Author Peer Bork [email protected] orcid.org/0000-0002-2627-833X Structural and Computational Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany Max-Delbrück-Centre for Molecular Medicine, Berlin, Germany Molecular Medicine Partnership Unit, Heidelberg, Germany Department of Bioinformatics, Biocenter, University of Würzburg, Würzburg, Germany Search for more papers by this author Paul I Costea Paul I Costea orcid.org/0000-0003-1645-3947 Structural and Computational Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany Search for more papers by this author Luis Pedro Coelho Luis Pedro Coelho Structural and Computational Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany Search for more papers by this author Shinichi Sunagawa Shinichi Sunagawa Structural and Computational Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany Department of Biology, Institute of Microbiology, ETH Zurich, Zurich, Switzerland Search for more papers by this author Robin Munch Robin Munch Structural and Computational Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany Search for more papers by this author Jaime Huerta-Cepas Jaime Huerta-Cepas Structural and Computational Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany Search for more papers by this author Kristoffer Forslund Kristoffer Forslund Structural and Computational Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany Search for more papers by this author Falk Hildebrand Falk Hildebrand Structural and Computational Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany Search for more papers by this author Almagul Kushugulova Almagul Kushugulova Center for Life Sciences, Nazarbayev University, Astana, Kazakhstan Search for more papers by this author Georg Zeller Corresponding Author Georg Zeller [email protected] orcid.org/0000-0003-1429-7485 Structural and Computational Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany Search for more papers by this author Peer Bork Corresponding Author Peer Bork [email protected] orcid.org/0000-0002-2627-833X Structural and Computational Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany Max-Delbrück-Centre for Molecular Medicine, Berlin, Germany Molecular Medicine Partnership Unit, Heidelberg, Germany Department of Bioinformatics, Biocenter, University of Würzburg, Würzburg, Germany Search for more papers by this author Author Information Paul I Costea1, Luis Pedro Coelho1, Shinichi Sunagawa1,2, Robin Munch1, Jaime Huerta-Cepas1, Kristoffer Forslund1, Falk Hildebrand1, Almagul Kushugulova3, Georg Zeller *,1 and Peer Bork *,1,4,5,6 1Structural and Computational Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany 2Department of Biology, Institute of Microbiology, ETH Zurich, Zurich, Switzerland 3Center for Life Sciences, Nazarbayev University, Astana, Kazakhstan 4Max-Delbrück-Centre for Molecular Medicine, Berlin, Germany 5Molecular Medicine Partnership Unit, Heidelberg, Germany 6Department of Bioinformatics, Biocenter, University of Würzburg, Würzburg, Germany *Corresponding author. Tel: +49 6221 387 8361; E-mail: [email protected] *Corresponding author. Tel: +49 6221 387 8361; E-mail: [email protected] Molecular Systems Biology (2017)13:960https://doi.org/10.15252/msb.20177589 PDFDownload PDF of article text and main figures. Peer ReviewDownload a summary of the editorial decision process including editorial decision letters, reviewer comments and author responses to feedback. ToolsAdd to favoritesDownload CitationsTrack CitationsPermissions ShareFacebookTwitterLinked InMendeleyWechatReddit Figures & Info Abstract Population genomics of prokaryotes has been studied in depth in only a small number of primarily pathogenic bacteria, as genome sequences of isolates of diverse origin are lacking for most species. Here, we conducted a large-scale survey of population structure in prevalent human gut microbial species, sampled from their natural environment, with a culture-independent metagenomic approach. We examined the variation landscape of 71 species in 2,144 human fecal metagenomes and found that in 44 of these, accounting for 72% of the total assigned microbial abundance, single-nucleotide variation clearly indicates the existence of sub-populations (here termed subspecies). A single subspecies (per species) usually dominates within each host, as expected from ecological theory. At the global scale, geographic distributions of subspecies differ between phyla, with Firmicutes subspecies being significantly more geographically restricted. To investigate the functional significance of the delineated subspecies, we identified genes that consistently distinguish them in a manner that is independent of reference genomes. We further associated these subspecies-specific genes with properties of the microbial community and the host. For example, two of the three Eubacterium rectale subspecies consistently harbor an accessory pro-inflammatory flagellum operon that is associated with lower gut community diversity, higher host BMI, and higher blood fasting insulin levels. Using an additional 676 human oral samples, we further demonstrate the existence of niche specialized subspecies in the different parts of the oral cavity. Taken together, we provide evidence for subspecies in the majority of abundant gut prokaryotes, leading to a better functional and ecological understanding of the human gut microbiome in conjunction with its host. Synopsis Genomic variation within gut microbial species analyzed in > 2,000 metagenomic samples from three continents reveals strongly structured populations (subspecies) in 44 out of 71 cases. Ecological and functional properties of subspecies as well as associations with the host were explored. For most abundant gut microbes from six phyla, subspecies were detected. Within each host, a single subspecies generally dominates and persists over time. Many subspecies exhibit biogeographic patterning, some Firmicutes subspecies strong restrictions, mostly to Chinese samples. Reconstruction of the subspecies-specific gene content predicts functional differences, some of which are linked to host health. Introduction Despite the long-standing debate on whether there is a coherent species concept in the prokaryotic world (Achtman & Wagner, 2008; Doolittle & Zhaxybayeva, 2009), modern molecular technologies offer multiple operational definitions of species, as well as of subordinate levels, such as ecotypes, that are being successfully applied to aid the discovery process in microbial ecology. Some rely on genome-wide features, such as DNA–DNA hybridization (Stackebrandt et al, 2002) or pairwise average nucleotide identity (Konstantinidis & Tiedje, 2005), while others are restricted to a specific marker gene (e.g., 16S rRNA gene, or a variable region therein) or multiple ones (Mende et al, 2013), (i.e., multilocus sequence typing). These schemes aim to classify prokaryotic strains or isolates [which may differ by just a single nucleotide (Viana et al, 2015)] into genetically or phenotypically coherent taxa; these either delineate a specific feature of a population (e.g., antibiotic resistance or invasion potential) or are determined in an unsupervised manner following observed population structure (Urwin & Maiden, 2003). Noting such consistent differences within operationally defined species has prompted the introduction of the subspecies concept as early as the 1950s, when Bacillus cereus could be stratified into virulent and non-virulent types (Davenport & Smith, 1952). Subspecies showing adaptation to different environments are also referred to as "ecotypes", a prominent example being the ocean-dwelling Prochlorococcus spp. (Johnson et al, 2006), for which specific subspecies are specialized for different environmental conditions (Biller et al, 2014). In regard to the human microbiome, extensive work has been undertaken to characterize population structure and describe subspecies in species that include pathogenic bacteria, such as Escherichia coli, different Salmonella species, and a few others, for which multiple typing schemes exist (Gordienko et al, 2013; Chakraborty et al, 2015; Bale et al, 2016; Sharma et al, 2016), but not for the vast majority of human gut commensals. Current genomic approaches to determine population structure are hampered by their dependence on bacterial isolates and sequenced genomes. Due to limited throughput and biased by growth requirements, cultivation-based assessments are unviable for the study of complex microbial communities. Recently, the advent of culture-independent metagenomic approaches and newly developed analysis tools have made it possible to study the population structure and genomic variation of microbes in their natural environment (Luo et al, 2015; Nayfach et al, 2016; Truong et al, 2017). Here, we devised a novel approach for broadly delineating subspecies in the majority of abundant gut microbes and determined their associated single-nucleotide variants in order to quantify and characterize them in their natural habitat. We observed consistent functional differences between subspecies with respect to their gene pools suggesting that they likely differ in phenotypic or ecological properties. We discovered that clearly delineated subspecies are the rule rather than the exception in the gut microbiome. When analyzing their global geographic distribution, we found that within a host, they generally persist stably over time and are mutually exclusive. Illustrating the utility of the subspecies concept, we associated particular subspecies with specific genes and host phenotypes, with implications for disease. Results To be able to determine population structure without relying on limited or biased sets of pan-genomes (Gordienko et al, 2013), we explored the natural genomic variation of the human gut microbiome (Schloissnig et al, 2013; Luo et al, 2015; Nayfach et al, 2016; Lloyd-Price et al, 2017; Truong et al, 2017). To this end, we assessed the microbial genetic landscape of 2,144 deeply sequenced human stool metagenomes from nine countries, spanning three continents, including published (Huttenhower et al, 2012; Qin et al, 2012; Karlsson et al, 2013; Le Chatelier et al, 2013; Zeller et al, 2014) as well as 298 newly generated ones (Table EV1). The newly sequenced samples include a Kazakh cohort as well as three individuals from Germany that have been sampled over an extended period of time (see Materials and Methods). By mapping the respective reads to a set of 1,753 previously determined representative genomes, each one representing one species (Mende et al, 2013), and using conservative thresholds for minimal sequencing depth and prevalence, we were able to confidently assess the variation in 71 abundant microbial species (Fig 1, Materials and Methods, and Table EV2). While the latter represent only a small fraction of known gut species, they account for the vast majority of the mapped reads (Table EV3; on average, 95.5% (SD = 6%) of assigned relative abundance per sample; Fig 1B). Figure 1. Identification and prevalence of human gut microbial subspecies A, B. Human gut microbial species explored for the existence of subspecies show wide phylogenetic spread according to NCBI taxonomy (A) and include Methanobrevibacter smithii, the main archaeal member of the human gut microbiome, as well as representatives of all abundant phyla. Species names are according to NCBI taxonomy, with species cluster (specI) identifiers according to Mende et al (2013), which splits some named species into multiple specI clusters. Of 71 investigated species, 44 stratify into subspecies (highlighted in blue). Each species' average abundance across 2,144 human gut metagenomes is proportional to the size of the circles on the cladogram. Bars represent the number of subspecies identified in each, with "1" indicating no subdivision. The black portion of the bar corresponds to subspecies for which no representative genome sequence is available from NCBI. Geographic enrichments of subspecies are displayed as a heat map (showing only significant enrichment, FDR-corrected Fisher test P-value < 0.05, per country as maximum log-odds ratio across conspecific subspecies). Subspecies with a restricted geographic range are predominantly found in the Chinese and Kazakh populations. The 71 investigated species captured an average of 95.5% of sequencing reads that were assigned to any reference genome. The subset of 44 species with identified subspecies accounted for the majority of this abundance (B). Download figure Download PowerPoint Subspecies delineation To delineate population structure corresponding to subspecies (referred to as metagenomic subspecies, MGSS, in the following), we used metaSNV (Costea et al, 2017) to compute a matrix of genomic allele distances between all samples that allowed variant calling for any given species (Materials and Methods), and identified clusters by applying a very stringent cutoff for separation (Materials and Methods). With this operational definition, we found between two and four subspecies in 44 of the 71 species (accounting for 72% (SD = 15%) of the assigned relative abundance per sample), totaling 112 subspecies (Table EV2). The horizontal coverage over the analyzed genomes was generally above 70% and did not correlate with the existences of subspecies (Appendix Fig S1). We thus uncovered extensive population structure reflecting naturally occurring subspecies in the majority of abundant gut microbes. The 71 gut species investigated represent 28 genera, for 25 of which population structure has never been assessed before (Materials and Methods). Of the other three, for which population genetic studies have been published, one, Escherichia, was previously described to have well-defined structure below the species level (Chakraborty et al, 2015). In comparison with the four subspecies of E. coli determined here, there is an exact correspondence, such that MGSS1, MGSS2, MGSS3, and MGSS4 recapitulate phylogroups B1, A, D, and B2, respectively (Materials and Methods, Table EV4). For the remaining two genera, a cohesive subspecies scheme has not been established, which precluded a comparison. To investigate whether the stratification into subspecies is a potentially more broad feature of human-associated prokaryotes, we further assessed the genomic variation of 676 oral samples, collected from different sites within the oral cavity (buccal mucosa, tongue dorsum, and supragingival plaque) of multiple individuals. Subjecting them to the same procedure as the gut metagenomes, we were able to investigate the existence of subspecies in 62 additional named bacterial species. Of these, 18 could be delineated into two subspecies, which may be found in different habitats of the oral cavity. For example, Granulicatella adiacens [1358] MGSS1 is only found in plaque, while MGSS2 only lives in the buccal mucosa (Table EV5). We then focused on the human gut metagenome for which a wealth of public datasets exist that could be utilized by our approach. Thus, for each of the subspecies identified in human gut species, we determined a set of unique SNPs that unambiguously identify it; the median number of such "genotyping" positions is 2,133 (with 1,142 and 3,989 being the 25th and 75th quantile, respectively). Using these positions, we could assign subspecies in a substantially expanded set of samples with low sequence coverage (on average, a 60% increase in the number of samples profiled per species; see Materials and Methods and Table EV2). We found that 47 of the 112 identified subspecies lack a representative reference genome, sometimes even the most prevalent ones (Fig 1A, Table EV2), as similarly noted by independent studies (Lloyd-Price et al, 2017; Truong et al, 2017). Hence, the metagenomic SNP profiles not only increased our power to investigate the relation between subspecies and host properties, but also allowed for subspecies identification and quantification in new samples. Subspecies biogeography Having analyzed metagenomes from nine countries and three continents (China, Kazakhstan, Sweden, Denmark, Germany, Austria, France, Spain, and the United States), we could assess the global geographic distribution of each subspecies. While many subspecies appeared to be distributed without any recognizable geographic pattern, some did show striking regional enrichments (Figs 1A and EV1), consistent with a recently published, independent study (Truong et al, 2017). Overall, the most restricted geographic ranges across subspecies were observed in the Chinese samples, followed by the ones from Kazakhstan, while European and American samples appeared more similar in their gut subspecies compositions (Fig 1A). For example, for Eubacterium rectale, one subspecies (E. rectale MGSS3) was almost exclusive to Chinese samples (see also Truong et al, 2017), while the other two were found in all other countries, including neighboring Kazakhstan (Fig EV2). Strong geographic segregation might explain why associations between E. rectale and host physiology have often proven to be unstable when testing in different cohorts, as these subspecies could not be profiled previously (David et al, 2014). When comparing geographic ranges across bacterial taxa, members of the Firmicutes phylum showed significantly more geographic restrictions compared to all other phyla (Wilcoxon test P-value = 0.002). The regional enrichments among several Firmicutes species discovered here might reflect an adaptation to specific environmental factors, many of which are more prevalent in some geographic regions, but not exclusively found there. Alternatively, the observed structure could be the result of drift and differences in dispersal among gut microbial subspecies, though further work is needed to disentangle the two. Click here to expand this figure. Figure EV1. Geographic distribution of subspeciesThe enrichment of each MGSS is computed for each species within each country, showing which subspecies is enriched where. An enrichment is considered significant if the P-value of Fisher's exact test is less than 0.05. Download figure Download PowerPoint Click here to expand this figure. Figure EV2. Geographic patterns of Eubacterium rectale subspeciesThe PCoA of nucleotide variation shows a clear geographic pattern of Eubacterium rectale, with MGSS3 being almost exclusively found in the Chinese population. The relative proportion of countries in each subspecies is summarized in the top right bar plot highlighting that very few individuals from the other countries considered in the study harbor MGSS3. Download figure Download PowerPoint Subspecies dominance and persistence in individuals We next investigated whether subspecies occurrences are also restricted in individuals and how stable an individual's gut subspecies composition is over time. It has been previously shown that overall strain-level composition is generally stable (Schloissnig et al, 2013; Lloyd-Price et al, 2017; Truong et al, 2017), while upon perturbations, such as fecal microbiota transplantation, lasting changes in strain populations can occur (Li et al, 2016). In order to study the population structure of conspecific subspecies, we recorded the frequency of each subspecies in each sample (Fig 2A–E). For the 44 species with substructure, we generally observed a clear dominance (one subspecies represents more than 90% of the combined abundance in any given individual), and in 83% of the samples even exclusivity, of one conspecific subspecies, with no apparent effect of sequencing depth on deducing subspecies exclusivity (see Materials and Methods and Appendix Fig S2) consistent with a recent independent report (Truong et al, 2017). This observation is in line with the frequency profiles observed in most samples, where the vast majority of alleles are "fixed", indicating the presence of only one dominating strain. Specifically, for 41 of the 44 species with substructure, over 75% of samples have 90% of assessed variants with frequencies below 5 or above 95% (Appendix Fig S3). Moreover, subspecies exclusion is also in line with ecological theory predicting that for closely related taxa, one outcompetes the others in the same ecological niche (Hardin, 1960). To analyze temporal stability of conspecific subspecies, we considered individuals for which samples from multiple time points were available and tracked the number of times we observed a change in the dominant one. Considering 74 individuals from the United States, each sampled twice about 200 days apart (Huttenhower et al, 2012), as high as 94% of the dominant subspecies were the same at both time points, almost all of them being exclusive. Of the remaining 6%, the vast majority involved individuals that had more than one co-occurring subspecies at the earlier time point already (in these, subspecies replacement was 17 times more likely than in cases with a single subspecies at baseline, Fisher test P = 2.2 × 10−16), indicating that mixed subspecies populations are more fragile over time. This is supported by a more in-depth comparison of three individuals for which we have collected more than 20 samples over more than 2 years, and for which we could monitor 16–18 species across multiple time points spanning at least 1 year. For two individuals, we only found one incidence of switching in each. In both cases, the switching was observed where subspecies coexisted and the most abundant one changed. In the third individual, sampled over the same time period, seven out of the 18 species which we were able to track after an antibiotic intervention showed a subspecies replacement, supporting a strong and long-lasting impact of treatment on microbial composition (Voigt et al, 2015) accompanied by a switch of conspecific subspecies dominance (Fig 2). Figure 2. Subspecies co-occurrence and phylogenetic consistency A–D. Subspecies identified for Methanobrevibacter smithii and Bacteroides vulgatus/dorei are shown in principal coordinate (PCoA) projections of the between-sample distances based on single-nucleotide variations (see Materials and Methods). The first principal coordinate (PC) explains over 70% of the variation in both cases (panels A and C). Reference genomes have been projected into the same PCoA plots (marked with "×" in A and C; see Materials and Methods). The numbers adjacent to the placed genomes correspond to those shown in parentheses next to NCBI taxonomy identifiers (leaves) on the phylogenetic trees in (B, D), respectively. The sample density for each subspecies is highlighted by the histogram above, and the total number of samples in which the species could be quantified is indicated in headers. Quantification of the frequency of each subspecies (bottom plot in A and C) reveals that for M. smithii, only one sample has two subspecies co-occurring in one individual, while all the others have a single dominating one. In contrast, for B. vulgatus/dorei, co-occurrence is more commonly observed. Phylogenies reconstructed from the reference genomes (NCBI taxonomy identifiers; see Materials and Methods) are fully consistent with the SNP-based clustering. The representative genome for each species, relative to which genomic variants were called, is highlighted with a box. E. In B. vulgatus/dorei, subspecies composition within each individual is generally stable over time, with a change of the dominant subspecies being rare even over a period of up to 1,000 days. An exception (highlighted by purple line) is seen for an individual, in which one dominant subspecies is replaced by another one after antibiotic treatment. The right-hand panel summarizes subspecies frequency changes, underlining remarkable stability over time. Download figure Download PowerPoint Our observation of general dominance or exclusivity of a single subspecies adds another layer to gut microbial individuality. That it persists over time under normal conditions highlights the potential of targeted microbial interventions for achieving long-lasting effects once a desired strain is successfully "implanted" by manipulations such as fecal microbial transplantation (Li et al, 2016) or probiotic treatment. Subspecies functional characterization In the absence of phenotypic information (Nichols et al, 2011), a way of inferring functional differences between distinct taxa is finding genes that are consistent within, but different between them. Applied to subspecies, we first identified for a given bacterial species core genes (species core, SC), consistently present in all strains, which should encode necessary functions performed by all strains (Vernikos et al, 2015). We further determined from the other (accessory) genes of that species a subspecies-specific core (SSSC). These SSSC genes are shared by all strains of that subspecies, but are distinct from those in any other conspecific subspecies. This set of genes thus encodes the consistent functional differences between them (Materials and Methods). Both SC and SSSC sets were inferred de novo, through a newly developed method based on the co-abundance concept (Materials and Methods), previously used to pool genes into metagenomic species independent of reference genomes (Nielsen et al, 2014). Using the SNP frequencies at the genotyping positions of each subspecies along with the overall species abundance, we could identify genes whose abundance consistently correlated with these (i.e., both SC and SSSC) with high accuracy (Appendix Fig S4). For the functional characterization of subspecies, we annotated the genes to functional categories (Materials and Methods) and ascertained whether these were specific to either SC or SSSC, by testing for enrichment of functional annotations such as SEED (Overbeek et al, 2005) pathways, eggNOG (Huerta-Cepas et al, 2016b) categories, and KEGG (Kanehisa & Goto, 2000) modules in all the species with subspecies structure (Fig EV3). Compared to the combined SCs, the combined SSSCs were enriched in genes functionally related to phages (including genes for integration and excision), cell surface protein modifications (such as genes from the sortase pathway), chemotaxis, and motility as well as genes of unknown function (Table EV6). Such enrichments are broadly consistent with functional specialization of subspecies to different environmental niches. We next tested for functional differences between the SSSCs of different subspecies. Using homology-based KEGG annotation as well as SEED functional annotations across all species, we found 9 and 16, respectively, significant enrichments between conspecific subspecies, encoding a wide range of differences (at a cutoff of P < 0.05, corrected for multiple testing within each species; Table EV7). For example, we found significant differences in adhesion genes in two of four E. coli subspecies (MGSS3 and MGSS4), suggesting higher virulence potential (Lee et al, 2010). Using reference genomes and pathogenicity annotations from the PATRIC database (Wattam et al, 2014), we noted that these two subspecies were significantly enriched in isolates that caused diseases in the human host. These findings are in line with results on the corresponding phylogroups D and B2 (MGSS3 and MGSS4, respectively), which have been associated with extraintestinal infections in humans (Chakraborty et al, 2015). Moreover, metagenomic data from a recent enterohemorrhagic E. coli outbreak (Loman et al, 2013) indicates that the pathogenic strain comes from one of these two subspecies (Fig EV4). Even though our analysis is limited by the fact that an average of 70% of the SSSCs were not assigned to any functional category by either KEGG o