摘要
Metagenomic sequencing is revolutionizing the detection and characterization of microbial species, and a wide variety of software tools are available to perform taxonomic classification of these data. The fast pace of development of these tools and the complexity of metagenomic data make it important that researchers are able to benchmark their performance. Here, we review current approaches for metagenomic analysis and evaluate the performance of 20 metagenomic classifiers using simulated and experimental datasets. We describe the key metrics used to assess performance, offer a framework for the comparison of additional classifiers, and discuss the future of metagenomic data analysis. Metagenomic sequencing is revolutionizing the detection and characterization of microbial species, and a wide variety of software tools are available to perform taxonomic classification of these data. The fast pace of development of these tools and the complexity of metagenomic data make it important that researchers are able to benchmark their performance. Here, we review current approaches for metagenomic analysis and evaluate the performance of 20 metagenomic classifiers using simulated and experimental datasets. We describe the key metrics used to assess performance, offer a framework for the comparison of additional classifiers, and discuss the future of metagenomic data analysis. Identifying the microbial taxon or taxa present in complex biological and environmental samples is one of the oldest and most frequent challenges in microbiology, from determining the etiology of an infection from a patient’s blood sample to surveying the bacteria in an environmental soil sample (Jones et al., 2017Jones S. Baizan-Edge A. MacFarlane S. Torrance L. Viral Diagnostics in Plants Using Next Generation Sequencing: Computational Analysis in Practice.Front. Plant Sci. 2017; 8: 1770Crossref PubMed Scopus (58) Google Scholar, Pedersen et al., 2016Pedersen H.K. Gudmundsdottir V. Nielsen H.B. Hyotylainen T. Nielsen T. Jensen B.A.H. Forslund K. Hildebrand F. Prifti E. Falony G. et al.MetaHIT ConsortiumHuman gut microbes impact host serum metabolome and insulin sensitivity.Nature. 2016; 535: 376-381Crossref PubMed Scopus (1079) Google Scholar, Somasekar et al., 2017Somasekar S. Lee D. Rule J. Naccache S.N. Stone M. Busch M.P. Sanders C. Lee W.M. Chiu C.Y. Viral Surveillance in Serum Samples From Patients With Acute Liver Failure By Metagenomic Next-Generation Sequencing.Clin. Infect. Dis. 2017; 65: 1477-1485Crossref PubMed Scopus (59) Google Scholar, Zhang et al., 2016Zhang W. Li L. Deng X. Blümel J. Nübling C.M. Hunfeld A. Baylis S.A. Delwart E. Viral nucleic acids in human plasma pools.Transfusion. 2016; 56: 2248-2255Crossref PubMed Scopus (67) Google Scholar). Prior to the advent of genomic sequencing technologies, identifying taxa required time-consuming sequential testing of candidates (Pavia, 2011Pavia A.T. Viral infections of the lower respiratory tract: old viruses, new viruses, and the role of diagnosis.Clin. Infect. Dis. 2011; 52: S284-S289Crossref PubMed Scopus (252) Google Scholar, Venkatesan et al., 2013Venkatesan A. Tunkel A.R. Bloch K.C. Lauring A.S. Sejvar J. Bitnun A. Stahl J.-P. Mailles A. Drebot M. Rupprecht C.E. et al.International Encephalitis ConsortiumCase definitions, diagnostic algorithms, and priorities in encephalitis: consensus statement of the international encephalitis consortium.Clin. Infect. Dis. 2013; 57: 1114-1128Crossref PubMed Scopus (567) Google Scholar). The application of metagenomic sequencing is transforming microbiology by directly interrogating the community composition in an unbiased manner, enabling more rapid species detection and the discovery of novel species and reducing reliance on culture-dependent approaches (Knights et al., 2011Knights D. Kuczynski J. Charlson E.S. Zaneveld J. Mozer M.C. Collman R.G. Bushman F.D. Knight R. Kelley S.T. Bayesian community-wide culture-independent microbial source tracking.Nat. Methods. 2011; 8: 761-763Crossref PubMed Scopus (840) Google Scholar, Loman et al., 2013Loman N.J. Constantinidou C. Christner M. Rohde H. Chan J.Z.-M. Quick J. Weir J.C. Quince C. Smith G.P. Betley J.R. et al.A culture-independent sequence-based metagenomics approach to the investigation of an outbreak of Shiga-toxigenic Escherichia coli O104:H4.JAMA. 2013; 309: 1502-1510Crossref PubMed Scopus (219) Google Scholar). The potential application of these technologies to improve diagnostics and in public health settings has also been widely recognized (Chiu and Miller, 2019Chiu C.Y. Miller S.A. Clinical metagenomics.Nat. Rev. Genet. 2019; 20: 341-355Crossref PubMed Scopus (431) Google Scholar, Miller et al., 2013Miller R.R. Montoya V. Gardy J.L. Patrick D.M. Tang P. Metagenomics for pathogen detection in public health.Genome Med. 2013; 5: 81Crossref PubMed Scopus (133) Google Scholar), and there is extensive ongoing work to overcome the challenges associated with clinical use of these approaches (Blauwkamp et al., 2019Blauwkamp T.A. Thair S. Rosen M.J. Blair L. Lindner M.S. Vilfan I.D. Kawli T. Christians F.C. Venkatasubrahmanyam S. Wall G.D. et al.Analytical and clinical validation of a microbial cell-free DNA sequencing test for infectious disease.Nat. Microbiol. 2019; 4: 663-674Crossref PubMed Scopus (303) Google Scholar, Miller et al., 2019Miller S. Naccache S.N. Samayoa E. Messacar K. Arevalo S. Federman S. Stryke D. Pham E. Fung B. Bolosky W.J. et al.Laboratory validation of a clinical metagenomic sequencing assay for pathogen detection in cerebrospinal fluid.Genome Res. 2019; 29: 831-842Crossref PubMed Scopus (206) Google Scholar). Because metagenomic sequencing produces genomic data from a set of species instead of a pure species isolate, one of the primary challenges in the field is the development of computational methods for identifying all of the species contained in these samples (Figure 1). There are two primary drivers of this computational challenge. First, the widespread use of high-throughput sequencing technologies that generate millions of short sequences (generally 50–200 nt) presents a computational challenge for classifying large numbers of reads in a reasonable time. BLAST (basic local alignment and search tool) is one of the most well-known and commonly used software programs for DNA search and alignment against a database of genomic sequences (Altschul et al., 1990Altschul S.F. Gish W. Miller W. Myers E.W. Lipman D.J. Basic local alignment search tool.J. Mol. Biol. 1990; 215: 403-410Crossref PubMed Scopus (70338) Google Scholar). Although BLAST is one of the most sensitive metagenomics alignment methods, it is computationally intensive, making it infeasible to run on the millions of reads typically generated by metagenomic sequencing studies. Second, this challenge is compounded by the exponential growth in recent years of the number of sequenced microbial genomes, meaning that the number of comparisons that need to be performed for new sequencing reads is huge and ever increasing. Many software tools have recently been developed to taxonomically classify metagenomic data and estimate taxon abundance profiles. For accurate analysis and interpretation of these data, it is important to understand how these different tools, broadly referred to as classifiers, work and how to determine the best approach for a given sample type, microbial kingdom, or application. This includes continually benchmarking the ensemble of tools for the best performance characteristics along multiple dimensions: classification accuracy, speed, and computational requirements. Several groups have previously benchmarked metagenomic tools (Lindgreen et al., 2016Lindgreen S. Adair K.L. Gardner P.P. An evaluation of the accuracy and speed of metagenome analysis tools.Sci. Rep. 2016; 6: 19233Crossref PubMed Scopus (190) Google Scholar, Mavromatis et al., 2007Mavromatis K. Ivanova N. Barry K. Shapiro H. Goltsman E. McHardy A.C. Rigoutsos I. Salamov A. Korzeniewski F. Land M. et al.Use of simulated data sets to evaluate the fidelity of metagenomic processing methods.Nat. Methods. 2007; 4: 495-500Crossref PubMed Scopus (249) Google Scholar, McIntyre et al., 2017McIntyre A.B.R. Ounit R. Afshinnekoo E. Prill R.J. Hénaff E. Alexander N. Minot S.S. Danko D. Foox J. Ahsanuddin S. et al.Comprehensive benchmarking and ensemble approaches for metagenomic classifiers.Genome Biol. 2017; 18: 182Crossref PubMed Scopus (144) Google Scholar, Meyer et al., 2019Meyer F. Bremges A. Belmann P. Janssen S. McHardy A.C. Koslicki D. Assessing taxonomic metagenome profilers with OPAL.Genome Biol. 2019; 20: 51Crossref PubMed Scopus (27) Google Scholar, Sczyrba et al., 2017Sczyrba A. Hofmann P. Belmann P. Koslicki D. Janssen S. Dröge J. Gregor I. Majda S. Fiedler J. Dahms E. et al.Critical Assessment of Metagenome Interpretation-a benchmark of metagenomics software.Nat. Methods. 2017; 14: 1063-1071Crossref PubMed Scopus (363) Google Scholar), but the continual introduction of newer tools requires ongoing evaluation to compare them against established tools. Here we review the core principles of metagenomic sequence classification methods, describe how to evaluate classifier performance, and use these approaches to benchmark 20 commonly used taxonomic classifiers. To account for database differences and updates between methods, we further compare the performance of these tools on a uniform database, which has not been considered in earlier studies. We also provide recommendations for their use and describe future directions for the expansion of this field. A large number of tools have recently been developed that are focused on classifying large amounts of sequencing reads to known taxa with increasing speed. These taxonomic classifiers require pre-computed databases of previously sequenced microbial genetic sequences against which sequencing data are matched. Within taxonomic classifiers, a distinction can be made between taxonomic binning and taxonomic profiling. Binning approaches provide classification of individual sequence reads to reference taxa. Profilers report the relative abundances of taxa within a dataset but do not classify individual reads. However, in practice, these methods are often used interchangeably when analyzing metagenomic sequencing data. Although not generated by default, a taxonomic profile can be calculated from binning approaches by summing up the individual read classifications. Taxonomic classifiers should not be confused with a distinct class of assembly-based tools for analysis of metagenomic sequencing data that cluster contigs de novo without the aid of any reference databases, an approach known as reference-free binning (Alneberg et al., 2014Alneberg J. Bjarnason B.S. de Bruijn I. Schirmer M. Quick J. Ijaz U.Z. Lahti L. Loman N.J. Andersson A.F. Quince C. Binning metagenomic contigs by coverage and composition.Nat. Methods. 2014; 11: 1144-1146Crossref PubMed Scopus (876) Google Scholar, Kang et al., 2015Kang D.D. Froula J. Egan R. Wang Z. MetaBAT, an efficient tool for accurately reconstructing single genomes from complex microbial communities.PeerJ. 2015; 3: e1165Crossref PubMed Scopus (925) Google Scholar, Wu et al., 2016Wu Y.-W. Simmons B.A. Singer S.W. MaxBin 2.0: an automated binning algorithm to recover genomes from multiple metagenomic datasets.Bioinformatics. 2016; 32: 605-607Crossref PubMed Scopus (780) Google Scholar). These tools cannot taxonomically classify sequences and, thus, are not evaluated here but have recently been benchmarked elsewhere (Sczyrba et al., 2017Sczyrba A. Hofmann P. Belmann P. Koslicki D. Janssen S. Dröge J. Gregor I. Majda S. Fiedler J. Dahms E. et al.Critical Assessment of Metagenome Interpretation-a benchmark of metagenomics software.Nat. Methods. 2017; 14: 1063-1071Crossref PubMed Scopus (363) Google Scholar). To generate assignments, classifiers utilize newer algorithmic approaches to ensure that classification speeds are fast enough for even large numbers of sequencing reads. To do so, most tools first seek to reduce the number of candidate hits for processing via approaches such as searching for stretches of perfect sequence matches with reference sequences (k-mers, typically around 31 nt in length) or via an FM index (full-text index in minute space) (Ferragina and Manzini, 2000Ferragina P. Manzini G. Opportunistic Data Structures with Applications.in: Proceedings of the 41st Annual Symposium on Foundations of Computer Science. IEEE Computer Society, 2000: 390Crossref Google Scholar). As a result, these methods are typically not as sensitive as BLAST but are designed to be much faster. In addition, they frequently favor more memory usage to reduce CPU usage and, thus, classification time. These tools can be divided into three groups: DNA-to-DNA classification (BLASTn-like), DNA-to-protein (BLASTx-like) classification, and marker-based classification. DNA-to-DNA and DNA-to-protein tools classify sequencing reads by comparison with comprehensive genomic databases of DNA or protein sequences, respectively. DNA-to-protein tools are more computationally intensive than DNA-to-DNA tools because they need to analyze all six frames of potential DNA-to-amino acid translation, but they can be more sensitive to novel and highly variable sequences because of the lower mutation rates of amino acids compared with nucleotide sequences (Altschul et al., 1990Altschul S.F. Gish W. Miller W. Myers E.W. Lipman D.J. Basic local alignment search tool.J. Mol. Biol. 1990; 215: 403-410Crossref PubMed Scopus (70338) Google Scholar). DNA-to-protein tools, however, target only the coding sequence of the genome and, therefore, will not be able to classify non-coding sequencing reads. Marker-based methods typically include in their reference database only a subset of gene sequences instead of whole genomes, normally specific gene families that have good discriminatory power between species. The most widely used single marker gene for bacterial metagenomics is the highly conserved 16S rRNA sequence (Edgar, 2018Edgar R.C. Updating the 97% identity threshold for 16S ribosomal RNA OTUs.Bioinformatics. 2018; 34: 2371-2375Crossref PubMed Scopus (292) Google Scholar, Yarza et al., 2014Yarza P. Yilmaz P. Pruesse E. Glöckner F.O. Ludwig W. Schleifer K.-H. Whitman W.B. Euzéby J. Amann R. Rosselló-Móra R. Uniting the classification of cultured and uncultured bacteria and archaea using 16S rRNA gene sequences.Nat. Rev. Microbiol. 2014; 12: 635-645Crossref PubMed Scopus (1359) Google Scholar), although other markers are needed to identify viruses, fungi, and other microbes that do not have the 16S marker gene. Some marker-based methods, such as MetaPhlAn2, address this limitation by indexing a number of different gene families in its database to identify taxa from other microbial kingdoms (Truong et al., 2015Truong D.T. Franzosa E.A. Tickle T.L. Scholz M. Weingart G. Pasolli E. Tett A. Huttenhower C. Segata N. MetaPhlAn2 for enhanced metagenomic taxonomic profiling.Nat. Methods. 2015; 12: 902-903Crossref PubMed Scopus (1100) Google Scholar). The use of a subset of genes makes these methods quick; however, the marker sequences used can introduce a bias in the results when they are not evenly distributed among the microbial sequences of interest (D’Amore et al., 2016D’Amore R. Ijaz U.Z. Schirmer M. Kenny J.G. Gregory R. Darby A.C. Shakya M. Podar M. Quince C. Hall N. A comprehensive benchmarking study of protocols and sequencing platforms for 16S rRNA community profiling.BMC Genomics. 2016; 17: 55Crossref PubMed Scopus (231) Google Scholar). All metagenomics classifiers require a pre-computed database based on previously sequenced microbial genetic sequences whose sheer size presents a considerable computational challenge. The most popular reference databases are RefSeq complete genomes (RefSeq CG) for microbial species as well as the BLAST nt and nr databases for high-quality nucleotide and protein sequences, respectively, from all kingdoms of life, with ∼50 and ∼200 million sequences, respectively, as of 2019. Other databases include SILVA for 16S rRNA, with ∼2 million sequences, and GenBank for a larger quantity of genomes with lower quality control standards (Benson et al., 2005Benson D.A. Karsch-Mizrachi I. Lipman D.J. Ostell J. Wheeler D.L. GenBank.Nucleic Acids Res. 2005; 33: D34-D38Crossref PubMed Scopus (996) Google Scholar, Quast et al., 2013Quast C. Pruesse E. Yilmaz P. Gerken J. Schweer T. Yarza P. Peplies J. Glöckner F.O. The SILVA ribosomal RNA gene database project: improved data processing and web-based tools.Nucleic Acids Res. 2013; 41: D590-D596Crossref PubMed Scopus (13764) Google Scholar). The universe of microbial sequences is very diverse, and these resulting databases are fairly large, typically requiring 10–100 s of gigabytes. This vast search space can also result in a significant number of false positive classifications because of the large number of possible taxa against which the sequences are matched. Additionally, the large universe of presently undiscovered microbial species can result in false negative classifications simply because the genetic sequences have never been categorized in a database before. Recent efforts to expand the number of known microbial genomes have highlighted the improvement in the proportion of reads classified compared with older databases (Forster et al., 2019Forster S.C. Kumar N. Anonye B.O. Almeida A. Viciani E. Stares M.D. Dunn M. Mkandawire T.T. Zhu A. Shao Y. et al.A human gut bacterial genome and culture collection for improved metagenomic analyses.Nat. Biotechnol. 2019; 37: 186-192Crossref PubMed Scopus (238) Google Scholar) but must be balanced with the challenges of handling larger databases. All classifier tools are distributed with pre-compiled reference databases, the composition of which can vary substantially between classifiers. This can act as a confounder when comparing classification performance across methods. These databases may use entirely different sources for sequence data, or, even when they share a common source for sequences (e.g., RefSeq), continual updates and addition of new sequences will mean databases created at different times will have different content. Most tools also allow a user to build his or her own database based on a desired set of sequences. This is a computationally intensive process, especially for comprehensive databases, but affords the user greater control over the analysis, especially when investigating rare, recently discovered, or highly diverse species. Given the complexity of both the test samples and reference databases in metagenomic classification, it is further important to perform comparisons using a uniform database to eliminate any confounding effects of differences in default database compositions. The metrics selected to benchmark classifiers can greatly influence their relative rankings and performance and, thus, must be carefully selected to best reflect the way these tools are used in practice. The most important metrics for metagenomic classification are precision and recall. Precision is the proportion of true positive species identified in the sample divided by the number of total species identified by the method, whereas recall is defined as the proportion of true positive species divided by the number of distinct species actually in the sample. These measures and derived metrics are commonly used across benchmarking studies (McIntyre et al., 2017McIntyre A.B.R. Ounit R. Afshinnekoo E. Prill R.J. Hénaff E. Alexander N. Minot S.S. Danko D. Foox J. Ahsanuddin S. et al.Comprehensive benchmarking and ensemble approaches for metagenomic classifiers.Genome Biol. 2017; 18: 182Crossref PubMed Scopus (144) Google Scholar, Meyer et al., 2019Meyer F. Bremges A. Belmann P. Janssen S. McHardy A.C. Koslicki D. Assessing taxonomic metagenome profilers with OPAL.Genome Biol. 2019; 20: 51Crossref PubMed Scopus (27) Google Scholar, Sczyrba et al., 2017Sczyrba A. Hofmann P. Belmann P. Koslicki D. Janssen S. Dröge J. Gregor I. Majda S. Fiedler J. Dahms E. et al.Critical Assessment of Metagenome Interpretation-a benchmark of metagenomics software.Nat. Methods. 2017; 14: 1063-1071Crossref PubMed Scopus (363) Google Scholar). The F1 score is the harmonic mean of recall and precision, weighting them equally in a single metric. However, because end users will often filter out taxa below a certain abundance threshold, using a single raw precision, recall, or F1 score does not provide a realistic estimate of classifier performance. To better assess precision and recall scores across all abundance thresholds, it is preferable to use a precision-recall curve, where each point represents the precision and recall scores at a specific abundance threshold (Figure 2). By ranging the abundance threshold from 0–1.0, the area under the precision-recall curve (AUPR) outputs a single metric to aggregate precision and recall scores (Davis and Goadrich, 2006Davis J. Goadrich M. The Relationship Between Precision-Recall and ROC Curves.in: Proceedings of the 23rd International Conference on Machine Learning (ACM). 2006: 233-240Crossref Scopus (2294) Google Scholar, Saito and Rehmsmeier, 2015Saito T. Rehmsmeier M. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets.PLoS ONE. 2015; 10: e0118432Crossref PubMed Scopus (1636) Google Scholar). It should be noted that precision and recall focus only on the positive class of identified taxa. Performance metrics that require the calculation of false negatives, such as ROC (receiver operating characteristic) curves, are less informative in this context because false negatives are poorly defined in real-world metagenomic samples. A potential drawback of AUPR is that it is biased toward low-precision, high-recall classifiers. Classifiers that do not recall all of the ground-truth taxa are penalized with zero AUPR from the highest achieved recall to 100% recall. For classifiers that do reach 100% recall, additional false positive taxon calls do not further penalize the AUPR score. In addition to considering the number of correctly identified species, it is also important to evaluate how accurately the abundance of each species or genera in the resulting classification reflects the abundance of each species in the original biological sample (“ground truth”). This is especially critical for applications such as microbiome sequencing studies, where changes in population composition can confer phenotypic effects (Morgan et al., 2012Morgan X.C. Tickle T.L. Sokol H. Gevers D. Devaney K.L. Ward D.V. Reyes J.A. Shah S.A. LeLeiko N. Snapper S.B. et al.Dysfunction of the intestinal microbiome in inflammatory bowel disease and treatment.Genome Biol. 2012; 13: R79Crossref PubMed Scopus (1725) Google Scholar, Ross et al., 2013Ross E.M. Moate P.J. Marett L.C. Cocks B.G. Hayes B.J. Metagenomic predictions: from microbiome to complex health and environmental phenotypes in humans and cattle.PLoS ONE. 2013; 8: e73056Crossref PubMed Scopus (76) Google Scholar). Abundance can be considered either as the relative abundance of reads from each taxa (“raw”) or by inferring abundance of the number of individuals from each taxa by correcting read counts for genome size (“corrected”). Some programs incorporate a correction for genome length into abundance estimates; this calculation can also be manually performed by reweighting the read counts after classification. Here we use raw abundance profiles unless correction is performed automatically by the software, as in the case of PathSeq and Bracken. To evaluate the accuracy of abundance profiles, we can calculate the pairwise distances between ground-truth abundances and normalized abundance counts for each identified taxon at a given taxonomic level (e.g., species or genus). For this, we calculate the L2 distance for a given dataset’s classified output as the straight-line distance between the observed and true abundance vectors (Figure 2). We can also use this measure to compare abundance profiles between classifiers by instead computing L2 distances between classified abundances for pairs of classifiers. Abundance profile distance is more sensitive to accurate quantification of the highly abundant taxa present in the sample (Aitchison, 1982Aitchison J. The Statistical Analysis of Compositional Data.J. R. Stat. Soc. Series B Stat. Methodol. 1982; 44: 139-177Crossref Google Scholar, Quinn et al., 2018Quinn T.P. Erb I. Richardson M.F. Crowley T.M. Understanding sequencing data as compositions: an outlook and review.Bioinformatics. 2018; 34: 2870-2878Crossref PubMed Scopus (119) Google Scholar). High numbers of very-low-abundance false positives will not dramatically affect the measure because they comprise only a small portion of the total abundance. For this reason, using such a measure alongside AUPR, which is highly sensitive to classifiers’ performance in correctly identifying low-abundance taxa, allows comprehensive evaluation of classifier performance. The L2 distance should be considered as a representation of the abundance profiles. Because metagenomic abundance profiles are proportional data and not absolute data, it is important to remember that many common distance metrics (including L2 distance) are not true mathematical metrics in proportional space (Badri et al., 2018Badri M. Kurtz Z. Muller C. Bonneau R. Normalization methods for microbial abundance data strongly affect correlation estimates.bioRxiv. 2018; https://doi.org/10.1101/406264Crossref Google Scholar, Quinn et al., 2018Quinn T.P. Erb I. Richardson M.F. Crowley T.M. Understanding sequencing data as compositions: an outlook and review.Bioinformatics. 2018; 34: 2870-2878Crossref PubMed Scopus (119) Google Scholar). Generally, in proportional data analysis, a common method is to normalize proportions by using the centered log-ratio transform to calculate distances. However, the output of these metagenomic classifiers includes many low-abundance false positives, leading to sparse zero counts for many taxa across the different reports. The log-transform of these zero counts is undefines unless arbitrary pseudocounts are added to each taxa, which can negatively bias accurate classifiers because false positive taxa will have added counts. Another commonly used metric to compare abundance profiles is the UniFrac distance, which considers both the abundance proportion of component taxa as well as the evolutionary distance for incorrectly called taxa (Lozupone and Knight, 2005Lozupone C. Knight R. UniFrac: a new phylogenetic method for comparing microbial communities.Appl. Environ. Microbiol. 2005; 71: 8228-8235Crossref PubMed Scopus (5306) Google Scholar). However, using this metric is complicated by the difficulty in assessing evolutionary distance between microbial species’ whole genomes. Metrics should also be tested across many datasets because classifiers may perform better or worse on certain species or sample types. Last, other features—such as classification speed, memory usage, and output format—may also influence the choice of classifier and should also be considered in any thorough evaluation. Here we benchmarked 20 metagenomic classifiers to compare performance in classification precision, recall, F1, speed, and other metrics using a uniform database to eliminate any confounding effects of differences in default databases. DNA-to-DNA classifiers evaluated here were Kraken (and its add-on for more accurate abundance quantification, Bracken), Kraken2, KrakenUniq, k-SLAM, MegaBLAST, metaOthello, CLARK, CLARK-S, GOTTCHA, taxMaps, prophyle, PathSeq, Centrifuge, and Karp. DNA-to-protein classifiers evaluated were DIAMOND, Kaiju, and MMseqs2. We also evaluated the marker-based methods MetaPhlAn2 and mOTUs2. A more detailed description of each classifier’s qualitative characteristics is provided in Table 1 and Table S1. To evaluate classifier performance controlling for database differences, a uniform database was created, when possible, based on RefSeq CG and benchmarked for each method alongside the default database. We considered the precision and recall across a range of abundance thresholds as well as overall abundance profiles as our primary benchmarking metrics.Table 1A List of Benchmarked Classifiers and Their Various CharacteristicsTypeClassifierCustom DatabasesGenerates Abundance ProfileMemory RequiredTime RequiredReferenceDNABrackenyesyes<1 Gb<1 minLu et al., 2017Lu J. Breitwieser F.P. Thielen P. Salzberg S.L. Bracken: estimating species abundance in metagenomics data.PeerJ Comput. Sci. 2017; 3: e104Crossref Scopus (412) Google ScholarCentrifugeyesyes20 Gb7 minKim et al., 2016Kim D. Song L. Breitwieser F.P. Salzberg S.L. Centrifuge: rapid and sensitive classification of metagenomic sequences.Genome Res. 2016; 26: 1721-1729Crossref PubMed Scopus (522) Google ScholarCLARKyesyes80 Gb2 minOunit et al., 2015Ounit R. Wanamaker S. Close T.J. Lonardi S. CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers.BMC Genomics. 2015; 16: 236Crossref PubMed Scopus (325) Google ScholarCLARK-Syesyes170 Gb40 minOunit and Lonardi, 2016Ounit R. Lonardi S. Higher classification sensitivity of short metagenomic reads with CLARK-S.Bioinformatics. 2016; 32: 3823-3825Crossref PubMed Scopus (66) Google ScholarKrakenyesyes190 Gb1 minWood and Salzberg, 2014Wood D.E. Salzberg S.L. Kraken: ultrafast metagenomic sequence classification using exact alignments.Genome Biol. 2014; 15: R46Crossref PubMed Scopus (2190) Google ScholarKraken2yesyes36 Gb1 minWood and Salzberg, 2014Wood D.E. Salzberg S.L. Kraken: ultrafast metagenomic sequence classification using exact alignments.Genome Biol. 2014; 15: R46Crossref PubMed Scopus (2190) Google ScholarKrakenUniqyesyes200 Gb1 minBreitwieser et al.,