摘要
Review25 August 2021Open Access Diagnostics and correction of batch effects in large-scale proteomic studies: a tutorial Jelena Čuklina Jelena Čuklina orcid.org/0000-0002-5220-8642 Department of Biology, Institute of Molecular Systems Biology, ETH Zurich, Zurich, Switzerland PhD Program in Systems Biology, University of Zurich and ETH Zurich, Zurich, Switzerland IBM Research Europe, Rüschlikon, Switzerland Search for more papers by this author Chloe H Lee Chloe H Lee orcid.org/0000-0002-6232-7119 Department of Biology, Institute of Molecular Systems Biology, ETH Zurich, Zurich, Switzerland Search for more papers by this author Evan G Williams Evan G Williams orcid.org/0000-0002-9746-376X Department of Biology, Institute of Molecular Systems Biology, ETH Zurich, Zurich, Switzerland Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Luxembourg, Luxembourg Search for more papers by this author Tatjana Sajic Tatjana Sajic orcid.org/0000-0003-4282-1336 Department of Biology, Institute of Molecular Systems Biology, ETH Zurich, Zurich, Switzerland Search for more papers by this author Ben C Collins Ben C Collins orcid.org/0000-0003-0827-3495 Department of Biology, Institute of Molecular Systems Biology, ETH Zurich, Zurich, Switzerland Queen’s University Belfast, Belfast, UK Search for more papers by this author María Rodríguez Martínez María Rodríguez Martínez orcid.org/0000-0003-3766-4233 IBM Research Europe, Rüschlikon, Switzerland Search for more papers by this author Varun S Sharma Varun S Sharma orcid.org/0000-0002-4531-640X Department of Biology, Institute of Molecular Systems Biology, ETH Zurich, Zurich, Switzerland Search for more papers by this author Fabian Wendt Fabian Wendt orcid.org/0000-0002-2501-536X Department of Health Sciences and Technology, Institute of Translational Medicine, ETH Zurich, Zurich, Switzerland Search for more papers by this author Sandra Goetze Sandra Goetze orcid.org/0000-0001-6880-8020 Department of Health Sciences and Technology, Institute of Translational Medicine, ETH Zurich, Zurich, Switzerland ETH Zürich, PHRT-CPAC, Zürich, Switzerland SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland Search for more papers by this author Gregory R Keele Gregory R Keele orcid.org/0000-0002-1843-7900 The Jackson Laboratory, Bar Harbor, ME, USA Search for more papers by this author Bernd Wollscheid Bernd Wollscheid orcid.org/0000-0002-3923-1610 Department of Health Sciences and Technology, Institute of Translational Medicine, ETH Zurich, Zurich, Switzerland ETH Zürich, PHRT-CPAC, Zürich, Switzerland SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland Search for more papers by this author Ruedi Aebersold Corresponding Author Ruedi Aebersold [email protected] orcid.org/0000-0002-9576-3267 Department of Biology, Institute of Molecular Systems Biology, ETH Zurich, Zurich, Switzerland Faculty of Science, University of Zurich, Zurich, Switzerland Search for more papers by this author Patrick G A Pedrioli Corresponding Author Patrick G A Pedrioli [email protected] orcid.org/0000-0001-6719-9139 Department of Biology, Institute of Molecular Systems Biology, ETH Zurich, Zurich, Switzerland Department of Health Sciences and Technology, Institute of Translational Medicine, ETH Zurich, Zurich, Switzerland ETH Zürich, PHRT-CPAC, Zürich, Switzerland SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland Search for more papers by this author Jelena Čuklina Jelena Čuklina orcid.org/0000-0002-5220-8642 Department of Biology, Institute of Molecular Systems Biology, ETH Zurich, Zurich, Switzerland PhD Program in Systems Biology, University of Zurich and ETH Zurich, Zurich, Switzerland IBM Research Europe, Rüschlikon, Switzerland Search for more papers by this author Chloe H Lee Chloe H Lee orcid.org/0000-0002-6232-7119 Department of Biology, Institute of Molecular Systems Biology, ETH Zurich, Zurich, Switzerland Search for more papers by this author Evan G Williams Evan G Williams orcid.org/0000-0002-9746-376X Department of Biology, Institute of Molecular Systems Biology, ETH Zurich, Zurich, Switzerland Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Luxembourg, Luxembourg Search for more papers by this author Tatjana Sajic Tatjana Sajic orcid.org/0000-0003-4282-1336 Department of Biology, Institute of Molecular Systems Biology, ETH Zurich, Zurich, Switzerland Search for more papers by this author Ben C Collins Ben C Collins orcid.org/0000-0003-0827-3495 Department of Biology, Institute of Molecular Systems Biology, ETH Zurich, Zurich, Switzerland Queen’s University Belfast, Belfast, UK Search for more papers by this author María Rodríguez Martínez María Rodríguez Martínez orcid.org/0000-0003-3766-4233 IBM Research Europe, Rüschlikon, Switzerland Search for more papers by this author Varun S Sharma Varun S Sharma orcid.org/0000-0002-4531-640X Department of Biology, Institute of Molecular Systems Biology, ETH Zurich, Zurich, Switzerland Search for more papers by this author Fabian Wendt Fabian Wendt orcid.org/0000-0002-2501-536X Department of Health Sciences and Technology, Institute of Translational Medicine, ETH Zurich, Zurich, Switzerland Search for more papers by this author Sandra Goetze Sandra Goetze orcid.org/0000-0001-6880-8020 Department of Health Sciences and Technology, Institute of Translational Medicine, ETH Zurich, Zurich, Switzerland ETH Zürich, PHRT-CPAC, Zürich, Switzerland SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland Search for more papers by this author Gregory R Keele Gregory R Keele orcid.org/0000-0002-1843-7900 The Jackson Laboratory, Bar Harbor, ME, USA Search for more papers by this author Bernd Wollscheid Bernd Wollscheid orcid.org/0000-0002-3923-1610 Department of Health Sciences and Technology, Institute of Translational Medicine, ETH Zurich, Zurich, Switzerland ETH Zürich, PHRT-CPAC, Zürich, Switzerland SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland Search for more papers by this author Ruedi Aebersold Corresponding Author Ruedi Aebersold [email protected] orcid.org/0000-0002-9576-3267 Department of Biology, Institute of Molecular Systems Biology, ETH Zurich, Zurich, Switzerland Faculty of Science, University of Zurich, Zurich, Switzerland Search for more papers by this author Patrick G A Pedrioli Corresponding Author Patrick G A Pedrioli [email protected] orcid.org/0000-0001-6719-9139 Department of Biology, Institute of Molecular Systems Biology, ETH Zurich, Zurich, Switzerland Department of Health Sciences and Technology, Institute of Translational Medicine, ETH Zurich, Zurich, Switzerland ETH Zürich, PHRT-CPAC, Zürich, Switzerland SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland Search for more papers by this author Author Information Jelena Čuklina1,2,3, Chloe H Lee1, Evan G Williams1,4, Tatjana Sajic1, Ben C Collins1,5, María Rodríguez Martínez3, Varun S Sharma1, Fabian Wendt6, Sandra Goetze6,7,8, Gregory R Keele9, Bernd Wollscheid6,7,8, Ruedi Aebersold *,1,10 and Patrick G A Pedrioli *,1,6,7,8 1Department of Biology, Institute of Molecular Systems Biology, ETH Zurich, Zurich, Switzerland 2PhD Program in Systems Biology, University of Zurich and ETH Zurich, Zurich, Switzerland 3IBM Research Europe, Rüschlikon, Switzerland 4Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Luxembourg, Luxembourg 5Queen’s University Belfast, Belfast, UK 6Department of Health Sciences and Technology, Institute of Translational Medicine, ETH Zurich, Zurich, Switzerland 7ETH Zürich, PHRT-CPAC, Zürich, Switzerland 8SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland 9The Jackson Laboratory, Bar Harbor, ME, USA 10Faculty of Science, University of Zurich, Zurich, Switzerland *Corresponding author. Tel: +41 44 633 31 70; E-mail: [email protected] *Corresponding author. Tel: +41 44 633 21 95; E-mail: [email protected] Molecular Systems Biology (2021)17:e10240https://doi.org/10.15252/msb.202110240 PDFDownload PDF of article text and main figures. ToolsAdd to favoritesDownload CitationsTrack CitationsPermissions ShareFacebookTwitterLinked InMendeleyWechatReddit Figures & Info Abstract Advancements in mass spectrometry-based proteomics have enabled experiments encompassing hundreds of samples. While these large sample sets deliver much-needed statistical power, handling them introduces technical variability known as batch effects. Here, we present a step-by-step protocol for the assessment, normalization, and batch correction of proteomic data. We review established methodologies from related fields and describe solutions specific to proteomic challenges, such as ion intensity drift and missing values in quantitative feature matrices. Finally, we compile a set of techniques that enable control of batch effect adjustment quality. We provide an R package, "proBatch", containing functions required for each step of the protocol. We demonstrate the utility of this methodology on five proteomic datasets each encompassing hundreds of samples and consisting of multiple experimental designs. In conclusion, we provide guidelines and tools to make the extraction of true biological signal from large proteomic studies more robust and transparent, ultimately facilitating reliable and reproducible research in clinical proteomics and systems biology. Introduction Recent advances in mass spectrometry (MS)-based proteomic approaches have significantly increased sample throughput and quantitative reproducibility. As a consequence, large-scale studies consisting of hundreds of samples are becoming increasingly common (Zhang et al, 2014, 2016; Liu et al, 2015; Mertins et al, 2016; Okada et al, 2016; Williams et al, 2016; Collins et al, 2017; Sajic et al, 2018). These technological and methodological advances, combined with proteins being the main regulators of the majority of biological processes, make MS-based proteomics a key methodology for studying physiological processes and diseases (Schubert et al, 2017). MS-derived quantitative measurements on thousands of proteins can, however, be affected by differences in sample preparation and data acquisition conditions such as different technicians, reagent batches, or changes in instrumentation. This phenomenon, known as “batch effects”, introduces noise that reduces the statistical power to detect the true biological signal. In the most severe cases, the biological signal ends up correlating with technical variables, leading to concerns about the validity of the biological conclusions (Petricoin et al, 2002; Hu et al, 2005; Akey et al, 2007; Leek et al, 2010). Batch effects have been extensively discussed, both in the genomic community that made major contributions to the problem about a decade ago (Leek et al, 2010; Luo et al, 2010; Chen et al, 2011; Dillies et al, 2013; Lazar et al, 2013; Chawade et al, 2014) and in the proteomic community which has faced the issue quite recently (Gregori et al, 2012; Karpievitch et al, 2012; Chawade et al, 2014; Välikangas et al, 2018). Nevertheless, finding solutions to the problem of batch effects is still a topic of active research. Although extensive reviews have been written on the topic (Leek et al, 2010; Lazar et al, 2013), researchers still get confused about the terminology. For example, the distinction between normalization, batch effect correction, and batch effect adjustments is not always clear and these terms are often used interchangeably. To clarify how we use these terms in this Review, we compiled a glossary, found in Table 1. Some definitions are adapted from Leek et al, 2010. Table 1. Terminology. Term Definition Batch effects Systematic differences between the measurements due to technical factors, such as sample or reagent batches. Normalization Sample-wide adjustment of the data with the intention to bring the distribution of measured quantities into alignment. Most prominently, sample means and medians are aligned after normalization. Batch effect correction Data transformation procedure that corrects quantities of specific features (genes, peptides, metabolites) across samples, to reduce differences that are associated with technical factors, recorded in the experimental protocol (i.e., sample preparation or measurement batches). Usually samples are assumed to be normalized prior to batch effect correction. This step is often called "batch effect removal" or "batch effect adjustment" in the literature. Note the difference in the definition used here. Batch effect adjustment Data transformation procedure that adjusts for differences between samples due to technical factors that altered the data (sample-wise and/or feature-wise). The fundamental objective of the batch effect adjustment is to make all samples comparable for a meaningful biological analysis. In our definition, batch effect adjustment is a two-step transformation: first normalization, then batch effect correction. Performing normalization first helps feature-level batch effect correction by first alleviating sample level discrepancies. There is also considerable debate on which batch correction method performs best, and multiple articles have compared various methods (Luo et al, 2010; Chen et al, 2011; Chawade et al, 2014). Other publications advise checking the assumptions about the data before selecting the bias adjustment method (Goh et al, 2017; Evans et al, 2018). The issue of batch correction is further complicated by the fact that each technology faces different issues. Specifically, RNA-seq batch effect adjustment requires approaches that address sequencing-specific problems (Dillies et al, 2013). Similarly, MS methods in proteomics (e.g., data-dependent acquisition—DDA, data-independent acquisition—DIA, and tandem mass tag—TMT) also present several field-specific challenges. First, there is the problem of peptide to protein inference (Clough et al, 2012; Choi et al, 2014; Rosenberger et al, 2014; Teo et al, 2015; Muntel et al, 2019). As protein quantities are inferred from the quantities of measured peptides or even fragment ions, one needs to decide at which level to correct the data. Second, it is known that missing values can be associated with technical factors (Karpievitch et al, 2012; Matafora et al, 2017). Finally, when dealing with experiments with large sample numbers, typically in the order of hundreds, one needs to account for MS signal drift. Here, we discuss the application of established approaches for batch effect adjustment. We also look at the methods that address MS-specific challenges. We start by providing an overview of the workflow and a definition of key terms for each step. In addition to considering batch effect assessment and adjustment, we summarize the best practices for assessing the improvements in data quality post-correction. We also devote a section to the implications of missing values in relation to batch effects and potential pitfalls related to their imputation. We finish with a discussion and a future perspective of the presented approaches. To facilitate the application to practical use cases, we illustrate all the relevant steps using three large-scale DIA and two DDA studies. For these "case studies", we primarily rely on the largest of the five datasets (i.e., Aging mouse study; preprint: Williams et al, 2021) and refer to the others where appropriate. The data analyses we show are only for illustration purposes and are not intended for deriving new biological insights. Workflow overview The purpose of this article is to guide researchers working with large-scale proteomic datasets toward minimizing bias and maximizing the robustness and reproducibility of results generated from such data. The workflow starts from a matrix of quantified features (e.g., transitions, peptides, or proteins) across multiple samples, here referred to as “raw data matrix” and finishes with "batch-adjusted" data, which are ready for downstream analyses (e.g., differential expression or network inference). We split the workflow into five steps, shown in Fig 1, and describe each of the steps below. Figure 1. Batch effect processing workflow 1. Initial assessment evaluates whether batch effects are present in raw data. 2. Normalization brings all samples from the dataset to a common scale. 3. Diagnostics of batch effects in normalized data. This step determines whether further correction is required. 4. Batch effect correction addresses feature-specific biases. 5. Quality control tests whether bias has been reduced while retaining meaningful signals. Download figure Download PowerPoint In the context of this article, we will use the term “adjust for batch effects” when referring to the whole workflow and “correct for batch effects” when referring to the correction of normalized data (see Table 1). We provide a checklist that summarizes the most important points of the protocol in Table 2. It is also important to stress that batch factors should be already considered in the experimental design phase, to ensure that the data are not biased beyond repair, something that can happen when biological groups are completely confounded with sample preparation batches (Hu et al, 2005; Gilad & Mizrahi-Man, 2015). For an extensive discussion on experimental design, we refer the reader to previously published materials on the topic (Oberg & Vitek, 2009; Čuklina et al, 2020). Here, we assume that the experiment has been designed with appropriate randomization and blocking, ensuring the correctability of bias caused by batch effects. Table 2. Batch effect processing checklist. Step Substeps Experimental designa a For details on experimental design, see (Čuklina et al, 2020). Randomize samples in a balanced manner to prevent confounding of biological factors with batches (technical factors). Consider adding replicates if possible, for example: (a) add replication for each technical factor; (b) regularly inject a sample mix every few (e.g., 10–15, but the exact number will need to be adjusted depending on experimental conditions) samples for control; (c) incorporate a sample mix per batch. Record all technical factors, both plannable and occurring unexpectedly. Initial assessment Check whether the sample intensity distributions are consistent. Check the correlation of all sample pairs. If intensities or sample correlations differ, check whether the intensities show batch-specific biases. Normalization Choose a normalization procedure, appropriate for biological background and data properties. Diagnostics Using diagnostic tools, determine whether batch effects persist in the data. Use quality control already at this step and skip the correction if it is not necessary. Tip: If the goal is to determine differentially expressed proteins, and the batch effects are discrete or linear, multi-factor ANOVA on normalized data is a sound statistical approach. This will adjust for batch effects while simultaneously identifying differentially expressed proteins. Note, that "hits" or differentially expressed proteins identified with this approach are valid even if diagnostic tools indicate the presence of batch effects. For more details on ANOVA methods, refer to (Rice, 2006). Batch effect correction Choose batch effect correction procedure, appropriate for the biological background and data properties, especially those detected at the previous step. Repeat the diagnostic step. Assess the ultimate benefit with quality control. Quality control Compare correlation of samples within and between the batches. Pay special attention to replicate correlation, if these are available. Compare correlation of peptides within and between the proteins. a For details on experimental design, see (Čuklina et al, 2020). In the accompanying “proBatch” package, we implemented several methods with proven utility in batch effect analysis and adjustment. We also provide tips for integrating other tools that might be useful in this context, and for making them compatible. proBatch is made available as a Bioconductor package (https://www.bioconductor.org/packages/release/bioc/html/proBatch.html) and a pre-built Docker container (https://hub.docker.com/r/digitalproteomes/probatch), as well as a GitHub repository (https://github.com/symbioticMe/batch_effects_workflow_code) of the workflow with all code and data required to reproduce the case study analyses. Extensive comparison of various methods has been published previously (Luo et al, 2010; Chawade et al, 2014), and here, we summarize the best practices from these papers, as well as reviews (Leek et al, 2010; Lazar et al, 2013) and application papers (Collins et al, 2017; Sajic et al, 2018), and turn them into principles that can guide the reader in choosing an appropriate methodology. Raw data matrix: choosing between protein/peptide/fragment level This workflow starts with a raw data matrix, for which initial steps such as peptide-spectrum matching, quantification, and FDR control have been completed. Data are assumed to be log-transformed unless the variance stabilizing transformation (Durbin et al, 2002) is used. In the latter case, the data transformation is included in the normalization procedure. We suggest performing batch effect adjustment on the peptide or fragment ion level, as this procedure alters feature abundances that are critical for protein quantity inference (Clough et al, 2012; Teo et al, 2015). We also suggest that all detected peptides, including non-proteotypic peptides and peptides with missed cleavages, should be kept into consideration during batch effect adjustment. Keeping all measurements is required to better evaluate the intensity distribution within each sample, which is critical for subsequent normalization and correction steps. Initial assessment The goals of the initial assessment phase are to determine bias magnitude and sources and to select a normalization method. In most cases, the intensity distributions differ among samples. Comparing global quantitative properties such as sample medians or standard deviations helps with the choice of normalization methods and the identification of technical factors requiring further control. Three approaches are particularly useful for initial assessment: (i) plotting the sample intensity average or median in order of MS measurement or technical batch, allows to estimate MS drift or discrete bias in each batch; (ii) boxplots allow to assess sample variance and outliers; and (iii) inter- vs. intrabatch sample correlation. A higher correlation of samples from the same batch compared with unrelated batches is a clear sign of bias. Optionally, a few proteins or peptides can be checked for signs of bias. Normalization The goal of normalization is to bring all samples to the same scale to make them comparable. Commonly used methods of normalization are quantile normalization, median normalization, and z-transformation. Two main considerations drive the choice of normalization method: Heterogeneity of the data: If samples are fairly similar, the bulk of the proteome does not change, and thus, techniques such as quantile normalization (Bolstad et al, 2003) can be used. In datasets in which the samples are substantially different (i.e., when a large fraction of the variables are either positively or negatively affected by the treatment) different methods, such as HMM-assisted normalization can be used (Landfors et al, 2011). Additionally, if some samples are expected to have informative outliers (e.g., muscle tissue, in which a handful of proteins are several orders of magnitude more abundant than the rest of the proteome), methods that keep the relationship of outliers to the bulk proteome need to be used (Wang et al, 2021). Distribution of sample intensities: The initial assessment step, especially boxplots, indicates which level of correction is required: In most cases, shifting the means or medians is enough, but when variances differ substantially, these need to be brought to the same scale as well. It should be noted that after normalization, no further data correction might be required. This can be determined with the diagnostic plots and quality control methods described below. If the results are satisfactory, keeping data manipulation minimal is advisable. Diagnostics of normalized data While normalization makes the samples more comparable, it only aligns their global patterns. Therefore, batch effects affecting specific proteins or protein groups might still represent a major source of variance even after normalization. Thus, the diagnosis of batch effects is most informative when performed on normalized data. The diagnostic approaches can be divided into proteome-wide and peptide-level approaches. The main approaches for proteome-wide diagnostics are as follows: Hierarchical clustering is an algorithm that groups similar samples into a tree-like structure called a dendrogram. Similar samples cluster together, and the driving cause of this similarity can be visualized by coloring the dendrogram by technical and biological factors. Hierarchical clustering is often combined with a heatmap, mapping quantitative values in the data matrix to colors which facilitates the assessment of patterns in the dataset. Principal Component Analysis (PCA) is a technique that identifies the leading directions of variation, known as principal components. The projection of data on two-component axes visualizes sample proximity. Additional coloring of the samples by technical/biological factors, or by highlighting replicates, facilitates the interpretation of what drives sample proximity. This technique is particularly convenient to assess clustering by biological and technical factors or to check for replicate similarity. Visualization without sample point or label overlay effects works in our experience up to about 50–100 samples in a dataset. One should be careful in interpreting proteome-wide diagnostics because these methods were designed for data matrices without missing values. Proteomic datasets often contain missing values for technical or biological reasons. For more details, we refer the reader to Box 1. In proteomics, peptide-level diagnostics are as useful as proteome-wide diagnostics. As in other high-throughput measurements, individual features, in this case, peptides, are visualized to check for batch-related bias. In proteomic datasets, spike-in proteins or peptides can be added as controls. In most DIA datasets, iRT peptides (Escher et al, 2012), if added in precise concentrations, are well suited for individual feature diagnostics. It should be noted that individual peptides have a variety of different responses to various batch effects, so checking a handful of peptides is necessary, whether endogenous or spiked-in. Another reason to check individual peptides in proteomics is to examine the trends associated with sample running order. These trends might occur as MS signal deteriorates and require special correction approaches. Note, that in proteomics, individual features are sometimes not peptides, but transitions or peptide groups. Thus, methods referred here as peptide-level diagnostics are applicable to any feature-level diagnostics. Batch effect correction Diagnostics help to determine whether batch effect corrections are needed. While global sample patterns are corrected during normalization, batch effects affect specific features and feature groups, and that is the level on which they need to be corrected. In proteomic datasets, two types of batch effects are frequently encountered, continuous and discrete. If batch effects are continuous, e.g., manifest as MS signal drift progressing from run to run during the sample measurement process, an order-specific curve needs to be fitted, such as a LOESS fit, or by using any other continuous algorithm. Signal drifts are likely to occur in studies profiling hundreds of samples. This problem is more prominent in mass spectrometry as compared to next-generation sequencing and is thus still relatively new to the research community. Discrete batch effects manifest as feature-specific shifts of each batch as a whole. Here, methods such as mean and median centering work very well. An advanced modification of the mean shift is provided by ComBat (Johnson et al, 2007) that uses a Bayesian framework which can be applied to proteomic data (Lee et al, 2019). However, ComBat requires that all features are represented in each of the batches. Therefore, especially in large-scale proteomic datasets, applying ComBat might require the removal of a substantial number of peptides that happen to be missing in at least one batch, regardless of how small this batch is (see Box 1 for details). Thus, one should be very careful when choosing the method for batch effect correction. Quality control The purpose of the quality control step is to determine whether the adjustment procedures—normalization and/or batch effect correction—have improved the data. At this step, the data after adjustment are compared with the raw data matrix. There are two types of criteria to evaluate the data quality: (i) removal of the bias (negative control) and (ii) improvement of the data (positive control). Typically, bias is considered removed if the similarity between samples is no longer driven by technical factors. This means that neither hierarchical clustering nor PCA shows clustering by batch, and the correlation of samples from the same batch is no longer stron