摘要
Journal of Animal EcologyVolume 77, Issue 4 p. 802-813 Free Access A working guide to boosted regression trees J. Elith, Corresponding Author J. Elith School of Botany, The University of Melbourne, Parkville, Victoria, Australia 3010; *Correspondence author. E-mail: j.elith@unimelb.edu.auSearch for more papers by this authorJ. R. Leathwick, J. R. Leathwick National Institute of Water and Atmospheric Research, PO Box 11115, Hamilton, New Zealand; andSearch for more papers by this authorT. Hastie, T. Hastie Department of Statistics, Stanford University, CA, USASearch for more papers by this author J. Elith, Corresponding Author J. Elith School of Botany, The University of Melbourne, Parkville, Victoria, Australia 3010; *Correspondence author. E-mail: j.elith@unimelb.edu.auSearch for more papers by this authorJ. R. Leathwick, J. R. Leathwick National Institute of Water and Atmospheric Research, PO Box 11115, Hamilton, New Zealand; andSearch for more papers by this authorT. Hastie, T. Hastie Department of Statistics, Stanford University, CA, USASearch for more papers by this author First published: 08 April 2008 https://doi.org/10.1111/j.1365-2656.2008.01390.xCitations: 3,489AboutSectionsPDF ToolsRequest permissionExport citationAdd to favoritesTrack citation ShareShare Give accessShare full text accessShare full-text accessPlease review our Terms and Conditions of Use and check box below to share full-text version of article.I have read and accept the Wiley Online Library Terms and Conditions of UseShareable LinkUse the link below to share a full-text version of this article with your friends and colleagues. Learn more.Copy URL Share a linkShare onFacebookTwitterLinked InRedditWechat Summary 1 Ecologists use statistical models for both explanation and prediction, and need techniques that are flexible enough to express typical features of their data, such as nonlinearities and interactions. 2 This study provides a working guide to boosted regression trees (BRT), an ensemble method for fitting statistical models that differs fundamentally from conventional techniques that aim to fit a single parsimonious model. Boosted regression trees combine the strengths of two algorithms: regression trees (models that relate a response to their predictors by recursive binary splits) and boosting (an adaptive method for combining many simple models to give improved predictive performance). The final BRT model can be understood as an additive regression model in which individual terms are simple trees, fitted in a forward, stagewise fashion. 3 Boosted regression trees incorporate important advantages of tree-based methods, handling different types of predictor variables and accommodating missing data. They have no need for prior data transformation or elimination of outliers, can fit complex nonlinear relationships, and automatically handle interaction effects between predictors. Fitting multiple trees in BRT overcomes the biggest drawback of single tree models: their relatively poor predictive performance. Although BRT models are complex, they can be summarized in ways that give powerful ecological insight, and their predictive performance is superior to most traditional modelling methods. 4 The unique features of BRT raise a number of practical issues in model fitting. We demonstrate the practicalities and advantages of using BRT through a distributional analysis of the short-finned eel (Anguilla australis Richardson), a native freshwater fish of New Zealand. We use a data set of over 13 000 sites to illustrate effects of several settings, and then fit and interpret a model using a subset of the data. We provide code and a tutorial to enable the wider use of BRT by ecologists. Introduction Ecologists frequently use models to detect and describe patterns, or to predict to new situations. In particular, regression models are often used as tools for quantifying the relationship between one variable and others upon which it depends. Whether analysing the body weight of birds in relation to their age, sex and guild; the abundance of squirrels as it varies with temperature, food and shelter; or vegetation type in relation to aspect, rainfall and soil nutrients, models can be used to identify variables with the most explanatory power, indicate optimal conditions and predict to new cases. The past 20 years have seen a growing sophistication in the types of statistical model applied in ecology, with impetus from substantial advances in both statistics and computing. Early linear regression models were attractively straightforward, but too simplistic for many real-life situations. In the 1980s and 1990s, generalized linear models (GLM; McCullagh & Nelder 1989) and generalized additive models (GAM; Hastie & Tibshirani 1990) increased our capacity to analyse data with non-normally distributed errors (presence–absence and count data), and to model nonlinear relationships. These models are now widely used in ecology, for example for analysis of morphological relationships (Clarke & Johnston 1999) and population trends (Fewster et al. 2000), and for predicting the distributions of species (Buckland & Elston 1993). Over the same period, computer scientists developed a wide variety of algorithms particularly suited to prediction, including neural nets, ensembles of trees and support vector machines. These machine learning (ML) methods are used less frequently than regression methods in ecology, perhaps partly because they are considered less interpretable and therefore less open to scrutiny. It may also be that ecologists are less familiar with the modelling paradigm of ML, which differs from that of statistics. Statistical approaches to model fitting start by assuming an appropriate data model, and parameters for this model are then estimated from the data. By contrast, ML avoids starting with a data model and rather uses an algorithm to learn the relationship between the response and its predictors (Breiman 2001). The statistical approach focuses on questions such as what model will be postulated (e.g. are the effects additive, or are there interactions?), how the response is distributed, and whether observations are independent. By contrast, the ML approach assumes that the data-generating process (in the case of ecology, nature) is complex and unknown, and tries to learn the response by observing inputs and responses and finding dominant patterns. This places the emphasis on a model's ability to predict well, and focuses on what is being predicted and how prediction success should be measured. In this paper we discuss a relatively new technique, boosted regression trees (BRT), which draws on insights and techniques from both statistical and ML traditions. The BRT approach differs fundamentally from traditional regression methods that produce a single ‘best’ model, instead using the technique of boosting to combine large numbers of relatively simple tree models adaptively, to optimize predictive performance (e.g. Elith et al. 2006; Leathwick et al. 2006, 2008). The boosting approach used in BRT places its origins within ML (Schapire 2003), but subsequent developments in the statistical community reinterpret it as an advanced form of regression (Friedman, Hastie & Tibshirani 2000). Despite clear evidence of strong predictive performance and reliable identification of relevant variables and interactions, BRT has been rarely used in ecology (although see Moisen et al. 2006; De’ath 2007). In this paper we aim to facilitate the wider use of BRT by ecologists, demonstrating its use in an analysis of relationships between frequency of capture of short-finned eels (Anguilla australis Richardson), and a set of predictors describing river environments in New Zealand. We first explain what BRT models are, and then show how to develop, explore and interpret an optimal model. Supporting software and a tutorial are provided as Supplementary material. explanation of boosted regression trees BRT is one of several techniques that aim to improve the performance of a single model by fitting many models and combining them for prediction. BRT uses two algorithms: regression trees are from the classification and regression tree (decision tree) group of models, and boosting builds and combines a collection of models. We deal with each of these components in turn. decision trees Modern decision trees are described statistically by Breiman et al. (1984) and Hastie, Tibshirani & Friedman (2001), and for ecological applications by De’ath & Fabricius (2000). Tree-based models partition the predictor space into rectangles, using a series of rules to identify regions having the most homogeneous responses to predictors. They then fit a constant to each region (Fig. 1), with classification trees fitting the most probable class as the constant, and regression trees fitting the mean response for observations in that region, assuming normally distributed errors. For example, in Fig. 1 the two predictor variables X1 and X2 could be temperature and rainfall, and the response Y, the mean adult weight of a species. Regions Y1, Y2, etc. are terminal nodes or leaves, and t1, t2, etc. are split points. Predictors and split points are chosen to minimize prediction errors. Growing a tree involves recursive binary splits: a binary split is repeatedly applied to its own output until some stopping criterion is reached. An effective strategy for fitting a single decision tree is to grow a large tree, then prune it by collapsing the weakest links identified through cross-validation (CV) (Hastie et al. 2001). Figure 1Open in figure viewerPowerPoint A single decision tree (upper panel), with a response Y, two predictor variables, X1 and X2 and split points t1, t2, etc. The bottom panel shows its prediction surface (after Hastie et al. 2001) Decision trees are popular because they represent information in a way that is intuitive and easy to visualize, and have several other advantageous properties. Preparation of candidate predictors is simplified because predictor variables can be of any type (numeric, binary, categorical, etc.), model outcomes are unaffected by monotone transformations and differing scales of measurement among predictors, and irrelevant predictors are seldom selected. Trees are insensitive to outliers, and can accommodate missing data in predictor variables by using surrogates (Breiman et al. 1984). The hierarchical structure of a tree means that the response to one input variable depends on values of inputs higher in the tree, so interactions between predictors are automatically modelled. Despite these benefits, trees are not usually as accurate as other methods, such as GLM and GAM. They have difficulty in modelling smooth functions, even ones as simple as a straight-line response at 45° to two input axes. Also, the tree structure depends on the sample of data, and small changes in training data can result in very different series of splits (Hastie et al. 2001). These factors detract from the advantages of trees, introducing uncertainty into their interpretation and limiting their predictive performance. boosting Boosting is a method for improving model accuracy, based on the idea that it is easier to find and average many rough rules of thumb, than to find a single, highly accurate prediction rule (Schapire 2003). Related techniques – including bagging, stacking and model averaging – also build, then merge results from multiple models, but boosting is unique because it is sequential: it is a forward, stagewise procedure. In boosting, models (e.g. decision trees) are fitted iteratively to the training data, using appropriate methods gradually to increase emphasis on observations modelled poorly by the existing collection of trees. Boosting algorithms vary in how they quantify lack of fit and select settings for the next iteration. The original boosting algorithms such as AdaBoost (Freund & Schapire 1996) were developed for two-class classification problems. They apply weights to the observations, emphasizing poorly modelled ones, so the ML literature tends to discuss boosting in terms of changing weights. Here, though, we focus on regression trees (including logistic regression trees), and the intuition is different. For regression problems, boosting is a form of ‘functional gradient descent’. Consider a loss function – in this case, a measure (such as deviance) that represents the loss in predictive performance due to a suboptimal model. Boosting is a numerical optimization technique for minimizing the loss function by adding, at each step, a new tree that best reduces (steps down the gradient of) the loss function.For BRT, the first regression tree is the one that, for the selected tree size, maximally reduces the loss function. For each following step, the focus is on the residuals: on variation in the response that is not so far explained by the model. [Technical aside: For ordinary regression and squared-error loss, standard residuals are used. For more general loss, the analogue of the residual vector is the vector of negative gradients. Deviance is used as the loss function in the software we use. The negative gradient of the deviance in a logistic regression BRT model or a Poisson BRT model is the residual y – p, where y is the response and p the fitted probability or fitted Poisson mean. These are fitted by a tree, and the fitted values are added to the current logit(p) or log(p).] For example, at the second step, a tree is fitted to the residuals of the first tree, and that second tree could contain quite different variables and split points compared with the first. The model is then updated to contain two trees (two terms), and the residuals from this two-term model are calculated, and so on. The process is stagewise (not stepwise), meaning that existing trees are left unchanged as the model is enlarged. Only the fitted value for each observation is re-estimated at each step to reflect the contribution of the newly added tree. The final BRT model is a linear combination of many trees (usually hundreds to thousands) that can be thought of as a regression model where each term is a tree. We illustrate the way in which the trees combine and contribute to the final fitted model in a later section, ‘How multiple trees produce curvilinear functions’. The model-building process performs best if it moves slowly down the gradient, so the contribution of each tree is usually shrunk by a learning rate that is substantially less than one. Fitted values in the final model are computed as the sum of all trees multiplied by the learning rate, and are much more stable and accurate than those from a single decision tree model. Similarly to GLM, BRT models can be fitted to a variety of response types (Gaussian, Poisson, binomial, etc.) by specifying the error distribution and the link. Ridgeway (2006) provides mathematical details for available distributions in the software we use here, including calculations for deviance (the loss function), initial values, gradients, and the constants predicted in each terminal node. Some loss functions are more robust to noisy data than others (Hastie et al. 2001). For example, binomial data can be modelled in BRTs with several loss functions: exponential loss makes them similar to boosted classification trees such as AdaBoost, but binomial deviance is more robust, and likely to perform better in data where classes may be mislabelled (e.g. false negative observations). From a user's point of view, important features of BRT as applied in this paper are as follows. First, the process is stochastic – it includes a random or probabilistic component. The stochasticity improves predictive performance, reducing the variance of the final model, by using only a random subset of data to fit each new tree (Friedman 2002). This means that, unless a random seed is set initially, final models will be subtly different each time they are run. Second, the sequential model-fitting process builds on trees fitted previously, and increasingly focuses on the hardest observations to predict. This distinguishes the process from one where a single large tree is fitted to the data set. However, if the perfect fit was a single tree, in a boosted model it would probably be fitted by a sum of identical shrunken versions of itself. Third, values must be provided for two important parameters. The learning rate (lr), also known as the shrinkage parameter, determines the contribution of each tree to the growing model, and the tree complexity (tc) controls whether interactions are fitted: a tc of 1 (single decision stump; two terminal nodes) fits an additive model, a tc of two fits a model with up to two-way interactions, and so on. These two parameters then determine the number of trees (nt) required for optimal prediction. Finally, prediction from a BRT model is straightforward, but interpretation requires tools for identifying which variables and interactions are important, and for visualizing fitted functions. In the following sections, we use a case study to show how to manage these features of BRT in a typical ecological setting. the case study We demonstrate use of BRT with data describing the distribution of, and environments occupied by, the short-finned eel (Anguilla australis) in New Zealand. We aim to produce a model that not only identifies major environmental determinants of A. australis distribution, but also can be used to predict and map its occurrence in unsampled rivers. The model will be a form of logistic regression that models the probability that a species occurs, y = 1, at a location with covariates X, P(y = 1 | X). This probability will be modelled via a logit : logit P(y = 1 | X) = f(X). Anguilla australis is a freshwater eel native to south-eastern Australia, New Zealand and western Pacific islands. Within New Zealand it is a common freshwater species, frequenting lowland lakes, swamps, and sluggish streams and rivers in pastoral areas, and forming a valuable traditional and commercial fishery. Short-finned eels take 10–20 years to mature, then migrate – perhaps in response to rainfall or flow triggers – to the sea to spawn. The eels spawn at considerable depth, then larvae are brought back to the coast on ocean currents and metamorphose into glass eels. After entering freshwater, they become pigmented and migrate upstream. They tend not to penetrate as far upstream as long-finned eels (Anguilla dieffenbachii), probably because there is little suitable habitat further inland rather than because they are unable to do so (McDowall 1993). The data set, developed for research and conservation planning in New Zealand, is described in detail by Leathwick et al. 2008). Briefly, species data were records of species caught from 13 369 sites spanning the major environmental gradients in New Zealand's rivers. Anguilla australis was caught at 20% of sites. Because this is a much larger data set than is often available in ecology, here we subsample the 13 369 sites, usually partitioning off 1000 records for modelling and keeping the remainder for independent evaluation. The explanatory variables were a set of 11 functionally relevant environmental predictors (Table 1) that summarize conditions over several spatial scales: local (segment and reach) scale, upstream catchment scale, and downstream to the sea. Most were available as GIS data for the full river system of New Zealand, enabling prediction to all rivers. The exception was one variable describing local substrate conditions (LocSed) that had records at only 82% sites. The 12th variable was categorical, and described fishing method (Table 1). Given these records and covariates, the logistic regression will be modelling the joint probability of occurrence and capture of A. australis. Table 1. Environmental variables used to model fish occurrence Variable Description Mean and range Reach scale predictor LocSed Weighted average of proportional cover of bed sediment: 1 = mud, 2 = sand, 3 = fine gravel; 4 = coarse gravel; 5 = cobble; 6 = boulder; 7 = bedrock 3·77, 1–7 Segment scale predictors SegSumT Summer air temperature (°C) 16·3, 8·9–19·8 SegTSeas Winter air temperature (°C), normalized with respect to SegJanT 0·36, –4·2–4·1 SegLowFlow Segment low flow (m3 s−1), fourth root transformed 1·092, 1·0–4·09 Downstream predictors DSDist Distance to coast (km) 74, 0·03–433·4 DSDam Presence of known downstream obstructions, mostly dams 0·18, 0 or 1 DSMaxSlope Maximum downstream slope (°) 3·1, 0–29·7 Upstream/catchment scale predictors USAvgT Average temperature in catchment (°C) compared with segment, normalized with respect to SegJanT –0·38, –7·7–2·2 USRainDays Days per month with rain >25 mm 1·22, 0·21–3·30 USSlope Average slope in the upstream catchment (°) 14·3, 0–41·0 USNative Area with indigenous forest (proportion) 0·57, 0–1 Fishing method Method Fishing method in five classes: electric, net, spot, trap, mixture NA software and modelling All models were fitted in r (R Development Core Team 2006) version 2.3-1, using gbm package version 1·5–7 (Ridgeway 2006) plus custom code written by J.L. and J.E. Our code is available with a tutorial (Supplementary material). There is also a growing range of alternative implementations for boosted trees, but we do not address those here. The following two sections explain how to fit, evaluate and interpret a BRT model, highlighting features that make BRT particularly useful in ecology. For all settings other than those mentioned, we used the defaults in gbm. optimizing the model with ecological data Model development in BRT is best understood in the context of other model-fitting practices. For all prediction problems, overfitting models to training data reduces their generality, so regularization methods are used to constrain the fitting procedure so that it balances model fit and predictive performance (Hastie et al. 2001). Regularization is particularly important for BRT because its sequential model fitting allows trees to be added until the data are completely overfitted. For most modelling methods, model simplification is achieved by controlling the number of terms. The number of terms is defined by the number of predictor variables and the complexity of fitted functions, and is often determined using stepwise procedures (for a critique of these see Whittingham et al. 2006) or by building several models and comparing them with information theoretical measures such as Akaike's information criterion (Burnham & Anderson 2002). Controlling the number of terms implies a prior belief that parsimonious models (fewer terms) provide better prediction. Alternatively, more terms can be fitted and their contributions downweighted using shrinkage (Friedman 2001). In conventional regression, this is applied as global shrinkage (direct, proportional shrinkage on the full model) using ridge or lasso methods (Hastie et al. 2001; Reineking & Schröder 2006). Shrinkage in BRT is similar, but is incremental, and is applied to each new tree as it is fitted. Analytically, BRT regularization involves jointly optimizing the number of trees (nt), learning rate (lr), and tree complexity (tc). We focus on trade-offs between these elements in the following sections, after explaining the role of stochasticity. boosting with stochasticity Introducing some randomness into a boosted model usually improves accuracy and speed and reduces overfitting (Friedman 2002), but it does introduce variance in fitted values and predictions between runs (Appendix S1, see Supplementary material). In gbm, stochasticity is controlled through a ‘bag fraction’ that specifies the proportion of data to be selected at each step. The default bag fraction is 0·5, meaning that, at each iteration, 50% of the data are drawn at random, without replacement, from the full training set. Optimal bag fractions can be established by comparing predictive performance and model-to-model variability under different bag fractions. In our experience, stochasticity improves model performance, and fractions in the range 0·5–0·75 have given best results for presence–absence responses. From here on we use a bag fraction of 0·5, but with new data it is worth exploration. number of trees vs. learning rate The lr is used to shrink the contribution of each tree as it is added to the model. Decreasing (slowing) lr increases the number of trees required, and in general a smaller lr (and larger nt) are preferable, conditional on the number of observations and time available for computation. The usual approach is to estimate optimal nt and lr with an independent test set or with CV, using deviance reduction as the measure of success. The following analysis demonstrates how performance varies with these parameters using a subsample of the data set for model fitting, and the remaining data for independent evaluation. Using a set of 1000 sites and 12 predictor variables, we fitted BRT models with varying values for nt (100–20 000) and lr (0·1–0·0001), and evaluated them on 12 369 excluded sites. Example code is given in the online tutorial (see Supplementary material). Results for up to 10 000 trees for a tc of 1 and 5 are shown in Fig. 2. Our aim here is to find the combination of parameters (lr, tc and nt) that achieves minimum predictive error (minimum error for predictions to independent samples). A value of 0·1 for lr (not plotted) was too fast for both tc values, and at each addition of trees above the minimum 100 trees, predictive deviance increased, indicating that overfitting occurred almost immediately. The fastest feasible lr (0·05) fitted relatively few trees, did not achieve minimum error for tc = 1 (see horizontal dashed line) or tc = 5, and in both cases predicted poorly as more trees were added (the curves rise steeply after they have reached a minimum, indicating overfitting). In contrast, the smallest values for lr approached best predictive performance slowly, and required thousands of trees to reach minimum error. There was little gain in predictive power once more than 500 or so trees were fitted. However, slower lr values are generally preferable to faster ones, because they shrink the contribution of each tree more, and help the final model to reliably estimate the response. We explain this further in Appendix S1, and as a rule of thumb recommend fitting models with at least 1000 trees. Figure 2Open in figure viewerPowerPoint The relationship between number of trees and predictive deviance for models fitted with five learning rates and two levels of tree complexity. Deviance was calculated from models fitted to a data set of 1000 sites, and predicted to a data set of 12 369 sites. The lowest predictive deviance achieved for each panel is indicated by a dotted horizontal line; the line for learning rate achieving this minimum is shown in bold. tree complexity Tree complexity – the number of nodes in a tree – also affects the optimal nt. For a given lr, fitting more complex trees leads to fewer trees being required for minimum error. So, as tc is increased, lr must be decreased if sufficient trees are to be fitted (tc= 5, Fig. 2b). Theoretically, the tc should reflect the true interaction order in the response being modelled (Friedman 2001), but as this is almost always unknown, tc is best set with independent data. Sample size influences optimal settings for lr and tc, as shown in Fig. 3. For this analysis, the full data set was split into training sets of various sizes (6000, 2000, 1000, 500 and 250 sites), plus an independent test set (7369 sites). BRT models of 30 000 trees were then fitted over a range of values for tc (1, 2, 3, 5, 7, 10) and lr (0·1, 0·05, 0·01, 0·005, 0·001, 0·0005). We identified, for each parameter combination, the nt that achieved minimum prediction error, and summarized results as averages across tc (Fig. 3a) and lr (Fig. 3b). If the minimum was not reached by 30 000 trees, that parameter combination was excluded. Figure 3Open in figure viewerPowerPoint Predictive deviance as a function of data set size and (a) tree complexity; (b) learning rate. Models were run on data sets of 250–6000 sites, and minimum predictive deviance was estimated on an excluded data set of 7369 sites. No result indicates that no model of those up to 30 000 trees achieved a minimum predictive deviance with that parameter combination. Predictive performance was influenced most strongly by sample size and, as expected, large samples gave models with lower predictive error. Gains from increased tc were greater with larger data sets, presumably because more data provided more detailed information about the full range of sites in which the species occurs, and the complexity in that information could be modelled better using more complex trees. Decision stumps (tc 1) were never best (they always had higher predictive deviance), but for small samples there was no advantage – but also little penalty – for using large (higher-tc) trees. The reason for not using the highest tc, though, is that the model would have to be learnt very slowly to achieve enough trees for reliable estimates. So, small samples here (e.g. 250 sites) would be best modelled with simple trees (tc 2 or 3) and a slow enough lr to allow at least 1000 trees. As a general guide, lr needs to be decreased as tc increases, usually inversely: doubling tc should be matched with halving lr to give approximately the same nt. While the results here suggest that using higher tc and very slow lr is the best strategy (for samples >500 sites the curves keep descending), the other trade-off is computing time. For example, fitting BRT models on the 1000-site data set on a modern laptop and using our online code took 0·9