Abstract
For informed decision-making in biodiversity conservation and ecological management, accurate predictions of species abundance are essential. This study aimed to assess the predictive performance of random forest (RF) spatial variants in modelling species abundance distribution compared to standard RF, Poisson, their hybrid methods with ordinary kriging (OK), and the random generalised linear model (RGLM). Model performance in abundance modelling has rarely been quantified using a comprehensive index, except the existing single statistical indices. Therefore, modified Taylor diagrams were used to evaluate the model’s overall ability to predict species abundance spatial patterns, taking into account abundance class and detection probability. An exponential correlation function was used to generate spatially correlated random effects with and without a quadratic term and two variation strengths. Species abundance class and the relationship between abundance and independent variables determine which RF spatial variant performs the best. Spatial RF variants outperform conventional modelling in terms of prediction accuracy and power, particularly when spatial autocorrelation and species detection probabilities are high. RF spatial variants were less precise for common species than RGLM and GLM-OK, which better predicted species abundance for low or no spatial autocorrelation cases. However, none of the models outperformed the others for all prediction goals, highlighting the need for combining performance metrics to evaluate species abundance distribution models. The study highlights the importance of model specification in ecological research and cautions against the use of RF algorithms as a black box.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
Introduction
The distribution of species abundance plays a crucial role in many ecological applications (McGill et al. 2007; Royle and Dorazio 2009; Baldridge et al. 2016; Su 2018). Species abundance data are relatively easy to collect and can reveal less obvious features of a community, such as how well organised, competing and predating (Verberk 2011). In-depth knowledge of species abundance distribution and associated factors is essential for understanding population dynamics, implications of management strategies, conservation planning, and climate impact assessment (McGill et al. 2007; Kellner and Swihart 2014; Baldridge et al. 2016; Su 2018).
Advanced computers and statistical methods are enabling scientists to better understand, quantify, and predict complex ecological processes. However, species occurrence, rather than abundance, has been the main focus of the development of spatial predictive modelling (Waldock et al. 2022). Models of species occurrence do not fully capture the changes in local abundance associated with changes in species distributions (Gregory et al. 2004; Lenoir and Svenning 2013; Howard et al. 2014; Hastings et al. 2020). For example, species present in large numbers at few sites may contribute significantly to ecological processes, but a focus on occurrence alone will overlook these species (Stuart-Smith et al. 2013; Johnston et al. 2015; Genung et al. 2020). Changes in abundance may also provide an early warning of a population decline, whereas patterns of occurrence may not change until the local population is decimated (O’Grady et al. 2004; Clements et al. 2017; Ceballos et al. 2020a; Hastings et al. 2020; Waldock et al. 2022).
Although Weber et al. (2017) showed a positive relationship between species abundance and environmental suitability, other studies reported a weak or non-existent relationship (Van Horne 1983; Dallas and Hastings 2018; Sporbert et al. 2020; Dallas and Santini 2020; Holt 2020). This relationship can be weakened due to species characteristics (Allee effect, species detectability, demographic stochasticity, non-equilibrium population states), environmental variability, measurement errors and spatial autocorrelation (Howard et al. 2014; Osorio-Olvera et al. 2019; Dallas and Santini 2020; Holt 2020; Waldock et al. 2022).
Measurement errors induce omissions, commissions, under-counting, over-counting, or imperfect detection (Borchers et al. 2015; Kéry and Royle 2016). Imperfect detection is common in ecological datasets, resulting in underestimated abundance and uncertainty (Royle et al. 2007; Kéry and Schmidt 2008; Kellner and Swihart 2014; Lahoz-Monfort et al. 2014; Benoit et al. 2018). Spatial autocorrelation is a measure of the degree to which a model’s residuals are spatially clustered considering the effects of covariates (Legendre 1993; Dormann et al. 2007).
Although not necessarily a challenge for statistical analysis, data’s spatial structure is often overlooked in many application fields and can lead to poor model performance (Brenning 2005; Ruß and Brenning 2010). Spatial structure is not taken into account by many popular machine learning (ML) techniques, including random forest (RF) (Roberts et al. 2017). However, these limitations have not been extensively explored in applied studies, especially in species abundance modelling (Johnson et al. 2013; Broms et al. 2016). Understanding how and why models behave differently for different species is critical to informing and managing conservation (Waldock et al. 2022).
Ecologists and statisticians are increasingly dealing with high-dimensional observations of species and environmental data (Beery et al. 2021). These data often have complex and non-linear interactions and missing values. Traditional statistical methods struggle to provide meaningful analyses of such data (Déath and Fabricius 2000). RF, a flexible and powerful alternative to traditional statistical methods, is increasingly applied in various fields, particularly to ecological datasets, to study complex systems (Gislason et al. 2006; Jj et al. 2006; Prasad et al. 2006; Cutler et al. 2007; Kuhn and Johnson 2013; Merow et al. 2014; Zurell et al. 2016; Lucas 2020; Pichler et al. 2020; Martín et al. 2021; Beery et al. 2021; Ceulemans et al. 2021; Wardeh et al. 2021; Simon et al. 2023).
Despite the introduction of RF variants that account for spatial autocorrelation in data, standard RF remains widely used (Ahijevych et al. 2016; Fayad et al. 2016; Lim et al. 2019; Fox et al. 2020; Hengl et al. 2018; Liu et al. 2018; Georganos et al. 2021; Saha et al. 2023). In recent years, several methods have been proposed to adapt standard RF to a spatial framework by incorporating spatial information. One of these techniques, known as RF Residual Kriging (RFRK), is the most widely used. After fitting a standard RF, spatial adjustment is performed using ordinary kriging (OK) on RF residuals (Hengl et al. 2015; Liu et al. 2018). Another technique is the explicit inclusion of spatial information as additional covariates in the RF, which adversely affects predictive performance when there is a dominating covariate effect (Saha et al. 2022). Hengl et al. (2018) introduced a spatial RF (RFSP) that incorporates all pairwise buffer distances as additional factors. Georganos et al. (2021) developed the Geographical Random Forest (GRF), in which the global process is a decomposition of several local sub-models of nearby observations. It is based on the concept of spatially varying coefficient models (Fotheringham et al. 2003; Stewart et al. 2017).
Furthermore, Talebi et al. (2022) introduced a spatial RF variant based on higher-order spatial statistics. It uses local spatial-spectral information to learn intrinsic heterogeneity, spatial dependencies and complex spatial patterns, in contrast to the standard RF algorithm that uses pixel-wise spectral information as predictors. The RFGLS, developed by Saha et al. (2023), is a method used to estimate nonlinear covariate effects in spatial mixed models, where spatial correlation is handled using the Gaussian procedure. Similarly to how general least squares (GLS) essentially extend ordinary least squares (OLS) to account for dependence in linear models, RFGLS extends RF.
However, there is a lack of comparisons between spatial RF variants and spatial regression in modelling of the distribution of species abundance. Spatial regression models are known to perform well in predicting data with spatial autocorrelation. Li et al. (2017) found that GLM hybrids with geostatistical techniques were less accurate than RF-OK hybrids in modelling count data. Song et al. (2013) introduced the Random Generalised Linear Model (RGLM), which is a GLM-based ensemble predictor and combines the standard RF and GLM models using forward selection. Garcia-Marti et al. (2019) combined RF and count data models to better model overdispersed, skewed, and zero-inflated distributions. It combines the segmenting capabilities of standard RF, using decision tree rules to partition data into homogeneous groups, with count data models more appropriate for modelling count data.
Yet spatial abundance models are often better suited to common and widespread species and have shown their greatest applicability to questions of ecosystem function and service supply rather than modelling rare or endemic species threatened with extinction (Purvis et al. 2000; Ceballos et al. 2020b; Waldock et al. 2022). However, the effects of species features combined with spatial autocorrelation on the performance of RF within a spatial framework are less well understood. Other studies have shown that for other types of species, particularly wide-ranging species, it may be difficult to model abundance accurately because their environmental niches are less significant (Chisholm and Muller-Landau 2011; Chu et al. 2016; Bowler et al. 2017; Yenni et al. 2017; Hallett et al. 2018). Rare species with a small niche have more stable populations, making it easier to predict abundance distribution (Yenni et al. 2017; Thuiller et al. 2019; Waldock et al. 2022). This study explores the effectiveness of different RF spatial variants in improving species abundance predictions and evaluates the effectiveness of combining performance metrics in model selection for species abundance distribution modelling. Model selection involves evaluating competing models’ performance to choose the best model (Hastie et al. 2009).
This study uses modified Taylor plots to assess the performance of different modelling approaches. The overall performance of these models has not previously been comprehensively quantified beyond the existing individual statistical indices. It aims to determine how different RF spatial variants perform in predicting species abundance distributions for different species and select the most accurate model, accounting for species and data features. To get there, we: (1) tested the predictive performance of different modelling approaches using Modified Taylor diagrams and various metrics of accuracy, discrimination and precision; (2) compared RF spatial variants with spatial regression modelling approaches and the random generalised linear model (RGLM); and (3) examined the influence of spatially correlated random effects’ complexity on their predictive performance.
Methods
Data simulation
To evaluate the effect of species and data features on RF spatial variants performance in predicting species abundance, we simulated virtual species abundance data as proposed by Guélat and Kéry (2018). Scenarios including cases with spatial autocorrelation and imperfect detection were considered in the present study. The following model was used to generate abundance data sets within a \(50 \times 50\) cell landscape.
\(N_i\) is a true latent abundance generated randomly at each site i, \(x_1,\) \(x_2\) and \(x_3\) are continuous covariates following the standard normal distribution but only \(x_1\) is informative, as described in Table 1.
\(x_2\) and \(x_3\) were uninformative independent variables), \(\rho _i\) is a spatially correlated random effect produced by an exponential correlation function, \(\beta _0\) represents a constant term contributing to abundance, \(\beta _1\) is the growth rate coefficient of the exponential function and \(\gamma\) is the autocorrelation’s strength. However, we also tested scenarios where \(\rho _i\) was a spatially correlated random effect produced by an exponential function with a quadratic term \(({x_1}^2).\)
The strength of pairwise correlations in the landscape is based on distance \(d_{ij}\) between sites i and j, determined by the covariance matrix of the multivariate normal distribution (MVN). Additionally, \(\sigma ^2\) is the spatial variance, \(\theta\) is the scale parameter controlling the distance-dependent decay of the spatial correlation in the expected abundance. \(C_{it}\) represents observed data which is the counts at site i during the visit t and \(\rho\) describes abundance measurement error. Observed data are randomly generated conditional on \(N_i.\) For each site, three visits were considered to account for imperfect detectability (p). For each case, 200 sites were selected to test the models. A common species occurs in large numbers in a specific ecosystem or habitat, while a rare species occurs in low numbers. For each case. 200 datasets were generated for each case.
Dispersion coefficient \(\alpha\) was used to determine the dispersion level for each dataset and each case. The coefficient \(\alpha\) was estimated using ordinary least squares (OLS) auxiliary regression. It was tested using the t statistic, which is asymptotically standard normal under the null hypothesis of equidispersion (Kleiber and Zeileis 2008). Standard Poisson models the conditional mean \(E(y)=\mu ,\) assumed equal to the variance \((var(y)=\mu )\) in case of equidispersion. We estimated the dispersion parameter \(\alpha\) with the linear variance function (quasi-Poisson model) such that
Overdispersion corresponds to \(\alpha > 0\) and underdispersion corresponds to \(\alpha < 0\) (Cameron and Trivedi 2005).
Predictive methods
Spatial linear model
Suppose that \(y = (y(s_1), \ldots , y(s_n))\) is a response vector that is spatially located at locations \(s_i \in D \subset {\mathbb {R}}^2\). The model of the spatial regression can be described as follows: The spatial regression model is given by:
where X is an \(n \times p\) design matrix for the covariates, n is the number of locations (sample size), \(\beta\) is a \(p \times 1\) vector of unknown regression coefficients, Z is an \(n \times 1\) vector of spatially autocorrelated random variables, \(\epsilon\) is an \(n \times 1\) vector of independent random errors.
The spatial structure of \(\epsilon _i\)’s covariance is defined by a variogram. The \(n \times n\) covariance matrix \(\Sigma\) for the spatial linear model is given by:
We assume a stationary covariance function that depends on Euclidean distance and has an exponential form to simplify the estimation of the Eq. 7. Thus, the entry (i, j) of R is expressed as:
where \(\Vert \cdot \Vert\) denotes the Euclidean distance metric, and \(\sigma _z^2\) and A are the parameters to be estimated. In the geostatistical literature, the parameters \(\sigma _z^2,\) A, and \(\sigma _\epsilon ^2\) are referred to as the partial sill, range, and nugget, respectively. The nugget parameter models the residual variation in the response when the separation distance is zero (Cressie and Wikle 2011). The spatial linear model (SLM) represented by the model in Eq. 6 is also known as the Universal Kriging model when used for spatial forecasting, the Ordinary Kriging model being the special case when X is an \(n * 1\) column vector of ones (Cressie and Wikle 2011; Fox et al. 2020). In the present study, we compared four types of covariance functions (exponential, Matérn, spherical, and Gaussian) for the spatial linear model (Chilès and Delfiner 2012), and the best covariance function was selected for model comparison. Restricted maximum likelihood (REML) estimation was used for parameter estimation of the spatial regression model (Webster and Oliver 2007).
Random forest
Random forest (RF) is a data-driven statistical method, an ensemble learning algorithm primarily used for classification or regression. It was developed to improve the prediction accuracy of classification and regression trees by combining a large set of decision trees (Breiman 2001). The algorithm benefits from two powerful techniques: random subspace selection at each split (“Classification and Regression Trees (CART) split criterion” (Breiman et al. 1984) and bagging (a contraction of “bootstrap-aggregating”) of unpruned decision tree learners (Breiman 1996). In regression, Random Forest (RF) predictions \((\hat{\theta })\) are obtained by averaging results from a given number (B) of individual decision trees \((t_b^*)\) based on generated bootstrap samples (K), as described in the literature (Breiman et al. 1984; Breiman 2001; Prasad et al. 2006; Biau and Scornet 2016; Hengl et al. 2018):
where: b is an individual bootstrap sample, \(t_b^*\) is an individual decision tree, B is the total number of trees, and \(t_b^*(x) = t\left( x; z_{b1}^*, z_{b2}^*, \ldots , z_{bK}^*\right) ,\) and \(z_{bk}^{*}\) \((k=1, \ldots , K) = (y_k, x_k)\) is the kth training sample with pairs of values for the target variable (y) and covariates (x). We used default values in the regression for the number of possible splits in each node \((mtry =p/3) ,\) the number of trees \((ntree = 500) ,\) and the minimum terminal node size \((nodesize= 5) ,\) as these are often good options (Liaw and Wiener 2002; Díaz-Uriarte and Alvarez de Andrés 2006).
Random forest variants for spatial framework
Random Forest for spatial data (RFSP)
RF is a non-spatial approach in that it does not take into account spatial heterogeneity or general sampling schemes when estimating model parameters. This may lead to non-optimum predictions and systematic over- or under-prediction, especially when spatial autocorrelation is high and point patterns show clear sampling bias. To overcome this, Hengl et al. (2018) proposed the “RFSP”, which uses buffer distances as additional predictor variables.
where \(x_P\) represents process-related covariates and \(x_G\) are covariates that take into account geographical proximity and spatial relationships between sampled sites.
where \(d_{pi}\) is the buffer distance to the sampled location pi, and N is the total number of sampled sites.
Spatial Random Forest
In the framework of this study, we called Spatial Random Forest (SRF), a spatial RF variant where instead of adding buffer distances as additional predictor variables, only spatial coordinates are added, such as:
where y is the dependent variable, \(x_P\) represents process-related predictor variables and \(x_G =(X,Y)\) with X and Y being spatial coordinates.
Random Forest residual Kriging
The RF residual kriging (RFRK) model is configured to perform a spatial adjustment by ordinary kriging on the standard RF residuals.
Geographical Random Forest
Geographical Random Forest (GRF) is a spatial RF variant developed by Georganos et al. (2021), in which the global process is decomposed into several local sub-models of nearby observations. It is based on the concept of spatially varying coefficient models used in geographically weighted regressions (Fotheringham et al. 2003; Stewart et al. 2017). A local random forest is computed for each sampled location i, but only the nearest ni observations are considered, resulting in the computation of a random forest for each training data location. This increases the flexibility of the locally calibrated RF compared to the global RF. Using the simplified version of the linear equation, we have the following:
where \(y_i\) is the observed value of the dependent variable for the location i, a is a coefficient, \(a(X_i,Y_i )x\) is the prediction (nonlinear) of the locally calibrated RF model on the location i, and \((X_i,Y_i)\) are its coordinates.
The longest distance between a data point and its kernel is called bandwidth, while the area where the sub-model operates is called neighbourhood (or kernel). In this work, we used an adaptive kernel where the given number of nearest neighbours to be selected determines the neighbourhood (Brunsdon et al. 1998; Fotheringham et al. 2003).
Geographical Weighted Random Forest
Geographically Weighted Random Forest (GWRF) is a GRF model that gives more weight to observations that are spatially autocorrelated when calibrating the model. GRF does not use weights to calibrate the model. All observations have the same weight, regardless of geographical position. GWRF handles spatial autocorrelation and is therefore appropriate for data that are highly spatially autocorrelated. GRF is suitable for data that are not or only weakly autocorrelated (Georganos et al. 2021). We examined a range of neighbours’ values (n = 10, 20, 30, 40, and 50) to determine the optimal value setting for both GRF and GWRF.
Generalized least square-based random forest
Similar to how general least squares (GLS) extends ordinary least squares (OLS) to account for dependence in linear models, random forest based on generalised least squares (RF-GLS) extends RF. RF-GLS estimates non-linear covariate effects in spatial mixed models where spatial correlation is handled by the Gaussian procedure. The following mixed model considers spatial point data:
where \(y_i\) and \(x_i\) denote the observed values of the dependent and predictor variables, respectively, corresponding to the \(i\text{th}\) observed location \(s_i;\) \(m(x_i)\) denotes the covariate effect; \(w(s_i)\) is the spatial random effect accounting for spatial dependence beyond the covariates modelled by a Gaussian process, and \(\epsilon _i\) accounts for independent and identically distributed Gaussian random noise.
In the present study, the R package RandomForestsGLS was used to fit and predict the abundance of species (Saha et al. 2023). This R package uses the computationally efficient Nearest Neighbour Gaussian Process (NNGP) (Datta et al. 2016). Model parameters were estimated from the generated data following (Saha et al. 2022). By integrating the non-linear mean estimate and the spatial kriging estimate from the BRISC package (Saha and Datta 2018), as explained by Saha et al. (2022), spatial prediction at new locations using nonlinear kriging is provided. We evaluated the four different covariance functions supported by the RandomForestsGLS package: exponential, Matérn, spherical, and Gaussian. We also evaluated the different numbers of neighbours used in the NNGP (10, 20, 30, 40 and 50). The results of this study were derived using the exponential covariance function. The results were derived using the exponential covariance function, as it best predicts the structure of the simulated data.
GLM hybrid methods with random forest and ordinary kriging
GLM and Ordinary Kriging hybrid model
For GLM and Ordinary Kriging hybrid (GLM-OK) model a spatial adjustment was performed by ordinary kriging on Poisson’s model residuals. As a control for the count data, the Poisson distribution was used.
The Random Generalised Linear Model
Random Generalised Linear Model (RGLM) is an ensemble prediction method based on GLM bootstrap with predictor variables selected using forward regression with AIC criterion.. Since species abundances are count data, the Poisson distribution was used in the RGLM. Song and Langfelder (2022) described the construction of the RGLM method.
Model performance assessment
We randomly sampled 200 observations from the generated datasets, which were divided into training (80%), and testing (20%) datasets. We used 5-fold cross-validation to evaluate the performance of the selected methods (Hastie et al. 2009). Performance metrics were averaged over 200 independent simulation runs to reduce the influence of randomness associated with 5-fold cross-validation. The measures of accuracy, precision, and discrimination were used to assess the predictive performance of the models that were tested.
-
Accuracy: The root mean square error (RMSE) and the mean error (ME) or bias between the RF predicted value and the observed abundance values at the sampled location were estimated as follows:
$$\begin{aligned} \text{RMSE}= \sqrt{\frac{1}{n} \sum_{i=1}^{n}(O_i - P_i)^2}, \end{aligned}$$(15)$$\begin{aligned} \text{ME}= \frac{1}{n} \sum_{i=1}^{n}(O_i - P_i), \end{aligned}$$(16)where \(O_i\) and \(P_i\) refer to observed and predicted species abundance at sampled locations i, respectively, and n is the number of sampled locations (number of observations).
-
Precision: We used the mean of the standard deviations of the predictions (SDP).
-
Discrimination: The mean squared Spearman rank correlation coefficient \(R^2 = \rho _s^2\) between the predicted and observed abundance of species at the sampled sites and the modelling efficiency coefficient (MEC) of Nash and Sutcliffe (1970) were used.
The n raw observed \(O_i\) and predicted \(P_i\) are converted to ranks \(R(O_i),\) \(R(P_i)\) and (\(\rho _s\)) is defined as their Pearson correlation coefficient.
$$\begin{aligned} \rho _s = \frac{\text{cov}(R(O), R(P))}{\sigma _{R(O)} \sigma _{R(P)}}, \end{aligned}$$(17)where \(\text{cov}(R(O), R(P))\) represents the covariance between the rank variables of O and P, \(\sigma _{R(O)}\) and \(\sigma _{R(P)}\) are standard deviations of the rank variable of O and P.
The MEC is calculated as:
$$\begin{aligned} \text{MEC} = 1 - \frac{\sum _{i=1}^{n}(O_i - P_i)^2}{\sum _{i=1}^{n}(O_i - \bar{O})^2}, \end{aligned}$$(18)where \(O_i\) and \(P_i\) refer to observed and predicted species abundance at sampled location i, n is the number of sampled locations and \(\bar{O}\) represents the mean observed species abundance across all sampled locations.
Model comparison
Kruskal–Wallis tests were used to identify significant differences in predictive performance between the models tested, as the different performance measures were either not normally distributed according to the Shapiro and Wilk (1965) test of normality, had heterogeneous variance according to the Fligner and Killeen (1976) test of homogeneity of variances, or both. Dunn (1964)’s post-hoc test for the Kruskal and Wallis (1952) test was used to compare their predictive performance. P-values were adjusted using the Benjamini and Hochberg (1995).
To graphically summarise the predictive performance of the models studied, a modified Taylor (2001) diagram was used. Three statistics: the standard deviation of predictions, the root mean square error (RMSE) and the Pearson correlation coefficient are plotted on a single graph. We used this diagram because an accuracy metric such as RMSE alone is not meaningful for highly skewed data. We replaced Pearson’s coefficient with Spearman’s rank-order correlation coefficient because the distributions of species abundance for common and rare species were skewed and zero-inflated; used the normalised standard deviation of predictions (NSTD) and centred RMSE (CRMSE). The standard deviations of observed and predicted species abundances are calculated respectively by:
and
The normalized STD (NSTD) is obtained by:
The centred RMSE (CRMSE) is defined as follows:
For each case, the best model was considered to be the one with the lowest CRMSE (In the perfect model, the CRME would be equal to 0), and closest normalised standard deviation and Spearman correlation coefficient to 1 (Jiang et al. 2015).
Computational environment
Models were implemented in R version 4.2.2 (R Core Team 2022) using raster package to manipulate data (Hijmans 2023), gstat for geostatistics analysis (Gräler et al. 2016) and AER to test the significance of dispersion (Kleiber and Zeileis 2008). The R packages used to predict the distribution of species abundance in this study are listed in Table 2.
Results
Table 3 presents descriptive statistics of simulated species abundances and their variation, including minimum and maximum values, arithmetic mean, median, and sum. As medians are smaller than means, simulated abundance datasets are skewed to the right for all species.
Variogram parameters of simulated species abundance are presented in Table 4. In all cases, variogram parameters are right-skewed, indicating that there are some extremely high values in the simulated dataset that pull the variogram parameter means upwards. This pattern is particularly apparent when the species is common with a high probability of detection and the abundance data are highly spatially autocorrelated. Data sets with large range produced fewer than five blocks in the \(50\times 50\) space. They were removed from the analysis because they did not produce enough blocks for fivefold cross-validation. Figure 1 shows how the spatial structure of the data, imperfect detection and species characteristics like rarity and commonness influence the variation in the abundance of a species. For example, a high degree of overdispersion will be observed if spatial autocorrelation is high and the probability of detecting a species is high, especially if the species is common. In contrast, if the probability of detecting a species is low, the spatial autocorrelation is low, and especially if the species is rare, there will naturally be little variability in the distribution of abundance and hence little dispersion in the abundance data. Results show that random sampling is effective in dealing with the dispersion in abundance data, whether the species is rare or common, even with significant spatial autocorrelation, as long as the sample is representative of the population. However, the dispersion of data collected within a population (sample) and the dispersion of data used for model training/evaluation may differ from the population. Therefore, a large sample is needed.
For each case, the proportion of simulated databases with significant overdispersion is shown in Table 5. Results reveal that when the species is common, the presence of high spatial autocorrelation leads to statistically significant over-dispersed abundance data in all (100%) of the simulated populations. When the spatial autocorrelation is low and the probability of detection is low, the relative abundance data for a common species is still highly over-dispersed in the majority of simulated populations (91%). However, when both spatial autocorrelation and detection probability are low, almost all simulated populations (98%) will have equidispersed abundance data if the species is rare, whereas when spatial autocorrelation is low but detection probability is high for a rare species, almost all simulated populations (98.5%) will have overdispersed abundance data. The fact that random sampling within a population can significantly alter the degree of dispersion within the sample collected is also highlighted. Depending on the data used to train and validate the models tested in this study, the proportions of over-dispersed samples vary significantly.
Assessment of model performance using single performance metrics for species abundance prediction
Random spatial effect is an exponential correlation function with linear terms only
Table 6 presents Kruskal–Wallis and Dunn’s post hoc tests results, which demonstrate that the predictive performance of the different modelling approaches is influenced by a number of factors, including selected metrics, abundance class, species detection probability, and spatial autocorrelation.
Case 1: Results reveal that the different modelling approaches have low predictive accuracy and modelling efficiency, with weaker relationships between predictions and observed abundances. RFGLS and SRF models exhibited the highest precision, although the RFGLS model demonstrated the weakest relationships between predicted and observed abundance. The least efficient model was the RGLM, with the RFSP and SRF yielding the highest modelling efficiency and predictive power.
Case 2: Models Predictive accuracies are similar, but the GLM-OK model underpredicted species abundance and exhibited greater bias. The RFGLS and SRF models are more precise. The RFSP and GLM-OK models have the highest \(R^2\) and MEC values.
Case 3: RMSE values differ among different modelling approaches, while the predictive bias does not. The RGLM and GLM-OK models are the most accurate for abundance predictions, while OK is the least accurate. Spatial RF variants predictive accuracy is similar to standard RFś for common species with low spatial autocorrelation and high detection probability. OK, despite low accuracy, provides the most precise predictions. RGLM and GLM-OK are the least precise models but offer the highest predictive power and efficiency. RFSP, RFRK, and SRF are the spatial RF variants that exhibited the highest \(R^2\) and MEC values.
Case 4: RMSE and biases of the models varied significantly. RGLM, GLM-OK, and GLM models demonstrated the highest predictive accuracy, while the GLM-OK model exhibited the highest bias. OK exhibited the highest precision, while RGLM, GLM-OK, and Poisson models demonstrated the highest predictive power and modelling efficiency. In comparison to cases involving common species, the accuracy metrics are relatively low for rare species due to the limited variation in their abundance.
Case 5: Models predictive accuracy and biases of the models were similar, except for the RGLM, which is sensitive to extreme values. GLM-OK algorithm is the least precise, while the RFGLS and SRF are the most precise. GLM-OK and RFSP have the highest predictive power, while SRF has the highest modelling efficiency. RGLM has the worst MEC, tending to negative infinity.
Case 6: Predictive accuracy and bias of the various modelling approaches are similar, with the RGLM exhibiting a sensitivity to extreme values in terms of bias and RMSE. The RFGLS is the most precise. The GLM-OK is the most biased, the least precise and underestimates species abundance, but is among models with the highest predictive power. The standard RF and RGLM models exhibited the lowest \(R^2\) values and yielded MEC values that tend towards negative infinity. SRF, followed by RFSP and OK models, exhibited the highest modelling efficiency.
In Cases 7 and 8, Poisson and Poisson hybrid models exhibited the greatest predictive accuracy, power, and efficiency. Conversely, OK exhibited the least predictive accuracy, the lowest predictive power, and modelling efficiency for case 8. However, it is the most precise model. GLM-OK exhibited bias, underestimated rare species abundance, and was the least precise. SRF, followed by RFGLS and RFRK, are spatial RF variants with the best predictive power and efficiency. However, they were not significantly different from standard RF.
Spatial random effect is an exponential correlation function with a quadratic term
Accuracy
Table 7 shows that different modelling approaches predict species abundance without significant difference in accuracy when spatial random effect is an exponential correlation function with a quadratic term and the spatial autocorrelation is high. However, the RGLM is sensitive to the influence of outliers when the species is rare, although this difference is not statistically significant due to the high variability of estimates. Poisson model and its hybrid models exhibit higher prediction accuracy than RF and its spatial variants when the spatial autocorrelation is low or absent. The GLM-OK model is more biased than other models in most cases. When species are rare, spatial RF variants yield predictions that are comparable to those of the standard RF. However, when species detection probability is low, the RFSP produces higher biases than the standard RF, but not statistically different. For common species, the RFSP overestimates abundance predictions when spatial autocorrelation is low or absent.
Precision
RFGLS and SRF models are more precise for common species with high spatial autocorrelation, while the RFRK model is the least precise. OK is the most precise model, followed by SRF and RFGLS, while the Poisson model and its hybrids (RGLM and GLM-OK) are the least precise when spatial autocorrelation is low or absent. The RFGLS and SRF models are more precise for rare species with high spatial autocorrelation, while the OK model remains the most precise for rare species with low or no autocorrelation (see Table 7).
Discrimination
Results reveal that RFSP has the highest predictive power \((R^2)\) for common species with high spatial autocorrelation, followed by GLM-OK and GWRF. Standard RF and RFGLS have the lowest \(R^2\) regardless of species abundance class for high spatially autocorrelated species abundance. GLM-OK, RGLM, RFSP, and Poisson models achieve the best \(R^2\) for common species with weak or no spatial autocorrelation, while OK yields the lowest \(R^2\) regardless of species abundance class. GLM-OK and RFSP models have the highest \(R^2\) for rare species with high spatial autocorrelation, while SRF and RFSP have the highest efficiency coefficients when spatial autocorrelation is high. RGLM has the highest efficiency coefficient when spatial autocorrelation is low or absent. However, it does not differ from GLM-OK when the species is common (see Table 7).
Evaluation of models’ predictive performance using modified Taylor diagrams
Training performance
Figures 2 and 3 illustrate that hybrid Poisson regression models are more precise than spatial variants of RF when predicting (interpolating) common species abundance. If the spatial autocorrelation is high and the species is rare, RFSP or RFRK provide more precise predictions with variability closer to observed abundance, while GLM-OK provides the most closely matching centred RMSE and Spearman correlation coefficients. However, hybrid Poisson models have larger centred RMSEs and smaller Spearman correlation coefficients between predicted and observed abundance than standard and spatial RF variants when the spatial autocorrelation is high and the species probability of detection is high. In contrast, when spatial autocorrelation is low, the best-centred RMSE and Spearman correlation coefficients are obtained from different models depending on the data complexity and species abundance class. For example, if the detection probability is high and the species is rare, hybrid Poisson models will give the best predictions, whereas if the detection probability is low, standard models (Poisson or RF) will give the best predictions in terms of accuracy and predictive power. Variants of RF (spatial or hybrid) produced predictions with better centred RMSE and Spearman correlation coefficient than the hybrid Poisson models when spatial autocorrelation is low and the species is common.
Predictive performance
Cases with linear terms only.
Figure 4 presents modified Taylor diagrams to illustrate the predictive performance of various modelling approaches for species abundance. Most models’ predictions have less variability than observed abundance for common species with a high probability of detection and high spatial autocorrelation (Fig. 4a), except the RGLM. This sub-figure shows that GWRF, despite not being the most precise model, is the one with more accurate predictions adjusted for bias and the closest pattern of variability with observed abundance. For common species with a low probability of detection and high spatial autocorrelation (Fig. 4b), the RGLM has the closest precision compared to observed abundance but the closest accurate predictions and pattern of variability with the observed abundance are obtained using GLM-OK followed by different models with close predictive performance such as the RFSP, GWRF, SRF, GLM and the RFRK.
For common species with high probability of detection but low spatial autocorrelation, GLM hybrid models (RGLM and GLM-OK) followed by the Poisson model yield the best predictive performance. These models provide more accurate predictions with the closest pattern of variability with observed abundance, while RFRK is the best RF spatial variant (Fig. 4c). GLM hybrids have the best predictive accuracy and predictive power but are less precise compared to RF variants for case 4 species (see Fig. 4d). RFRK is the spatial RF variant with the closest predictive power, precision, and accuracy compared to observed abundance.
For rare species with high probability of detection and high spatial autocorrelation, points are clustered together, with RFRK followed by GWRF being the closest models to observed abundance (see Fig. 4e). However, RGLM does not appear on the plot due to its mean-centred RMSE tending to positive infinity. The GLM-OK hybrid model, followed by GWRF, is the most appropriate spatial RF variant for rare species with low probability of detection and high spatial autocorrelation, while RFSP is the most precise (see Fig. 4f). Figure 4g and h show that Poisson model hybrids outperform spatial RF variants for case 7 and 8 species, with RFRK and RFGLS being the most effective spatial RF variants.
Cases with a quadratic term
Figure 5 reveals that RF spatial variants predict abundance better for species with high spatial autocorrelation, while hybrid Poisson models are best for low autocorrelation. However, the choice of the best variant depends on the complexity of the data. Four models (RFSP, RFGLS, GWRF and GRF) yield the shortest radial distance from the reference point for common species with high detection probability and high spatial autocorrelation (see Fig. 5a). RFSP remains the best modelling approach for case 2 species as shown in Fig. 5b. However, GRF and GWRF appear to be sensitive to a low probability of detection. Figure 5c reveals that for case 3 species, GLM-OK and Poisson are the best predictive models, while RFGLS, standard RF, and RFRK are the best-performing RF spatial variants. For case 4, RGLM outperforms other models, followed by GLM-OK and Poisson models. The RFSP is the best-performing spatial variant of the RF (see Fig. 5d). Figure 5e shows that when spatial autocorrelation is high for a rare species with a high probability of detection, the GLM-OK model yields predictions with the closest variability to observed abundance variability. However, GRF and GWRF have the smallest radial distances, making their predictions closer to the observed abundance. Figure 5f shows that when the probability of detection of a rare species is low, RFSP is the best-performing model, followed by GLM-OK and Poisson. Figure 5g and h show that for a rare species, when spatial autocorrelation is low or absent, hybrid models like RGLM and GLM-OK produce predictions closer to the observed abundance than RF spatial variants.
Discussion
The increasing biodiversity loss due to climate change and human activities has led to a rise in species distribution models, but concerns have been raised about their reliability in predicting species distributions (Guillera-Arroita et al. 2015; Rizvanovic et al. 2019; Zhang et al. 2020). Statistical models are crucial in applied ecology, providing predictions that are as close to reality as possible (Houlahan et al. 2017; Norberg et al. 2019). However, they cannot replicate the complexity of ecological systems. Consequently, practitioners should be guided in selecting models and understanding the limits of predictive performance given the wide range of models available (Urban et al. 2016; Norberg et al. 2019).
Oppel et al. (2012) found that predicting species abundance is challenging, suggesting ensemble models as a solution. Machine learning approaches, particularly RF algorithms, have shown better predictive ability for species occurrence and abundance than artificial neural networks and classical regression-based models (e.g., Boulesteix et al. 2012; Jetz et al. 2019; Zhang et al. 2020; Waldock et al. 2022). RF’s ability to handle high-dimensional prediction problems, handle situations where predictor variables exceed observations, and capture complex relationships has made it attractive in various fields. However, spatial autocorrelation challenges the effectiveness of RF, despite its flexibility and non-linear nature (Hengl et al. 2018; Liu et al. 2018; Georganos et al. 2021; Saha et al. 2023).
Evaluation of alternative models using multiple diagnostic metrics
The study highlights the importance of performance metrics and model assessment approach choice in species abundance distribution modelling. It found that only a few models consistently provided the best accuracy, precision, and discrimination on hold-out datasets. High performance for some metrics may result in poor performance for other metrics for the same data set. Norberg et al. (2019) evaluated 33 species distribution models and found similar trends in predictive performance. None of the models performed well for all prediction tasks simultaneously. This can lead to poor global performance when performance assessment tools combine different metrics.
For example, this study suggests that using the standard deviation of predictions as an indicator of prediction’s precision may not be meaningful for skewed abundance datasets. The RFGLS model was found to be the most precise when spatial autocorrelation is high. However, modified Taylor plots revealed that this model overestimated precision, as its normalized standard deviation was closer to the origin but further from 1, indicating underestimation of observed species abundance variability.
Dunn’s post-hoc Kruskal–Wallis test revealed no difference in prediction accuracy between spatial variants of random forest and other modelling approaches for high spatially autocorrelated abundance datasets with imperfect species detection probability. Garcia-Marti et al. (2019) noted that for highly skewed data, such as highly autocorrelated species relative abundance, an accuracy metric like RMSE alone is not informative. Metrics that concisely combine performance metrics should be used to determine the degree of agreement between observed and predicted species abundances. Combining various metrics is the most effective method for model selection (Moriasi et al. 2015; Izzaddin et al. 2024). The geometric relationship of the Taylor diagram balances model fit measures, making it easier to assess performance (Taylor 2001).
However, Taylor’s diagram has shortcomings, such as failing to account for overall model bias and relying on RMSE (Gleckler et al. 2008; Hu et al. 2019). This suggests that selecting the best-performing models based on RMSE minimisation favours those who underestimate observed variability, unless the correlation coefficient between model results and observations is equal to 1 (Izzaddin et al. 2024). Spatial RF spatial variants offer better predictive accuracy and power for common species with high spatial autocorrelation and high probability of being detected but are less precise than hybrid Poisson models. For rare species with high spatial autocorrelation, hybrid Poisson regression models are less precise than OK, RF, and its spatial variants. Norberg et al. (2019) suggest that poorly calibrated models may provide overly confident predictions. They suggest fitting a limited set of models with complementary performance and use a cross-validation procedure with independent data to determine the best model for the study.
Impact of data and species features on the predictive performance of different modelling approaches
Several studies have compared random forest and its variants with other models, but results are not always consistent. Li et al. (2011), Appelhans et al. (2015), Hengl et al. (2015) and Fox et al. (2020) comparing RF and spatial regressions found that RF is superior when using many covariates with nonlinear relationships. Parmentier et al. (2011) and Temesgen and Ver Hoef (2014) showed that spatial regression models may outperform RF when using datasets with many spatially autocorrelated records. Li et al. (2017) found that GLMs and their hybrid methods with geostatistical techniques (OK) were less accurate than hybrid methods of RF and OK in modelling count data.
The current study examines various scenarios in which these alternative models may perform better or not. We found that RF spatial variants offer better predictive accuracy and power than other modelling methods when spatial autocorrelation and species probability of detection are high. However, they tend to underestimate species abundance variability more than hybrid Poisson models with RF and OK. The optimal RF spatial variant varies depending on species features and the relationship between abundance and independent variables. Hybrid Poisson models with RF and OK are more effective in predicting data that closely aligns with observed values when spatial autocorrelation is low or absent.
These results contrast with the claim of Saha et al. (2022) that all spatial variants of RF outperform the standard RF due to the lack of spatial information. Their conclusion only holds when spatial autocorrelation is very high and species are common and likely to be detected. The authors concluded that the RFGLS method was the most effective or one of the most effective methods in all contexts. However, due to the discrete nature of species abundance and the presence of overdispersion, the RFGLS method, although technically applicable to count data, is not the most appropriate choice for predicting abundance in most cases.
The RFSP is one of the best performing spatial RF variants, but not as good as other models such as RGLM, GLM-OK and Poisson for species abundance with low spatial autocorrelation. SRF and RFSP belong to the class of approaches where spatial features are added to the RF as additional covariates (Hengl et al. 2018; Saha et al. 2022). In 1 out of 16 cases, SRF outperforms RFSP, particularly for rare species with high detection probability and low or no spatial autocorrelation during training.
Saha et al. (2022) confirmed Mentch and Zhou (2020a) findings that adding noise covariates improves RF prediction performance at low signal-to-noise ratios, i.e. when the number of informative predictor variables is very small compared to the number of uninformative predictors. In this setting, they found that ignoring spatial variation in the RFRK fit led to a worse performance than the RFSP fit. The RFSP performed worse than the RFRK when the number of informative predictor variables was larger than the number of non-informative predictors.
Although many studies (Hengl et al. 2018; Georganos et al. 2021; Mentch and Zhou 2020a) have found that including geographic coordinates as additional features is a good practice when using ML with spatial data, Mentch and Zhou (2020b) showed that even the simple act of including additional pure noise variables in the model can dramatically improve the accuracy of out-of-sample predictions, sometimes significantly outperforming even the best-tuned standard RFs. Beyond this important limitation, adding geographic coordinates as features increases computational complexity. Generalisations about the most effective spatial variant of RF for predicting species relative abundance are difficult, as this choice depends on data characteristics and species features.
However, based on the agreement between model predictions and observed abundance data, GWRF, GRF, and in some cases, RFRK seem to be the most appropriate when the spatial autocorrelation is high, whereas RFRK seems to be the appropriate spatial RF variant when spatial variance is low. This suggests the need for accurate, precise, and reliable spatial RF variants that account for both dispersion and spatial autocorrelation. In addition, spatial structure and dispersion parameters can be modified by the sampling, validation approach and sub-sampling process in the RF algorithm. This helps in understanding species abundance distribution and factors associated with it, which could inform future research and decision-making.
Future studies should explore datasets with diverse covariance functions and other species characteristics to evaluate spatial RF variants’ performance and explore ensemble methods that combine spatial approaches with count data models with full dispersion flexibility.
Conclusion
This study compared spatial RF variants to standard RF and other modeling methods for species abundance distributions in ecology using different model assessment approaches. Findings suggested using metrics that combine performance metrics and indicate agreement between observed and predicted abundances.
Spatial RF variants typically outperformed other models in prediction accuracy and power, especially when spatial autocorrelation and species detection probability were high. However, they are less precise than hybrid Poisson models like RGLM and GLM-OK. The best-performing RF spatial variant depends on data, species features, and the complexity of the relationship between abundance and independent variables. The RFSP provides better predictions for common species with high detection probability and high spatial autocorrelation but is computationally expensive. RFSP should be used with caution, as even adding pure noise variables can improve accuracy.
GWRF and GRF gave better predictions when the species was rare, with high detection probability and high spatial autocorrelation. For linear terms only, GWRF gave better predictions for common species with high spatial autocorrelation and high probability of detection. For rare species, it ranked second to RFRK. Hybrid Poisson models are most likely to give predictions close to observed data when there is no or low spatial autocorrelation. This study emphasizes the importance of model specification in ecological research, recommending species and data-specific models, and combining performance metrics to assess models performance. It suggests testing spatial variants accounting for data dispersion to improve species abundance predictions. These findings are crucial for future studies, enhancing understanding of population dynamics, management strategies, conservation planning, and climate impact assessment.
Availability of data and materials
The data and materials supporting this study’s findings are available from the corresponding author on request.
References
Ahijevych D, Pinto JO, Williams JK et al (2016) Probabilistic forecasts of mesoscale convective system initiation using the random forest data mining technique. Weather Forecast 31(2):581–599. https://doi.org/10.1175/WAF-D-15-0113.1. https://journals.ametsoc.org/view/journals/wefo/31/2/waf-d-15-0113_1.xml
Appelhans T, Mwangomo E, Hardy DR et al (2015) Evaluating machine learning approaches for the interpolation of monthly air temperature at Mt. Kilimanjaro, Tanzania. Spat Stat 14:91–113. https://doi.org/10.1016/j.spasta.2015.05.008. https://www.sciencedirect.com/science/article/pii/S2211675315000482, spatial and Spatio-Temporal Models for Interpolating Climatic and Meteorological Data
Baldridge E, Harris DJ, Xiao X et al (2016) An extensive comparison of species-abundance distribution models. PeerJ 4:e2823
Beery S, Cole E, Parker J et al (2021) Species distribution modeling for machine learning practitioners: a review. In: ACM SIGCAS conference on computing and sustainable societies. COMPASS ’21. Association for Computing Machinery, New York, pp 329 – 348. https://doi.org/10.1145/3460112.3471966
Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc: Ser B (Methodol) 57(1):289–300. https://doi.org/10.1111/j.2517-6161.1995.tb02031.x
Benoit D, Jackson DA, Ridgway MS (2018) Assessing the impacts of imperfect detection on estimates of diversity and community structure through multispecies occupancy modeling. Ecol Evol 8(9):4676–4684. https://doi.org/10.1002/ece3.4023
Biau G, Scornet E (2016) A random forest guided tour. TEST 25:197–227. https://doi.org/10.1007/s11749-016-0481-7
Borchers DL, Stevenson BC, Kidney D et al (2015) A unifying model for capture-recapture and distance sampling surveys of wildlife populations. J Am Stat Assoc 110(509):195–204. https://doi.org/10.1080/01621459.2014.893884
Boulesteix AL, Janitza S, Kruppa J et al (2012) Overview of random forest methodology and practical guidance with emphasis on computational biology and bioinformatics. WIREs Data Min Knowl Discov 2(6):493–507. https://doi.org/10.1002/widm.1072
Bowler DE, Haase P, Kröncke I et al (2017) Cross-taxa generalities in the relationship between population abundance and ambient temperatures. Proc Biol Sci 284(1863):20170870. https://doi.org/10.1098/rspb.2017.0870
Breiman L (1996) Bagging predictors. J Mach Learn Res 24(2):123–40
Breiman L (2001) Random forests. Mach Learn 45(1):5–32. https://doi.org/10.1023/A:1010933404324
Breiman L, Friedman J, Olshen R et al (1984) Classification and regression trees, 1st edn. Chapman and Hall/CRC Press, Boca Raton. https://doi.org/10.1201/9781315139470
Brenning A (2005) Spatial prediction models for landslide hazards: review, comparison and evaluation. Nat Hazards Earth Syst Sci 5(6):853–862. https://doi.org/10.5194/nhess-5-853-2005. https://nhess.copernicus.org/articles/5/853/2005/
Broms KM, Hooten MB, Fitzpatrick RM (2016) Model selection and assessment for multi-species occupancy models. Ecology 97(7):1759–1770. https://doi.org/10.1890/15-1471.1
Brunsdon C, Fotheringham S, Charlton M (1998) Geographically weighted regression. J R Stat Soc: Ser D (Stat) 47(3):431–443. https://doi.org/10.1111/1467-9884.00145
Cameron AC, Trivedi PK (2005) Microeconometrics: methods and applications. Cambridge University Press, Cambridge. https://doi.org/10.1017/CBO9780511811241
Ceballos G, Ehrlich PR, Raven PH (2020a) Vertebrates on the brink as indicators of biological annihilation and the sixth mass extinction. Proc Natl Acad Sci USA 117(24):13596–13602. https://doi.org/10.1073/pnas.1922686117
Ceballos G, Ehrlich PR, Raven PH (2020b) Vertebrates on the brink as indicators of biological annihilation and the sixth mass extinction. Proc Natl Acad Sci 117(24):13596–13602. https://doi.org/10.1073/pnas.1922686117
Ceulemans R, Guill C, Gaedke U (2021) Top predators govern multitrophic diversity effects in tritrophic food webs. Ecology 102(7):e03379. https://doi.org/10.1002/ecy.3379
Chilès JP, Delfiner P (2012) Structural analysis. In: Geostatistics: modeling spatial uncertainty, chap 2. Wiley, New York, pp 28–146. https://doi.org/10.1002/9781118136188.ch2
Chisholm RA, Muller-Landau HC (2011) A theoretical model linking interspecific variation in density dependence to species abundances. Theor Ecol 4(2):241–253. https://doi.org/10.1007/s12080-011-0119-z
Chu C, Kleinhesselink AR, Havstad KM et al (2016) Direct effects dominate responses to climate perturbations in grassland plant communities. Nat Commun 7(1):11766. https://doi.org/10.1038/ncomms11766
Clements CF, Blanchard JL, Nash KL et al (2017) Body size shifts and early warning signals precede the historic collapse of whale stocks. Nat Ecol Evol 1(7):188. https://doi.org/10.1038/s41559-017-0188
Cressie N, Wikle CK (2011) Statistics for spatio-temporal data, 1st edn. Wiley series in probability and statistics. Wiley, New York
Cutler DR, Edwards TC Jr, Beard KH et al (2007) Random forests for classification in ecology. Ecology 88(11):2783–2792. https://doi.org/10.1890/07-0539.1
Dallas TA, Hastings A (2018) Habitat suitability estimated by niche models is largely unrelated to species abundance. Glob Ecol Biogeogr 27(12):1448–1456. https://doi.org/10.1111/geb.12820
Dallas TA, Santini L (2020) The influence of stochasticity, landscape structure and species traits on abundant–centre relationships. Ecography 43(9):1341–1351. https://doi.org/10.1111/ecog.05164
Datta A, Banerjee S, Finley AO et al (2016) Hierarchical nearest-neighbor Gaussian process models for large geostatistical datasets. J Am Stat Assoc 111(514):800–812. https://doi.org/10.1080/01621459.2015.1044091
Déath G, Fabricius KE (2000) Classification and regression trees: a powerful yet simple technique for ecological data analysis. Ecology 81(11):3178–3192. https://doi.org/10.1890/0012-9658(2000)081[3178:CARTAP]2.0.CO;2. https://esajournals.onlinelibrary.wiley.com/doi/abs/10.1890/0012-9658(2000)081[3178:CARTAP]2.0.CO;2
Díaz-Uriarte R, Alvarez de Andrés S (2006) Gene selection and classification of microarray data using random forest. BMC Bioinform 7(3):1–13
Dormann FC, McPherson MJ, Araújo BM et al (2007) Methods to account for spatial autocorrelation in the analysis of species distributional data: a review. Ecography 30(5):609–628. https://doi.org/10.1111/j.2007.0906-7590.05171.x
Dunn OJ (1964) Multiple comparisons using rank sums. Technometrics 6(3):241–252. https://doi.org/10.1080/00401706.1964.10490181
Fayad I, Baghdadi N, Bailly JS et al (2016) Regional scale rain-forest height mapping using regression-kriging of spaceborne and airborne lidar data: application on French Guiana. Remote Sens 8(3). https://doi.org/10.3390/rs8030240. https://www.mdpi.com/2072-4292/8/3/240
Fligner JM, Killeen TL (1976) Distribution-free two-sample tests for scale. J Am Stat Assoc 71(353):210–213
Fotheringham A, Brunsdon C, Charlton M (2003) Geographically weighted regression: the analysis of spatially varying relationships. Wiley, New York. https://books.google.bj/books?id=9DZgV1vXOuMC
Fox EW, Ver Hoef JM, Olsen AR (2020) Comparing spatial regression to random forests for large environmental data sets. PLoS ONE 15(3):1–22. https://doi.org/10.1371/journal.pone.0229509
Garcia-Marti I, Zurita-Milla R, Swart A (2019) Modelling tick bite risk by combining random forests and count data regression models. PLoS ONE 14(12):1–22. https://doi.org/10.1371/journal.pone.0216511
Genung MA, Fox J, Winfree R (2020) Species loss drives ecosystem function in experiments, but in nature the importance of species loss depends on dominance. Glob Ecol Biogeogr 29(9):1531–1541. https://doi.org/10.1111/geb.13137
Georganos S, Grippa T, Gadiaga AN et al (2021) Geographical random forests: a spatial extension of the random forest algorithm to address spatial heterogeneity in remote sensing and population modelling. Geocarto Int 36(2):121–136. https://doi.org/10.1080/10106049.2019.1595177
Gislason PO, Benediktsson JA, Sveinsson JR (2006) Random forests for land cover classification. Pattern Recognit Lett 27(4):294–300. https://doi.org/10.1016/j.patrec.2005.08.011. https://www.sciencedirect.com/science/article/pii/S0167865505002242, pattern Recognition in Remote Sensing (PRRS 2004)
Gleckler PJ, Taylor KE, Doutriaux C (2008) Performance metrics for climate models. J Geophys Res: Atmos. https://doi.org/10.1029/2007JD008972
Gräler B, Pebesma E, Heuvelink G (2016) Spatio-temporal interpolation using gstat. R J 8:204–218. https://journal.r-project.org/archive/2016/RJ-2016-014/index.html
Gregory RD, Noble DG, Custance J (2004) The state of play of farmland birds: population trends and conservation status of lowland farmland birds in the United Kingdom. Ibis 146(s2):1–13. https://doi.org/10.1111/j.1474-919X.2004.00358.x
Guélat J, Kéry M (2018) Effects of spatial autocorrelation and imperfect detection on species distribution models. Methods Ecol Evol 9(6):1614–1625. https://doi.org/10.1111/2041-210X.12983
Guillera-Arroita G, Lahoz-Monfort JJ, Elith J et al (2015) Is my species distribution model fit for purpose? Matching data and models to applications. Glob Ecol Biogeogr 24(3):276–292. https://doi.org/10.1111/geb.12268
Hallett LM, Farrer EC, Suding KN et al (2018) Tradeoffs in demographic mechanisms underlie differences in species abundance and stability. Nat Commun 9(1):5047–5055. https://doi.org/10.1038/s41467-018-07535-w
Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning, 2nd edn. Springer series in statistics. Springer, New York. https://doi.org/10.1007/978-0-387-84858-7
Hastings R, Rutterford L, Freer J et al (2020) Climate change drives poleward increases and equatorward declines in marine species. Curr Biol 30(8):1572-1577.e2. https://doi.org/10.1016/j.cub.2020.02.043
Hengl T, Heuvelink GBM, Kempen B et al (2015) Mapping soil properties of Africa at 250 m resolution: random forests significantly improve current predictions. PLoS ONE 10(6):1–26. https://doi.org/10.1371/journal.pone.0125814
Hengl T, Nussbaum M, Wright MN et al (2018) Random forest as a generic framework for predictive modeling of spatial and spatio-temporal variables. PeerJ 6:e5518. https://doi.org/10.7717/peerj.5518
Hijmans RJ (2023) raster: Geographic data analysis and modeling. https://CRAN.R-project.org/package=raster, r package version 3.6-20
Holt RD (2020) Reflections on niches and numbers. Ecography 43(3):387–390. https://doi.org/10.1111/ecog.04828
Houlahan JE, McKinney ST, Anderson TM et al (2017) The priority of prediction in ecological understanding. Oikos 126(1):1–7. https://doi.org/10.1111/oik.03726
Howard C, Stephens PA, Pearce-Higgins JW et al (2014) Improving species distribution models: the value of data on abundance. Methods Ecol Evol 5(6):506–513. https://doi.org/10.1111/2041-210X.12184
Hu Z, Chen X, Zhou Q et al (2019) DISO: a rethink of Taylor diagram. Int J Climatol 39(5):2825–2832. https://doi.org/10.1002/joc.5972
Izzaddin A, Langousis A, Totaro Vea (2024) A new diagram for performance evaluation of complex models. Stoch Environ Res Risk Assess. https://doi.org/10.1007/s00477-024-02678-3
Jetz W, McGeoch MA, Guralnick R et al (2019) Essential biodiversity variables for mapping and monitoring species populations. Nat Ecol Evol 3:539–551. https://doi.org/10.1038/s41559-019-0826-1
Jiang Z, Li W, Xu J et al (2015) Extreme precipitation indices over china in cmip5 models. Part I: model evaluation. J Clim 28(21):8603–8619. https://doi.org/10.1175/JCLI-D-15-0099.1. https://journals.ametsoc.org/view/journals/clim/28/21/jcli-d-15-0099.1.xml
Johnson PT, Preston DL, Hoverman JT et al (2013) Biodiversity decreases disease through predictable changes in host community competence. Nature 494(7436):230–233
Johnston A, Fink D, Reynolds MD et al (2015) Abundance models improve spatial and temporal prioritization of conservation resources. Ecol Appl 25(7):1749–1756. https://doi.org/10.1890/14-1826.1
Kalogirou S, Georganos S (2022) SpatialML: spatial machine learning. https://CRAN.R-project.org/package=SpatialML, r package version 0.1.5
Kellner KF, Swihart RK (2014) Accounting for imperfect detection in ecology: a quantitative review. PLoS ONE 9(10):1–8. https://doi.org/10.1371/journal.pone.0111436
Kéry M, Royle JA (2016) Chapter 6—modeling abundance with counts of unmarked individuals in closed populations: binomial n-mixture models. In: Kéry M, Royle JA (eds) Applied hierarchical modeling in ecology. Academic Press, Boston, pp 219–312. https://doi.org/10.1016/B978-0-12-801378-6.00006-0. https://www.sciencedirect.com/science/article/pii/B9780128013786000060
Kéry M, Schmidt BR (2008) Imperfect detection and its consequences for monitoring for conservation. Community Ecol 9(2):207–216. https://doi.org/10.1556/ComEc.9.2008.2.10
Kleiber C, Zeileis A (2008) Applied econometrics with R. Springer, New York. https://CRAN.R-project.org/package=AER. ISBN:978-0-387-77316-2
Kruskal WH, Wallis WA (1952) Use of ranks in one-criterion variance analysis. J Am Stat Assoc 47(260):583–621. https://doi.org/10.1080/01621459.1952.10483441
Kuhn M, Johnson K (2013) Applied predictive modeling. Springer, New York. http://www.amazon.com/Applied-Predictive-Modeling-Max-Kuhn/dp/1461468485/
Lahoz-Monfort JJ, Guillera-Arroita G, Wintle BA (2014) Imperfect detection impacts the performance of species distribution models. Glob Ecol Biogeogr 23(4):504–515. https://doi.org/10.1111/geb.12138
Lawler JJ, White D, Neilson RP et al (2006) Predicting climate-induced range shifts: model differences and model reliability. Glob Change Biol 12(8):1568–1584. https://doi.org/10.1111/j.1365-2486.2006.01191.x
Legendre P (1993) Spatial autocorrelation: trouble or new paradigm? Ecology 74(6):1659–1673. https://doi.org/10.2307/1939924
Lenoir J, Svenning JC (2013) Latitudinal and elevational range shifts under contemporary climate change. In: Levin SA (ed) Encyclopedia of biodiversity, 2nd edn. Academic Press, Waltham, pp 599–611. https://doi.org/10.1016/B978-0-12-384719-5.00375-0. https://www.sciencedirect.com/science/article/pii/B9780123847195003750
Li J, Heap AD, Potter A et al (2011) Application of machine learning methods to spatial interpolation of environmental variables. Environ Model Softw 26(12):1647–1659. https://doi.org/10.1016/j.envsoft.2011.07.004. https://www.sciencedirect.com/science/article/pii/S1364815211001654
Li J, Alvarez B, Siwabessy J et al (2017) Application of random forest, generalised linear model and their hybrid methods with geostatistical techniques to count data: predicting sponge species richness. Environ Model Softw 97:112–129. https://doi.org/10.1016/j.envsoft.2017.07.016. https://www.sciencedirect.com/science/article/pii/S1364815217301615
Liaw A, Wiener M (2002) Classification and regression by randomForest. R News 2(3):18–22. https://CRAN.R-project.org/doc/Rnews/
Lim CC, Kim H, Vilcassim MR et al (2019) Mapping urban air quality using mobile sampling with low-cost sensors and machine learning in Seoul, South Korea. Environ Int 131:105022. https://doi.org/10.1016/j.envint.2019.105022. https://www.sciencedirect.com/science/article/pii/S0160412019304854
Liu Y, Cao G, Zhao N et al (2018) Improve ground-level PM2.5 concentration mapping using a random forests-based geostatistical approach. Environ Pollut 235:272–282. https://doi.org/10.1016/j.envpol.2017.12.070. https://www.sciencedirect.com/science/article/pii/S0269749117316469
Lucas TC (2020) A translucent box: interpretable machine learning in ecology. Ecol Monogr 90(4):e01422
Martín B, González-Arias J, Vicente-Vírseda JA (2021) Machine learning as a successful approach for predicting complex spatio-temporal patterns in animal species abundance. Anim Biodivers Conserv 44(2):289–301
McGill BJ, Etienne RS, Gray JS et al (2007) Species abundance distributions: moving beyond single prediction theories to integration within an ecological framework. Ecol Lett 10(10):995–1015. https://doi.org/10.1111/j.1461-0248.2007.01094.x
Mentch LK, Zhou S (2020a) Getting better from worse: augmented bagging and a cautionary tale of variable importance. J Mach Learn Res 23:224:1–224:32. https://api.semanticscholar.org/CorpusID:212633465
Mentch LK, Zhou S (2020b) Randomization as regularization: a degrees of freedom explanation for random forest success. J Mach Learn Res 21(171):1–36. http://jmlr.org/papers/v21/19-905.html
Merow C, Smith MJ, Edwards TC Jr et al (2014) What do we gain from simplicity versus complexity in species distribution models? Ecography 37(12):1267–1281. https://doi.org/10.1111/ecog.00845
Moriasi DN, Gitau MW, Pai N et al (2015) Hydrologic and water quality models: performance measures and evaluation criteria. Trans ASABE 58(6):1763–1785. https://doi.org/10.13031/trans.58.10715
Nash J, Sutcliffe J (1970) River flow forecasting through conceptual models part I—a discussion of principles. J Hydrol 10(3):282–290. https://doi.org/10.1016/0022-1694(70)90255-6. https://www.sciencedirect.com/science/article/pii/0022169470902556
Norberg A, Abrego N, Blanchet FG et al (2019) A comprehensive evaluation of predictive performance of 33 species distribution models at species and community levels. Ecol Monogr 89(3):e01370. https://doi.org/10.1002/ecm.1370
O’Grady JJ, Reed DH, Brook BW et al (2004) What are the best correlates of predicted extinction risk? Biol Conserv 118(4):513–520. https://doi.org/10.1016/j.biocon.2003.10.002. https://www.sciencedirect.com/science/article/pii/S0006320703003975
Oppel S, Meirinho A, Ramírez I et al (2012) Comparison of five modelling techniques to predict the spatial distribution and abundance of seabirds. Biol Conserv 156:94–104. https://doi.org/10.1016/j.biocon.2011.11.013. https://www.sciencedirect.com/science/article/pii/S0006320711004319, seabirds and Marine Protected Areas planning
Osorio-Olvera L, Soberón J, Falconi M (2019) On population abundance and niche structure. Ecography 42(8):1415–1425. https://doi.org/10.1111/ecog.04442
Parmentier I, Harrigan RJ, Buermann W et al (2011) Predicting alpha diversity of African rain forests: models based on climate and satellite-derived data do not perform better than a purely spatial model. J Biogeogr 38(6):1164–1176. https://doi.org/10.1111/j.1365-2699.2010.02467.x
Pichler M, Boreux V, Klein AM et al (2020) Machine learning algorithms to infer trait-matching and predict species interactions in ecological networks. Methods Ecol Evol 11(2):281–293. https://doi.org/10.1111/2041-210X.13329
Prasad AM, Iverson LR, Liaw A (2006) Newer classification and regression tree techniques: bagging and random forests for ecological prediction. Ecosystems 9(2):181–199. https://doi.org/10.1007/s10021-005-0054-1
Purvis A, Gittleman JL, Cowlishaw G et al (2000) Predicting extinction risk in declining species. Proc R Soc Lond Ser B: Biol Sci 267(1456):1947–1952. https://doi.org/10.1098/rspb.2000.1234
R Core Team (2022) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/
Rizvanovic M, Kennedy JD, Nogués-Bravo D et al (2019) Persistence of genetic diversity and phylogeographic structure of three New Zealand forest beetles under climate change. Divers Distrib 25(1):142–153. https://doi.org/10.1111/ddi.12834
Roberts DR, Bahn V, Ciuti S et al (2017) Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure. Ecography 40(8):913–929. https://doi.org/10.1111/ecog.02881
Royle JA, Dorazio RM (2009) 8-Metapopulation models of abundance. In: Royle JA, Dorazio RM (eds) Hierarchical modeling and inference in ecology. Academic Press, San Diego, pp 267–295. https://doi.org/10.1016/B978-0-12-374097-7.00010-7. https://www.sciencedirect.com/science/article/pii/B9780123740977000107
Royle JA, Kéry M, Gautier R et al (2007) Hierarchical spatial models of abundance and occurrence from imperfect survey data. Ecol Monogr 77(3):465–481. https://doi.org/10.1890/06-0912.1
Ruß G, Brenning A (2010) Data mining in precision agriculture: management of spatial information. In: Hüllermeier E, Kruse R, Hoffmann F (eds) Computational intelligence for knowledge-based systems design. Springer, Berlin, pp 350–359
Saha A, Datta A (2018) BRISC: bootstrap for rapid inference on spatial covariances. Stat 7(1):e184. https://doi.org/10.1002/sta4.184
Saha A, Basu S, Datta A (2022) RandomForestsGLS: random forests for dependent data. https://CRAN.R-project.org/package=RandomForestsGLS, r package version 0.1.4
Saha A, Basu S, Datta A (2023) Random forests for spatially dependent data. J Am Stat Assoc 118(541):665–683. https://doi.org/10.1080/01621459.2021.1950003
Shapiro SS, Wilk MB (1965) An analysis of variance test for normality (complete samples). Biometrika 52(3–4):591–611. https://doi.org/10.1093/biomet/52.3-4.591
Simon SM, Glaum P, Valdovinos FS (2023) Interpreting random forest analysis of ecological models to move from prediction to explanation. Sci Rep. https://doi.org/10.1038/s41598-023-30313-8
Song L, Langfelder P (2022) randomGLM: random general linear model prediction. https://CRAN.R-project.org/package=randomGLM, r package version 1.10-1
Song L, Langfelder P, Horvath S (2013) Random generalized linear model: a highly accurate and interpretable ensemble predictor. BMC Bioinform 14(1):5. https://doi.org/10.1186/1471-2105-14-5
Sporbert M, Keil P, Seidler G et al (2020) Testing macroecological abundance patterns: the relationship between local abundance and range size, range position and climatic suitability among European vascular plants. J Biogeogr 47(10):2210–2222. https://doi.org/10.1111/jbi.13926
Stewart FA, Yang W, Kang W (2017) Multiscale geographically weighted regression (MGWR). Ann Am Assoc Geogr 107(6):1247–1265. https://doi.org/10.1080/24694452.2017.1352480
Stuart-Smith RD, Bates AE, Lefcheck JS et al (2013) Integrating abundance and functional traits reveals new global hotspots of fish diversity. Nature 501:539–542. https://doi.org/10.1038/nature12529
Su Q (2018) A general pattern of the species abundance distribution. PeerJ 6:e5928. https://doi.org/10.7717/peerj.5928
Talebi H, Peeters L, Otto A et al (2022) A truly spatial random forests algorithm for geoscience data analysis and modelling. Math Geosci 54(1):1–22. https://doi.org/10.1007/s11004-021-09946-w
Taylor KE (2001) Summarizing multiple aspects of model performance in a single diagram. J Geophys Res: Atmos 106(D7):7183–7192. https://doi.org/10.1029/2000JD900719
Temesgen H, Ver Hoef JM (2014) Evaluation of the spatial linear model, random forest and gradient nearest-neighbour methods for imputing potential productivity and biomass of the Pacific Northwest forests. For: Int J For Res 88(1):131–142. https://doi.org/10.1093/forestry/cpu036
Thuiller W, Guéguen M, Renaud J et al (2019) Uncertainty in ensembles of global biodiversity scenarios. Nat Commun 10(1):1446. https://doi.org/10.1038/s41467-019-09519-w
Urban MC, Bocedi G, Hendry AP et al (2016) Improving the forecast for biodiversity under climate change. Science 353(6304):aad8466. https://doi.org/10.1126/science.aad8466
Van Horne B (1983) Density as a misleading indicator of habitat quality. J Wildl Manag 47(4):893–901
Venables WN, Ripley BD (2002) Modern applied statistics with S, 4th edn. Springer, New York. https://www.stats.ox.ac.uk/pub/MASS4/. ISBN:0-387-95457-0
Verberk W (2011) Explaining general patterns in species abundance and distributions. Nat Educ Knowl 3(10):38
Waldock C, Stuart-Smith RD, Albouy C et al (2022) A quantitative review of abundance-based species distribution models. Ecography. https://doi.org/10.1111/ecog.05694
Wardeh M, Blagrove MS, Sharkey KJ et al (2021) Divide-and-conquer: machine-learning integrates mammalian and viral traits with network features to predict virus-mammal associations. Nat Commun 12(1):3954. https://doi.org/10.1038/s41467-021-24085-w
Weber MM, Stevens RD, Diniz-Filho JAF et al (2017) Is there a correlation between abundance and environmental suitability derived from ecological niche modelling? A meta-analysis. Ecography 40(7):817–828. https://doi.org/10.1111/ecog.02125
Webster R, Oliver MA (2007) Geostatistics for environmental scientists, 2nd edn. Wiley, New York
Wright MN, Ziegler A (2017) ranger: A fast implementation of random forests for high dimensional data in C++ and R. J Stat Softw 77(1):1–17. https://doi.org/10.18637/jss.v077.i01
Yenni G, Adler PB, Ernest SKM (2017) Do persistent rare species experience stronger negative frequency dependence than common species? Glob Ecol Biogeogr 26(5):513–523. https://doi.org/10.1111/geb.12566
Zhang C, Chen Y, Xu B et al (2020) Improving prediction of rare species’ distribution from community data. Sci Rep 10(1):12230. https://doi.org/10.1038/s41598-020-69157-x
Zurell D, Thuiller W, Pagel J et al (2016) Benchmarking novel approaches for modelling species range dynamics. Glob Change Biol 22(8):2651–2664. https://doi.org/10.1111/gcb.13251
Acknowledgements
This work was supported by the DAAD In-Country/In-Region Scholarship Program FSA/UAC, International Development Research Centre (IDRC) and Swedish International Development Cooperation Agency (SIDA) through the Artificial Intelligence for Development (AI4D) Africa Programme, managed by Africa Centre for Technology Studies (ACTS) Scholarship Program, and the Regional Universities Forum for Capacity Building in Agriculture through the Graduate Training Assistantship Program supported by the Carnegie Corporation in New York.
Author information
Authors and Affiliations
Contributions
C.A.M. designed the study, analysed, and interpreted results, and drafted the manuscript, A.B.F. and R.G.K. supervised the study and revised the manuscript, and all authors approved the final version.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that there is no conflict of interest regarding the publication of this article.
Ethics approval
Not applicable.
Consent to participate
Not applicable.
Consent for publication
Not applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Mushagalusa, C.A., Fandohan, A.B. & Glèlè Kakaï, R. Predicting species abundance using machine learning approach: a comparative assessment of random forest spatial variants and performance metrics. Model. Earth Syst. Environ. 10, 5145–5171 (2024). https://doi.org/10.1007/s40808-024-02055-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s40808-024-02055-7