Introduction

The distribution of species abundance plays a crucial role in many ecological applications (McGill et al. 2007; Royle and Dorazio 2009; Baldridge et al. 2016; Su 2018). Species abundance data are relatively easy to collect and can reveal less obvious features of a community, such as how well organised, competing and predating (Verberk 2011). In-depth knowledge of species abundance distribution and associated factors is essential for understanding population dynamics, implications of management strategies, conservation planning, and climate impact assessment (McGill et al. 2007; Kellner and Swihart 2014; Baldridge et al. 2016; Su 2018).

Advanced computers and statistical methods are enabling scientists to better understand, quantify, and predict complex ecological processes. However, species occurrence, rather than abundance, has been the main focus of the development of spatial predictive modelling (Waldock et al. 2022). Models of species occurrence do not fully capture the changes in local abundance associated with changes in species distributions (Gregory et al. 2004; Lenoir and Svenning 2013; Howard et al. 2014; Hastings et al. 2020). For example, species present in large numbers at few sites may contribute significantly to ecological processes, but a focus on occurrence alone will overlook these species (Stuart-Smith et al. 2013; Johnston et al. 2015; Genung et al. 2020). Changes in abundance may also provide an early warning of a population decline, whereas patterns of occurrence may not change until the local population is decimated (O’Grady et al. 2004; Clements et al. 2017; Ceballos et al. 2020a; Hastings et al. 2020; Waldock et al. 2022).

Although Weber et al. (2017) showed a positive relationship between species abundance and environmental suitability, other studies reported a weak or non-existent relationship (Van Horne 1983; Dallas and Hastings 2018; Sporbert et al. 2020; Dallas and Santini 2020; Holt 2020). This relationship can be weakened due to species characteristics (Allee effect, species detectability, demographic stochasticity, non-equilibrium population states), environmental variability, measurement errors and spatial autocorrelation (Howard et al. 2014; Osorio-Olvera et al. 2019; Dallas and Santini 2020; Holt 2020; Waldock et al. 2022).

Measurement errors induce omissions, commissions, under-counting, over-counting, or imperfect detection (Borchers et al. 2015; Kéry and Royle 2016). Imperfect detection is common in ecological datasets, resulting in underestimated abundance and uncertainty (Royle et al. 2007; Kéry and Schmidt 2008; Kellner and Swihart 2014; Lahoz-Monfort et al. 2014; Benoit et al. 2018). Spatial autocorrelation is a measure of the degree to which a model’s residuals are spatially clustered considering the effects of covariates (Legendre 1993; Dormann et al. 2007).

Although not necessarily a challenge for statistical analysis, data’s spatial structure is often overlooked in many application fields and can lead to poor model performance (Brenning 2005; Ruß and Brenning 2010). Spatial structure is not taken into account by many popular machine learning (ML) techniques, including random forest (RF) (Roberts et al. 2017). However, these limitations have not been extensively explored in applied studies, especially in species abundance modelling (Johnson et al. 2013; Broms et al. 2016). Understanding how and why models behave differently for different species is critical to informing and managing conservation (Waldock et al. 2022).

Ecologists and statisticians are increasingly dealing with high-dimensional observations of species and environmental data (Beery et al. 2021). These data often have complex and non-linear interactions and missing values. Traditional statistical methods struggle to provide meaningful analyses of such data (Déath and Fabricius 2000). RF, a flexible and powerful alternative to traditional statistical methods, is increasingly applied in various fields, particularly to ecological datasets, to study complex systems (Gislason et al. 2006; Jj et al. 2006; Prasad et al. 2006; Cutler et al. 2007; Kuhn and Johnson 2013; Merow et al. 2014; Zurell et al. 2016; Lucas 2020; Pichler et al. 2020; Martín et al. 2021; Beery et al. 2021; Ceulemans et al. 2021; Wardeh et al. 2021; Simon et al. 2023).

Despite the introduction of RF variants that account for spatial autocorrelation in data, standard RF remains widely used (Ahijevych et al. 2016; Fayad et al. 2016; Lim et al. 2019; Fox et al. 2020; Hengl et al. 2018; Liu et al. 2018; Georganos et al. 2021; Saha et al. 2023). In recent years, several methods have been proposed to adapt standard RF to a spatial framework by incorporating spatial information. One of these techniques, known as RF Residual Kriging (RFRK), is the most widely used. After fitting a standard RF, spatial adjustment is performed using ordinary kriging (OK) on RF residuals (Hengl et al. 2015; Liu et al. 2018). Another technique is the explicit inclusion of spatial information as additional covariates in the RF, which adversely affects predictive performance when there is a dominating covariate effect (Saha et al. 2022). Hengl et al. (2018) introduced a spatial RF (RFSP) that incorporates all pairwise buffer distances as additional factors. Georganos et al. (2021) developed the Geographical Random Forest (GRF), in which the global process is a decomposition of several local sub-models of nearby observations. It is based on the concept of spatially varying coefficient models (Fotheringham et al. 2003; Stewart et al. 2017).

Furthermore, Talebi et al. (2022) introduced a spatial RF variant based on higher-order spatial statistics. It uses local spatial-spectral information to learn intrinsic heterogeneity, spatial dependencies and complex spatial patterns, in contrast to the standard RF algorithm that uses pixel-wise spectral information as predictors. The RFGLS, developed by Saha et al. (2023), is a method used to estimate nonlinear covariate effects in spatial mixed models, where spatial correlation is handled using the Gaussian procedure. Similarly to how general least squares (GLS) essentially extend ordinary least squares (OLS) to account for dependence in linear models, RFGLS extends RF.

However, there is a lack of comparisons between spatial RF variants and spatial regression in modelling of the distribution of species abundance. Spatial regression models are known to perform well in predicting data with spatial autocorrelation. Li et al. (2017) found that GLM hybrids with geostatistical techniques were less accurate than RF-OK hybrids in modelling count data. Song et al. (2013) introduced the Random Generalised Linear Model (RGLM), which is a GLM-based ensemble predictor and combines the standard RF and GLM models using forward selection. Garcia-Marti et al. (2019) combined RF and count data models to better model overdispersed, skewed, and zero-inflated distributions. It combines the segmenting capabilities of standard RF, using decision tree rules to partition data into homogeneous groups, with count data models more appropriate for modelling count data.

Yet spatial abundance models are often better suited to common and widespread species and have shown their greatest applicability to questions of ecosystem function and service supply rather than modelling rare or endemic species threatened with extinction (Purvis et al. 2000; Ceballos et al. 2020b; Waldock et al. 2022). However, the effects of species features combined with spatial autocorrelation on the performance of RF within a spatial framework are less well understood. Other studies have shown that for other types of species, particularly wide-ranging species, it may be difficult to model abundance accurately because their environmental niches are less significant (Chisholm and Muller-Landau 2011; Chu et al. 2016; Bowler et al. 2017; Yenni et al. 2017; Hallett et al. 2018). Rare species with a small niche have more stable populations, making it easier to predict abundance distribution (Yenni et al. 2017; Thuiller et al. 2019; Waldock et al. 2022). This study explores the effectiveness of different RF spatial variants in improving species abundance predictions and evaluates the effectiveness of combining performance metrics in model selection for species abundance distribution modelling. Model selection involves evaluating competing models’ performance to choose the best model (Hastie et al. 2009).

This study uses modified Taylor plots to assess the performance of different modelling approaches. The overall performance of these models has not previously been comprehensively quantified beyond the existing individual statistical indices. It aims to determine how different RF spatial variants perform in predicting species abundance distributions for different species and select the most accurate model, accounting for species and data features. To get there, we: (1) tested the predictive performance of different modelling approaches using Modified Taylor diagrams and various metrics of accuracy, discrimination and precision; (2) compared RF spatial variants with spatial regression modelling approaches and the random generalised linear model (RGLM); and (3) examined the influence of spatially correlated random effects’ complexity on their predictive performance.

Methods

Data simulation

To evaluate the effect of species and data features on RF spatial variants performance in predicting species abundance, we simulated virtual species abundance data as proposed by Guélat and Kéry (2018). Scenarios including cases with spatial autocorrelation and imperfect detection were considered in the present study. The following model was used to generate abundance data sets within a \(50 \times 50\) cell landscape.

$$\begin{aligned} N_i\sim \text{Poisson}(\lambda _i), \end{aligned}$$
(1)
$$\begin{aligned} \log (\lambda _i)= \beta _0 + \beta _1 x_{1i} + \beta _2 x_{2i} + \beta _3 x_{3i} + \gamma \rho _i, \end{aligned}$$
(2)
$$\begin{aligned} \rho _i\sim \text{MVN}(0, \sigma ^2 e^{-\theta d_{ij}}), \end{aligned}$$
(3)
$$\begin{aligned} C_{it}\sim \text{Binomial}(N_i, p). \end{aligned}$$
(4)

\(N_i\) is a true latent abundance generated randomly at each site i\(x_1,\) \(x_2\) and \(x_3\) are continuous covariates following the standard normal distribution but only \(x_1\) is informative, as described in Table 1.

Table 1 Simulation parameters used to generate species abundance data

\(x_2\) and \(x_3\) were uninformative independent variables), \(\rho _i\) is a spatially correlated random effect produced by an exponential correlation function, \(\beta _0\) represents a constant term contributing to abundance, \(\beta _1\) is the growth rate coefficient of the exponential function and \(\gamma\) is the autocorrelation’s strength. However, we also tested scenarios where \(\rho _i\) was a spatially correlated random effect produced by an exponential function with a quadratic term \(({x_1}^2).\)

The strength of pairwise correlations in the landscape is based on distance \(d_{ij}\) between sites i and j,  determined by the covariance matrix of the multivariate normal distribution (MVN). Additionally, \(\sigma ^2\) is the spatial variance, \(\theta\) is the scale parameter controlling the distance-dependent decay of the spatial correlation in the expected abundance. \(C_{it}\) represents observed data which is the counts at site i during the visit t and \(\rho\) describes abundance measurement error. Observed data are randomly generated conditional on \(N_i.\) For each site, three visits were considered to account for imperfect detectability (p). For each case, 200 sites were selected to test the models. A common species occurs in large numbers in a specific ecosystem or habitat, while a rare species occurs in low numbers. For each case. 200 datasets were generated for each case.

Dispersion coefficient \(\alpha\) was used to determine the dispersion level for each dataset and each case. The coefficient \(\alpha\) was estimated using ordinary least squares (OLS) auxiliary regression. It was tested using the t statistic, which is asymptotically standard normal under the null hypothesis of equidispersion (Kleiber and Zeileis 2008). Standard Poisson models the conditional mean \(E(y)=\mu ,\) assumed equal to the variance \((var(y)=\mu )\) in case of equidispersion. We estimated the dispersion parameter \(\alpha\) with the linear variance function (quasi-Poisson model) such that

$$\begin{aligned} var(y) = \mu +\alpha \mu = \mu (1 + \alpha ). \end{aligned}$$
(5)

Overdispersion corresponds to \(\alpha > 0\) and underdispersion corresponds to \(\alpha < 0\) (Cameron and Trivedi 2005).

Predictive methods

Spatial linear model

Suppose that \(y = (y(s_1), \ldots , y(s_n))\) is a response vector that is spatially located at locations \(s_i \in D \subset {\mathbb {R}}^2\). The model of the spatial regression can be described as follows: The spatial regression model is given by:

$$\begin{aligned} y = X\beta + Z + \epsilon , \end{aligned}$$
(6)

where X is an \(n \times p\) design matrix for the covariates, n is the number of locations (sample size), \(\beta\) is a \(p \times 1\) vector of unknown regression coefficients, Z is an \(n \times 1\) vector of spatially autocorrelated random variables, \(\epsilon\) is an \(n \times 1\) vector of independent random errors.

The spatial structure of \(\epsilon _i\)’s covariance is defined by a variogram. The \(n \times n\) covariance matrix \(\Sigma\) for the spatial linear model is given by:

$$\begin{aligned} \Sigma = \text{var}(Z) + (\epsilon ) = R + \sigma _\epsilon ^2 I. \end{aligned}$$
(7)

We assume a stationary covariance function that depends on Euclidean distance and has an exponential form to simplify the estimation of the Eq. 7. Thus, the entry (ij) of R is expressed as:

$$\begin{aligned} \text{cov}(z(s_i), z(s_j)) = \sigma _z^2 \exp \left( -\frac{\Vert s_i - s_j\Vert }{A}\right) \end{aligned}$$
(8)

where \(\Vert \cdot \Vert\) denotes the Euclidean distance metric, and \(\sigma _z^2\) and A are the parameters to be estimated. In the geostatistical literature, the parameters \(\sigma _z^2,\) A,  and \(\sigma _\epsilon ^2\) are referred to as the partial sill, range, and nugget, respectively. The nugget parameter models the residual variation in the response when the separation distance is zero (Cressie and Wikle 2011). The spatial linear model (SLM) represented by the model in Eq. 6 is also known as the Universal Kriging model when used for spatial forecasting, the Ordinary Kriging model being the special case when X is an \(n * 1\) column vector of ones (Cressie and Wikle 2011; Fox et al. 2020). In the present study, we compared four types of covariance functions (exponential, Matérn, spherical, and Gaussian) for the spatial linear model (Chilès and Delfiner 2012), and the best covariance function was selected for model comparison. Restricted maximum likelihood (REML) estimation was used for parameter estimation of the spatial regression model (Webster and Oliver 2007).

Random forest

Random forest (RF) is a data-driven statistical method, an ensemble learning algorithm primarily used for classification or regression. It was developed to improve the prediction accuracy of classification and regression trees by combining a large set of decision trees (Breiman 2001). The algorithm benefits from two powerful techniques: random subspace selection at each split (“Classification and Regression Trees (CART) split criterion” (Breiman et al. 1984) and bagging (a contraction of “bootstrap-aggregating”) of unpruned decision tree learners (Breiman 1996). In regression, Random Forest (RF) predictions \((\hat{\theta })\) are obtained by averaging results from a given number (B) of individual decision trees \((t_b^*)\) based on generated bootstrap samples (K),  as described in the literature (Breiman et al. 1984; Breiman 2001; Prasad et al. 2006; Biau and Scornet 2016; Hengl et al. 2018):

$$\begin{aligned} \hat{\theta }^B(x) = \frac{1}{B} \sum _{b=1}^{B} t_b^*(x), \end{aligned}$$
(9)

where: b is an individual bootstrap sample, \(t_b^*\) is an individual decision tree, B is the total number of trees, and \(t_b^*(x) = t\left( x; z_{b1}^*, z_{b2}^*, \ldots , z_{bK}^*\right) ,\) and \(z_{bk}^{*}\) \((k=1, \ldots , K) = (y_k, x_k)\) is the kth training sample with pairs of values for the target variable (y) and covariates (x). We used default values in the regression for the number of possible splits in each node \((mtry =p/3) ,\) the number of trees \((ntree = 500) ,\) and the minimum terminal node size \((nodesize= 5) ,\) as these are often good options (Liaw and Wiener 2002; Díaz-Uriarte and Alvarez de Andrés 2006).

Random forest variants for spatial framework

Random Forest for spatial data (RFSP)


RF is a non-spatial approach in that it does not take into account spatial heterogeneity or general sampling schemes when estimating model parameters. This may lead to non-optimum predictions and systematic over- or under-prediction, especially when spatial autocorrelation is high and point patterns show clear sampling bias. To overcome this, Hengl et al. (2018) proposed the “RFSP”, which uses buffer distances as additional predictor variables.

$$\begin{aligned} y = f\left( x_P,x_G\right) , \end{aligned}$$
(10)

where \(x_P\) represents process-related covariates and \(x_G\) are covariates that take into account geographical proximity and spatial relationships between sampled sites.

$$\begin{aligned} x_G = \left( d_{p1},d_{p2},d_{p3},\ldots , d_{pN}\right) , \end{aligned}$$
(11)

where \(d_{pi}\) is the buffer distance to the sampled location pi,  and N is the total number of sampled sites.


Spatial Random Forest


In the framework of this study, we called Spatial Random Forest (SRF), a spatial RF variant where instead of adding buffer distances as additional predictor variables, only spatial coordinates are added, such as:

$$\begin{aligned} y = f\left( x_P,x_G \right) , \end{aligned}$$
(12)

where y is the dependent variable, \(x_P\) represents process-related predictor variables and \(x_G =(X,Y)\) with X and Y being spatial coordinates.


Random Forest residual Kriging


The RF residual kriging (RFRK) model is configured to perform a spatial adjustment by ordinary kriging on the standard RF residuals.


Geographical Random Forest


Geographical Random Forest (GRF) is a spatial RF variant developed by Georganos et al. (2021), in which the global process is decomposed into several local sub-models of nearby observations. It is based on the concept of spatially varying coefficient models used in geographically weighted regressions (Fotheringham et al. 2003; Stewart et al. 2017). A local random forest is computed for each sampled location i, but only the nearest ni observations are considered, resulting in the computation of a random forest for each training data location. This increases the flexibility of the locally calibrated RF compared to the global RF. Using the simplified version of the linear equation, we have the following:

$$\begin{aligned} y_i = ax_i + e =a(X_i,Y_i)x_i + e,\quad i = 1:n \end{aligned}$$
(13)

where \(y_i\) is the observed value of the dependent variable for the location ia is a coefficient, \(a(X_i,Y_i )x\) is the prediction (nonlinear) of the locally calibrated RF model on the location i,  and \((X_i,Y_i)\) are its coordinates.

The longest distance between a data point and its kernel is called bandwidth, while the area where the sub-model operates is called neighbourhood (or kernel). In this work, we used an adaptive kernel where the given number of nearest neighbours to be selected determines the neighbourhood (Brunsdon et al. 1998; Fotheringham et al. 2003).


Geographical Weighted Random Forest


Geographically Weighted Random Forest (GWRF) is a GRF model that gives more weight to observations that are spatially autocorrelated when calibrating the model. GRF does not use weights to calibrate the model. All observations have the same weight, regardless of geographical position. GWRF handles spatial autocorrelation and is therefore appropriate for data that are highly spatially autocorrelated. GRF is suitable for data that are not or only weakly autocorrelated (Georganos et al. 2021). We examined a range of neighbours’ values (n = 10, 20, 30, 40, and 50) to determine the optimal value setting for both GRF and GWRF.

Generalized least square-based random forest

Similar to how general least squares (GLS) extends ordinary least squares (OLS) to account for dependence in linear models, random forest based on generalised least squares (RF-GLS) extends RF. RF-GLS estimates non-linear covariate effects in spatial mixed models where spatial correlation is handled by the Gaussian procedure. The following mixed model considers spatial point data:

$$\begin{aligned} y_i = m(x_i) + w(s_i)+ \epsilon _i, \end{aligned}$$
(14)

where \(y_i\) and \(x_i\) denote the observed values of the dependent and predictor variables, respectively, corresponding to the \(i\text{th}\) observed location \(s_i;\) \(m(x_i)\) denotes the covariate effect; \(w(s_i)\) is the spatial random effect accounting for spatial dependence beyond the covariates modelled by a Gaussian process, and \(\epsilon _i\) accounts for independent and identically distributed Gaussian random noise.

In the present study, the R package RandomForestsGLS was used to fit and predict the abundance of species (Saha et al. 2023). This R package uses the computationally efficient Nearest Neighbour Gaussian Process (NNGP) (Datta et al. 2016). Model parameters were estimated from the generated data following (Saha et al. 2022). By integrating the non-linear mean estimate and the spatial kriging estimate from the BRISC package (Saha and Datta 2018), as explained by Saha et al. (2022), spatial prediction at new locations using nonlinear kriging is provided. We evaluated the four different covariance functions supported by the RandomForestsGLS package: exponential, Matérn, spherical, and Gaussian. We also evaluated the different numbers of neighbours used in the NNGP (10, 20, 30, 40 and 50). The results of this study were derived using the exponential covariance function. The results were derived using the exponential covariance function, as it best predicts the structure of the simulated data.

GLM hybrid methods with random forest and ordinary kriging

GLM and Ordinary Kriging hybrid model

For GLM and Ordinary Kriging hybrid (GLM-OK) model a spatial adjustment was performed by ordinary kriging on Poisson’s model residuals. As a control for the count data, the Poisson distribution was used.


The Random Generalised Linear Model


Random Generalised Linear Model (RGLM) is an ensemble prediction method based on GLM bootstrap with predictor variables selected using forward regression with AIC criterion.. Since species abundances are count data, the Poisson distribution was used in the RGLM. Song and Langfelder (2022) described the construction of the RGLM method.

Model performance assessment

We randomly sampled 200 observations from the generated datasets, which were divided into training (80%), and testing (20%) datasets. We used 5-fold cross-validation to evaluate the performance of the selected methods (Hastie et al. 2009). Performance metrics were averaged over 200 independent simulation runs to reduce the influence of randomness associated with 5-fold cross-validation. The measures of accuracy, precision, and discrimination were used to assess the predictive performance of the models that were tested.

  • Accuracy: The root mean square error (RMSE) and the mean error (ME) or bias between the RF predicted value and the observed abundance values at the sampled location were estimated as follows:

    $$\begin{aligned} \text{RMSE}= \sqrt{\frac{1}{n} \sum_{i=1}^{n}(O_i - P_i)^2}, \end{aligned}$$
    (15)
    $$\begin{aligned} \text{ME}= \frac{1}{n} \sum_{i=1}^{n}(O_i - P_i), \end{aligned}$$
    (16)

    where \(O_i\) and \(P_i\) refer to observed and predicted species abundance at sampled locations i, respectively, and n is the number of sampled locations (number of observations).

  • Precision: We used the mean of the standard deviations of the predictions (SDP).

  • Discrimination: The mean squared Spearman rank correlation coefficient \(R^2 = \rho _s^2\) between the predicted and observed abundance of species at the sampled sites and the modelling efficiency coefficient (MEC) of Nash and Sutcliffe (1970) were used.

    The n raw observed \(O_i\) and predicted \(P_i\) are converted to ranks \(R(O_i),\) \(R(P_i)\) and (\(\rho _s\)) is defined as their Pearson correlation coefficient.

    $$\begin{aligned} \rho _s = \frac{\text{cov}(R(O), R(P))}{\sigma _{R(O)} \sigma _{R(P)}}, \end{aligned}$$
    (17)

    where \(\text{cov}(R(O), R(P))\) represents the covariance between the rank variables of O and P, \(\sigma _{R(O)}\) and \(\sigma _{R(P)}\) are standard deviations of the rank variable of O and P.

    The MEC is calculated as:

    $$\begin{aligned} \text{MEC} = 1 - \frac{\sum _{i=1}^{n}(O_i - P_i)^2}{\sum _{i=1}^{n}(O_i - \bar{O})^2}, \end{aligned}$$
    (18)

    where \(O_i\) and \(P_i\) refer to observed and predicted species abundance at sampled location i, n is the number of sampled locations and \(\bar{O}\) represents the mean observed species abundance across all sampled locations.

Model comparison

Kruskal–Wallis tests were used to identify significant differences in predictive performance between the models tested, as the different performance measures were either not normally distributed according to the Shapiro and Wilk (1965) test of normality, had heterogeneous variance according to the Fligner and Killeen (1976) test of homogeneity of variances, or both. Dunn (1964)’s post-hoc test for the Kruskal and Wallis (1952) test was used to compare their predictive performance. P-values were adjusted using the Benjamini and Hochberg (1995).

To graphically summarise the predictive performance of the models studied, a modified Taylor (2001) diagram was used. Three statistics: the standard deviation of predictions, the root mean square error (RMSE) and the Pearson correlation coefficient are plotted on a single graph. We used this diagram because an accuracy metric such as RMSE alone is not meaningful for highly skewed data. We replaced Pearson’s coefficient with Spearman’s rank-order correlation coefficient because the distributions of species abundance for common and rare species were skewed and zero-inflated; used the normalised standard deviation of predictions (NSTD) and centred RMSE (CRMSE). The standard deviations of observed and predicted species abundances are calculated respectively by:

$$\begin{aligned} \sigma _o = \sqrt{\frac{1}{n} \sum _{i=1}^{n} (O_i - \bar{O})^2} \end{aligned}$$
(19)

and

$$\begin{aligned} \sigma _p = \sqrt{\frac{1}{n} \sum _{i=1}^{n} (P_i - \bar{P})^2}. \end{aligned}$$
(20)

The normalized STD (NSTD) is obtained by:

$$\begin{aligned} \text{NSTD} = \frac{\sigma _p}{\sigma _o}. \end{aligned}$$
(21)

The centred RMSE (CRMSE) is defined as follows:

$$\begin{aligned} \text{CRMSE} = \sqrt{\frac{1}{n} \sum _{i=1}^{n} [(P_i - \bar{P}) - (O_i - \bar{O})]^2}. \end{aligned}$$
(22)

For each case, the best model was considered to be the one with the lowest CRMSE (In the perfect model, the CRME would be equal to 0), and closest normalised standard deviation and Spearman correlation coefficient to 1 (Jiang et al. 2015).

Computational environment

Models were implemented in R version 4.2.2 (R Core Team 2022) using raster package to manipulate data (Hijmans 2023), gstat for geostatistics analysis (Gräler et al. 2016) and AER to test the significance of dispersion (Kleiber and Zeileis 2008). The R packages used to predict the distribution of species abundance in this study are listed in Table 2.

Table 2 R packages and functions used to predict species abundance

Results

Table 3 presents descriptive statistics of simulated species abundances and their variation, including minimum and maximum values, arithmetic mean, median, and sum. As medians are smaller than means, simulated abundance datasets are skewed to the right for all species.

Table 3 Descriptive statistics of the simulated species abundance

Variogram parameters of simulated species abundance are presented in Table 4. In all cases, variogram parameters are right-skewed, indicating that there are some extremely high values in the simulated dataset that pull the variogram parameter means upwards. This pattern is particularly apparent when the species is common with a high probability of detection and the abundance data are highly spatially autocorrelated. Data sets with large range produced fewer than five blocks in the \(50\times 50\) space. They were removed from the analysis because they did not produce enough blocks for fivefold cross-validation. Figure 1 shows how the spatial structure of the data, imperfect detection and species characteristics like rarity and commonness influence the variation in the abundance of a species. For example, a high degree of overdispersion will be observed if spatial autocorrelation is high and the probability of detecting a species is high, especially if the species is common. In contrast, if the probability of detecting a species is low, the spatial autocorrelation is low, and especially if the species is rare, there will naturally be little variability in the distribution of abundance and hence little dispersion in the abundance data. Results show that random sampling is effective in dealing with the dispersion in abundance data, whether the species is rare or common, even with significant spatial autocorrelation, as long as the sample is representative of the population. However, the dispersion of data collected within a population (sample) and the dispersion of data used for model training/evaluation may differ from the population. Therefore, a large sample is needed.

Table 4 Variogram parameters for simulated species abundance data
Fig. 1
figure 1

Dispersion coefficients of species abundance data based on predictor variables

For each case, the proportion of simulated databases with significant overdispersion is shown in Table 5. Results reveal that when the species is common, the presence of high spatial autocorrelation leads to statistically significant over-dispersed abundance data in all (100%) of the simulated populations. When the spatial autocorrelation is low and the probability of detection is low, the relative abundance data for a common species is still highly over-dispersed in the majority of simulated populations (91%). However, when both spatial autocorrelation and detection probability are low, almost all simulated populations (98%) will have equidispersed abundance data if the species is rare, whereas when spatial autocorrelation is low but detection probability is high for a rare species, almost all simulated populations (98.5%) will have overdispersed abundance data. The fact that random sampling within a population can significantly alter the degree of dispersion within the sample collected is also highlighted. Depending on the data used to train and validate the models tested in this study, the proportions of over-dispersed samples vary significantly.

Table 5 Dispersion levels of species abundance data

Assessment of model performance using single performance metrics for species abundance prediction

Random spatial effect is an exponential correlation function with linear terms only

Table 6 presents Kruskal–Wallis and Dunn’s post hoc tests results, which demonstrate that the predictive performance of the different modelling approaches is influenced by a number of factors, including selected metrics, abundance class, species detection probability, and spatial autocorrelation.

Case 1: Results reveal that the different modelling approaches have low predictive accuracy and modelling efficiency, with weaker relationships between predictions and observed abundances. RFGLS and SRF models exhibited the highest precision, although the RFGLS model demonstrated the weakest relationships between predicted and observed abundance. The least efficient model was the RGLM, with the RFSP and SRF yielding the highest modelling efficiency and predictive power.

Case 2: Models Predictive accuracies are similar, but the GLM-OK model underpredicted species abundance and exhibited greater bias. The RFGLS and SRF models are more precise. The RFSP and GLM-OK models have the highest \(R^2\) and MEC values.

Case 3: RMSE values differ among different modelling approaches, while the predictive bias does not. The RGLM and GLM-OK models are the most accurate for abundance predictions, while OK is the least accurate. Spatial RF variants predictive accuracy is similar to standard RFś for common species with low spatial autocorrelation and high detection probability. OK, despite low accuracy, provides the most precise predictions. RGLM and GLM-OK are the least precise models but offer the highest predictive power and efficiency. RFSP, RFRK, and SRF are the spatial RF variants that exhibited the highest \(R^2\) and MEC values.

Case 4: RMSE and biases of the models varied significantly. RGLM, GLM-OK, and GLM models demonstrated the highest predictive accuracy, while the GLM-OK model exhibited the highest bias. OK exhibited the highest precision, while RGLM, GLM-OK, and Poisson models demonstrated the highest predictive power and modelling efficiency. In comparison to cases involving common species, the accuracy metrics are relatively low for rare species due to the limited variation in their abundance.

Case 5: Models predictive accuracy and biases of the models were similar, except for the RGLM, which is sensitive to extreme values. GLM-OK algorithm is the least precise, while the RFGLS and SRF are the most precise. GLM-OK and RFSP have the highest predictive power, while SRF has the highest modelling efficiency. RGLM has the worst MEC, tending to negative infinity.

Case 6: Predictive accuracy and bias of the various modelling approaches are similar, with the RGLM exhibiting a sensitivity to extreme values in terms of bias and RMSE. The RFGLS is the most precise. The GLM-OK is the most biased, the least precise and underestimates species abundance, but is among models with the highest predictive power. The standard RF and RGLM models exhibited the lowest \(R^2\) values and yielded MEC values that tend towards negative infinity. SRF, followed by RFSP and OK models, exhibited the highest modelling efficiency.

In Cases 7 and 8, Poisson and Poisson hybrid models exhibited the greatest predictive accuracy, power, and efficiency. Conversely, OK exhibited the least predictive accuracy, the lowest predictive power, and modelling efficiency for case 8. However, it is the most precise model. GLM-OK exhibited bias, underestimated rare species abundance, and was the least precise. SRF, followed by RFGLS and RFRK, are spatial RF variants with the best predictive power and efficiency. However, they were not significantly different from standard RF.

Table 6 Predictive performance of various modelling approaches when the spatial autocorrelation is exponential with linear terms only

Spatial random effect is an exponential correlation function with a quadratic term

Accuracy


Table 7 shows that different modelling approaches predict species abundance without significant difference in accuracy when spatial random effect is an exponential correlation function with a quadratic term and the spatial autocorrelation is high. However, the RGLM is sensitive to the influence of outliers when the species is rare, although this difference is not statistically significant due to the high variability of estimates. Poisson model and its hybrid models exhibit higher prediction accuracy than RF and its spatial variants when the spatial autocorrelation is low or absent. The GLM-OK model is more biased than other models in most cases. When species are rare, spatial RF variants yield predictions that are comparable to those of the standard RF. However, when species detection probability is low, the RFSP produces higher biases than the standard RF, but not statistically different. For common species, the RFSP overestimates abundance predictions when spatial autocorrelation is low or absent.


Precision


RFGLS and SRF models are more precise for common species with high spatial autocorrelation, while the RFRK model is the least precise. OK is the most precise model, followed by SRF and RFGLS, while the Poisson model and its hybrids (RGLM and GLM-OK) are the least precise when spatial autocorrelation is low or absent. The RFGLS and SRF models are more precise for rare species with high spatial autocorrelation, while the OK model remains the most precise for rare species with low or no autocorrelation (see Table 7).


Discrimination


Results reveal that RFSP has the highest predictive power \((R^2)\) for common species with high spatial autocorrelation, followed by GLM-OK and GWRF. Standard RF and RFGLS have the lowest \(R^2\) regardless of species abundance class for high spatially autocorrelated species abundance. GLM-OK, RGLM, RFSP, and Poisson models achieve the best \(R^2\) for common species with weak or no spatial autocorrelation, while OK yields the lowest \(R^2\) regardless of species abundance class. GLM-OK and RFSP models have the highest \(R^2\) for rare species with high spatial autocorrelation, while SRF and RFSP have the highest efficiency coefficients when spatial autocorrelation is high. RGLM has the highest efficiency coefficient when spatial autocorrelation is low or absent. However, it does not differ from GLM-OK when the species is common (see Table 7).

Table 7 Predictive performance of various modelling approaches when the spatial correlation is exponential with a quadratic term

Evaluation of models’ predictive performance using modified Taylor diagrams

Training performance

Figures 2 and 3 illustrate that hybrid Poisson regression models are more precise than spatial variants of RF when predicting (interpolating) common species abundance. If the spatial autocorrelation is high and the species is rare, RFSP or RFRK provide more precise predictions with variability closer to observed abundance, while GLM-OK provides the most closely matching centred RMSE and Spearman correlation coefficients. However, hybrid Poisson models have larger centred RMSEs and smaller Spearman correlation coefficients between predicted and observed abundance than standard and spatial RF variants when the spatial autocorrelation is high and the species probability of detection is high. In contrast, when spatial autocorrelation is low, the best-centred RMSE and Spearman correlation coefficients are obtained from different models depending on the data complexity and species abundance class. For example, if the detection probability is high and the species is rare, hybrid Poisson models will give the best predictions, whereas if the detection probability is low, standard models (Poisson or RF) will give the best predictions in terms of accuracy and predictive power. Variants of RF (spatial or hybrid) produced predictions with better centred RMSE and Spearman correlation coefficient than the hybrid Poisson models when spatial autocorrelation is low and the species is common.

Fig. 2
figure 2

Modified Taylor diagrams comparing the training performance of different modelling approaches for species abundance distributions with spatial randomness determined by an exponential correlation function with linear terms only

Fig. 3
figure 3

Modified Taylor diagrams comparing the training performance of several modelling approaches for species abundance distributions where spatial randomness is determined by an exponential correlation function with a quadratic variable

Predictive performance

Cases with linear terms only.

Figure 4 presents modified Taylor diagrams to illustrate the predictive performance of various modelling approaches for species abundance. Most models’ predictions have less variability than observed abundance for common species with a high probability of detection and high spatial autocorrelation (Fig. 4a), except the RGLM. This sub-figure shows that GWRF, despite not being the most precise model, is the one with more accurate predictions adjusted for bias and the closest pattern of variability with observed abundance. For common species with a low probability of detection and high spatial autocorrelation (Fig. 4b), the RGLM has the closest precision compared to observed abundance but the closest accurate predictions and pattern of variability with the observed abundance are obtained using GLM-OK followed by different models with close predictive performance such as the RFSP, GWRF, SRF, GLM and the RFRK.

For common species with high probability of detection but low spatial autocorrelation, GLM hybrid models (RGLM and GLM-OK) followed by the Poisson model yield the best predictive performance. These models provide more accurate predictions with the closest pattern of variability with observed abundance, while RFRK is the best RF spatial variant (Fig. 4c). GLM hybrids have the best predictive accuracy and predictive power but are less precise compared to RF variants for case 4 species (see Fig. 4d). RFRK is the spatial RF variant with the closest predictive power, precision, and accuracy compared to observed abundance.

For rare species with high probability of detection and high spatial autocorrelation, points are clustered together, with RFRK followed by GWRF being the closest models to observed abundance (see Fig. 4e). However, RGLM does not appear on the plot due to its mean-centred RMSE tending to positive infinity. The GLM-OK hybrid model, followed by GWRF, is the most appropriate spatial RF variant for rare species with low probability of detection and high spatial autocorrelation, while RFSP is the most precise (see Fig. 4f). Figure 4g and h show that Poisson model hybrids outperform spatial RF variants for case 7 and 8 species, with RFRK and RFGLS being the most effective spatial RF variants.

Fig. 4
figure 4

Modified Taylor diagrams comparing the performance of several models in predicting species abundance distributions when the spatial randomness is determined by an exponential correlation function with linear terms only

Cases with a quadratic term

Figure 5 reveals that RF spatial variants predict abundance better for species with high spatial autocorrelation, while hybrid Poisson models are best for low autocorrelation. However, the choice of the best variant depends on the complexity of the data. Four models (RFSP, RFGLS, GWRF and GRF) yield the shortest radial distance from the reference point for common species with high detection probability and high spatial autocorrelation (see Fig. 5a). RFSP remains the best modelling approach for case 2 species as shown in Fig. 5b. However, GRF and GWRF appear to be sensitive to a low probability of detection. Figure 5c reveals that for case 3 species, GLM-OK and Poisson are the best predictive models, while RFGLS, standard RF, and RFRK are the best-performing RF spatial variants. For case 4, RGLM outperforms other models, followed by GLM-OK and Poisson models. The RFSP is the best-performing spatial variant of the RF (see Fig. 5d). Figure 5e shows that when spatial autocorrelation is high for a rare species with a high probability of detection, the GLM-OK model yields predictions with the closest variability to observed abundance variability. However, GRF and GWRF have the smallest radial distances, making their predictions closer to the observed abundance. Figure 5f shows that when the probability of detection of a rare species is low, RFSP is the best-performing model, followed by GLM-OK and Poisson. Figure 5g and h show that for a rare species, when spatial autocorrelation is low or absent, hybrid models like RGLM and GLM-OK produce predictions closer to the observed abundance than RF spatial variants.

Fig. 5
figure 5

Modified Taylor diagrams comparing the performance of several models in predicting species abundance distributions when the spatial randomness is determined by an exponential correlation function with a quadratic term

Discussion

The increasing biodiversity loss due to climate change and human activities has led to a rise in species distribution models, but concerns have been raised about their reliability in predicting species distributions (Guillera-Arroita et al. 2015; Rizvanovic et al. 2019; Zhang et al. 2020). Statistical models are crucial in applied ecology, providing predictions that are as close to reality as possible (Houlahan et al. 2017; Norberg et al. 2019). However, they cannot replicate the complexity of ecological systems. Consequently, practitioners should be guided in selecting models and understanding the limits of predictive performance given the wide range of models available (Urban et al. 2016; Norberg et al. 2019).

Oppel et al. (2012) found that predicting species abundance is challenging, suggesting ensemble models as a solution. Machine learning approaches, particularly RF algorithms, have shown better predictive ability for species occurrence and abundance than artificial neural networks and classical regression-based models (e.g., Boulesteix et al. 2012; Jetz et al. 2019; Zhang et al. 2020; Waldock et al. 2022). RF’s ability to handle high-dimensional prediction problems, handle situations where predictor variables exceed observations, and capture complex relationships has made it attractive in various fields. However, spatial autocorrelation challenges the effectiveness of RF, despite its flexibility and non-linear nature (Hengl et al. 2018; Liu et al. 2018; Georganos et al. 2021; Saha et al. 2023).

Evaluation of alternative models using multiple diagnostic metrics

The study highlights the importance of performance metrics and model assessment approach choice in species abundance distribution modelling. It found that only a few models consistently provided the best accuracy, precision, and discrimination on hold-out datasets. High performance for some metrics may result in poor performance for other metrics for the same data set. Norberg et al. (2019) evaluated 33 species distribution models and found similar trends in predictive performance. None of the models performed well for all prediction tasks simultaneously. This can lead to poor global performance when performance assessment tools combine different metrics.

For example, this study suggests that using the standard deviation of predictions as an indicator of prediction’s precision may not be meaningful for skewed abundance datasets. The RFGLS model was found to be the most precise when spatial autocorrelation is high. However, modified Taylor plots revealed that this model overestimated precision, as its normalized standard deviation was closer to the origin but further from 1, indicating underestimation of observed species abundance variability.

Dunn’s post-hoc Kruskal–Wallis test revealed no difference in prediction accuracy between spatial variants of random forest and other modelling approaches for high spatially autocorrelated abundance datasets with imperfect species detection probability. Garcia-Marti et al. (2019) noted that for highly skewed data, such as highly autocorrelated species relative abundance, an accuracy metric like RMSE alone is not informative. Metrics that concisely combine performance metrics should be used to determine the degree of agreement between observed and predicted species abundances. Combining various metrics is the most effective method for model selection (Moriasi et al. 2015; Izzaddin et al. 2024). The geometric relationship of the Taylor diagram balances model fit measures, making it easier to assess performance (Taylor 2001).

However, Taylor’s diagram has shortcomings, such as failing to account for overall model bias and relying on RMSE (Gleckler et al. 2008; Hu et al. 2019). This suggests that selecting the best-performing models based on RMSE minimisation favours those who underestimate observed variability, unless the correlation coefficient between model results and observations is equal to 1 (Izzaddin et al. 2024). Spatial RF spatial variants offer better predictive accuracy and power for common species with high spatial autocorrelation and high probability of being detected but are less precise than hybrid Poisson models. For rare species with high spatial autocorrelation, hybrid Poisson regression models are less precise than OK, RF, and its spatial variants. Norberg et al. (2019) suggest that poorly calibrated models may provide overly confident predictions. They suggest fitting a limited set of models with complementary performance and use a cross-validation procedure with independent data to determine the best model for the study.

Impact of data and species features on the predictive performance of different modelling approaches

Several studies have compared random forest and its variants with other models, but results are not always consistent. Li et al. (2011), Appelhans et al. (2015), Hengl et al. (2015) and Fox et al. (2020) comparing RF and spatial regressions found that RF is superior when using many covariates with nonlinear relationships. Parmentier et al. (2011) and Temesgen and Ver Hoef (2014) showed that spatial regression models may outperform RF when using datasets with many spatially autocorrelated records. Li et al. (2017) found that GLMs and their hybrid methods with geostatistical techniques (OK) were less accurate than hybrid methods of RF and OK in modelling count data.

The current study examines various scenarios in which these alternative models may perform better or not. We found that RF spatial variants offer better predictive accuracy and power than other modelling methods when spatial autocorrelation and species probability of detection are high. However, they tend to underestimate species abundance variability more than hybrid Poisson models with RF and OK. The optimal RF spatial variant varies depending on species features and the relationship between abundance and independent variables. Hybrid Poisson models with RF and OK are more effective in predicting data that closely aligns with observed values when spatial autocorrelation is low or absent.

These results contrast with the claim of Saha et al. (2022) that all spatial variants of RF outperform the standard RF due to the lack of spatial information. Their conclusion only holds when spatial autocorrelation is very high and species are common and likely to be detected. The authors concluded that the RFGLS method was the most effective or one of the most effective methods in all contexts. However, due to the discrete nature of species abundance and the presence of overdispersion, the RFGLS method, although technically applicable to count data, is not the most appropriate choice for predicting abundance in most cases.

The RFSP is one of the best performing spatial RF variants, but not as good as other models such as RGLM, GLM-OK and Poisson for species abundance with low spatial autocorrelation. SRF and RFSP belong to the class of approaches where spatial features are added to the RF as additional covariates (Hengl et al. 2018; Saha et al. 2022). In 1 out of 16 cases, SRF outperforms RFSP, particularly for rare species with high detection probability and low or no spatial autocorrelation during training.

Saha et al. (2022) confirmed Mentch and Zhou (2020a) findings that adding noise covariates improves RF prediction performance at low signal-to-noise ratios, i.e. when the number of informative predictor variables is very small compared to the number of uninformative predictors. In this setting, they found that ignoring spatial variation in the RFRK fit led to a worse performance than the RFSP fit. The RFSP performed worse than the RFRK when the number of informative predictor variables was larger than the number of non-informative predictors.

Although many studies (Hengl et al. 2018; Georganos et al. 2021; Mentch and Zhou 2020a) have found that including geographic coordinates as additional features is a good practice when using ML with spatial data, Mentch and Zhou (2020b) showed that even the simple act of including additional pure noise variables in the model can dramatically improve the accuracy of out-of-sample predictions, sometimes significantly outperforming even the best-tuned standard RFs. Beyond this important limitation, adding geographic coordinates as features increases computational complexity. Generalisations about the most effective spatial variant of RF for predicting species relative abundance are difficult, as this choice depends on data characteristics and species features.

However, based on the agreement between model predictions and observed abundance data, GWRF, GRF, and in some cases, RFRK seem to be the most appropriate when the spatial autocorrelation is high, whereas RFRK seems to be the appropriate spatial RF variant when spatial variance is low. This suggests the need for accurate, precise, and reliable spatial RF variants that account for both dispersion and spatial autocorrelation. In addition, spatial structure and dispersion parameters can be modified by the sampling, validation approach and sub-sampling process in the RF algorithm. This helps in understanding species abundance distribution and factors associated with it, which could inform future research and decision-making.

Future studies should explore datasets with diverse covariance functions and other species characteristics to evaluate spatial RF variants’ performance and explore ensemble methods that combine spatial approaches with count data models with full dispersion flexibility.

Conclusion

This study compared spatial RF variants to standard RF and other modeling methods for species abundance distributions in ecology using different model assessment approaches. Findings suggested using metrics that combine performance metrics and indicate agreement between observed and predicted abundances.

Spatial RF variants typically outperformed other models in prediction accuracy and power, especially when spatial autocorrelation and species detection probability were high. However, they are less precise than hybrid Poisson models like RGLM and GLM-OK. The best-performing RF spatial variant depends on data, species features, and the complexity of the relationship between abundance and independent variables. The RFSP provides better predictions for common species with high detection probability and high spatial autocorrelation but is computationally expensive. RFSP should be used with caution, as even adding pure noise variables can improve accuracy.

GWRF and GRF gave better predictions when the species was rare, with high detection probability and high spatial autocorrelation. For linear terms only, GWRF gave better predictions for common species with high spatial autocorrelation and high probability of detection. For rare species, it ranked second to RFRK. Hybrid Poisson models are most likely to give predictions close to observed data when there is no or low spatial autocorrelation. This study emphasizes the importance of model specification in ecological research, recommending species and data-specific models, and combining performance metrics to assess models performance. It suggests testing spatial variants accounting for data dispersion to improve species abundance predictions. These findings are crucial for future studies, enhancing understanding of population dynamics, management strategies, conservation planning, and climate impact assessment.