Introduction

Conservation and management planning for a species requires accurate knowledge of its distribution. Historically, a key source of species distributional data has been expert knowledge, which defines species distribution boundaries based on where they are likely to occur (Hurlbert and Jetz 2007; Mainali et al. 2020). Expert-based range maps are generally drawn by minimizing the probability of false negatives: they often avoid declaring absence in truly occupied regions (Mainali et al. 2020). However, the resulting maps lack fine-scale resolution, and they often encompass environmentally unsuitable and unoccupied sites, skewing the actual distribution (Hurlbert and Jetz 2007; Mainali et al. 2020). Experts also may have limited ability to extrapolate their knowledge beyond their region of expertise as they are unable to study other regions (Murray et al. 2009; Hochachka et al. 2012). These factors highlight the need to include additional sources of data for species distribution mapping.

Research in ecology benefits from people’s contribution and engagement, which can contribute towards nature conservation programs (Hochachka et al. 2012). Citizen-science initiatives can contribute to scientific efforts by providing large datasets useful for mapping distributions (Kéry et al. 2013; Zulian et al. 2021) and planning population management and conservation (Stauffer et al. 2018). Information gathered in these initiatives are commonly aggregated into Species Distribution Models (SDMs) (Fletcher et al. 2019) which relate the species observations to relevant environmental and geographic variables (Guillera-Arroita et al. 2015; Kéry 2011; Kéry et al. 2013). They provide an alternative to expert-based range maps as they consider the fine scale nuances of species distributions, as well as combining spatially extensive data from several sources to strengthen inference on species geographical distribution and environmental effects (Altwegg and Nichols 2018; Fletcher et al. 2019; Miller et al. 2019). However, inference can be biased when simply aggregating datasets because many citizen-science data sources are not standardized in terms of location, protocol, and sampling effort (Fletcher et al. 2019; Miller et al. 2019).

Data integration, rather than data aggregation, represents a recent methodological advance on the use of SDMs (Fletcher et al. 2019; Miller et al. 2019). Data integration enables the inclusion and analysis of different, non-standardized datasets while accounting for processes that may influence the precision of the distribution estimates, such as the residual spatial autocorrelation and detection probability (Pacifici et al. 2017; Guélat and Kéry 2018; Zulian et al. 2021). Indeed, SDMs accommodating spatial autocorrelation and imperfect detection often have superior performance and precision to estimate distribution for a focal site and sites lacking data than other models (Guélat and Kéry 2018). In short, spatial autocorrelation refers to similar environment and species occurrence probability between geographically close sites (Legendre 1993; Dormann et al. 2007) and imperfect detection to false negatives being recorded in truly occupied sites (Mackenzie et al. 2002).

Globally, there is an urgent need for effective conservation and management of large-bodied species for two main reasons: (i) they are relatively more prone to extinction than many smaller species (Cooke et al. 2020), and (ii) they generally have large home range sizes whereby its conservation might benefit other species’ populations (Walpole and Leader-Williams 2002). The greater rhea (Rhea americana Linnaeus, 1758 (Rheiformes, Rheidae)) is a large and flightless bird, which is highly vagile and has a large home range size (Sick 2001). Although the greater rhea can be seen foraging in crop fields and artificial pastures with domestic cattle, sheep, and horses, preferred habitats are grasslands, shrublands, and savannas (Sick 2001; Bellis et al. 2008; Giordano et al. 2010). Even with a geographic range that extends from the northeast to southern Brazil, eastern Bolivia, Paraguay, Uruguay, and part of Argentina, the greater rhea populations are presenting declines throughout its distribution due to hunting (for meat, feathers, skin, and eggs) and habitat loss, having a global conservation status of “Near Threatened” (BirdLife International 2016). Also, its preferred habitats are globally threatened by conversion to croplands and tree plantations (Hoekstra et al. 2005; Veldman et al. 2015). Locally, in the Brazilian state of Rio Grande do Sul (hereafter referred as RS), more than 50% of its vegetation cover originally comprised grasslands, especially within the limits of the Pampa biome (Overbeck et al. 2007; Cordeiro and Hasenack 2009; Hofmann et al. 2018). However, today only half of this vegetation cover remains (Souza Jr et al. 2020) due to high rates of grassland fragmentation and loss (Cordeiro and Hasenack 2009; Vélez-Martin et al. 2015) and insufficient and inefficient conservation measures (Overbeck et al. 2007; Andrade et al. 2019).

In this study, we elicited expert knowledge, and created SDMs with data integration, to produce independent maps of the greater rhea distribution in the 497 municipalities of RS state, Brazil. Our objectives were threefold: (1) to compare different models to define whether habitat and/or spatial autocorrelation explains the realized occurrence of the greater rhea in RS; (2) to compare the map of realized occurrence generated by the models with the map generated by experts; and (3) to compare the disparity between the realized occurrence and expert knowledge between the biomes found in RS (Pampa and Atlantic Forest), in order to find whether most of the species’ preferred habitat is inside the Pampa biome. The Pampa biome is a region of flat land with large extents of open grasslands (Boldrini 2009) where an observant animal, attentive to its surroundings, such as the greater rhea, can thrive (Sick 2001; Bellis et al. 2008). Given the species’ dispersal ability, conspecific attraction, and habitat preference for grasslands and ability to use crop fields, we expected a strong spatial autocorrelation component on occurrence, as well as a positive effect of grasslands and crop fields and a negative effect of the number of rural properties on greater rhea occurrence in RS municipalities. Nonetheless, we compared this hypothesis with alternative scenarios comprising different combinations of covariates. Another hypothesis was that most disparities regarding the species occurrence generated by the expert knowledge and SDMs would occur inside the expert-drawn maps, as it used to have areas not suitable for the species (Mainali et al. 2020).

Methods

We produced maps of the distribution of the greater rhea in the RS state using two procedures: (1) expert knowledge and (2) site-occupancy models based on citizen-science data. Since some citizen-science data sources provide only the municipality as the most local unit of species record, we used the boundaries of the 497 state municipalities as discrete sites of analysis.

Study area

We studied the distribution of the greater rhea in the RS, the southernmost and the ninth largest state of Brazil, with an area of 281,707 km2 (IBGE 2020). Grasslands and grassland-forest mosaics originally covered most of the southern, southwestern, and northeastern regions of the state (Boldrini 2009). In these grasslands, the dominant land use was livestock, traditionally raised by European and Portuguese colonizers (Ribeiro and de Quadros 2015). A considerable portion of these grasslands were converted to crop fields and tree plantations during the last few decades (Vélez-Martin et al. 2015). Forests dominated in the northern region of the state (along the Uruguay river) and the southern limits of the Atlantic Rainforest (central and eastern RS) (Boldrini 2009), whose conversion to agriculture was more pronounced in the beginning of the twentieth century. Currently, the main land use in the state is crop fields, mainly soybean, rice, and maize (Cordeiro and Hasenack 2009; Hofmann et al. 2018) forming the mosaic of habitats potentially suitable for the Greater Rhea.

Expert-drawn maps

We defined an “expert” as a researcher and professional with sound fieldwork experience, with involvement in bird-watching initiatives, and who has supervised and published research regarding the bird species of RS. We gathered expert knowledge through an application created using the R package shiny (Chang et al. 2021). In this application, each expert was presented with two drawing steps, each with different mapped features; these were (1) a map which had municipality boundaries and labels, and (2) a map with the municipality boundaries, labels, and grassland area cover per municipality. The different mapped features were used to improve expert precision when mapping the distribution of the greater rhea by providing more information about the presence of suitable habitat for the species. As polygons were similar between mapped features, we used the polygons drawn over the first map (Supplementary Material Fig. S1.1). We requested information from 24 experts by email between November 2020 and February 2021. We received polygons from eight experts and used them to calculate the proportion of experts declaring the greater rhea in each municipality (hereafter “expert-based occurrence”).

SDM data

We fitted species distribution models, namely site-occupancy models with data integration, to five datasets comprising the detection and non-detection of the greater rhea and associated sampling effort per site (municipality) across several visits (sampling occasions) within a site. The five citizen-science datasets used were eBird (http://www.ebird.org), WikiAves (http://www.wikiaves.com.br), GBIF (Global Biodiversity Information Facility, http://www.gbif.org), VertNet (http://www.vertnet.org), and iNaturalist (http://www.inaturalist.org). The first two are exclusive for bird sightings. eBird is global and includes data as checklists—where the observer uploads a list of all the species detected in a certain georeferenced location (Sullivan et al. 2009). In contrast, WikiAves is a Brazilian initiative that has photos and sound recordings of birds in an individual detection format, at a municipality level (WikiAves 2019). We gathered data from eBird directly from its website (eBird 2019), and used both complete and incomplete lists because (i) removing incomplete lists could result in the loss of valuable data; (ii) our model took into account sampling effort, so that incomplete lists represent less effort to detect the greater rhea; and (iii) incomplete lists represent a small portion of the eBird dataset (n=80 of 5492 lists). We gathered WikiAves data directly from its administrator (WikiAves 2019). GBIF, VertNet, and iNaturalist are global platforms of open data, which provide georeferenced detection for a wider range of taxa. We downloaded bird data (searching for the term “Aves” or “Rio Grande do Sul birds”) from these platforms directly on R through the packages rgbif, spocc, and rvertnet. We used functions of the CoordinateCleaner package (Zizka et al. 2019) to clean misplaced coordinates (e.g., centroids, capitals, duplicates, institutions, sea, urban areas, and zero coordinates). All data were temporally bound between January 1, 2008, and December 31, 2018.

Because of its checklist format, eBird is the only dataset that closely follows the classical sampling protocol of replicated visits to sites, optimal data for site-occupancy modeling (see MacKenzie et al. 2002; Mackenzie and Royle 2005). In this case, each list was a visit to a certain municipality and received a detection (1) if the greater rhea was one of the species on the list, and a non-detection (0) if not. The number of lists per municipality varied significantly. Municipalities with protected areas or with certain habitat features that make them “birding hotspots,” such as Lagoa do Peixe National Park (municipalities of Tavares, Mostardas, and São José do Norte) which attracts many migrant birds (and birdwatchers) than other municipalities. This resulted in large differences in the number of checklists per municipality (from 0 to 623 lists, Supplementary Material Fig. S1.2). To minimize processing time and avoid issues regarding missing data and parameter convergence when estimating effort parameters for the eBird model, we used data from the first 11 checklists per municipality, which was the average number of lists found across all 497 municipalities (Supplementary Material Fig. S1.2). We used the total number of observers of the checklist, the duration, and the distance traveled as effort covariates for eBird data.

WikiAves, GBIF, VertNet, and iNaturalist datasets had no direct measurement of sampling effort associated with the municipalities. The effort associated with GBIF, VertNet, and iNaturalist datasets was the total number of any bird species recorded at the municipality. If the greater rhea was among the detected species, the municipality received a detection (1); otherwise, it received a non-detection (0). The effort associated with the WikiAves dataset was the total number of photos and song recordings per municipality, with any photo or call recording of the greater rhea indicating a detection. All these covariates are reasonable proxies for effort in bird sampling (Miller et al. 2019) (for more details, see Zulian et al. 2021). Even though eBird and GBIF had a similar spatial distribution of detections, each dataset brought different metadata characterizing sampling events, such as the number of observers, sighting duration, and distance traveled, that were useful in the modeling process and contributed to the detection probability estimation.

Site covariates

We used the municipality-level area of grasslands, the area of crop fields, and the number of properties relative to the municipality area as site covariates. We measured the area of grasslands and crop fields per municipality using data from Hofmann et al. (2018). To obtain grassland area per municipality, we gathered the total cover of dry, humid, and secondary (regenerating) grasslands within the state, and then divided it by the municipality area. To obtain the crop fields’ area, we first summed the cover of soybean, corn, and wheat and then divided it by each municipality area (Hofmann et al. 2018). To have a proxy on how fragmented the greater rhea habitat is, we gathered data on the number of properties per municipality using data from the Brazilian Census of Agriculture (IBGE 2010) and divided it by municipality area to obtain the number of properties relative to municipality area. All spatial objects were converted into Lambert projection before data extraction and spatial analysis to ensure a proper calculation of municipality and habitat area. The three site covariates were transformed (square root) and then standardized by the mean and standard deviation before the modeling.

Site-level covariate maps showed that grasslands are predominantly concentrated in the southern region of RS, whereas the crop fields are distributed throughout the state (Supplementary Material Fig. S1.3). The municipalities with the highest proportion of rural properties relative to municipality area are concentrated in the north, northwestern, and central regions of RS (Supplementary Material Fig. S1.3).

Statistical analyses

Species distribution modeling

We used hierarchical species distribution models with data integration of standardized (eBird) and non-standardized (other) citizen-science datasets to estimate the greater rhea distribution (Miller et al. 2019). For the biological part of the model, we denote the true occupancy state z of each municipality i, by:

$${z}_i\sim Bernoulli\ \left({\psi}_i\right),$$
(1)

where the state of this latent variable z is partially observed and follows a Bernoulli distribution, with mean ψi. We denote here zi as the realized state: 1 when occupied and 0 when not occupied. The zi is estimated by choosing values of parameters that maximize the likelihood of finding the observed data, per citizen-science dataset (Miller et al. 2019).

The site occupancy probability ψi is estimated by considering the influence of ecological processes on the probability of site occupancy (MacKenzie et al. 2002), as follows:

$$logit\left({\psi}_i\right)={\beta}_0+{\beta}_1\ast {X}_1+\dots +{\beta}_n\ast {X}_n+{\delta}_i.$$
(2)

The ψi has its own GLM, with β0 being the GLM intercept (the average site occupancy probability), β1 − βn the regression coefficients, X1 − Xn the values of each environmental covariate (see the following “Site covariates”), and δi being the conditional autoregressive (CAR) random effect of a focal site i:

$${\delta}_i\sim CARnormal\ \left(\frac{\sum_{j=1}^m{\delta}_j}{m},\frac{\tau^2}{m}\right),$$
(3)

where δj are the values of CAR random effect for the neighbors of the focal site, and τ2 is the spatial variance parameter (Besag et al. 1991; Guélat and Kéry 2018). We built a discrete spatial neighborhood of hexagonal cells with either 10, 25, or 50 km, to test for the optimal cell size for neighborhood consideration.

For the observational part of the model, we analyzed detection error by building one model per citizen-science dataset (before integrating them to estimate the zi) and estimating the detection probability. The general model had the following format:

$${p}_j^{\ast }=1-{\left(1-p\right)}^{Ej},$$
(4)

where the probability of detecting the greater rhea \({p}_j^{\ast }\) in an occasion j was an exponential function of the sampling effort Ej in each sampling occasion j (Pacifici et al. 2017; Stauffer et al. 2018; Zulian et al. 2021). The sampling effort is estimated from data using a linear model without the intercept, so that Ej is in units of sampling effort. The parameter p was fixed to 0.5, as it is not identifiable according to available data. In this exponential function, \({p}_j^{\ast }\) increases non-linearly with increasing sampling effort up to an asymptote (Stauffer et al. 2018). The Ej is estimated separately for each dataset DSn—with n varying from 1 to 5: eBird, WikiAves, iNaturalist, GBIF, and VertNet—based on the effort covariates from each dataset (X1j, X2j, and X3j), which varied across the j visits, following a linear equation without intercept:

$${E}_j^{DS_n}={\alpha}_1\ast X{1}_j+{\alpha}_2\ast X{2}_j+{\alpha}_3\ast X{3}_j,$$
(5)

Finally, we integrated the biological and observational parts of the data integration model multiplying zi estimated in Eq. 1 by the \({p}_j^{\ast }\) from Eq. 4. The true state of site occupancy, zi, was then updated in each multiplication with \({p}_j^{\ast }\). The observation dataset Yij became a latent variable from a Bernoulli distribution that contained the parameter \({z}_i{p}_j^{\ast }\):

$${Y}_{ij}\sim Bernoulli\ \left({z}_i\times {p}_j^{\ast}\right).$$
(6)

The parameter \({z}_i{p}_j^{\ast }\) contained the same information—detection conditional to site occupancy state—which is typical for site-occupancy models (MacKenzie et al. 2002): the species is only observed in a site i, for a sampling occasion j (y = 1), if the site is truly occupied by the species, and the species is effectively detected by the observer.

We estimated the parameters of the data integration models using Bayesian inference through which it was possible to update the prior knowledge regarding the model parameters using data to form “posterior” estimates of each model parameter (Ellison 2004). We wrote our models in BUGS language and ran them in WinBUGS through the R package R2WinBUGS. We ran the model in three independent chains of 60,000 iterations, burn-in phase of 40,000 iterations, adaptive phase of 30,000 iterations, and thinning each 20 samples, yielding a posterior sample of 3000 iterations. We obtained a value of convergence criterion R-hat lower than 1.1 for all the parameters. All analyses were run in the R programming environment (R Core Team 2021).

Model selection analysis

We built 12 different models depicting different alternative hypotheses to explain the spatial distribution of the greater rhea in RS (Table 1). We selected the best model using the model deviance criterion (D, see the following Eq. 8), for which the lowest value represents the best model. The deviance of each dataset was calculated based on two different sets of data: (i) training and (ii) validation datasets. We used 20% of the data from each dataset (~100 municipalities) for validation and used the remaining 80% for training the model and estimating the parameters. Then, the training and validation datasets were compared, in terms of the squared number of detections, producing a likelihood function for each citizen-science dataset n calculated as:

$$L={y}^{\hat{y}}\ {\left(1-\hat{y}\right)}^{1-y},$$
(7)
Table 1 Model selection analysis, showing models ranked according to the Deviance Criterion. Models 4, 7, 8, and 9, designated as “Complete,” include all the site covariates and account for different spatial structure. Spatial models (1, 2, and 5) account for different spatial structures and do not include the site covariates. Grassland models (6, 10, 11, and 12) only include this site covariate. The Null model does not include site covariates and spatial structure

where y is the validation dataset, and the \(\hat{y}\) dataset was estimated using the training dataset (i.e., derived from the model) (Hooten and Hobbs 2015). Deviance across all datasets was calculated as the sum of the likelihood across n citizen-science datasets as:

$$D=2-{\sum}_{n=1}^N\mathit{\log}(L).$$
(8)

Spatial distribution

We used the zi estimated by the best-ranked model to obtain the number of municipalities where the realized occurrence of the greater rhea was equal to 1. This measure is also known as “Finite Sample Size” or FSS and it was calculated using the equation:

$$FSS=\frac{\sum_{i=1}^I{z}_i}{N},$$
(9)

where zi is the realized occurrence in each municipality i, and N is the total number of sites (N=497 municipalities) (MacKenzie et al. 2006; Royle and Kéry 2007). The greater rhea distribution based on the expert knowledge was obtained by counting the number of municipalities within the expert-drawn polygons where they considered the species to be present.

Comparison of expert knowledge and SDM

To compare the differences between the expert-drawn and the SDM map, we overlaid the expert-based occurrence map with the map of zi estimated by the best-ranked SDM. Values equal to 0 indicated municipalities where experts and SDM results were the same. Positive values indicate municipalities with greater rhea presence according to experts, whereas negative values indicate municipalities with greater rhea presence according to the SDM. To depict the disparity between expert knowledge and SDM, we counted the number of municipalities in which the difference was higher, lower, or equal to 0. We therefore considered the differences between the number of municipalities with greater rhea occurrence declared only by the experts, only by SDM, and by both. Finally, we used linear models (LMs) to test whether expert-based and SDM-based occurrences were related, and whether the average disparity between experts and the SDM was higher inside than outside the expert-drawn map (expert-based occurrence close to 1 and 0, respectively). In these LMs, the expert-based occurrence was considered as the predictor variable, and the absolute difference between expert- and SDM-based occurrence (i.e., both with a value between 0 and 1) as a response variable. The model parameters were estimated through Bayesian inference, and models were run using the function brm, from the R package brms (Bürkner 2017) and the Stan software. We used flat priors in four parallel Hamilton Monte-Carlo Markov chains comprising 2000 iterations each, burn-in (warmup) of 1000 iterations and thinning each iteration, producing 4000 samples of the posterior probability distribution of each model parameter. Average parameter values and associated credible intervals were calculated using these 4000 posterior-distribution samples.

Variation between biomes and mapping approach

To investigate the differences between expert-drawn maps and estimated maps from the SDM per biome, we overlaid expert and SDM maps with the shapefile of the Pampa biome. Municipalities outside the Pampa were categorized as Atlantic Forest. We extracted the values of zi and expert knowledge ranks (1s or 0s) per biome to check variation of greater rhea occurrence relative to the biome. We inspected this variation using a biplot showing the relationship between SDM-based occurrence zi and the expert-based occurrence. We further tested for differences in average realized occurrence between biomes using ANOVA with biome, data (expert knowledge or SDM), and the interaction between biome and mapping approach as predictor factors, and the realized occurrence (collated zi from SDM and 1s and 0s from experts) as response variables. The ANOVA was run using the same settings as the LMs. Post hoc test was done using the emmeans function, from the emmeans R package (Lenth 2022), in which the marginal median of linear trends (with the associated 95% high probability density (HPD) interval) was estimated to identify the direct effect of the biome: mapping approach interaction on each metric while maintaining constant the isolated effect of biome and approach.

Results

The number of municipalities with samples and greater rhea detections varied among the citizen-science datasets (Fig. 1). WikiAves had the largest spatial coverage and the highest number of detections among the five datasets, accounting for 466 municipalities with samples and 99 with greater rhea detections. The second largest spatial coverage was from GBIF, with 246 municipalities with samples and greater rhea detections in 35 of them. Although eBird had a smaller spatial coverage than GBIF, it had a higher number of municipalities with greater rhea detections (36 municipalities). Finally, VertNet and iNaturalist had the smallest spatial coverage and number of municipalities with detections. VertNet had 91 municipalities with samples and two with greater rhea detections, whereas iNaturalist had 73 municipalities with samples and seven with greater rhea detections (Fig. 1).

Fig 1
figure 1

Spatial distribution of sampling and detections of the greater rhea in the municipalities of the RS, Brazil, from the eBird, GBIF, WikiAves, iNaturalist, and VertNet datasets. The greater rhea was detected in the municipalities in red and not detected in municipalities in white. Gray municipalities had no sampling. The bottom-right map shows the aggregated data across the five datasets. The figure was produced in R Programming Environment and edited in Inkscape

Model selection analysis

Model selection analysis based on the deviance criterion showed that the spatial model with a neighborhood cell size of 50 km (model 1, Table 1) had the best predictive performance among the twelve models tested. The spatial model with a neighborhood of 10 km (model 2) was the second best-ranked model and had a slightly worse predictive ability than the first model. The complete models (with all space and site covariate options: models 4, 7, 8, and 9), the models accounting only for site covariates (6, 10, 11, and 12), and the model accounting only for space with a neighborhood of 25 km (model 5) performed slightly worse than the Null model (model 3, Table 1).

Spatial distribution

The estimates of the finite sample size (FSS) were very similar across all models, ranging from 0.47 to 0.50. The mean FSS of the best ranked model was 0.48, with the 95% credible interval (CI) ranging from 0.41 to 0.56 (Table 1), which translates into a realized occurrence in 239 municipalities (95% CI: 204–278). From now on, we report only the results from the best-ranked model (model 1).

Estimated occurrence based on the integration of the five datasets showed a greater rhea distribution range that includes predominantly the Pampa biome area, comprising municipalities from the southern, western, and coastal regions of RS (Fig. 2a). The estimated occurrence also includes some municipalities within the northeastern and middle plateau of RS. The northern portion of the state had the highest uncertainty in the estimated occurrence, with most of the municipalities showing an average z estimate around 0.5 (Fig. 2a).

Fig. 2
figure 2

Geographic distribution of the greater rhea in RS, Brazil. The dashed black line depicts the limits of the Pampa biome. a The realized occurrence (average of zi across the 3000 posterior-distribution samples) estimated from the data integration model, with darker tones representing higher mean zi and intermediate values representing the municipalities with the highest uncertainty regarding the occurrence. b The expert-drawn map, with dark-blue municipalities representing municipalities with a high proportion of experts declaring greater rhea presence. c The disparity between the realized occurrence and the expert knowledge. In c, blue colors represent greater rhea occurrence only from the expert drawn map, red colors represent occurrence only from the SDM results, and the white colors represent agreement between the expert drawn map and the SDM result. The figure was produced in R Programming Environment and edited in Inkscape

Based on eight maps received from experts, the polygon drawn indicates greater rhea presence in most municipalities that are inside the Pampa biome, for example in Santana do Livramento and Alegrete (dark-blue areas in western Pampa), in a portion of the northern municipalities, and in two isolated municipalities in the northeast of RS (Fig. 2b). The sum of expert-based occurrences resulted in 103.13 municipalities (21% out of 497 municipalities) with the presence of the species.

Comparison of expert knowledge and species distribution modeling (SDM)

Linear models showed a positive relationship between expert-based and SDM-based occurrence (average regression coefficient (β) and 95% credible interval (CI) of β: 0.59; CI: 0.46, 0.72) (Fig. 3a), and a negative, albeit weak, relationship between expert-based occurrence and the absolute difference in expert- and SDM-based occurrence (β: −0.13; CI: −0.23, −0.03) (Fig. 3b). The map of the difference between the expert- and SDM-based occurrence shows that most agreement between the SDM and the expert-drawn map lies within the Pampa biome (Fig. 2c). Most of the municipalities in this region returned a difference close to 0, highlighting that the greater rhea was recorded as present in the region by both the experts and the SDM. While the results from the SDM and from the experts agreed in only four municipalities (in which the difference between expert and SDM was exactly 0), they disagreed in the majority of the municipalities. Experts recorded the species’ occurrence in 78 municipalities (i.e., difference between expert and SDM > 0), and SDM results in 415 municipalities (i.e., difference between expert and SDM < 0).

Fig. 3
figure 3

Plots showing the relationship between expert- and SDM-based occurrence (a), the disagreement between these mapping approaches (b), and the variation in realized occurrence between biomes and mapping approaches (c). Points were subtly jittered in a and b to improve visualization. Small points in c show the 3000 posterior distribution samples of a Bayesian ANOVA, and the black points depict the median across these samples. The figure was produced in R Programming Environment and edited in Inkscape

Variation between biomes and mapping approach

ANOVA with post hoc test showed differences on greater rhea realized occurrence among the levels of the interaction between biome (Pampa or Atlantic Forest) and mapping approaches (SDM or expert aggregated map) (Fig. 3c). The effect of this interaction shows that (1) there was a higher realized occurrence in the municipalities within the Pampa, and (2) experts estimated a lower chance of occurrence in the Atlantic Forest and Pampa than the SDM (Fig. 3c). Experts estimated 13% (lower and higher 95% HPD: 10–15%) of the Atlantic Forest municipalities as occupied by the greater rhea, and 37% (HPD: 33–40%) of the Pampa municipalities. The SDM, in its turn, estimated 43% (HPD: 41–46%) of the Atlantic Forest and 58% (HPD: 54–61%) of the Pampa municipalities as occupied by the greater rhea (Fig. 3c).

Discussion

Both expert knowledge and SDM show that the greater rhea is a bird with affinities to open landscapes, and that there is a considerable overlap of its distribution with the area of the Pampa biome. However, expert knowledge and SDMs showed contrasting patterns within the Atlantic Forest biome where the species’ occurrence might be more uncertain. We initially expected that most disparity between experts and SDM would occur inside the range depicted by the experts (e.g., Mainali et al. 2020). Our analysis of absolute difference did not support this hypothesis, showing that difference between SDM- and expert-based occurrence was lower in municipalities with occurrence declared by the experts. This result indicates that the experts are aware of the factors determining the spatial distribution of the greater rhea, at least within the grassy physiognomies of the Pampa biome. Regarding the effect of site-level covariates, we found a strong spatial autocorrelation component influencing the spatial distribution of the greater rhea in RS. While a spatial autocorrelation effect was initially expected, we did not expect it to be the only factor influencing the greater rhea distribution in the studied area. We attribute this large spatial effect to the biology of the greater rhea, mainly due to its vagility, large home range size, and strong conspecific attraction during the reproduction period. Finally, the estimated realized occurrence, which corresponds to the area of occupancy (sensu IUCN 2001), includes 48% of the state municipalities, and showed the locations more suitable to greater rhea occurrence in southernmost RS.

Both expert knowledge and SDMs aim to present binary state of species distributions (either occupied or unoccupied) but using different methods (Hurlbert and Jetz 2007; Mainali et al. 2020). We found that the greater rhea distribution generated by the experts and by the SDM agreed on an important point: both resulted in a realized area of occupancy in almost the complete extent of the Pampa biome. The Pampa biome is a region of flat land originally covered by large extents of open grassland (Boldrini 2009). The greater rhea is well adapted for such a habitat with its long neck and legs to escape predators, allowing it to thrive (Bellis et al. 2004). Its long neck enables it to see over large distances and remain vigilant to predators. It is described as a cursorial bird, so it is well adapted for running long distances over open grasslands and can reach speeds of up to 60 km/h (Abourachid and Renous 2000). It has also been described as having a large home range within natural grasslands (11 km2) driven by the availability of its main food source, wild dicots, which tend to be found in low densities within grasslands meaning the birds often have to travel further afield to feed (Bellis et al. 2004).

Nonetheless, the greater rhea distribution maps from the SDM and the expert knowledge differ in the following aspects: (1) the SDM estimated a broader occurrence in the municipalities of highland grasslands (northeastern RS), medium plateau, and close to the Uruguay River, and (2) experts declared occurrence in the southeast and central RS mountain ranges, and in the west portion of the Uruguay river coast. Possible sources of disparity may have arisen due to (i) false negative results from experts in the Atlantic Forest; (ii) false positive results from experts in the Pampa, mainly along the limits between the Pampa and Atlantic Forest; and (iii) an omission of important variables within the SDMs (see below). First, although the Atlantic Forest is predominantly covered by forests, a habitat not suitable for the species (Souza Jr et al. 2020), it is possible to find natural grassy landscapes and grassland-forest mosaics potentially suitable for the greater rhea in northeastern and middle plateau of RS (Boldrini 2009). Second, expert maps included unsuitable areas not occupied by the species in the eastern Pampa, which might include more forested areas found along the hills in the southernmost limits of the Atlantic Rainforest.

Third, we found a strong effect of the spatial component on greater rhea occurrence. Spatial autocorrelation can be introduced by two different factors (that unfortunately cannot be distinguished based on detection/non-detection data), namely omission of important site covariates, and biotic processes such as dispersal and conspecific attraction (Legendre 1993; Dormann et al. 2007). Important covariates that might vary in space and shape species distribution can be the expansion of agriculture towards northern RS, where large extents of open habitats (crop fields) were created and are used by the greater rhea. Perhaps the species is expanding its range within RS alongside the agriculture expansion, an effect that cannot be detected with static site-occupancy models but was already noticed in RS and in other Brazilian states (Miranda 2021). Also, the species move across large geographic areas when searching for dense and tall grasslands to reproduce (Erize and Villafañe 2017). Such spatiotemporal dynamics are not explained by habitat covariates but might be represented by the spatial autocorrelation component. Therefore, although habitat covariates may not be necessary to explain the geographic distribution of the greater rhea in RS, it is essential to monitor its populations and track changes in its preferred habitat to accurately identify and analyze population dynamics over time.

The conservation of Pampean grasslands can be greatly aided by the definition of a flagship species. This type of species can help increase interest from citizens and governments to support and fund conservation efforts (Simberloff 1998; Walpole and Leader-Williams 2002). When these species are protected, this can in turn ensure the protection of the wider landscape and other species that rely on it (Walpole and Leader-Williams 2002). The greater rhea is an unusual bird due to its morphology and large size, its large home range, and interesting breeding behavior. The species is also classified globally as “Near Threatened” (BirdLife International 2016), predominantly due to the loss of the grassland habitat, illegal hunting (for their feathers, meat, and skin), egg collection, and predation by felids and feral dogs (Sales 2006). Being a species of conservation importance, the greater rhea could be considered as an effective flagship species for the conservation of the Pampa habitat within Brazil.

Using expert knowledge and data integration models based on citizen-science data, which produced maps showing the area of greater rhea occupancy, we highlighted that (i) regions in which both SDM and experts agreed about the greater rhea distribution can be useful and form the core areas for future conservation efforts, and can be targeted by studies considering parameters such as population size and survival; and (ii) regions where the SDM and expert knowledge differed should be the focus of further sampling effort. Acknowledging the possibility of false negatives in observations (MacKenzie et al. 2002; Kéry et al. 2013; Guillera-Arroita et al. 2015; Guélat and Kéry 2018) is crucial for minimizing statistical uncertainty. This cannot be achieved with expert-based range maps or by the aggregation of different datasets. Therefore, the collaboration of experts, ecologists, and citizen scientists in mapping species distribution over space and time is encouraged to produce an informative source of evidence for planning species management and conservation.