Introduction

Forest management strategies accounting for spatial patterns of genetic structure are critical for delimiting conservation management units, germ plasm collection, sources and target sites for reforestation, and assisted migrations (Broadhurst et al. 2008; Frankham 2010). Moreover, spatial genetic structure provides insights for understanding relevant ecological and evolutionary processes, such as demography, progeny fitness, colonization history (Kalisz et al. 2001; Zhao et al. 2009), as well as potential for local adaptation in the light of environmental change (Alberto et al. 2013; Savolainen et al. 2013). Efforts to protect biodiversity thus should include the quantification of spatial patterns of genetic structure across species’ ranges.

Reliable evaluations of genetic structure require sound sampling schemes that involve matching the scale of the spatial process driving genetic structure, which is determined by the life history traits of the study species (Anderson et al. 2010; Keller et al. 2013). Although conducting an extensive sampling is recommended to avoid biased results from genetic analyses, in real situations this condition is often difficult to achieve (Hall and Beissinger 2014). Sampling can be constrained by varying factors, such as limited economic resources for processing a small number of individuals for genetic analysis, permitted only in few sites or within geographically accessible areas.

Plants due to their sessile life form are expected to develop strong genetic structure over distance (mostly due to seed dispersal), where neighboring individuals are more closely related than distant individuals, resulting in a genetic pattern of isolation by geographical distance (IBD) (Ennos 2001; Hampe and Petit 2005; Epperson 2007). IBD is a prevalent driver of genetic structure in tree populations (Vekemans and Hardy 2004), and is likely the most common spatial process investigated in landscape genetic studies (Jenkins et al. 2010). Inappropriate sampling is recognized to mislead inferences of genetic structure (Latch and Rhodes 2006; Koen et al. 2013; Naujokaitis-Lewis et al. 2013; Prunier et al. 2013; Hoban and Schlarbaum 2014; Tucker et al. 2014). For instance, typical analyses in population genetics, such as estimates of genetic differentiation F ST (Schwartz and McKelvey 2009; Landguth and Schwartz 2014), genetic clustering algorithms (Schwartz and McKelvey 2009), and correlative distance methods (Landguth and Schwartz 2014; Oyler-McCance et al. 2013) are sensitive to sampling bias under IBD, which may over- and under-estimate genetic structure (Meirmans 2012).

Sampling effects on inferences of spatial genetic structure have been explored using computer simulations (e.g., Schwartz and McKelvey 2009; Landguth et al. 2012; Oyler-McCance et al. 2013; Landguth and Schwartz 2014). From simulation studies, it has been suggested that random or systematic sampling (e.g., uniform sampling within transects) perform well for identifying the underlying spatial process (Schwartz and McKelvey 2009; Oyler-McCance et al. 2013). Other factors, such as the number of loci and alleles per locus are also important (Landguth et al. 2012). Sampling issues in simulation studies have been developed on scenarios that emulate animals with high dispersal abilities and mainly with continuous distributions (but see Prunier et al. 2013). For plants that have different life history traits, results from such studies may not apply. Moreover, recent evidence from landscape genetic simulations suggest that IBD is not only determined by dispersal distances per se, but also by habitat configuration and the spatial locations of populations (van Strien et al. 2014). This implies that sampling bias on inferences of spatial genetic structure under IBD, may be far more complex than anticipated for species with non-continuous distributions, such as in many tree species.

The aim of this study was to assess the effects of sampling design on inferences of spatial genetic structure in forest tree species with non-continuous distributions. I used spatially explicit, individual-based computer simulations to evaluate the sensitivity of sampling design through varying three factors: the spatial sampling scheme (i.e. spatial arrangement of samples), the sampling intensity (i.e. number of sites), and the number of microsatellite loci. I compared the performance of each sampling design combination to quantify spatial genetic differentiation under varying degrees of isolation by distance. Specifically, I asked: (1) which is the best performing spatial sampling scheme? (2) Does performance of spatial sampling largely depend on the number of sites sampled and the number of loci? (3) How does performance of sampling design vary with the strength of spatial genetic structure (i.e., IBD)? Through simulating and contrasting two species with different characteristics, this study also illustrates how bias in sampling design may depend on the study species.

Methods

Model system

To illustrate the effects of sampling design on inferences of spatial genetic structure under approximate real-like scenarios, I selected two tropical dry forest trees varying in spatial distributions, individual abundances, and mating systems. The selected species are characteristic of the tropical dry forest of the Bajío region in Central Mexico, utilized for timber and valuable for reforestation: Cedrela dugesii and Bursera palmeri.

Cedrela dugesii is a monoicous deciduous tree endemic to the Bajío region, where the species is commonly associated and abundant on rocky outcrops (Calderón de Rzedowski and Germán 1993). It has seeds adapted to dispersal by wind and is pollinated by insects (Calderón de Rzedowski and Germán 1993). B. palmeri is a dioicous deciduous tree (sometimes shrub) of the tropical dry forest and scrublands of the Mexican Plateau. It has a larger distribution in the Bajío region relative to C. dugesii (Fig. 1). B. palmeri has seeds dispersed by animals and is pollinated by insects (Rzedowski and Guevara-Féfer 1992). The tropical dry forest in the Bajío region covered large areas, but currently it is decreasing due to the intensification of agriculture, extensive livestock, and urbanization (Encino-Ruiz et al. 2014; Rzedowski et al. 2014). Both tree species are in decline and increasingly fragmented due to logging for timber and wood products (Rzedowski and Guevara-Féfer 1992; Calderón de Rzedowski and Germán 1993).

Fig. 1
figure 1

Sample design under isolation by distance in two simulated forest tree species with non-continuous distributions: a Cedrela dugesii and b Bursera palmeri. The circles denote the 10 km2 radius sampling locations for the systematic sampling. The stars denote the sampling sites included for the cluster sampling scheme. For the random sampling, individuals were collected randomly across the landscape

Landscape genetic simulations

Simulations were implemented in CDPOP v1.2.31 (Landguth and Cushman 2010), which is an individual-based landscape genetics program. CDPOP simulates gene flow under Mendelian inheritance among spatially referenced genotypes based on probabilistic functions of individual movement and mating across multiple generations (for more information on the software see Landguth and Cushman 2010 and the github site https://github.com/ComputationalEcologyLab/CDPOP). To map the spatial distribution for each tree species in the Bajío region, I accessed all collection records from specialized flora web-databases congregated by the National Commission for Knowledge and Use of Biodiversity of Mexico (REMIB-CONABIO 2016; http://www.conabio.gob.mx/remib/cgi-bin/). Records were also obtained from the herbarium IEB (Instituto de Ecología, A. C., Centro Regional del Bajío), which is specialized in the flora of the Bajío region.

Within an area of approximately 3.5 × 2.9 km that corresponded to the main area of species distribution in the Bajío region, 5000 individuals of C. dugesii (Fig. 1a) and 3000 individuals of B. palmeri (Fig. 1b) were randomly positioned within a 10 km2 radius of each recorded location. Estimates of population size do not exist, but population sizes were established according to field observations. Although floristic surveys have been carried in the Bajío region in the last decades to record all flora species (which is the main objective of the herbarium IEB, https://plants.jstor.org/partner/IEB), there is still a possibility that some sites of species occurrence are unknown. To include the effect of missing sites on sampling schemes (see sampling design), I assigned individuals within few random locations across the landscape. The distribution of C. dugesii was more spatially aggregated relative to the more scattered distribution of B. palmeri. Mating parameters were set in CDPOP to represent a dioicous (equal number of females and males in B. palmeri) and a monoicous (C. dugesii) species. Pollen and seed dispersal were modeled from a probabilistic distribution proportional to an inverse linear function, which is determined by the maximal dispersal distance travelled by seeds (female-D max ) and pollen (male-D max ) (Landguth and Cushman 2010). Pollen distances were specified three times larger than seed dispersal as pollen is assumed to travel longer distances relative to the distances travelled by seeds (McCauley 1997; Petit et al. 2005). For C. dugesii, pollen flow was set to reach D max  = 25% (60, 952.8 m) of the maximal Euclidean distance between individuals (243,811.3 m), while seed dispersal distance was set to D max  = 5% (36, 571.6 m). For B. palmeri, pollen flow was set to D max  = 15% (51, 816.9 m) and seed dispersal distances were set to D max  = 3% (10, 363.4 m) of the maximal Euclidean pairwise individual distance (345, 445.8 m). For both species, there is no knowledge of pollen and seed dispersal, but the parameterized seed dispersal distance corresponded to dispersal distances in related Bursera species reported in the literature (Cantarello et al. 2011). For C. dugesii, there is no dispersal distances information in related species, but dispersal parameters selected allow building a genetic pattern of isolation by distance without crashing the population to extinction as IBD become stronger over generations.

Population growth followed a logistic function with a maximum size of 5000 (C. dugesii) or 3000 (B. palmeri) individuals (population size was constant at every generation), while allowing seeds fecund from multiple fathers. Based on information from the literature (Bonfil-Sanders et al. 2008; Sánchez-Martínez et al. 2011), age structure and percentage of mortality per class were defined as: (1) seeds (40% mortality), (2) seedlings (80% mortality), (3) juveniles (50% mortality), (4) reproductive individuals (20% mortality), and (5) mature individuals (100% mortality for non-overlapping generations). Simulations were set from generation 1 to 300, which generated genotypes of 20 microsatellites and ten alleles per loci. Because CDPOP simulates stochastic processes, I ran 25 replicates to account for the mean and variability of the spatial genetic structure for each species.

Sampling design combinations

I was interested to analyze the effects of sampling given three common factors: (1) spatial sampling scheme, (2) sampling intensity (number of sites), and (3) number of microsatellite loci. I applied three common types of spatial sampling schemes (random, geographically clustered, and systematic sampling within documented localities). For the random sampling, I randomly collected individuals across the whole landscape, without distinguishing sampling sites. For the geographically clustered sampling, I sampled individuals from small portions of the landscape within a 10 km2 radius. This type of sampling is common when logistic and economic resources are limited to collect few samples, but the researcher categorize sampling sites based on their geographic or environmental differences (e.g., Abeysinghe et al. 2000; Rico et al. 2008; Carrillo-Ángeles and Mandujano 2011). For the systematic sampling, I sampled individuals within a 10 km2 radius of documented locations. This sampling type is more extensive than the cluster sampling since a larger spatial representation of the species distribution is achieved, but some areas are missed because of their geographically inaccessibility or because the researcher have incomplete knowledge of species occurrence (e.g., He et al. 2009; Dubreuil et al. 2010). For each sampling strategy, four sampling intensities were applied: 80, 65, 50 and 35%. For the cluster and systematic sampling, sampling intensity was the number of documented sites of occurrence, while for the random sampling was the number of individuals across the whole landscape. The number of individuals collected within each of the four intensity levels in the random sampling, was set equal to the number of individuals sampled in each level of the systematic sampling. I implemented three sets of microsatellites: 20, 14, and 8 microsatellite loci, which reflect similar numbers of markers commonly employed in genetic studies for trees (e.g., Astronium urundeuva n = 7 loci, Caetano et al. 2005; Dipteryix alata n = 8 loci, de Campos Telles et al. 2014; Cedrela odorata n = 9 loci, Hernández et al. 2008; Tetragastris panamensis n = 15 loci, Kenfack and Dick 2009; Protorhus deflexa n = 19 loci, Sato et al. 2014; Laurus n = 20 loci, Arroyo et al. 2010). Levels within the three main sampling factors resulted in 36 sampling design combinations (Table 1).

Table 1 Simulated sampling factors and associated levels of analysis investigated in two forest tree species

Individual-based sampling

The objective of this study was not to assess the performance of individual-based and population-based analysis, but to examine how sampling scheme and effort can bias genetic differentiation estimates. Population-based analysis assumes that each sampling site constitute a distinct population, which in many tree species that are recently fragmented such assumption is likely inadequate. Empirical evidence suggests that individual-based analysis is adequate for quantifying spatial genetic structure in species with continuous and non-continuous distributions (Bankenhol and Fortin 2016). Prunier et al. (2013) and Luximon et al. (2014) using spatially-explicit computer simulations showed that individual-based analysis provides more statistical power for detecting isolation by distance relative to population-based analysis. Specifically, Prunier et al. (2013) for species with patchy distributions showed that as few as three to four sampling individuals per aggregate was sufficient for estimating spatial genetic structure using individual-based analysis. To avoid artificial delimitations of putative populations and to make analyses comparable among sampling schemes, I performed individual-based analysis.

Effects of sampling design on estimates of IBD

I used Mantel test (Mantel 1967) to evaluate the correlation of genetic and geographical distances for each of the 36 scenarios using the library ecodist (Goslee and Urban 2007) in the R statistical software (R Development Core Team 2016). Inter-individual genetic distances were calculated using the proportion of shared alleles (D PS , Bowcock et al. 1994), and Euclidean distances were calculated from the X and Y coordinates between all pairs of individuals. This was applied for each of the 25 replicate runs and for seven generation periods: 25, 50, 100, 150, 200, 250, and 300. To facilitate comparisons among sampling combinations and their deviation from the true r coefficient, I calculated a standardized measure of relative error (RE; Kossinets 2006): RE = (r coefficient True-valuer coefficient Replicate-value/r coefficient True-value) × 100%. The relative error represents the proportional difference between the mean true r coefficient and the estimated mean r coefficient among all simulated runs and for each sampling combination, which means that higher absolute values represent a greater divergence between the true and estimated value.

Significant deviations of IBD r coefficients from the true values and effect size

Three-way ANOVAs were applied to identify main factors and their interactions using as response variable the r coefficients. A logarithmic transformation to the response variable was applied to meet ANOVA assumptions. For all ANOVAs, I calculated the effect size of each resultant significant factor and their interactions to determine their relative strength on the response variable. The effect size (η2) was calculated as the ratio of the effect variance (sum of squares SS) to the total variance (total SS; Tabachnick and Fidell 2007). To assess whether the 25 replicate Mantel r coefficients significantly deviated from the observed distribution of “true” r coefficients (using the total genotype dataset for each of the 25 runs), I applied a Kolmogorov-Smirnoff test (KS) for each of the 36 sampling combinations.

Results

Effects of sampling design on estimates of IBD

For both tree species, the strength of the association between genetic and Euclidean distances increased with the number of generations, but B. palmeri resulted in a stronger IBD pattern (generation 25: r = 0.09 and generation 300: r = 0.35, p < 0.05) relative to the strength of IBD shown by C. dugesii (generation 25: r = 0.05 and generation 300: r = 0.29, p < 0.05). Figure 2 shows an example of the variation of RE values within sampling combinations and from generation 25 to 300. To facilitate visualization, Fig. 2 shows only comparisons of the three sampling strategies at sampling intensities of 80 and 35%, and for 20 and 8 loci, which were the scenarios with marked differences. For both species, RE values were larger when the r coefficient was low (generation 25) and tended to decrease as the number of generations progressed and the strength of the correlation coefficient increased. This trend was more evident in C. dugesii (Fig. 2a). An exception to this pattern was for the random sampling with 20 loci, which regardless of sampling intensity and generation time, was the sampling combination that produced the lowest error (<10%). Instead, the number of loci was more important in the random sampling, with an increase of >30% in RE values over scenarios with 8 loci.

Fig. 2
figure 2

Relative error plots of the three spatial sampling schemes at 80 and 35% of sampling intensity, and 20 and 8 microsatellite loci combinations for a Cedrela dugesii and b Bursera palmeri. RE values were calculated every 25 generations from generation 25 to generation 300. Sampling abbreviations: Syst Systematic, Clus Cluster and Rand Random

For both species, the cluster sampling with 20 loci and 35% of sampling intensity was the sampling combination that had the largest RE values independently of generation time. Overall, the cluster and systematic sampling were more sensitive to sampling intensity and number of loci, but for B. palmeri RE values within each sampling combination varied more and trends were less predictable (Fig. 2b). Moreover, the cluster and systematic sampling at 35% of sampling intensity resulted in higher RE variation, which changed from negative (above the true mean r coefficient) to positive (below the true mean r coefficient) as IBD become stronger over generations. Cluster sampling with 8 loci resulted in lower RE values relative to 20 loci regardless of sampling intensity, which was more evident in C. dugesii (Fig. 2a).

Significant deviations of IBD r coefficients from the true values

Figures 3 and 4 show boxplots of the distribution of estimated r coefficients around the “true” mean r coefficient (dashed line) for the 36 sampling combinations and for three generations (50, 150, and 300). For C. dugesii, random sampling with 20 loci resulted in estimated r coefficients closest to the true mean r, irrespective of sampling intensity and the strength of IBD. Results from the KS test indicated non-significant differences between distributions of the true r and estimated r coefficients for this sampling combination (Fig. 3). Cluster and systematic sampling at 20 loci across the three generations, showed significant differences from the true correlation for all sampling intensities. For both sampling strategies, estimated r coefficients were above the true mean r, and hence associations of IBD were overestimated. In scenarios with 14 and 8 loci, random sampling across all levels of sampling intensity resulted in estimated r coefficients below (underestimating) the true mean r, and with marked differences at 8 loci. This trend was observed across the three generations, and these deviations from the true r value were statistically significant. For the cluster and systematic sampling, estimated r coefficients were higher at loci 20, which monotonically decreased as the number of loci went from 14 to 8, which was more evident at generation 25 relative to generation 300.

Fig. 3
figure 3

Boxplots of the distribution of estimated r coefficients for each of the 36 sampling combinations in three generations (50, 150, and 300) in Cedrela dugesii. The dashed line represents the “true” mean Mantel r coefficient, calculated across 25 replicates for the complete data set. The stars above each box denotes a significant deviation (p < 0.05) of the distribution of estimated r coefficients from the true r values using the Kolmogorov-Smirnoff test

Fig. 4
figure 4

Boxplots of the distribution of estimated r coefficients for each of the 36 sampling combinations in three generations (50, 150, and 300) in Bursera palmeri. The dashed line represents the “true” mean Mantel r coefficient, calculated across 25 replicates for the complete data set. The stars above each box denotes a significant deviation (p < 0.05) of the distribution of estimated r coefficients from the true r values using the Kolmogorov-Smirnoff test

Similarly in B. palmeri, the random sampling resulted in estimated r coefficients that were not significantly different from the true r across the three generations analyzed. Although estimated r values for the systematic sampling were overall higher relative to the mean true r in scenarios with 20 loci, those differences were not statistically significant. Moreover, estimated r values for the systematic sampling were closer to the true r value with increasing the number of sites sampled, which was more evident as the strength of IBD increased. This trend was similar for the cluster sampling with 85% intensity at generation 150 and 300 (Fig. 4). Unlike C. dugesii, sampling intensity produced a lower variation within the three sampling strategies and irrespective of the number of loci (Fig. 4). Moreover, all sampling combinations with 8 loci (except cluster sampling at 35% intensity) had similar distributions of estimated r coefficients at all sampling intensity levels, which distributions were significantly lower than the true r coefficient.

Effect size of sampling factors

Results from ANOVA showed that effects of sampling strategy, sampling intensity, number of loci, and the interaction between sampling strategy and number of loci were statistically significant across the seven analyzed generations in both species (Tables 2 and 3). However, there were important differences in the relative effect sizes among factors, which values varied in relation to generation time and thus the strength of the r coefficient. Specifically, for C. dugesii, sampling strategy had the largest effect size, which comprised 74% of the variation at generation 25 (r = 0.05), but that decreased to 50% at generation 150 (r = 0.2). After generation 200 (r = 0.23), the number of loci reversed the trend and had a largest importance (>50% of the variation). Sampling intensity followed by the interaction between sampling strategy and number of loci accounted for less than 10% of the total variation (Table 2). For B. palmeri, sampling strategy had the largest effect size only at generation 25 (r = 0.1), which comprised 56% of the variation, while number of loci was the most important factor after generation 50 (r = 0.15), explaining 54%. Effect size values for the number of loci continued to increase as associations of IBD become stronger, and for a maximum effect size of 86% at generation 300 (r = 0.35). Also, sampling intensity followed by the interaction between sampling strategy and number of loci accounted for less than 15% of the variation (Table 3).

Table 2 Summary statistics of three-way ANOVAs to test differences among sampling factors and their interactions on mantel correlation coefficients generated for simulated data of seven generations in Cedrela dugesii
Table 3 Summary statistics of three-way ANOVAs to test differences among sampling factors and their interactions on mantel correlation coefficients generated for simulated data of seven generations in Bursera palmeri

Discussion

Performance of sampling design on the correct quantification of IBD

As expected, the dioicous species with shorter dispersal range (B. palmeri) developed stronger genetic differentiation over distance across all generations relative to the monoicous species with larger dispersal (C. dugesii). Results showed that the spatial sampling scheme and the number of loci were important factors influencing the correct estimation of IBD. Specifically, the random sampling with 20 microsatellite loci was the sampling combination that performed best across generations and for both species, and thus performance was independent of the strength of IBD (generation time) and the spatial distribution of individuals.

The good performance of the random sampling has been observed in previous computer simulation studies, which also found that increasing sample size increased the accuracy for identifying the underlying spatial process (Landguth et al. 2012; Oyler-McCance et al. 2013). I found that the random sampling with 20 loci was equally sufficient for quantifying the “true” strength of IBD at the lowest and highest proportions of sampled individuals. Indeed, the random sampling scheme was unaffected by sampling intensity across all scenarios. The random sampling in this study, was a good unbiased distribution of samples across all sites of species occurrence, which allowed to capture the existing pattern of genetic differentiation with acceptable degree on the condition that the number of loci was high (20 loci).

Simulation studies have found that the systematic sampling, where the sampling is distributed at regular intervals across the landscape, perform with similar acceptable rates to the random sampling (Naujokaitis-Lewis et al. 2013; Oyler-McCance et al. 2013). In contrast to previous studies, the systematic sampling did not show similar trends to the random sampling. Unlike the random sampling, performance of the systematic sampling was related to sampling intensity and the strength of IBD. Specifically, in B. palmeri, the effectiveness of the systematic sampling for successfully describing the “true” pattern of IBD, increased with the proportion of sampled sites and the number of loci. For C. dugesii, the above was observed only for the combination with 14 loci and when the pattern of IBD was moderate. The difference observed on the performance of the random versus the systematic sampling scheme is related to the influence of missing sites and the spatial configuration of the species distribution. In simulation studies for species with continuous distributions, the systematic sampling spaced at regular intervals and covering all areas of the species occurrence is more likely to capture the prevailing pattern of genetic structure, thus performing equally well to the random sampling. For the systematic sampling in this study, samples were collected only in recorded locations, while missing “undocumented” sites. Hence, as fewer sites were sampled, there was a larger chance of not accurately capturing the genetic variation found across the landscape (e.g., Koen et al. 2013; Naujokaitis-Lewis et al. 2013). The random and systematic sampling differ only in the spatial allocation of samples because the number of individuals within each intensity level were set equal, and hence differences are not due to sample size. The simulation study of van Strien et al. (2014), showed that habitat configuration has a large influence on spatial patterns of genetic differentiation under IBD, where the highest differentiation between pairs of demes (as measure with genetic distance metrics) may not necessarily corresponds to the largest pair-wise Euclidean distance. This implies that the strength of IBD may not appear as a uniform spatial gradient of genetic differentiation across the landscape. The different performance of sampling schemes in the two simulated species is likely related to the spatial configuration of individuals as suggested from the work of van Strien et al. (2014).

Oyler-McCance et al. (2013) through an individual-based simulation study for an animal species with continuous distributions found that the cluster sampling had a low probability of identifying the underlying spatial process irrespective of sample size. Overall, the results here also suggest that the cluster sampling does not perform well for species with non-continuous distributions. Remarkably, at the lowest proportion of sampled sites and combined with 20 loci, the cluster sampling performed the worst, overestimating the strength of IBD. This sampling scheme, was more sensitive to sample size and the number of loci, resulting in large variance of estimated r values across all scenarios. This pattern was more evident in C. dugesii, which spatial distribution was more aggregated.

Sampling in few locations of the whole species distribution, is common in population genetic studies due to logistics or financial constraints (e.g., Tucker et al. 2014; Dubreuil et al. 2010; Rico et al. 2008). The cluster sampling captures only a portion of the genetic variation found across the landscape, if by chance areas of high genetic differentiation are included, the strength of the association between genetic distances and Euclidean distances would appear stronger, thus overestimating IBD (Landguth and Schwartz 2014). On the other hand, if areas of the landscape with low genetic differentiation are included, IBD would be underestimated. This would explain the patterns shown for the systematic and the cluster sampling, but more marked in the cluster sampling at low levels of sampling intensity where the range of the r coefficient values was larger (i.e., interquartile size and whiskers of boxplots Figs. 3 and 4) relative to the variation in the other sampling schemes.

Relative importance of sampling factors

Although the three analyzed sampling factors had significant effects on the quantification of spatial genetic structure, the relative importance of each factor varied between species and across generations. For C. dugesii, the spatial sampling scheme was the most important from generation 25 to 150, which then was reversed for the number of loci, while for B. palmeri the number of loci was by far the most important factor from generation 50. In all cases, sampling intensity had a small effect size. Differences between species in the relative importance of sampling factors may be related to the strength of IBD. Specifically as IBD become stronger, the relative effect size values for the number of loci increased, while decreased for the spatial sampling scheme. Landguth et al. (2012), in a simulation study found that increasing the number of loci (up to 25) was important for the correct identification of the underlying spatial process. This trend was more evident if IBD was strong (Mantel r > 0.7). However, the authors found that increasing the number of samples (from 100 to 500) if the number of loci was small (n = 15) improved the probability of correctly identifying the underlying spatial process (Landguth et al. 2012). In this study, increasing the number of samples (e.g., from 260 individuals at 35% to 700 individuals at 80%) did not improve the performance of the sampling scheme if loci was either 14 or 8. Specifically, using 8 loci underestimated the “true” pattern of IBD in most sampling combinations (except in C. dugesii for the cluster sampling at low sample sizes). Notably, underestimating IBD with 8 loci was marked at generation 300 when IBD was the strongest.

These results indicate the lower power to detect genetic structure in standard correlative matrix distance methods if few microsatellite loci are used (Landguth et al. 2012; Peterman et al. 2016). Mantel and partial Mantel test statistics have been criticized due to its poor performance for accurately identifying the main spatial process driving genetic structure if multiple and correlated competing landscape hypotheses occur (Guillot and Rousset 2013; Zeller et al. 2016). Because in this study isolation by distance is the only spatial process being modeled, I consider the simple Mantel test sufficient for illustrating the relative performance that sampling design have on the quantification of IBD. It would be interesting to explore under more complex simulation scenarios, such as landscape resistances to gene flow (e.g., Zeller et al. 2016) or adaptive genetic differentiation under gene flow (e.g., Landguth and Balkenhol 2012), the performance of alternative statistical methods, such as spatial regressions and multiple regression on distance matrices (MRM) (Wagner and Fortin 2016). Other important aspect to explore using computer simulations, would be the role of dispersal vectors on patterns of genetic structure. For instance, the influence of wind speed and turbulence is known to importantly influence the distance and direction of pollen flow (Wang et al. 2016) and seed deposition (Nathan et al. 2002). Moreover, the movement of pollinators can also be influenced by the speed and direction of wind (Ahmed et al. 2009). The effect of sampling schemes under more realistic scenarios beyond the species spatial configuration and landscape structure would be needed.

Recommendations and Conclusions

Results here highlight that covering the total spatial extent of the species distribution is critical because missing areas of occurrence is likely to under- or over-estimate patterns of genetic differentiation. By contrasting two species with varying spatial distributions, it was observed that a cluster sampling scheme is even more problematic for species with patchy distributions. Equally important is the number of microsatellite loci because as fewer loci are included, there is an increasing chance of underestimating IBD due to low power to detect spatial genetic structure. Increasing the number of samples over the number of loci, appears not to improve the performance of sampling schemes for correctly identifying the underlying process even if sampling is randomly distributed across the landscape. This implies that researchers should give more priority to obtain large number of markers over the effort of increasing sample size. Obtaining a large number of loci >20 and with good-level of polymorphism (i.e., 10 alleles) is still difficult for many research studies in non-model species with poor o non-existing genetic information. Fortunately, advances of next generation sequencing technologies allow overcoming the limitation associated with marker discovery and polymorphism. Specifically, restriction site associated DNA sequencing and genotyping by sequencing methods that identify thousands of polymorphic markers across the genome (single nucleotide polymorphisms SNP) are becoming rapidly accessible to research studies in non-model species (Baird et al. 2008; Elshire et al. 2011).

Despite the increasing development of computer simulation software in ecological and landscape genetics (e.g., SPOTG of Hoban et al. 2013; SimAdapt of Rebaudo et al. 2013; CDMetaPOP of Landguth et al. 2016), computer simulations remain poorly used as a tool for sampling planning in research studies. The routine implementation of computer simulations with model parameters focused on the biology of the study species, the vectors, and the landscape structure would support more effective decisions on sampling design in population genetic studies (Hoban 2014).