Introduction

Most angiosperms are hermaphrodite (70%; Yampolsky and Yampolsky 1922). This co-occurrence of both male and female reproductive organs on the same individual allows self-pollination and, in the absence of self-incompatibility mechanisms, self-fertilization. Indeed, about 40% of flowering plants do self at various rates, and about 15% of them reproduce predominantly through selfing, with outcrossing rates lower than 10% (Igic and Kohn 2006). Such high selfing rates are expected to have strong consequences on the genetic diversity of natural populations and on its organization.

Theoretical studies have examined the consequences of selfing for the genetic diversity of populations, particularly in terms of effective population size Ne. The effective population size is defined as the size of an ideal Wright–Fisher population experiencing the same rate of genetic drift as the population under consideration (Crow and Kimura 1970). As reviewed in Charlesworth (2009), the effective population size can be affected by several factors, including the mating system. Self-fertilization reduces the number of independent gametes sampled for reproduction, which directly decreases Ne (Pollak 1987). Demographic events are also likely to affect the effective size of selfing populations where founder effects can be frequent, e.g., through the establishment of a new population by a single individual (Baker 1967). In addition, according to the “dead-end hypothesis” (Stebbins 1957), selfing populations are expected to accumulate deleterious mutations (Lynch et al. 1995; Abu Awad et al. 2014), and could lack the genetic diversity required to adapt to changing environmental conditions (Charlesworth and Charlesworth 1995; Lande and Porcher 2015; Abu Awad and Roze 2018). Therefore, we can expect frequent catastrophic demographic events, with strong bottlenecks or even extinctions followed by recolonization accompanied by strong founder effects (Schoen and Brown 1991). Such metapopulation dynamics (Ingvarsson 2002) are expected to further reduce Ne (Pannell and Charlesworth 2000). Finally, selective effects can also reduce the effective size in selfing populations. Indeed, the increase in homozygosity due to selfing (Caballero and Hill 1992) reduces the effective recombination (Golding and Strobeck 1980; Nordborg 2000). The reduction of Ne due to selective sweeps or background selection is thus expected to extend by hitchhiking to larger linked regions of the genome or, in some extreme cases, to the whole genome (Charlesworth 2009). These different effects of the mating system on the effective population size have been summarized by Glémin (2007) as

$$N_e = \frac{{\alpha N}}{{1 + F}}{,}$$
(1)

where N is the census size, F = σ/(2 − σ) is Wright’s equilibrium fixation index with a selfing rate σ, and α summarizes the reduction of the effective population size due to demographic effects and hitchhiking (\(\alpha \in \left[ {0;\,1} \right]\)). Overall, this formalizes the predominant role of genetic drift in shaping genetic diversity within selfing populations.

Empirical observations confirm these theoretical expectations, notably through estimations of Ne and genetic diversity in natural predominantly selfing populations. The contemporary effective size of a population can be estimated based on the temporal changes in allele frequency (FC, Waples 1989) as was done by Siol et al. (2007) and Gomaa et al. (2011) (Table 1). More recently, Frachon et al. (2017) extended the original method to use the temporal differentiation (FST instead of FC). When we apply this method to published temporal FST values, we find that Ne estimates in predominantly selfing populations are generally lower than 100 (Table 1). For comparison, Palstra and Ruzzante (2008) reviewed temporal estimates of Ne in 83 studies concerning different taxa (including the aforementioned selfing species) and found a median Ne of 260. Published estimates of Ne (or temporal FST) therefore support theoretical predictions of a reduced effective population size in predominantly selfing populations compared to outcrossing ones. Reviews on allozyme data (Schoen and Brown 1991; Hamrick and Godt 1997), as well as on sequence polymorphism (Glémin et al. 2006), showed convincing evidence for lower genetic diversity (as measured by HE) in predominantly selfing populations compared to outcrossing ones. Yet, Schoen and Brown (1991) also reported larger variability in levels of genetic diversity among selfing populations than among outcrossing populations. Empirical estimates for genetic diversity reported in the temporal studies we reviewed for estimates of effective size are consistent with these findings (Table 1), with some monomorphic populations (HE = 0) along with highly diverse populations (up to HE = 0.88). Substantial genetic diversity can therefore persist in some predominantly selfing populations, suggesting that evolutionary forces other than genetic drift may play a significant role in shaping the genetic diversity of natural populations reproducing predominantly through selfing.

Table 1 Ranges of temporal FST and estimates of effective population sizes in the literature of predominantly selfing populations

Besides the level of genetic diversity, the within-population genetic structure is also affected by selfing. Indeed, due to reduced effective recombination (Nordborg 2000), we expect populations to be organized in homozygous lineages, where some multilocus genotypes (thereafter called MLGs) can reach a high frequency (Hartfield et al. 2017). Empirical studies confirm this expectation, for example in the nearly obligate selfing species Lobelia inflata where substantial genetic differentiation was found between completely homozygous lineages co-occurring within populations (Hughes and Simons 2015). Similar population genetic structures have also been observed in several other predominantly selfing plant or animal species (e.g., Barrière and Félix 2007; Montesinos et al. 2009; Siol et al. 2008). Because this multilocus genetic structure is specific to predominantly selfing populations, we believe that the comparison of single and multilocus indices of diversity can be relevant to separate the effects of selfing and genetic drift due to small population sizes or demographic processes such as population size changes or migration.

Along with the effect of selfing, Allard (1975) interpreted this distinctive genetic structure of diversity in repeated genotypes as a result of selection favoring locally adapted MLGs, which can then reach high frequencies in the population. The reduction in gene flow through pollen dispersal in selfing populations could indeed promote local adaptation, as suggested by Hereford (2010), even if no significant effect of the mating system on local adaptation was found in his meta-analysis. However, given the strong incidence of genetic drift expected in populations undergoing high and recurrent selfing, the efficacy of selection is questionable and the role of local adaptation as opposed to genetic drift in shaping the MLGs composition of these populations remains to be assessed. We propose to test whether neutral processes alone can be responsible for the peculiar genetic structure observed in highly selfing populations (high-frequency MLGs) in the absence of selection. Answering this question requires analytical predictions for multilocus diversity indices such as those available for single locus diversity. Such predictions are lacking, and little is known about the expected range of values for multilocus diversity indices under high and recurrent selfing. Overall, a formal description of the multilocus genetic diversity expected in predominantly selfing populations evolving under neutral scenarios is still lacking and limits interpretations of empirical data.

In addition, the organization of predominantly selfing populations in MLG lineages offers the possibility to follow the changes in MLG frequencies through time. Such temporal surveys can give additional insight into the processes shaping diversity. In particular, they are useful to measure the strength of genetic drift through the estimation of the effective population size (e.g., Table 1). Although data gathered from time series are frequent in experimental populations evolving under artificial selection, they are more rarely available for natural populations (Bailey and Bataillon 2016), in particular predominantly selfing populations (Table 1). Temporal studies of natural selfing populations have found that MLGs can be maintained within population over time (Siol et al. 2007; Bomblies et al. 2010; Gomaa et al. 2011). Nonetheless, the last two studies have also found populations in which all the MLGs changed over time, which led the authors to propose extinction–recolonization dynamics to explain their observations. Yet, because there are no theoretical predictions for the trajectory of MLG frequencies over time in populations evolving neutrally, it is not clear how demographic events (from changes in population size to extinction–recolonization) affect the persistence of multilocus genotypes over time without selection.

Here, we propose to use simulations to explore how predominant selfing shapes single locus and multilocus genetic diversity in neutrally evolving populations. The goals of our study are threefold. First, we use simulations to provide neutral expectations for multilocus indices of diversity and determine whether neutral scenarios can explain the peculiar population genetic structure (with high frequency and persistent MLGs) observed in predominantly selfing species without selection. Second, we examine how combining single and multilocus indices of diversity may be insightful when studying the evolutionary trajectory of predominantly selfing populations to distinguish the effects of selfing, population size, and more complex demographic events such as bottlenecks, migration, admixture, or extinction–recolonization. Third, we use changes in allele frequency through time to examine whether we can estimate effective sizes as small as those reported in the literature, and we consider the influence of complex demographic scenarios such as bottlenecks, admixture, and extinction–recolonization on the trajectory of MLG frequencies through time. We compare our simulation results with observations from temporal data on nine populations of the highly selfing plant species Medicago truncatula. These nine temporal datasets for M. truncatula natural populations can be viewed as a reality check (independent iterations of evolution in a selfing population across 20 generations). As such, they validate the pertinence of our simulation framework as we find genetic diversity patterns similar to our simulations.

Material and methods

Simulation model and scenarios explored

We performed individual-based simulations of diploid hermaphroditic populations using SLiM 2.5 (Haller and Messer 2017). In order to be able to qualitatively compare the simulation results with our empirical data, we fixed some simulation parameters such as the type of genetic markers, the number of loci, and the time span between the temporal samples. We simulated the evolution of 20 independent loci (with a recombination rate of 0.5). SLiM output was processed in R (R Core Team 2018) in order to transform the mutations that occurred on each of the 20 predefined loci into microsatellite allele sizes following the stepwise mutation model (Ohta and Kimura 1973). Briefly, we randomly attributed an effect to each mutation (±1 repeat unit) and the effects of all the mutations occurring at a given locus in a given individual were summed in order to obtain microsatellite allele size. Mutations were neutral and occurred at a rate µ = 10−3 per generation and per locus, which is a realistic rate for plant microsatellites (Thuillet et al. 2002; Marriage et al. 2009). To produce the next generation, new zygotes were built as a combination of two gametes sampled either from two different individuals for outcrossing, or from the same individual for selfing, according to a fixed selfing rate (σ). Each simulation comprised two periods. A first period of 25 N generations (with N the demographic population size, measured as the number of diploid individuals) allowed the populations to reach the mutation-drift equilibrium. At this stage (time t0 = 0), 100 diploid individuals were randomly sampled. Twenty generations later (t20), a second sample of 100 individuals was drawn to obtain temporal sampling.

Five demographic scenarios were considered. In the first one, we simulated a single isolated population with a constant demographic size N. Four demographic population sizes were considered: \(N \in \left[ {50;\,100;\,250;\,1000} \right]\) and combined to five different values of selfing rate (σ): 0 (completely outcrossing population), 0.5 (partially selfing population), 0.95, 0.98 (predominantly selfing population), and 1 (completely selfing population). To disentangle the effects of selfing from those of genetic drift, we also simulated populations of the same effective size with different selfing rates by setting N = 2Ne/(2 − σ) for \(\sigma \in \left[ {0;\,0.5;\,0.95;\,0.98;\,1} \right]\) and \(N_e \in \left[ {100;\,250} \right]\). To examine the impact of sampling effect, we reiterated the analysis for one of the simulations with N = 100 and \(\sigma \in \left[ {0;\,0.95;\,1} \right]\) after reducing the sample size at t0 and t20 to 5, 10, 20, 30, or 50 individuals. Each sampling was repeated independently 100 times. In the following scenarios, we considered only predominantly selfing populations (σ = 0.95). In a second scenario, we explored the combined effects of predominant selfing and a bottleneck. To this aim, we simulated an isolated population of size N = 250 and a selfing rate σ = 0.95 undergoing at time t10 a drastic demographic size reduction (to N’ = 1, 5, or 25 diploid individuals) for one generation. In a third scenario, we evaluated the effects of migration by simulating an island model with ten subpopulations of constant size \(N \in \left[ {50;\,100;\,250} \right]\) exchanging diploid migrants at a constant rate. Three values of migration rate (m) were simulated: 2 × 10−4, 2 × 10−3, and 2 × 10−2 per generation. Samples were taken from a single deme (the focal population) in these structured scenarios. The effects of more drastic migration events were investigated in a fourth scenario, the admixture scenario, where a fraction of the focal subpopulation was replaced by individuals from another single population. The metapopulation was again simulated with an island model with a migration rate m = 2 × 10−3 per generation. At time t10 a single admixture event was simulated, with an admixture rate of 50%, 75%, or 100%. Note that 100% admixture is equivalent to a local extinction and recolonization scenario without change in population size. The focal population was sampled at generations t0 and t20, as in the previous scenarios. Because extinction–recolonization events may be associated with founder events, we evaluated a final set of scenarios with a bottleneck concomitant with an extinction–recolonization event. During the bottleneck, the focal population size was reduced to 1, 5, or 25 diploid individuals. After one generation, the population size was restored to N = 250 individuals and 100 diploid individuals were sampled at t20. For each simulation scenario described above (and summarized in Table S1), 1000 independent replicates were performed. SLiM simulation scripts for each of these scenarios as well as R scripts are available on the INRA dataportal. https://doi.org/10.15454/VYPXIJ.

Diversity indices

Diversity analyses were performed using the Hierfstat package in R (Goudet 2005). The genetic diversity of each simulated population was assessed on the t20 sample using the average gene diversity across loci (HE, Nei 1973), the variance in allele size (V), the average number of alleles per locus (nA), and the number of polymorphic loci (PL). In an isolated random mating population at mutation-drift equilibrium, HE and V measured on microsatellite markers evolving under the stepwise mutation model are expected to vary with the effective population size Ne and the mutation rate µ as

$$H_E = 1 - \sqrt {\frac{1}{{2\theta + 1}}}$$
(2)
$$V = 2N_e\mu$$
(3)

where \(\theta = 4N_e\mu\) (Kimmel et al. 1998).

The deviation from Hardy-Weinberg proportions was measured using the inbreeding coefficient FIS with the R package Hierfstat. The percentage of pairs of loci showing significant linkage disequilibrium (LD%) was calculated using Genepop (Rousset 2008) with a significance threshold of 0.05. The identity disequilibrium (g2), which is expected to depend on the selfing rate following \(g_2 = \frac{{1 - \sigma }}{{\left( {1 - \frac{\sigma }{4}} \right)\left( {1 - \frac{\sigma }{{2 - \sigma }}} \right)^2}} - 1\), (David et al. 2007) was also computed using the R package inbreedR (Stoffel et al. 2016). The R package Poppr (Kamvar et al. 2014) was used to identify the number of private alleles (pA), to group individuals with identical combinations of alleles (multilocus genotypes, MLG), compute the number of distinct MLGs, their frequency and their repartition over time in the two samples (t0, t20). The multilocus diversity was characterized by the Shannon’s index, computed as \(H = - {\sum} {p_iln\left( {p_i} \right)} ,\) where \(p_{{i}_p}\) is the frequency of the ith MLG. The frequency of the most frequent MLG (MFMLG) was also computed and we analyzed the correlation between MFMLG and H through Spearman correlations using R.

We calculated the pairwise genetic distances between individuals at generation t20 as the number of allele differences (between 0 and 40) between each pair of individuals, regardless of the allele size. We used two indices to characterize the distributions of distances: the mean pairwise genetic distance (Dmean) and the maximum pairwise genetic distance (Dmax). The correlation between some indices (Dmean and HE, or Dmax and LD%) was measured through Spearman correlations using R.

To summarize the trajectories of MLG frequencies through time, we considered the joint MLG frequency spectrum (MLGFS, by analogy with the allele frequency spectrum) as the matrix containing the proportion of MLGs found at the corresponding individual counts in each generation, averaged over simulation replicates. For K simulation replicates, we have \(MLGFS\left[ {i,j} \right] = \frac{{\mathop {\sum}\nolimits_{k = 1}^K {\frac{{MLG\left( {i,j} \right)}}{{MLG_k}}} }}{K},\) where MLG(i, j) is the number of MLGs found in i individuals at t0 and in j individuals at t20, and MLGk is the total number of different MLGs in replicate k. The MLGFS therefore allows to follow the evolution of the frequency of MLGs overtime.

The relative temporal differentiation between the two samples was assessed with Weir and Cockerham’s FST (1984), estimated using the R package Hierfstat (Goudet 2005). The effective population size was estimated based on the temporal differentiation between samples (temporal FST) as outlined in Frachon et al. (2017):

$$\widehat {N_e} = \frac{{\tau \left( {1 - F_{ST}} \right)}}{{4F_{ST}}}{,}$$
(4)

where \(\widehat {N_e}\) is the estimate of the effective population size and τ is the number of generations separating the two sampling events. This method assumes that the population is isolated (no migration), of constant size and that no mutation occurs between samplings. We estimated the focal population effective size in our different simulation scenarios in order to examine whether the deviations from theoretical assumptions (e.g., admixture or bottlenecks) can lead to Ne estimates as small as those reported in the literature. Ne estimates in scenarios of isolated populations were compared with the theoretical expectations given by Eq. (1) (assuming α = 1: \(N_e = \frac{N}{{1 + \frac{\sigma }{{2 - \sigma }}}}\)) for an isolated population with no change of population size, where N is the demographic size of the population and σ is the selfing rate (Pollak 1987); and \(\frac{1}{{N_e}} = \frac{1}{T}\mathop {\sum}\nolimits_{t = 1}^T {\frac{1}{{N_e^t}}}\) for an isolated population undergoing bottlenecks, where Net is the effective population size at generation t (Crow and Kimura 1970).

Medicago truncatula natural populations

Medicago truncatula is an annual, predominantly selfing species of the legume family (Fabaceae), found around the Mediterranean Basin. Maternal progeny analyses have shown very low levels of residual outcrossing (Siol et al. 2008). Between 1986 and 2014, nine natural populations located in Spain (SP1–SP3), Corsica (CO1–CO3), and southern France (FR1–FR3) were sampled two or three times each (locations of the different populations can be found on a map in Figure S1). In order to avoid over-sampling the progeny of a single individual, pods were sampled along transects running across the populations, with at least 1-m distance between each collected pod. This sampling strategy also allows to limit spatial effects due to the very fine spatial structure observed in M. truncatula natural populations (Bonnin et al. 2001). Sample sizes varied between 31 and 232 individuals. Hereafter, each temporal sample will be denominated by its population code followed by the sampling year.

DNA was extracted from 50 mg of fresh leaves with the Chemagic DNA Plant Kit (Perkin Elmer), according to the manufacturer’s instructions. The protocol is adapted to the use of the KingFisher Flex™ (Thermo Fisher Scientific) automated DNA purification workstation. Twenty microsatellite loci were used for genotyping. Eighteen of them have been described previously (Baquerizo‐Audiot et al. 2001; Arrighi et al. 2006; Ronfort et al. 2006; Siol et al. 2007). Two new loci, 319 and DMI1-6, were developed in our team after identifying long and polymorphic simple sequence repeats in resequencing studies (319-F GTGGGATTTGAATAGGATTG, 319-R CGATATGGTCCACTTTTGTC, annealing temperature: 57 °C; DMI1-6-F1 TAGAAGATGAAGCGCAAACG, DMI1-6-R2 TTCACCTTAACGCGTCCAAC, annealing temperature: 60 °C). We followed the protocol of amplification reactions described in Siol et al. (2007). Samples were prepared by adding 3 µl of diluted PCR products to 16.5 µl of ultrapure water and 0.5 µl of the size marker AMM524. Amplified products were analyzed on an ABI prism 3130 Genetic Analyzer and genotype reading was performed using GeneMapper Software version 5. Individuals and loci with more than 10% missing data across all samples of a population were removed from the diversity analyses, as well as completely monomorphic loci.

For each population and year, we performed the same analyzes of single and multilocus diversity as those performed on our simulated populations. To account for variation in sample sizes, mean allelic richness per locus (Rs) and private alleles (pA) were computed using the rarefaction method with the program ADZE (Szpiech et al. 2008). Selfing rates were estimated from FIS using the classical relationship FIS = σ/(2 − σ) (Hartl and Clark 1998) and using a maximum-likelihood approach based on the identity disequilibrium (g2), with the software RMES (David et al. 2007). MLG frequency spectra were computed for each population. Temporal FST estimates were used to estimate the effective population size using the method described previously (Eq. (3), Frachon et al. 2017), assuming a single generation per year. Approximate bootstrap confidence intervals for the temporal estimates of effective size were computed following DiCiccio and Efron (1996).

Results

Single-locus and multilocus genetic diversity in isolated populations

In the simulations of a single isolated population for different combinations of selfing rates and demographic population sizes, estimates of single locus indices (HE and V) are in accordance with theoretical predictions at mutation-drift equilibrium (Table S2), showing a decrease in the neutral genetic diversity with increasing selfing rates (Table 2). When the demographic size is adjusted to keep the effective size constant while the selfing rate varies, single locus diversity indices remain around the expected value too (Table 3). These results are not new as they replicate the known effects of selfing on single locus diversity but they are helpful to validate our simulation framework.

Table 2 Mean values for single locus and multilocus indices of genetic diversity for isolated populations with increasing selfing rates and a constant demographic size of N = 250
Table 3 Simulated populations with increasing selfing rates and demographic sizes adjusted to keep the effective size (Ne) constant and equal to 250

As shown in Table 2, we found that the multilocus diversity (nMLG and H) also decreases with selfing, while the homozygosity (FIS) and the associations between loci (LD%, g2) increase. For completely selfing populations (σ = 1), g2 is biased downwards due to extremely high homozygosity limiting the number of loci available for the estimation (as g2 measures the correlation of heterozygosity between loci). In completely outcrossing or low selfing populations (up to 50% selfing in our simulations), there are on average as many MLGs as individuals sampled. In contrast, nMLG decreases to around two thirds of the sample in our simulations with 95% selfing and to less than one-third in completely selfing populations for N = 250. This loss of MLGs is even more dramatic when the population size is lower (e.g., N = 50, Table S2). In addition, contrary to single locus indices, multilocus diversity indices keep decreasing with increasing selfing rate even for a given Ne value (Table 3), in conjunction with the increase in linkage disequilibrium. Our analysis of the effect of sample size shows that, for a given selfing rate, both HE and the frequency of the most frequent MLG (MFMLG) are biased and less precise for small sample sizes. The statistics approach the expected value with a smaller sample size for single locus compared to multilocus diversity (Nsamp = 20 for HE and Nsamp > 30 for MFMLG, Fig. S7).

Structure of multilocus genetic diversity in more complex scenarios

The frequency of the MFMLG summarizes the increase of repeated multilocus genotypic combinations, because it increases with the selfing rate, especially for σ ≥ 0.95 (Fig. 1a, Table S2) and is highly correlated with Shannon’s index H (P < 2.2 × 10−16, r2 = 0.91, Fig. S2). In the following, we will use MFMLG as an indicator of multilocus diversity variations. Our simulations with lower population size, bottlenecks or metapopulation dynamics highlight that extreme patterns of MLG repetition (MFMLG > 30%) are observed only for very low demographic population sizes (N = 50) combined with high selfing rates (σ ≥ 0.95), or with strong bottlenecks (reduction to fewer than five individuals), associated with predominant selfing (Fig. 1b, Table S2). These extreme patterns of MLG repetition are nevertheless highly variable among simulation replicates. Figure 1b also illustrates that this increase of MFMLG with low population size or bottlenecks is associated with a decrease in single locus diversity (HE). In addition, Fig. 1b shows that changes in single locus diversity due to migration are greater than changes in the frequency of the MFMLG (migration and admixture scenarios on Fig. 1b). Similar patterns can be observed when comparing Shannon’s index (H) and HE (Fig. S3).

Fig. 1
figure 1

Covariation between the frequency of the most frequent MLG (MFMLG) and the gene diversity (HE) across simulated scenarios. a Scenarios of isolated populations with varying demographic sizes N and selfing rates σ; b scenarios of migration (orange), admixture (black), bottleneck (blue), and extinction–recolonization (light blue) with σ = 0.95; c natural populations of Medicago truncatula for each sampling date. For a and b, points indicate means and horizontal and vertical bars stand for the standard deviation across the 1000 replicates

The mean distance between two individuals within a population (Dmean) is highly correlated to HE (P < 2.2 × 10−16, r2 = 0.98, Fig. S4). We used the maximum distance found between two individuals in a population (Dmax) to describe the genetic divergence accumulated between MLGs. Dmax increases with the genetic diversity (with increasing N) and with the selfing rate (for populations with the same effective size, Ne = 250 or Ne = 100 in Table S2). For predominantly selfing populations (σ ≥ 0.95), Dmax is strongly correlated with LD% (P < 2.2 × 10−16, Fig. S5). As a consequence, Dmax is the highest in scenarios involving migration, either at a constant rate or after drastic admixture (Fig. 2b). Another interesting result is that only very low demographic population sizes or very strong bottlenecks in isolated populations result in low Dmax (<0.5).

Fig. 2
figure 2

Covariation between the maximum distance between pairs of individuals in a population (Dmax) and HE across simulated scenarios. a Scenarios of isolated populations with increasing demographic sizes N and selfing rates σ; b scenarios of migration (orange), admixture (black), bottleneck (blue) and extinction–recolonization (light blue) with σ = 0.95; c natural populations of Medicago truncatula for each sampling date. For a and b, points indicate means and horizontal and vertical bars stand for the standard deviation across the 1000 replicates

Temporal changes

We measured the change in allele frequencies through time (samples separated by 20 generations) with the relative genetic differentiation between temporal samples using FST. We observed that FST estimates and their variance increase with selfing (Table S2). Whereas migration equilibrium (at rates ≤ 0.002) does not affect the temporal differentiation much, occasional large migration events (admixture) raise the differentiation through time. Genetic differentiation is particularly strong in extreme demographic scenarios such as extinction–recolonization because of population replacement.

We used the temporal FST to estimate the effective population size according to Eq. (4), ignoring the fact that some of our simulated scenarios do not meet the assumption of the underlying theoretical model (isolated populations). Figure 3a shows that, as expected in isolated populations of constant size (see Eq. (2)), \(\widehat {N_e}\) estimates increase with HE. Despite a large variance between replicates, average \(\widehat {N_e}\) estimates in isolated populations are close to the theoretical expectations (see Table S2), except for large populations sizes or complete selfing (σ = 1). In these cases, the variance of \(\widehat {N_e}\) is extreme, due to sampling variance and linkage disequilibrium, respectively. Under complex demographic scenarios involving strong migration events or extinction–recolonization, \(\widehat {N_e}\) estimates are remarkably low compared to the simulated demographic population sizes. Those estimates disagree with the levels of genetic diversity (HE) observed in these populations given the expectations from Eq. (2). This could be caused by departures from the model’s assumptions (see discussion). Indeed, the effects of migration are visible through the increase in the number of private alleles at t20 (pA, Table S2).

Fig. 3
figure 3

Temporal estimation of effective population size (\(\widehat {{\it{N}}_{\it{e}}}\)) compared to gene diversity (HE). The red curve corresponds to the expected relationship between HE and Ne assuming a constant isolated population (Eq. (2)) and a mutation rate of 0.001. a Scenarios of isolated populations with varying selfing rates σ and demographic sizes N; b scenarios of migration, admixture, bottleneck and extinction–recolonization with σ = 0.95; c temporal \(\widehat {N_e}\) estimates in natural populations of Medicago truncatula, with vertical bars representing the 95% confidence interval. For a and b, horizontal and vertical bars stand for the standard deviation across the 1000 replicates

Figure 4 shows patterns of MLGs persistence over time (after 20 generations in our simulations) in the different scenarios we investigated. Under complete outcrossing, the conservation of a MLG after 20 generations is extremely rare, as expected due to recombination (Fig. 4a). In contrast, under predominant selfing (Fig. 4b), MLGs are frequently shared between temporal samples. Moreover, MLGs that reach a high frequency within the first generation (measured by the abscissa for t0), tend to remain at high frequency at t20. This pattern is amplified when the population size is low (Fig. 4c). Strong bottlenecks (Fig. 4d) raise the frequency of some MLGs independently of their frequency in the first generation. Scenarios with migration and admixture slightly reduce the occurrence of conserved high-frequency MLGs (Fig. 4e, f) while scenarios including extinction–recolonization produce spectra with fewer MLGs conserved over time (Fig. 4g, h for extinction–recolonization with a bottleneck).

Fig. 4
figure 4

Joint MLG frequency spectra for each demographic scenario. The horizontal axis represents the frequency at which a MLG is found in the first sample (t0), the vertical axis represents the frequency of this same MLG in the second sample (t20). The color gradient represents the log10 of the frequency at which each case is observed in 1000 simulation replicates. a Isolated outcrossing population (σ = 0) of 250 individuals. The inset is a zoom of frequencies between 0 and 0.15; b isolated predominantly selfing population (σ = 0.95; N = 250); c isolated predominantly selfing population (σ = 0.95; N = 50); d isolated predominantly selfing population (σ = 0.95; N = 250) undergoing a bottleneck of one individual at t10; e predominantly selfing population in an island model (σ = 0.95; m = 0.002; N = 250); f predominantly selfing population in an island model (σ = 0.95; m = 0.002; N = 250) undergoing 50% admixture with constant population size at t10; g predominantly selfing population in an island model (σ = 0.95; m = 0.002; N = 250) undergoing extinction–recolonization with constant population size at t10; h predominantly selfing population in an island model (σ = 0.95; m = 0.002; N = 250) undergoing extinction–recolonization by one individual at t10

Empirical data

In the nine natural populations of M. truncatula studied, FIS values are high, ranging between 0.88 and 1. This translates into very high-selfing rate estimates for all populations (σFIS > 0.9, Table 4). Selfing rate estimates with RMES are sometimes lower but remain well above 0.8 (Table 4). The number of MLGs is generally low and compatible with high selfing rates.

Table 4 Genetic diversity in M. truncatula populations

Single and multilocus diversity

The mean gene diversity within population and year (HE) is remarkably high (higher than 0.3, Table 4). The maximum genetic distance between two individuals, Dmax, is also always high (higher than 0.5, Fig. 2c). Most populations therefore seem to be distributed within a parameter space more limited than the one explored by our simulations (high HE associated with high Dmax). Only sample FR1_1991 presents both low single and multilocus diversity (Fig. 1c, Fig. 2c). MFMLG values are highly variable, with extreme patterns of MLG repetition (MFMLG higher than 30%) in nearly half of the populations studied. Accordingly, these populations also display the lowest values of Shannon’s H (Table 4). However, such a combination of high single-locus diversity and extremely low multilocus diversity was not observed in any of our simulated scenarios (Fig. 1c).

Temporal dynamics of diversity

The MLG frequency spectra highlight two different types of dynamics of MLGs through time. In populations SP1, SP2, and CO1, a single low-frequency MLG or none at all remain over time (Fig. S6). In our simulations, such dynamics of multilocus diversity are obtained with extinction–recolonization events only (Fig. 4g, h). In the other populations, several MLGs are conserved through time (SP3, CO3, FR3, CO2, FR2, and FR1, Fig. S6). Among these populations, FR3 and CO3 are the most diverse (HE > 0.5) and present extreme patterns of multilocus diversity with MFMLG lower than 0.2 and Dmax higher than 0.8 (Figs. 1c and 2c). Such patterns were also observed in our simulations with strong migration and admixture (Figs. 1b and 2b).

Temporal differentiation between sampling years, as estimated by temporal FST, is high for most of the populations studied (Table 4). The effective population sizes estimated using the temporal FST method are variable but consistently low: all estimates except one are lower than 100 (with a maximum of 150 for population CO2). These estimates are too low compared with the observed single locus diversity given the expectations from eq. 2 (Fig. 3c). In our simulations, we observed a similar mismatch between HE and Ne values for migration and admixture scenarios. This resemblance with migration and admixture scenarios is also visible in the high number of private alleles we observe in recent compared to older temporal samples in our populations (Table 4, Table S2).

Discussion

Our work aimed at describing the consequences of high-selfing rates and metapopulation dynamics on the structure of genetic diversity in populations, and how it can change over time. We argue that the classical single locus diversity indices are not sufficient to fully understand the demographic history of predominantly selfing populations, which should benefit from multilocus indices. Because of the lack of analytical expectations for such indices, we proposed a simulation approach to address the question in a theoretical framework.

Neutral scenarios can explain the multilocus population genetic structure of predominantly selfing species

Our simulations of isolated and predominantly selfing populations (with selfing rates above 0.95), show a population genetic structure organized in repeated multilocus genotypes. As for single locus diversity, multilocus diversity (measured as the number of MLGs or the haplotypic diversity H), decreases with increasing selfing rates. In addition, the nonindependence between loci increases with selfing (Nordborg 2000), as seen through the elevated linkage and identity disequilibrium, and the multilocus diversity is further reduced. The maximum distance between two individuals, Dmax, increases with the selfing rate, highlighting the fact that self-fertilizing populations are composed of differentiated lineages. The increase in Dmax is caused by the reduced effective recombination in selfing populations, which constrains new mutations within only one genetic background. For predominantly selfing populations (σ ≥ 0.95), Dmax is also highly correlated to LD%, because they both increase with within-population structure. The empirical data obtained on natural populations of M. truncatula strongly support our simulation results: our estimates of selfing rates confirm M. truncatula as a predominantly selfing species, and we find repeated MLGs and large Dmax in every population. In Arabidopsis thaliana, studies in natural populations also showed that they are composed either of identical or highly differentiated individuals (Bakker et al. 2006; Montesinos et al. 2009). Overall, these results highlight the importance of multilocus analyses for the study of natural selfing populations. Yet, such analyses are often overlooked in the literature (e.g., Trouvé et al. 2005; Gow et al. 2007) or are limited to reporting the number of distinct MLGs (e.g., Bomblies et al. 2010; Gomaa et al. 2011). Furthermore, our results stress out that accurate estimates of MLG frequencies are essential, especially in the presence of rare MLGs, and require larger samples than for estimates of single locus diversity (above 30 individuals, Fig. S7).

The first aim of the present study was to verify if selectively neutral scenarios with high-selfing rates could lead to repeated MLGs, sometimes at high frequency and maintained through time. Indeed, Avise and Tatarenkov (2012) argued that this peculiar genetic structure in selfing populations provided evidence for the occurrence of selective processes promoting locally adapted MLGs. Our simulation results show that strong genetic drift induced by small population size or bottlenecks may be sufficient to explain the multilocus genetic structure observed in selfing populations, without any selection. It is, however, important to stress out that our results are not sufficient to rule out selective processes in a population but only present alternative hypotheses to explain the observed structure of genetic diversity. Testing Avise and Tatarenkov (2012)’s hypothesis would require reciprocal transplants, or at least measuring the fitness of the MLGs in order to see if the locally most frequent MLG has indeed the highest fitness in the local environment.

Combining single and multilocus indices of diversity is insightful when studying the demographic history of predominantly selfing populations

We analyzed the multilocus structure of the simulated populations and focused on indices describing MLG frequency (MFMLG) or MLG genetic similarity (Dmax). Those indices are especially informative when analyzed conjointly with single locus diversity (HE) and can help disentangle the effect of selfing and demographic events (such as bottlenecks or migration) on genetic diversity in selfing populations. Indeed, high MFMLG combined with low HE is characteristic of small population (constantly small or due to a bottleneck) with a high selfing rate. On the other hand, low levels of multilocus diversity while single locus diversity is high were observed with strong migration or admixture scenarios. This highlights the fact that migration restores single locus diversity faster than multilocus diversity in predominantly selfing populations (Fig. 1b). This was also visible when analyzing conjointly HE and Dmax: in our migration scenarios, new alleles combined within migrant MLGs were introduced in the population, resulting in both high LD% and Dmax values.

Even though our simulations explored only a restricted number of scenarios (in terms of population size, sample size, selfing rate, time span between sampling, etc.), they were able to replicate patterns observed in empirical data. Except for one population (FR1), the levels of genetic diversity (HE) in most populations of M. truncatula were surprisingly high compared with theory and other studies of predominantly selfing populations (e.g., Gomaa et al. 2011; Lundemo et al. 2009; Stenøien et al. 2005). In addition, high single locus diversity (HE) was combined with repeated MLGs (high MFMLG), which may be consistent with populations belonging to a metapopulation with strong migration or even admixture events. Genetic diversity measured by both MFMLG and HE (or Dmax and HE) as well as the MLG frequency spectrum suggest likely extinction–recolonization events in three populations of M. truncatula (SP1, SP2, and CO1). Nevertheless, a fine analysis of the spatial genetic structure should be performed to ensure that the drastic changes in genetic structure observed in these populations are not due to a shift in the location of the sampling transect. However, in a set of populations, MLG repetition (MFMLG) and single-locus genetic diversity (HE) were both very high (higher than 0.4). This combination of a low number of MLGs and high single-locus diversity was never observed in our simulations. Other studies also reported a similar pattern (e.g., Barrière and Félix 2007), which is probably associated with scenarios or combinations of parameter values that were not considered in our set of simulations. This highlights the need for a more systematic exploration of the parameter space if one intends to perform statistical inferences on empirical data.

Insights from the temporal analysis of predominantly selfing populations

The temporal dimension of our analyses is one of the main particularities of this study, and is rarely examined in natural populations with high selfing rates (we found less than a dozen studies, some being reported in Table 1). Yet temporal data are useful because they allow estimating the effective population size and thus give insight into the strength of genetic drift (Waples 1989). However, the decreased effective recombination in selfing populations reduces the number of independent loci, and after several generations of predominant selfing the whole genome tends to behave as a single “superlocus”. FST estimates based on few or a single locus suffer from a large sampling variance (Weir and Hill 2002) and this is visible in our simulations in which the variability of FST estimates increased with the selfing rate. Indeed, measuring FST from linked loci is equivalent to measuring it from a lower number of loci. Moreover, if metapopulation dynamics are frequent in selfing populations, it may cause departures from the assumptions of the theoretical model underlying the estimation of Ne from temporal FST (i.e., isolated population of constant size, Waples 1989). Thus, temporal estimates of Ne in highly selfing populations should be treated with caution.

Interestingly, our simulations showed that examining the trajectory of MLG frequencies through time gives insights into the demographic history of a predominantly selfing population. In particular, the MLG frequency spectrum (MLGFS) describes the upper and lower bounds for the trajectory of MLG frequencies between two generations for a given demographic scenario. For example, a strong increase in MFMLG between two generations suggests a bottleneck. Although MLGFS are more difficult to interpret on empirical data because of the absence of replicates (e.g., Fig. S6), we show that they can provide support for a hypothesis of extinction–recolonization, along with large values of temporal differentiation.

Finally, our study also highlights high temporal stochasticity in M. truncatula natural populations. Indeed, diversity often changed over time, and the temporal samples of a given population were not clustered together in joint analyses of diversity indices (Figs. 1c and 2c). This temporal variability was also described in Bulinus forskalii by Gow et al. (2007), who attributed it to “highly dynamic demographic systems, including bottleneck and extinction–recolonization events”. Larger variance in diversity levels among selfing compared to outcrossing populations has also been reported before (Schoen and Brown 1991).

Conclusion

The comparison of our simulation results with data obtained in a highly selfing species (Medicago truncatula) highlighted the pertinence of our simulation approach. Yet, the number of scenarios and parameter values we explored were limited and this could limit the generalization of our results to other datasets (e.g., other molecular markers, other sampling tempo). We expect, however, that the general patterns highlighted here using microsatellites will remain unchanged with other molecular markers such as a large number of genome-wide SNPs. In addition, the scripts are available and easily amendable to fine tune the comparison with other empirical datasets (in terms of population size, sample size, selfing rate, time span between sampling, etc.).

The demographic scenarios examined here were sufficient to show that selection is not required to explain the prevalence of repeated MLGs in predominantly selfing populations and their persistence through time. If background selection or selective sweeps are expected to reduce the effective size and would reduce single and multilocus diversity concomitantly (Glémin 2007; Kamran‐Disfani and Agrawal 2014), the effects of complex selection scenarios such as local adaptation are more difficult to predict. Simulating scenarios involving selection was beyond the scope of this study but would be useful to address the question of the threshold selfing rate beyond which selection will act at the MLG (or haplotype) level rather than at the locus level, due to the severely reduced effective recombination (Neher and Shraiman 2009). Another perspective to this work will be to develop an integrated method to infer parameters such as the effective size and the selfing rate, based on the summary statistics described here (using a likelihood-free inference method, e.g., Beaumont et al. 2002; Rousset et al. 2016).

Data archiving

Genotype data of Medicago truncatula natural populations are available on the INRA dataportal. https://doi.org/10.15454/VCZIMR

The scripts used for the simulations and the computation of diversity indicators are available on the INRA dataportal. https://doi.org/10.15454/VYPXIJ