Introduction

Importance of Assessing Diversity

Potato is a clonal crop, but has about 100 wild species relatives. Maintaining these wild relatives in the genebank as botanical seed populations has several practical advantages over in vitro or tuber clonal propagation. It is also important to accurately characterize the diversity among individuals within these populations, since this informs decisions for adopting the most effective approaches to sampling when collecting, maintaining and evaluating germplasm.

Emphasis on Diversity within Species

Great effort has been expended by numerous taxonomists over many decades in an effort to partition potato diversity into species. This includes studies of many kinds based on physical traits as well as DNA markers. The most recent taxonomy reduced the number of species by about half (Spooner et al. 2014). This sacrificed some level of information for the goal of having a scheme in which potato species names are more stable, and all populations in the genebank having a definite species name, rather than being uncertain or hybrids (Bamberg et al. 2018). Thus, species names should already establish the first hierarchy of diversity, and the genebank manager should be able to confine concerns about diversity sampling to variation among populations within species and individuals within populations within species. The question of how to efficiently sample populations has often been addressed by selecting “core” collections (reviewed in Bamberg and del Rio 2014 and Bamberg et al. 2016). One can also assess populations within species in the genebank by how their total DNA marker diversity has accumulated over real time as populations were added, or how many markers we would expect to miss in a single average population (Bamberg and del Rio 2016).

Context for Measuring the Partitioning of Diversity within Populations—Index of Variation

Hosaka and Hanneman (1991) proposed an index of variation (IV) as the percent of variable markers of all those detected. But fixed markers may be irrelevant. They might be common to all potatoes or even all plants. We proposed IV2 (Bamberg et al. 2000) in which variation within a population is measured in the context of only markers known to be variable within the species. Thus, if all populations are identical and have within them the total variation found within the species, they would have an IV2 of 100%. At the other extreme, if a population is completely homogeneous (e.g., strict inbreeding species with a single genotype per population), individuals within populations do not vary at all for traits or markers that vary within the species, and so all have an IV2 of 0%.

Review of Heterogeneity – DNA Markers

Hardigan et al. (2015) reported that wild diploid species have very low heterozygosity, but that was on the basis of tuberosum cultivar-derived SNP loci for which there is ascertainment bias. For example, the primitive outcrossing diploid S. jamesii had the least apparent heterozygous loci at <1%. Similarly, Aversano et al. (2015) sequenced an individual of the diploid outcrossing species S. commersonii, claiming the observed average 1.5% SNP heterozygosity was very low compared to cultivars, but also not accounting for the ascertainment bias of comparing random SNPs to selected tuberosum-derived SNPs of tetraploid cultivated potato. In contrast, RAPD data from our early work (Bamberg and del Rio 2004, Bamberg et al. 2009), and a follow-up study with new data using the same tuberosum-derived SNP markers (Bamberg et al. 2015b) showed S. jamesii populations had essentially the maximum detectable heterozygosity when adjusted for ascertainment bias. They had IV2 of 50-70%, at least as much as the 56% (Hirsch et al. 2013) observed in the average tuberosum cultivar (where 100% of these markers were known to be variable by design). Similarly, Haynes et al. (2017) used SSRs (presumably without ascertainment bias) to show that 10 natural diploid S. chacoense populations had at least as much heterozygosity as tetraploid cultivars (after adjustment was made for the ploidy difference). Bisognin and Douches (2002) used SSRs and isozymes to detect high levels of heterogeneity within populations.

Review of SNP Frequency Reports in Potato

The review of Sood et al. (2017, p. 65) concludes potato is extremely heterozygous, citing various estimates of 1% to 5% SNPs per base pair. For example, Hardigan et al. (2017) did a GBS study of numerous wild species, concluding that SNP diversity of potato was over 1%, much higher than reported in other crops, particularly inbreds. Leisner et al. (2018) sequenced an inbred clone of wild diploid (normally outcrossing) species S. chacoense, similarly showing that some chromosomal regions that had ostensibly resisted becoming homozygous (presumably still being in their natural non-inbred form) had about 1.5% SNP per base pair. The homogenized regions’ SNP frequency was only one-tenth as much, roughly as expected by chance following several generations of selfing, and approaching known inbreds like tomato and soybean at about one SNP per 1000 bp (see Simko et al. 2006). On the other hand, Li et al. (2018) also did a GBS study of numerous potato species concluding that potato actually has about 20-fold less SNP frequency than these other reports, attributing the discrepancy to improper SNP filtering (Huang et al. 2018, Hardigan et al. 2018).

Review of Heterogeneity – Traits

There is evidence of high phenotype variation within populations of wild diploid potato species. This is the variability relevant to potato breeding, so it would indeed be curiously unhelpful if a reasonable analysis of DNA markers concluded that individuals within diploid potato germplasm populations are very homozygous and uniform.

The following are examples in diploid wild species with 5% or less within-population SNP heterozygosity according to Hardigan et al. (2015) for species not known to be selfers.

S. berthaultii... highly significant variation for multiple Colorado Potato Beetle resistance parameters (Bamberg et al. 1996).

S. boliviense… variation for folate (Robinson et al. 2019).

S. chacoense... highly significant variation for multiple CPB resistance parameters (Bamberg et al. 1996), leptine glycoalkaloid (Sinden et al. 1986), in vitro rooting characteristics (Christensen et al. 2017), seed proteins (Hosaka and Hanneman 1991).

S. microdontum… highly variable tuber calcium level (Bamberg et al. 1998), large late blight differences (Douches et al. 2001), tuber greening under illumination (Bamberg et al. 2015a), presence of Crazy Sepal mutant (Bamberg 2006).

S. bulbocastanum… resistance to nematodes (Brown et al. 1989), late blight (Lokossou et al. 2010), Zebra chip (Cooper and Bamberg 2014), seed proteins (Hosaka and Hanneman 1991).

S. bulbocastanum, cardiophyllum, pinnatisectum, circaeifolium... resistance to late blight, black leg, CPB (Chen et al. 2003), seed proteins (Hosaka and Hanneman 1991).

S. jamesii… tuber antioxidants (Hale et al. 2008).

S. commersonii… resistance to early blight (Jansky et al. 2008).

Unpublished late blight evaluations sponsored by the genebank conducted at multiple locations also identified wide ranges of resistance within single populations of S. microdontum, S. berthaultii and S. okadae.

Materials and Methods

Empirical Assessment of Heterozygosity within Populations

We sought to re-assess heterozygosity within populations in the context of the whole diversity within model species. Thus, we calculated the IV2 value, the proportion of loci polymorphic within populations of all loci polymorphic in the species (see Table 1).

Table 1 Heterogeneity estimates in diploid wild potato species

S. jamesii heterogeneity was assessed with RAPD markers on individuals within 7 genebank populations, and AFLP markers on 28 individuals of PI 275169 from the genebank, 34 individuals of a large natural population in Ida Canyon near Sierra Vista in southern Arizona, and 78 individuals from an extremely large natural population in Navajo Canyon of Mesa Verde National Mesa Verde National Park near Cortez in southwestern Colorado. The species context was usually comprehensive, being derived from an AFLP marker set on 128 genebank populations sampled as bulks of at least 25 individuals as described in Bamberg et al. (2016).

S. microdontum is a species with an array of remarkable qualities for breeding (Bamberg and del Rio 2014) and genome analysis suggests it has had a surprisingly large introgression to cultivars (Hardigan et al. 2017). This species’ heterozygosity was assessed by AFLP on 23 plants from genebank population PI 473170 in the context of the core set of 50 populations identified previously by Bamberg and del Rio (2014).

Factors that Bias Estimation of within-Population Diversity

Various factors can lead to underestimation of diversity. These include bias due to ascertainment (Bamberg et al. 2015b), comparing populations with different allele frequencies (Bamberg and del Rio 2004), and different ploidy (Meirmans et al. 2018). We examined each of these by calculating and modeling the bias when artificial differences appear between two samples that are actually the same except for the bias factor.

Ascertainment Bias

For ascertainment bias simulation, we isolated the variable of proportion of heterozygous loci. We used the Randbetween function of Excel® to compose 10 genotypes of 5,000 loci as though they were drawn from populations with heterozygosity at 2%, 5% and 12%. DARwin software (Perrier and Jacquemoud-Collet 2006) was used to visualize the result as radial plots based on genetic similarities calculated with simple matching coefficient and UPGMA clustering. This analysis included a single random 50% heterozygosity individual representing populations with maximum detectable heterozygosity serving as the root.

Allele Frequencies Bias

With all other factors equal, multiple samples of single individuals from one population will appear to be more similar than single individuals drawn from different populations. We used approximate parameters of AFLP results reported for diploid S. brevicaule in Bryan et al. (2017). Simulation data was created using the Index and Randbetween functions of Excel. For example, a cell with the formula… =INDEX(A1:A8, RANDBETWEEN(1,8))… returns the value in a randomly chosen cell from the array (A1:A8). To isolate the effect of different allele frequencies, 20 individual genotypes were drawn at random for which each of 200 polymorphic loci had the same random allele frequency (that is, as though they were sibs from the same population). Then 20 individuals were assigned genotypes for 200 polymorphic loci where each locus could be a different allele frequency (that is, as though there were individuals from different populations). We repeated this using 2,000 polymorphic loci. These simulations were based on random allele frequencies from 1% to 99%, but empirical evidence (Bryan et al. 2017, Bamberg and del Rio 2004) suggests that real allele frequencies tend to be at the extremes, so we added a simulation of random genotypes for which 200 polymorphic loci could have allele frequencies x only as (20% > x > 80%). In these simulations, the proportion and identity of polymorphic loci was identical for all populations. This means that all simulated populations are identical in terms of the loci possessing the allele, with the only difference being the frequency of the allele at those loci. Thus, if multiple individuals from the same population had been bulked, it would appear to be identical to the bulk of individuals from all individuals of the different populations (since all the same alleles would be included in both bulks). We also calculated the apparent GS of two single plants drawn from the same population (polymorphic loci have the same random allele frequency for both plants) versus those drawn from different populations (polymorphic loci have different random allele frequencies for each plant) by averaging GS for 50,000 random loci.

Bias Due to Ploidy

Assessing ploidy bias does not require simulation. One need only calculate the average percent heterozygous loci individuals for all possible allele frequencies for both diploids and tetraploids, then average the 4x/2x ratio. We made this calculation using Excel.

Results and Discussion

Empirical Assessments of Heterozygosity within Populations

Table 1. presents new and previous empirical data on wild species. High IV2 was observed. Thus, of the polymorphic loci detected in a species, many are polymorphic within populations of that species. Combined with potentially low marker frequencies, the average risk of miss-sampling a polymorphic locus with a single plant is high. So, any practical evaluation aimed at mining traits variable within these populations would not be expected to be very efficient if based on single plant samples per population.

Simulations of the Effects of Biases

Ascertainment Bias

A SNP locus is codominant, genotyping for the wild allele and, potentially, a second mutant allele. They are mapped mutants in functional genes of S. tuberosum (cultivars), so are not random and may not be neutral. Using SNPs from the Infinium 8303 Potato SNP array to measure diversity will introduce a cultivated potato “ascertainment bias” (Bamberg et al. 2015b). The concern of this bias is that populations less related to the germplasm from which SNPs were developed necessarily appear less heterogeneous, even potentially reversing the actual order of diversity of two populations (Lachance and Tishkoff 2013). A simple illustration with animal physical descriptors shows why ascertainment bias needs to be adjusted by calculating the IV2 statistic: Descriptors used to measure diversity in a certain bird species would likely be applicable to most birds. But mammals would appear to be very homogenous as an average of bird descriptors because they have zero variation for many descriptors such as feather length, egg weight, beak color, etc. To use bird descriptors fairly to assess the relative diversity in mammals, the denominator in the ratio of variable versus fixed descriptors has to be descriptors for which variation is known to occur in both birds and mammals (e.g., lifespan, body temperature, diet). Within potato, we would expect the most primitive species like jamesii (Sarkinen et al. 2013) to have their apparent heterozygosity depressed the most by ascertainment bias.

If potato averages only a few percent of polymorphic loci across the whole genome, a plant tested with a random sample of all possible loci will obviously appear unrealistically distinct and homozygous in comparison to a plant tested with a set of custom loci all known to be polymorphic within its own genepool. Thus, comparable estimates of diversity within populations of potato species relatively unrelated to tuberosum must level the playing field by considering only loci for which variation has been shown to be possible as variation among populations.

Figure 1 illustrates the simulation for which polymorphic loci were artificially restricted to 2%, 5%, and 12%, representing about the observed heterozygosity reported in Hardigan et al. (2015, Figure 4) for very primitive wild diploids, typical wild diploids, and primitive cultivated landrace diploids, respectively. Ten genotypes of each were randomly selected only on the basis of observed heterozygosity level that could be depressed by ascertainment bias. As expected, individuals cluster by their degree of ascertainment bias. Specifically, when percent heterozygous loci randomly decreases, individuals appear increasingly homogeneous and distinct from individuals with greater percent heterozygosity. Thus, ascertainment bias would make the very primitive species S. jamesii appear very homogeneous despite empirical evidence of both high DNA and trait variability (Table 1).

Fig. 1
figure 1

UPGMA DARwin plot of 30 individuals with 5000 loci with genotypes assigned randomly based only on the level of restricted observable heterozygosity (2%, 5%, 12%). Tree is rooted with a single random individual with expected maximum heterozygosity (50%)

One can easily calculate and graph the pattern of apparent similarity between and among populations as ascertainment bias varies from none to 100%. An example is provided in Supplemental Figure S.1b.

Bias Due to Small Samples of Low and Different Allele Frequencies

Bryan et al. (2017) used AFLP markers on an array of wild potato species to compare the variation among numerous individuals in a species—about 20 plants from a single population and single plants from each of about 20 different populations. It was assumed that since within-population individuals tightly clustered with each other, one could conclude that any single random plant from that cluster was a good representative of the genetic identity of the population. But single individuals within a population have a special biased relationship with each other, being drawn from a single distribution of loci which each have a single allele frequency. This means that all of the samples would present exactly the same genotype if assessed as a bulk of multiple plants using a dominant DNA marker like AFLP. Figure 2 shows the effect of common versus different allele frequencies. For 200 polymorphic loci at all random allele frequencies, individuals drawn at random from the same population (each locus has the same random allele frequency, the white circles) will always appear much more similar to each other than individuals from different populations (where each locus in each individual can be from a different allele frequency-- black squares). Remember that this effect is completely due to bias, since all these simulated individuals are polymorphic at exactly the same loci so would appear to be identical in bulk because all the alleles would be captured regardless of their individual frequency—each set has exactly the same alleles that are present in the other. Supplemental Figure S2b illustrates that increasing the number of polymorphic loci observed to 2,000 does not remove the bias, but increases it. Supplemental Figure S2c illustrates that bias also increases if one considers only extreme allele frequencies at 200 loci, which is more realistic according to empirical assessments of allele frequencies at real polymorphic loci in wild potato species.

Fig. 2
figure 2

PCA plots of relationship of individual genotypes chosen at random from 200 polymorphic loci at all random frequencies, isolating allele frequency bias. Individuals represented by circles have random allele frequencies in common (as if from a single population) while individuals represented as squares are drawn at random from random allele frequencies (as if from different populations)

We can easily estimate absolute similarities of the two different groups of plants with calculations on simulated genotypes. When there are 50,000 polymorphic loci and genotypes are drawn at random, two random plants deviate very little from the expected 50% Simple Matching Coefficient similarity as individuals from different populations and 66% similarity as individuals from the same population. If allele frequencies are restricted to the extremes (that is, allele frequencies tend to be close to fixed or rare rather than balanced), the bias making individuals from the same population look artificially similar becomes even more extreme. For example, when allele frequencies (x) are at the extremes (20% > x > 80%), individuals drawn randomly from the same population appear to be very close to 83% similar, and when allele frequencies are very extreme (10% > x > 90%), individuals drawn randomly from the same population will appear to be very close to 95% similar (random individuals from different populations remain at an average 50% similarity).

Ploidy Bias

Most DNA markers have two alleles by design. If allele frequencies are a and b, the chance of sampling random heterozygous diploid individuals will be the complement of the homozygote frequency = 1-(a2 + b2). For tetraploids, it is 1-(a4 + b4). So if a diploid and tetraploid population had the same allele frequencies for all polymorphic loci, the chance of detecting heterozygosity in a single plant is obviously much greater for tetraploids. If known polymorphic loci have random allele frequencies, the average advantage of a tetraploid in detecting heterozygotes is very close to 1.8 times as much as its corresponding diploid form—higher if allele frequencies are extreme. Figure 3 shows the percent heterozygous loci observed in a variety of classes of potato germplasm (Hardigan et al. 2017, Fig 2D) and how very primitive diploid “outgroups”, other wild diploid species, and cultivated diploid landraces are not remarkably less heterozygous when the adjustment is made for ploidy bias by multiplying by a factor of 1.8. Again, the magnitude of this bias will depend on the actual proportions of allele frequencies, with the advantage of detecting heterozygosity in tetraploids increasing as the balance of allele frequency departs from 50%:50%. Calculations based on the observed allele frequencies of the S. jamesii population from Navajo Canyon in Table 1 indicate that one would indeed expect about 1.8 times as much heterozygosity to be detected in a single tetraploid individual as a single comparable diploid. In the study of Haynes et al. (2017), diploid S. chacoense heterozygosity estimates needed to be increased by a factor of 1.3 to remove ploidy bias and allow a fair comparison to the heterozygosity of tetraploid cultivars.

Fig. 3
figure 3

Diploid wild species and outgroups have similar heterozygosity to cultivated tetraploid forms if adjusted for ploidy bias

Conclusions

When biases are corrected for, marker evidence is in harmony with trait evidence that primitive diploid potato species can be quite heterogeneous within their populations. It follows that accurate collection, evaluation and preservation of diversity depends on sampling an adequate number of individuals within populations.