Introduction

Genome-wide association studies (GWAS) are widely used for the identification of SNP-markers associated with phenotypic variation in humans, animals, and plants. Application of GWAS has resulted in the identification of QTLs in a variety of major crop species like maize, cotton, wheat, sorghum, barley, and rice (Tibbs Cortes et al. 2021). The main advantages of GWAS are that a wide genetic variation can be analysed simultaneously and that it utilizes all historical recombinations, resulting in an increased resolution for QTL detection when compared to traditional bi-parental linkage mapping. Nevertheless, GWAS suffers from several disadvantages not present in mapping populations (Korte and Farlow 2013). Firstly, population structure can result in the identification of false positive associations. Correction for relatedness between genotypes can solve this problem partly; however, it may also result in not identifying true-positives. Secondly, the geometric distribution of the population allele frequency of SNP markers indicates that many rare SNP will have a reduced power in GWAS as compared to SNP allele frequencies in bi-parental populations. And finally, genetic heterogeneity, where different loci or alleles may lead to similar phenotypes, complicates the detection of individual QTLs.

Despite these disadvantages, association studies still offer the potential to dissect the genetics of complex traits, with the ultimate goal to improve breeding efficiency through the use of molecular markers. In the last decade several association studies have been conducted in potato (Baldwin et al. 2011; D’hoop et al. 20142008; Li et al. 2010; Lindqvist-Kreuze et al. 2014; Malosetti et al. 2007; Rosyara et al. 2016; Schönhals et al. 2016; Urbany et al. 2011; Klaassen et al. 2019; Prodhomme et al. 2020). All of these studies found significant marker-trait associations, but differed in the number of markers and marker types used. For example, in the studies of D’hoop et al. (2014) and D’hoop et al. (2008), a relatively high number of markers were used; however, the multi-locus AFLP marker system is not easily simplified into a single locus assay for follow-up studies or breeding. Other studies (Baldwin et al. 2011; Li et al. 2010; Schönhals et al. 2016; Urbany et al. 2011) focused on a fixed number of several candidate genes, and as a consequence, only a relatively small portion of the genome was taken into account. These studies can therefore technically not be considered as truly genome-wide. Two SNP arrays have been developed for potato, a 10 K SNP array, known as the SolCAP 8303 SNP array (Felcher et al. 2012) and a 20 K SNP array as described by Vos et al. (2015). Both arrays offer genome-wide coverage of SNPs and should be able to capture the genetics underlying complex traits. However, two studies using the SolCAP array presented fewer significant marker trait associations than expected (Lindqvist-Kreuze et al. 2014; Rosyara et al. 2016). In this study, we explore the potential of the other array (Vos et al. 2015) using steroidal glycoalkaloid content in potato tubers as an example.

Steroidal glycoalkaloids (SGAs) are secondary metabolites and abundantly present in the Solanaceae family. These SGAs act (mainly in leaf tissue) as a defence against different plant pathogens such as insects (Nenaah 2011) and fungi (Hoagland 2009). In contrast to this beneficial property for the plant, the potato SGA’s α-solanine and α-chaconine can have toxic effects on humans upon consumption (Friedman 2006). Therefore, in many countries, a threshold of 200 mg kg−1 of tuber fresh weight is set as upper limit of total SGA content, and in the Netherlands, measuring Solanine glycoalkaloid (SGA) is a requirement in the Value for Cultivation and Use (VCU) testing in order to obtain breeders right and has been set as maximum the multiyear average of the variety Innovator for table and ware potatoes and for starch potatoes the multiyear average of the variety Aventra (Anonymous 2015). In order to breed for varieties, which do not exceed this threshold, the understanding of the genetics of glycoalkaloid accumulation is helpful. Previous mapping studies identified several QTLs for SGA accumulation in potato. While most studies measure SGA levels in potato leaves (Manrique-Carpintero et al. 2013, 2014; Medina et al. 2002; Ronning et al. 1999; Sagredo et al. 2006, 2011; Yencho et al. 1998), a few have also been conducted on SGA accumulation in tubers (Mariot et al. 2016; Sørensen et al. 2008; Valcarcel et al. 2014; Villano et al. 2020). The majority of these mapping studies involved hybrids with wild species like S. sparsipilum (Sørensen et al. 2008), S. berthaulthii (Yencho et al. 1998), S. phureja (Medina et al. 2002), S. commersonii (Carputo et al. 2003), and S. chacoense (Manrique-Carpintero et al. 2014; Medina et al. 2002; Ronning et al. 1999; Sagredo et al. 2006, 2011). This suggests the involvement of alleles originating from wild species in the accumulation of SGA content in the contemporary potato germplasm. Six mapping studies reported a QTL for accumulation of SGAs on chromosome 1 (Hutvágner et al. 2001; Manrique-Carpintero et al. 2014; Ronning et al. 1999; Sagredo et al. 2011; Sørensen et al. 2008; Yencho et al. 1998). Additionally, α-solanine and α-chaconine QTLs have been mapped on chromosomes 6 and 11 in two studies (Manrique-Carpintero et al. 2014; Yencho et al. 1998) and on chromosomes 4, 8 and 12 (Yencho et al. 1998). However, the only QTL found for accumulation of α-solanine and α-chaconine in tubers is on chromosome 1 (Sørensen et al. 2008). The major genes involved in the SGA biosynthetic pathway in Solanaceae have been published (Itkin et al. 2013). In this study, six genes involved in glycoalkaloid metabolism in potato are described with four and two GAME (GLYCOALKALOID METABOLISM) genes located on chromosome 7 and 12 respectively. The cluster on chromosome 7 includes two of the three known genes encoding steroidal alkaloid glycosyltransferases (SGT-genes) that decorate the steroidal alkaloid skeleton with various sugar moieties. It has been shown that SGT-genes are responsible for the final glycosylation steps in the SGA pathway (McCue et al. 2011, 2005; Moehs et al. 1997). In the paper of Cárdenas et al. (2016), GAME9 was postulated as a major regulator of the SGA pathway which co-localized with the QTL on chromosome 1 from Sørensen et al. (2008). From these studies, the key-metabolic and regulator genes of the SGA pathway are known; however, to be able to apply this knowledge to marker-assisted selection in a breeding program, it is essential to know which allelic variants of these genes are responsible for variation in SGA content in potato tubers. We therefore used a GWAS approach combined with QTL-mapping from bi-parental segregating populations to identify natural variation related to glycoalkaloid accumulation in potato tubers.

Materials and methods

Plant materials

For the genome-wide association study, a variety panel of 275 tetraploid genotypes was used (Table S1). This is a subset of the variety panel described in Vos et al. (2015). In addition, three bi-parental segregating populations were used: (1) a tetraploid bi-parental population Altus × Colomba of 87 genotypes, which is a subset from Bourke et al. (2015) (hereafter referred to as A × C); (2) a tetraploid bi-parental population Altus × KA 2004–4057 of 34 genotypes (hereafter referred to as A × K); and (3) a diploid bi-parental population SH 83–92-488 × RH89-039–16 (Van Os et al. 2006) of 157 genotypes (hereafter referred to as SH × RH).

Phenotypic data collection

Phenotypic data for the variety panel has been collected from three sources (Table S1 lists for each variety its inclusion for each dataset). First, field trials were performed in 2008 and 2009 at Averis in Valthermond on dark sandy soil (reclaimed peat bog). The varieties were arranged into maturity classes according to their maturity type (early maturing–main crop–late maturing) to avoid competition as much as possible. Plot size was 16 plants, and the plants were spaced at 40 cm within rows and at 75 cm between rows. Plots were randomized within maturity classes, while maturity classes were randomized. Tubers were planted end of April and harvested end of September, begining of October. Three weeks before harvest, vines were killed. No irrigation was applied. The fertility regime followed standard commercial potato production field practices. First, in 2008 134 genotypes and in 2009 183 genotypes were phenotyped for α-solanine and α-chaconine content as described below. The overlap between both years is 132 varieties. Second, for 138 varieties, the phenotypic values of total glycoalkaloid content were collected as part of official testing the “Value for Cultivation and Use” (VCU-data), as a requirement before adoption in the National List (Anonymous 2015). The overlap between the field trial data (N = 132) and VCU data (N = 138) is 42 varieties. Third, for 18 varieties, glycoalkaloid content was available from multi-year, multi-location data from breeding programs as described in (D’hoop et al. 2011). The overlap between the MYML and VCU data is 13 varieties. In summary, we used an initial variety panel (N = 132), a supplementary panel (143), and a total panel (N = 275).

The tetraploid population A × C was phenotyped once in 2011 and in two replicates on a field trial in 2012. The diploid population SH × RH and the tetraploid population A × K were both grown in field trials in 2009 and 2011, and phenotypes were collected from one replicate per year.

TGA measurements

For TGA extraction, 5 kg of tubers including the skin were ground and the liquid was separated. Tubers affected by greening were carefully excluded from the sample. Subsequently, 300 μl of the potato juice was put in a 15-ml plastic centrifuge tube and 10 ml standard TGA extraction buffer was added (5% acetic acid supplemented with 20 mM Na-1-heptanesulfonate). Samples were incubated overnight while shaking at 150 strokes/min followed by centrifugation at 9000 × g for 10 min. Supernatant of the extract was filtered using a Pall Acrodisc GHP 0.45 µm syringe filter.

Analysis was carried out as described by Laus et al. (2017) using an isocratic online-SPE-HPLC system, which contains Oasis HLB Prospect-2/Symbiosis 10 × 2 mm cartridges (Waters, Milford Massachusetts, USA). Cartridges are replaced preventively every 96 cycles. The cartridge was washed at a flow rate of 1 ml/min for 4 min with 100% acetonitrile followed by a 3-min wash with ultrapure water after which 850 µl of TGA extract was loaded onto the cartridge. The cartridge was subsequently washed for 30 s with water, 2 min with wash buffer 1 containing 25% acetonitrile /1.5% NH4OH, and 30 s with ultrapure water followed by 4 min with wash buffer 2 containing 15% acetonitrile /10 mM phosphate buffer pH 7.6. TGA was retrieved from the cartridge by the mobile phase of the HPLC system. Separation of α-solanine and α-chaconine was carried out on a Hypersil ODS 250 mm × 4.6 mm 5 μm C18 column with a Hypersil ODS C18 5.0 × 4.6 mm guard column (Thermo scientific, Waltham, USA) at a temperature of 34 °C and a flow rate of 1.6 ml/min. The mobile phase consisted of a 60/40 mixture of 100% acetonitrile and 10 mM phosphate buffer pH 7.6 degassed using the degas function of a ultrasonic cleaner (VWR international, Leuven, Belgium) and filtered over a 0.45-µm PVDF filter (Waters, Milford Massachusetts, USA). Quantification of α-solanine and α-chaconine was based on the peak absorbance area at a wavelength of 202 nm. Peak areas are converted to mg kg−1 concentrations using calibration curves based on α-solanine and α-chaconine standards added to the sample. Total SGA (equal to the sum of α-solanine and α-chaconine), α-solanine and α-chaconine content was reported in mg kg−1 (PPM).

Marker data

The variety panel and both tetraploid bi-parental populations (A × C and A × K) were genotyped with the SolSTW 20 K Infinium array (Vos et al. 2015). In the two bi-parental populations, 7150 and 7144 markers segregated respectively. In the variety panel, an allele frequency cut-off of 1.25% was used, which implies for tetraploids that 4.81% of the varieties are positive for the minor allele in simplex condition. Given our panel size (N), the probability to find a duplex individual is negligible (0.0009 × N). After exclusion of rare variants (< 1.25%), 11,674 SNP markers were polymorphic in the entire panel of 275 genotypes (11,147 SNPs in the subset of 132 genotypes). The SNP markers were used according to the reference genome where the SNP dosage used is the dosage of the non-reference SNP-allele. The diploid mapping population was genotyped with AFLP. The genetic map of this population (Khan 2012; Van Os et al. 2006) could be used to convert the genetic positions of marker bins into physical coordinates (Sharma et al. 2013).

Phenotypic analyses

Unless specified otherwise, statistical analyses were performed in GenStat 14th Edition (VSN International).

Prior to the GWAS, best linear unbiased estimates (BLUE) were calculated using restricted maximum likelihood (REML), with the following mixed model:

$$\underset{\_}{\mathrm{phenotype}}=\underset{\_}{\mathrm{trial}}+\mathrm{genotype}+\underset{\_}{\mathrm{error}}$$

where random effects are underlined. Phenotype is a vector of \({\mathrm{Log}}_{10}\)-transformed plot means of a trait measured on different varieties by different companies, in different locations, and in different years; trial is a combination factor of company, location, and year, and genotype is a factor identifying each variety. The adjusted means per genotype (BLUE) were saved to use as response in GWAS.

A random components model was used on the same phenotypic data to estimate the contribution from each component of the total phenotypic variance:

$$\underset{\_}{\mathrm{phenotype}}=\underset{\_}{\mathrm{trial}}+\underset{\_}{\mathrm{genotype}}+\underset{\_}{\mathrm{error}}$$

Broad sense heritability was calculated as \({H}^{2}={V}_{G}/\left({V}_{G}+{V}_{e}\right)\), where \({V}_{G}\) is the genetic variance component, and \({V}_{e}\) is the remaining phenotypic variance after correcting for the different trials.

QTL detection

For the GWAS both a naive and a kinship-corrected model were applied. The threshold to call associations significant requires correction for multiple testing. A Bonferroni correction for 11,674 independent marker tests results in significance threshold at − log(p) = 5.37. In view of the amount of LD between many markers, the Bonferroni correction is too strict. Previous work on LD-decay (Vos et al. 2017) showed that haploblocks extend for at least 1 Mb. Exclusion of pericentromeric regions (not involved in recombination) would leave 500 independent haploblocks of 1 Mb per genome, demanding a significance threshold at − log(p) = 4.0.

For the naive GWAS, a regression model was used to test the association between the trait and each marker. The naive model used for the association panel, and also regression model for A × C, A × K populations, was:

$$\underline{y}=\mathrm{marker}+\underset{\_}{\varepsilon }$$

where y is a vector of adjusted means per variety (BLUE) obtained from the phenotypic model; marker is a vector of SNP dosage taking values between zero and four, and ε is the residual, with \(\mathrm{Var}\left(\underset{\_}{\varepsilon }\right)= I{\sigma }_{\varepsilon }^{2}\). This model was also used for the QTL analysis in the tetraploid bi-parental populations (A × C and A × K).

In a second step, a mixed model was used to correct for relatedness among these 132, 275, or 143 varieties, to prevent false positive detection due to population structure. A kinship matrix K was calculated using procedure Fsimilarity in Genstat using the ecological distance (1 −|xi-xj|/r, unless xi = xj = 0, where xi and xj are allele dosages and r is the range). This distance measure bears similarity with Euclidian distance but the shared absence of a SNP minor allele is excluded, and does not inflate similarity. For this purpose, 1000 markers were randomly picked. From marker pairs with r2 > 0.1, one marker was removed, until no marker pair exceeded this threshold. With the remaining 710 independent markers, three subpopulations were identified named the “Agria,” “Starch,” and “Rest” group as defined by D’hoop et al. (2008). A principal coordinate analysis based on these 710 markers has been presented (Vos et al. 2015). For the QTL analysis in the tetraploid bi-parental populations (A × C and A × K), single-marker regression using marker dosages was used, similar to the naive GWAS model, and the marker positions are not in cM, but in PGSC 4.03 physical coordinates.

The model correcting for relatedness (mixed model) used for the association panel was:

$$\underline{y}=\mathrm{marker}+\underset{\_}{\mathrm{genotype}}+ \underset{\_}{\varepsilon }$$

where \(\mathrm{Var}\left(\underset{\_}{\mathrm{genotype}}\right)= K{\sigma }_{g}^{2}\), and \(\mathrm{Var}\left(\underset{\_}{\varepsilon }\right)= I{\sigma }_{\varepsilon }^{2}\) as before.

The percentage of explained phenotypic variance by a single marker in the GWAS mixed model was calculated in three different ways. The first method \({R}^{2}\) is the squared correlation between marker and trait, which is the same \({R}^{2}\) as produced by a simple regression model, and does not take population structure into account. The second method is the \({R}_{W2}^{2}\) statistic (Kramer 2005 and Sun et al. 2010), and the third method is the concordance correlation rc (Vonesh et al. 1996 and Sun et al. 2010). In the text, we used the estimates of the first method \({R}^{2}\). Because values obtained with \({R}^{2}\) are always in between the considerably higher estimates obtained with \({R}_{W2}^{2}\) and the lower estimates of the rc statistic, we used the estimates of the first method \({R}^{2}\) to describe our results. The explained variances as obtained with the \({R}^{2}\), \({R}_{W2}^{2}\), and rc statistic are reported in Table S2 for all associated markers with a significance exceeding − log10(p) > 4.0.

Backward selection

After the single marker analyses, a backward selection procedure was used to select a multi-locus model that explained a substantial part of the trait variation. In a first step, we tested LD between all significant markers from the single-marker analysis (model correcting for relatedness), and selected the most significant marker in each set of markers that were in linkage disequilibrium (LD) (r2 > 0.9). The remaining markers had a lower LD, and were all included in a multi-locus model that corrected for relatedness:

$$\underline{y}=\sum\nolimits_{j}{\mathrm{marker}}_{j}+\underset{\_}{\mathrm{genotype}}+ \underset{\_}{\varepsilon }$$

where the terms are the same as for the single-marker model above. In the backward selection procedure, first we checked if any marker was not significant in the presence of all other markers, and in that case, the least significant marker in the model was removed. The model was then re-fitted, and tested. Non-significant markers were removed, one at a time, until all markers in the model were significant (p < 0.05).

Results

Phenotypic evaluation

Phenotypic data for this study was collected from several sources: (1) a designated field trial, (2) official VCU data published by Plantum.nl (Anonymous 2015) to allow admittance to the National list (the legally permitted threshold of SGA may not exceed 200 mg kg−1), and (3) historical multi-year, multi-location data collected from breeding programs (D’hoop et al. 2011). Correlations of the overlap between the different data origins resulted in relatively high R2-values (Fig. S1). The R2 between VCU data and our field trial was 0.72 (n = 42), and the R2 between historical data and VCU data was 0.85 (n = 13). There was no overlap between the field trial and historical breeding data. Because of these high correlations, we were confident that merging the different datasets would likely increase the power for QTL detection.

Working with a variety panel on a trait for which high values are discouraged because it will lead to rejection of those varieties, it is obvious that the number of varieties with high SGA content is limited. Nevertheless, varieties exceeding 200 mg kg−1 were identified, which, apart from B5141-6 (withdrawn from cultivation; hereafter indicated by its former name Lenape), were all varieties from the starch industry for which a different maximum applies (not exceeding the multiyear average of the variety Aventra) and not used for human consumption such as Festien, Allure, Astarte, Elkana, Mantra, Kuras, and Mercator. Having such a low number of high SGA varieties in our variety panel results in an extremely skewed distribution as shown in Figs. 1 and S2. A similarly skewed distribution of SGA phenotypic values is observed in the bi-parental populations A × K, A × C, and SH × RH (Fig. S2), although not as extreme as in the variety panel. In order to obtain a normal distribution of the trait values required for a correct analysis, a log10-transformation was applied, and to be consistent, this transformation was applied to all datasets. After a log10-transformation, BLUEs (best linear unbiased estimates) were calculated for the variety panel and the A × C population using REML. For both populations, no significant origin, year, or location effects were identified. The A × K and SH × RH were only phenotyped once; therefore, REML could not be applied. The log10-transformation resulted in a more normal distribution of trait values suitable for GWAS (Fig. 1).

Fig. 1
figure 1

Boxplots showing statistical distribution of all traits. Both the distribution for the raw data (mg kg−1) and log10-transformed data are shown. The width of the boxplot is proportional to the population size

The broad sense heritability (H2) was calculated (Table 1). Highly similar estimates (H2 = 0.58) were observed for α-solanine, α-chaconine, and total SGA content in the variety panel. In the A × C population, the heritabilities for the different SGA traits ranged between 0.64 and 0.67 (Table 1). The log10-transformation resulted in a slight increase of the heritability’s for α-solanine and total SGA content (Table 1). The heritability of α-solanine/α-chaconine ratio (SGR) was relatively low (H2 = 0.26) in the variety panel; however, a much higher heritability (H2 = 0.77) was observed in the A × C population. Also, for the ratio, the log10-transformation resulted in an increase of the heritability in the variety panel to 0.53.

Table 1 Populations used in this study, their sizes, and broad sense heritabilities (H2) for the glycoalkaloid traits in mg kg−1 (untransformed) and the log10 transformation of mg kg−1

In Fig. 1, boxplots show the statistical distribution of all traits in all populations of this study for both the raw data (mg kg−1) and after the log10-transformation. The maximum SGA content observed in the variety panel is 508 mg kg−1 in Festien. Some progeny from two bi-parental populations also exceeded 200 mg k g−1, of which the SH × RH population shows a clear transgressive segregation (Fig. S2). The third population A × C does not show transgressive segregation, and the complete population remains well below the 200 mg kg−1-threshold. The ratio between α-solanine and α-chaconine in the majority of the varieties is below 1 indicating that more α-chaconine is accumulating than α-solanine. In Table S3, the correlation between the different SGA traits for the different populations is shown. For all populations, a very high correlation between α-solanine, α-chaconine, and total glycoalkaloid content is observed, while the correlation between the ratio and level of SGAs is less significant or even absent.

QTL identification for glycoalkaloid content

An initial GWAS was performed with 132 genotypes, because for these genotypes, both α-solanine and α-chaconine phenotypes were available. Analysis of α-solanine and α-chaconine contents (as separate traits) gave almost identical results as total SGA contents, which is due to their high correlations. Separate α-solanine and α-chaconine data are only available for the 132 varieties of the basic panel, and in view of the high correlations, we only show results of total SGA contents. As a second step, we performed a GWAS on the complete panel of 275 varieties. Figure 2a shows the results of a naive GWAS on 132 genotypes, where 312 markers exceeded the multiple testing threshold of -log10(p) ≥ 4 (indicated with green dots). The same naive analysis with all 275 genotypes resulted in 603 significantly associated markers (Fig. 2c). The maximum − log10(p) of both analyses is around 12. Population structure strongly affects the results of GWAS, and therefore, both naive and kinship-corrected results of GWAS are presented in Fig. 2A and B, respectively, for the GWAS with 132 genotypes and Fig. 2C and D, respectively, for the GWAS with 275 genotypes. Correction for population structure should also diminish p-value inflation, and therefore, Q-Q plots are presented in Fig S3a and S3b to show the expected and observed p-values of the marker associations with SGA and ratio, respectively. The Q-Q plots clearly show that correction with the kinship matrix largely circumvents the p-value inflation. A kinship-corrected GWAS of the initial (N = 132) and the total panel (N = 275) both identified 21 significantly associated markers (Table S2). The strong reduction in the number of significant marker trait associations suggests a strong correlation between trait values and population structure. Indeed, Fig. 3A confirms this confounding effect, as higher SGA values are overrepresented among varieties that belong to the “Starch” subpopulation. More striking is that the 21 significantly associated markers from the analysis with 132 genotypes with replicated phenotypes and the 21 SNPs from the complete set of 275 plants are non-overlapping. Therefore, we also tested the subset of 143 plants for which we relied on data from the VCU test and the historical data from breeding programs. The kinship-corrected GWAS identified eight SNPs exceeding the threshold (listed in Table S2) of which two associations were unique for this supplementary panel and six SNPs overlapped with SNPs discovered in the total panel. For several SNPs the significance of the association and the explained variance was much higher in this complementary panel as compared to the total panel. When we compared the physical coordinates of the SNPs from the different association analyses, it was obvious that the significant markers largely came from the same genomic regions on chromosomes 1 and 11. However, unique positions were also found for the initial panel on chromosome 4 and 5 and for the complete panel on chromosome 3.

Fig. 2
figure 2

Manhattan plots for total SGA content. A Naive GWAS with 132 genotypes, B kinship-corrected GWAS with 132 genotypes, C naive GWAS with 275 genotypes, D kinship-corrected GWAS with 275 genotypes, E QTL analysis of bi-parental mapping population A × C, and F QTL analysis of bi-parental mapping population A × K. Dotted horizontal line is at the multiple-testing threshold of − log10 (p) of 4. All green dots are significant according to this threshold

Fig. 3
figure 3

Phenotypes are confounded with population structure. The distribution of A total SGA content and B α-solanine/α-chaconine ratio over the three different subpopulations. The width of the boxplot is proportional to the population size

An overview of the SNPs and QTL regions identified in this study is presented in Table 2, and can be summarized as the detection of five genomic regions containing QTLs involved in amount of steroidal glycoalkaloids: Sga1.1, Sga3.1, Sga5.1, Sga7.1 and Sga11.1.

Table 2 Summary of the QTLs and peak markers underlying the QTLs as shown in Figs. 2 and 6

Within the Sga1.1 region, the most significant SNPs identified in the kinship-corrected GWAS across the different panels are located on chromosome 1 within superscaffold PGSC0003DMB000000095 explaining up to 21% of the variation. The 23.8 Mb large region (chr01:63,764,681…87,544,718) has two sub-regions where SNPs pile up predominantly within a 4-Mb region (chr01:65,692,910…69,665,033) identified in the analyses with 132 and 275 genotypes and another 7.7 Mb interval more south (chr01:80,467,982…87,185,358) only identified in the subset of 132 genotypes. Both are independent, because the markers of the sub-regions do not correlate (data not shown). Validation of Sga1.1 with bi-parental mapping populations was not feasible with the A × C or A × K populations, because none of the significant SNPs segregated in A × C and only three segregate in A × K, but were not significant. The QTL Sga1.1 could be validated in mapping population SH × RH, where a major-effect QTL on linkage group SH01 was detected (LOD = 27.8). In SH × RH, this QTL explained 56.7% of the variance (Fig. 4). Using the study of Sharma et al. (2013), the peak marker maps around the exact same region of PGSC0003DMB000000095.

Fig. 4
figure 4

QTL mapping in the diploid SH × RH population. Maps of chromosomes 1, 5, 7, and 11. Marker names are shown left and map positions (cM) right from the chromosome. Bars show the minus 1 and minus 2 LOD interval for all QTLs

The locus Sga3.1 maps to a sharp QTL peak encompassing SNPs from a 1.7-Mb interval (chr03:42,232,155…43,921,814) where PotVar0068174 is the SNP most significantly associated with SGA (− log10(p) = 5.8) with SGA in the total panel, explaining 17.5% of the variation. In the subset of 132 genotypes, the associations did not reach the significance threshold of 4. This QTL was not validated in any of the bi-parental mapping populations, although PotVar0068174 was segregating in both A × C and A × K.

GWAS identified PotVar0034580 as a single, significantly associated SNP (− log10(p) = 4.5) located on a distal position (51.70 Mb) of the south arm of chromosome 5. We did not assign a QTL name to this putatively spurious association, because PotVar0034580 was only significant in the basic panel and not associated with SGA content in the total panel, nor in any of the mapping populations. On the north arm of chromosome 5 however, the validation population SH × RH displayed a significant QTL (− log10(p) = 5.9) at 18.2 cM, explaining 7.1% of the variance. The AFLP markers in the SH × RH population, associated with the QTL called Sga5.1, correspond to a position on superscaffold PGSC0003DMB000000192.

The validation population A × C did not result in any significant QTL (Fig. 2E), while the validation population A × K displayed a highly significant QTL, called Sga7.1, on chromosome 7 with a − log10(p) of 8.1, and explaining 63.9% of the variance. Figure 2F shows the location of Sga7.1 in a physically large interval with many SNPs. The long-range LD in this interval is caused by suppression of recombination in the pericentromeric heterochromatin. The peak position consists of two co-segregating SNPs (PotVar0092875 and PotVar0115020). GWAS could not identify associated SNPs on chromosome 7 with SGA contents. This QTL was validated in the SH × RH population (Fig. 4) explaining 7.5% of the variation with a LOD of 6.2, although the position of the AFLP markers correspond to superscaffold PGSC0003DMB000000233, slightly more towards the chromosome end (Sharma et al. 2013). The QTL interval of Sga7.1 includes candidate genes GAME6 and GAME11.

On chromosome 11, a significant QTL was discovered with the subset of 132 genotypes and the complete panel (Fig. 2B and C). This QTL named Sga11.1 maps to a 2.3-Mb region (Chr11:2,037,454…4,347,636) where peak marker PotVar0066293 explains 14.5% of the variance. Sga11.1 could be validated in the SH × RH population, where a QTL with a LOD-value of 5.6 explaining 15.4% of variation could be mapped genetically to a position corresponding closely to superscaffold PGSC0003DMB000000133 (Sharma et al. 2013).

A backward selection procedure was used to compose a multi-locus model of non-redundant markers. The markers included in this model are listed in Table S4, and the prediction by this model is illustrated in Fig. 5A. Collectively these SNPs explain 32% of the total variation. In view of the high broad sense heritability for SGA (H2 is ranging between 0.56 and 0.73), the results of our model suggest a reasonable amount of missing heritability.

Fig. 5
figure 5

Backward selection models of A total SGA content and B α-solanine/α-chaconine ratio. The observed phenotypes (x-axis) are plotted against the predicted phenotypes based on a set of five significant SNP markers for total SGA and six significant SNPs for α-solanine/α-chaconine ratio

QTL analysis for α-solanine/α-chaconine ratio

The ratio between α-solanine and α-chaconine was studied in the panel of 132 varieties only because in the additional 143 genotypes α-solanine and α-chaconine were not measured separately. In Fig. 6, the Manhattan plots of a naive and a kinship-corrected GWAS are shown. In contrast to the GWAS with total SGA, kinship correction did not cause strong reduction in the number of associated SNPs. This suggests that trait values for SGA ratio are negligibly confounded with population structure (Fig. 3B). A highly significant QTL for the ratio (SGR) named Sgr7.1 (Table 2) was identified on chromosome 7 (Fig. 6A, B). The most significantly associated SNP is PotVar0069919 (− log10(p) = 8.5) and explains up to 24.4% of the variation. The QTL covers a 36-Mb region, comprising approximately half of the physical chromosome (PGSC0003DMB000000684 to PGSC0003DMB000000251) and spanning the pericentromeric heterochromatin. PotVar0069919 is located at 1.5 Mb distance from the most obvious candidate gene SGT1 (PGSC0003DMG400011749). Three strongly associated SNPs, shown in Fig. 6A, B, suggest the presence of a QTL on chromosome 4. However, such isolated dots are suspicious, because they completely co-segregate with markers on chromosome 7, and indeed, the recent revision of scaffold DMB67 (Endelman and Jansky 2016) would replace these markers to chr07:26,125,388…26,309,187, right within the QTL on chromosome 7. Many of the highly significant SNPs are present in old varieties Jaune d’Or, Yam, and Myatt’s Ashleaf, suggesting that this haplotype is not a recent introgression from wild species. The validation of this QTL failed because no QTLs could be discovered for α-solanine/α-chaconine ratio in the three bi-parental mapping populations. However, the majority of the SNPs underlying the QTLs Sgr7.1 (Fig. 6A) and Sga7.1 (Fig. 2F) are the same.

Fig. 6
figure 6

Manhattan plots for α-solanine/α-chaconine ratio. A Naive GWAS with 132 genotypes, B kinship-corrected GWAS with 132 genotypes, C QTL analysis in bi-parental A × C population, and D QTL analysis in bi-parental population A × K

The second QTL for α-solanine / α-chaconine ratio was detected on chromosome 8. This locus, named Sgr8.1, is most significantly detected with PotVar0063333 (− log10(p) = 7.1) and explains 19.7% of the variation (Table 2). The SNPs significantly associated with QTL Sgr8.1 cover a 2.1-Mb region comprising the candidate gene SGT2 (PGSC0003DMG400017508). Validation of this QTL region was possible using the A × C mapping population, where highly significant SNPs were observed in a much wider 7.5 Mb genomic region. However, the SNPs underlying this QTL are completely different in both populations (Table S2). In the bi-parental population, a haplotype segregates, tagged by the peak marker PotVar0063060 (-log10(p) = 9.5), Fig. 6C), explaining 35.5% of the variance. Remarkably, PotVar0063060 is a rare SNP in the variety panel, with a population allele frequency of 0.4%, located within the open reading frame of SGT2. Such rare SNPs do not have the power to detect a QTL in an association panel. Graphical genotyping on the data of Uitdewilligen et al. (2013) revealed a unique rare haplotype of at least 5.8 Mb (chr08:47,377,386.. 53,138,672) present in the varieties Festien, Kartel, and Aveka (Table S5). These SNPs are identical by descent and located on an introgressed haplotype from S. vernei. Kartel is the male parent of variety Altus, which is the female parent of the A × C population, and therefore, the QTL could be discovered with SNPs specific to this rare haplotype. Table S5 lists additional SNPs that are within the open reading frame of SGT2 that were not implemented on the 20 K array. These include PotVar0063041, a synonymous C/T SNP at coordinate 49,813,826, PotVar0063070, a non-synonymous G/T SNP at coordinate 49,813,558 and PotVar0063092, a G/A SNP causing a premature stop codon at coordinate 49,813,385. In the other two bi-parental populations (SH × RH and A × K), we did not detect any significant QTLs for the ratio between α-solanine and α-chaconine.

In Fig. 5B, the phenotypic predictions from a backward selection model are shown. This model combines a set of non-redundant significant SNPs listed in Table S4. The R2 of this figure is 60% indicating that the QTLs Sgr7.1 and Sgr8.1 can explain a major part of the heritability within the variety panel.

Discussion

Genome-wide association studies have become a widely accepted method to explore the genetic structure of complex traits in plants. Extensive research in plants, predominantly in Arabidopsis thaliana, have highlighted some of its advantages and limitations, nicely reviewed by Korte and Farlow (2013). They state that Arabidopsis is almost an ideal organism to conduct GWAS, because the continued self-fertilization allows repeatedly phenotyping of genetically identical individuals. From that perspective, vegetatively propagated crops such as potato should be equally suitable for GWAS. As a textbook example for GWAS, we made use of publically available phenotypic data from VCU testing, and multi-year multi-location data from breeding programs in addition to a designated trial. With genotypic data collected with the 20 K SolSTW array (Vos et al. 2015), we could explore the possibilities and limitations of genome-wide association studies in tetraploid potato.

Population structure

In general, population structure is a major obstacle in GWAS. Without a proper correction for the relationship between individuals, many spurious associations might be identified. In this study, we show that total SGA content is confounded with population structure (Fig. 3A), resulting in many spurious associations (Fig. 2A, C). In contrast, α-solanine/α-chaconine ratio is not confounded with population structure and the naive and kinship-corrected results are almost identical (Fig. 6). This difference is the obvious result of selection, where selection against SGA is less important for varieties bred for the starch industry, and for α-solanine/α-chaconine, ratio selection is absent.

In natural populations, not only selection but also reproductive isolation due to geographic origin is involved in shaping structured populations (Kooke et al. 2016). In crop species, the various strategies of germplasm collection and controlled crosses will reduce reproductive isolation and create less structured populations. In potato, no evidence has been reported so far that geographic origin contributed to population structure. Clear evidence for population structure caused by breeding towards market segments has been documented in literature (D’hoop et al. 2010; Hirsch et al. 2013; Vos et al. 2015). Potatoes for processing or starch industry and fresh consumption have different requirements for trait values, and therefore, traits are expected to be confounded with a selection induced population structure. An approach to avoid the burden of correction for population structure was demonstrated by Zhao et al. (2011), as they performed GWAS both across and within subpopulations. Unfortunately, the subgroups in this study (“Agria” (n = 28) and “starch” (n = 41)) are too small to perform a genome wide association study.

The three biparental populations differed in their capacity to detect QTL. Whereas four QTL were significant in the diploid population SH × RH (N = 157), only one QTL was significant in the tetraploid mapping populations A × C (N = 87) and A × K (N = 34). This may be the impact of population on statistical power, but also ploidy level may cause a difference. When comparing a disomic 1:2:1 segregating ratio with a tetrasomic 1:8:18:8:1 ratio, the latter has only 1/36 homozygous offspring showing extreme trait values. Hence, a higher phenotypic variance can be expected in diploids, facilitating QTL discovery.

Missing heritability and genetic heterogeneity

When we compare the broad sense heritabilities with the variance collectively explained by all SNP markers that were included in the backward selection models, an obvious difference is found for SGA and α-solanine/α-chaconine ratio. For SGA, we obtained H2 = 65–70% while R2 = 32% (Fig. 5), and for ratio, we obtained H2 = 53% while R2 = 60% (Fig. 5). Although comparison of H2 with R2 is not straightforward, it is clear that the proportion of missing heritability is larger for the structure-confounded trait of SGA. This suggests that the correction for population structure has dismissed true-positive markers for the structure-confounded trait SGA and that most likely the majority of QTLs involved in α-solanine/α-chaconine ratio have been identified.

Trait values for SGA were collected from field trials, historical breeder’s records, and VCU documents, but these different phenotypic records relate to different variety panels. GWAS analysis of the sub-panels with either field data or the VCU data, as well as the merger of these panels into a larger GWAS panel, allowed the reproducible identification of QTL positions for SGA and α-solanine/α-chaconine ratio, as these loci were consistent across sub-panels and the total variety panel.

So far, the discovery of the map positions of the QTLs may seem straightforward, but at these QTL loci, we lack information about the haplotype structure, or how many of the various alleles have a positive or negative influence on the trait values. This problem is illustrated by the striking observation that the 21 significantly associated SNPs detected with small sub-panel (n = 132) did not overlap with the 21 significant SNPs detected with the total panel (n = 275). SNPs not detected in the smaller set become significant in the larger set or SNPs significant in the smaller set are no longer significant in the larger dataset. The former may be indicative for some missing heritability due to SNPs with a low allele frequency; the latter indicates that similar levels of total SGA content can be explained by different combination of alleles, i.e. genetic heterogeneity.

This study provides two examples of SNPs that were excluded from GWAS because their allele frequency was below the pre-set threshold of 1.25%. The allele frequency of PotVar0043608 in the subset of 132 genotypes is only 0.9%, but with an allele frequency of 1.6% in the supplementary panel and 1.3% in the total panel, highly significant associations with SGA contents allowed the identification of Sga1.1. The second example relates to the detection Sgr8.1 with SNP PotVar0063060 in in the A × C population, associated with the haplotype derived from the variety Kartel, having a population allele frequency of below the threshold in the GWAS panels and therefore excluded from analysis.

The genetic heterogeneity in this particular situation can be explained as the result of the different varieties found in the 132-set and the complete panel. The supplementary panel with phenotypic data from VCU records represents more modern varieties. For example, two-thirds of the 132 varieties were released before 1991, whereas two-thirds of N = 143 supplementary panel is comprised of varieties released after 1995. Presumably, the incidence of alleles contributing to high SGA contents derived from heirlooms, has declined over the years, but introgression breeding for disease resistance have passed new alleles to the contemporary gene pool.

Candidate genes underlying QTLs for SGA content and α-solanine/α-chaconine ratio

Much progress has been made in recent years to elucidate the genes involved in the regulation and biosynthesis of SGAs (Cárdenas et al. 2015, 2016; Itkin et al. 2013; Zhao et al. 2021). This biochemical information is valuable, but it does not provide information on the loci that cause genetic differences in the amounts and composition of SGA in potato varieties. The development of marker-assisted selection strategies requires the identification of relevant alleles at the QTLs involved in trait variation rather than the genes themselves.

The most significant SNP underlying Sga1.1 is PotVar0043608 (Fig. 2D). This SNP is located at 160 kb distance from GAME9 (Cárdenas et al. 2016) suggesting linkage disequilibrium between a SNP and a GAME9 allele increasing SGA contents. The QTL Sga1.1 was validated in the SH × RH population, and also co-localizes with the QTL identified by Sørensen et al. (2008). In our study, Sga1.1 is the QTL with the largest effect, which confirms the characterisation of GAME9 as a key regulator of the SGA pathway (Cárdenas et al. 2016). This study shows that GAME6/GAME11 co-localize with the QTL Sga7.1; however, the position of these genes might be confusing because the reference genome (PGSC 2011) reports a location 10 Mb differing from the location reported by Itkin et al. (2013). Furthermore, the QTLs Sgr7.1 and Sgr8.1 clearly match with the responsible genes SGT1 and SGT2. Earlier reports (Krits et al. 2007) already documented on the correlation between SGT1 and SGT2 transcript ratio and the α-solanine to α-chaconine ratio in potato tubers. As discussed before, the validation of Sgr8.1 seems straightforward; nevertheless, the SNPs were not validated. In the A × C validation population, other SNPs were associated with α-solanine and α-chaconine ratio. This provides evidence that multiple alleles of SGT2 are involved in modulation of the α-solanine and α-chaconine ratio in potato tubers.

Three more QTLs Sga3.1, Sga5.1, and Sga11.1 were identified at genomic positions, where no obvious candidate genes have been identified. Although Sga3.1 was only identified by GWAS and Sga5.1 only in one mapping population SH × RH, the QTL Sga11.1 was discovered by GWAS and validated in the SH × RH population, and co-localizes with QTLs identified in an earlier study (Manrique-Carpintero et al. 2013). On the other hand, the associated SolCAP SNPs identified by Manrique-Carpintero et al. (2013) could not be reproduced in our study.

Fig. 2b suggests that PotVar0076636 and PotVar0107030 distally located on the short arm of chromosome 4 and PotVar0034580 on chromosome 5 have been ignored. These QTL regions could not be validated in the large set of 275, nor in any of the bi-parental populations. Additionally, these were close to the significance threshold and were therefore assumed to be false positives

For many of the other GAME genes postulated in literature, there was no QTL identified, such as GAME4 and GAME12 on chromosome 12 (Cárdenas et al. 2015; Itkin et al. 2013), SSR2 on chromosome 2 (Sawai et al. 2014), and HMG1 (chromosome 2), HMG2 (chromosome 2), and SQE (chromosome 4) (Ginzberg et al. 2012; Manrique-Carpintero et al. 2013, 2014). This suggests that these genes do not display functional variation, and may be highly conserved.

Are introgression segments involved in the SGA biosynthesis?

As described in the introduction, many studies have been performed on SGA content using several wild relatives. In the paper of Vos et al. (2015), it was demonstrated that specific SNPs can be used to identify introgression segments from wild species, which were used as donor of disease resistance genes. In this study, PotVar0043608 is the most significant SNP associated with Sga1.1 on chromosome 1. Analysis of the year of market introduction of potato varieties polymorphic for PotVar0043608 (Vos et al. 2015) has indicated that polymorphism at this SNP locus was first observed in variety Lenape (Table S2). Lenape is a variety with Solanum chacoense in its pedigree, which was withdrawn from the national variety lists because of its high SGA content (Zitnak and Johnston 1970). This study provides evidence that this specific SNP is a Solanum chacoense specific DNA variant. In descendants from Lenape, the SNP is indicative of an introgressed haploblock responsible for an elevated SGA content.

Another example is the most significant SNP underlying the QTL involved in SGA ratio in A × C. This SNP was first observed in the variety Kartel, and is derived from one of the at least five Solanum species (vrn (three different accessions from UK, NL and GER), edn, opl, dms, adg) used to develop ancestral progenitors in the pedigree of Kartel (Van Berloo et al. 2007). There are also indications for another S. vernei derived haplotype underlying the QTL Sga7.1 (but not Sgr7.1), because several of the highly significant SNPs in Fig. 2AC, and F originate not from Kartel but from VTN 62–33-3, which is a progenitor clone with Dutch and Scottish S. vernei derived progenitors in its pedigree.

Breeding for disease resistance has contributed new haplotypes to the gene pool. Wild species derived alleles for genes involved in SGA located on these introgressed haplotypes may increase the risk of genetic complementation. In particular, the efforts to develop potato cyst nematode-resistant starch varieties resulted in elevated levels of SGA. This study confirms earlier reports (Hellenäs et al. 1995; Yencho et al. 1998) proposing that combinations of S. tuberosum and non-tuberosum alleles may be responsible for high SGA levels in potato tubers.

In this study, we illustrate the utility of GWAS in tetraploid potato to gain insight into the genetics of a complex trait. However, not all traits are equally suitable and some traits will suffer more than others from the disadvantages of GWAS. The combination of GWAS and bi-parental mapping populations seems essential to avoid incorrect interpretation of the data.