Introduction

Many sophisticated genomic tools have been established within the last decade and led to a potentiation in resource development of major crop species that are useful for plant breeding and genetic studies. Sequencing of whole genomes/transcriptomes enabled the genome/transcriptome-wide discovery of single nucleotide polymorphism (SNP) markers amenable for high-throughput genotyping platforms. Genotyping arrays are now used for many purposes, such as genetic diversity analysis, high density genetic mapping, fine mapping of quantitative trait loci (QTL), and detection of marker-trait associations suitable for the application of marker-assisted selection (MAS) in breeding programs.

Cultivated sunflower (Helianthus annuus L.) is an oilseed crop of great economic importance with a worldwide production of approximately 44.7 million tons per annum (FAOSTAT 2013). Sunflower is considered a model species for other large-genome members of the Compositae family, especially with regards to evolutionary and ecological questions. The whole genome (~3.6 Gbp) of inbred line HA412HO is target of an on-going sequencing project (Gill et al. 2014; Grassa et al. 2015; Kane et al. 2011; Natali et al. 2013) and will represent the reference genome sequence for this species. Once assembled, it will dramatically facilitate the discovery of SNP markers by re-sequencing of other sunflower lines.

The development of molecular markers is advanced in sunflower and, over the years, different marker types have been generated. Restriction Fragment Length Polymorphisms (RFLP) fingerprinting was the first molecular marker technique available in sunflower (Berry et al. 1994; Gentzbittel et al. 1992, 1994). Several detailed genetic linkage maps have been developed based on simple sequence repeat (SSR), sequence-tagged-site (STS) markers, and EST-derived SNP markers (Pérez Vich and Berry 2010) for both cultivated sunflower (Al-Chaarani et al. 2004; Berrios et al. 2000; Lai et al. 2005a; Tang et al. 2002; Yu et al. 2003), as well as some wild relatives (Barb et al. 2014; Burke et al. 2004; Heesacker et al. 2009; Lai et al. 2005b; Rieseberg et al. 2003). Based on transcriptome sequencing, a medium density SNP array (10 K) for sunflower was successfully developed (Bachlava et al. 2012). It was used to construct an integrated high-resolution genetic linkage map of H. annuus L. (Bowers et al. 2012a) and for association mapping (Mandel et al. 2013). Using a custom Affymetrix Expression GeneChip, an ultra-dense genetic map for sunflower was developed by placing 67,486 short features representing 22,481 unigenes (Bowers et al. 2012b). Sunflower SNP resources were extended by applying a restriction site-associated DNA sequencing (RAD-Seq) approach that finally resulted in an Illumina Infinium array with 8723 SNPs suitable for genotyping and genetic mapping of three populations (5019 mapped markers) (Pegadaraju et al. 2013; Talukder et al. 2014). To meet the demand for an integrated dense genetic map based on publically available SNP resources and genetic maps developed by two SNP marker consortia (Bowers et al. 2012a; Talukder et al. 2014) an in silico approach was used. Resequencing of a mapping population and alignment of resulting contigs and of known marker flanking sequences on draft genome scaffolds allowed to determine the genetic positions of more than 10,000 markers in an unified map (Hulke et al. 2015). To analyze and predict complex agronomic traits, genome-wide approaches as genome-wide association studies (GWAS) or whole genome-based prediction (Meuwissen et al. 2001) have become popular genomic tools. However, compared to QTL mapping in bi- or multiparental crosses, a larger set of markers is required in GWAS or genomic prediction to ensure that linkage disequilibrium (LD) between markers and QTL is preserved (Goddard and Hayes 2007; Mammadov et al. 2012). The aim of the current study was to develop a genotyping array based on the Illumina® Infinium assay, with a high number of predominantly haplotype-specific SNP markers located mostly in or near genes that can be used for a better understanding of the genetic regulation of complex agronomic traits in sunflower.

Materials and methods

Whole genome and amplicon sequencing

Four sunflower inbred lines representing two main groups of the sunflower gene pool (two restorer; SUN48-0003, SUN48-0006 and two maintainer lines; SUN48-0025, SUN48-0026) were selected for whole genome sequencing (WGS). For high molecular weight DNA extraction, a protocol for cell nuclei isolation (Murray and Thompson 1980) was applied that minimizes mitochondrial and chloroplast DNA contamination. Quality of DNA was checked electrophoretically, and 5 µg was subjected to standard 350 bp library preparation according manufacturer’s protocols (Illumina, San Diego, USA).

For amplicon sequencing, 48 maintainer and restorer lines representative of current elite breeding material were selected. Based on 5955 EST sequences of the Helianthus annuus UniGene EST Set (http://www.ncbi.nlm.nih.gov/genbank/dbest), primers were designed using Primer 3.06 software (Untergasser et al. 2012). The primer pairs were tested for amplification from eight genotypes using following PCR conditions: 5 min at 94 °C; 40 cycles of 1 min at 94 °C, 1 min at 60 °C, and 2 min at 72 °C; and a final extension step of 10 min at 72 °C. The reaction volume was set to 25 µl containing 20 ng DNA, 1 × GoTaq® buffer and 1 unit GoTaq® polymerase (Promega, Madison, USA), 0.5 µl of each primer (10 µmol/l), 1.5 µl dNTPs (25 µmol/l). More than 56 % (3356) of the deduced primer pairs showed distinct bands on agarose gel after PCR. Amplification was carried out using these 3356 primer pairs (Table S1) for further 40 genotypes. Amplicons of each inbred line were pooled and each 10 µg was used to prepare 200 bp insert libraries as recommended by Illumina (Illumina Inc., San Diego, CA, USA). Pool-specific bar codes were added by ligation of a six-base index sequence-containing adapter allowing for filtering of reads after sequencing. WGS and amplicon sequencing was performed on an Illumina Hiseq 2000 instrument using the 2 × 100 bp paired-end sequencing strategy (Aros Applied Biotechnology A/S, Aarhus, Denmark).

SNP detection and final choice

A multi-step selection procedure (Fig. 1) was followed to obtain high-confidence bi-allelic SNP markers with stable cluster performance. For de novo assembly, we employed the CLC Assembly Cell de novo assembler (version 3.2.2, CLC bio, Aarhus, Denmark). First, raw sequence reads were quality-trimmed with the program “quality_trim”. In order to create a reference sequence against which SNP calling could be carried out, trimmed reads of all WGS and amplicon sequencing were assembled using the parameter settings “-p fb ss 250 450”. Contigs generated from the genomic sequences were used for blast analyses to detect genomic targets from the UniGene set (Build # 11, Helianthus annuus, NCBI). UniGene (http://www.ncbi.nlm.nih.gov/UniGene/) is a largely automated analytical system that produces an organized view of a species specific transcriptome. Subsequently, all reads were mapped to the generated de novo reference sequence (identified contigs representing unigenes) whereby reads that matched more than once were ignored. CLC Genomics Workbench 5.01 was used for further mapping reads with the following CLC parameters applied for quality-based variant detection: maximum expected variations (ploidy) = 2; maximum gap and mismatch count = 5; minimum average quality = 15; minimum central quality = 20; minimum coverage = 20; minimum variant frequency (%) = 20.0; window length = 11.

Fig. 1
figure 1

Flow diagram describing steps and major criteria of the SNP selection process during the development of the sunflower 25 K genotyping array

SNPs were called from genomic sequences if the sequences were present in ≥3 sunflower lines. SNPs derived from amplicon sequences were included if sequences at that position were available from at least 23 lines. SNPs with nearby polymorphisms within the 50 bp left and right were eliminated. Further, SNPs with more than two alleles were discarded. SNP markers that represented the same haplotype over the entire contig were also reduced to one entry.

Additional sequences of pre-validated SNPs available from the sunflower 10,640 SNP genotyping array (Bachlava et al. 2012) were included in the selection process. SNP selection was continued with the removal of duplicated SNPs. All remaining SNPs were submitted to the Illumina Assay Design Tool (Illumina, San Diego, CA), and only SNPs that matched the Illumina® Infinium assay quality requirements (final score ≥ 0.4) were finally used for the array design.

Plant material and genotyping

The resulting sunflower Infinium iSelect HD BeadChip (Illumina®, San Diego, USA) was used to genotype a diversity panel of lines, hybrids, and mapping populations, altogether 1090 genotypes. Among them was a sunflower collection of 287 accessions (Table S2) representing 243 inbred lines, 19 open-pollinated varieties (OPVs), 5 landraces, and 20 lines with recent introgressions from wild Helianthus relatives that are referred to as introgression lines. This set of accessions captures nearly 90 % of the allelic diversity present within the gene pool of cultivated sunflower (Mandel et al. 2011) and originates from collections of the USDA North Central Regional Plant Introduction Station (NCRPIS) and the French National Institute for Agricultural Research (INRA). Pedigree information was available for about half of the accessions (USDA-ARS 2014; USDA 2006). Information on the designation into the categories maintainer, restorer, nonoil, and oil was available for almost all USDA inbred lines (USDA 2006). INRA-derived accessions could not be assigned to the nonoil or oil class. However, they could be distinguished in terms of breeding history into maintainer and restorer class (INRA 2014; Mandel et al. 2011, 2013).

Further, a subset of the population NDBLOSsel × CM625 consisting of 159 recombinant inbred lines (RILs) which was previously used for QTL mapping of resistance to Sclerotinia midstalk rot (Micic et al. 2005) was investigated. Prior to DNA isolation and genotyping, the ninth generation of each RIL was generated by selfing the previous generation (F8) through single seed descent.

22,299 SNPs were analyzed with respect to their genotype clustering using GenomeStudio software (v2011.1, Illumina, San Diego, USA). In order to create three high-quality clusters to represent the three possible genotypes at each locus, SNP marker quality was assessed by visual inspection of the cluster distribution and by subsequent adjustment of the cluster calling for each marker, exemplified in Figure S1. SNP markers for which two or more polymorphic loci were scored simultaneously (i.e. SNPs that created more than three clusters) were excluded.

Use of SNP array for analysis of population structure

To assess the utility of the 25 K SNP array in detecting population structure within the set of inbred lines, Principal Coordinate Analysis [PCoA; (Gower 1966)] and population substructure analysis using ADMIXTURE (Alexander et al. 2009) were performed using genotypic data of 243 inbred lines. SNP markers with ≥5 % missing data were excluded. Remaining missing data were imputed using Beagle (Browning and Browning 2009) via the R package “synbreed” (Wimmer et al. 2012) using R version 3.0.1 (http://www.R-project.org/). PCoA was calculated based on Rogers' distances using R with the packages “synbreed” (Wimmer et al. 2012), “adegenet” (Jombart 2008), and “ape” (Paradis et al. 2004). File conversion was done via Plink version 1.07 (Purcell et al. 2007) and analysis of population substructure was calculated using ADMIXTURE version 1.23 (Alexander et al. 2009) running with default settings for K = 1 to K = 20. Nucleotide diversity was calculated per SNP according to Tajima (1983) and the differentiation index F ST according to Weir and Cockerham (1984). In case of landraces and OPVs, SNPs were tested for Hardy–Weinberg equilibrium (HWE) using an exact test following Wigginton et al. (2005) with p ≤ 0.001. F ST calculation and HWE tests were performed with the software Plink (version 1.09) with default parameter settings (Chang et al. 2015).

Genetic mapping

A genetic map of the 159 RILs derived from the cross NDBLOSsel × CM625 was constructed by using JoinMap 4.0 (van Ooijen 2006) with default parameter settings. Graphical genotypes representing the calculated linkage groups (LGs) were visually inspected, and doubtful genotyping results such as low quality data and suspicious double cross-overs were eliminated from the dataset. MapManager QTXb20 version 0.3 (Manly et al. 2001) was used to recalculate the map positions. Distances between SNP markers were estimated using the Kosambi mapping function (Kosambi 1944). Linkage group assignment according to Tang et al. (2002) was based on the overlap of SNPs mapped by Bowers et al. (2012a) and our marker set.

Use of SNP array for analysis of quantitative traits

The suitability of our SNP genotyping platform was investigated for QTL mapping and genome-based prediction of sunflower midstalk rot resistance caused by the pathogen Sclerotinia sclerotiorum. From the study of Micic et al. (2005) we extracted phenotypic data for three resistance traits, stem lesion length (SLL), speed of fungal growth (SFG), and leaf lesion length (LLL) as well as for the morphological trait leaf length with petiole (LLP) for 113 RILs derived from the cross NDBLOSsel × CM625. Based on the genotyping data and genetic linkage map described above and on adjusted entry means from field trials across two locations, QTL mapping was performed for each trait using composite interval mapping (CIM) implemented in the software package PLABQTL 1.2 (Utz and Melchinger 2006). A conservative LOD threshold corresponding to an experiment-wise type I error rate of α = 0.05 was chosen. This threshold was determined using 1000 permutations as described by Churchill and Doerge (1994). The support interval of a putative QTL was defined as the chromosomal region surrounding a QTL peak with a LOD fall off of 1.0. The additive effect as well as the phenotypic variance explained (R2) by each QTL was obtained from a multiple regression model fitting all significant QTL simultaneously.

For genomic prediction of Sclerotinia resistance traits, a genome-based best linear unbiased prediction (GBLUP) model was used: \({\mathbf{y}} = {\mathbf{1}}_{n}\upmu + {\mathbf{Zu}} + {\varvec{\upvarepsilon}},\) where y is the n-dimensional vector of adjusted means across the two locations for the n = 113 RILs, \({\mathbf{1}}_{n}\) is an n-dimensional vector of ones, \(\upmu\)  is an overall mean, Z is an n × n matrix assigning genotypes to phenotypes. The n-dimensional vector u of genotypic effects is assumed to be normally distributed with \({\mathbf{u}}\sim {\text{N}}\left( {{\mathbf{0}},{\mathbf{U}}\sigma_{g}^{2} } \right),\) where U is a marker-derived relationship matrix calculated according to Habier et al. (2007) and \(\sigma_{g}^{2}\) is the genotypic variance. The n-dimensional vector of residuals is assumed to be normally distributed with \({\varvec{\upvarepsilon}}\sim {\text{N}}\left( {{\mathbf{0}},{\mathbf{I}}\sigma_{\varepsilon }^{2} } \right),\) where I is an n × n dimensional identity matrix and \(\sigma_{\varepsilon }^{2}\)  is the residual variance. Genotypic and residual variances were estimated by restricted maximum likelihood using ASREML (Gilmour et al. 2009). To assess the prediction performance of the model, ten times replicated fivefold cross-validation with random sampling in estimation and test set was performed as described in Albrecht et al. (2011). Predictive ability of the model was estimated as Pearson’s correlation between predicted and observed phenotypes of lines in the test set. Further, prediction accuracy indicating the correlation between predicted and observed genotypes was approximated by the mean predictive ability divided by the square-root of the trait heritability (Dekkers 2007). Analyses were performed using the “synbreed” R package (Wimmer et al. 2012).

Results

SNP array development

Sequencing of the four sunflower lines on Illumina HiSeq 2000 resulted in a total of 268 Gb of DNA sequence generated from 100 bp paired-end reads. On average, a >20-fold coverage was reached for each line. The genome was de novo assembled into 142,137 contigs with an average length of 407 bp. Blast analyses were carried out to compare the de novo assembled contigs to the UniGene set, and matching sequences were used as reference contigs. By applying the selection criteria of step 1 (Fig. 1) onto reference contigs, 616,781 SNPs were called from read mapping; of these, 532,613 SNPs were derived from WGS and 84,168 from amplicon sequences. 25,742 SNP markers passed all filtering steps and were usable for Illumina® Infinium array design: 11,042 from amplicon sequencing and 14,700 from the de novo assemblies. The inclusion of additional 10,640 pre-validated publically available SNPs (Bachlava et al. 2012) resulted in 36,382 markers for further processing. After removal of redundant markers, the SNP pool was reduced to a final size of 25,944 bi-allelic candidate SNPs having high design scores. In total, 3645 SNP markers (14 %) failed to meet bead representation and decoding quality metrics during the Illumina manufacturing process. For genotyping of 1090 sunflower samples and subsequent cluster file construction, 22,299 functional SNPs (Table S4) were used.

Finally, a set of 20,502 high-quality, bi-allelic SNPs was obtained that included 6393 publically available markers (Bachlava et al. 2012) and corresponded to an average density of one SNP per ~176 kb of the genome. In the final marker set, 18,990 (92.6 %) SNPs were polymorphic and 15,535 (75.8 %) SNPs had a minor allele frequency (MAF) ≥ 10 % when tested on 243 inbred lines, 5 landraces, 19 OPVs, and 20 introgression lines (Table S5). The lowest number of monomorphic SNPs could be observed for the inbred lines (7.7 %). In addition, 3.4 % of the markers detected rare alleles (MAF < 1 %) within the inbred lines of this sunflower diversity collection. The proportion of heterozygous calls per SNP denoted as observed heterozygosity ranged from 0.07 to 0.59 with inbred lines displaying the lowest and introgression lines the highest values (Table 1). Average nucleotide diversity per SNP was with 0.35 lowest for inbred lines and with 0.38 and 0.39 in a comparable range for landraces, OPVs and introgression lines, respectively. Only 50 SNPs (0.24 %) failed the test for Hardy–Weinberg equilibrium in landraces and OPVs.

Table 1 Overview of diversity parameters

Analysis of population structure and substructure

In order to demonstrate the applicability of a genotyping array, a high frequency of polymorphic variants in a representative set of genotypes is crucial. In the present diversity collection of 287 accessions, for a subset of 184 genotypes the affiliation to one of the four categories nonoil restorer (n = 19), oil restorer (n = 68), nonoil maintainer (n = 41) and oil maintainer (n = 56) was known. Within this subset of 184 categorized lines, 91.8 % of the SNP variants on the array were polymorphic and only 1677 SNPs were monomorphic indicating a very low false discovery rate during SNP identification. By assigning the genotypic data to the subsets which were separated according agronomic use and breeding history, 95.4 % of the SNPs were identified to be polymorphic for oil maintainer, 95.1 % for oil restorer and 92.1 % for nonoil maintainer, respectively (Figure S2). Probably due to the small sample size and the higher degree of relationship indicated by the available pedigree information (USDA-ARS 2014; USDA 2006), only 78.3 % of the SNPs were polymorphic within the nonoil restorers.

The determination of population substructure is a key aspect for quantitative genetic or population genetic analyses since population stratification or admixture may affect detection of marker-trait associations, genomic prediction accuracy, or estimation of population genetic parameters. Principal coordinate (PCoA) and subpopulation structure analyses were performed to investigate the potential of the array to resolve population substructure in 243 sunflower inbred lines. When applying PCoA to the dataset, the first axis separated nonoil and oil lines. It explained 5.5 % of the observed variation within the set of inbred lines (Fig. 2) in accordance with a moderate level of differentiation between these two groups (F ST = 0.116). The second axis further subdivided restorer and maintainer lines explaining 4.9 % of the total variation (average F ST = 0.056). A high number of subgroups was observed for the 243 inbred lines based on the cross-validation errors calculated by ADMIXTURE. Errors were similar for the number of groups K = 11 to 14 (0.696–0.702) with a minimum for K = 13. The population structure of the set of inbred lines is shown in Fig. 3 for K = 13. Our analysis separated the 243 lines into four subgroups belonging to the group of restorer lines and eight subgroups known as maintainer. A further subgroup was composed of restorer and maintainer lines. Within the restorer group, three subgroups were found to represent oil restorer. One subgroup contained nonoil restorer. Taking the known pedigrees into account, the largest restorer subgroup could be clearly assigned to progeny derived from line RHA274. In the group of maintainer lines, the majority of lines clustered into three subgroups that contained nonoil as well as oil maintainers. Clearly separated were two oil maintainer subgroups which comprised the offspring of HA300 (Peredovik 301) and HA89, respectively. The majority of inbred lines were strongly admixed highlighting the diversity within the panel. Only genotypes representing nonoil maintainer clustered into subgroups of very closely related lines.

Fig. 2
figure 2

Genetic differentiation of 243 sunflower inbred lines. Association of lines as revealed by principal coordinate analysis based on Rogers’ distances is presented. Nonoil maintainer, oil maintainer, nonoil restorer, oil restorer, and inbred lines of unknown origin are designated by different colors (color figure online)

Fig. 3
figure 3

Population substructure among 243 inbred lines. Identified subgroups are shown as revealed by ADMIXTURE for K  = 13. Individuals are plotted on the x-axis and sorted in descending order according to their subgroup assignment given at the bottom. Ancestry was plotted on the y-axis

Analysis of NDBLOSsel × CM625 segregating for resistance to Sclerotinia midstalk rot

A genetic map of the RIL population NDBLOSsel × CM625 was developed based on 6355 high-quality polymorphic markers (Table S3). The distribution of the SNPs (Fig. 4) was generally found to be even across the 17 LGs with the exception of the upper half of LG3 and the bottom half of LG10 where only few markers could be placed. This was probably because the genomes of the two parental lines are very similar in the respective regions and thus displayed a lower level of polymorphism. The frequency distributions of adjusted means for the four traits under study are shown in Figure S4. Resistance traits were significantly correlated (0.40–0.65) with each other (Figure S4). There was no remarkable correlation between the morphological trait leaf length with petiole and the three resistance traits. In the QTL analysis, two, one, and three QTL were identified for stem lesion length, speed of fungal growth, and leaf length with petiole, respectively. Details on putative QTL, including positions in the genome, information of flanking markers, and QTL effects are given in Table 2. LOD scores along the genome for all four traits are shown in Fig. 5. The detected QTL explained 8.1–35.2 % of the phenotypic variance and exhibited small support intervals ≤4 cM. The largest proportion of phenotypic variance was explained by a QTL on LG8 affecting resistance traits leaf length with petiole and speed of fungal growth. For all resistance traits, the NDBLOSsel allele increased the Sclerotinia resistance. The leaf length with petiole increasing allele at the QTL on LG17 originated from NDBLOSsel while it was contributed by CM625 at the QTL on LG5 and LG15. Predictive abilities from the GBLUP model are represented by boxplots for each of the four traits in Fig. 6. High predictive ability was observed for stem lesion length, with on average 0.74 (h 2 = 0.79; accuracy = 0.83). For the other traits, predictive abilities were on a medium level with mean predictive ability ranging from 0.31 (h 2 = 0.51; accuracy = 0.43) for leaf lesion length, 0.41 (h 2 = 0.57; accuracy = 0.54) for speed of fungal growth to 0.61 (h 2 = 0.63; accuracy = 0.77) for the morphological trait. Predictive abilities for leaf lesion length were highly variable with a standard deviation of 0.18.

Fig. 4
figure 4

Genetic map of the RIL population NDBLOSsel × CM625 constructed based on 6355 high-quality polymorphic SNP markers. Below each linkage group (LG) the number of markers is presented

Table 2 Characteristics of detected QTL
Fig. 5
figure 5

LOD score profile from QTL mapping along the 17 linkage groups for the traits stem lesion length (SLL), speed of fungal growth (SFG), leaf lesion length (LLL), and leaf length with petiole (LLP). The dashed line represents the LOD threshold obtained by a permutation test according to Churchill and Doerge (1994) corresponding to a type I error rate of 5 %

Fig. 6
figure 6

Genome-based prediction of phenotypic traits. Boxplots showing the distribution of predictive abilities from ten times replicated fivefold cross-validation within the RIL population NDBLOSsel × CM625 obtained with GBLUP for stem lesion length, speed of fungal growth, leaf lesion length, and leaf length with petiole

Discussion

This study was aimed at the discovery of a large set of SNP polymorphisms within Helianthus annuus L. to generate an Illumina® Infinium iSelect HD BeadChip. For this, de novo assembled contigs were filtered for sequences that mapped to the H. annuus specific UniGene set prior to variant calling. In order to achieve an unbiased SNP set, two different approaches were used. In the first approach, the focus was on identifying as many SNPs as possible through genome sequencing of four sunflower lines. The drawback of this procedure was that only few lines could be sequenced at high coverage with reasonable costs so that the identified SNPs could not be fully representative for the entire cultivated gene pool. In the amplicon approach, we analyzed 48 lines so that the gene pool was more widely represented in terms of the observed allelic variation, especially since only haplotype-specific markers (one marker per observed haplotype in each amplicon) were selected for the array. The drawback here was that in this way only 3356 genes could be analyzed. In order to allow the data to be related to the draft sunflower genome sequence (Kane et al. 2011), SNP markers recently validated and genetically mapped by other groups were included (Bachlava et al. 2012; Bowers et al. 2012a). Due to the selection procedure, most of the SNPs were located near (in 5′ and 3′-flanking sequences) genes or in exons and introns of those. Validation based on genotyping of inbred lines, OPVs, introgression lines, landraces, and RIL and F2 populations resulted in 20,502 high-quality bi-allelic SNPs, each detecting a single locus in the sunflower genome. This number corresponded to 91.9 % of the SNPs assayed and was comparable to scoring rates reported for other plant species during Illumina® Infinium assay design (Bianco et al. 2014; Dalton-Morgan et al. 2014; Song et al. 2013). The final set included 14,109 new SNP markers and 6393 publically available high-quality SNP markers (Bachlava et al. 2012). With that the developed 25 K array is the largest genotyping array that is currently available for routine sunflower genotyping since the array described by Bowers et al. (2012b) is, as described in their publication, too error-prone and expensive for routine SNP genotyping. The high overall polymorphism rate of 92.6 % depicted the quality of the SNP filtering procedure and is in line with results obtained by other studies regarding genotype array validation in animals and plants (Chen et al. 2014; Ramos et al. 2009; Tosser-Klopp et al. 2014; Unterseer et al. 2014). It further confirmed the utility of the array for applications in a wide range of sunflower germplasm.

Central applications of a genotyping array are the characterization of genetic variation and subpopulation structure in germplasm collections. Here, we found a high level of nucleotide diversity in accordance with a previous report by Mandel et al. (2013) and in line with the history of sunflower breeding. Elite sunflower inbred lines have been developed after passing through at least three major bottlenecks: breeding for oilseed traits, self-compatibility and self-pollination, and for hybrid seed production traits (Hongtrakul 1997). This breeding strategy could have resulted in a considerable decrease of diversity, but migration of OPVs and other exotic germplasm as well as selection of inbred lines from inter-pool crosses counteracted a strong reduction. Furthermore, in the recent past several wild species have become increasingly important as sources of disease resistance, drought tolerance, and other agronomically important traits (Jan and Chandler 1988; Miller 1987; Seiler 1992, 2010). The development of interspecific hybrids is often accompanied by a transfer of large segments of wild species genome into the respective breeding line (Barb et al. 2014; Dußle et al. 2004; Qi et al. 2012), leading to broadening of genetic diversity but also to undesired linkage drag. Due to its high resolution, the new genotyping array offers the possibility to improve the targeted introgression and reduction of donor segments. In addition, it enables a better representation of the heterozygosity in a set of lines. We observed a threefold higher level of heterozygosity for inbred lines, introgression lines, as well as landraces and OPVs compared to Mandel et al. (2013) which can be explained by the increased number of SNPs with a MAF below 10 % in our study.

In the last decades of the twentieth century, breeding of sunflower was focused on generating inbred lines and heterotic pools to maximize heterosis and improve traits essential for hybrid breeding (Miller 1987). In the long term, this strategy should result in distinct germplasm groups which exhibit maintenance of cytoplasmic male sterility (cms) in the seed parent pool and fertility restoration in the pollen parent pool. However, when classifying the genetic material into maintainer and restorer lines we observed only a moderate level of differentiation between the two primary sunflower breeding pools in accordance with Mandel et al. (2013). The low level of molecular variation explained by the first two principal coordinates in our study reflected a rather complex genetic composition of the investigated germplasm. Here, the differentiation between nonoil and oil lines was stronger compared to the separation between restorer and maintainer lines. A differentiation between nonoil and oil lines has been observed for restorer lines previously (Mandel et al. 2013), but was indicated by the second coordinate. However, between the two studies only 165 accessions overlapped, corresponding to 61 % (Mandel et al. 2013) and 68 % (this study) of the investigated lines. Moreover, the previous study based on 5.5 K SNPs compared to our set of 18.9 K polymorphic markers. These findings underline the demand for high-density genotypic data to resolve the population structure within the present sunflower collection. Indeed, the fine scale resolution of the array enabled us to uncover the presence of thirteen subgroups. These subgroups could generally be assigned to the maintainer and restorer group and were separated to a large extent regarding agronomic use (oil vs. nonoil), although the majority of inbred lines was characterized by a high degree of admixture. The new SNP array allowed fine scale resolution of ancestry identifying e.g. one clear subgroup formed by descendants of RHA274, a restorer line of the PET1 cms system, which was released in 1973 and represents a prominent parent of the pollen parent pool. Two further subgroups were formed by the progeny of HA89 and HA300, released in 1971 and 1976, respectively that co-founded the seed parent pool (Fick and Miller 1997; Miller 1997). Thus, the array developed here offers the possibility to depict the allelic diversity of sunflower and to represent the breeding history with high resolution.

Marker-assisted selection based on results from QTL mapping studies or genome-based prediction of genetic values is expected to increase progress in plant breeding. Especially in resistance breeding, phenotyping is often not trivial as it requires the occurrence of pathogens in the field or expensive artificial infection methods. In the QTL analysis of a biparental population with 6355 polymorphic SNPs, we confirmed the majority of QTL detected by Micic et al. (2005). The slightly lower number of QTL identified in our study resulted from the more stringent LOD threshold applied to account for multiple testing which is crucial in large marker datasets. With a maximum of 4 cM, the LOD support intervals were strongly reduced compared to Micic et al. (2005). Thus, genotyping with the 25 K array allowed a better resolution of genomic regions involved in resistance to Sclerotinia midstalk rot and should enable successful marker-based selection for this trait. We hypothesize that the high marker density of this new array will be highly beneficial for QTL mapping in advanced mating designs such as multiparental mapping populations (Giraud et al. 2014). Furthermore, the 25 K array will constitute an essential tool for map-based cloning of genes associated with important agronomic traits of sunflower.

An alternative approach to predicting untested phenotypes based on individual QTL is whole genome-based prediction. With the development of high-density marker arrays, genome-based prediction has been successfully applied in a number of crops (Albrecht et al. 2011; Heffner et al. 2011; Hofheinz et al. 2012). So far, whole genome-based prediction studies for resistance traits in sunflower have been lacking, but studies on quantitative fungal or insect resistance conducted in maize (Technow et al. 2013), wheat (Daetwyler et al. 2014; Rutkoski et al. 2012), and barley (Lorenz et al. 2012) have shown its merit and applicability. Here, we assessed the performance of GBLUP to predict Sclerotinia midstalk rot in a biparental population genotyped with the newly developed 25 K array. We obtained high predictive abilities especially for the resistance trait stem lesion length (mean predictive ability = 0.74). For leaf lesion length and speed of fungal growth, predictive abilities were lower, but both traits also showed significantly lower trait heritabilities. Predictive abilities obtained in a biparental population need to be considered as an upper bound of what can be achieved in a breeding population. However, we consider the results presented here as a first indication that the potential of genome-based prediction of Sclerotinia midstalk rot resistance warrants further investigation.

The high number of SNP markers now available for sunflower opens new avenues for marker-based genetic studies and breeding. It will be particularly useful when GWAS or genome-based prediction is applied to diverse datasets which require large marker densities to preserve the marker-QTL LD. The same holds true for genome-based prediction of sunflower hybrid performance. A first analysis based on a few hundred AFLP markers (Reif et al. 2013) did not have the power to predict sunflower hybrid performance with high accuracy. It remains to be shown if the availability of the 25 K SNP markers will overcome this limitation.

Authors Contributions statement

This study was carried out in collaboration between all authors. CCS, MWG, MO, SW, RW, and ML conceived the study. VH and SW provided material. HL, MM, AP, JP, RW, CL, SU, and WE performed analyses. ML, MWG, and CCS drafted the manuscript. All authors read and approved the final version of the manuscript.