Introduction

Many approaches in plant genetics and breeding such as association mapping (AM), QTL linkage mapping and genome-wide selection (GS) require high-throughput marker systems with genome-wide coverage (Mammadov et al. 2012). Single-nucleotide polymorphism (SNP) markers fulfill the criteria for such approaches due to their frequency, reproducibility and genome-wide distribution (Mammadov et al. 2012; Huang and Han 2013). The discovery of a large number of SNPs in sunflower (Helianthus annuus L.) is necessary because sunflower is an agronomically important oilseed crop and the application of molecular markers will improve the efficiency and accuracy of selection in sunflower breeding. Two studies described the development of SNP loci by sequencing the transcriptomes of eight sunflower lines using the Roche 454 GS FLX platform (10,640 SNPs; Bachlava et al. 2012) and by sequencing of six sunflower inbred lines using restriction site-associated DNA sequencing (RAD-Seq) and Illumina paired-end sequencing technology (16,467 SNPs; Pegadaraju et al. 2013). In addition, Livaja et al. (2016) recently discovered 20,502 high-quality SNP loci by amplicon sequencing.

Genotyping by sequencing (GBS) is a high-throughput and practical SNP discovery approach which involves sequencing of a plant population using multiplexing (bar coding) and next-generation sequencing (NGS) technologies (Kumar et al. 2012). GBS can be used for SNP discovery and genotyping of plants by mapping the sequence reads of individual plants on a reference genome (Elshire et al. 2011; Huang and Han 2013). To date, the GBS approach has been popular and applied to many plant species for SNP discovery and genotyping. These species include barley (Elshire et al. 2011), Miscanthus sinensis (Ma et al. 2012), Rubus idaeus (Ward et al. 2013) and maize (Romay et al. 2013). Also GBS was applied in an F2 population of oil palm to identify 21,471 SNPs and to construct a linkage map containing 1085 SNPs (Pootakham et al. 2015).

Although approximately 27,000 SNP markers were developed by sequencing of a few sunflower accessions (Bachlava et al. 2012; Pegadaraju et al. 2013), no high-throughput SNP discovery study has been performed in sunflower using GBS. Thus, the first aim of this study was to discover SNPs in the sunflower genome using such an approach while the second goal was to construct a linkage map from these markers.

Materials and methods

Plant materials

An intraspecific F2 population consisting of 300 individuals was generated by self-pollinating F1 hybrids from a cross of H. annuus cv. RHA 436 × H. annuus cv. H08 M1. Plants were grown in the field in Bursa, Turkey, in 3 m × 0.3 m rows with row spacing of 0.7 m. Soil was fertilized with 15:15:15 fertilizer (25 kg ha−1) and nitrogen (N), phosphorus (P) and potassium (K) (3.75 kg ha−1 for each element). Total genomic DNA was isolated from leaf tissue of the two parents (RHA 436 and H08 M1) and 300 F2 individuals using a CTAB method (Doyle and Doyle 1990). Genomic DNA of individuals was quantified using Qubit™ quantitation assay (Life Technologies) and agarose gel electrophoresis. Genomic DNA of the two parents (RHA 436 and H08 M1) and 93 F2 individuals selected from the population for the best DNA quality and quantity were used for GBS.

Genotyping by sequencing (GBS)

Genomic DNA of the 95 sunflower genotypes (93 F2 individuals plus 2 parents) was subjected to GBS by the University of Wisconsin Biotechnology Center (https://www.biotech.wisc.edu/, Madison, WI USA). Genomic DNA digestion, common and bar code adaptor ligation, sample pooling and amplification for sequencing library construction were performed according to protocols described by Elshire et al. (2011). Single-end sequencing of the 95-plex library was performed with the Genome Analyzer II next-generation sequencing platform (Illumina Inc. San Diego, CA).

GBS data analysis and SNP identification

The FASTQ file format generated from raw sequence reads by the CASAVA 1.8.2 software package (Illumina Inc.) was used for further processing in the GBS discovery pipeline of TASSEL software (Glaubitz et al. 2014). The FASTQ and sample key file (containing the plate layout and bar codes for each genotype) were used as input files in the pipeline. Sequence reads containing the expected bar codes followed by an ApeKI cut site remnant (CWGC) were trimmed to 64 bases using the FastqToTagCountPlugin of the pipeline. Reads containing unidentified bases (N) were excluded from analysis. The bar-coded sequence reads were sorted and collapsed into unique sequence tags with counts using the plug-in with default parameters. The only exception was that minimum number of times a tag must be present to be output was altered to 3. After filtration using a minimum count threshold of 3, tag count files were merged into a master file using the MergeMultipleTagCountPlugin. The master tags were converted to FASTQ format by the TagCountToFastqPlugin using default parameters except the minimum count of reads for a tag was 3. The tags were then aligned to the sunflower draft genome (Celera_14libs_sspace2_ext.final.scaffolds.split.fasta.bz2) (http://sunflowergenome.org/early_access/repository/main/genome/) using the Bowtie 2 plug-in with default parameters (Langmead and Salzberg 2012). Overall, these steps of the pipeline were used to filter and correct most of the sequencing errors (Glaubitz et al. 2014). The output of the alignment was converted to a “Tags On Physical Map” (TOPM) file. This file contained information about the physical positions of the master tags which had the best unique alignment with the genome as generated by the SAMConverterPlugin. In addition, a “Tags by Taxa” (TBT) file which contained sorted and demultiplexed reads according to their bar code adapters was generated by the FastqToTBTPlugin. Finally, the TOPM and TBT files were used for SNP calling using the TagsToSNPByAlignmentPlugin. Parameters used for SNP calling and filtration are given in Table S1. SNPs were recorded in a HapMap file for each linkage group. MergeDuplicateSNPsPlugin was used to merge the duplicate SNPs in the HapMap file. A physical map of the identified SNPs was drawn using Mapchart software (Voorrips 2002).

Construction of linkage map

The SNP markers developed in this study were further filtered for linkage map construction. SNP markers with more than 10 % missing data and which deviated significantly from the expected 1:2:1 Mendelian segregation ratio (Chi-square goodness-of-fit test, p < 0.01) were excluded from linkage analysis. The linkage map was constructed using JoinMap version 3.0 software (Ooijen and Voorrips 2001) with F2 population-type parameters. Linkage groups were determined using a minimum LOD score of 3.0 and maximum recombination ratio of 0.5. Default parameters of the maximum likelihood algorithm were used for ordering of markers in linkage groups. Map distances were calculated using the Kosambi mapping function. Correlations between the linkage and physical maps were calculated for each linkage group using the positions of SNPs to compare the maps using PASW software (Norusis 2010).

Results

GBS

A total of 93 individuals and the two parents of the F2 population were subjected to GBS and analysis. A total of 271,445,770 sequence reads were generated from the 95-plex library Genome Analyzer II next-generation sequencing platform. From these reads, 1,208,784 tags were generated. While 29.2 % (353,304) of the sequence tags were uniquely aligned to the sunflower genome, 14.2 % (172,196) of the tags were aligned to multiple positions. A majority of the sequence tags (56.5 %, 683,284) could not be aligned to the genome. Only those that were uniquely aligned to the genome were used for SNP calling by the TASSEL software pipeline. Tags averaged 64 bases and covered 22.6 Mb. Sequences of the merged GBS tags can be accessed at https://figshare.com/s/7e40b56de730e94f9687.

GBS data analysis and SNP identification

A total of 46,278 SNP loci were discovered in the sunflower genome based on the locations of the unique alignment of merged tags (353,304) in the genome which contained 17 linkage groups (LG1–LG17). Thus, the SNP loci identified in this study were distributed on all 17 linkage groups. Physical mapping of the identified SNPs in the sunflower genome showed that the SNPs were well distributed with high coverage (Fig. 1). After filtration of the SNPs according to TASSEL software parameters (Table S1), a total of 9535 (20.6 %) SNPs were retained. Average number of filtered SNPs per tag was 0.03. While the frequency (resolution) of unfiltered SNPs was 72.2 kb/SNP (1 SNP every 72.2 kb), the frequency of filtered SNPs was 350.6 kb/SNP (Table 1). For both unfiltered and filtered SNPs, LG2 had the most SNPs: 3849 and 879 unfiltered and filtered SNPs, respectively. As a result, LG2 had the highest resolution among chromosomes. LG15 had the fewest SNPs with 834 and 91 unfiltered and filtered SNPs, respectively) (Table 1). Most of the SNPs (58.7 %) identified in this study were transition mutations (A/G or C/T) (Table 2). The most observed substitution type was A/G (29.7 %). The least common substitution type was C/G transversion (6.9 %). The observed transition/transversion ratio was 1.42.

Fig. 1
figure 1

Physical map of 46,278 unfiltered SNPs in sunflower genome. Black-colored locations were highly saturated with SNPs. Detailed information about physical map and SNPs is available in supplemental Excel file

Table 1 Numbers and frequencies of unfiltered, filtered and polymorphic SNP loci in sunflower linkage groups
Table 2 Types of substitutions represented by the identified SNP loci

The filtered SNPs were further analyzed to identify only SNPs that were polymorphic for the parents of the F2 population. In this way, a total of 7646 (16.5 % of the total unfiltered SNPs) polymorphic SNPs were identified. Frequency (resolution) of polymorphic SNPs between parents was 437 kb/SNP (1 SNP every 437 kb). Physical mapping of these SNPs showed that they were well distributed in the sunflower genome with some gaps on LGs 14, 15, 16 and 17 (Fig. 2).

Fig. 2
figure 2

Physical map of 7646 SNPs found to be polymorphic for parental lines. Detailed information about physical map and SNPs is available in supplemental Excel file

Linkage map construction

After elimination of SNPs with too much missing data (more than 10 %) or segregation distortion, a total of 1548 SNPs were retained for linkage map construction. A total of 6098 SNPs were excluded from linkage map construction due to high levels of missing data and segregation distortion. Most of the SNPs (73.8 % of the total SNPs polymorphic between the parental lines) were excluded due to high levels of missing data (more than 10 %), the rest (26.2 %) were excluded due to segregation distortion. A genetic linkage map consisting of 817 SNPs distributed over 17 linkage groups was constructed (Fig. 3). The total distance of the map was 2472 cM. The size of the linkage groups varied from 17 cM (LG15 with 7 SNPs) to 236 cM (LG11 with 236 SNPs) (Table 3). Average distance between SNPs in the map was 3.0 ± 0.2 cM. The genetic linkage map was compared to the physical map of sunflower, and the correlation between SNP positions was calculated. A moderate level of correlation was determined for all linkage groups with r = 0.70 except for linkage groups LG4 and LG10 (r = 0.39 and 0.20, respectively).

Fig. 3
figure 3

SNP-based genetic linkage map. Detailed information about genetic linkage map is available in supplemental Excel file

Table 3 Summary of SNP loci in linkage map

Discussion

GBS and SNP identification

The genotyping by sequencing approach has been applied to many species for SNP discovery. Although the numbers of sequence reads and percentages of uniquely aligned reads were variable in previous work, comparison indicates that the sunflower GBS performed in this study was efficient in terms of read number. For example, total reads of Rubus idaeus generated by GBS (135,776,036) (Ward et al. 2013) were lower than those we obtained for sunflower. In addition, the percentage of reads that passed filtration and uniquely aligned to the genome in our study (29.2 %) was higher than for GBS analysis of Miscanthus sinensis (23 %) for which many more reads were obtained (415,694,046) (Ma et al. 2012). Differences in read numbers among studies can be due to the number of individuals used in GBS and/or technical performance of the sequencing chemistry. GBS of Miscanthus sinensis used 192 individuals (Ma et al. 2012) while our study and that of Rubus idaeus (Ward et al. 2013) used 95 individuals. In another study, GBS analysis of an oil palm F2 population consisting of 108 individuals generated a much higher number of reads (524,508,111) with 88 % of the total reads aligned to unique locations in the oil palm genome (Pootakham et al. 2015). This higher alignment percentage stresses the importance of the genome because oil palm has a complete genome sequence (Singh et al. 2013) while the current assembly of the sunflower genome represents only 85 % of the complete genome. Although the draft genome of sunflower is not complete, the assembled genome is highly accurate and contiguous because most (more than 76 %) of the contigs larger than 5 kb were assigned to the physical and/or genetic map (Kane et al. 2011).

The present study implemented the GBS approach for SNP discovery in sunflower. Although a reasonable number of SNPs were discovered, the study showed that the GBS approach performed in a parental genetic background in sunflower was less efficient than other SNP discovery methods. These alternative approaches discovered many more validated SNPs in sunflower using NGS sequences from a few individuals (10,640 SNPs; Bachlava et al. 2012; 16,467 SNPs; Pegadaraju et al. 2013; 20,502 SNPs; Livaja et al. 2016). The number of SNPs discovered by GBS was relatively low because of the limited number of sequence tags uniquely aligned to the sunflower genome (29.2 %). This was due to the high number of repetitive sequences in the genome which made tag alignments problematic (Treangen and Salzberg 2012). However, when compared with other GBS studies, the number of identified SNPs after filtering in this study (9535) was similar to the numbers reported in Rubus idaeus (9143) and soybean (10,120) (Ward et al. 2013; Sonah et al. 2013).

The average frequency of sunflower SNPs identified in this study was 72 kb/SNP (1 SNP every 72 kb); this is higher than soybean (100 kb/SNP) (Sonah et al. 2013) and much lower than the frequency identified in oil palm (1 SNP in 0.665 kb) (Pootakham et al. 2015). Again, the higher frequency of SNPs in oil palm could reflect the higher number of reads and higher coverage of the oil palm alignment (88 %). This result could also indicate that the oil palm genome is more diverse than sunflower due to higher gene flow and less inbreeding in trees than annual plants (Austerlitz et al. 2000).

The number of transition SNPs was higher than transversions as expected because it is well known that random mutations lead to higher transition rates during evolution (Vogel and Kopun 1977). This higher transition rate was also observed in oil palm. In oil palm and Miscanthus sinensis, the most prevalent substitution types were A/G and C/T (Ma et al. 2012; Pootakham et al. 2015). These substitutions were also the most prevalent substitution types identified in this study.

The SNPs identified in this study were further filtered to retain only those SNPs which were polymorphic for the parents. After parental filtration, most of the SNPs (80 % of the filtered SNPs) were retained. Although some gaps were observed in four linkage groups, most of the sunflower genome was well covered with SNPs. Gaps can be filled using the unfiltered SNPs identified in the relevant regions (Fig. 1). In a future study, these SNPs can be utilized with genotyping platforms such as Kompetitive Allele Specific PCR (KASP) (Semagn et al. 2013) for breeder friendly sunflower genotyping.

Linkage map construction

The polymorphic SNPs were used to construct a linkage map containing 817 SNP markers. Although this set represented only 1.7 % of the unfiltered SNPs, exclusion of most of the SNPs due to missing data or segregation distortion was also observed in other GBS linkage mapping studies performed in apple (0.9 % of total identified SNPs retained) and oil palm (5 % retained) (Pootakham et al. 2015; Gardner et al., 2014). The total size of the sunflower linkage map (2472 cM) was larger than previous SNP marker maps reported by Talukder et al. (2014) (1444 cM) and Bowers et al. (2012) (1080 cM). The larger total size of the linkage map and SNPs with segregation distortion might be due genotyping errors caused by the nature of sunflower genome structure. For example, repetitive sequences can limit the alignment efficiency of the sequence tags generated by GBS. Because the sunflower genome is known to contain a large proportion of repetitive sequences (∼78 %), this made tag alignments problematic and caused genotyping errors (Kane et al. 2011; Treangen and Salzberg 2012). Although the linkage map constructed in this study had lower marker density than the other two SNP-based maps (0.13 cM/SNP for Talukder’s map and 0.29 cM/SNP for Bowers’ map), it had higher marker density than the map (5.5 cM/SNP) constructed by Lai et al. (2005). The maps of Talukder and Bowers had higher resolution because they are consensus maps and were constructed from multiple populations. In addition, one of the two parents of the four mapping populations (Bowers et al. 2012) was a wild accession of sunflower and a non-oilseed landrace while the rest of the parents were sunflower cultivars. Although our linkage map had some unmapped regions similar to the maps of Talukder et al. (2014) and Lai et al. (2005), the SNP markers were generally well distributed.

The number of linkage groups of the map equals the haploid chromosome number of sunflower (n = 17) as expected and moderate correlation was determined between the genetic linkage and physical maps. Conflict between linkage and physical maps was also reported in a GBS study in apple with some SNP positions in the linkage map conflicting with their physical map locations (Gardner et al. 2014). This might be due to misassembly of pseudochromosomes or a small population size (Silva et al. 2007). The F2 population used for linkage map construction had 93 individuals and this might not be sufficient for accurate ordering of large numbers of SNPs (Silva et al. 2007).

In conclusion, this study is the first to demonstrate the implementation of a GBS approach for SNP identification in sunflower. A total of 46,278 SNP loci were identified and 7646 SNP loci were validated in an F2 population. Although a reasonable number of SNPs was discovered, the relatively low efficiency of GBS in sunflower was revealed. This is most likely due to the high number of repetitive sequences in the genome which resulted in a larger and lower-resolution genetic linkage map. Nevertheless, the SNP markers and SNP-based linkage map constructed in this work will be valuable molecular genetics tools for sunflower breeding.