Introduction

Wild goatgrass, Aegilops tauschii, is the D-genome progenitor of bread wheat, Triticum aestivum. The hexaploid bread wheat (2n = 6× = 42; AABBDD) originated around 8000 years ago through a spontaneous hybridization between the tetraploid emmer wheat, Triticum turgidum (2n = 4× = 28; AABB), and the diploid Ae. tauschii (2n = 2× = 14, DD), and thereafter assumed most of the global wheat production (Jia et al. 2013). Cultivated wheat species cover the arable lands more than any other food crops; yet, in terms of production and yield, wheat ranks the fourth after sugarcane, maize, and rice (http://faostat.fao.org/). Estimations on the growing world population and the climate change patterns may require up to 70% increase in wheat production by 2050 (Mayer et al. 2014). To meet this challenge, wild germplasms, including relatives and progenitors, will be key to explore favorable alleles and small RNAs such as microRNAs for wheat improvement (Nevo and Chen 2010; Lucas et al. 2011a, b; Kuzuoglu-Ozturk et al. 2012; Budak et al. 2013a; Budak et al. 2013b; Mochida and Shinozaki 2013; Akpinar et al. 2015a, b, c, d).

Molecular breeding allows the development and release of improved varieties much faster than classical breeding strategies. In particular, molecular markers have been pivotal in the screening of genetic resources for favorable alleles and in characterization of germplasms and introgressed lines (Budak et al. 2004; Budak et al. 2005; Castillo et al. 2010; Mizuno et al. 2010; Paux et al. 2012; Lucas et al. 2012; Budak et al. 2015; Winfield et al. 2015). While initially sequence-independent markers, such as restriction fragment length polymorphisms (RFLPs) and random amplified polymorphic DNA (RAPD), aided in genetic mapping and genetic diversity studies, molecular breeding efforts have mostly benefited from sequence-based markers which enable high-throughput, rapid, and cost-effective characterization of large numbers of samples (Lucas et al. 2012; Paux et al. 2012). Of these sequence-based markers, single-nucleotide polymorphisms (SNPs) have become the marker of choice as they overcome shortages of expressed sequence tags (ESTs) and simple sequence repeats (SSRs) (Berkman et al. 2012; Paux et al. 2012; Thomson 2014; Akpinar et al. 2015c). SNPs are the most widespread markers as they involve changes in the nucleotide level, which are transitions (C/T or G/A), transversions (C/G, C/A, or T/A, T/G), or small insertions/deletions (indels), and they can be linked to other marker types, such as EST-SNPs (Paux et al. 2012). Although homoeologous SNPs found in polyploid genomes complicate the identification of varietal SNPs, flexibility, speed, and cost-efficiency of SNP-based markers and platforms have greatly increased the popularity of SNPs in genetic and genomic studies. SNP-based marker systems have the potential to accelerate breeding programs not only by enabling the construction of highly saturated genetic maps but also by facilitating fine mapping of regions of interest and effective utilization of a germplasm by developing genome specific assays (Mammadov et al. 2012).

Ae. tauschii is a promising resource for wheat breeding due to its direct ancestry to bread wheat and genetically diverse populations (Dvorák et al. 1998; Wang et al. 2013). This is particularly important for the D genome traits, as bread wheat D genomes exhibit low level of polymorphism (Paux et al. 2012). Accordingly, this resource has begun to be realized particularly since the release of its whole genome physical map and draft genome sequence (Luo et al. 2013; Jia et al. 2013). SNPs have been utilized to map a powdery mildew resistance gene (Wang et al. 2015), to assess genetic diversity (Akhunov et al. 2010), to evaluate genetic and physical maps (You et al. 2010; Kumar et al. 2015), and to identify associations with morphological traits (Liu et al. 2015a; Liu et al. 2015b). Additionally, a few studies reported genome-wide large-scale SNP identification in Ae. tauschii (You et al. 2011; Iehisa et al. 2012; Wang et al. 2013; Winfield et al. 2015).

Here, we report the chromosome-specific discovery of SNPs in Ae. tauschii 5D chromosome using next-generation genomic and transcriptomic sequences of a total of seven Ae. tauschii accessions. Wheat 5D chromosome is known to carry alleles for important agronomic traits, such as dough strength (HMW-GS, Blechl and Anderson 1996), grain texture (Ha locus, Chantret et al. 2005), grain hardness (Pina, Pinb loci, Morris 2002), free-threshing (WAP2, Ning et al. 2009), flowering (Vrn loci, Quarrie et al. 2005; Yoshida et al. 2010; Zhang et al. 2012; Kippes et al. 2015), disease resistance (Lr1, Cloutier et al. 2007), abiotic stress tolerance (Galaeva et al. 2013), and putative microRNA target genes (Kurtoglu et al. 2013). Here, we also investigated single nucleotide variations (SNVs) between Ae. tauschii and T. aestivum 5D chromosomes by comparing SNVs found on T. aestivum 5D chromosome by mapping Ae. tauschii AL8/78 for which the draft genome sequence is available (Jia et al. 2013) and SNVs found on Ae. tauschii 5D chromosome by mapping T. aestivum D-genome reads published by the International Wheat Genome Sequencing Consortium (Mayer et al. 2014). SNPs and SNVs reported in this study provide an extensive collection on the D genome which should contribute to genetic and genomic efforts such as small RNAs (Budak and Kantar 2015).

Materials and Methods

Sequencing of Flow-Sorted Chromosomes

Flow sorting and sequencing of isolated chromosomes 5DS and 5DL from T. aestivum, 5B from Triticum dicoccoides, and 5D from Ae. tauschii were described previously (Akpinar et al. 2015a, b; Lucas et al. 2014). Briefly, suspensions of intact mitotic chromosomes were isolated from T. aestivum L. var. Chinese Spring using a double ditelosomic line (2n = 40 + 2t5DS + 2t5DL) and from Ae. tauschii genotype MvGB589, and the 5D chromosomes were selected by flow sorting, evaluated for purity, and amplified for direct sequencing, as described previously in detail (Lucas et al. 2014; Akpinar et al. 2015b; Akpinar and Budak 2016).

Shotgun libraries were prepared from 0.5 μg of amplified DNA from each flow-sorted chromosome/chromosome arm, quantified, and sequenced using GS FLX Titanium kits according to the manufacturer’s protocols (all Roche Life Sciences) as previously described (Akpinar et al. 2015a; Lucas et al. 2014). Sequence reads are submitted to the EBI European Nucleotide Archive under the primary accession numbers: ERP002330 (T. aestivum 5D) and PRJEB5993 (Ae. tauschii 5D).

Additional Datasets and Software

Whole genome shotgun sequence reads from Ae. tauschii accession AL8/78 (Jia et al. 2013) were downloaded from the NCBI Sequence Read Archive (BioProject ID: PRJNA182898; run accession numbers: SRR124169-SRR124252). Illumina sequence reads produced from flow-sorted chromosome arms of T. aestivum L. var. Chinese Spring (Mayer et al. 2014) were downloaded from the EBI European Nucleotide Archive in .fastq format (run accession numbers: ERR277143–ERR277149).

RNA-sequencing reads from six Ae. tauschii accessions, IG47182 (DRR001933), PI476874 (DRR001934), 2220007 (SRR947843-SRR947850), AS2404 (SRR1222442-SRR1222447), TQ20 (SRR043333-SRR043335), and D2220009 (SRR949823), were retrieved from the NCBI Sequence Read Archive. Illumina and 454 reads were checked for read quality and composition using FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) and PRINSEQ (http://prinseq.sourceforge.net/), respectively. Reads containing tag, adapter, or primer sequences were trimmed. Illumina reads of 49–51-nucleotide length containing >10% Ns and 454 reads containing >1% Ns were discarded to avoid ambiguity.

Illumina reads were aligned to reference sequences using Burrows-Wheeler Alignment (bwa) software (Li and Durbin 2009). Identification of sequence variations and initial filtering of potential SNPs were carried out using SAMtools (Li et al. 2009). Routine comparisons of small numbers of sequences were carried out using standalone BLAST+ applications, v.2.2.27 (Camacho et al. 2009). Sequence annotations were performed using Blast2GO software with an initial blast step run locally against all known Viridiplantae proteins (Conesa and Götz 2008).

Defining 5D Chromosome Gene-Related Reference Sequences

In order to identify SNPs present on the 5D chromosomes, we used a strategy adapted from that described by You et al. (2011) for calling SNPs using NGS data in the absence of a finished quality genome sequence. First of all, the low-coverage 454 reads from each chromosome (arm) were filtered against mitochondrial/chloroplast DNA contamination and rRNA sequences and masked against known repetitive elements. The masked sequences were then assembled using gsAssembler (Roche 454 Life Sciences), and contigs with higher than expected depth (defined as 3× the average sequencing coverage of the chromosome) were excluded as potential unknown repeat elements. From the remaining low-copy-number (LCN) contigs and singletons, those associated with coding sequences were identified using sequence similarity searches against annotated protein sequences of Brachypodium distachyon, Sorghum bicolor, and Oryza sativa and high-confidence predicted protein sequences of Hordeum vulgare. These procedures are described in detail in our previous papers (Akpinar et al. 2015a,b; Lucas et al. 2014). The gene-associated, LCN contigs and singletons were combined into a single fasta file for each chromosome/chromosome arm and used as references for identifying sequence variations.

SNP Calling and Filtering

Read alignment, mapping, and variant calling steps were carried out on the high-performance computing cluster. For SNP identification from Ae. tauschii AL8/78, run files containing paired-end reads with nominal insert sizes ranging from 167 to 510 nt giving a total average genome coverage of >50× were aligned to the references using the bwa align and sampe modules (reads from libraries with smaller insert sizes were preferred, as both halves of each pair were more likely to map to the same 454 singleton or contig). For SNP identification from T. aestivum var. Chinese Spring, paired-end reads from the flow-sorted chromosomes of the D genome were used for a comparable data volume (Mayer et al. 2014). The resulting alignment files in Sequence Alignment/Map (.sam) format were converted to sorted binary (.bam) files using samtools view and sort. After sorting, the .bam files generated from all runs from the same biological sample were combined, using “samtools mpileup”; the command “bcftools view—vcg” was then used to extract all sequence variations in the variant call format (.vcf). Filtering criteria were then applied to eliminate apparent variations resulting from alignment errors as follows: sites with a read depth of <10 or >3 times the average depth of coverage of the chromosome, or a mapping quality score of <30, and sites within 3 nt of an alignment gap were eliminated using vcfutils.pl VarFilter. Finally, in cases where two sequence variations were detected separated by 3 nt or fewer, both were removed as probable products of low-quality sequence data. After these criteria had been applied, the remaining variations were accepted as putative SNPs or SNVs. SNPs/SNVs identified from different genotypes were compared and collated using in-house Perl scripts that are available upon request. The summarized workflow is given in Fig. 1.

Fig. 1.
figure 1

Schematic overview of the workflow used to collate SNP calls generated for different genotypes

Results and Discussion

Identification of SNPs in Multiple Ae. tauschii Genotypes

Prior to the identification of SNPs on genic sequences from Ae. tauschii chromosome 5D, our SNP calling and filtering workflow was tested using two independently generated sequence datasets in the Ae. tauschii AL8/78 genotype. The validity of the conditions for initial SNP filtering was confirmed by mapping the Ae. tauschii AL8/78 whole genome shotgun reads from Jia et al. (2013), against preliminary sequence scaffolds generated independently from pooled BAC clones of the same accession by Dvorak et al. (http://aegilops.wheat.ucdavis.edu/ATGSP/index.php). Any variants detected between these two datasets should be the result of errors in sequencing or alignment, as both derive from the same genetic material. Accordingly, while initial alignment and variant calling gave a putative 274,286 SNVs and indels, 274,163 of these (99.96%) were eliminated by filtering for read depth and mapping quality as described above. Elimination of all but 123 out of a total 274,286 variations indicated that filtering criteria employed in our workflow can differentiate most sequencing errors from genuine SNPs. This gives confidence that the filtering criteria used are stringent enough to remove the vast majority of erroneous SNP calls.

The filtered SNP calls generated for each genotype under test were compared and collated using in-house Perl scripts, with the workflow summarized in Fig. 1. Briefly, .vcf files containing SNP calls for different genotypes aligned to the Ae. tauschii 5D gene-related reference sequences from accession MvGB589 were combined, sorted, and assigned unique SNP IDs based on their position in the reference sequences. As the most comprehensive dataset available, the genomic DNA reads from Ae. tauschii AL8/78 were used to filter the resulting summary table. Only positions in the reference to which AL8/78 reads aligned at a depth of 10 or more were retained, and the genotype of AL8/78 at every position called as a SNP in one or more of the other accessions but not AL8/78 was re-checked. In most cases, AL8/78 matched the reference sequence, but in some positions, AL8/78 also had indications of sequence variation but had not been called as an SNP owing to high sequence depth or low mapping quality. These instances were accepted as genuine SNPs based on the strength of the evidence from other accessions.

Furthermore, each Ae. tauschii SNP was compared with its corresponding position in T. aestivum var. Chinese Spring. Firstly, gene-related sequences from Chinese Spring 5D chromosome (Lucas et al. 2014) were used to construct a BLAST database. Then, a segment of sequence centered on each SNP was excised from the Ae. tauschii MvGB589 reference dataset and searched for the best match in the Chinese Spring 5D sequences using blastn, requiring a minimum of 90% sequence identity over at least 60 nt. The optimum length of the SNP-centered sequence segment to search for was determined empirically by carrying out searches with segment lengths of 60, 80, 100, and 120 nt and comparing the number of matches to these segments at different percentage sequence identities. Most matches were the same in all of these searches; however, the total number of matches increased slightly with segment length. It was observed that up to 100 nt, the increase was primarily at high identity (>98%) whereas at 120 nt, most of the increase was at 90–96% identity, suggesting that less reliable alignments were beginning to be included (Fig. 2). Therefore, 100 nt was selected as the optimum segment length. For the matches obtained to these segments, the equivalent base in Chinese Spring 5D for each Ae. tauschii SNP position was added to the summary table of all SNPs (Supplementary Table S1).

Fig. 2.
figure 2

Comparison of the percent sequence identities and number of blast matches for SNP-centered segment lengths of 60, 80, 100, and 120 nt

In total, compared to gene-related sequences from chromosome 5D of Ae. tauschii, 68,592 SNPs were identified in one or more accessions, corresponding to an average density of 4.49 SNPs per kilobyte in these sequences. As expected, more SNPs were identified in genomic shotgun than RNA-Seq datasets, with 31,250 identified in AL8/78 and a total of 25,543 in chromosome arms 5DS and 5DL of T. aestivum var. Chinese Spring (Table 1). A small number of SNP positions (550) were called in both 5DS and 5DL; these could be explained by sequence reads from genes with highly similar paralogs on both chromosome arms matching each other. Although majority of the gene-related sequences carrying those SNPs lacked known orthologs in Viridiplantae, others that do exhibit orthologous relationships to known plant proteins suggested involvement in core functions and/or affiliation to large gene families or repeat elements. For instance, seven such contigs and singletons were associated with serine-threonine kinases and phosphatases as core components of cellular signal transduction. Interestingly, 16 more contigs and singletons revealed significant similarity to the far-red-impaired response (FAR) family of transcription factors (TFs) and mutator proteins that are related to the Mutator-like element (MULE) superfamily of transposases (Lin et al. 2007). Two contigs had significant similarity to the SKP1-Cullin-F-box (SCF) E3 ubiquitin ligase complex, while two more sequences were also related to components of the protein turnover machinery (pumilio and ubiquitin carrier E2). Both SKP1-like and pumilio-like genes have multiple copies in plant genomes (Tam et al. 2010; Hong et al. 2013). These observations support the view that SNP positions commonly identified both in 5DS and 5DL may indeed correspond to genes with paralogous coding regions on both chromosome arms or genes associated with repetitive elements. Majority of the SNP-containing Aegilops 5D contigs and singletons had only a single SNP (Fig. 3a). While almost 68% of the contigs, which are much longer than singleton reads, contained one or more SNPs, only 32% of the singletons carried at least one SNP. Overall, Aegilops 5D contigs also had higher numbers of SNPs per contig, compared to singletons (Fig. 3b).

Table 1 Identification of SNPs in genomic DNA sequences by genotype
Fig. 3.
figure 3

Distribution of SNPs in gene-related Ae. tauschii 5D contigs and singletons (a) and the comparison of SNP distributions in contigs and singletons (b). Green color shows contigs, whereas purple color indicates singleton reads (color figure online)

Transcriptome-Guided SNP Identification

Gene-associated sequences from Ae. tauschii (MvGB589) 5D chromosome were aligned with next-generation RNA-Seq reads from six Ae. tauschii genotypes (Table 2). Cumulatively, 17,553 SNPs were identified between the MvGB589 genotype and the six accessions, corresponding to an average density of 1.15 SNPs per kilobyte (corresponding to 1 SNP every ∼870 bp), close to the SNP density (1 SNP every ∼876 bp) in genic sequences between Ae. tauschii AL8/78 and AS75 genotypes (You et al. 2011). However, higher SNP densities, as high as 1 in every 202 bp, were reported for the genic regions on the wheat D genome (Akhunov et al. 2010). This figure is more similar to the overall SNP density identified from all genotypes (including T. aestivum Chinese Spring), which is close to 1 SNP per 223 bp, suggesting that RNA-Seq reads may not be sufficient to discover all SNPs linked to gene-associated regions on the genome. Since the gene-associated sequences used in the identification of SNPs were of genomic source, these sequences may include intergenic regions, such as introns and flanking regulatory elements, which may not be fully covered by RNA-Seq reads, resulting in the underestimation of the overall SNPs in these sequences.

Table 2 Identification of SNPs in Ae. tauschii RNA-Seq data by genotype

The number of identified SNPs in individual accessions was largely dependent on the size of the transcriptomic data available, ranging from 33 (accession TQ20) to 11,585 (accession D2220009). With larger datasets, however, the number of SNPs appeared to level off around 12,000 which may indicate the maximum number of SNPs that can be identified by transcriptomic sequences of a single accession (Fig. 4). For the accession D2220009, for which size of the RNA-Seq data was the largest, an SNP density of 1 in every 1318 bp was observed on gene-related 5D sequences, which is considerably lower than that of the estimated SNP density for any random Ae. tauschii haplotypes (corresponding to 1 SNP in every 409 bp; Dvorak et al. 2011). This may suggest a closer evolutionary relationship between D2220009 and MvGB589, assuming similar evolutionary dynamics for the remaining chromosomes. Interestingly, ∼3-Gb RNA-Seq reads of AS2404 genotype covered almost 80% of all gene-associated 5D sequences, similar to much larger datasets of 2220007 (12.4 Gb) and D2220009 (31.8 Gb) accessions, although far fewer SNPs could be identified in this genotype (2486 vs 9752 and 11,585, respectively). This may also suggest a closer relationship between MvGB589 and AS2404, compared to 2220007 and D2220009 accessions. Additionally, accessions IG47182 and PI476874 had comparable transcriptomic sequences mapping against the gene-related 5D sequences. While IG47182 revealed 203 SNPs in 71 gene-related sequences from the 5D chromosome, PI476874 had 228 SNPs in 84 gene-related 5D sequences. Of these, 31 SNPs were common to IG47182 and PI476874, with 29 having the same alternative bases in these accessions.

Fig. 4.
figure 4

Number of SNPs identified versus the size of the RNA-Seq data used for SNP discovery. Mb megabases

Among the SNPs from all samples, the ratio of transitions (Ti; C<>T or A<>G) to transversions (Tv; Purine <> Pyrimidine) was 1.66 ± 0.25 (excluding the smallest dataset TQ20, 1.75 ± 0.12). A bias towards transitions is commonly observed in SNP analysis due to the high rate at which methylated cytosine residues are deaminated to uracil in biological conditions (Shen et al. 1994). This bias increases confidence that the variants detected are real SNPs; conversely, random sequencing errors would be expected to have a Ti/Tv ratio of 0.5, as for any base substitution, there are two possible transversions and only one transition. Curiously, the Ti/Tv ratio of genic SNPs between PI476874 and MvGB589 5D chromosome was considerably higher (1.92) than all others, including that of the highly similar dataset of IG47182 (Table 2). A previous study of SNPs in the group of seven chromosomes of T. aestivum (Lorenc et al. 2012) observed a Ti/Tv ratio of 2.12. This higher value could be explained by the bread wheat genome being more highly methylated than Ae. tauschii, owing to its hexaploidy and higher repeat content.

Polymorphic SNPs in Ae. tauschii Accessions

In total, 17,553 SNPs were identified by mapping over 40-Gb transcriptomic sequences from all six Ae. tauschii accessions on gene-related 5D sequences of genotype MvGB589. When SNPs identified from individual samples were combined, however, the total number of SNPs was 26,575, likely indicating that almost one third of the SNPs identified by the combined transcriptome sequences lacked sufficient depth in individual samples. This may be due to the limited data size of certain accessions and low representations of transcripts that are not highly expressed in the tissues or under the conditions that the samples are from.

A total of 5732 SNPs were covered by two or more genotypes (Fig. 5). Of these, 4179 (72%) have the same alternative base in all genotypes covering the SNP position, differing from MvGB589. Similarly, of the 965 SNPs covered by three or more genotypes, 776 had the same alternative base in all genotypes covering the SNP position, whereas among the remaining, some genotypes had the same reference base. The SNPs found between the gene-related sequences of the 5D chromosome of MvGB589 and transcriptome sequences of six Ae. tauschii accessions were exclusively biallelic, assuming only two out of the four (A, G, T, C) possibilities.

Fig. 5.
figure 5

SNPs identified from transcriptomic sequences belonging to six Ae. tauschii accessions. a The distribution of the SNPs identified from 1 genotype or 2+ genotypes. None of the SNPs were identified from all six genotypes. b Distribution of SNPs commonly identified for two Ae. tauschii accessions