Introduction

The Chinese jujube (Ziziphus jujuba Mill.) is one of the most popular fruits in Asia, known for its delicious taste, high nutritional value, and high sugar and vitamin C contents (Chen et al. 2013; Gao et al. 2013; Liu and Wang 2009). With the rapid increase in jujube cultivation, there is an urgent need for breeding improvement (Wang et al. 2007). Despite great efforts that have been made in crossbreeding, limited progress has been achieved in the past decades due to the unique biological characteristics of the jujube tree, such as early flowers dropping, small flower size, and high percentages of seedless fruits (Liu et al. 2014a; Ma et al. 2008). Consequently, jujube breeding is far behind that of other popular fruit trees (Zhang et al. 2015). In particular, breeding strategies of crops have involved molecular marker-assisted selection (MAS) in which a molecular genetic linkage map is required as a fundamental resource; however, few high-density genetic maps have been constructed for the jujube. Therefore, an advanced breeding strategy and more genetic maps are necessary for jujube breeding as well as for identifying genes related to important genotypes.

The first molecular genetic map of the jujube was reported by Shen (2005), based on an F1 population generated from ‘Dongzao’ × ‘Linyillizao’. It comprised 14 linkage groups (LGs) and spanned 1237.4 cM with 333 amplified fragment length polymorphism (AFLP) markers. Qi et al. (2009) improved the map to 15 LGs spanning 1309.4 cM with 388 AFLP markers and 35 random amplified polymorphic DNA (RAPD) markers, with an average marker interval of 3.1 cM. However, these two linkage maps are not sufficient for fine mapping of quantitative trait loci (QTLs) due to the long marker interval distance. Also, AFLP and RAPD markers are not compatible for genome anchoring due to the lack of sequencing information. More recently, a high-density genetic map derived from the cross of Z. jujuba ‘JMS2’ × Z. jujuba var. spinosa ‘Xing 16’ was reported (Zhao et al. 2014) that contained 2872 SNP markers with an average marker interval of 0.34 cM and comprising 12 LGs. This map provides new insight for jujube crossbreeding and genome research, and also shows great potential for the discovery and application of SNP markers with next-generation sequencing (NGS) technology.

Several strategies for genome complexity reduction have proved to be efficient for SNP discovery in various fruit species, such as the grape (Wang et al. 2012a), pear (Wu et al. 2014), peach (Bielenberg et al. 2015), and sweet cherry (Guajardo et al. 2015). The jujube has a genome size of ∼360 Mb with a high level of heterozygosity (Liu et al. 2014b; Wu et al. 2013). A rapid and low cost approach had been considered for application in a large mapping population. Genotyping by sequencing (GBS) utilizes one or multiple restriction enzymes to digest genomic DNA into fragments that can be sequenced on high-throughput platforms (Elshire et al. 2011). A low cost for each sample can be achieved by multiplexing different samples and adding specific barcodes ligated to digested fragments prior to sequencing (Huang et al. 2014; Pootakham et al. 2015). GBS has been widely used in genetic linkage map construction (Bielenberg et al. 2015; Fiedler et al. 2015; Moumouni et al. 2015), scaffold anchoring, and genome assembly (Glazer et al. 2015; Mathew et al. 2014; Soto et al. 2015).

From the perspective of jujube breeding and improvement, a cross between two excellent cultivars would offer great potential for selection of superior progenies. Therefore, we performed hybridization between two excellent cultivars: ‘Dongzao’ and ‘Zhongningyuanzao’. Based on the F1 population, a new high-density genetic map was constructed using a GBS strategy. Sequenced scaffolds from the reference genome of Z. jujuba Mill. ‘Junzao’ (Accession: LPXJ 00000000) were also anchored based on this map.

Materials and methods

Plant material and DNA extraction

Two jujube cultivars, ‘Dongzao’ (8 years old) and ‘Zhongningyuanzao’ (8 years old), were selected as the hybrid parents. They exhibited obvious differences in several phenotypes, such as fruit shape, fruit quality, and fruit maturity time. The hybridization was performed in Zhongning County (37.48° N, 105.70° E), Ningxia, China, in 2012. In total, 385 seeds were harvested and 294 were grown into seedlings. Among them, 145 individuals were selected randomly for genetic map construction. All progenies and the two parents were maintained in the experimental field of the Ningxia Academy of Agricultural and Forestry Science (Zhongning County, Ningxia).

Young leaves from the parents and progenies were collected at 3 weeks after sprouting in May 2015. Collected leaves were immediately stored in liquid nitrogen, then transferred to the laboratory and stored at −80 °C. Genomic DNA was isolated using the MiniBEST Plant Genomic DNA Extraction Kit 9768 (Takara, Dalian, China) following the manufacturer’s protocol. DNA purity and concentration were determined using a NanoDrop 2000 UV-Vis Spectrophotometer (Thermo Fisher Scientific, USA), and finally checked on 1 % agarose gels.

GBS protocol and high-throughput sequencing

We used a GBS strategy in this study to develop SNP markers. First, we performed a GBS pre-design for restriction enzyme selection. In this step, the enzyme combination and sizes of digested fragments were predicted and evaluated according to the reference genome of ‘Junzao’ (Accession: LPXJ 00000000). Generally, the following characteristics were required for the selection of appropriate restriction enzymes: (1) The digested tags should be evenly distributed throughout the sequences to be examined; (2) Repeated tags should be avoided; (3) The paired-end length of each tag should be suitable for the coverage of the Illumina HiSeq 4000 sequencing platform; (4) The tag numbers must be sufficient for the subsequent steps. To maintain sequence depth uniformity of different fragments, a tight length range of ∼50 base pairs (bp) was selected.

Next, we constructed the GBS library according to the pre-designed scheme. The pipeline diagram was shown in Fig. 1. Here, three restriction enzymes were selected for DNA digestion and 0.1–1 μg of genomic DNA was incubated at 37 °C with MseI (New England Biolabs; NEB), T4 DNA ligase (NEB), ATP (NEB), and MseI Y adapter N containing barcodes, and then heat-inactivated at 65 °C. To further decrease the complexity and increase the sequencing depth and genome coverage, two additional enzymes, HaeIII and EcoRI (NEB), were simultaneously added into the MseI digestions to further digest the fragments at 37 °C. Then, the digested fragments with ligations were purified with Agencourt AMPure XP (Beckman) and subjected to PCR amplification using Phusion Master Mix (NEB) with universal primers as well as index primers to add index and complete i5 and i7 sequences. The PCR products were purified using Agencourt AMPure XP (Beckman) and pooled, then run on a 2 % agarose gel. Fragments of 400–450 bp (with indexes and adaptors) were cut from the gel and purified with a gel extraction kit (QIAGEN). These purified products were then further cleaned with Agencourt AMPure XP (Beckman) prior to sequencing. Then, paired-end 150-bp sequencing was performed on the selected tags on the Illumina HiSeq 4000 platform.

Fig. 1
figure 1

GBS pipeline diagram. The genomic DNA were digested with MseI and ligated with Y adapter N containing a barcode for each individual. Then two additional restriction enzymes HaeIII and EcoRI were simultaneously added into the MseI ligations for further digestion. Then PCR was applied for removing the fragments lacking MseI cutting sites in both pair-end and adding the index as well as completing i5 and i7 sequences on the Y adapters. Then PCR products were purified and pooled, and run on a 2 % agarose gel. Finally, fragment length varied from 400 to 450 bp (with indexes and adaptors) were cut from the gel and purified for sequencing

Sequence data analysis

The sequence data from each individual were sorted according to the barcodes in the raw reads. To ensure that reads were reliable and without artificial bias (such as low quality paired reads, which resulted mainly from base-calling duplicates and adapter contamination) in the following analysis, raw data (raw reads) in FASTQ format were first processed through a series of quality control (QC) procedures with in-house C scripts. QC standards were as follows: (1) Reads with ≥10 % unidentified nucleotides (Ns) were removed, (2) Reads with >50 % of bases having a Phred quality <5 were removed, (3) Reads with >10 nt aligned to the adapter were removed, allowing ≤10 % mismatches, and (4) Reads containing the HaeIII or EcoRI sequences were removed. Then, the Burrows-Wheeler Aligner (BWA) software (Li and Durbin 2009) was used to align the clean reads from each individual against the reference genome (settings: mem –t − 4 –k 32 –M -R). Alignment files were converted to bam files using the SAMtools software (Li et al. 2009) (settings: –bS –t). If multiple read pairs had identical external coordinates, only the pair with the highest mapping quality was retained.

SNP calling and genotyping

SNP calling was performed for parents and progenies using the SAMtools software (Li et al. 2009). The number of SNPs and types of transitions or transversions were counted. Then, a Perl script was used to filter the SNPs that had more than two genotypes. Polymorphic markers between the two parents were detected and classified into eight segregation patterns (ab × cd, ef × eg, hk × hk, lm × ll, nn × np, aa × bb, ab × cc, and cc × ab) according to the CP model in JoinMap 4.0 software (Van Ooijen 2011).

Linkage map construction and evaluation

Markers showing significantly distorted segregation (P < 0.001), integrity (> 65 %), or containing abnormal bases were filtered by JoinMap 4.0. The segregation patterns hk × hk and nn × np were used for the male parent map construction while the patterns lm × ll and hk × hk were used for the female parent map using JoinMap 4.0. The regression algorithm, three times circulation sequence, and Kosambi (1943) mapping function were used in marker distance calculation. The LOD value was set to 2.0–20. The integrated map for both the male and female parents was computed using the combine group for map integration function in MergeMap software (Wu et al. 2008). A Perl script SVG was used to visualize exported maps, and heat maps were constructed to evaluate the maps.

Anchoring sequenced scaffolds to the genetic map

Markers localized on the genetic map were used to anchor sequenced scaffolds of ‘Junzao’ using a Perl script. The visualized map of LGs that correlated with anchored scaffolds was then constructed and exported.

Results

Restriction enzyme selection for library construction

Before construction of the sequencing library, restriction enzymes were evaluated based on the predicted number of tags, length of fragments, and distribution across the reference genome. After scanning the entire ‘Junzao’ genome, a combination of three restriction enzymes, MseI (TTAA), HaeIII (GGCC), and EcoRI (GAATTC), was selected for GBS library construction. According to their restriction sites, the predicted sizes of the DNA fragments ranged from 400 to 450 bp and the number of tags was 99,050 with only 3.4 % repetitive tags. These tags covered 332,817,121 bp (94.74 %) of the genome and were distributed across 2745 scaffolds, accounting for 7.6 % of total scaffolds (Online Resource 1 ESM_1).

Analysis of GBS data

In total, 79.8 Gb of raw data containing 256,708,177 paired-end reads were generated by sequencing the parents and 145 progenies. After data filtering, 92 % of reads were high quality, with an average Q20 ratio of 93.79 % and a GC content of 35.31 %. The parents were sequenced at a higher level to enhance the chances of detecting more SNP markers. Finally, clean data covering 1,480,832,064 (99.99 %) and 1,840,679,712 (99.99 %) bp were obtained for the female and male parents, respectively. For each individual, the clean data ranged from 189,546,336 to 700,333,632 bp, with an average of 486,968,574 bp (Online Resource 2 ESM_2).

High-quality clean reads were aligned against the ‘Junzao’ genome. Consequently, 9,876,597 clean reads were obtained for the female parent and 7,802,479 clean reads for the male parent. For F1 individuals, an average of 2,789,866 clean reads was aligned to the reference genome. From all of these data, 98 % could be aligned to unique positions on the genome, while the remaining 2 % were aligned to either multiple or zero positions (Online Resource 3 ESM_3). Only reads aligned to unique positions on the reference genome were retained for the following SNP calling and genotyping.

SNP discovery and genotyping

SNP calling of the two parents and F1 individuals was performed with SAMtools (Li et al. 2009). In total, 136,178 and 109,275 SNPs were detected in the female and male parents, respectively. For F1 individuals, an average of 68,921 SNPs was discovered for an individual progeny. Among them, the parents exhibited a higher heterozygous rate (27.89 %) of SNPs than the F1 individuals (18.51 %) (Online Resource 4 ESM_4).

By excluding the missing information, 40,372 (54.99 %) polymorphic SNPs were detected between the two parents. These SNPs could be classified into eight segregation types according to the CP model in JoinMap 4.0. As shown in Fig. 2, seven patterns were detected. Among them, four major patterns including lm × ll, nn × np, hk × hk, and aa × bb accounted for nearly 99.8 %, while the other three patterns, ab × cc, cc × ab, and ef × eg, only accounted for 0.2 %. The type aa × bb was excluded, as it was homozygous in both parents and resulted in no segregation in F1 individuals. Only segregation types lm × ll, nn × np, and hk × hk were selected for genotyping in F1 individuals. Among the 35,733 selected polymorphic SNPs, the ratios of transitions and transversions were 56.49 and 43.51 %, respectively. Each type of transition (A/G and C/T) and transversion (A/C, A/T, C/G, and G/T) was nearly balanced (Online Resource 5 ESM_5).

Fig. 2
figure 2

Segregation types of polymorphic SNP markers. The x-axis indicates the eight segregation types; the y-axis indicates the corresponding number of markers

Genetic linkage map construction and evaluation

Available markers were filtered for <65 % integrity in each individual and chi-square tests with P < 0.001. Finally, 694 markers with hk × hk, 1755 markers with lm × ll, and 1656 markers with nn × np segregation types could be used for map construction (Online Resource 6 ESM_6). After data preparation, 2350 markers with types nn × np and hk × hk and 2449 markers with types lm × ll and hk × hk were used for the male and female map constructions, respectively. On the female map, 1409 (971 non-redundant) markers fell into 12 LGs and the genetic length was 1250.564 cM with an average marker interval of 1.38 cM (Online Resource 7 ESM_7). On the male map, 1214 (843 non-redundant) markers fell into 12 LGs; the genetic length was 1140.657 cM with an average marker interval distance of 1.49 cM (Online Resource 8 ESM_8). Heat maps reflect the linkage relationship between each marker in a single linkage group (Online Resource 9 ESM_9). Most of the LGs in both parent maps performed well in visualization.

Then the two parent maps were merged; the integrated map spanned 1456.53 cM with 2540 (1756 non-redundant) markers and fell into 12 LGs, which is consistent with the chromosome number of the jujube (Fig. 3, Online Resource 10 ESM_10). Among the 12 LGs, LG01 was the largest group, with a genetic distance of 183.8 cM and 292 markers. LG02 was the shortest group with 90 markers and spanning 74.27 cM. The average marker interval ranged from 0.54 to 1.20 cM, with an average distance of 0.88 cM (Table 1). Between the markers, 1745 gaps were detected. Among them, 1724 gaps (98.8 %) were less than 5 cM, 19 gaps were between 5 and 10 cM, and only two gaps were larger than 10 cM, which were on LG05 and LG07, respectively.

Fig. 3
figure 3

Linkage groups 01–12 of the integrated map. The x-axis indicates the numbers of linkage groups; the y-axis indicates the genetic length (cM)

Table 1 Characteristics of the integrated genetic map derived from the F1 population (‘Dongzao’ × ‘Zhongningyuanzao’)

Anchoring scaffolds of the sequenced jujube genome to the genetic map

The sequenced jujube genome (‘Junzao’) consisted of 36,147 scaffolds covering 351 Mb of sequences. The 1756 non-redundant markers were used as anchors to orient these scaffolds. In total, 316 unique scaffolds representing 224 Mb (63.7 %) of the total genome were localized on the 12 LGs (Fig. 4). LG01 anchored the highest number of scaffolds with a physical length of 32.8 Mb. LG12 anchored the lowest number of scaffolds with a length of 17.2 Mb. The correlation between genetic and physical length was 153.81 Kb/cM on average (Table 2).

Fig. 4
figure 4

Anchoring of sequenced scaffolds of ‘Junzao’ (Accession number: LPXJ00000000) to the 12 linkage groups. The x-axis indicates linkage groups (LG01–12) and scaffolds, the y-axis indicates the genetic length (cM)

Table 2 Anchored sequenced genome scaffolds of Z. jujuba ‘Junzao’ with SNP markers

Based on the scaffold orientation information of the draft genome of ‘Junzao’, we compared the orders and directions of scaffolds anchored by the current map. As shown in Table 3, our map shared 264 anchored scaffolds with the anchored genome. Of them, 180 (68.2 %) were in accordance with the placement orders. Most of the LGs showed a good collinear relationship between the different anchorings, such as LG04, 05, 06, and 11 (ESM_11a and b). In addition, 52 scaffolds covering 7,056,409 bp of data were recruited with the current map. These scaffolds were distributed on 12 LGs, and the sizes ranged from 406 to 1,210,059 bp. LG11 was enriched with the highest number (14) of new scaffolds, while LG02 and LG03 had the lowest number with only one new scaffold each. In this way, the linkage map of the ‘Junzao’ jujube genome was extended by 7 Mb with new anchored scaffolds (Online Resource 12 ESM_12).

Table 3 Comparison of anchored scaffold information with ‘Junzao’ genome assembly

Discussion

GBS is a rapid, efficient, and cost-effective strategy for SNP development, genetic linkage map construction, marker-based complex trait selection, and draft genome assembly in many species with or without reference genomes (Elshire et al. 2011; He et al. 2014; International Barley Genome Sequencing Consortium et al. 2012; Poland et al. 2012; Ward et al. 2013). Here, we used the combination of three restriction enzymes for GBS library construction. MseI has a digestion site with a 4-bp sequence (TTAA) and has a higher distribution frequency on the jujube genome. The additional two enzymes HaeIII and EcoRI were added to further digest fragments after MseI. The combination of three restriction enzymes improved the efficiency of GBS by increasing the tag number, sequencing depth, and genome coverage, and also supplied more chances to detect suitable regions for targeted fragments. In total, 79.8 Gb of raw sequencing data with 92 % clean data were generated, and finally 98 % of the clean data were mapped to unique positions on the reference genome. Thus, these data are indicative of good results in the library construction. Moreover, 73,417 SNPs and 40,372 (54.99 %) polymorphic SNPs were detected between the two parents. A previous study revealed that ‘Dongzao’ and ‘Zhongningyuanzao’ had a distant genetic relationship (Huang et al. 2015), which could increase the chances of detecting more polymorphic markers. These results demonstrated that GBS is a low cost, high efficiency, and rapid approach for SNP development and map construction.

The integrated genetic linkage map comprised 2540 (1756 non-redundant) SNP markers and spanned 1456.33 cM, with an average marker interval of 0.88 cM. The map was divided into 12 LGs and was consistent with the haploid chromosome number. Compared with previously reported linkage maps (Qi et al. 2009; Shen 2005), a large number of markers with sequencing information as well as a higher map density with a shorter marker interval distance were achieved in the current study. However, this map is still lower density than the ‘interspecific’ map reported by Zhao et al. (2014), which is attributed to the different sequencing strategies and cross combinations. The RAD strategy could identify more SNP alleles with its higher genome coverage (Baird et al. 2008; Davey et al. 2011). Notably, interspecific hybridization used in genetic map construction would contribute to more recombination and polymorphic markers (Wang et al. 2012a). However, the new high-density map developed by crossing two excellent jujube cultivars will provide more breeding resources for further research.

A high-density genetic linkage map can facilitate genome assembly and has been one of the fundamental components of genome sequencing (Gaur et al. 2012). Assisted by the high-density genetic map, we anchored 63.7 % of assembled scaffolds (224 Mb) of the ‘Junzao’ genome. Comparison of our map to the previously anchored genome revealed a highly collinear relationship between the genetic and physical maps. Also, 52 newly anchored scaffolds extended the genome by 7 Mb, demonstrating that the high-density genetic map was useful and efficient for genome assembly improvement. Additionally, the inconsistent scaffold placement order may be attributed to the different cultivars sequenced for the genetic map and genome sequencing. Rearrangements, translocations, gains or losses of DNA segments, and copy number variations (CNVs) have been widely observed in different genotypes of the same species (Swanson-Wagner et al. 2010; Zmienko et al. 2014). On the other hand, the markers on different genetic maps could influence the anchoring results. Certainly, improper mapping or errors present in the genome assembly would also contribute to inconsistent placement orders (Soto et al. 2015). We calculated the relationships between the genetic and physical maps in the present study. The ratio of physical/genetic distance was 153.81 Kb/cM on average, and this information will be useful for further research on gene cloning and genome structure analysis.

Segregation distortion is regarded as the driving force of evolution (Cai et al. 2015; Taylor and Ingvarsson 2003). Segregation distortion is widely present in many species and is affected by population types and environmental factors (Wang et al. 2012b; Zhou et al. 2015; Zhou et al. 2014). It is perhaps caused by gametophyte and/or zygotic selection and chromosomal rearrangements (Faure et al. 1993; Wang et al. 2012b). Considering the self-incompatible behavior of the jujube (Asatryan and Tel-Zur 2013a, b), unsurprisingly, 2337 (36.28 %) distorted markers were found in the mapping process. A similar phenomenon with the female parent ‘Dongzao’ was also reported by Shen (2005). Although these markers could increase the map density and might not significantly influence QTL mapping (Xu 2008; Zhang et al. 2010; Zhang et al. 2013), these markers were filtered and not used in this study.

Conclusions

We used GBS technology to develop SNP markers based on an F1 population derived from the cross of ‘Dongzao’ × ‘Zhongningyuanzao’. Using this map, we anchored 63.7 % (224 Mb) of assembled genome scaffolds, containing 52 new scaffolds, which extended the draft genome assembly by 7 Mb. These results have enriched marker resources with sequencing information of the jujube. The newly constructed high-density genetic map will serve as an efficient tool for molecular-assisted breeding and genomic studies. This work will contribute to the further research and improvement of the jujube in the near future.