Introduction

Recent years have seen a massive rise in the volumes of data generated by next-generation sequencing technologies (NGS), which now form the basis of most genome studies, ranging from the assembly of draft genome sequences to genome diversity analysis. Genomic information is becoming increasingly available for Brassica and Cicer species. Information on a public B. rapa genome was published in 2011 (Wang et al. 2011) and of two genomes of B. oleracea was published recently (Liu et al. 2014; Parkin et al. 2014) together with that of B. napus (Chalhoub et al. 2014). Draft references of both kabuli and desi Cicer arietinum genomes were also published in 2013 (Jain et al. 2013; Varshney et al. 2013). The availability of these reference genomes enables the discovery of sequence-based molecular markers and their association with agronomic traits for applied crop improvement (Edwards and Batley 2010; Edwards et al. 2013; Hayward et al. 2012b).

Recombination is one of the major sources of genetic variation, shuffling sets of genes to produce novel allelic combinations. Both reciprocal exchange between homologous chromosomes (crossover) and non-reciprocal exchange (non-crossover) can occur, and both are initiated by the repair of double-strand breaks (DSBs) in DNA during meiosis, reviewed by Chen et al. (2007). A subset of non-crossover events referred to as gene conversions result in fragments from homologous regions in the partner chromosome being used as a template for DSB repair (Mezard et al. 2007). The positioning of crossovers is well known in many plant species through cytological observation and through recombination mapping, but the frequency of gene conversion is mostly unknown in crops (Gaut et al. 2007). Recombination events in plants have frequently been genetically mapped using molecular markers (Farkhari et al. 2011; Yao et al. 2002). These studies showed variation in recombination frequencies for the same recombination bins across different populations. In general, pericentromeric regions showed the lowest frequency of recombination, and telomeric regions showed the highest frequency of recombination.

One study used 13,551 SRAP markers to produce a recombination map for B. napus (Westar × Zhonygou 821). The study identified 1663 crossovers in 58 double-haploid lines, which corresponds to 1.51 crossovers per individual per chromosome (Sun et al. 2007). Other studies used genotyping by sequencing (GBS) using restriction site-associated genomic DNA (RAD tags) to produce genetic maps. In maize and barley, an early GBS approach was able to map 200,000 and 25,000 sequence tags (Elshire et al. 2011). Another approach used two different restriction enzymes to reduce genomic complexity and was able to map 34,000 SNPs in barley and 20,000 SNPs in wheat (Poland et al. 2012).

The distribution of recombination has been mapped in A. thaliana, where 40 F2 individuals from lines Columbia and Landsberg erecta were resequenced to assess the distribution of crossovers and gene conversions (Yang et al. 2012). This study identified more than 3000 gene conversions and 73 crossovers per plant. Interestingly, the majority (72.6 %) of smaller crossover blocks (10–500 kb) were found in pericentromeric regions, while larger crossover blocks were distributed evenly among all chromosomes. A follow-up study repeated the analysis, but under the assumption that genomic rearrangements confound the mapping of short reads (Wijnker et al. 2013). The researchers removed all markers near putative duplicated regions to counter the errors introduced by mismapped reads, leading to an estimated one to three gene conversions and ten crossovers per meiosis, a much smaller number than presented in Yang et al. (2012). A recent study hypothesised that the large number of gene conversions found in Arabidopsis in (Yang et al. 2012) are due to false-positive SNPs caused by copy-number variation, mismappings due to duplicated regions caused by transposable elements and tandem repeats (Qi et al. 2014). After removal of these erroneous SNPs, 11 crossovers could be identified in two plants, which is equal to an average of 1.1 crossovers per chromosome. A smaller number of non-crossovers could be identified with five non-crossovers in one plant and one non-crossover in the other.

One of the limitations of most forms of genotyping is that only a restricted set of the total SNPs are assayed across a population. While this is efficient for the identification of major recombination events used for genetic mapping, the resolution is fixed by the restriction site density. With the decreasing cost of NGS data generation and the increasing availability of reference genomes, it is becoming cost-effective to generate whole genome sequence data for GBS applications. To this end, we have developed a novel GBS approach called skim-based genotyping by sequencing (skimGBS) which uses low-coverage whole genome sequencing for high-resolution genotyping. We demonstrate the application of this approach to genotype a double-haploid (DH) canola population derived from a cross between the cultivars Tapidor and Ningyou 7, as well as a population of C. arietinum recombinant inbred lines (RILs). Using this approach, it was possible to generate genome-wide recombination maps and to assess and compare the frequency of crossover and gene conversion events. We hypothesise that inflated numbers of crossovers and gene conversions are caused by errors in the reference assembly, and that the correct number of gene conversions and crossovers should be similar to those reported by Wijnker et al. (2013) and Qi et al. (2014).

Materials and methods

SkimGBS is a two-stage method that requires a reference genome sequence, genomic reads from parental individuals and individuals of the population. Firstly, the parental reads are mapped to the reference genome and SNPs are called using SGSautoSNP (Lorenc et al. 2012). Subsequent mapping of the progeny reads to the same reference and comparison with the parental SNP file enables the calling of the parental genotype. According to the SGSautoSNP protocol, read data were not trimmed or filtered.

For B. napus, two reference sequences relating to the B. napus diploid progenitors were used for mapping reads: the A-genome (Wang et al. 2011) and the C-genome (Parkin et al. 2014). The Brassica population consisted of 92 double-haploid Tapidor × Ningyou 7 individuals from the TNDH mapping population previously described (Qiu et al. 2006) (Supplementary Table 1). The C. arietinum population consisted of 46 PI489777 × ICC4958 F9-F10 RILs (Supplementary Table 2) (Gaur et al. 2012) and reads were aligned to the published kabuli reference genome (Varshney et al. 2013). Both parental and offspring reads were aligned using SOAPaligner v2.21 (Li et al. 2009), using only reads that map uniquely (setting: ‘-r 0’, maximum mismatch: standard of 2) with a generous insert size (0–1000). Only reads aligning in pairs were used in subsequent analyses. The genomic reads of both populations have been deposited in the Short Read Archives and are collected in two BioProjects at http://www.ncbi.nlm.nih.gov/bioproject/PRJNA274890 and http://www.ncbi.nlm.nih.gov/bioproject/PRJNA274892.

SNPs for the parental genomes were called using SGSautoSNP (Lorenc et al. 2012). A custom script (‘snp_genotyping_all.pl’) compared the progeny read alignments with parental genotypes to assign genotypes. SNPs that had more than 85 % missing alleles were removed with a simple Python script (‘RemoveEmptyeSNPs.py’). We implemented a simple method of sideways imputation (‘imputeFlapjackAlleles.py’), which assumes that recombinations do not occur. For example, if there are two missing alleles in between two Ningyou genotypes in an individual, then the two missing alleles are imputed with the Ningyou genotype.

In both populations, some individuals exhibited a much higher heterozygosity than the others in the population: 10 individuals in the C. arietinum population and 25 in the B. napus population. These were removed from all subsequent steps of analysis.

Gene conversion events have previously been defined as being shorter than 10 kb in length and longer than 20 bp (Yang et al. 2012). Additionally, we defined a gene conversion block as containing at least two SNPs. It follows from this definition that crossover events are longer than 10 kb. Crossovers and gene conversions that shared their start or endpoints within the resolution offered by the skimGBS data were removed using a custom script (‘fuzzyRecombinationFilter.py’). For each individual, the total number of gene conversions, crossover events and the number of nucleotides covered by these were counted, as well as the distribution of recombination and gene conversion events. The Shapiro–Wilk test, t test and Spearman’s rank correlation coefficient test were performed using R v3.0.1 using the functions shapiro.test(), t.test() and cor(). The distribution of recombination events was plotted using Python v2.7. The centromere positions for Brassica were derived (Cheng et al. 2013; Parkin et al. 2014).

Results

Brassica napus

A total of 78.8 and 46.0 Gbp of whole genome sequence data, representing 69.6× and 40.6× genome coverage, were generated for the parental cultivars Tapidor and Ningyou 7. After mapping these reads to the B. rapa cv. Chiifu and B. oleracea cv. TO1000 draft genome assemblies using SOAPaligner (Li et al. 2009), SGSautoSNP (Lorenc et al. 2012) identified a total of 880,809 intervarietal SNPs. Of these, 840,264 (95 %) were distributed across the 19 pseudomolecules and the remaining 40,545 SNPs were located on unplaced contigs (see Table 1).

Table 1 Predicted SNPs in B. napus between the cultivars Tapidor and Ningyou

Illumina genome sequence data were generated for 92 individual progeny of a Tapidor × Ningyou 7 population, with an average coverage of 1.3× per individual and ranging from 0.1× to 7.36× (Supplementary Table 1). After read mapping to the reference, an average of 313,590.25 alleles could be called per individual, with a minimum of 7000 and a maximum of 602,133 alleles called per individual. The relationship between coverage and called SNPs in the population is shown in Fig. 1. This shows that with about 1.5 Gbp of coverage, the majority of SNPs are called.

Fig. 1
figure 1

Relationship between the number of called SNPs and number of aligned reads for each of the 92 Brassica napus DH individuals

An estimate was made of the frequency of miscalled alleles due to sequence errors. Across the 92 individuals, 0.065 % of called alleles (19,219) were different from either of the parental alleles and presumed to be due to sequence error. As these errors represent two possible non-parental alleles, we estimate the frequency that a sequence error calls the incorrect parental allele to be 0.032 % (1 in 3000).

Due to the low coverage of the population, for many SNPs only few alleles could be called in the population. After removal of SNPs with more than 85 % missing alleles, 794,837 SNPs (90.2 %) with an average number of alleles of 306,982 remained. Sideway imputation raised the total number of alleles in the population from 28,242,426 to 62,903,177, wit an average per individual of 683,730.

After removal of the SNPs with high, too many missing alleles in the population, a very large number of crossovers and gene conversions could be identified (see Supplementary Table 3). There was a wide variation in heterozygosity between individuals and 25 individuals with high heterozygosity were removed from further analysis.

After the initial analysis, the A-genome exhibited on average 151.18 crossovers and 697.85 gene conversions per individual. In the A-genome, TN80 exhibited the smallest number of gene conversions (76), TN21 exhibited the smallest number of crossovers (85), TN98 showed the highest number of gene conversions (19,047), and TN100 showed the highest number of crossovers (536).

Similarly, the C-genome exhibited on average 115.53 crossovers and 374.85 gene conversions per individual. TN80 exhibited the smallest number of gene conversions (93), TN21 the smallest number of crossovers (37), TN65 the highest number of gene conversions (1628), and TN100 the highest number of crossovers (337). Close examination of these results suggested that many were due to structural differences, either due to differences in the reference cultivar to Tapidor and Ningyou, or due to misassemblies in the reference genomes, and so a filter was applied to remove all overlapping gene conversions and crossovers. A total of 16,943 crossovers and 70,984 gene conversions from the 67 individuals were removed, with more gene conversions and crossovers removed from the A-genome than from the C-genome.

From the A-genome, 9655 crossovers (95.32 %) and 46,245 gene conversions (98.91 %) were removed compared to removal of 7288 crossovers (94.15 %) and 24,649 gene conversions (98.14 %) from the C-genome. The difference in crossovers and gene conversions between the A-genome to the C-genome was statistically significant (two-tailed Student’s t test, crossovers p < 0.001, gene conversions p < 0.00001).

After filtering, we identified 927 crossovers, 13.84 per individual or 0.7 per chromosome. This ranged from 0 in TN2 to 249 in TN100. In addition, we identified 977 gene conversions, 14.58 per individual, or 0.76 per chromosome and individual. These ranged from 10 in TN18 to 20 in TN7 (see Supplementary Table 4). An overview of chromosome A1 before and after filtering is presented in Figs. 2 and 3.

Fig. 2
figure 2

Recombination map for Brassica napus chromosome A1 before filtering of overlapping recombinations. Red genotype Tapidor, blue genotype Ningyou, white missing. Each line is one individual; from top to bottom: TN9, TN99, TN98, TN97, TN94, TN93, TN90, TN8, TN89, TN88, TN87, TN86, TN85, TN83, TN82, TN80, TN7, TN78, TN76, TN75, TN74, TN73, TN65, TN5, TN57, TN54, TN4, TN48, TN47, TN46, TN45, TN44, TN43, TN42, TN41, TN40, TN3, TN39, TN38, TN37, TN36, TN35, TN34, TN32, TN31, TN30, TN2, TN29, TN28, TN27, TN26, TN25, TN24, TN22, TN21, TN20, TN1, TN19, TN18, TN17, TN16, TN15, TN14, TN12, TN11, TN10 and TN100 (colour figure online)

Fig. 3
figure 3

Recombination map for Brassica napus chromosome A1 after filtering of overlapping recombinations. Red genotype Tapidor, blue genotype Ningyou, white missing. Each line is one individual; from top to bottom: TN9, TN99, TN98, TN97, TN94, TN93, TN90, TN8, TN89, TN88, TN87, TN86, TN85, TN83, TN82, TN80, TN7, TN78, TN76, TN75, TN74, TN73, TN65, TN5, TN57, TN54, TN4, TN48, TN47, TN46, TN45, TN44, TN43, TN42, TN41, TN40, TN3, TN39, TN38, TN37, TN36, TN35, TN34, TN32, TN31, TN30, TN2, TN29, TN28, TN27, TN26, TN25, TN24, TN22, TN21, TN20, TN1, TN19, TN18, TN17, TN16, TN15, TN14, TN12, TN11, TN10 and TN100 (colour figure online)

After filtering, chromosome C4 had the highest number of crossovers per individual, ranging from 0 to 40, with an average of 1.42, whilst chromosome C5 had the lowest average number of crossovers at 0.28. The number of gene conversions per chromosome was very similar with all chromosomes carrying zero to two gene conversions (see Supplementary Table 5). After filtering, there was no difference in crossovers or gene conversions when comparing the A-genome with the C-genome (two-tailed Student’s t test, crossovers p > 0.7, gene conversions p > 0.6). The distribution of crossovers on chromosomes was plotted (Supplementary Figs. 1–19).

Cicer arietinum

A total of 7.2 and 5.9 Gbp of sequence data were generated from the two C. arietinum cultivars PI489777 (wild-type) and ICC4958, which represent an estimated coverage of 9.7× and 7.9×, respectively (Supplementary Table 2). A total of 555,346 SNPs were identified using SGSautoSNP (Lorenc et al. 2012), of which 448,619 (80.7 %) were distributed over the eight chromosomes and 106,727 were located on unplaced contigs. A total of 20.9 Gbp of Illumina paired read sequence data were generated for 46 progeny individuals (between 0.13× and 1.54×, with an average of 0.58×). Mapping these reads to the reference led to between 37,444 (RIL12) and 268,431 (RIL43) called SNPs, with an average of 147,363 per individual. 43,722 SNPs had too many missing alleles (>85 %) and were discarded (Table 2). Sideway imputation added 15,221,101 alleles, leading to a total of 21,999,801 called alleles, an average of 478,256 per individual.

Table 2 Predicted SNPs in C. arietinum between accessions PI4958 and ICC489777

Out of a total of 6,778,700 called alleles, 6440 (0.09 %) exhibited heterozygosity, with 10 individuals exhibiting high heterozygosity. These individuals were removed from subsequent analyses.

Crossovers and gene conversions were predicted following the same approach as for Brassica. Prior to filtering, crossovers totalled 3960, an average of 110 per individual, while gene conversions totalled 4675, or 129.86 per individual. After filtering, the number of gene conversions ranged from 5 (RIL4) to 22 (RIL29), and crossovers ranged from 0 (RIL7) to 60 (RIL29). There were 219 crossovers and 256 gene conversions (see Supplementary Table 6). An overview of chromosome 1 before and after filtering is presented in Supplementary Figs. 20 and 21.

After filtering, out of the eight C. arietinum chromosomes, chromosome 6 had the highest average number of crossovers with 1.44 and chromosome 3 had the lowest average number of crossovers with 0.19 (ranging from 0 to 3 in individuals). The number of gene conversions varied over the chromosomes, from an average of 0.69 in chromosome 7 to 1.0 in chromosome 5 (see Supplementary Table 7). The distribution of all crossovers was plotted (Supplementary Figs. 22–29).

Discussion

Here, we present the application of a skim-based genotyping by sequencing (skimGBS) method in B. napus and C. arietinum to assess the frequency and distribution of recombination. SGSautoSNP has been previously used to successfully predict SNPs in B. napus with an accuracy of >95 % (Hayward et al. 2012a) and in wheat with an accuracy of 93 % (Lai et al. 2014). By combining this SNP discovery method with skimGBS, we can assess the segregation of SNPs in a population. All scripts for the skimGBS pipeline are available at: http://www.appliedbioinformatics.com.au/index.php/SkimGBS.

We demonstrate that SkimGBS can be used to genotype a greater number of SNPs than previous approaches in these species. For example, in B. napus without imputation we genotyped an average of 147,363 SNPs per individual compared to 2604 using RAD-based GBS (Bus et al. 2012). SkimGBS was able to call more SNPs than earlier GBS approaches (Elshire et al. 2011; Poland et al. 2012), since there is no genomic complexity reduction steps in the SkimGBS pipeline.

The relatively high rates of sequence error found in next-generation DNA sequence data are a potential source of genotype miscalling. We estimate that 0.032 % of SNPs (one in 3000 of genotypes) are erroneously called in our analysis due to sequence error. Erroneous SNPs may also be predicted due to mismapping of reads to the reference genome. Due to the low coverage of the population, for about 10 % of the SNPs more than 85 % of the alleles in the population were missing. We removed these SNPs from further steps as these may have a negative impact on the sideway imputation.

We used only non-repetitively aligning reads to minimise the number of SNPs from homeologous regions (Lai et al. 2014; Lorenc et al. 2012). As we require two adjacent SNPs to call a gene conversion and need both SNPs to be at least 20 bp apart, we estimate the frequency of miscalled gene conversions due to sequence error to be negligible. We observed that some individuals in the B. napus population had a relatively high frequency of heterozygous alleles. This was unexpected as the population was produced as double haploids and so should be homozygous. We expect that the heterozygous individuals were due to pollen flow between lines during population development, and so these individuals were removed from the analysis.

Due to the low coverage of the sequence-based genotyping, some alleles were not called and so we used sideway imputation to predict these missing alleles, increasing the average number of alleles from 306,982 to 683,730 per individual in Brassica. While imputation allows for improved visualisation of haplotype blocks, it is not required to determine haplotype blocks or recombination events. There was weak to moderate correlation between the number of aligned reads and number of both crossovers (−0.22) and gene conversions (−0.54) (Supplementary Tables 8 and 9), suggesting that the majority of recombination events were captured. There was a high correlation (0.81) between the number of aligned reads and the number of heterozygous SNPs for an individual. This is due to the fact that higher coverage is required to observe a heterozygous SNP. For a heterozygous SNP to be observed, at least two reads have to align to the locus, and due to the low coverage of skimGBS many heterozygous SNPs may be missed.

Initial results suggested that gene conversions outnumbered crossovers in B. napus and C. arietinum, with the ratio of gene conversions to crossovers similar to that observed in Arabidopsis by Yang et al. (2012). A subsequent paper by Wijnker et al. (2013) suggested that small genomic rearrangements may lead to false high counts of gene conversion events. After filtering to remove genotypes around potentially rearranged regions, the number of gene conversions and crossovers in our study reduced to levels observed in Arabidopsis by Wijnker et al. (2013). After filtering, an average of 0.76 gene conversions and 0.73 crossovers per individual and per chromosome were detected in B. napus. The number of crossover and gene conversion events per meiosis differs between species due to various factors, and is dependent on the number and size of chromosomes present. Also, only 50 % of the total recombination events occurring in the F1 meiosis can be detected by progeny testing, as recombination events occur between only two of the four chromatids comprising a homologous chromosome pair, and only one chromatid is subsequently retained in gamete production. However, these results are in the same range as the 1–3 gene conversions and twn crossovers per meiosis (or 0.2–0.6 gene conversions and 2 crossovers per chromosome per individual) detected by Wijnker et al. (2013) and to the average of one crossover per chromosome and 0.2–1 gene conversion per chromosome reported by Qi et al. (2014) using similar methods.

The RIL population of C. arietinum exhibited a similar number of crossovers and gene conversions to B. napus. One individual in the population had a much higher number of gene conversions than the rest of the population, leading to the possibility that this individual skewed the average. Some chromosomes show a greater abundance towards the telomeres, but others exhibit a more even distribution (Supplementary Figs. 22–29).

In both populations there are individuals that after filtering exhibit either a much higher non-crossover rate (RIL29 in the Cicer population) or crossover rate (TN100 in the Brassica population) than the rest of the population. It could be that these were missed in the filtering step of non-homozygous individuals, or that the reads for these individuals were actually from several different individuals, leading to a larger number of recombinations.

Interestingly, we observed a difference in erroneously called recombination events between the three genomes used as references in this study, with more errors in the Brassica A-genome than the C-genome, and fewer again in the C. arietinum genome. This corresponds with genome assembly quality and likelihood of misassembled regions. The Brassica diploid genomes are highly complex, sharing a whole genome triplication (Liu et al. 2014; Parkin et al. 2014; Wang et al. 2011), and the recent assembly of the Brassica C-genome (Parkin et al. 2014) is of greater quality than the A-genome assembly which was published 3 years earlier (Wang et al. 2011). While the C. arietinum genome reference carries some misassembled regions (Ruperao et al. 2014), this relatively simple genome, with no recent genome duplications, and produced using the latest sequencing chemistry and assembly methods, is likely to have fewer misassembled regions than the Brassica genomes.

Previous studies suggest that lower numbers of recombination events occur around centromeres and a greater number of crossover events occur towards telomeres (Farkhari et al. 2011; Helms et al. 1992; Roberts 1965). In human genomes, DSBs and recombination hotspots exhibit specific sequence motifs: for example, polypurine and polypyrimidine tracts are overrepresented in regions of gene conversions (Chen et al. 2007). In A. thaliana, recombination hotspots seem to be biased towards regions with a high AT content, located away from methylated DNA and carrying at least two distinct sequence motifs (Wijnker et al. 2013). Other studies such as that by Drouaud et al. (2013) showed distinct recombination hotspots and related the results to proteins such as MSH4.

In addition to predicted recombination, we observed regions of the genome which demonstrated an alternative haplotype structure compared to the surrounding regions across all individuals. These regions reflect major differences in structure between the reference genomes used for read mapping and the genomes of the sequenced population. While these positions were removed from the analysis of recombination in this study, they offer the potential to validate genome structural assemblies and characterise differences in genome structure at a high resolution. Due to the early draft status of both genome annotations (compared to the highly validated annotations for A. thaliana), we did not compare the distribution of recombination to genetic content.

This study demonstrates high-resolution skimGBS in two important crops and identifies gene conversion and crossover recombination with high precision. The skim GBS approach is flexible, with relatively little data required for trait association, while increasing the volume of sequence data enables fine mapping of recombination events, detailed characterisation of gene conversions as well as the potential to validate genome assemblies and identify structural variations. The continued decline in the cost of generating genome sequence data should lead to an increase in the application of skimGBS for crop improvement.

Author contribution statement

PEB and PR ran the pipeline and analysed the results. PEB wrote parts of the SkimGBS pipeline and co-wrote the manuscript. ASM provided valuable critique and discussion of recombinations and their positions. JS and CKC wrote parts of the SkimGBS pipeline. SH, YL, JM, TS, PV and RV generated the DH and RIL populations, provided genetic material and genomic data and provided genome references. JB and DE conceived the study and co-wrote the manuscript.