Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

7.1 Introduction

Knowledge of genome sequences has a huge impact in plant biology (Schadt et al. 2010). The number of plant genomes being sequenced is rising (Michael and Jackson 2013) due to the rapid advancement of genome sequencing technologies, including those that allow high-throughput sequencing of longer reads and high-resolution assembly algorithms (Edwards and Batley 2010; Metzker 2010; Schatz et al. 2012). However, a common hurdle is assembly accuracy, especially considering the highly repetitive nature of plant genomes (Macas et al. 2007; Schatz et al. 2012). For example, bread wheat, which has one of the largest genomes among those sequenced from plants (17,000 Mbp; Brenchley et al. 2012), has an estimated repeat content of 80 % and the sequences assembled into scaffolds covered only 22 % of the genome (Brenchley et al. 2012; Michael and Jackson 2013). Even for Chinese cabbage (Brassica rapa), which has a relatively small genome of 529 Mbp, only about 60 % of the genome was assembled into pseudo-chromosome sequences, with the remaining 40 % made up mainly of repeat elements (Johnston et al. 2005; Wang et al. 2011; Michael and Jackson 2013).

Repetitive components of genomes are responsible for the extensive genome size variation in higher plants (Hardman 1986; Pagel and Johnstone 1992; Macas et al. 2007) and used to be considered ‘junk’ (Doolittle and Sapienza 1980; Nowak 1994; Shapiro and von Sternberg 2005). However, many recent studies have shown that repetitive elements have diverse functions within cells (Biémont and Vieira 2006; Biémont 2010), from involvement in maintaining chromosome integrity (Nowak 1994), and gene expression (Biémont and Vieira 2006), to changing phenotypes (Biémont and Vieira 2006). Therefore, characterization of these components in relation to genome assemblies is fundamental to understanding the holistic landscape and deciphering the complexity of plant genomes (Biémont 2010).

Despite their importance, repetitive sequences have hindered genome assembly and increased costs in terms of both time and money (Schatz et al. 2012). They remain largely unexplored and unassembled in many sequenced plant genomes (Wang et al. 2011; Michael and Jackson 2013; Liu et al. 2014), because most assembly algorithms are designed for less complex sequences (Schatz et al. 2012). However, the large amount of information that could be gathered from these repeats would be useful for understanding genome structure and evolution (Biémont 2010).

In the assembled genome sequences, most of the repetitive elements that occupy ~40 % of the B. rapa genome are transposon related (Wang et al. 2011; Michael and Jackson 2013). However, more redundant repeats such as centromeric and pericentromeric LTR retrotransposons (CRBs and PCRBrs, respectively; Lim et al. 2007), centromeric tandem repeats (including CentBr1 and CentBr2; Lim et al. 2005), and subtelomeric tandem repeats (STRs; Koo et al. 2011), in addition to the rDNA arrays were not included in the assembled genome sequence. Less than 1 % of these repeats are included in the currently available 283 Mbp assembled sequences (Table 7.1) despite coverage of >98 % of the euchromatic regions (Wang et al. 2011). This discrepancy demonstrates the difficulty of anchoring repeats in the assembly. Characterizing, quantifying and cytogenetically mapping these elements should aid in the final refinement of the genome structure.

Table 7.1 Comparison of major repeat composition identified in the reference genome assembly of B. rapa ‘Chiifu’ (Wang et al. 2011) with that found in 1x WGS sequence of 11 B. rapa accessions

In this chapter, we describe a genomic survey for major repeats of B. rapa using 1x whole-genome sequence (WGS) that captured a substantial portion of previously reported repeats and allowed us to characterize others. We also review the possible evolutionary roles of the identified repetitive elements in shaping the B. rapa genome. We further demonstrate the utility of combining in silico mapping of low-coverage WGS and fluorescence in situ hybridization (FISH) techniques to localize and estimate the genomic distribution and abundance of each repeat family. Finally, we discuss exciting applications and future prospects for this approach, especially for large and repeat-replete genomes and resource-deficient plant species.

7.2 The Hidden Genome: Characterization of Major Repeats

Knowing the distribution of repetitive elements within a genome is important in understanding genome organization, evolution, and function (Harrison and Heslop-Harrison 1995). In B. rapa, analysis using mitotic chromosome spreads demonstrated that heterochromatin is mostly concentrated in the centromeric and pericentromeric regions (Lim et al. 2005). These regions were later shown to contain major repetitive elements including the centromeric tandem repeats CentBr1 and CentBr2, centromeric retrotransposon of Brassica (CRB) and peri-centromeric retrotransposon of B. rapa (PCRBr; Harrison and Heslop-Harrison 1995; Lim et al. 2000, 2005; Koo et al. 2004). Repeats that are not concentrated in the centromeric regions have also been characterized (Wang et al. 2011; Liu et al. 2014). In addition to the tandemly repeated housekeeping 5S and 45S rRNA genes, a tandem repeat named STR based on its localization in the subtelomeric regions of several Brassica species was recently discovered (Koo et al. 2011). Collectively, these elements constitute the major repeat components of the hidden portion of the B. rapa genome (Table 7.1).

Most of these repeats have been identified by capture and characterization of single or a few elements via various efforts by independent groups; thus, global and comparative analyses of repetitive elements among related genomes has been limited (Macas et al. 2007). Oftentimes, considerable time and resources were spent to characterize these elements. For example, CentBr1, CentBr2, CRB, and PCRBr were isolated after identification of patterns in restriction enzyme digestion, screening several thousand BAC clones, downstream cloning of isolated sequences, sequencing, and cytogenetic mapping (Harrison and Heslop-Harrison 1995; Koo et al. 2004; Lim et al. 2005, 2007). With the current availability of NGS technology, a huge amount of information now awaits capture and utilization without the tedium and expense of more traditional approaches.

7.2.1 Reconstruction of Nuclear rDNA Units

Owing to the vital function they play in protein biosynthesis and cellular function, ribosomal RNA genes are highly conserved across plant species (Hershkovitz and Zimmer 1996; Martins and Wasko 2004; Waminal et al. 2014). However, the spacers between each rDNA repeat unit are more divergent among species, making them an excellent tool for phylogenetic studies (Martins and Wasko 2004). Additionally, they have been exploited as cytogenetic FISH markers for studies related to genome dynamics and evolution (Roa and Guerra 2012; Waminal et al. 2012). However, complete sequences of B. rapa nuclear rDNAs have not yet been reported. Using de novo assembly of low-coverage WGSs (dnaLCW; Kim et al. 2015), we obtained the complete 5S unit without gaps and 45S rDNA unit sequences with small gaps in the intergenic spacer (IGS) for B. rapa.

The complete 5S rDNA unit was 501 bp, comprising a 120-bp 5S rRNA gene and 381-bp IGS (Fig. 7.1a). Based on mapping of raw reads to the complete 5S rDNA contig (Fig. 7.1b), it was estimated that there were 16,756 copies of the 5 rDNA unit in the haploid ‘Chiifu’ genome (Fig. 7.2c). Likewise, the complete 5S rDNA unit for Brassica oleracea ‘C1234’ totaled 503 bp with 119-bp genic and 384-bp IGS regions. However, only 1743 copies were estimated to be present in the B. oleracea genome based on raw read mapping (Fig. 7.1c); a value much lower than that in the B. rapa ‘Chiifu’ genome, and supported by FISH analysis (Fig. 7.3a, b, e, f). Obtaining the complete unit of the 45S rDNA sequence for B. rapa ‘Chiifu’ was hindered by GC-rich repeats in the IGS region. Due to the abundant subrepeat regions and possible heterogeneous sequences in the IGS, gap-filling methods were ineffective, leaving a small gap in the 45S rDNA unit of 7764 bp for B. rapa ‘Chiifu’ (Fig. 7.1d). Nevertheless, using the same methods we successfully obtained a complete 7586-bp 45S rDNA unit for B. oleracea. Mapping 1x NGS reads to 45S rDNA sequences of B. rapa ‘Chiifu’ and B. oleracea C1234 (Fig. 7.1e, f) revealed 8709 and 1339 copies, respectively (Fig. 7.2c).

Fig. 7.1
figure 1

Structure of 5S and 45S rDNAs of B. rapa ‘Chiifu’ and B. oleracea C1234 and raw read mapping. a Structure of the complete 5S rDNA unit of B. rapa and B. oleracea assembled based on the dnaLCW method (Kim et al. 2015). b, c Coverage of the 5S rDNA unit based on raw read mapping against the 1x genomes of B. rapa (genbank no: KM538957) and B. oleracea (genbank no: KM538957), respectively. d Structure of the 45S rDNA unit of B. rapa (partial) (genbank no: KM538957) and B. oleracea (complete) (genbank no: KM538957) assembled based on the dnaLCW method. e, f Coverage of 45S rDNA unit based on raw read mapping against the 1x genome of B. rapa and B. oleracea, respectively

Fig. 7.2
figure 2

Sequences identified via genomic survey of major repeats among 11 different B. rapa and two B. oleracea accessions for comparison. a1 Comparison of centromeric and subtelomeric tandem repeat copy numbers and a2 rDNA and centromeric retrotransposons between B. rapa and B. oleracea. Error bars represent standard deviation. Copy numbers of b centromeric tandem repeats of B. rapa (CentBr), c ribosomal DNA (rDNA), d B. rapa and B. oleracea subtelomeric satellite repeats (BrSTR and BoSTR, respectively), and e centromere-specific retrotransposon of Brassica (CRB) and peri-centromeric retrotransposon of B. rapa (PCRBr)

Fig. 7.3
figure 3

Cytogenetic mapping and evolution of B. rapa and B. oleracea major repeats. B. rapa a FISH signals of 5S rDNA (yellow arrows), b 45S rDNA (red arrows), c CentBr2 (green signals, yellow arrows indicate 4 major signals) and BrSTR (red, on both arms of chromosome A05, blue arrows indicate weak BrSTR signals) in B. rapa root metaphase chromosomes. d Karyotype idiogram showing the cytogenetic distribution of major repeats. Chromosome numbering is according to Xiong and Pires (2011). eh B. oleracea. e 45S rDNA f 5S rDNA g CentBo1, and h CentBo2. i Genome-specific evolution of Brassica centromeric repeats showing lineage divergence (Mya) at nodes and repeats with corresponding estimated insertion and amplification time (Myr). Bars in ah = 10 μm, i = 5 μm

7.2.2 Exploring the Hidden Portion of the Genome

A few studies have reconstructed and estimated the genomic content of major repeats using low-coverage NGS sequences (Hawkins et al. 2006; Macas et al. 2007; Swaminathan et al. 2007). This approach allowed the identification of up to 48 % of the 75–97 % repeats in the 4300 Mbp Pisum sativum genome (Macas et al. 2007). Even though not all of the repeats were captured in silico, enough information was available to carry out comparative studies among closely-related species. Coupled with FISH, this approach was able to reveal the distribution of the newly identified tandem repeats, providing a better picture of their actual location and abundance in the genome.

In B. rapa, several types of major repeats have been characterized, including the centromeric and pericentromeric LTR retrotransposons, CRB and PCRBr, respectively (Lim et al. 2007), centromeric tandem repeats CentBr1 and CentBr2 (Lim et al. 2005), and subtelomeric tandem repeat STR (Koo et al. 2011). We used these publicly available sequences along with the B. rapa rDNA sequences we assembled herein to survey the abundance of each element using 1x Illumina WGS data with at least 80 % sequence similarity as a criterion. As stated above, repetitive elements currently identified in the B. rapa pseudo-chromosome sequences covered less than 1 % of the total assembled sequence (Wang et al. 2011). Here, we identified repetitive elements representing more than 20 % of the genome. Accordingly, only 0.3 % of these sequences are represented in the current genome assembly (Table 7.1).The most abundant repeats in the B. rapa genome were 45S rDNA (8 %), followed by CentBr1 (7 %) and PCRBr (2 %).

In B. rapa (A genome), CentBr1 is more abundant than CentBr2 (Fig. 7.2a, b), unlike their orthologous sequences in B. oleracea (C genome), CentBo1 and CentBo2, which are present in similar copy numbers (Fig. 7.2a, b; Lim et al. 2007; Koo et al. 2011). This was supported by our 1x WGS survey of 11 B. rapa and two B. oleracea accessions that revealed large copy number differences between CentBr1 and CentBr2, but not much difference between CentBo1 and CentBo2 (Fig. 7.2b).

The 1x WGS survey also identified >5000 and >300 times more 45S and 5S rDNA, respectively, than what was included in the assembled pseudo-chromosome sequences (Table 7.1). When compared to B. oleracea, B. rapa had 5 and 17 times more copies of 5S and 45S rDNA, respectively (Fig. 7.2a), which was consistent with FISH results (Fig. 7.3a, b, e, f; Xiong and Pires 2011).

Previous reports have identified two classes of subtelomeric tandem repeats in Brassica, STRa and STRb which share 89 % sequence identity (Koo et al. 2011). More sequences were identified from the 1x WGS reads when searching with BrSTRa compared to BrSTRb, suggesting that BrSTRa type TR sequences are more abundant than BrSTRb type sequences in both the B. rapa and B. oleracea genomes (Fig. 7.2d). In addition, different accessions of B. rapa and B. oleracea showed orders of magnitude difference in abundance for other repeat elements, indicative of genome plasticity which may reflects phenotypic polymorphism among accessions (Fig. 7.2b–e).

There was not much copy number variation for CRB elements among different B. rapa and B. oleracea accessions (Fig. 7.2e), supporting their common existence in the genus Brassica (Lim et al. 2007). By contrast, PCRBr was significantly more abundant in B. rapa compared with the negligible amount found in B. oleracea (Fig. 7.2e), supporting the observation of Lim et al. (2007) that PCRBr is specific to the A genome.

7.2.3 Cytogenetic Mapping of Repetitive Elements

FISH is an invaluable tool in genetic and genomic studies. It has allowed confirmation of chromosomal segment inversions (van der Knaap et al. 2004; Huang et al. 2009; Cabo et al. 2014), localization of centromeric repeats (Lee et al. 2005; Wolfgruber et al. 2009), visualization of transposons (Yu et al. 2007; Neumann et al. 2011) and repetitive elements (Lamb et al. 2007a; Macas et al. 2007; Suzuki et al. 2012), and even detection of single genes (Khrustaleva and Kik 2001; Lamb et al. 2007b) and transgenes (Santos et al. 2006; Park et al. 2010). Macas et al. (2007) demonstrated the utility of FISH to cytogenetically map the major repeats identified in the pea genome in a survey of 454 NGS sequence data. Additionally, there are some limitations in identifying these repetitive elements through computational analysis, which may not always accurately report the proportion of repeats that resides in that genome (Macas et al. 2007; Schatz et al. 2012).

With our analysis of the B. rapa genome, FISH data afforded us a better view of the genomic proportion of each repetitive element. Whereas about 20 % of the total repetitive elements were captured using in silico analysis, FISH generally revealed about 29 % of all the repetitive elements in the genome (Table 7.1). We consider the FISH signal likely to represent an overestimate because it only detects two-dimensional hybridization signals from the three-dimensional chromosome structure.

In B. rapa, CentBr1 and CentBr2 show about 85 % sequence similarity and are separately distributed to eight and two chromosome pairs, respectively (Lim et al. 2007). However, in B. oleracea, there is less distinct separation between the chromosomal locations of CentBo1 and CentBo2, which show co-localization in several centromeres (Fig. 7.3g, h; Lim et al. 2007; Koo et al. 2011; Liu et al. 2014). This is consistent with there being little copy number difference between CentBo1 and CentBo2 compared to CentBr1 and CentBr2 in the 1x WGS survey (Fig. 7.2a). This also suggests that there was a different rate of homogenization of centromeric tandem repeats between B. rapa and B. oleracea genomes as well as among centromeres within each genome, as observed in some Brassicaceae species (Hall et al. 2005).

CentBr arrays are intermingled with a major centromeric LTR retrotransposon, CRB. Although CRB is common to the three basic Brassica lineage A, B, and C genomes, CentBr is present only in the A and C genomes (Lim et al. 2007). Additionally, the A genome-specific retrotransposon PCRBr hybridized to B. rapa chromosomes, but not to those of B. oleracea and B. nigra (Fig. 7.3i; Lim et al. 2007). It localized to four chromosomes with major heterochromatin blocks in B. rapa, which could explain the relatively high genomic proportion of PCRBr identified based on the 1x WGS survey (Fig. 7.2a, e). In addition, although Koo et al. (2011) reported three loci on three separate chromosomes for BrSTR, our data showed two loci on both arms of chromosome A05, with a major locus on the short arm, and two other very weak loci on two short chromosomes (Fig. 7.3c). This may be explained by the different sensitivity of FISH experiments, or different cytotypes used in the experiments, noting that Brassica genomes are highly dynamic and polymorphic (Koo et al. 2011). This was also demonstrated by Xiong and Pires (2011), who showed different numbers of 5S rDNA loci between different B. rapa accessions ‘Chiifu’ and the double haploid B. rapa IMB218. Taken together, the satellite repeat distribution in B. rapa further supports the general observation that centromeric and subtelomeric regions are havens for satellite repeats (Charlesworth et al. 1994).

Although in silico analysis identified more 45S rDNA than CentBr1, FISH showed that 45S rDNA was second to CentBr1 in terms of genomic abundance (Table 7.1). This suggests that some CentBr1 may not have been thoroughly captured despite their relative abundance; this is likely true for the other types of sequence as well considering that our analysis identified only half of the 40 % unassembled sequences.

There are more 5S and 45S rDNA loci in B. rapa, three and five, respectively, (Lim et al. 2005; Koo et al. 2011; Xiong and Pires 2011) compared with B. oleracea, which has only one and two (Liu et al. 2014). This underlies the higher genomic proportion of rDNA in B. rapa relative to that in B. oleracea (Fig. 7.2a, c).

A summary of the cytogenetic distribution of B. rapa repeats is presented in Fig. 7.3d. Genome composition of the eight major repeats studied in this study account for about half of the unassembled sequence based on mapping of 1x WGS reads, indicating that more repeats such as DNA transposons still remain hidden in the genome and could be further identified through a refined dnaLCW method (Table 7.2).

Table 7.2 Summary of different B. rapa and B. oleracea accessions used in this survey

7.3 Functions and Evolutionary Implications of Repetitive Elements

The differential accumulation of repetitive elements, rather than gene sequences, is mainly responsible for the differences in C-value in plant genomes (Wei et al. 2013), a phenomenon commonly known as the C-value paradox (Hardman 1986; Pagel and Johnstone 1992; Macas et al. 2007). A growing amount of evidence supports the importance of these repeats in genome functions and evolution (Nowak 1994; Pardue and DeBaryshe 2003; Hall et al. 2005; Shapiro and von Sternberg 2005; Biémont and Vieira 2006; Wei et al. 2013).

Transposable elements (TE) are now known to possess characteristics that help shape the structure and evolution of genomes. They help regulate genes, defend genomes from retrotransposon proliferation and retrovirus invasion, cause mutations, influence recombination rates, protect chromosomes through telomerase-independent fashion, and maintain centromeres, which play a significant role in chromosome segregation (Pardue and DeBaryshe 2003; Wolfgruber et al. 2009; Biémont 2010; Sarilar et al. 2011; Goodier et al. 2012; Sampath et al. 2013). In Brassica, MITE transposons preferentially accumulate near or inside of genic regions indicating these likely play roles in gene evolution (Sarilar et al. 2011; Sampath et al. 2013, 2014).

Most plant centromeric DNA is composed of 150–180 bp tandem repeats and centromere-specific retrotransposons (CR; Jiang et al. 2003; Lim et al. 2007; Talbert and Henikoff 2010; Neumann et al. 2011; Jiang 2013). The centromeric tandem repeat arrays can extend to several megabases and are often interrupted by CRs, which can also insert into other CRs, forming a complex nested pattern, and play a significant role in centromere function and evolution (Jiang et al. 2003; Lim et al. 2007; Wei et al. 2013). Association of these tandem repeats and CRs with modified histone H3 (CENH3), the hallmark of active centromeres, further indicates their active role in centromere function (Neumann et al. 2011; Jiang 2013).

Some evidence has been presented to help explain the rapid evolution of centromeric tandem arrays across different centromeres within a species. Unequal crossover, gene conversion, and repeat transposition have been invoked as key players in the homogenization and spread of repeats intra-chromosomally, between sister chromatids, between homologous chromosomes, and between non-homologous chromosomes (Walsh 1987; Charlesworth et al. 1994; Cohen et al. 2003; Hall et al. 2005). Unequal crossovers usually result in higher-order repeat units consisting of more than one type of element and variation in lengths of arrays (Hall et al. 2005; Talbert and Henikoff 2010). Other mechanisms such as gene conversion and repeat transposition may amplify satellite arrays and cause their spread into nonhomologous chromosomes (Hall et al. 2005).

In Brassica, CentBr and CRB are major components of the centromere (Lim et al. 2007). The CRB is a common centromeric component of the A, B, and C genomes. However, the absence of CentBr hybridization in B. nigra (B genome) indicates that the B genome diverged from the A and C genomes earlier, supporting the 9 MYA divergence time for the B genome (Fig. 7.3i; Lim et al. 2007; Koo et al. 2011). This was further supported by the FISH results with the subtelomeric repeat STR, which also showed genome-specific evolution. The BnSTR tandem repeat from B. nigra (B genome) did not hybridize to either the A or C genome, and BrSTR from the A genome did not hybridize to either the B or C genome, although BoSTR from the C genome hybridized to both the A and C genomes (Koo et al. 2011). However, those tandem repeats (CentB and STR) show high sequence similarity between species (Lim et al. 2005, 2007; Koo et al. 2011), suggesting that the tandem repeats subsequently diverged in the A, B, and C genomes after speciation even though they shared a single origin in the ancient genome.

The pericentromeric retrotransposon PCRBr showed A-genome specificity (Fig. 7.3i). PCRBr is a gypsy type retrotransposon and is accumulated in several chromosomes of B. rapa suggesting that these retrotransposons were rapidly amplified in the A genome after divergence from the C genome during the last 4.6 MYA (Wang et al. 2011; Liu et al. 2014). Additionally, CentBr1 and CentBr2 have diverged in sequence and chromosomal distribution in B. rapa and B. oleracea. CentBr2 has both HindIII (AAGCTT) and Sau3AI (GATC) restriction sites while CentBr1 has lost the Sau3AI site (Koo et al. 2011). This phenomenon was also observed for maize CentC and Cen4 (Kato et al. 2004). Collectively, these results highlight the dynamic nature of the genomes in the genus Brassica and present examples of lineage- and genome-specific rapid evolution of centromeric components (Koo et al. 2011).

7.4 Conclusion and Perspectives

As exemplified by Macas et al. (2007) in Pisum sativum, survey of plant genomes using low-coverage NGS data proved to be an excellent tool for capturing the highly repetitive genomic sequences that are mostly left out during assembly. Our application of this technique to Brassica species further corroborated the usefulness of this approach. Characterizing the genomic abundance and distribution of these repetitive sequences is further facilitated when 1x WGS genomic survey is coupled with molecular cytogenetic techniques such as FISH.

Using this approach, independent analysis of repetitive elements from genome assembly data can provide huge amount of information regarding genome structure and evolution when comparative analyses are performed with closely and distantly related species. This approach may also promote our knowledge of plants with huge genomes such as Allium (Jakse et al. 2008). Repetitive sequences can be analyzed using low-coverage WGS before completion of genome sequencing and can provide guidance for complete elucidation of the genome structure of the target plant. This combined genome survey and cytogenetic approach will also be useful for evolutionary genomics analysis of plant families lacking available genome sequences by allowing comparison of the repetitive yet highly informative portions of their genomes, as exemplified by our work in ginseng (Panax ginseng; Choi et al. 2014).