Keywords

Introduction

The exploitation of evolving experimental techniques, starting from early cytological approaches, molecular markers, Fluorescence In Situ Hybridization (FISH) and Optical Mapping, till the nucleotide sequencing of entire genomes, contributed relevant discoveries on genome organization, also determining relationships among chromosomal peculiarities, in phylogeny, in evolution.

Comparative approaches highlighted that many structure features of plant genomes are remarkably similar among different species, and are also shared with other eukaryotes, animals and fungi (Heslop-Harrison 2000). All eukaryotes have their genomic DNA organized in chromosomes, associated with proteins, showing almost the same organization. Centromeric regions are located in regions that are almost conserved along the chromosome structure, and the terminal regions are organized in telomeres.

Comparative approaches also highlighted the relevance of polyploidy in plants, with chromosome number which varies widely among plant species, such that 2n ranges in value from 4 to more than 1000, although the number within any given species is usually constant. Occurrence of polyploidy may be also associated to diploidization events, with rearrangements also implying genome reshuffling, translocations, fusion and fission of chromosomes. These events have been discussed to be some of the consequences why plant genomes are highly duplicated (Lysak et al. 2005; Cui et al. 2006; Tang et al. 2008a, b; Jiao et al. 2011, 2012; Sangiovanni et al. 2013). Beyond the interesting issue of investigating on the mechanisms implied in the occurrence of polyploidy and diploidization events in plants, even in a relatively short time span, tracing plant genome evolution and diversification (Jaillon et al. 2007; Tomato Genome Consortium 2012; Denoeud et al. 2014), it would also be rather intriguing to understand what enabled angiosperms to efficiently manage the presence of homologous chromosomes in comparison to all other eukaryotes, where polyploids are rare. However, in the context of this chapter, it is remarkable to focus on the effects that whole-genome and segmental duplications had on the redundancy of genome regions and of gene copies, with the definition of novel gene families. Though it is not the aim of this chapter to discuss repeats in DNA due to polyploidization events or to retaining of duplicated regions, it is noteworthy, indeed, to underline also here that one of the main outcomes of the tomato genome sequencing effort was the tracing of two consecutive genome triplications in the Solanum lineage. The more ancient event was shared with rosids, while, a more recent one appeared specific to the Solanum lineage (Tomato Genome Consortium 2012; Denoeud et al. 2014). These events had a relevant impact on diversification and evolution of novel functionalities in these clade of plants. However, it is discussed that the repeated regions tracing these possible events in the tomato genome were mainly detected only at sequence level (Tomato Genome Consortium 2012), presumably because of the high divergence determined by gene loss or mutations since the last hypothesized polyploidization event (Shearer et al. 2014).

The dynamics of genome evolution in plants offers striking opportunities to have multiple copies of the genome content, i.e. to repeat it, and to keep it duplicated even when diploidization occurred. Furthermore, the transfer of genes or of entire parts of the DNA from organelles to nucleus is now well documented both in plants and animals (Martin and Herrmann 1998; Vaughan et al. 1999).

Worthy to note, though the different occurrences of genome rearrangements in plants, the gene numbers as well as their order are almost conserved over substantial evolutionary distances in plants (Gebhardt et al. 1991; Ahn et al. 1993; Devos and Gale 1993, 1997, 2000).

The tomato genome, as an example, is highly syntenic with those of other economically important Solanaceae (Potato Genome Sequencing Consortium 2011; Tomato Genome Consortium 2012; Hirakawa et al. 2014; Kim et al. 2014; Sierro et al. 2014) as well as other plants (Jaillon et al. 2007). However, plant genome size can strongly vary among different species. Indeed, repetitive sequences contribute significantly to genome size in plants. Understanding the mechanisms and inferring on possible functional reasons favouring these variability and plasticity is still an open challenge.

DNA Content in the Cell

The amount of DNA (in picograms) in an unreplicated haploid cell, which corresponds to the constant value or C-value (Swift 1950; Greilhuber et al. 2005), is relatively homogeneous within a species. However, it is evident that the C-value is particularly variable between species. This variability is not related to the complexity of the organisms in terms of size or developmental mechanisms. The DNA content of the unicellular amoeba was 200 times higher than in human cells, though mammals have evident higher developmental complexity. This initially “unexpected” phenomenon represents the so-called “C-value paradox”. The paradox is today explained knowing that the DNA content in a species can be abundant in repetitive sequences, though the numbers of coding genes are of the same order of magnitude in all eukaryotes, which ranges from about 6000 in the unicellular Saccharomyces cerevisiae to approximately 20,000 to 25,000 in the human genome (which is 200 times bigger than the genome of the yeast) (Richard et al. 2008).

In general, the term “repetitive sequences” refers to highly similar DNA fragments that are present in multiple copies in a genome. In particular the major contribution to the haploid genome size in eukaryotes is due to highly and moderately repeated sequences, i.e. DNA motifs, ranging in length from a single couple of nucleotides to thousands of nucleotides, repeated many hundreds or thousands of times. These repeated motifs are ubiquitous in eukaryotic genomes (Charlesworth et al. 1994; Kumar and Bennetzen 1999; Bowen and Jordan 2002) and represent a large portion of the chromosome structure (von Sternberg 2002), ranging between 50 and 90 % or more of all the nuclear DNA content. As an example, more than the 50 % of the human genome is composed by repeats (Richard et al. 2008).

In higher plants, the amount of DNA is particularly variable between species (Flavell et al. 1974; Bennett and Smith 1976; Ouyang and Buell 2004; Hawkins et al. 2009). The lowest content reported for A. thaliana is one of the main reasons why this genome was the first one to be sequenced among plant species (NSF 1990; Arabidopsis Genome Initiative 2000). Accordingly, mainly thanks to its “modest” genome size, poplar was the first tree to be sequenced (Brunner et al. 2004). Also in the case of plant genomes, the proportion of protein-coding regions is rather similar among the species (Table 10.1). Indeed, the structural and developmental complexity of plant species with very different amounts of DNA per cell is not fundamentally different from those with the highest amounts (Smyth 1991). It is also evident (Table 10.1) that the contribution of repeats to each genome has a wide range of variability starting from very low percentages, like in Arabidopsis thaliana, reaching a very high relative content like in Capsicum annum (~82 %) and in several monocots (~85 %).

Table 10.1 List of plants with sequenced genomes

DNA Repeat Classes

Repetitive DNA was first detected because of its rapid reassociation kinetics when denatured, since the rate at which a particular sequence reassociates is proportional to the number of times it is found in the genome. Based on the renaturation rates, in denaturation–renaturation experiments of genomic DNA after heat exposure, it is possible to identify three major classes of DNA sequence types: the highly repetitive sequences, representing DNA fragments that reassociate very rapidly; the moderately repetitive ones, i.e. DNA fragments that reassociate at an intermediate rate, the single copy (or very low copy number class) representing fragments that do not repeat at a consistent frequency in DNA sequences. Such approaches to estimate the repetitive content of genomic DNAs in different organisms, though possible underestimations due to diverging repetitive elements, are remarkable since they give out a global accurate picture of genome composition in the absence of sequence information. In parallel to the reassociation kinetics properties, repeated sequences can be also divided in two major categories based on their organization or distribution in a genome: “tandem repeats” and “dispersed repeats” (Fig. 10.1). Tandem repeats are generally corresponding to the highly repetitive sequences. They mostly localize on large conspicuous heterochromatic DNA blocks at the distal ends and interstitial parts of the chromosome (Schmidt and Heslop-Harrison 1998) and include sequences that are repeated in tandem along the genome sequences such as ribosomal DNA repeat arrays (rDNA) and satellite DNA. Among tandem repeats, duplicated protein-coding genes (paralogs) can also be included. Dispersed repeats are usually corresponding to moderately repeated sequences, and include transposons and dispersed gene paralogs. Transfer RNA genes (tDNA) are often distributed in tandem, but they are usually included among the dispersed repeats (Richard et al. 2008).

Fig. 10.1
figure 1

Repeated DNA sequences in eukaryotic genomes. The two main categories of repeated elements (tandem and dispersed repeats) are shown, along with their subcategories

Tandem Repeats

rDNA

rDNAs represent non protein-coding multigene families usually classified as tandem repeats. rDNAs (Fig. 10.1) are usually head-to-tail arrays of genes encoding the precursor (45S) of the three largest ribosomal RNAs (18S, 5.8S and 25S in plants). The corresponding DNA region generally contains several tandem copies, including active rRNA genes and silent rRNA genes, which are often highly compacted in dense heterochromatin. The rDNA region gives rise to secondary constrictions in metaphase chromosomes that are called the nucleolus organizer regions (NOR), around which the nucleolus forms. rRNA coding genes are usually transcribed by RNA polymerase I. The 5S rRNA genes, highly conserved genes of around 120nts in length, are distributed independently from the 45S rDNA, in multiple copies arranged as tandem arrays separated by a high variable spacer in sequence and in length. The number of copies of the core unit, from 200 to 900 nucleotides, can vary from 1000 to 50,000 copies. The sequences can be adjacent or not to the 45S rDNA region and are usually transcribed by the RNA polymerase III.

Satellite DNA

The name “satellite DNA” refers to a “satellite” band different in density from bulk DNA in a density gradient, due to repetitions of short DNA sequences. It consists of almost large number of repeat units, distributed as tandem arrays of DNA. Satellite DNA is in itself also distinguished in minisatellites or microsatellites. Both subcategories are variable in number of repeats (Variable Number of Tandem Repeats or VNTR). Minisatellites consist of a core repeat units of 10 to 60–90 nucleotides. Microsatellites (also known as “Simple Sequence Repeats” or SSRs, or “Short Tandem Repeats” or STRs) consist of a core of around 2–6–10 nucleotides. In general satellite DNA can be distributed throughout the chromosomes (King et al. 1997; Richard et al. 2008), both in heterochromatin and euchromatin regions (Cuadrado and Schwarzacher 1998; Cuadrado and Jouve 2007a, b; Chang et al. 2008), in genes, both in the protein-coding regions, in introns, or in their regulatory regions, and within transposable elements.

The tandem satellite DNA sequences exhibit in general characteristic chromosomal locations, with roles depending on their locations. They can be at telomeric, subtelomeric and centromeric regions, with repetitive families that can be shared within a taxonomic family or a genus, or may be specific to the species, genome or even a chromosome (Sharma and Raina 2005). These features have formed the basis of extensive utilization of repetitive sequences for taxonomic and phylogenetic studies. Satellite DNA is the main component of centromeres, with a core units from 9 to 64 bp long, and of telomeric regions, with a conserved core units of around 6 bp, and repetition numbers that can range from hundreds to thousands, depending on the species (Podlevsky et al. 2008), forming the main structural constituent of heterochromatin. Centromeres are essential for chromosome segregation, yet their DNA sequences evolve rapidly in contrast with the high conservation of the core units of telomeres (Henikoff et al. 2001). Centromeres differ greatly in their sequence organization among different species. In Saccharomyces cerevisiae a “point centromere” of 125-bp sequence is sufficient to confer centromere function (Meraldi et al. 2006). In most animals and plants, centromeres contain megabase-scale arrays of simple tandem repeats, sometimes interspersed with long terminal repeat transposons (Heslop-Harrison et al. 2003) and, despite their relevant role, very little is known about the degree to which centromere tandem repeats share common properties between different species (Melters et al. 2013). However, the key kinetochore proteins are conserved in both plants and animals, particularly the centromere-specific histone H3-like protein (CENH3) highlighting the importance of epigenetic mechanisms in the establishment and maintenance of centromere identity (Houben and Schubert 2003). Telomere repeats occur predominantly at the ends of eukaryotic chromosomes, arranged in tandem to form large uninterrupted blocks often associated to subtelomeric satellite repeats (Ganal et al. 1991). They appear to protect chromosome ends from degradation and shortening during replication (Mason and Biessmann 1995).

Microsatellites may have high variability in length, due to unequal crossing over, rolling circle amplification and replication slippage, even before meiosis (Tautz and Schlotterer 1994), making these regions endowed of a high rate of mutation per locus per generation (Jarne and Lagoda 1996; Kruglyak et al. 1998). This is why these sequences are important for different approaches (Buschiazzo and Gemmell 2006). Indeed microsatellites can be amplified using unique sequences at the flanking regions to define primers for amplifications, producing variable patterns of fragments lengths which are useful for population studies, fingerprinting, marker assisted selection, and study of breeding patterns of wild or domesticated species (Martinez-Zapater et al. 1986; Maluszynska and Heslop-Harrison 1991; Michelmore et al. 1991; Martin et al. 1992; Maughan et al. 1995; Liu et al. 1996; McCouch et al. 1997; Milbourne et al. 1997; Livingstone et al. 1999).

Dispersed Repeats

tDNA

Genes coding for transfer RNAs represent a non protein-coding multigene family, as rRNA coding genes. Though often distributed in tandem, they are usually classified as dispersed repeats.

In addition to its essential function in protein synthesis, recent studies have shown that tRNAs are multifunctional molecules involved in many processes of cellular metabolism (Minajigi and Francklyn 2010). Furthermore, tRNA-derived RNAs appear to be used in the RNA silencing pathway, and are a major source of short interspersed nuclear elements (Bermudez-Santana et al. 2010; Phizicky and Hopper 2010).

It is postulated that all tRNA genes (tDNAs) derive from an ancestral molecule (Eigen et al. 1989) that during evolution gave rise to a full set of tRNA genes generated as the result of numerous mutation, duplication and reorganization events. The number of tRNA pseudogenes and organellar-like tRNA genes present in nuclear genomes varies greatly from one plant species to another. Generally, there is no correlation between genome size and tDNA copy number in the nuclear genome (Richard et al. 2008). However, Michaud et al. (2011), in their analysis of tRNA gene distribution in plant genomes, revealed that the tRNA gene content in plants is rather homogenous, and is mostly correlated with genome size.

Transposable Elements

Among dispersed repeats, transposable elements (TEs) are DNA sequences that are capable of “moving” in the cell, integrating into a new site within the genome where they originated from (Craig et al. 2002), creating changes and amplifying and altering the cell’s genome size. This is why they were also termed “jumping elements”. They were discovered in plants by Barbara McClintock who earned her Nobel Prize for this scientific contribution in 1983 (McClintock 1953). She not only found that genes could move, but also that they could be turned on or off according to the environmental conditions or during different stages of cell development. Transposons consist of two major classes: retrotransposons (class I elements) and DNA transposons (class II elements) (Fig. 10.1), depending on the mechanisms that determine their excision and insertion in the genome.

Retrotransposons replicate by forming RNA intermediates, which are then reverse transcribed to DNA sequences and inserted into new genomic locations. Therefore, retrotransposons need transcription and a reverse transcriptase to move, while DNA transposons are excised from the genome, and the “cut-and-paste” mechanisms for transposition require transposases (Craig et al. 2002). Retrotransposons are commonly grouped in LTR or non-LTR retrotransposons according to the presence or not of long terminal repeats (LTR). In LTR retrotransposons, the terminal repeats range from ~100 bp to over 5 kb in size. They are the most high representative class in plant genomes (Kumar and Bennetzen 1999; Bennetzen 2000) and may be further subclassified into different classes, differing by the degree of sequence similarity and by the order of encoded gene products along their structure. Among these, Ty1-copia-like and Ty3-gypsy-like are commonly found in high copy number in plants genomes, but also in animals, fungi and protista. Retroviruses are often classified separately from the LTR retrotransposons though they share many features with them. A major difference with Ty1-copia and Ty3-gypsy retrotransposons is that Retroviruses have an Envelope protein (ENV) and have domains that enable extracellular mobility (Cotton 2001).

Non-LTR retrotransposons include long interspersed elements (LINEs) and short interspersed elements (SINEs). LINEs encodes for functionalities that are essential for retrotransposition, such as reverse transcriptase and endonucleases activities, and are transcribed by the RNA polymerase II, like mRNAs. Their mechanisms of transposition, however, differ from that of other LTR elements (Bibillo and Eickbush 2004). SINEs are nonautonomous retroelements, with length ranging from 100 to 900 bp, and copy not identical in the genome (Kramerov and Vassetzky 2005). They do not encode reverse transcriptase, and presumably co-opt the LINE machinery to be retrotransposed (Jurka 1997). They are transcribed by RNA polymerase III, being organized at their 5’ end like a typical tRNA promoter (Defraia and Slotkin 2014).

Bioinformatics for Repeat Detection

Repeat Sequence Databases

Due to the presence of different types of repeats, there are different dedicated databases that organize repeats, such as Repbase (Jurka et al. 2005), the Tandem Repeats Database (Gelfand et al. 2007), RepeatsDB (Di Domenico et al. 2014). In particular, RepBase is a comprehensive repeat collection including prototypes of repetitive DNA sequences derived from the consensus of each of the repeat families from each eukaryotic species. The Tandem Repeats Database is specific for repeated regions in tandem, while RepeatsDB specifically contains tandem repeats found in protein sequences. In parallel to these resources, Rfam (Burge et al. 2013) contains families of non protein-coding RNAs, and is useful to support annotation of the corresponding genes in a genome, rRNA and tRNA coding genes included.

Some available databases are specific for plants, PGSB Repeat Database (Nussbaumer et al. 2013) and the Plant Repeat Database organized starting from the TIGR Plant repeat database (Ouyang and Buell 2004), this last updated till 2008, both designed as comprehensive repeat collections. PlantSat (Macas et al. 2002) and Plant rDNA database (Garcia et al. 2012) are dedicated to satellite repeats and rDNAs, respectively. Some of these databases have the possibility to allow search for repeated region in specific genera or species, such as the Plant Repeat Database, that is made of subsections dedicated to Solanaceae, Gramineae or other plants, or Plant rDNA database.

Methodologies

Bioinformatics strategy to identify and annotate repeats in genome sequences is almost similar even in different species. In general, the currently available methods can be based on comparative approaches, which aim to identify and therefore classify the repeated regions aligning a query sequence, the one to be analyzed, with sequences representing repeat classes organized in dedicated databases. Other approaches are based on de novo detections of repeats along a sequence, these methods supporting the identification of novel repeat sequences, i.e. sequences not available in dedicated collections since not yet discovered and classified.

RepeatMasker (Smit et al. 1996) or Censor (Kohany et al. 2006) are some of the well-known similarity-based search tools, useful to support the annotation of the repeats detected along a sequence and to provide its masked version, i.e. a sequence in which all the regions identical to repeats are changed to X or Ns, to be ignored in subsequent analyses, like those necessary to detect coding genes.

Similarity methods also may consider comparisons with established genome sequence references find occurrence of similar repeat regions.

Tandem Repeats Finder (Benson 1999) and mreps (Kolpakov 2003) are other specific tools helpful to find and annotate tandem repeats in DNA sequences. Like LTR_STRUC (McCarthy and McDonald 2003), Recon (Bao and Eddy 2002) and RepeatScout (Price et al. 2005), they detect repeated DNA sequences by de novo approaches. These approaches are generally based on self-comparisons of repeated similar regions. The exploitation of associated clustering approaches usually permits also to group-related sequences, to classify them into families and or subfamilies.

The identification and the annotation of repeated gene loci, such as those coding for non protein-coding genes (tRNA, rRNA), can be performed by dedicated tools like Infernal (Nawrocki et al. 2009), also useful for the identification of other non protein-coding RNAs. Specifically, Infernal is used to search RNA families dedicated databases for similar sequences such as Rfam. Infernal builds a profile from a structurally annotated multiple sequence alignments of RNA families with a position-specific scoring system. The scoring approach also takes into consideration secondary structure organization of the family being modelledQuery, such as base pairing, combining different levels of structure information to get to the most appropriate result. Other tools, such as tRNAscan-SE (Schattner et al. 2005) and ARAGORN (Laslett and Canback 2004) or SnoReport (Hertel et al. 2008) are specific for some classes of RNAs, like tRNAs and snoRNAs, respectively.

Repeats in the Tomato Genome

Protein-coding Gene Paralogs

Though the description of protein-coding paralog genes is not the main topic of this chapter, preferred to briefly reported on their distribution in the tomato genome since they represent repeat sequences in a genome and their occurrence contributed to reveal the two consecutive triplications events of the Solanum lineage, that moulded the gene set controlling fruit characteristics (Tomato Genome Consortium 2012). The total number of genes with at least one paralog in tomato is 25,992, about 75 % of the total gene content. In Fig. 10.2 we report the distribution of paralog gene numbers per chromosome. This reflects the high duplication level of mRNA coding genes reported in the tomato genome (Tomato Genome Consortium 2012).

Fig. 10.2
figure 2

Paralog gene distribution per chromosome. The data source from which we report this summary is obtained from BioMart section of EnsemblPlants (http://plants.ensembl.org/)

Non Protein-coding Repeated Genes

Among paralogs we may also consider large multigene families such as ribosomal RNAs (rDNA) and tRNAs (tDNA) genes.

Non protein-coding RNAs in the tomato genome sequences were annotated by Infernal using the Rfam database (version 9.1) (specifically, the collection available at ftp://ftp.sanger.ac.uk/pub/databases/Rfam/9.1/infernal-latest.tar.gz and compatible with Infernal 1.0) (Tomato Genome Consortium 2012).

Long rDNAs were excluded from the analyses of the tomato assembly released by the consortium, because of a specific option used by the authors when running the software Infernal, that excluded the annotation of these specific regions (Tomato Genome Consortium 2012, supplementary materials 2.3.2). Therefore the analysis resulted to be limited to the identification of 1853 non protein-coding RNAs of 90 distinct Rfam families in which almost 48 % of all the targets represented tRNA coding genes (RF00005) (Tomato Genome Consortium 2012).

Table 10.2 summarizes the results included in the iTAG2.4_infernal.gff3 file made available by the tomato genome sequencing consortium at the ftp section of the Sol Genomics Network (http://solgenomics.net/). Moreover, in order to complete the annotation of the non protein-coding rDNAs, we performed a BLASTn of the tomato chromosomes versus the Large Subunit sequences (LSU, RF02543), which include the 25S RNA, and the Small Subunit (SSU, RF01960) sequences, corresponding to 18S, both collections available in the Rfam database (release 12.0). We considered only locus that corresponded to matches with identity and coverage ≥98 %.

Table 10.2 Number of 5.8S rRNA, 5S rRNA, tRNA as reported by the Tomato Genome Consortium (2012)

5.8S rRNA genes defined by the consortium are listed mainly on chromosomes 11 and 6, while higher figures are reported by our updating corresponding to regions similar to 25S sequences (Table 10.2). It is also evident that there are still matches on the unassigned sequences collected as unassembled on “chromosome 0”, probably because the difficulties in assigning repeated sequences during the assembly of large and complex genomes.

The table also shows a high number of 5S coding regions on chromosome 1 (Fig. 10.3a), confirming the loci identified as repeated in tandem by FISH on pachythene chromosomes on the short arm of chromosome 1 (1S), close to the centromeric region (Vallejos et al. 1986; Lapitan et al. 1991; Xu and Earle 1996a, b). Though, as explained, the information on the long rDNA regions (45S or at least 18S and 25S families) was not available from the sequencing and annotation effort, we reviewed the information collected from analyses preceding the tomato genome sequencing and exploited our updating based on the BLASTn analysis. Indeed, it was known that ribosomal DNA represents the most abundant repetitive DNA family in tomato, comprising approximately 3 % of the genome. From experimental analysis, 5S and 45S rRNA genes were detected as tandemly repeated with 1000 and 2300 copies. Karyotyping in combination with fluorescence in situ hybridization (FISH) on tomato pachytene chromosomes allowed the identification and mapping of the 45S rDNA on the satellite of the short arm of chromosome 2 (2S) and a minor locus on 2L, though these evidence are not confirmed by the tomato genome sequencing, from which no match, neither with the only considered marker 5,8S, was detected (Vallejos et al. 1986; Tanksley et al. 1988; Lapitan et al. 1991; Xu and Earle 1996a, b). However, these results find some confirmation from our updated analysis, with few matches from the 25S confirmed on chromosome 2. Other minor loci were also revealed at 6S, 9S and 11S (Xu and Earle 1996a, b), the first and the last also finding some confirmation by the annotation from the consortium, with stronger support by our update. Indeed, the updated analysis shows regions similar to the 25S (LSU) in all the chromosomes, accompanied by a similar distribution by the 18S, though with lower numbers, in contrast with what expected from previous analysis.

Fig. 10.3
figure 3

Distribution per chromosome 1 (a) and chromosome 6 (b) of repeated non protein-coding genes. Percentage of N is also reported by a nonoverlapping window analysis of chromosomes divided per 500 Kb, with a total of 197 windows for chromosome 1 and 100 windows for chromosomes 6. Details of regions with 5S rRNA and tRNA in tandem on chromosome 1 are shown

In Fig. 10.3a, b the distribution of non protein-coding genes on chromosomes 1 and 6 are shown, respectively. Data are from the iTAG2.4_infernal.gff file made available by the tomato genome consortium at ftp://ftp.solgenomics.net/genomes/Solanum_lycopersicum/annotation/ITAG2.4_release/. Moreover, the results from the updated analysis here provided are also shown in the figure.

Our updated analysis also permitted the clear identification of an rDNA locus associated to the occurrence of 45S loci on chromosome 6, since 18S 5.8S and 25S are all located in the region (Fig. 10.3b).

tDNA distribution is shown both in Table 10.2 and in Fig. 10.3. Interestingly to notice, their occurrence is reported in all the chromosomes.

Noncoding Tandem Repeats

Noncoding tandem repeat sequences in tomato chromosomes were detected using the de novo approach of Tandem Repeats Finder (Benson 1999), with default parameters. This permitted to classify the sequences by length into microsatellites (2–9 bp), minisatellites (10–99) and satellites (≥100-bp), while overlapping annotations of more than one of the three classes were classified as hybrid type.

The whole collections of tandem repeats resulted to cover 3.2 % of the genome, with the major contribution from minisatellites (1.7 of the entire genome and 53.7 % of the tandem repeats). Microsatellite repeats in tomato genome were also analyzed by Suresh et al. (2014), who detected a total of 68,641 microsatellite repeat motifs. Dinucleotide repeats (60.18 %) resulted much more abundant than tri (19.56 %) and other repeats, of which ~82.90 and ~17.10 % were simple and compound repeats, respectively. A total of 5841 and 4773 SSRs were present in the assigned genes and their 5′-upstream sequences, with average frequencies of 0.172 SSRs/gene and 0.14 SSRs/5′-upstream sequences, respectively. Data are accessible at the Tomato Genomic Resources Database (http://59.163.192.91/tomato2/).

Telomere

Beyond rDNAs, telomeres are the most ubiquitous tandem repeated arrays in the genome of eukaryotes.

The telomere repeats have been studied extensively in species of the Solanaceae family, which show mostly the Arabidopsis-type telomere (TTTAGGG). The typical tomato telomeric repeat (TR) (TT(T/A)AGGG) is arranged in tandem to form large uninterrupted blocks (Ganal et al. 1991). A block of 162-bp subtelomeric repeats (TGRI) is localized a few hundred kb from the terminal telomere repeats in 20 of the 24 homologous chromosomes (Ganal et al. 1988, 1991; Schweizer et al. 1988; Lapitan et al. 1989). These repeated blocks together accounts for around the 2 % of the total chromosomal DNA and, though the TR repeat is highly conserved, the long range physical structure of these arrays has been shown to be highly variable in different varieties (Broun et al. 1992) and within the genome (Zhong et al. 1998). Zhong et al. (1998) investigated on the relative length and distribution of the TR the spacer and the TGRI blocks in tomato chromosomes. The major evidence from Zhong et al. work was to highlight differences in TR-spacer-TGRI organization in most if not all the chromosome ends in tomato. Concerning the role of the spacer and the TGRI repeats it is assumed that they could represent buffering blocks separating chromosome ends from unique sequences or alternatively, playing a role in favouring or preventing chromosome degradation, fusions and fissions (Meyne et al. 1990). However, they have also been speculated to be regions susceptible to unequal crossing over between homologous and even nonhomologous chromosomes, yielding to high polymorphisms even in conserved genomes (Broun et al. 1992).

Interestingly, interstitial telomeric repeats (ITRs) were also revealed hybridizing the TR repeat on lambda clones of tomato, showing unexpected telomere homologous sequences on 8 of the 12 tomato centromeres (Ganal et al. 1991; Presting et al. 1996).

ITRs are organized as short tandem arrays and are expected to be evolutionary relics derived from chromosomal rearrangements and DNA repairs (He et al. 2013). However, megabase-sized ITR arrays were reported in Solanum species (Tek and Jiang 2004). These results showed that some ITR subfamilies were amplified and invaded the functional centromeres of Solanaceae chromosomes revealing possible other roles than simply being relics of chromosomal rearrangements. The epigenetic landscape and transcription of telomeres and ITRs were also investigated. As an example, in Nicotiana tabacum (with no detectable ITRs), and in Ballantinia antipoda, (with large blocks of pericentromeric ITRs and relatively short telomeres) Majerová et al. (2014) revealed that genuine telomeres displayed heterochromatic as well as euchromatic marks, while ITRs were just heterochromatic. Methylated cytosines were present at telomeres and ITRs, but showed a bias with more methylation towards distal telomere positions and different blocks of ITRs methylated to different levels (Majerová et al. 2014). Interestingly, the authors also showed that telomeres and ITRs are transcribed, and that the level of telomerase transcripts is tissue dependent, contributing novel insights for the understanding of the specific role and regulation activity of the associated transcripts.

Centromere

The tomato genome sequencing confirmed the presence of a high DNA repeat content in the heterochromatin pericentromeric regions, however no value added information was provided by the sequencing effort to characterize centromeric tandem repeated regions. It is known, however, that both the centromeric satellites and the retroelements are essential for centromere recognition by kinetochore proteins (Zhong et al. 2002; Nagaki and Murata 2005; Nagaki et al. 2011), and previous efforts also revealed the mosaic structure of centromeres in plant species (Nagaki et al. 2012). Interestingly, though it was evident that centromeric repeats evolve rapidly (Melters et al. 2013), Gong et al. (2012) recently reported that six of the 12 potato centromeres contain megabase-sized arrays of satellite repeats different in each centromere. By contrast, five potato centromeres are shown to be composed of single- and low-copy DNA sequences, with no satellite repeats detected. These five potato centromeres structurally resemble neocentromeres. Moreover, they also showed that most of the centromeric satellite repeats in potato were amplified recently from retrotransposon-related sequences and are not present in wild Solanum species closely related to potato.

A deeper comparative analysis revealed that different centromeric haplotypes were found to be associated with three potato centromeres, including haplotypes containing megabase-sized satellite repeats and haplotypes that do not contain the same repeats (Wang et al. 2014).

To further understand the evolution of centromeric DNA in Solanum species, (Zhang et al. 2014) conducted a genome-wide analysis of DNA sequences associated with the cenH3 nucleosomes in Solanum verrucosum (2n = 2x= 24), a wild species closely related to potato. They demonstrated a rapid divergence of the centromeric sequences between these two closely related species. Therefore, they hypothesized that centromeric satellite repeats may undergo boom–bust cycles of evolution from which a structurally favourable repeat lengths, maybe favouring the structure ideal for cenH3 nucleosome organization, could take place.

Many existing centromeres are believed to have originated as neocentromeres that activated de novo from noncentromeric regions by acquiring specific histones in the nucleosome (for example, the canonical histone H3 is replaced by cenH3 histone in plants or by CENP-A in animals (Kalitsis and Choo 2012; Rocchi et al. 2012). Newly formed neocentromeres are associated with gene “desert” regions and initially do not contain satellite repeats (Marshall et al. 2008; Wang et al. 2014). The evolutionarily new centromeres presumably accumulate satellite repeats and/or retrotransposons during evolution and eventually evolve rapidly to become repeat-based centromeres (Yan et al. 2006; Kalitsis and Choo 2012; Sharma et al. 2013).

Transposons

Considering the dispersed repeats, we already reported on tDNA distribution in the tomato genome in the paragraph on non protein-coding repeated gene families.

The other relevant class among dispersed repeats includes the transposons. In Table 10.3, we report the nucleotide coverage in terms of transposon classes of all the chromosomes, as derived from the annotation reported in the iTAG2.4_repeat.gff3 file released by the tomato genome consortium (Tomato Genome Consortium 2012) and available at http://solgenomics.net.

While the pseudomolecules images in the Nature paper report the general behaviour of repeat content along tomato and potato pseudomolecules, in this chapter we provide, as an example, a more detailed view with a similar approach showing the distribution of all single class of repeats along tomato chromosomes 1 and 6 (Fig. 10.4a, b).

As reported from Nature 2012, full length LTR retrotransposons in the tomato genome sequence, were detected by a curated analysis starting from a de novo approach based on LTR-STRUC (McCarthy and McDonald 2003). 1647 intact LTR retrotransposons were detected. These sequences were assigned to the gypsy or copia subgroups which were identified thanks to the order of their inner protein domains.

Additional full length LTR elements were found by sequence similarity, leading to a total of 4052 still intact elements. Moreover, a cluster analyses of these sequences highlighted that tomato and potato (Potato Genome Sequencing Consortium 2011) genome sequences shared common LTR retrotransposons (Tomato Genome Consortium 2012).

The insertion events of LTR retrotransposons were also dated by the sequence divergence between left and right LTRs (Wiley et al. 2009). Interestingly, this analysis showed fewer copies in tomato and potato when compared to sorghum and older insertion age. This appears to be a peculiarity of tomato, and apparently also of potato, among angiosperms (Tomato Genome Consortium 2012).

Transposons along tomato chromosomes were annotated by the wublast version of RepeatMasker (http://www.repeatmasker.org) against the dicots section of mipsREdat (REdat_v8.9_Eudico). This transposon library is connected to a repeat classification scheme (mips_REcat) and contains a collection of known transposons as well as de novo detected LTR retrotransposons from tomato (1647) and potato (1309). The RepeatMasker output was subjected to two post-processing filter steps: (a) removal of low confidence hits (length <50 bp, score ≥255) and (b) cleaning of overlapping annotations, considering higher score hits first, and overlapping lower scored hits either shortened or, if the overlap exceeded 80 % of their length, removed.

In Table 10.3 we redefined the nucleotide coverage in terms of repeat classes for all the tomato chromosomes, starting from the available annotation from the consortium (Tomato Genome Consortium 2012).

Moreover, while the pseudomolecule images in the Nature 2012 paper (Tomato Genome Consortium 2012) reports the general behaviour or the global repeat content along tomato pseudomolecules, in this chapter we provide a more detailed view with a similar approach showing the distribution of all single classes of repeats along chromosomes 1 and 6 (Fig. 10.4a, b).

Moreover, in Fig. 10.5 we report the distribution of the transposons by the delta repeat minus gene content in a 500 kb window in chromosome 6. The plots confirmed the high content of LTR retrotransposon in repeat-rich regions, that should correspond to heterochromatin regions (Di Filippo et al. 2012) with higher content of the gypsy-like class and much lower content of the copia-like one. The plots also show that, among non-LTR retrotransposon, the SINE are more frequent in gene richer regions, as also demonstrated at BAC level (Di Filippo et al. 2012), with a similar trend also from LINE.

Table 10.3 Number of nucleotides covered by transposons per chromosomes
Fig. 10.4
figure 4

Distribution of gene and repeat content along chromosomes 1 and 6. Annotation of line, LTR, Gypsy, Copia, Sine and DNA transposons were obtained from ITAG2.4_repeats.gff3; gene annotations were from ITAG2.4_gene_models.gff3, both available at http://solgenomics.net/. Data are reported by a 500 Kb nonoverlapping window. Left and right y-axes represent different percentages. The right y-axes represent the number of undefined nucleotide (N) per window

Fig. 10.5
figure 5

Distribution of main repeat classes by windows of 500 kb along chromosome 6. The data are reported as frequency in the window versus the difference between repeat and gene content frequency (ΔRG). Annotations were obtained as for Fig. 10.3

The iTAG2.4_repeats.gff3 file used to perform this analysis was downloaded from the ftp section at http://solgenomics.net/.

Discussion

Solanaceae is an unusually divergent family consisting of approximately 90 genera and 3000–4000 species (Knapp et al. 2004) and almost all members share the same chromosome number (x = 12) (Wikstrom et al. 2001). Though the genomes appeared to have undergone relatively small numbers of chromosomal rearrangements (Park et al. 2011), they maintained a conserved gene content and order (Bonierbale et al. 1988; Tanksley et al. 1988; Prince et al. 1993; Livingstone et al. 1999; Wang et al. 2008; Wu et al. 2009). Though, the sequencing of different genotypes of the same species revealed microscale heterogeneity between cultivated and wild species (Traini et al. 2013; Ercolano et al. 2014; Qin et al. 2014), the overall conservation of the Solanaceae gene regions was generally described as conserved, even at the level of syntenic segments (Wang et al. 2011). The level of conservation revealed at gene level, however, is not confirmed when considering genome size, repetitive sequence content and composition. Within the Solanaceae family, Solanum lycopersicum (tomato) has a genome size of ~950 Mb, the genome size of Solanum tuberosum (potato) is 840 Mb and Capsicum annuum (pepper) genomes is of 3349 Mb, though the estimated gene content is comparable, suggesting a possible significant role of repeats in the speciation of these clade of plants (Zhu et al. 2008).

The 12 tomato chromosomes consist of an extended heterochromatic region (>60 % genome), mostly representing the telomeres and extended pericentromeric regions. The euchromatin regions locate in the distal part of the chromosome (Peterson et al. 1996, 1998), composed of most single-copy sequences with fewer retrotransposon and the 90 % of the genes (Chang et al. 2008).

Pericentromeric heterochromatin is generally assumed to be gene poor and repeat-rich, where crossing over is severely repressed (Sherman and Stack 1995). The pericentromeric heterochromatic segments contain a large portion of retrotransposons, other types of repeated sequences and some single-copy sequences, which also include a lower but representative gene content (Di Filippo et al. 2012).

Among tandem repeats, ribosomal DNA represents one of the most abundant repetitive DNA family. The repeat unit, estimated to be 9.1 Kb, was expected of 2300 copies and at the end of chromosome 2 by Ganal et al. (1988). rDNA should represent the 3 % of the tomato genome and its distribution was described also by several other efforts (Vallejos et al. 1986; Lapitan et al. 1991). As reported in this chapter, the rDNA regions appear not to be exhaustively covered by the tomato genome sequencing and by the associated annotation, and this is presumably the reason why they are not broadly discussed in the effort (Tomato Genome Consortium 2012). However, the presence of satellite DNA joint to the intergenic spacer of rDNA units also reveals the strong association of these two types of repeats and a possible initiation of satellite repeats from these loci (Jo et al. 2009).

Previous analysis also confirmed a 162 bp satellite repeat, named TGRI, with 77,000 copies in the genome as localized within a few hundred kb of the terminal 7 bp telomeric repeat TT(T/A)AGGG in tomato, at 20 of 24 chromosome ends (Ganal et al. 1988). In addition, internal telomeric repeats (ITR) were also found at a few centromeric and interstitial sites (Lapitan et al. 1989; Ganal et al. 1992; Presting et al. 1996), opening interesting questions on the reasons of this organization, as also highlighted in this chapter.

Two other tomato genomic repeats, TGRII and TGRIII, are less abundant, and were estimated with 4200 and 2100 copies, respectively. TGRII is apparently randomly distributed with quite a regular spacing of 133 kb (Ganal et al. 1988), while TGRIII is predominantly clustered in the pericentromeric region. The TGRIV repeat was discovered later and it was found mainly associated to satellite repeats in the centromere (Chang et al. 2008).

Microsatellite polymorphism and genomic distribution were studied in tomato by fingerprinting using labelled oligonucleotide probes complementary to GATA or GACA microsatellites (Vosman et al. 1992; Grandillo and Tanksley 1996). The mapping of individual fingerprint bands showed main association to centromeres (Arens et al. 1995). The copy number and the size of microsatellite containing restriction fragments were proved to be highly variable between tomato cultivars (Arens et al. 1995). Structure, abundance, variability and location were also evaluated (Broun and Tanksley 1996) and successfully used for genotyping tomato cultivars and accessions (Smulders et al. 1997; Bredemeijer et al. 2002). Interestingly, what is evident in tomato is the presence of compound satellite repeats, highly variable in length and strongly specific to the species. Ganal et al. (1988), underlined that the distribution of the major classes of tandem repeats described in tomato is limited to this species. This is probably due to high evolving rate of these regions. Zamir and Tanksley (1988) also reported a positive correlation between copy number and rate of divergence of repeats among DNA sequences from related Solanaceae species. This means that highly repeated regions are less conserved when compared to single-copy regions, coherently also with a different selective pressure on the two types of regions. Further analyses revealed rapid evolution of centromere-proximal sequences (Presting et al. 1996) which is also confirmed from analysis in other Solanaceae (Gong et al. 2012; Melters et al. 2013; Wang et al. 2014; Zhang et al. 2014).

Among all classes of repeats, transposons comprise a large proportion of the tomato genome. In general, the highest contribution to dispersed repeats in plant genomes is mainly due to LTR retrotransposons (Piegu et al. 2006; Richard et al. 2008; Lee and Kim 2014). Plants show more C-value variation than other taxa (http://data.kew.org) (Bennett and Leitch 2005), which appears to be correlated with LTR retrotransposon abundance (Michael 2014). In animals non-LTR elements appear to be more abundant (Sakowicz et al. 2009). DNA transposons have minor impact on genome size because of the way they expand (Lee and Kim 2014). In particular, repeat-rich regions of the tomato genome revealed abundance of the LTR retroelements Ty3–gypsy and Ty1–copia (Yasuhara and Wakimoto 2006; Chang et al. 2008; Szinay et al. 2008; Tang et al. 2008a, b; Peters et al. 2009; Di Filippo et al. 2012), though the second class is present at a less extent, as also confirmed by the tomato genome annotation (Table 10.3; Fig. 10.5).

In Di Filippo et al. (2012), tomato genome sequences obtained by the preliminary BAC sequencing that preceded the whole-genome shotgun approach were analyzed to correlate heterochromatin and euchromatin regions with the relative gene and repeat content. Moreover, in the same effort, molecular markers, available to define the eu/heterochromatin boundaries along each tomato chromosome (data from the Solanaceae Genome Network website), and all the BACs associated to the chromosome structure by fluorescence in situ hybridization (FISH) (de Jong 1998; de Jong et al. 2000; Wang et al. 2006; Szinay et al. 2008; Tang et al. 2008a, b; Peters et al. 2009) were used to analyze the associated sequences. This gave out a preliminary confirmation based on sequence analysis that BACs associated to euchromatin in the tomato genome were indeed richer in gene and lower in repeat content when compared to BACs associated to heterocromatin regions. The analyses presented in Di Filippo et al. (2012), while confirming the initial assumption that genes were predominantly located in repeat-poor euchromatin regions, proved that the repeat-rich heterochromatic BACs were not completely depleted of genes (Yasuhara and Wakimoto 2006; Mueller et al. 2009). Interestingly, Di Filippo et al. (2012) also proposed an immediate approach to show the specific content of repeat classes in tomato gene or repeat richer BACs, corresponding to euchromatic and heterochromatic BACs, respectively. We also exploited the same approach here to confirm, at chromosome level, the distribution of different repeat classes in compositionally different genome regions (Fig. 10.5).

Today it is well known that transposons play various relevant roles in genome evolution, gene expression regulation and genetic instability. They can change position within the genome, contributing to genome reorganizations and altering the genome size, since transposition often results in duplication of the transposable elements, contributing with their movement to changes in cell function and organisms development (Nowacki et al. 2009) as well as to genome reorganization. Interestingly, in most cases transposable elements are silenced through epigenetics mechanism like methylation and chromatin remodelling. As a consequence, no phenotypic effects nor the movement of transposons occur when, in the wild type plant, they are silenced (Martienssen and Colot 2001; Reik et al. 2001). It is important to note, however, that DNA methylation is not conceived as a factor provoking heterochromatin formation (some species may lack methylation) but rather as a factor stabilizing heterochromatin structures (for review, see Wolffe and Matzke 1999).

Type, number and size of repeat domains in a genome can vary among species, but even differ between close genotypes or accessions, being useful as genome markers in karyotype analysis and chromosome markers in a segregating population. However, based on the assumption that a portion that comprises such a large extent of higher eukaryotes genome sequence cannot be without specific reasons, more interesting could be the understanding of the role and, possibly, advantages, if any, in repeat expansion or reduction, as well as association of these phenomena with heterochromatin formation. A prerequisite for heterochromatin formation appears to be the structural organization of the repeats rather than the nature of the particular sequences, or their repetitive character. It is evident that DNA repeats have specific structure role in constitutive heterochromatin, essential in multicellular organisms at chromosomal and nuclear level. At the chromosomal level, constitutive heterochromatin is present around vital areas such as telomeres and centromeres. The centromeric satellite DNA and retrotransposons are known to be essential in the recognition of the kinetochore (Zhong et al. 2002; Nagaki et al. 2003). The pericentromeric repeats are considered important in the recruitment of histone modification enzymes promoting the formation and maintenance of heterochromatin (Hall et al. 2002; Volpe et al. 2002; Zhong et al. 2002; Bender 2004; Lippman et al. 2004) and conferring protection and strength to the centromere. Around secondary constrictions, heterochromatic blocks may ensure against evolutionary change of ribosomal DNA by decreasing the frequency of crossing over in these regions during meiosis, also absorbing the effects of mutagenesis. Indeed, repetitive sequences in the form of constitutive heterochromatin appeared concomitant with the localization of the portion of the genome that was concerned with synthesis of ribosomal RNA, and with the need to protect chromosome structure and function by telomeres and centromeres, when the mitotic spindle developed in evolution. During meiosis heterochromatin may also aid in the initial alignment of chromosomes, facilitating speciation by allowing chromosomal rearrangement but also providing, through the species specificity of its DNA, barriers against cross-fertilization. At the nuclear level, constitutive heterochromatin may help to maintain the spatial relationships through all the steps of cell cycle. The repetitive DNA was therefore kept through natural selection and, because of its innate attitude to amplify and expand, it favoured eukaryotes genome expansion and evolution (Yunis and Yasmineh 1971; Bennetzen and Kellogg 1997). This occurred in the limit of an efficient management of other cellular activities (Knight et al. 2005). In principle, repeats are prone to expand but there exist also mechanisms to decrease dramatically their content, if necessary, including illegitimate or unequal recombination and other type of deletions (Grover and Wendel 2010). However, beyond the relevance here discussed, and the impact DNA repeats can have on genome evolution and expansion, it would also be rather important to investigate on further possible roles of species specific repeats in structuring and protecting the genome though the energy requirements that genome expansion can take from cell functionality.