Keywords

1.1 Introduction

1.1.1 Early Findings on Genome Sizes and Sequence Complexity

Even before DNA could be sequenced, researchers realised that eukaryotic genomes show an extreme variation in size (Bennett and Smith 1976). Some studies reported an over 200,000-fold variation in genome size, namely between the amoeba Amoeba dubia that has an estimated genome size of 670,000 Mbp (Gregory 2001) and the 2.9 Mbp genome of the microsporidium Encephalitozoon cuniculi (Biderre et al. 1995; Katinka et al. 2001). In the absence of DNA sequence information, genome sizes were measured by estimating nuclear DNA amounts through densitometric measurements (e.g. Bennett and Smith 1976). The “sequence complexity” of genomes was assessed by DNA re-association kinetics. These experiments showed that the vast differences in genome sizes are due to the presence of different amounts of “repeating DNA sequences” (Britten et al. 1974), although their nature was completely unknown at that time. Nevertheless, it was clear early on that the repetitive fraction of a genome is relatively complex and consists of many different types of repeats. Genomes could even be fractionated into highly and moderately repetitive sequences by DNA re-association kinetics (Peterson et al. 2002).

1.1.2 Definition of “Gene Space” and the “C-Value Paradox”

Only when technological advances allowed near-complete sequencing of eukaryotic genomes, actual gene numbers could finally be estimated. Here, it needs to be noted that the definition of what actually constitutes the “gene space” of a genome is still a topic of debate. It certainly includes all “typical” protein-coding genes. Additionally, many components of the gene space do not encode proteins, such as the highly repetitive ribosomal DNA clusters, tRNAs and small nucleolar and small interfering RNAs. Probably, gene space should also include conserved non-coding sequences (Freeling and Subramaniam 2009) and ultraconserved elements (Bejerano et al. 2004), although their functions are barely understood. In the following discussion of gene numbers, I will only refer to protein-coding genes.

1.1.3 The Number of Genes is Similar in All Genomes

As Table 1.1 shows, the estimates of gene numbers differ from species to species, but for all sequenced eukaryotic genomes they are in a range from 5,000 to 50,000. Thus, at a first glance, gene numbers vary only by a factor of 10 while genomes sizes, as described above, vary more than 200,000-fold. The recently finished genome of Brachypodium distachyon probably has the most stringent gene annotation so far and possesses 25,554 genes. This gene number is very similar to that of the most recent version of the Arabidopsis thaliana genome (version 9) that has 26,173 annotated genes. Even the large maize genome is estimated to contain only about 30,000 genes (Schnable et al. 2009). Interestingly, these numbers are very similar to those for vertebrate genomes, because for all sequenced vertebrate genomes, such as human, mouse, or chicken, genes numbers are now estimated in the range of 25,000–30,000 (Table 1.1). Only fungi and invertebrate animals have clearly fewer genes. Yeast, with its compact 12 Mbp genome has less than 6,000 genes while insects such as Anopheles gambiae or Drosophila melanogaster have approximately 12,000 genes (Table 1.1). Thus, a consensus transpires that most eukaryotes possess between 5,000 and 30,000 genes, making it obvious that only a relatively small fraction of the genomes sequenced to date actually encode functional genes.

Table 1.1 Genome sizes and gene numbers in publicly available genomes

1.1.4 The C-Value Paradox

The fact that gene numbers are very similar while genome sizes vary extremely came to be known as the “C-value Paradox”. Moreover, depending on which taxonomic group is analysed, there may be little or no correlation between genome size and phylogenetic relationships. This effect is particularly strong on plants where even very closely related species can have very different genome sizes (Fig. 1.1). Among the dicotyledonous plants, there is Arabidopsis thaliana, the first plant which had its genome completely sequenced. With a size of about 120 Mbp (Arabidopsis Genome Initiative 2000), it is one of the smallest plant genomes known. In contrast, closely related Brassica species that diverged from Arabidopsis only 15–20 MYA (Yang et al. 1999) have five to ten times larger genomes. In monocotyledonous plants, variation is even more extreme: The grasses Brachypodium dystachion, rice and sorghum have genome sizes of 273 Mbp, 389 Mbp and 690 Mbp, respectively, considerably larger than the Arabidopsis genome but roughly an order of magnitude smaller than the genomes of some agriculturally important grass species such as wheat and maize, with haploid genome sizes of 5,700 and 2,500 Mbp, respectively. And even they are still dwarfed by the genomes of some lilies, among them Fritillaria uva-vulpis which has a genome size of more than 87,000 Mbp, over 700 times the size of the Arabidopsis genome (Leitch et al. 2007). Also among Dicotyledons, closely related species often differ dramatically in their genome sizes. Maize and sorghum, for example diverged only about 12 MYA (Swigonova et al. 2004), but the maize genome is more than four times the size of the sorghum genome (Table 1.1, Fig. 1.1).

Fig. 1.1
figure 00011

Phylogenetic relationships and genome sizes in selected plant species. Divergence times of specific clades are indicated in red numbers next to the corresponding branching. These numbers are averages of the published values provided in Table 1.1. The scale at the bottom indicates divergence times in million years ago (MYA). Major taxonomic groups that are discussed in the text are indicated at the left

1.2 Transposable Elements

1.2.1 Basics of Selfishness and Junk

As the number of genes is similar in all organisms, it became clear early on that the factor which mainly determines genome size is the amount of repetitive sequences. Nowadays we know that the vast majority of these repetitive sequences are in fact transposable elements (TEs). These elements contain no genes with apparent importance for the immediate survival of the organism. Instead they contain just enough genetic information to produce copies of themselves and/or move around in the genome. For this reason, such sequences are often referred to as “selfish” DNA (Orgel and Crick 1980). To some degree that disparaging view is justified, because TEs are small genetic units, actual “minimal genomes”, which contain exactly enough information to be able to replicate, move around in the genome or both. They use the DNA replication and translation machinery of their “host” and thrive within the environment of the genome. For this reason, the term “junk DNA”, is often used almost synonymously with TE sequences, reflecting the view of TEs being largely a parasitic burden to the organism.

1.2.2 TE Taxonomy and Classification

Pioneering work in TE classification was done by Hull and Covey (1986), Finnegan (1989) and Capy et al. (1996). The first publicly available database for TEs was RepBase (girinst.org/repbase/) by Jerzy Jurka and colleagues who also proposed a classification system for all TEs (Jurka et al. 2005). In 2007, a group of TE experts met at the Plant and Animal Genome Conference in San Diego (CA, USA) with the goal to define a broad consensus for the classification of all eukaryotic transposable elements. This included the definition of consistent criteria in the characterisation of the main superfamilies and families and a proposal for a naming system (Wicker et al. 2007). The proposed system is a consensus of previous TE classification systems and groups all TEs into 2 major classes, 9 orders and 29 superfamilies (Fig. 1.2). A practical aspect of the classification system is that the TE family name should be preceded by a three-letter code for class, order and superfamily (Fig. 1.2). This was intended to make working with large sets of diverse TEs easier as it enables simple text-based sorting and allows the immediate recognition of the classification when seeing the name of a TE. The proposed classification system is open to expansion as new types of TEs might still be identified in the future. A system that attempts to cover such a vast and complex biological field is by its nature reductionist and tends to oversimplify matters. Thus, there is still an ongoing scientific debate about various aspects of the system (Kapitonov and Jurka 2008; Seberg and Petersen 2009), some of which will be discussed in more detail below.

Fig. 1.2
figure 00012

Classification system for transposable elements (Wicker et al. 2007a). The classification divides TEs into two main classes on the basis of the presence or absence of RNA as a transposition intermediate. They are further subdivided into subclasses, orders and superfamilies. The size of the target site duplication (TSD), which is characteristic for most superfamilies, can be used as a diagnostic feature. A three-letter code describes all major groups and is added to the family name of each TE

1.2.3 Class and Subclass: The Highest Levels of TE Classification

At the highest taxonomic level, TEs are divided into two classes. Class 1 contains all TEs that replicate via an RNA intermediate in a “copy-and-paste” process. This class includes both LTR as well as non-LTR retrotransposons. In Class 2 elements, the DNA itself is moved analogous to a “cut-and-paste” process. Class 2 elements are further subdivided into subclass 1 and 2. Subclass 1 are the classic cut-and-paste elements where the DNA is moved with the help of a transposase enzyme. Subclass 2 includes TEs whose transposition process entails replication without double-stranded cleavage and the displacement of only one strand. The Order Helitron from Subclass 2 seems to replicate via a rolling-circle mechanism (Kapitonov and Jurka 2001). Their placement within class 2 reflects the common lack of an RNA intermediate, but not necessarily common ancestry.

1.2.4 TE Superfamilies Represent Ancient Evolutionary Lineages

The most commonly used level of classification is the assignment of a TE to a particular superfamily. Superfamilies are ancient evolutionary lineages that arose during the very early evolution of eukaryotes, some even before the divergence of prokaryotes and eukaryotes. Superfamilies are mainly defined by homology at the protein level. That means that two TEs belong to the same superfamily if their predicted protein sequences show clear homology and can be aligned over most of their length. Terms like “clear homology” and “most of their length” reflect a plea to common sense and should not be tightly bound to arbitrary cut-offs based on E-Values or percent sequence similarity. The fact is that TEs belonging to the same superfamily (even if they come from very distantly related species) usually share many conserved amino acid motifs along the length of their predicted proteins which, importantly for practical work, is usually picked up in a blastx or blastp search. In contrast, TEs from different superfamilies usually show hardly any sequence similarity in their encoded proteins. Protein similarity between members of different superfamilies is reduced to very ancient sequence motifs such as the DDE or Zn-finger motifs (Capy et al. 1997). Here it has to be noted that sequence similarity within the same superfamily can only be expected in the “core” enzymes of the TE elements such as the transposase, reverse transcriptase or integrase, while fast-evolving proteins such as gag (in LTR retrotransposon) and ORF2 (in many DNA transposons) often cannot be aligned between members of the same superfamily. The superfamily of SINEs (small interspersed nuclear elements) has a special status. These small elements do not encode any proteins but are derived from RNA Polymerase promoters and can therefore only be classified based on specific DNA motifs.

1.2.5 TEs Show Most Diversity at the Family Level

It is at the family level is where things get really complicated. While the 29 superfamilies are relatively clearly defined, the exact definition of a TE family is still topic of debate (Kapitonov and Jurka 2008; Seberg and Petersen 2009). It is clear that within superfamilies TEs have diverged in to an almost incomprehensibly large number of sub-groups and clades. Here, researchers usually introduce the family as the next lower level (after Superfamily). Early on, it became clear that there must be hundreds or even thousands of different types of TEs populating genomes (SanMiguel et al. 1998; Wicker et al. 2001). However, the challenge has been to define criteria for a family that, on one hand, make at least some biological sense and on the other hand are reasonably simple to apply. Of course, the most biologically meaningful TE classification would be based on phylogenetic analysis (Seberg and Petersen 2009). Construction of phylogenetic trees deduced from DNA or predicted protein sequences allows the identification of specific clades, and is therefore a classification scheme based on biological criteria. Such analyses are essential for our understanding of how TEs and genomes evolve. However, phylogenetic analyses are complex and very labour intensive and require a thorough knowledge of TEs, but they are relatively irrelevant when it comes to the initial task of TE identification and annotation, especially in large-scale genome projects.

1.2.6 The 80–80–80 Rule Revisited

In 2007, several colleagues and I proposed the “80–80–80” rule (Wicker et al. 2007) which became both famous and infamous among researchers working on TE annotation. The rule says that two TEs belong to the same family if they share at least 80 % sequence identity at the DNA level over at least 80 % of their total size. The third criterion simply refers to the minimal size of a putative TE sequence that should be analysed in order to avoid that unspecific signals are over-interpreted. The rule was mainly based on practical criteria. We assumed that most researchers on task to annotate TE sequences would need a simple guideline to classify TE sequences. In most cases, blastn (DNA against DNA) searches would be performed as a first step for TE identification. The BLAST algorithm is not able to align DNAs which are significantly less than 80 % identical. Thus, a given TE sequence will produce no strong BLASTN alignments if its sequence is significantly less than 80 % identical to sequences in the reference database. The second criterion (80 % of the entire length of the TE) was introduced to address the problem that different parts show different levels of sequence conservation within the same TE family. Most TEs are comprised of protein-coding sequences and regulatory regions. Good examples illustrating that problem are the long terminal repeat (LTR) retrotransposon superfamilies. The two LTRs contain promoter and downstream regions while the internal domain contains mainly protein-coding regions. Comparisons between many different TE families shows that the regulatory regions evolve much faster than the coding sequences. Thus, often the DNA sequences of the coding region might be alignable while up- and downstream regions (e.g. LTRs) are completely diverged and cannot be aligned. The second criterion of the 80–80–80 rule requires that at least some of the regulatory sequences can be aligned at the DNA level. There is at least some biological justification for the 80/80 rule, as elements which are similar at the DNA level must have originated from a common “mother” copy in evolutionary recent times.

1.2.7 Biological Meaning vs. Pragmatism in TE Classification

It is clear that a classification rule based simply on the fact that DNA sequences can be aligned is arbitrary, and it was justifiably criticised (Kapitonov and Jurka 2008; Seberg and Petersen 2009). Indeed, TE families (we shall stick to the term “family” for this discussion) sometimes form a continuum, where a sequence from one end of the spectrum might not be properly alignable with one from the other end. But within the continuum, it is possible to move from one end to the other by continuously aligning the most similar sequences. Thus, the simple criterion of whether the DNA sequence of two TEs can be aligned over most of their length can lead to unclear situations. Nevertheless, in most cases, the criterion works quite well. Indeed, usually it is not possible to cross the boundary from one TE family to the other simply by continuously aligning the most similar sequences. For example the Copia families BARE1 and Maximus from barley show practically no DNA sequence identity, not even in the most conserved parts of the CDS (Wicker and Keller 2007). It is, therefore, not possible to cross the boundary from one family to the other based on alignments of the DNA sequences. If nothing else, the strategy of defining TE families based on sequence homology is at least pragmatic and allows classification without complex phylogenetic analyses. Nevertheless, it does not replace phylogenetic analyses when it comes to the study of evolution.

1.2.8 How Many Different TE Families Are There?

Recently, the classification system of Wicker et al. (2007) was put to the test in the framework of the International Brachypodium Initiative (2010). The stated goal was to obtain a TE annotation that is comparable in quality to gene annotation. Thus, Brachypodium became the first plant genome where a special group, the Brachypodium repeat annotation consortium (BRAC), was responsible solely for TE annotation. Great care was taken to isolate and characterise as many TE families as possible. As shown in Table 1.2, a total of 499 TE families were characterised. The largest variety was found in LTR retrotransposons which contribute over two-thirds of all families. They are also the class of elements that contributes most to the total genome sequence due to their large size. Most abundant in numbers of copies were small Miniature Inverted-Repeat Transposable Elements (MITEs; Bureau and Wessler 1994), small non-autonomous DNA transposons. Over 20,000 Stowaway MITEs of 23 different families were identified. Despite the large effort invested in TE annotation in the Brachypodium genome, TE annotation is still not complete. When sequences were annotated carefully in comparative analyses, dozens of additional TE families could be identified (Jan Buchmann, pers. comm). Many of them are low-copy elements which have weak or no homology to previously described TE families. Thus, the 499 TE families identified in the framework of the genome project are certainly a minimal number. The Brachypodium genome is relatively small compared to other plant genomes. However, there is evidence that the size of larger genomes is mainly due to the excessive expansion of relatively few TE families, rather than the diversification of countless small families. Especially in plants, single or a few LTR retrotransposon families can contribute large parts to the genome (Paterson et al. 2009; Schnable et al. 2009; Wicker et al. 2009). In fungi, the situation is similar: in the very repetitive genome of barley powdery mildew, a few dozen TEs completely dominate the repetitive fraction (Spanu et al. 2010). In summary, in most genomes one has to expect hundreds of different TE families, in some probably thousands. However, fears that there might more TE families in a single genome than words in the English language (SanMiguel et al. 2002), and thus naming of all individual families would be impossible, seem to be unfounded.

Table 1.2 Numbers of TE families in the genome of the model grass Brachypodium distachyon

1.2.9 The Necessity of TE Databases

For the researcher confronted with the epic task to annotate TEs in a genome, it is essential to have a good reference database of TE sequences. In the best case, this is a dataset of well-characterised TE sequences. In the worst case, it is a collection of sequences that are simply known to be repetitive and which were assembled automatically into contigs. Often the reality lies somewhere between the two. The most abundant TEs are usually well characterised with respect to their precise termini and proteins they encode. But for many sequences, one only knows that they are repetitive, but the exact size or classification is not known. Repeat classification and characterisation is still done very much on a species by species. This is mainly because TEs from different species (if they diverged more than a dozen million years ago) share very little sequence identity at the DNA level. Thus, only protein-coding TEs can usually be identified across species boundaries. If one also wants to precisely annotate non-coding regions and non-autonomous TEs, one usually needs to generate a TE database for the respective species. There are too many TE databases for different species available to describe here. The most inclusive product available today is probably RepBase (girinst.org/repbase/), which includes TE sequences from many different species. However, the task of compiling an all-inclusive TE database which adheres to consistent rules is a monumental one, and it is growing literally by the day.