Keywords

2.1 Introduction

Transposable elements (TEs) are mobile genetic elements that shape the eukaryotic genomes in which they are present. They are virtually ubiquitous and make up, for instance, 20% of a typical D. melanogaster genome (Bergman et al. 2006), 50% of a H. sapiens genome (Lander et al. 2001), and 85% of a Z. mays genome (Schnable et al. 2009). They are classified into two classes depending on their transposition mode: via RNA for class I retrotransposons and via DNA for class II transposons (Finnegan 1989). Each class is also subdivided into several orders, superfamilies, and families (Wicker et al. 2007). Due to their unique ability to transpose and because they frequently amplify, TEs are major determinants of genome size (Petrov 2001; Piegu et al. 2006) and cause genome rearrangements (Gray 2000; Fiston-Lavier et al. 2007). Once described as the “ultimate parasites” (Orgel and Crick 1980), TEs are commonly found to regulate the expression of neighboring genes (Feschotte 2008; Bourque 2009) or even to have been domesticated so as to provide a specific host function (Zhou et al. 2004; Bundock and Hooykaas 2005; Santangelo et al. 2007; Kapitonov and Jurka 2005).

As a consequence of the development of new rapid sequencing techniques, the number of available sequenced eukaryotic genomes is constantly increasing. However, the first step of the analysis, i.e., accurate annotation, remains a major challenge, particularly concerning TEs. Correct genome annotation of genes and TEs is an indispensable part of thorough genome-wide studies. Consequently, efficient computational methods have been proposed for TE annotation (Bergman and Quesneville 2007; Lerat 2010; Janicki et al. 2011). Given that the pace at which genomes are sequenced is unlikely to decrease in the coming years; the process of TE annotation needs to be made widely accessible.

This chapter lays down a clear road map detailing the order in which computational tools (or combinations of such tools) should be used to annotate TEs in a whole genome. We distinguish three steps (1) identifying TEs by searching for reference sequences (e.g., full-length TE sequences) and building consensuses from similar sequences, (2) manual curation to define and classify TE families, and (3) annotation of every TE copy. We also provide some hints on manual curation, a step that is still necessary.

2.2 De Novo Detection of Transposable Elements

Various efficient computational methods are available to identify unknown TEs in genomic sequences. Each method is based on specific assumptions that have to be understood to optimize selection and combination of the methods to ensure they are appropriate for any particular analytic goal.

2.2.1 Computing Highly-Repeated Words

TEs, due to their capacity to transpose, are often present in a large number of copies within the same genome. Although TE sequences degenerate with time, words (i.e., short subsequences of few nucleotides) that compose them are consequently repeated throughout the genome. Software, such as the TALLYMER (Kurtz et al. 2008) and P-CLOUDS (Gu et al. 2008), has been designed to find repeats rapidly in genome sequences by counting highly frequent words of a given length k, called k-mers. These programs are very useful for quickly providing a view of the repeated fraction in a given set of genomic sequences, including especially unassembled sequences. However, they do not provide much detail about the TEs present in these sequences. Their output only identifies highly repeated regions without indicating precise TE fragment boundaries or TE family assignments. These methods are quick and simple to use but allow only limited biological interpretations and no real TE annotation.

Other methods also start by counting frequent k-mers but then go on to try to define consensuses. ReAS (Li et al. 2005) applies this approach directly to shotgun reads. For each frequent k-mer, a multiple alignment of all short reads containing it is built and then extended iteratively. REPEATSCOUT (Price et al. 2005) has a similar approach but works on assembled sequences. These tools return a library of consensus sequences. Although their results are more biologically relevant than those of previous methods, the consensuses are usually too short and correspond to truncated versions of ancestral TEs (Flutre et al. 2011). Substantial manual inspection and editing is therefore needed to obtain a meaningful list of consensus sequences.

2.2.2 All-by-All Alignment and Clustering of Interspersed Repeats

Repeats can also be identified by self-alignment of genomic sequences, starting with an all-by-all alignment of the assembled sequences.

Several tools can be used for this. Some, such as BLAST (Altschul et al. 1997) and BLAST-like algorithms, use heuristics. For instance, BLASTER (Quesneville et al. 2003) performs this search by launching BLAST repeatedly over the genome sequences. Others are exact algorithms. Hence, PALS uses “q-Gram filters” that unlike a heuristic (e.g., BLAST), it rapidly and stringently eliminates a large part of the search space from consideration before the alignment search but nevertheless guarantees not to eliminate a region containing a match (Rasmussen et al. 2005). As the amount of input data is usually large, the computations are intensive. Consequently, stringent parameters are applied: good results are obtained with BLAST-like tools when matches shorter than 100 bp or with identity below 90% or with an E-value above 1e-300 are dismissed (Flutre et al. 2011). As most TEs are shorter than 25 kb, segmental duplications can also be filtered out by removing longer matches. To speed up the computations, such alignment tools can be launched in parallel on a computer cluster.

With these parameters, only closely related TE copies will be found. Note that the aim of this step is not to recover all TE copies of a family but to use those that are well conserved to build a robust consensus (see below). Stringent alignment parameters are crucial for successful reconstruction of a valid consensus. Interestingly, even with these stringent criteria, this approach is still more sensitive than other methods for identifying repeats. However, it is also the most computer intensive. It also misses single-copy TE families because at least two copies are required for detection by self-alignment.

Once the matches corresponding to repeats have been obtained, they need to be clustered into groups of similar sequences. The aim is for each cluster to correspond to copies of a single TE family. However, TEs may include divergent interspersed repeats, often nested within each other, making the task difficult. Algorithms have been designed to cluster identified sequences appropriately, limiting the artifacts induced by nested and deleted TE copies and non-TE repeats such as segmental duplications. The various tools that are available are based on different assumptions about (1) the sequence diversity within a TE family, (2) the evolutionary dynamics of TE sequences, (3) nested patterns, and (4) repeat numbers.

GROUPER (Quesneville et al. 2003; Flutre et al. 2011) starts by connecting fragments belonging to the same copy by dynamic programming, and then applies a single link clustering algorithm with (1) a 95% coverage constraint between copies of the same cluster and (2) cluster selection based on the number of copies not included in larger copies of other clusters. The rational here is to detect copies that have the same length as they most probably correspond to mobile entities. Indeed, copies can diverge rapidly by accumulating deletions leading to copies with different sizes. Copies that are almost intact can transpose conserving their original, presumably functional, size. RECON (Bao and Eddy 2002) also starts with a single link-clustering step. If a cluster includes nested repeats and is thus chimerical, it can be subdivided according to the distribution of its all-by-all genome alignment ends. Indeed, nested repeats exhibit a specific pattern in alignments of sequences obtained in an all-by-all genome comparison: the alignment ends of any one inner repeat are all in the relative same position.

PILER-DF (Edgar and Myers 2005) identifies lists of matches covering a maximal contiguous region, defines them as piles, and then builds clusters of globally alignable piles. The rational here is identical to that used by GROUPER where copies of identical length are sought; however, PILER-DF has no specific attitude to indels.

The three clustering programs behave differently according to the sequence diversity of TE families. For instance, GROUPER better distinguishes groups of mobile elements differing by their sizes inside a TE family. It also better recovers fragmented copies due to its dynamic programming joining algorithm. But, it produces more redundant results and only correctly recovers TE families if there are at least three complete copies. RECON is better for TE families with fewer than three complete copies, being able to reconstruct the complete TE from fragments. PILER is fast and very specific. It is a useful option for large genomes when time is an issue, or if a non-exhaustive search is sufficient.

Once clusters are defined, a filter is usually applied to retain only those with at least three members, thereby eliminating the vast majority of segmental duplications. Finally, for each remaining cluster, a multiple alignment is built from which a consensus sequence is derived. Numerous algorithms are available for this but only those complying with the following criteria should be used (1) speed, because the number of clusters is usually very large and (2) ability to handle appropriately sequences of different lengths, which is the case for the clusters generated by RECON. MAP (Huang 1994) and MAFFT (Katoh et al. 2002) comply with these criteria and give good results (Flutre et al. 2011). Taking the 20 longest sequences is generally sufficient to build the consensus. The set of consensus sequences obtained represents a condensed view of all TE families present in the genome being studied.

For easy identification of TE families, i.e., those for which there are full-length copies that are very similar to each other, all clustering methods will find roughly the same consensus. However, for other families, which may be numerous, different methods generate different clusters, because they rely on different assumptions. Therefore, manual curation is required to identify an appropriate set of representative sequences (see below).

This all-by-all genome comparison strategy has been implemented in a pipeline called TEdenovo (Fig. 2.1). The TEdenovo pipeline is part of the REPET package (Flutre et al. 2011) and was designed to be used on a computer cluster for fast calculations. It allows the use of different software at each step to exploit the best strategy according to the genome size and the TE identification goal.

Fig. 2.1
figure 00021

Workflow of the 4-step de novo TE detection pipeline (Flutre et al. 2011)

2.2.3 Features-Based Methods

Alternatively, TEs can be detected using prior knowledge about TE features. For example, class I LTR retrotransposons characteristically have LTR at both ends of the element, and this can be used for their detection. Numerous class II TEs encompass TIR structures that can be used as markers. Many TE families generate a double-strand break when they insert into the DNA sequence. The break is caused by the enzymatic machinery of the TE that generally cuts the DNA with a shift between the two DNA strands. After the insertion, DNA repair processes generate a short repeat of few nucleotides (up to 11) at each end; these repeats are called Target Site Duplications (TSDs) and are characteristic of particular TE families.

There are many different types of TEs and several tools to detect them are available (Table 2.1). Most of these tools have been described in detail in various reviews (Bergman and Quesneville 2007; Lerat 2010; Janicki et al. 2011). Here, we will address the general principles behind their design.

Table 2.1 Availability of feature-based detection programs for TE de novo identification

As class I LTR retrotransposons are easily characterized on the basis of their LTRs and are abundant in genomes, there have been substantial efforts to design bioinformatics tools for their detection. Some of these tools also use the characteristics of some of the substructures of the LTR retrotransposons. The programs available are: LTR_STRUC (McCarthy and McDonald 2003), LTR_MINER (Pereira 2004), SmaRTFinder (Morgante et al. 2005b), LTR_FINDER (Xu and Wang 2007), LTR_par (Kalyanaraman and Aluru 2006), find_LTR (Rho et al. 2007), which is now called MGEscanLTR, LTRharvest (Ellinghaus et al. 2008), and LTRdigest (Steinbiss et al. 2009) that also identifies protein-coding regions within the LTR element. The algorithms of these tools are generally divided into two parts: they first build a data structure to speed up searches for repeats, and then use this structure to search for repeats in the genomic sequences. For example, LTRharvest builds suffix-array using the “suffixerator” tool from GenomeTools package (Lee and Chen 2002). Some of these tools add a third step to refine the search by looking for additional substructures, such as Primer Binding Sites (PBS) and Poly-Purine Tracks (PPT) that are important signals for LTR retrotransposon transposition. These programs also allow searching for TSD and coding regions, including those encoding protein domains, specific to these TEs.

There are also tools aimed at detecting class I non-LTR retrotransposons, e.g., Long Interspersed Nuclear Elements (LINE) and Short Interspersed Nuclear Elements (SINE). TSDfinder (Szak et al. 2002) is based on the L1 TE insertion signature which is constituted in part by two Target Site Duplications (TSDs) and a polyA tail. RTAnalyzer (Lucier et al. 2007) is a Web server that follows the same approach as TSDFinder. SINEDR (Tu et al. 2004) is designed to look for SINE elements, a group of non-LTR retrotransposons, in sequence databases. MGEScan-non-LTR (Rho and Tang 2009) identifies and classifies non-LTR TEs in genomic sequences using probabilistic models. It is based on the structure of the 12 TE clades that are non-LTR TEs. It uses two separate Hidden Markov Model (HMM) profiles, one for the Reverse Transcriptase (RT) gene and one for the endonuclease (APE) gene, both of which are well conserved among non-LTR TEs.

Class II TEs, but not Helitrons and Cryptons, are structurally characterized by TIRs. Some class II-specific bioinformatics tools, for example, FindMite (Tu 2001), Transpo (Santiago et al. 2002), and MAK (Yang and Hall 2003), search for defined TIR features in sequences. Must (Chen et al. 2009) is designed to search for TEs containing two TIRs and two direct repeats (i.e., TSD) to identify MITE candidates. Two new tools were published recently: MITE-Hunter (Han and Wessler 2010) which is a five-step pipeline, with the first step involving a TIR-like structure search and TS clustering (Hikosaka and Kawahara 2010), which is dedicated to finding T2-MITEs.

Despite there being no TIR structures in Helitrons, programs have also been designed for their detection: HelitronFinder (Du et al. 2008) is based on known consensus sequences and HelSearch (Yang and Bennetzen 2009) looks for a Helend structure constituted by a six base-pair hairpin and CTRR nucleotide motif.

2.2.4 Evidence for TE Mobility

The identification of a long indel by sequence alignments between two closely related species is suggestive of the presence of a TE. The rest of the genome can then be searched for this sequence to assess its repetitive nature. This approach has been used (Caspi and Pachter 2006) and appears to work well for recent TE insertions: indeed, it will only detect insertions that occurred after speciation. Using several alignments with species diverging at different times may lead to more TEs being identified (Caspi and Pachter 2006), as each alignment allows detection of TEs inserted at different times. However, one limitation is the difficulty of correctly aligning long genomic sequences from increasingly divergent species.

This idea could be also used within a genomic sequence by considering segmental duplications. A long indel apparent in sequence alignments of genomic duplications may similarly be an indication of the presence of a TE (Le et al. 2000). Various controls are needed, however, to confirm the TE status of the sequence. For example, TE features such as terminal repeats (e.g., LTR, TIR) or similarity to other TE sequences could be used. This approach only detects TE insertions that occur after the duplication event and may thus be limited to rare events.

TSDs are hallmarks of a transposition event, but they can be difficult to find in old insertions because they are short, and they can be altered by mutations or deletions. In addition, the size of the TSD depends on the family and not all TEs generate a TSD upon insertion.

2.3 Classification and Curation of Transposable Element Sequences

When they amplify, TE copies may nest within each other in complex patterns (Bergman et al. 2006), thereby fragmenting the elements. With time, the sequences accumulate (1) point substitutions, (2) deletions that truncate copies, and (3) insertions that interrupt their sequences (Blumenstiel et al. 2002). These events generate complex remnants of TEs. Various de novo tools use these remnants to try to infer the ancestral sequence that actually transposed.

When starting with a self-alignment (i.e., all-by-all genome comparison) of genomic sequences, the optimal strategy is to use several tools and even combine them. However, all the relevant tools and every de novo approach can encounter difficulties when trying to distinguish true TEs from segmental duplications, multimember gene families, tandem repeats, and satellites. It is, therefore, strongly recommended to confirm that the predicted sequences can be classified as being TEs. Computerized analysis therefore still needs to be complemented by manual curation.

2.3.1 Classification

Sequences believed to correspond to TEs can be classified according to their similarity to known TEs, for example, those recorded in databases like Repbase Update (Jurka et al. 2005). A tool called TEclass (Abrusan et al. 2009) implements a support vector machine, using oligomer frequencies, to classify TE candidates.

However, for most previously unknown TE sequences obtained via de novo approaches from nonmodel organisms, classification requires the specific identification of several TE features [see (Wicker et al. 2007) for complete description]. By searching for structural features, such as terminal repeats, features characteristic of various TE types can be identified: long terminal repeats specific to class I LTR retrotransposons, terminal inverted repeats specific to the class II DNA transposons, and poly-A or SSR-like tails specific to class I non-LTR retrotransposons. In addition, using BLASTN, BLASTX, and TBLASTX to compare TE candidates with a reference data bank, can provide hints for classification, as long as the reference data bank contains elements similar to the TE candidate. Therefore, it is also recommended to search for matches for sequences encoding TE-specific protein profiles in TE sequences. For example, the presence of a transposase gene is strongly indicative of a class II DNA transposon. Such protein profiles can be obtained from the Pfam database which includes protein families represented by multiple sequence alignments and hidden Markov models (HMM) (Finn et al. 2010). These profiles can be used by programs such as HMMER to find matches within the candidate TE sequences.

Some tools classify TE sequences according to their features, usually via a decision tree. The TEclassifier in the REPET package (Flutre et al. 2011) and REPCLASS (Feschotte et al. 2009) searches for all the features listed above. In addition, REPCLASS allows TE candidates to be filtered on the basis of the number of copies they have in the genome. TEclassifier interestingly allows the removal of redundancy from among potential TE sequences. It uses the classification to eliminate redundant copies (a sequence contained within a longer one) and retains well-classified TE candidate sequences preferentially over less well-classified TE candidate sequences. This tool is particularly useful for reconciling different TE reference libraries obtained independently, as it guarantees to retain well-classified TE candidate sequences.

2.3.2 Identification of Families

Once the newly identified TE sequences have been classified, manual curation is required as some consensus sequences may not have been classified previously and there may still be some redundant consensus sequences. Manual curation is crucial because the annotation of TE copies, as described in the next section, depends on the quality of the TE library. One way to curate a library of TE consensus sequences is to gather these sequences into clusters that may constitute TE families. A tool like BLASTCLUST in the NCBI-BLAST suite can quickly build such clusters via simple link clustering based on sequence alignment coverage and identity. Eighty percent identity and coverage, as proposed by (Wicker et al. 2007), gives good results. Typical clusters will contain well-classified consensuses (e.g., class I—LTR—Gypsy element) as well as unclassified consensuses (without structural features and little sequence similarity either with known TEs or any TE domain).

Then, computing a multiple sequence alignment (MSA) for each cluster gives a useful view of the relationships between the consensus sequences such that it is possible to assess whether they belong to the same TE family. One of the programs detailed above, MAP or MAFFT, can be used. It can also be informative to build a MSA with the consensus and with the genomic sequences from which these consensuses were derived and/or the genomic copies that each of these consensuses can detect. In such cases, we advise first building a single MSA for each consensus with the genomic sequences it detects, and then building a global MSA by aligning these multiple alignments together, for example, using the “profile” option of the MUSCLE program (Edgar 2004). Finally, after a visual check of the MSA with the evidence used to assign a classification to the consensus, it is then possible to tag all consensus sequences in the same cluster with the most frequent TE class, order, superfamily, and family, if one has been assigned (Fig. 2.2). The MSA can be also edited by splitting it or deleting sequences to obtain a MSA corresponding to a single TE family. Indeed, in some cases, consensuses are only similar along a small segment or display substantial sequence divergence. In these cases, the MSA can be split into as many MSA as there are candidate TE families. In other cases, an insertion appears to be specific to one consensus sequences and may sometimes show evidence (e.g., BLAST hits) for a different TE order. This may indicate a chimeric consensus that can be either removed from the library, if artifactual according to the sequences used to build the consensus (also visible in the MSA), or used to build a new TE family (if several copies support it). In all these cases, finding a genomic copy that aligns along almost all the length of a consensus (e.g., 95% coverage) appears to be a reasonable criterion for retaining the consensus. Those that fail generally appear to be artifacts or at least could be considered to be of no value.

Fig. 2.2
figure 00022

Alignment (Jalview (Clamp et al. 2004) screenshot) of de novo TE consensus sequences with Athila, the best-matching known TEs in the Repbase Update. They are represented with some of the features shown: LTRs (red zones), ORFs (blue zone), and matches with HMM profiles (black). The differences between the consensuses obtained by different methods, here RECON (cons1) and GROUPER (cons2, cons3, cons4), are indicated. Manual curation would remove cons3 as it corresponds to a single LTR with short sequences not present in the Athila family and cons4 as it corresponds to a LTR probably formed from the Athila solo-LTRs of the genome. A good consensus for the family would be a combination of cons1 and cons2

Phylogenies of TE family copies and/or consensus sequences provide another view of the members in a TE family. This can serve as an aid to curation if the cluster has many members or if two or more subfamilies are present. In such cases, sub-families can be hard to detect by examination of the MSA alone, but may become evident in a phylogeny if distinct sub-trees emerge. Such phylogenies can be constructed from the MSA with currently available software, including the PhyML program (Guindon and Gascuel 2003). Note, however, as most phylogeny programs do not consider gaps, branch length may be biased when sequences are of very different lengths. Divergence between the sequences can also be a criterion. Some authors (Wicker et al. 2007) have suggested a 80–80–80 rule: two sequences can be considered to belong to same TE family if they can be aligned along more than 80 bp, over more than 80% of their length, with more than 80% of identity. This rule is empirical but appears to be useful for classifying TE sequences into families that are consistent for the following annotation step, the annotation of their copies. These authors also suggest a nomenclature system for naming new TEs.

2.4 Annotation of Transposable Element Copies

This third phase annotates all TE copies in the genome, resolving the most complex degenerate or nested structures. This requires a library of reference sequences representing the TE families. In the best case, the library is both exhaustive and non-redundant, i.e., each ancestral TE, autonomous or not, is represented by a single consensus sequence. We usually use the manually curated library built as described in the previous section, as well as known TE sequences present in the public data banks. Note that some TE families, particularly those including structural variants with independent amplification histories, are best represented by several consensuses. In such cases, manual curation would retain several consensuses for a family, considered here as nonredundant.

2.4.1 Detecting TE Fragments

The first step mines the genomic sequences with the TE library via local pairwise alignments. Several tools were designed specifically for this purpose, such as REPEATMASKER (Smit et al. 1996–2004), CENSOR (Jurka et al. 1996; Kohany et al. 2006), and BLASTER (Quesneville et al. 2003). Some of these tools incorporate scoring matrices to be used with particular GC percentages, as is the case for isochores in the human genome. All these tools propose a small set of parameter combinations depending on the level of sensitivity required by the user.

Although similar, these tools are complementary. We have shown previously that combining these three programs is the best strategy (Quesneville et al. 2005). The MATCHER program (Quesneville et al. 2003) can then be used to assess the multiple results and keep only the best for each location.

Whatever parameters are used for the pairwise alignments, some of the matches will be false positives, i.e., a TE reference sequence will match a locus although no TE is present. For protein-coding genes, full-length cDNAs can be used for confirmation; unfortunately, there is no equivalent way of checking for TE annotation. An empirical statistical filter, such as implemented in the TEannot pipeline (REPET package) (Flutre et al. 2011), can be used to assess the false positive risk. The genomic sequences are shuffled and screened with the TE library. The alignments obtained on a shuffled sequence can be considered as false positives, then the 95-percentile alignment score is used to filter out spurious alignments obtained with the true genome. Only the matches with the true genomic sequences having a higher score are kept. This procedure guarantees that no observed match scores used for the annotation can be obtained for random sequences with a probably greater than 5%.

2.4.2 Filtering Satellites

Short simple repeats (SSRs) are short motifs repeated in tandem. Many TE sequences contain SSRs but SSRs are also present in the genome independently. It is therefore necessary to filter out TE matches if they are restricted to SSR that the TE consensus may contain. This can be done by annotating SSRs and then removing TE matches included in SSR annotations. Several efficient programs, for example, TRF (Benson 1999), MREPS (Kolpakov et al. 2003), and REPEATMASKER, are available for SSR annotation. In TEannot from the REPET package, these three programs are launched in parallel, and their results are subsequently combined to be used to eliminate hits due to only SSRs in TE consensuses.

Satellites are longer motifs, around 100 bp long, also repeated in tandem. Although they are not TEs, they are sometimes difficult to distinguish because they may contain parts of TEs. PILER-TA (Edgar and Myers 2005) detects pyramids in a self-alignment of the genomic sequences. These pyramids can be used to make a consensus of the satellite unit motif. These consensuses can then be aligned on the whole genome to find all their occurrences and to distinguish them from TEs.

2.4.3 Connecting TE Fragments to Recover TE Copies

Even when TE fragments have been mapped in the genome, the work is only half-finished. Indeed, TE copies can be disrupted into several fragments. A complete TE annotation requires retrieving all copies and thus linking fragments belonging to the same copy when it has transposed.

The first, historical method was manual curation using dot plots. However, this is laborious and curator dependent, and is impractical for large genomes. It requires the curator having detailed knowledge of transposable elements. Moreover, it ignores the age of nested fragments, potentially leading to incongruities. Therefore several computational approaches have been proposed. Many of them are reviewed in the article by Pereira (Pereira 2008).

Joining TE fragments to reconstruct a TE copy is known as a “chain problem” as it corresponds to finding the best chain of local pairwise alignments. The optimal solution is found via dynamic programming as implemented in MATCHER. Subsequently, an additional procedure implemented in the TEannot pipeline (Fig. 2.3) called “long join,” can be used to take into account additional considerations related to TE biology. Two TE fragments distant from each other but mostly separated by other TE fragments (e.g., at least 95% as in heterochromatin) can be joined as long as the TE fragments between them are younger. The age can be approximated using the percent identity of the matches between the TE reference sequences and the fragments.

Fig. 2.3
figure 00023

The four steps of the TEannot pipeline (Quesneville et al. 2005)

2.5 Discussion

The contribution of TEs to genome structure and evolution, and their impact on genome assembly has generated an increasing interest in the development of improved methods for their computational analysis. The most common strategy is to detect pairs of similar sequences at different locations in an all-by-all genome comparison, and then cluster these pairs to obtain families of repeats. These methods are not specific to TEs and, therefore, find repeats generated by many different processes, including tandem repeats, segmental duplications, and satellites. Moreover, TE copies can be highly degenerated, deleted, or nested. So repeat detection methods can make errors in the detection of individual TE copies and consequently in defining TE families. We believe that existing automatic approaches still need to be supplemented by expert manual curation. At this step, careful examination is required because some identified families that may appear to be artifactual can in fact be unusual TE families. Indeed, well documented cases illustrate how TEs families can appear confusing as they may (1) include cellular genes or parts of genes [e.g., pack-MULEs (Jiang et al. 2004) or Helitrons (Morgante et al. 2005a)], (2) be restricted to rDNA genes [e.g., the R2 Non-LTR retroelement superfamily (Eickbush et al. 1997)], or (3) form telomeres [in Drosophila (Clark et al. 2007)]. Close examination of noncanonical cases may also reveal new and interesting TE families or particular transposition events [e.g., macrotranspositions (Gray 2000)].

Knowledge-based TE detection methods (i.e., based on structure or similarity to distant TEs) have distinct advantages over de novo repeat discovery methods. They capitalize on prior knowledge established from the large number of previously reported TE sequences. Thus, they are more likely to detect bona fide TEs, including even those present as only a single copy in the genome. However, these methods are not well suited to the discovery of new TEs (especially of new types). Moreover, these methods have intrinsic ascertainment biases. For example, miniature inverted repeat transposable elements (MITEs) and short interspersed nuclear elements (SINEs) will be under-identified if only similarity-based methods are used because these TEs are composed entirely of noncoding sequences.

For some species, only parts of the genomic sequences are available as BAC sequences assembly. Working on a genome subset could be difficult for all-by-all genome comparison approaches as a TE might appear not repeated if other copies are not yet sequenced. Detection sensitivity of such approaches increase on both the sequenced fraction of the genome and its repeat density. Consequently, according to the sequence size and the repeat density, all-by-all genome comparison approaches may be used with more or less success. Interestingly, detection sensitivity of knowledge-based approaches (i.e., based on structure or similarity to distant TEs) is independent of the sequenced fraction, making them highly recommended here.

Through our experience with many genome projects (Cock et al. 2010; Abad et al. 2008; Amselem et al. 2011; Cuomo et al. 2007; Duplessis et al. 2011; Martin et al. 2008, 2010; Nene et al. 2007; Quesneville et al. 2003, 2005; Rouxel et al. 2011; Spanu et al. 2010), we have assessed the relative benefits of using different programs for TE detection, clustering, and multiple alignments. Our investigations suggest that only combined approaches, using both de novo and knowledge-based TE detection methods, are likely to produce reasonably comprehensive and sensitive results. Figure 2.4 shows the general workflow to follow for annotating TEs. In view of this, the REPET package (Flutre et al. 2011) has been developed. It is composed of two pipelines, TEdenovo and TEannot. These pipelines launch several different prediction programs in parallel and then combine their results to optimize the accuracy and exhaustiveness of TE detection. Even with this sophisticated pipeline, manual curation is still needed. Hence, in addition to the automation of all the steps required for the TE annotation, it computes data that are useful for the manual curation, including TE sequence multiple alignments, TE sequence phylogenies, and TE evidence. Sequencing costs have dropped dramatically and sequences have thus become easier to obtain. However, sequence analysis remains a major bottleneck. Efficient analysis pipelines are required. They need to be quick and robust to accelerate the pace of data production; they should also exploit the knowledge of the few specialists able to perform genome analysis on a large scale so that TE annotations are made available to the wider community of scientists.

Fig. 2.4
figure 00024

Workflow for annotating TEs in genomic sequences