Introduction

Banana and plantains, the fourth most important tropical crop of the world, are herbaceous monocotyledonous plants of genus Musa of family Musaceae and order Zingiberales (Tomlinson 1969). There are more than 1000 banana cultivars with a high genomic diversity and variability (Heslop-Harrison and Schwarzacher 2007). The genus Musa is divided into sections on the basis of morphology and chromosomes numbers as Musa (n = 11) (formerly Eumusa and Rhodochlamys) and Callimusa (n = 9/10, including Australimusa and Ingentimusa (Häkkinen 2013). Edible bananas belong to section Musa and are mostly sterile, parthenocarpic, triploid (2n = 3x = 33) hybrids (along with a few diploids and tetraploids) from Musa acuminata (A-genome) alone or in combination with B-genome diploid Musa balbisiana (Heslop-Harrison and Schwarzacher 2007). Cultivars have multiple origins from cultivated and wild cultivars by hybridization (Hippolyte et al. 2012). Most cooking types are inter-specific hybrids (AAB/ABB), while sweet dessert bananas are triploid M. acuminata (AAA) (Pollefeys et al. 2004; Heslop-Harrison 2011).

Transposable elements (TEs) represent a diverse group of DNA sequences that are classified into two major classes (Class I retrotransposons and Class II DNA transposons) based on their mode of transposition. Class I elements are further classified into superfamilies such as Copia, Gypsy, Retroviruses, caulimoviruses, and LINE elements (Hansen and Heslop-Harrison 2004). Based on the presence or absence of functional gagpol protein coding domains, retrotransposons are classified as complete (autonomous) or incomplete (non-autonomous) elements. Many retroelement families have characteristic long terminal repeats (LTRs) flanking their coding domains. Large retrotransposon derivatives (LARDs) are non-autonomous elements considered as deletion derivatives of autonomous LTR retrotransposons (Wicker et al. 2007; Defraia and Slotkin 2014; Nouroz 2015). Caulimoviruses (pararetroviruses) belong to Caulimoviridae superfamily, which replicate in plants via an RNA intermediate evolved from LTR retroelements (Bousalem et al. 2008; Llorens et al. 2011). Among TEs, the major proportion in plants is represented by long terminal repeat (LTR) retrotransposons (REs), which reverse transcribe their RNA to generate DNA copy integration to new host sites. The LTR retrotransposons (LTR REs) in plants display 4–6 bp target site duplications (TSDs), LTRs, a few hundred bp to several kilobases long, and exhibit primer binding sites (PBS) and a polypurine tract (PPT) at 5′ and 3′ ends, respectively (Eickbush and Jamburuthugoda 2008; Wicker et al. 2007; Nouroz et al. 2015). The previous studies suggest that caulimoviruses/pararetroviruses have evolved from LTR REs.

Genome expansion between species can result from both increase in TE copy number and TE types from different superfamilies (Du et al. 2010; Zhang et al. 2014). Whole-genome sequencing has explored the ways to identify and characterize TEs in sequenced genomes, and about the half of the Musa genome is made up of TEs with LTR REs as the most dominant elements (>27.76%), followed by long interspersed elements (LINEs; 5.5%). The class 2 DNA TEs are rare (1.3%) in Musa and are mostly represented by hAT, Harbinger, and Mutator superfamilies (D’Hont et al. 2012).

LTR retroelements have been studied in many eukaryotic genomes. Because of their role in genome evolution, high copy number, and mobility, they have proved valuable for development of transposon-based markers (IRAP and REMAP; Schulman et al. 2012), which have been used to measure diversity in plants including wheat (Queen et al. 2004), Crocus (Alsayied et al. 2015), and Brassica (Nouroz et al. 2015).

In the present study, we aimed to identify complete and fragmentary LTR retrotransposons in Musa BAC sequences, and to study in detail their structural diversity, evolutionary relationships, and distribution in various Musa genotypes.

Materials and methods

Dot plot identification of LTR retrotransposons from Musa BACs

The present study involved an approach for the identification of LTR retrotransposons based on the dot plot comparison of BAC sequences against themselves. The Musa BACs were retrieved from NCBI database before June, 2014 and BACs with LTR REs were selected for further analysis. In the dot plot analysis, a diagonal line extends from one corner of the dot plot to the diagonally opposite corner representing identity of sequence. Where an LTR retroelement is present, the LTRs from the termini are represented by two diagonal lines flanking that main diagonal, indicating 5′ and 3′ LTRs. The boundaries of the LTRs of the elements were defined in BAC sequences, the number of nucleotides in LTRs of each element was counted, and target site duplications (TSDs) were characterized by visual inspection.

Computational analysis and data mining for LTR retrotransposons

The intact or full-length elements identified by dot plot analyses were further blasted against the Musa Nucleotide Collection (nr/nt) database (https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastSearch) available in NCBI. The searches for LTR retrotransposons were performed in several steps to identify the intact, truncated, partial elements, solo LTRs, and remnants. The intact or full-length elements were defined as elements having both LTRs and internal gagpol genes. In the second step, the complete elements were used as query to find the full-length copies, truncated elements, partial or deleted elements, and remnants, which were defined with small modifications according to the recommendations of Ma et al. (2004) and Nouroz et al. (2015). For the identification of conserved gagpol gene encoding proteins, the nucleotide sequences were investigated in ‘Conserved Domain Database’ (http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml) implemented in NCBI and motifs in Hansen and Heslop-Harrison (2004). The PBS and PPT motifs were detected in the LTR_FINDER using parameter ‘Predict PBS using Zea mays and Oryza sativa tRNA database’.

Characterization and naming of LTR retrotransposons

The Repbase (Jurka et al. 2005) and Gypsy databases (Llorens et al. 2011) were used as reference databases to characterize the retrotransposons on the basis of homology. Elements that failed to be characterized by homology searches against TE databases were characterized by visual inspection on the basis of their structural hallmarks such as TSDs, LTRs, and organization of their gagpol encoding protein domains such as integrase (INT), reverse transcriptase (RT), RNaseH (RH), and envelope (ENV). The retrotransposons were classified as Copia if they displayed pol gene as 5′-INT-RT-RH-3′, Gypsy as 5′-RT-RH-INT-3′, Retroviruses as 5′-RT-RH-INT-ENV-3′ and LARDs if they exhibit TSDs, LTRs, and large non-coding internal regions without any known gagpol gene domain. Naming of elements followed recommendations of Capy (2005) such as MaGYP1, where ‘M’ indicates genus Musa, the second letter ‘a’ indicates species acuminata, three letters ‘GYP’ represent the superfamily (Gypsy), and the number ‘1’ indicates the number of the identified element.

Polymerase chain reactions (PCRs)

DNA samples from 48 Musa accession/genotypes (Table 1) were analysed for the presence of LTR REs. The degenerate primer pairs (Table 2) designated as reverse transcriptase amplification polymorphism (RTAP) markers were designed from RT regions with online program Primer3 (http://frodo.wi.mit.edu/primer3/). PCR was conducted in a 15 µl reaction mixture with 50–75 ng/µl genomic DNA + 10× buffer A (Kapa Biosystems, UK) + 1.0 mM MgCl2 + 2–2.5 mM dNTP (YORKBIO) + 10 pmol of each primer (SIGMA-ALDRICH) + 0.5–1 U of 5 U/µl Taq polymerase (Kapa Biosystems, UK). The thermal cycling conditions were 3 min denaturation at 94 °C; 35 cycles of 1 min denaturation at 94 °C, 1 min annealing at 52–64 °C (depending on primers), and 1 min extension at 72 °C; final 5 min extension at 72 °C. PCR products were separated by electrophoresis in 1% agarose gel with TAE buffer and gels were stained with 1–2 µl ethidium bromide for the detection of DNA bands under UV illumination.

Table 1 Musa accessions used with their names and genomic compositions
Table 2 PCR primer pairs to amplify the RT region of Gypsy, Copia, and Caulimovirus (CV) elements

Multiple sequence alignment and phylogenetic analysis

The RT sequences from 77 known LTR REs (Supplementary Table) of Copia, Gypsy, and Pararetroviruses superfamilies were collected from Gypsy database (Llorens et al. 2011). The 33 RT sequences (~180–220 aa) were taken from identified Musa LTR retrotransposons and were aligned with two known elements (Ty1-Copia, Ty3-Gypsy) in CLUSTALW multiple alignment implemented in BioEdit, which were visually inspected and edited manually, if needed. Small insertions and deletions were removed and frame shifts were introduced. All 5′–3′ oriented RT regions were included in alignment, even if they have stop codons or frame shift mutations. The phylogenetic analyses were performed by constructing unrooted neighbor-joining trees with 1000 bootstrap replicates in the MEGA5 program (Tamura et al. 2011). Evolutionary distances were computed using the p distance method. The methodology is summarized in Fig. 1.

Fig. 1
figure 1

Flow chart representing the methodology used in present work from identification to phylogenetic analysis of LTR retroelements in Musa. BACs bacterial artificial chromosomes, LTR REs LTR retrotransposons, CDD conserved domain database, LARDs large retrotransposons derivatives, PBS primer binding site, PPT polypurine tract, RT reverse transcriptase

Results

The LTR retrotransposon landscape in Musa

Fifty elements from 30 Musa-origin BAC clones (column 3 of Table 3) were identified by dot plot analysis by plotting each BAC sequence against itself (Fig. 2). Of the 50 identified elements, 20 belonged to Gypsy, 19 to Copia, 1 to Caulimoviridae (Pararetroviruses), and 10 to LARD-like elements (Table 3). The search was extended using these elements as query in BLASTN searches against Musa Nucleotide Collection (nr/nt) database of NCBI, and all full-length, truncated, and partial copies were counted. A total of 16,246 elements and their partial fragments from Copia, Gypsy, Caulimoviridae, and LARDs were identified, of which 153 were intact (full-length) elements with 58 from Gypsy, 48 from Copia, 1 pararetrovirus, and 46 from LARD-like elements. A total of 61 truncated elements, 635 partial elements, 258 solo LTRs, and 15,140 remnants were counted from Musa Nucleotide Collection (nr/nt) database deposited in NCBI database.

Table 3 Superfamilies of LTR retrotransposons identified from Musa with their sizes, TSDs, LTRs, positions, and orientations in BAC clone sequences
Fig. 2
figure 2

Dot plot of a Musa acuminata BAC sequence (AC226035) against itself to identify LTR retrotransposons. The central diagonal line running from one corner to other showed the homology of the sequence against itself. The boxes on the diagonal line showed the positions of LTR retrotransposons insertions with LTRs. Four Copia, three Gypsy, and two LARD-like elements are inserted with a total size of ~60 kb of the 105 kb BAC size covering 58.5% of total BAC sequence. The nested structure of LTR retrotransposon is also shown in largest square

General characteristics of Musa Gypsy retrotransposons

The Gypsy elements ranged in sizes from 3015–17,804 bp, where smallest non-autonomous MaGYP6 was 3015 bp large, while the largest autonomous MbGYP20 was 17804 bp having a nested structure (Table 3). Around 90% elements were found terminated by 5 bp TSDs, while rest (10%) showed 4 bp TSDs. The LTRs of the Gypsy ranged in sizes from 264–1105 bp with average size of 450–550 bp (Table 3). The M. acuminata BAC sequence AC226035 showed the highest copy number of retrotransposons among the investigated BACs, with four Copia, three Gypsy, and two LARD-like elements covering a total of ~60 kb (58%) of 105 kb-long BAC (Fig. 2). Another M. acuminata BAC sequence AC226048 harbored six Gypsy elements (MaGYP8MaGYP13; Table 3) covering a total of ~31 kb (24%) of 134.5 kb BAC sequence. The partial copies or remnants from these elements further increased their size and percentage.

Structural features of the Gypsy retrotransposons in Musa

The structural features of all Gypsy elements identified in the present study were analysed in detail. MaGYP1 (4982 bp) was flanked by 5 bp TSDs and 505 bp LTRs (Table 3). MaGYP2 was identified as a non-autonomous element (3.8 kb) flanked by 5′-543/527-3′ bp LTRs. MaGYP3 and MaGYP4 showed high structural homology with sizes of 4.5 and 4.6 kb, respectively, and incorporated Transcriptional regulator (TR), Heme-thiolate proteins (HP), Tymovirus proteins (TVP), and Hepadnavirus proteins (HVP) like additional proteins (Table 4). MaGYP5 (6.25 kb) was found to be flanked by LTRs of 586 bp and displayed internally deleted gagpol region with additional domains not common to REs (Table 4). MaGYP6, the smallest non-autonomous Gypsy, was only 3 kb including 655 bp LTRs (Table 3). MaGYP7 and MaGYP9 (5.3 kb) were flanked by LTRs of 411–519 bp with small insertions in 5′LTR. MaGYP8 (5.9 kb) and MaGYP10 (5.4 kb) displayed canonical gagpol polyproteins structure (Table 4).

Table 4 Musa LTR retrotransposons with PBS/PPT motifs and gagpol gene protein domains

MaGYP12 (5.76 kb) terminated by 671 bp LTRs (Table 3) encoded gagpol gene domains as 5′-AP-RT-RH-INT-3′ with an additional Zinc knuckle (ZK) domain (Fig. 3a). MaGYP13 (5.4 kb) was flanked by 1062 bp LTRs, displaying only RT domain with homology to the RT of Non-LTR retroelements. A 4.9 kb MbGYP15 was identified in M. balbisiana flanked by 884 bp LTRs and displayed gagpol genes (Table 4). The elements MbGYP16, MaGYP17 and MbGYP18 were 4.0, 6.1, and 7.4 kb large terminated by 4 bp TSDs, where MaGYP17 encoded an additional CSP protein domain (Fig. 3a). MbGYP19 (7.36 kb) was flanked by 1105 bp 5′LTR and 883 bp 3′LTR with AT rich insertion next to the downstream of 5′LTR.

Fig. 3
figure 3

a Schematic representation of representative retrotransposons in Musa. The red arrowheads at the corners represent the TSDs, while blue arrows indicate TIRs. The gagpol regions are drawn with their protein domains. Scale is measuring the lengths of the elements (bp). Additional insertions or unknown sequences are represented by different colours. b 17.8 kb large MaGYP20 is drawn with other Gypsy and a DNA transposon inserted in it. A 16.2 kb MaCOP3 is shown with 5.2 kb inserted Copia element in it. AP aspartic protease, RT reverse transcriptase, INT integrase, GAG gag-nucleocapsid, ZK zinc knuckle, DUF domain of unknown function, CHR chromatin organization modifier, CMV cauliflower mosaic virus, PR hypothetical protein, UN unknown (colour figure online)

Structural features of Musa Copia elements

Nineteen intact Copia elements identified by dot plot analyses of Musa BACs were investigated in detail. MaCOP1 and MbCOP19 showed homologies in their molecular structures with 5.3 and 5.2 kb sizes, flanked by LTRs of 605 and 592 bp, respectively (Table 3) encoding the conserved protein domains of 5′-INT-RT-RH-3′ (Table 4). MaCOP2 (4.8 kb) displayed a PBS, canonical gagpol genes and PPT upstream to 3′LTR. MaCOP4 (4.0 kb) was identified from M. acuminata with only INT and an additional ZK domain in its structure (Table 4). MaCOP5 and MaCOP17 with structural homologies and lengths (8.1 kb) were flanked by LTRs of 5′-1285/1201-3′ and 5′-1000/1324-3′ bp, respectively. MaCOP6 and MaCOP9 though ~7.0 kb in sizes lacked the pol polyproteins except RT. MaCOP7 (5.0 kb) flanked by 5′-144/149-3′ bp LTRs have shown the shortest LTRs in the present study. MaCOP8 and MaCOP14 ranged 6.0 kb in size, flanked by 5′-499/500-3′ and 5′-492/548-3′ bp LTRs, respectively. MaCOP10 and MaCOP11 were 8.7 and 8.4 kb large elements, where MaCOP10 was found terminated by 1597 bp LTRs while MaCOP11 (Fig. 3a) was flanked by 5′-1494/1388-3′ LTRs. MaCOP12 and MaCOP13 were 7.1 and 5.9 kb large elements, flanked by 5′-1238/1132-3′ and 5′-573/548-3′ bp, respectively. MbCOP15 and MaCOP16 displayed the ORF encoding the gagpol products (GAG-INT-RT-RH). MbCOP18 (9.8 kb) investigated in M. balbisiana accession ‘AC226052.1’ was terminated by long (5′-1415/1396-3′ bp) LTRs (Fig. 3a; Table 3).

Structural features of Musa Caulimovirus (Pararetrovirus)

An 11.1 kb element was investigated from M. acuminata BAC sequence (AC226046.1), flanked by the largest LTRs investigated in the present study. The element, named MaCVI (M. acuminata chromovirus), was characterized by having 3.8 kb LTRs, an internal region containing the PBS, pol gene encoding the AP, RT, and RH domains, and a PPT adjacent to 3′LTR with two additional protein domains (Fig. 3a). The PBS of MaCVI was different from all other Copia and Gypsy elements described here with tRNAGly (an unusual RNA type). A 15 bp PPT upstream to 3′LTR was found with different sequence structure compared to other elements (Table 4).

Structural features of LARD-like elements

Non-autonomous LARD-like elements were also characterized (Table 3) by having 4–5 bp TSDs, LTRs, exhibiting PBS/PPT motifs and internal non-coding regions. MaLAR1 in M. acuminata accession (AY484588) was 4564 bp long, flanked by 4 bp TSD and 447 bp LTRs (Fig. 3a). MbLAR2 (4428 bp), similar to MaLAR1m, was identified from M. balbisiana, flanked by 445 bp LTRs. MbLAR3 shared structural homology with MbLAR5 and MbLAR6 with a size of 4.4 kb, displaying LTRs of 382–383 bp. MaLAR4 was 4.3 kb including the LTRs (5′-607/611-3′) and flanked by 5 bp imperfect TSDs. MaLAR7 and MaLAR8 showed similar structural features having 4.5 kb size. MbLAR9 identified from M. balbisiana BAC (AC186754) was a 7.7 kb element with no detectable PBS and PPT motifs. It displayed an unknown insertion and a non-autonomous hAT element with two additional solo LTRs (Fig. 3a). MaLAR10 was the smallest LARD-like element (4 kb) studied here displaying LTRs of 5′-974 and 984-3′ bp.

Nested LTR retrotransposons structures in Musa

Two Gypsy-like LTR REs showed complex nested structures, where 1 or more TEs or unidentifiable sequences were found inserted within the element. MaGYP14 (11.6 kb) was flanked by 624 bp LTRs, exhibiting pol gene domains as 5′-AP-RT-RH-INT-3′, with ~1.7 kb additional transcriptional regulator protein and a GC rich unknown insertion of ~2.3 kb (Table 3). The most complex structure was observed in MbGYP20 (17.8 kb), which showed a nested structure of three insertions and two solo LTRs (Fig. 3b). One insertion was a 9.6 kb Gypsy, where another unknown insertion of 4.5 kb was inserted in opposite orientation. MaCOP3 (16.2 kb) was found inserted in M. acuminata BAC (AC226035) displaying a complex nested structure of LTR REs (Fig. 3b). A 5.3 kb MaCOP1 element was inserted in MaCOP3 starting from 3203 to 8492 bp. The outer element MaCOP3 (10.9 kb) showed an insertion towards the 5′LTR and was flanked by 5′-338 and 299-3′ bp LTRs, while the inserted element MaCOP1 was terminated by 605 bp LTRs.

The gagpol polyprotein organization in Musa LTR retrotransposons

The organization of gagpol protein domains of Gypsy REs revealed 2 patterns (canonical and defective) and 14 sub-patterns of domain organizations (Table 4). The canonical Gypsy domain structure (5′GAG-RT-RH-INT-3′) was observed in six elements. A single element MaGYP6 encoded a gag protein only, MaGYP2 showed gag and a transcriptional regulator (TR) domain, MaGYP13 encoded RT only and MbGYP18 displayed 5′-AP-RT-3′. The five elements (MaGYP7, MaGYP9, MaGYP11, MaGYP15, and MaGYP16) lacked the INT domain, while RT and RH domains were absent in MaGYP1. Three elements MaGYP8, MaGYP10, and MaGYP17 showed similar domain pattern (5′-GAG-AP-RT-RH-INT-CHR-3′) with one or other extra domain. MbGYP20 showed a complex organization of protein domains due to nested retrotransposons structures as 5′-GAG-AP-(3′-CMV-RH-DUF-5′)-DUF-CMV-RT-RH-CHR-3′ (Fig. 3b; Table 4).

The Copia gagpol protein structural organization revealed seven sub-patterns of the two main patterns (canonical and defective). The canonical pattern of Copia protein domain organization is 5′-GAG-INT-RT-RH-3′, observed in almost 90% of the elements with one less or extra domain. MaCOP6 and MaCOP11 encoded only a RT domain, while MaCOP17 showed a slightly different pattern 5′-GAG-RT-RH-MT-3′, where additional mannosyl transferase (MT) protein was replaced with INT. A nested LTR RE MaCOP3 showed a complex pattern 5′-GAG-AP-INT-RT-RH/GAG-INT-RT-RH-3′, where two sets of proteins domains were detected encoded by two different Copia elements. The other 13 elements showed the canonical Copia protein organization (5′-INT-RT-RH-3′) (Table 4). All the LARDs elements were checked for gagpol protein domains, but no identifiable gagpol gene was detected.

PBS and PPT pattern of Musa retrotransposons

The 15–18 bp PBS located downstream to 5′LTR and its reverse compliment PPT located adjacent to the 3′LTR were detected (Table 4) by scanning the LTR RE against Z. mays tRNA database. A total of 80 and 75% elements showed the presence of 14–18 bp PBS and 15 bp PPT, respectively, 10% showed PPT only, while the remaining 10% failed to detect any PBS or PPT by scanning tRNA of Z. mays, which were than scanned against the O. sativa tRNA database and their PBS and PPT were obtained (Table 4). MaGYP2 lacked PPT, while a PBS was not detected in MaGYP6 and MbGYP19. Seven different tRNA types were investigated in Gypsy elements with tRNAMet, as most frequent type present in 30% of the elements followed by tRNAAsn, found in 20% of elements (Table 4). The PBS and PPT structures of Copia elements revealed that 95% of elements showed 14–18 bp PBS except MaCOP14 (Table 4). Eight different types of tRNA types were observed in all Copia elements investigated, with tRNAMet as most common tRNA type detected in 40% of the elements; followed by tRNAVal, observed in 20% elements. All the other six types of tRNA contributed 5% of the tRNA type. PPT adjacent to the 3′LTR was detected in 90% of all Copia elements except MaCOP6 and MaCOP14 (Table 4). The PBS and PPT motifs in LARDs revealed that out of ten individual elements, only two elements (20%) MaLAR7 and MaLAR8 displayed the PBS and PPT motifs in their 5′and 3′ LTRs, respectively (Table 4).

PCR amplification of retrotransposons in Musa genomes

The presence and abundance of Gypsy elements in 48 diverse Musa genotypes were determined by reverse transcriptase amplification polymorphism (RTAP) in using PCR. The primers were designed from conserved RT regions (Table 2). Of the 48 Musa genotypes (Table 1), 6 M. acuminata (AA), 6 M. balbisiana (BB), 3 hybrids (AB), 8 triploid M. acuminata (AAA), 19 (AAB), and 6 (ABB) allotriploids were used to analyze the presence of LTR REs in them. The primer pair MaGYP8F/R (Table 2) was used to amplify 684 bp RT region of MaGYP8 family. The products were amplified from all M. acuminata (AA) (Calcutta 4, Sannachenkadali, Pisanglilin, Kadali, Matti, and Cherukadali), M. balbisiana (BB) (PKW1, PKW2, Javan, Klutuk, Tani, and Batu), AB genomes (Njalipovan, Adukkan, and Padalamukili), AAA (Manoranjitham, Grand Nain, Gross Michel, Greenred, Red, Monsmari, Robusta, and Dwarf Cavendish), AAB (Motta Povan, Karimkadali, Perumadali, Kunoor Ettan, Palyamcodan, Mysoreettan, Krisnavazhai, Poovan, Doothsagar, Charapadati, Kumbillakannan, Velipadati, Vellapalayamcodan, Ettapadati, Padati, Chinali, Nendran, Poomkalli, and Kamaramasengi), and ABB genotypes (Kosta Bontha, Peyan, Kanchikela, Boothibale, Monthan, and Karpooravali) (Fig. 4a). This showed the ancient nature of this element, which was present in a common ancestor predating the separation of A and B-genome Musa. The RT-based amplification polymorphism of MaGYP12 family revealed its amplification from all 47 Musa accessions except M. acuminata (Calcutta 4), where no amplification suggests its absence or recent swept from the genome (Fig. 4b). The 835 bp RT regions of MaGYP17 family were amplified from all the 48 Musa accessions (Fig. 4c) by primer pair MaGYP17F/R. The amplification of various Gypsy elements from Musa genotypes revealed their distribution in almost all regardless of A or B-genome specificity (Fig. 4a–c).

Fig. 4
figure 4

PCR analysis for the detection of retrotransposon reverse transcriptase (RT) polymorphisms across 48 cultivars in Musa. Dark bands are indicating the expected products. The amplification of a MaGYP8, b MaGYP12, c MaGYP17, d MaCOP5, e MaCOP8, and f MaCV1. PCR figures show reversed images of size-separated ethidium bromide-stained DNA on agarose gels after electrophoresis. Ladders (HP-I) show fragment sizes in base pairs; the diploid and triploid Musa genomes represented on the top of lanes (AA, BB, AB, AAA, AAB, and ABB) and the numbers at the base are given in Table 1

The availability of various members of Copia superfamilies was investigated in 48 Musa accessions (Table 1) by PCR analysis. The primer pair MaCOP5F/R (Table 2) amplified a 744 bp RT region in all M. acuminata (AA), M. balbisiana (BB), AB, AAA, AAB, and ABB genotypes (Fig. 4d). A 964 bp MaCOP8 RT genomic sequence was amplified in PCR by primer pair MaCOP8F/R from all 48 diploid and triploid Musa genotypes with weak and strong signals in various genotypes (Fig. 4e). The abundance of Caulimovirus named MaCVI was examined in various Musa genotypes by PCR analysis. The primer pair MACVIF/R was designed to amplify a 425 bp RT sequence, which revealed that the product was amplified from all Musa genotypes except M. balbisiana (BB) accession ‘PKW2’ (Fig. 4f).

Phylogenetic relationships of Musa and other plant LTR retrotransposons

The phylogenetic relationships of 33 RT sequences of Musa LTR REs and two known elements (Ty1-Copia, Ty3-Gypsy) were performed in MEGA5. Two main lineages separated the Copia, Gypsy/Caulimoviridae (Pararetrovirus) elements with 19 and 16 elements, respectively (Fig. 5). The Copia lineage is further resolved into two groups with 2 (MaCOP6, MaCOP9) and 17 elements in respective groups. Of the 17 Copia, the Ty1-Copia from Saccharomyces cerevisiae out-grouped from rest of the Musa Copia. Some Copia elements clustered on the same or sister branches due to high homologies in their RT sequences (Fig. 5). Of the Gypsy lineage, Ty3-Gypsy from S. cerevisiae out-grouped from Musa Gypsy elements. MACV1 from Caulimoviridae also out-grouped from Musa Gypsy elements. The RT sequences of MACV1 and Gypsy indicated homology in their sequences, yet they are distinct from Copia. Most of the RT sequences from M. acuminata and M. balbisiana showed homology to each other and are resolved on the same or sister branches (Fig. 5).

Fig. 5
figure 5

Phylogenetic analysis of Musa LTR retrotransposon. The 33 RT sequences from intact elements from Musa and two known sequences (Ty1-Copia, Ty3-Gypsy) from S. cerevisiae were used to construct the phylogenetic tree. Neighbor-joining tree was constructed with 1000 bootstrap replicates in MEGA5 program. The p distance model was used to calculate the genetic distance. The two major lineages separate the Gypsy (represented by black rhombus)/pararetrovirus elements (green square) and Copia (blue circles). The detailed descriptions of the elements are given in Table 1. Ma, M. acuminata; Mb, M. balbisiana; COP, Copia; GYP, Gypsy; CV, Chromoviridae; MaCVI, M. acuminata Chromoviridae (colour figure online)

The evolutionary relationships of Musa and other organism-based LTR REs were performed by constructing a phylogenetic tree of 110 RT sequences (Fig. 6), of which 33 were collected from Musa LTR REs of the present study, while 77 were from various organisms and were collected from a Gypsy database (Supplementary Table). The evolutionary history was reconstructed by the unrooted neighbor-joining method with 1000 bootstrap replicates, where strong bootstrap values supported the monophyletic origin of Gypsy and Copia retrotransposons, yet the three main lineages (shown by different colours and shapes in Fig. 6) separate the Gypsy, Copia, and Caulimoviruses indicating distinct homology and no recombination between the sequences of these superfamilies. The Gypsy clustered 40, Copia 46, and Caulimoviruses 24 elements in their respective lineages. Of the Gypsy lineage, the Ty3-Gypsy from S. cerevisiae and Gypsy element from Drosophila melanogaster out-grouped. Most Musa Gypsy elements clustered in Musa specific groups except few elements (Fig. 6) as MaGYP12 clustered together with CRM of Z. mays, MaGYP14 with Gloin of Arabidopsis thaliana, and MbGYP19 with Cereba of Hordeum vulgare. The Caulimoviridae constituted close lineage to Gypsy lineage with 24 elements, where PCSV out-grouped and misfits near Copia elements. MaCV1 formed a sister family with other Caulimoviridae members as BSOLV, CSSV, KTSV, BSGFV, and BSVAV (Supplementary Table). In Copia lineage, Ty1-Copia and Ty2 from S. cerevisiae out-grouped. In most cases, the Musa specific Copia clustered in their respective families, while others (MaCOP2, MaCOP15, and MaCOP17) shared families with other members as MaCOP2 grouped with Araco of A. thaliana, MaCOP15 with Tork-4 of Solanum lycopersicum and MaCOP17 with TSI-9 of Setaria italica (Fig. 6).

Fig. 6
figure 6

Phylogenetic tree showing relationship between RT nucleotide sequences of Musa and other plants. Of the 110 RT sequences, 33 sequences are from Musa and remaining 77 (Supplementary Table) are from known plant retrotransposons collected from Gypsy database. The tree was inferred using neighbor-joining method in MEGA5, where p distance model was used to calculate the genetic distance. The 1000 bootstrap replicates were used and the values <50% are not shown. The three main lineages separate the Gypsy (represented by black filled and open rhombus), Copia (blue filled and open circles), and Caulimoviruses (blue filled and open squares). The Musa specific elements from Gypsy, Copia, and Caulimoviruses are represented by filled shapes, while open shapes are representing elements from other organisms. The details of the Musa elements are given in Table 1 and other plant elements in Supplementary Table. Ma, M. acuminata; Mb, M. balbisiana; COP, Copia; GYP, Gypsy; MaCVI, M. acuminata Chromoviridae (colour figure online)

Discussion

Repetitive DNA sequences are the most abundant and rapidly evolving component of the genome (Bennetzen 2000; Biscotti et al. 2015). The genome of Musa is rich in LTR retroelements belonging to the Copia, Gypsy, and Caulimovirus superfamilies (D’Hont et al. 2012; Davey et al. 2013), and there is a need to identify all transposable elements and their derivatives, especially LTR REs, which are major drivers of gene and genome evolution. The BAC analysis here provides reference sequences, uninfluenced by prior knowledge of sequence motifs, presence of heterozygosity, and multiple copies throughout the genome. D’Hont et al. (2012) identified the composition of TEs in M. acuminata genotype DH-Pahang, finding Copia in high proportion followed by Gypsy and LINEs. The DNA transposons were very rare representing Harbinger, Mutator, and hAT families; Menzel et al. (2015) found only 70 hAT elements, although the related MITEs were amplified to much higher copy numbers. The most active DNA transposons identified from other angiosperms plants like Mariner, Harbinger, and CACTA were very rare in Musa, contrasting with other recent studies confirming the abundance of Harbinger (Nouroz et al. 2016) and CACTA (Nouroz et al. 2017) elements in plants like Brassica. A study showed 26.85% of LTR REs in Musa balbisiana genotype ‘PKW’ with Copia and Gypsy as dominant superfamilies. The percentages of LINEs and DNA transposons were less and similar in A- and B-genome Musa (Davey et al. 2013).

The present study involved de novo identification and description of Copia, Gypsy, Caulimoviruses, and LARD-like elements in Musa BACs, building from the previous analyses focussed on selected repeats (Hribova et al. 2010; D’Hont et al. 2012). The approach of comparative analysis of BAC sequences by dot plot was effective and highly informative to identify the LTR REs in the sequenced genome of Musa. This strategy helped in the identification of elements present in Musa BAC sequences, with a method independent of others based on homology and software such as LTR_FINDER, LTR_STRUC, and LTR_harvest. In the initial analysis, 50 intact elements from three main superfamilies (Copia, Gypsy and Caulimoviruses) were identified. Further BLAST analysis using these full-length elements retrieved a total of 153 intact elements from 6 Mbp of Musa BACs screened. The intact copies (listed in Table 3) covered 15–18% of the genome surveyed, further supporting investigations revealing the high repetitive proportions found in the Musa genome analysis using short reads from 454 sequencing (Hribova et al. 2010) and BAC-end sequencing (Cheung and Town 2007). About 61 truncated copies, 635 partial copies, 258 solo LTRs, and 16246 small fragments (remnants) were also identified here; precise alignment of truncated or partial copies is not possible due to deletions and the high numbers, but their contribution to the Musa genome was counted. Such deleted elements and insertions in LTR REs are common in plants like rice (Ma et al. 2004) and Arabidopsis (Devos et al. 2002). Most of the deletions or insertions in the intact elements were bounded by terminal duplications of few bp. Such terminal duplications were observed around the deletions within retroelements from Arabidopsis (Devos et al. 2002). The numbers of partial copies, truncated elements, and remnants (mentioned above) were very high in our study in comparison to the full-length copies. The low copy number of full-length LTR REs in comparison to partial or deleted copies was obvious from other plants such as only 583 full-length LTR REs were identified from Elaeis guineensis genome (Beulé et al. 2015). The full-length elements of the current study ranged in sizes from 4 to 17.8 kb with flanking LTRs of 149 bp to 3.8 kb. These findings are in accordance to the study of LTR REs in Medicago truncatula, where the elements ranged in size from 4 to 18.7 kb with similar sized LTRs (Wang and Liu 2008).

A caulimovirus element (MaCV1) residing in Musa genome displayed the structural features common to Caulimoviruses present in many plant genomes including Musa and potato. Three families of such pararetroviruses were isolated from potato and their distributions on chromosomes were studied by fluorescent in situ hybridization (Hansen et al. 2005). The reverse transcriptase alignment and phylogenetic analysis revealed that MaCV1 formed sister lineage with Gypsy elements suggesting homology in their sequences, but detailed analysis showed that both followed different evolutionary pathways. In Brassica, the virus-like elements grouped with Gypsy lineage indicating their common ancestral origin, but they also followed two different evolutionary pathways (Alix and Heslop-Harrison 2004). The clustering of most of the Musa-related sequences in their respective families revealed separate lines of evolutionary history, while a few elements shared sequence similarity with other elements and thus were clustered. The domain organization of the elements also varied, consistent with earlier studies: Copia-like elements were 5′-AP-INT-RT-RH-3′, Gypsy-like elements 5′-AP-RT-RH-INT-3′, and caulimoviruses showed 5′-ORF-AP-RT-RH-3′ domain pattern (Hansen and Heslop-Harrison 2004; Wicker et al. 2007).

The LARD-like elements were frequent in Musa genome and their abundance indicated that, like other LTR retroelements, they are major component of these genomes. Despite lacking internal coding domains, several copies were detected in Musa BACs. We cannot fully answer the question which LTR retroelement superfamily these LARDs belong to, nor identify the full-length elements providing the coding domains with machinery for transposition and integration to a new site. However, comparison of the elements with known TE sequences and structural homology suggested that both Gypsy and Copia are the progenitors of these elements. The previous studies revealed that LARDs constitute major proportion of several genomes as identified in M. truncatula and Brassica (Wang and Liu 2008; Nouroz 2015).

In the present work, PCR experiments based on RT region have demonstrated that Gypsy and Copia retrotransposon families were amplified from all diploid and triploid Musa genotypes tested. This confirmed the ancient and conserved nature of RT sequences which evolved pre-separation of A and B Musa and transduplication of various Musa genomes. Similar ratios of RT were investigated in various members of family Asteraceae (Docking et al. 2006). The present study is evident that very few elements were species specific, either in A-genome M. acuminata or B-genome M. balbisiana, while the majority were present in both. The PBS and PPT motifs were detected in most elements, while in few, either PBS or PPT was missing or deleted. The tRNAMet was the most commonly used type in both superfamilies as was investigated in M. truncatula (Wang and Liu 2008). Some of the retrotransposons of the present study acquired an extra protein domain without any role in their transposition; such domains are harbored by retrotransposons in other plants (Havecker et al. 2004).

Conclusion

The present study described several novel LTR retroelements in the Musa genome including their structural features, protein domain organization, pattern of PBS/PPT motifs, evolutionary dynamics, and percentage in their host genome by their transduplication. The results indicated that individual LTR retroelement families have distinct behaviour, genomic organizations, and actively proliferating in their host genomes. This work provided references of single elements, with the description and annotation of major portions of retrotransposons, and a valuable advance in the quest to unravel the genomics of LTR REs, in general, and their evolutionary dynamics in Musa, in particular.