Introduction

Understanding of species divergence are essential to natural history reconstructions (for example, Avise and Nelson 1989; Baker et al. 1993; Charruau et al. 2011; Driscoll et al. 2007). For example, the wild silkworms, Bombyx mandarina, are now widely spreading in the far east Asian region except for the Ryukyu Islands in the southern part of the Japanese archipelago. The domesticated model organism, B. mori and B. mandarina can be crossed and leave the offspring, and the hybrids are fertile. It has been thought that the origin of B. mori was B. madanrina inhabiting in China, since they have 28 chromosomes, as same as B. mori, while B. mandarina in Korea and Japan have 27 chromosomes. It has shown that the divergence between B. mandarina from Japan and a B. mori strain is around seven million years (Yukuhiro et al. 2002). Dating further details of the divergence events attracts much interest.

Different species, and sometimes even different kingdoms, may share various types of mariner-like transposable elements (MLEs; a DNA-type transposon) in their genomes (Robertson 1993; Robertson and MacLeod 1993; Hartl 2001; Bui et al. 2008). Moreover, very similar MLEs are often found in distantly related species (Maruyama and Hartl 1991; Robertson and Lampe 1995; Hartl et al. 1997; Casse et al. 2006). It is thought that each MLE has transposed horizontally across species at various times by moving into the germline of a different organism. Full length of MLE sequences can be amplified and isolated from several different species by designed primers based on terminal inverted repeats (TIRs) of MLEs from some species. Therefore, we can hope that dating events between or among species indirectly is possible by investigating sequence divergence (Nakajima et al. 2002; Kawanishi et al. 2007). However, such inferences are possible as long as the molecular clock hypothesis holds (Zuckerkandl and Pauling 1965).

In the previous studies by one of the authors, the full-length of MLEs isolated by using TIR of the MLE from Hyalophora Cecropia (Lidholm et al. 1991) from some different species are named as the CIM (Cecropia-ITR-MLE). It was shown that the CIMs including a complete open reading frame for transposase was shared among distantly related species, some lepidopteran insects including an emperor moth (Attacus atlas), grasshopper (Traulia ornata), and a coral (Fungia scruposa), from the Ryukyu Islands (Nakajima et al. 2002). It seemed like that these CIMs have transmitted horizontally among the different species in Ryukyu Islands. CIMs is categorized into Bmmar2 type of MLE defined by Kumaresan and Mathavan (2004) spreads 19 copies in the genome of B. mori. It exists by four copies on the chromosome 6. One of them has a unique tripartite structure composed of three different transposable elements, namely CIM (1,309 bp), L1Bm (318 bp), and BMC1 (2,642 bp) retrotransposable elements and we named it as “BmTNM locus” in 1999 (Nakajima et al. 1999). The BmTNML including flanking sequence (1,072 bp) denoted host-sequences are extracted from population of B. mori and B. mandarina (Kawanishi et al. 2008). Although the tripartite structure of the BmTNML of B. mandarina inhabiting various area shows little difference, the sequences have rich polymorphisms and some of them were inserted by new maT-type transposon BmamaT1 (1307 bp) (Kawanishi et al. 2008). The structures of the sequence are summarized in Fig. 1. The advantage to focusing on the BmTNML locus is that the locus is uniquely identified in the Bombix genome and the genealogy will be the same between host-sequences and CIM due to the tight linkage, while patterns of mutations superimposed on the genealogy are independent between them. Therefore, the equality of substitution rate between the estimates given by host-sequences and those given by CIM can be evaluated at various time points on the genealogy.

Fig. 1
figure 1

Genomic structures of the BmTNML locus consisting of host-sequences, CIM, L1Bm, BmamaT1, and BMC1. The upstream and downstream sequences of the inserted sites are shown in the flanking regions of TEs. The polymorphisms in these sequences are denoted as the parenthesis including two alleles. The excision of BmamaT1 is shown as a dotted line with the footprint “ACTA” and the flanking sequences “taatgaaata” and “tatattttac”

In this study, we estimated the times to the most recent common ancestor (MRCA) and the TE insertion/deletion (named as indel hereafter) events on the BmTNML locus. These estimates by using CIM sequences strongly correlated with those by using flanking sequences, implying that this type of MLEs can be used as a molecular clock for dating evolutionary events among species. By using the CIM sequences from A. atlas and F. scruposa we estimated the time of the horizontal transmission event between these species.

Methods

A basic local alignment search tool (BLAST) search (http://kaikoblast.dna.affrc.go.jp/) was conducted against the B. mori genome using the common TIR sequence flanking MLEs (Robertson 1993; Robertson and MacLeod 1993) as a probe. At a threshold of more than 95 % sequence homology, we found that 19 kinds of CIMs were inserted in the B. mori genome. In this locus, 11 publicly available sequences from B. mandarina and two from B. mori are obtained from the National Center for Biotechnology Information (NCBI) database (http://www.ncbi.nlm.nih.gov/). Sequences are categorized into four types (from Seq-1 to Seq-4) according to indels of TEs. Geographic labels (and accession numbers) of the sequences are as follows: Seq-1 China (AB3630105), Seq-2 Korea (AB363006), Japan (AB363010, AB363014, AB473770, AB473771), Seq-3 Japan (AB363021, AB363024, AB473769, AB473773), and Seq-4 China (AB473763), B. mori Japan (AB363029), B. mori (strain name DAIZO) derived from B. mori genome. The BmTNML locus is intergenic and no significant deviation from the selective neutrality was detected according to Tajima’s test (Tajima 1989).

Indels of TEs can be present as unique event polymorphisms (UEPs), because these events occur exactly once in the evolutionary history. Therefore, the genealogy is specifically defined by UEPs. Figure 2a shows the genealogy. T 1T 6 represent MRCAs of the sequences included in each lineage. The number of segregating sites within sample sequences in a lineage was counted in each segment of host-sequence, CIM, L1Bm, BmamaT1, or BMC1, respectively. Alignment gaps were excluded from subsequent analyses. The lineage is defined as the descendant leaves of each internal node of the genealogy. For example, the descendants of the T 2 node consist of sequences belonging to Seq-4, Seq-2, and Seq-3 (Fig. 1). When a given leaf consisted of multiple sequences, the number of segregating sites in the sequences was counted. We see that (1) CIM was already inserted in the BmTNML locus at the root of the genealogy. (2) The tripartite structure of CIM, L1Bm, and BMC1 (Fig. 1), characterized the B. mori sequences and the B. mandarina sequence from China. (3) The BmamaT1 deletion was observed in the sub-lineage (Seq-3) of the BmamaT1 inserted sequences. It suggests that the wild silkworm comes to Japan from China through Korea and the silkworm domestication occurred in China, as is historically known (Xiang et al. 2005).

Fig. 2
figure 2

a The genealogy of B. mandarina population using host-sequences at the BmTNML locus. Black triangles represent the insertion or the excision (ΔBmamaT1) (Kawanishi et al. 2008) of TEs. For Seq-1, the lineage consists of a single sequence, and none of the coalescent estimate for TMRCAΔ is obtained in the lineages. Times are scaled by MYA (see Table 1). b The correlation of estimated times scaled by MYA between host-sequences and CIM. Circles indicate the \(TMRCA\Updelta\) and \(T\Updelta\) referred from Table 1

Based on the coalescent model (Kingman 1982; Hudson 1983; Tajima 1983), we estimated the ages of TE indels and the TMRCAs (times to the MRCA) of the lineages by a Bayesian method. Given, the number of segregating sites in a sample of size n (S n ), coalescent times were estimated by the rejection-sampling method (Ripley 1987). From the Bayes rule, the posterior distribution of the coalescent times \(T = \left( {T_{n} ,T_{n - 1} , \cdots ,T_{2} } \right)\) is \(P\left( {T|S_{n} } \right) \propto f\left( {S_{n} |T} \right)\pi \left( T \right)\). The likelihood of \(f\left( {S_{n} |T} \right)\) follows a Poisson distribution, and T i , which is the waiting time of coalescence between i sequences, follows an exponential distribution with parameter \(i\left( {i - 1} \right)/2\). We denoted the joint distribution of waiting times T i by π (T). We assumed from the infinitely-many-sites model that each gene is a sequence of completely linked sites and that every mutation occurs at a site different from the sites of the previous mutations (Watterson 1975). We obtained the posterior estimate of TMRCAs of the sample using the following algorithm (Tavaré et al. 1997):

Algorithm 1

To simulate from the joint density of \(T_{n} ,T_{n - 1} , \cdots ,T_{2}\) given S n  = k. The algorithm is given by replacing θ in Algorithm 7.3 of Tavaré (2004) with Watterson’s estimator θ W  (Watterson 1975). We repeated the simulation until 1,000,000 simulated samples were obtained.

We investigated the times to a TE indel event (T∆) as observed in sample sequences using the conditional genealogy given the UEP. Given that the TE indel is represented b times in the sample (1 ≤ b < n), the UEP property requires that the b sequences coalesce together before any of the non-TE indel sequences share any common ancestors with them. We modified the rejection-sampling method of “Algorithm 1” according to previous studies (Slatkin and Rannala 1997; Tavaré 2004), and let \(TMRCA\Updelta\) as the TMRCA given that the sequences have the UEP.

Algorithm 2

To simulate coalescent times from conditional distributions of \(T\Updelta\) and \(TMRCA\Updelta\) given m additional mutations have occurred in a linked region containing the UEP. The algorithm is given by replacing θ in Algorithm 8.2 of Tavaré (2004) with θ W .

We repeated the simulation until the accepted samples reached 1,000,000.

We computed the number of segregating sites among the sequences corresponding to each of the MRCAs (T 1T 6 in Fig. 2a). The TMRCA in T 1 or T 5 was estimated using algorithm 1, because of the fixation of CIM or unconditional with TE. The other four TMRCAs were conditional with the insertions of L1Bm, BmamaT1, and BMC1, or with the deletion of BmamaT1. Algorithm 2 was used to estimate the \(TMRCA\Updelta\) and \(T\Updelta\).

Results and Discussion

The posterior distributions of estimated times at all points of MRCAs and UEPs were overlapped among host-sequences, CIM, and TEs (Table 1). It was suggested that the MRCA of the B. mandarina population existed 10.713 MYA and 95 % credible interval (CI) 6.701–15.642. This estimate was nearly identical to that using CIM (9.373 MYA and 95 %CI 5.916–13.579). Subsequent lineages were characterized by the L1Bm (T 2), BmamaT1 (T 3), and BMC1 (T 6) insertions and the BmamaT1 deletion (T 4) (Fig. 2a). One of the UEPs, BMC1, defines the origin of the B. mori population that shares a common ancestor with B. mandarina from China. This relationship is consistent with previous studies (MinHui et al. 2008; Li et al. 2010). We suggest that their common ancestor dates to 0.468 MYA. The estimate is much older than the time of domestication 5,000 years ago (Xiang et al. 2005), while much more recent than the 7.1 MYA divergence between B. mandarina from Japan and B. mori (Yukuhiro et al. 2002). The estimated times of \(TMRCA\Updelta\) and \(T\Updelta\) from host-sequences were strongly correlated with those using CIMs (r = 0.9972) along the lineages of the genealogy (Fig. 2b). This strong correlation between estimate by using selectively neutral sequence and MLE sequence supports that CIM is a useful molecular clock.

Table 1 Coalescent times of host-sequences, Cecropia-ITR-MLE (CIM), L1Bm, BmamaT1, and BMC1 in the B. mandarina genealogy

The horizontal transmission of the CIM has been observed in an emperor moth (A. atlas: AB006464) and a coral (F. scruposa: AB055188) from the Ryukyu Islands (Nakajima et al. 2002). The two sequences differ at six sites. We estimated the time to the insertion event by assuming that the number of mutations follow the Poisson distribution with parameter \(2u{\text{t}}\) with the mutation rate u followed the Gamma distribution with shape parameter and the scale parameter are α and β, respectively, and the mean was the B. mandarina estimate (2.759 × 10−3 mutations/site/million years × 1,223 bp; see “Note” in Table 1). Integrating out the mutation rate, the marginal likelihood of the time (t) was obtained as \(L\left( {t|n} \right) = \frac{{\left( {2t} \right)^{n} {{\Upgamma}}\left( {n + \alpha } \right)}}{{n!{{\Upgamma}}\left( \alpha \right)\beta^{\alpha } \left( {2t + \frac{1}{\beta }} \right)^{n + \alpha } }}\), where n is the number of nucleotide differences between the two sequences. The maximum likelihood estimator of the time is n/2αβ, with asymptotic variance \(\left( {n - 1} \right)\left( {\alpha + n - 1} \right)/2\alpha \left( {\alpha + 2} \right)\). The maximum likelihood estimate was 0.89 MYA with a 95 % confidence interval 0–3.23 MYA (the variance equals the mean in the Gamma distribution) or 0–3.88 MYA (the variance equals ten times larger than the mean). These two species belong to different phyla of Arthropoda and Cnidaria. This estimate is much more recent than the time of existence of their MRCA.

Here we demonstrate that CIMs can be used as a molecular clock to date horizontal transmissions. But our justification comes from analyses of only the single locus of the single species. Further study about other loci of CIMs and also about other MLEs to confirm the usefulness of MLEs as a molecular clock is needed. Since MLEs are easily amplified and identified without prior information about surrounding genomic sequences of the organisms, our method can be expanded to the analysis of other transposable elements and applied to diverse organisms and will provide crucial insights of natural history reconstructions in biogeography.