Introduction

Gene duplication has been recognized as an important source of raw material for biological innovations (Ohno 1970). Following gene duplication, the functional and regulatory divergence of the different copies of an ancestral gene may contribute to the generation of evolutionary novelties (Hahn 2009; Lynch 2007; Nei and Rooney 2005; Ohno 1970; Zhang 2003). The relaxins and insulin-like genes provide an interesting example of the evolutionary versatility that can be generated by the differentiation of duplicated genes. In this gene family, the different paralogs have acquired a variety of physiological roles including bone formation, testicular descent, trophoblast development, and cell differentiation (Bathgate et al. 2003; de Pablo and de la Rosa 1995; Reinecke and Collet 1998; Sherwood 2004).

The insulin–relaxin gene superfamily comprises four different kinds of genes: insulin (INS), insulin growth factors (IGFs), relaxins (RLNs), and insulin-like peptides (INSLs) (Chan and Steiner 2000; Nagamatsu et al. 1991; Olinski et al. 2006a; Park et al. 2008a). The number and nature of genes in the insulin–relaxin superfamily in a species is variable. For example, humans possess a single INS ortholog, 2 IGFs, 3 RLNs, and 4 INSLs in their genomes, whereas mouse possess 2 INS paralogs, 2 IGFs, 2 RLNs, and 3 INSLs (Park et al. 2008a; Wentworth et al. 1986). These genes have been placed into a single superfamily because of structural similarities and the presence of six conserved cysteine residues (Chan and Steiner 2000). The insulin–relaxin superfamily was further classified into two separate families, with INS and IGFs in the first, and RLNs and INSLs in the second. The RLNs and INSLs genes are located in separate clusters, and synteny analyses have shown that these clusters were established early in vertebrate evolution (Olinski et al. 2006a, b; Park et al. 2008a, b). For example, in humans, one cluster includes the RLN1, RLN2, INSL4, and INSL6 genes and is located on chromosome 9, a second cluster includes the RLN3 and INSL3 genes and is located on chromosome 19, and the third cluster includes the INSL5 gene and is located on chromosome 1 (Park et al. 2008a).

Vertebrate genomes have undergone two rounds of whole genome duplications (WGDs) prior to the divergence between cyclostomes and gnathostomes (Dehal and Boore 2005; Kuraku et al. 2009; Meyer and Schartl 1999; Ohno 1970), and these two WGDs have been postulated to be major determinants of the diversification of the insulin–relaxin gene superfamily (Olinski et al. 2006a, b; Park et al. 2008a). The accepted model of evolution of the RLN/INSL family posits that all extant members of this group derive from a single progenitor found in the common ancestor of vertebrates. Through the two successive rounds of WGDs, the single-copy proto-RLN/INSL gene would have given rise to three proto-RLN/INSL paralogs, located on three different chromosomes (Fig. 1a, Olinski et al. 2006b; Park et al. 2008a, b), plus a paralog that was secondarily lost. Subsequent rounds of tandem gene duplication and divergence of each of these three paralogs would eventually give rise to the different clusters found today in extant vertebrates (Hsu 2003; Olinski et al. 2006b; Park et al. 2008a, b). Under this model, paralogs found on the same cluster should share a most recent common ancestor to the exclusion of paralogs on separate clusters. From a functional standpoint, this would imply that the functional divergence of the different RLN and INSL paralogs occurred after the two rounds of WGD (divergence post-WGD model, Fig. 1a). An alternative view to explain the presence of both RLN and INSL genes on different chromosomes could invoke a tandem duplication predating the WGDs. In this scenario, the ancestors of RLN and INSL genes would have already been present in the ancestral locus (Fig. 1b), and functional divergence would have preceded WGDs (divergence pre-WGD model, Fig. 1b). Thus, the major difference between the two competing models is in the timing of the functional differentiation relative to the two rounds of WGDs in the last common ancestor of extant vertebrates.

Fig. 1
figure 1

Alternative hypotheses regarding the evolutionary history of the RLN/INSL-like genes. a The divergence post-WGD model. Starting with a single-copy proto-RLN/INSL gene, two successive rounds of whole genome duplications originated three proto-RLN/INSL genes (one was lost) on three different chromosomes. After that, duplications of each of the resulting paralogs originated all of the other paralogs found on the same chromosome. According to this model, paralogs on a chromosome should form monophyletic groups. b The divergence pre-WGD model. In this case, the ancestor of the RLN and INSL genes would have diverged prior to the two rounds of whole genome duplications. Here the two successive rounds of whole genome duplications would have generated the copies of the RLN–INSL ancestral arrangement. Under this scenario, the RLN and INSL genes would form reciprocally monophyletic clades. Our results favor the divergence post-WGD model depicted on a

On a more recent time scale, tandem duplications have also played an important role in the evolution of the vertebrate RLN and INSL repertoire. Within mammals, for example, apes posses an additional RLN gene, RLN2, with no clear ortholog in any other vertebrate group, and a similar pattern can be observed for the INSL4 gene, which is only found in catarrhine primates, the group that includes apes and Old World monkeys (Bieche et al. 2003; Park et al. 2008a, b). Given that the data at hand suggests that the RLN2 and INSL4 genes derive from lineage-specific tandem duplications, we would expect them to nest within the corresponding primate clade in the corresponding phylogenies: the RLN2 clade should nest within apes sequences, and the INLS4 should nest within catarrhine sequences.

From a phylogenetic standpoint, the divergence post-WGD and divergence pre-WGD models make mutually exclusive topological predictions. For the divergence post-WGD model (Fig. 1a) we would expect paralogs on the same cluster to form monophyletic groups. By contrast, in the divergence pre-WGD model (Fig. 1b) we would expect RLN and INSL paralogs to cluster in separate clades regardless of their genomic location. Similarly, if the RLN2 and INSL4 genes were the result of lineage-specific tandem duplications, they would be expected to group with other paralogs of the corresponding lineage. Accordingly, the main goal of this study are (1) to compare the divergence post-WGD and divergence pre-WGD models, and (2) to the assess the relative contribution of WGDs and tandem duplications to the observed diversity of RLN and INSL genes in vertebrates, two questions that were not directly addressed in the previous studies (Good-Avila et al. 2009; Olinski et al. 2006a, b; Park et al. 2008a, b).

Materials and Methods

Data

We selected placental mammals as a model system because they have the most diverse repertoire of RLN/INSL genes, and allow us to evaluate competing models explaining the origin of these genes, and address specific questions relative to the gain and loss of RLN/INSL genes in this group. Accordingly, the DNA sequences from structural genes in the RLN/INSL gene family of placental mammals were obtained from the Ensembl database (release 55). In each case, RLN/INSL-like genes were identified by comparing known exon sequences with genomic fragments using the program Blast2 version 2.2 (Tatusova and Madden 1999) available from NCBI (http://www.ncbi.nlm.nih.gov/blast/bl2seq). Sequences derived from shorter records based on genomic DNA or cDNA were also included in order to attain a broad and balanced taxonomic coverage of placental mammals (Supplementary Table 1).

To explore the sensitivity of our analyses to changes in the alignment method, nucleotide translated sequences were aligned using Dialign-TX (Subramanian et al. 2008), Kalign2 (Lassmann et al. 2009), the E-INS-i, G-INS-i, and L-INS-i strategies from Mafft v.6 (Katoh et al. 2009), MUSCLE v3.5 (Edgar 2004), Probcons (Do et al. 2005), and Tcoffee (Notredame et al. 2000). Nucleotide alignments were generated using the amino acid alignments as a template with the software PAL2NAL (Suyama et al. 2006). Finally, the biological accuracy of alignments was assessed using the software MUMSA (Lassmann and Sonnhammer 2005), which compares alignment blocks from different alignment strategies to assess the difficulty of an alignment case, and also ranks the quality of each alternative alignment. For each set of alignments of a given set of sequences, MUMSA provides an Average Overlap Score (AOS) which gives a measure of the alignment difficulty that ranges from 0 to 1, with 1 being the least difficult. In addition, it also assigns a Multiple Overlap Score (MOS) score to each of the different alignments, which also ranges from 0 to 1, with 1 being the highest quality.

Phylogenetic Inference

Phylogenetic relationships among the different RLN/INSL-like DNA sequences in the dataset were estimated using Bayesian and maximum likelihood approaches, as implemented in Mr.Bayes v3.1.2 (Ronquist and Huelsenbeck 2003) and Treefinder version October 2008 (Jobb et al. 2004), respectively. The best fitting models were estimated separately for each gene segment, and also for each codon position within each segment using the “propose model” routine from Treefinder version October 2008 (Supplementary material 2; Jobb et al. 2004). For the Bayesian analyses, two simultaneous independent runs were performed for 30 × 106 iterations of a Markov Chain Monte Carlo algorithm, with five simultaneous chains, sampling every 1000 generations. Support for the nodes and parameter estimates were derived from a majority rule consensus of the last 15,000 trees sampled after convergence. In maximum likelihood, we estimated the best tree for each alignment, and support for the nodes was estimated with 1,000 bootstrap pseudoreplicates.

Results and Discussion

Alignment Accuracy

The AOS score for the alignments was 0.55, and the MOS scores for each individual alignment ranged from 0.577 to 0.689, with higher scores denoting higher quality. Based on these scores, we selected the four multiple alignments with the best MUMSA scores (L-INS-i, E-INS-i, G-INS-i, and Probcons), and compared the likelihood scores of the resulting trees. We then selected the tree with the highest likelihood score, which was obtained with the G-INS-i MAFFT alignment strategy, as our best tree. Results obtained with the other three alignment strategies are reported as Supplementary material 3.

Phylogenetic Analysis

The topology recovered in our analysis is congruent with the divergence post-WGD model (Fig. 1a), as genes found on the same clusters share a most recent common ancestor to the exclusion of paralogs found on separate clusters (Fig. 2). We recovered the monophyly of each of the INSL3, INSL5, RNL3, INSL6 paralogs, plus the monophyly of a clade containing the RLN1, RLN2, and INSL4 genes, with strong support in all phylogenies (Fig. 2). Among these groups, the relationship between the INSL6 and RLN1, RLN2, RLN3, and INSL4 clades is strongly supported, while the relationship between the RNL3 and INSL3 paralogs, is moderately supported (Fig. 2).

Fig. 2
figure 2

Unrooted maximum likelihood phylogeny (lnL = −18013.49) describing phylogenetic relationships among the RLN/INSL-like genes of placental mammals. Values on the nodes denote bootstrap support values (above) and Bayesian posterior probabilities (below). The chromosomal location of the human paralogs is indicated with shading

To further clarify relationships within the RLN1, RLN2, and INSL4 clade, we performed a second set of analyses restricted to these genes, and added marsupial and platypus sequences as outgroups. Because the RLN2 gene is restricted to apes, represented by human and chimp in our study, it was expected to derive from an ape-specific duplication (Wilkinson et al. 2005; Park et al. 2008a, b). Our phylogenies are consistent with this interpretation: the human and chimp RLN2 orthologs were sister to the human and chimp RLN1 genes (Figs. 2, 3), indicating that the ape RLN1 and RLN2 genes derives from the duplication of a proto-RLN1 ortholog (Fig. 4), in agreement with previous studies. Similarly, it is generally thought that the duplication that gave rise to INSL4 is an evolutionary innovation specific to the catarrhine lineage (Bieche et al. 2003). According to this scenario, INSL4 would be expected to be sister to the RLN1/RLN2 clade of human and chimp, to the exclusion of the marmoset RLN1 gene. However, our analyses are not compatible with this scenario: the RLN1 gene of the marmoset was placed sister to the RLN1/RLN2 genes of human and chimp with strong support (Fig. 3). Additionally, all RLN sequences from Euarchontoglires, the group that included primates, rodents, and lagomorphs, were recovered as a monophyletic group to the exclusion of the INSL4 clade with strong support (Fig. 3). In all cases the INSL4 clade is embedded within the RLN1 clade with strong support (Figs. 2, 3), suggesting that this gene arose from an RLN, and not from an INSL ancestor (Bieche et al. 2003; Olinski et al. 2006b; Wilkinson et al. 2005), thus, this phylogeny suggests that the INSL4 gene derives from the duplication of an RLN-like gene that predates the radiation of Euarchontoglires (Fig. 3), and that the gene was secondarily lost in all Euarchontoglires other than catarrhine primates. This latter point was also supported by the approximately unbiased topology test (Shimodaira 2002), which rejected the placement of the INSL4 as sister to the catarrhine RLN1 and RLN2 clade (P < 0.0001), but not as a sister group of the Euarchontoglires clade (P = 0.42).

Fig. 3
figure 3

Maximum likelihood phylogeny (lnL = −7543.97) describing phylogenetic relationships among RLN1, RLN2, and INSL4 genes of placental mammals. Sequences were aligned using the G-INS-i strategy from Mafft v.6 (Katoh et al. 2009). Two marsupial and one monotreme sequences were added to root the tree. Values on the nodes denote bootstrap support values (above) and Bayesian posterior probabilities (below). Note the position of the INSL4 clade outside the RLN1/RLN2 clade from Euarchontoglires

Fig. 4
figure 4

Evolutionary reconstruction of the duplication history of the RLN2 and INSL4 genes in placental mammals. Starting with the RLN1–INSL6 gene arrangement, the proto-RLN1 gene would have gone through two tandem duplication events giving rise to the INSL4 gene in the last common ancestor of Euarchontoglires, and to the RLN2 in the last common ancestor of apes. The INSL4 gene was lost independently in all Euarchontoglires other than catarrhine primates

Evolution of the RLN/INSL-Like Genes

In the past, different models for the evolution of the RNL/INSL gene family were developed that made specific predictions regarding genealogical relationships among the different paralogs (Hsu 2003; Olinski et al. 2006b; Park et al. 2008a, b). However, the most recent phylogenetic studies did not compare the competing evolutionary scenarios in a phylogenetic framework (Good-Avila et al. 2009; Park et al. 2008a; Wilkinson et al. 2005). According to the divergence post-WGD model of evolution (Fig. 1a), the RLN and INSL paralogs derive from a single RLN/INSL ancestral gene that underwent two successive rounds of WGDs and gave rise to three RLN/INSL genes located on three different genomic locations. Each of these resulting genes would have been the progenitors of the paralogs found on a given cluster (Fig. 1a, Hsu 2003; Olinski et al. 2006b; Park et al. 2008a, b). The alternative divergence pre-WGD model would require the duplication of the single RLN/INSL ancestral gene and ensuing differentiation into a proto-RLN and a proto-INSL gene prior to the two rounds of WGDs. Here, the RLN and INSL ancestral genes would have been already present when the WGDs took place (Fig. 1b). The phylogenetic predictions of these two competing models are mutually exclusive and easily recognizable. Under the divergence post-WGD model we would expect genes on the same cluster to be monophyletic, whereas under the divergence pre-WGD model we would expect to find all RLN genes in one clade, and all INSL genes in another. The topology recovered in our ML and Bayesian analyses is congruent with the divergence post-WGD model, and the approximately unbiased topology test (Shimodaira 2002) was marginally significant in rejecting the divergence pre-WGD model in all cases (P = 0.0508).

From an evolutionary perspective, the main implication of this finding is that the differentiation of the RLN and INSL genes occurred independently in the different clusters after the two rounds of WGDs. The initial stages of this process would have occurred early in the evolution of vertebrates as orthology among tetrapod and teleost fish members of the family has been well established in previous studies (Good-Avila et al. 2009; Park et al. 2008a).

On a more recent time scale, tandem duplications have also played an important role in the evolution of the vertebrate RLN and INSL repertoire. For example, most mammals posses five relatively old paralogs of this family in their genomes: INSL5, RLN1, INSL6, RLN3, and INSL3 (Park et al. 2008a, b), while primates have two additional, younger genes (INSL4 and RNL2) that are not present in any other placental lineages (Bieche et al. 2003). These genes derive from more recent tandem duplications (Fig. 4). The RLN2 gene apparently originated between 29 and 18 mya (Goodman et al. 1998; Steiper and Young 2009) in the last common ancestor of apes (Figs. 3, 4; Good-Avila et al. 2009; Park et al. 2008a; Wilkinson et al. 2005), whereas the INSL4 gene arose from an RLN ancestor prior to the divergence of Euarchontoglires (Figs. 2, 3, 4). Outside placental mammals, monotremes and marsupials appear to posses four paralogs (INSL5, RLN1, RLN3, and INSL3; Park et al. 2008a, b).

Variation in the RLN/INSL gene complement is also observed among other vertebrates. There are two RLN and two INSL genes in the western clawed frog (Xenopus tropicalis), but only one RLN and one INSL gene in chicken (Gallus gallus) (Park et al. 2008a, b). Ray-finned fish underwent an additional round of WGD, and as a result, possess duplicated RLN/INSL clusters and duplicated copies of INSL5 and RLN3 genes relative to tetrapods (Good-Avila et al. 2009; Park et al. 2008a). On the other hand, the RNL1 gene was lost in zebrafish (Danio rerio), but is found as a single-copy gene in other ray-finned fish species (Good-Avila et al. 2009; Park et al. 2008a), as is the case with the INSL3 gene, which has been found as a single-copy gene in the zebrafish, the spotted green pufferfish (Tetraodon nigroviridis), and the fugu (Takifugu rubripes) (Good-Avila et al. 2009).

Conclusions

This study provides strong support for the divergence post-WGD model of evolution for the vertebrate RLN/INSL family of genes (Fig. 1a). Under the proposed scenario, there would have been a single ancestral proto-RLN/INSL gene prior to the two rounds of WGDs in the last common ancestor of extant vertebrates. The two successive rounds of WGD would have then generated the progenitors of the different RLN–INSL clusters on each chromosome and subsequent duplications within each cluster then gave rise to the present RLN/INSL gene clusters. From a functional standpoint, our study illustrates the interplay between gene duplication and functional differentiation in the generation of biological novelties. In this case, the differentiation of the proto-RLN/INSL genes deriving from the WGDs gave rise to RLN and INSL independently on each separate cluster. Our study also indicates that linage-specific patterns of duplication, deletion, and retention of genes have played a strong role in shaping the RLN and INSL gene complement in extant species. An example of this process is the INSL4 gene, which at present is only found in catarrhine primates, but appears to derive from a relatively older duplication that predates divergence among Euarchontoglires, and has apparently been secondarily lost in all Euarchontoglires other than catarrhines (Fig. 4).