Introduction

The Per-ARNT-Sim (PAS) transcription factors (TFs) which belong to the basic helix-loop-helix (bHLH) TFs are critical regulators of the gene expression network involved in many essential physiological and developmental processes (Kewley et al. 2004). The bHLH/PAS genes were thought to include the aryl hydrocarbon receptor (AHR), hypoxia inducible factors (HIF-α), single minded proteins (SIM), aryl hydrocarbon receptor nuclear translocator (ARNT), aryl hydrocarbon receptor nuclear translocator-like (ARNTL, also called BMAL1 and BMAL2), neuronal PAS domain protein (NPAS) (Gilles-Gonzalez and Gonzalez 2004; Kewley et al. 2004). The bHLH/PAS proteins contain a bHLH domain and PAS domain dimmers labeled as PAS and PAS_3 in the Pfam database (Gilles-Gonzalez and Gonzalez 2004; Kewley et al. 2004). Although the widespread distribution and importance in physiological and developmental processes of bHLH/PAS genes, the process of domain fusion and expansion of this gene family are unknown. We aimed to explore the evolution, origin and gene structure conservation of this gene family, as this may also provide new insights into other gene family evolutions.

bHLH/PAS proteins have multiple functions in many essential physiological and developmental processes, which include dioxin toxicology, hypoxic response, neural development, circadian clock, schizophrenia and learning ability. AHR, as the receptor of Dioxin is well studied in toxicology using zebrafish and mice (Hahn et al. 2006) and also plays essential roles in the immune system (Esser et al. 2009; Stevens et al. 2009). Responses to hypoxia are mediated by HIF-1-3α, and the oxygen-dependent hydroxylases are inactivated resulting in the accumulation of HIF-α and the formation of a dimer with ARNT under hypoxia (Kewley et al. 2004). ARNT is a general partnering factor of bHLH/PAS proteins such as the AHR, HIF-α, SIM and itself (Kewley et al. 2004). The mechanism of the circadian clock is that the CLOCK and BMAL1 (ARNTL) form a heterodimer to bind to the E-box response elements in the promoter region and stimulate the transcription of other essential clock genes (Griffin et al. 1999; Kume et al. 1999; Preitner et al. 2002). NPAS1/3/4 are associated with neuronal function, social recognition, learning ability (Erbel-Sieler et al. 2004; Pickard et al. 2005; Ploski et al. 2011) and NPAS2 likely functions as part of a molecular clock operative in the mammalian forebrain (Reick et al. 2001).

Gene duplications provide a redundant gene copy for functional innovation (Conant and Wolfe 2008). In 2009, Hahn et al. reviewed the models for the three fates of duplicated genes which are neofunctionalization, subfunctionalization and conservation of function (Hahn 2009). Conant and Wolfe stated that a duplicate gene pair undergoes three life stages to form new functions which are creation, fixation-preservation and subsequent optimization (Conant and Wolfe 2008). Here, we are interested in the specific role of natural selection in the evolutionary history of the bHLH/PAS family. Previous evolutionary studies of bHLH/PAS genes focused on some subfamilies, such as the HIF-α (Zhang et al. 2010), PHD–HIF oxygen-sensing system (Rytkonen et al. 2011), SIM in arthropods (Linne et al. 2012), the animal circadian clock in sea anemone (Reitzel et al. 2010). Yet, there is a gap with regards to a comprehensive study about the molecular evolution of the bHLH/PAS gene family.

In this study, we performed an extensive phylogenetic analysis for the bHLH/PAS gene family. Based on the analysis of exon–intron structure and phylogenetic trees, we proposed a model of bHLH/PAS origination and evolution, including reservation or change of a conserved “1001-0210” eight-exon structure. Analyses of selective pressure and functional divergence were combined to explain the relationship of the site-specific evolution and functional divergence. Our studies will help in understanding the bHLH/PAS evolution mechanism and provide a starting point for further experimental investigations.

Materials and methods

Identification of bHLH/PAS genes

The proteome sequences of the represented species were obtained from predicted bHLH/PAS genes as per our previous study (Zhou et al. 2012). The proteome of Dictyostelium discoideum was downloaded from dictybase (http://www.dictybase.org). The proteome of Strongylocentrotus purpuratus was downloaded from NCBI (http://www.ncbi.nlm.nih.gov). The proteomes of Monosiga brevicollis, Amphimedon queenslandica, Nematostella vectensis and Branchiostoma floridae were obtained from JGI (http://genome.jgi-psf.org/). The other investigated proteomes were downloaded from Ensembl (http://www.ensembl.org).

HMMER in our local server and Pfam 26.0 in batch (http://pfam.sanger.ac.uk/) with an E-value below 1.0 were performed against all proteomes. The Pfam profile PF00010 (bHLH domain), PF08447 (PAS_3 domain) and PF00989 (PAS domain) were used in HMMER. The CD-HIT (Huang et al. 2010) 90 % sequence identity with a default parameter was applied to remove the polymorphisms, splice variants, pseudogenes and duplicates in JGI and NCBI data. To remove the splice variants from Ensembl, the longest one was kept among the proteins with the same gene id by our perl script.

Multiple alignment and phylogenetic analysis

For the phylogenetic analysis, the truncated sequences including bHLH, PAS and PAS_3 domains and their intervals were used to perform the multiple alignments (Fig. S1) by ClustalX (v2.0) (Larkin et al. 2007). The phylogenetic trees were constructed by three different approaches: Bayesian analyses, neighbor-joining (NJ) method with the JTT model, Maximum Likelihood (ML). Bayesian analysis was performed using MrBayes 3.1.2 (Ronquist and Huelsenbeck 2003) with the mixed amino acid substitution model; a MCMC chain with 12,000,000 generations was used. Markov chains were sampled every 100 generations, and the first 25 % of the trees were discarded as burn-in. Convergence was assessed by checking the average standard deviation of split frequencies (below 0.01). MEGA5 (Tamura et al. 2007) software was used to construct the NJ trees based on the multiple alignments. To assess the confidence degree of nodes in the NJ phylogenetic trees, bootstrap with 1,000 repetitions was performed. The ML trees were constructed using the PhyML3.0 (Guindon and Gascuel 2003) with 100 bootstrap replicas. The appropriate models of the ML methods including model parameters were calculated by ProtTest2.4 (Abascal et al. 2005) using Akaike Information Criterion (AIC), and the result was JTT + I + G + F.

Identification of exon–intron structure and domain search

Exon–intron structures were based on NCBI (S. purpuratus), JGI (N. vectensis) and UCSC (S. purpuratus, http://genome.ucsc.edu/) and Ensembl (others). The information of bHLH, PAS and PAS_3 domains was from Pfam 26.0 in batch (http://pfam.sanger.ac.uk/), and was checked and edited manually. The diagram of exon–intron structures was generated using the perl and R script based on the extracted information, followed by manual edition.

Analysis of functional divergence

Type I functional divergence analysis was performed by DIVERGE (version 2.0) software (Gu and Vander Velden 2002). Type I functional divergence which resulted in altered functional constraints between duplicated genes, lead to one of the genes being conserved and the other gene showing great variation. The coefficient of functional divergence (θ) is an indicator of the level of type I functional divergence among two homologous gene clusters. The sites (k) with contribution to the functional divergence were predicted according to their posterior probabilities (Q k). The sites with Q k > 0.67 were meaningful for the functional divergence. We chose 75 sequences (including bHLH, PAS and PAS_3 domains and their intervals) of deuterostome (including sea urchin, ciona, zebrafish, frog, anole, chicken and human) to calculate the coefficients of functional divergence.

Analysis of selective pressure

The truncated amino acid sequences of bHLH, PAS and PAS_3 domains and their intervals were aligned by ClustalX (v2.0) (Larkin et al. 2007). The corresponding cDNA data was downloaded from Ensembl (http://www.ensembl.org). Perl script and local BLAT software were used to obtain the corresponding truncated cDNA. The codon alignment of cDNA was generated by web server PAL2NAL (Suyama et al. 2006). Then CODEML in PAML4.5 (Yang 2007) was used to test the selective pressure with the model M0 (one ratio), M1a (near neural), M2a (positive selection), M3 (discrete), M7 (beta), M8 (beta and ω). The non-synonymous/synonymous substitution rate ratio (ω = d N/d S) was calculated to indicate the selective pressure. ω > 1 indicates positive selection, ω < 1 indicates negative selection, and ω ≈ 1 indicates neutral evolution. The ML method was used to explore the selective pressure, and the likelihood ratio test (LRT), 2ΔlnL = 2(lnL 1−lnL 0) obeying the χ 2 distribution with the degrees of freedom (df).

Results

Genome wide identification revealed that bHLH/PAS genes existed in metazoa

To study the origination and evolution of bHLH/PAS genes, we searched bHLH/PAS genes in all species investigated by the definition of containing all three domains: bHLH, PAS, and PAS_3 (see “Materials and methods”, Table 1, Table S1). The results showed that this family existed in metazoa since we identified four sequences with the bHLH, PAS and PAS_3 domains in A. queenslandica. Moreover, two sequences with bHLH and PAS domains (jgi|Monbr1|12346| and jgi|Monbr1|26507|), and the similar sequences with bHLH and PAS_3 domains (jgi|Monbr1|37483| and jgi|Monbr1|7924|) were found in M. brevicollis, suggesting that these three domains may fuse step by step to form the bHLH/PAS genes. Since we are interested in the bHLH/PAS genes with three domains, although we also found some sequences with only two domains in different species, we excluded them in the following analyses. No sequence with these three or two domains was found in Saccharomyces cerevisiae, D. discoideum, plants, fungi and microbes by BLAST search at NCBI. In invertebrate genomes, there are usually <10 bHLH/PAS genes, and vertebrates always contain more than 10 bHLH/PAS genes (Table 1). In addition, we summarized the detailed gene information of bHLH/PAS in human and zebrafish (Table S2–3).

Table 1 List of bHLH/PAS genes in the genomes investigated

Phylogenetic analysis of the bHLH/PAS genes

To infer the evolution and classification of bHLH/PAS genes, a phylogenetic Bayesian tree was constructed based on the three domains (bHLH, PAS and PAS_3) and their intervals (Fig. 1). The subtrees of the Bayesian tree are shown in supplementary Fig. S2. The topologies of ML and NJ trees are similar to the Bayesian tree (Fig. S3). Based on the above results, we classified the bHLH/PAS genes into three groups: the first group is AHR, the second group is SIM1/2, HIF-1-3α and NPAS1/3, the third group is NPAS2/CLOCK, ARNT/ARNT2 and ARNTL/ARNTL2.

Fig. 1
figure 1

Phylogenetic tree and gene structure of bHLH/PAS genes. The Bayesian tree was built for the sequences of three domains (bHLH, PAS and PAS_3) and their intervals using the MrBayes 3.1.2 with the mixed amino acid substitution model. The vertebrate bHLH/PAS branches were compressed and their expanded subtrees were in Fig. S2. Three groups (I–III) were labeled in our diagram

The first group is AHR, which existed from Caenorhabditis elegans to Homo sapiens. There are two clusters which are the deuterostome cluster and the protostome cluster in the AHR clade. We found that there is only one copy of the AHR gene in most vertebrates except zebrafish (three copies), chicken (three copies) and opossum (three copies). The Lamprey AHR (ENSPMAP00000006554) was also found. This AHR group is clustered with three sequences (ENSDARP00000105317 and ENSDARP00000083340 in zebrafish which were identified as “neuronal PAS domain-containing protein 4-like” by NCBI, C15C8.2b in C. elegans) with high common Bayesian posterior probability (91) and high bootstrap value in NJ and ML trees (Fig. S3).

The second group SIM1/2, HIF-1-3α and NPAS1/3, appeared to be the common ancestor of metazoans because we identified two sponge sequences of A. queenslandica (Aqu1.222845 and Aqu1.222846) in this group. The NPAS1/3 clade contained a sequence from N. vectensis (jgi|Nemve1|96877|), implying that this clade emerged early in the common ancestor of eumetazoan. NPAS1 and NPAS3 diverged in vertebrates, suggesting that they duplicated during the origin of vertebrates since there was only one copy in ciona (ENSCINP00000001875). Though two copies of NPAS1/3 (ENSPMAP00000010632 and ENSPMAP00000008398) were identified in lamprey, they seem to be from the tandem duplication. This is similar in the SIM1/2 clade that diverged in vertebrates, while one sequence in amphioxus (jgi|Brafl1|265033|) and S. purpuratus (gi|115920139|) is located in this clade. The fly SIM (FBpp0082178) indicated that the SIM gene emerged at least before the deuterostome and the protostome split. Two copies of SIM1 (ENSDARP00000033085 and ENSDARP00000003984) were found in zebrafish, which may be the result of 3R WGD (whole genome duplication) in Teleost (Taylor et al. 2001; Vandepoele et al. 2004; Meyer and Van de Peer 2005). One lamprey HIF-α (ENSPMAP00000000148) and one amphioxus HIF-α (jgi|Brafl1|208408|) were found. HIF-α was also found in sea anemone (jgi|Nemve1|161959|), implying that HIF-α occurred early before bilateral animals emerged. The presence of three members of HIF-α in most vertebrates (HIF-1-3α) suggests that HIF-α duplicated at the origin of vertebrates, and maybe as a result of 2R WGD.

The third group contains two big clades; the first clade is NPAS2/CLOCK and the second clade is ARNT/ARNT2 and ARNTL/ARNTL2. The NPAS2/CLOCK clade was found in A. queenslandica (Aqu1.220834). There was only one copy of genes in this clade in N. vectensis (jgi|Nemve1|160110|), Drosophila melanogaster (FBpp0099478) and B. floridae (jgi|Brafl1|63636|), while the genes in this clade diverged into NAPS2 and CLOCK in vertebrates. We noticed that a clade containing five sequences (ENSDARP00000016938 in zebrafish, ENSXETP00000005089 in frog, ENSGALP00000014735 in chicken, ENSOANP00000018438 in platypus and ENSMODP00000007175 in opossum) is located in the NPAS2 clade of the Bayesian, ML and NJ trees (Fig. 1, Fig. S3). It is surprising that we didn’t find genes of this clade in other mammalians, which means they may have been lost during evolution. There are two CLOCK genes (ENSDARP00000014377 and ENSDARP00000120780) in zebrafish, as compared to one CLOCK gene in other vertebrates. The clade ARNT/ARNT2 and ARNTL/ARNTL2 also contained two big subclades and their common Bayesian posterior probability was very high (100). The subclade of ARNT/ARNT2 appeared as early as the period of the sponge (Aqu1.211840), and contained only one copy in invertebrates which include N. vectensis (jgi|Nemve1|116698|), D. melanogaster (FBpp0081483), S. purpuratus (gi|390346232|), Ciona intestinalis (ENSCINP00000016558) and B. floridae (jgi|Brafl1|124387|). ARNT diverged into ARNT and ARNT2 in vertebrates. This is similar in the ARNTL subclade, which also contained two copies in vertebrates. ARNTL emerged early in N. vectensis (jgi|Nemve1|132249|), and contained only one copy in invertebrates.

A conserved exon phase pattern “1001-0210” existed in bHLH/PAS genes

To study the gene structures of bHLH/PAS genes, we painted the exon–intron structures in some species based on the information of Ensembl and JGI (Fig. 2, Fig. S4, Fig. S5). Our results showed that most deuterostomes (e.g. gi|390353292| in S. purpuratus, ENSCINP00000019134 in C. intestinalis) contained a conserved exon–intron structure within the exon phases “1001-0212” in the AHR clade. The AHR genes in flies (FBpp0297169) and nematodes (C41G7.5a) lost this exon–intron structure, which may be a result of the gene rearrangement in protostome (Putnam et al. 2007). There are the similar exon–intron structures in other clades. For example, the exon–intron structure of the amphioxus gene (jgi|Nemve1|96877|) is “1001-0210” in the NPAS1/3 clade, while the exon–intron structure of deuterostome genes in this clade evolved into “1001-0212”. The little change in the last exon of “1001-0210” is caused by a prolongation in the C terminal. The exon–intron structure of amphioxus SIM gene (jgi|Brafl1|265033|) is still “1001-0210”, however, it became “1001-0212” in vertebrate SIM1/2 (see zebrafish and human SIM1/2). The exon phases of HIF-α in sea anemone and amphioxus were “1001-0210”, while it became “101-0210” in vertebrates as a result of an exon fusion in the second and third exon because the amino acids sequences of this exon is similar with the original two exons (Fig. 2a, Fig. S4a).

Fig. 2
figure 2

The detailed exon split and fusion in HIF-1-3α, ARNTL1/2, NPAS2/CLOCK clades. The one or two exons circled by red boxes and arrows indicated the exon split and fusion, while the corresponding alignments of the sequences are in the right. a Exon fusion in HIF-1-3α. b Exon fusion in ARNTL1/2. c Exon split in NPAS2/CLOCK (color figure online)

In NPAS2 and CLOCK clade, the exon–intron structure changed into “1001-10212” as compared to “1001-0212” in vertebrates NPAS1/3 and SIM1/2. This is not caused by an exon insertion but the rearrangement of the fifth and the sixth exons in “1001-0210” (Fig. S5). We noticed that the exon–intron structure of the amphioxus gene (jgi|Brafl1|63636|) in this clade is “1001-1212”, as a result of the same rearrangement (Fig. S5). The exon–intron structure of the sea anemone gene (jgi|Nemve1|214146|) in this clade has become “1001-121112”, which is caused by the duplicated two PAS domains in this gene. The other sea anemone gene (jgi|Nemve1|160110|) in this clade has even changed into one exon (Fig. S4b).

The change in ARNT/ARNT2 and ARNTL/ARNTL2 clade was huge, which is mostly in ARNT/ARNT2 subtree (Fig. S4b). In the ARNTL/ARNTL2 subtree, the exon–intron structure is “1001-1212” in amphioxus (jgi|Brafl1|110703|), human and zebrafish, suggesting that it is conserved in evolution. The fly ARNTL gene (FBpp0074693) has become four exons with exon phase “2212”. Similar to the exon fusion of HIF-α, we also found that the exon fusion may have occurred in the origins of chordates. The “1001-10212” in sea urchin ARNTL (gi|390370839|) became “1001-1212” in amphioxus ARNTL (jgi|Brafl1|110703|) (Fig. 2b). An exon split was also found in this interval between PAS and PAS_3 domains in NPAS2/CLOCK (Fig. 2c). In the ARNT/ARNT2 subtree, the change of the exon–intron structure occurred during evolution. The exon–intron structure of the sea anemone gene (jgi|Nemve1|116698|) is “1001-10210”. It became “1001-11212” in sea urchin (gi|390346232|), and then “1001-111212” in amphioxus (jgi|Brafl1|124387|). It dramatically changed into “0122-10002” in vertebrates (zebrafish and human ARNT/ARNT2).

Based on the analysis of exon–intron structures, we found a conserved exon phase “1001-0210” which exists in some primitive species including the sea anemone (jgi|Nemve1|96877|) and amphioxus (jgi|Brafl1|208408|), which may be the ancestral exon phase in the bHLH/PAS family. It was retained or showed little change in most clades except in the ARNT/ARNT2 subtree.

Analysis of functional divergence

Gene family proliferation which resulted from genome wide or local chromosome duplication events provides the raw materials for functional innovations. After gene duplication, one gene copy maintains the original function, while the other copy is free to change toward functional divergence for the accumulating amino acid changes. Type I functional divergence represents the amino acid configurations that one gene duplication copy was conserved but the other copy is highly variable (Gu 1999, 2001) to determine the shift selective constraints in the bHLH/PAS gene family, the coefficients of functional divergence (θ) were calculated based on the pairwise sequence comparisons (Table 2). The results show that all θ values were significantly >0 with p < 0.05 (LRT, df = 1, 3.841 at 5 % for χ 2), suggesting that a site-specific rate shift after gene duplication is a common phenomenon in the evolution of the bHLH/PAS gene family. Further analysis focused on some groups (Fig. 3). ARNT/ARNT2 and ARNTL/ARNTL2 are grouped in one clade while they have more sites with posterior probabilities (Q k) that are more than 0.67, implying that they underwent the functional divergence after gene duplication (Fig. 3a). NPAS1/3 and SIM1/2 are also in the neighboring groups but their sites with Q k > 0.67 were less than the ARNT/ARNT2 and ARNTL/ARNTL2 group (Fig. 3b). The θ between HIF-1-3α and NPAS1/NPAS3 is 0.4752, lower than others, and the sites with Q k > 0.67 are mostly located in PAS_3 domain (Fig. 3c). We also checked the ARNTL/ARNTL2 and NPAS2/CLOCK which has many sites with Q k > 0.67 (Fig. 3d). ARNTL1/2 and HIF-1-3α are not in neighboring clades and have many functional divergence sites (Fig. 3e). NPAS1/3 and AHR are also not in neighboring clades but have less functional divergence sites (Fig. 3f). We noticed that the three θs among SIM1/2, HIF-1-3α and NPAS1/3 are all lower than 0.5, suggesting that they didn’t diverge dramatically. ARNT1/2 and ARNTL1/2 are grouped with high bootstraps but their θ is 0.6088, hinting that they underwent functional divergence after duplication.

Table 2 Estimates of the coefficient of type I functional divergence (θ)
Fig. 3
figure 3

Site-specific profile for evolution rate change. The dotted lines which are perpendicular to X-axis mean the boundaries of bHLH (1-64), PAS (115-279) and PAS_3 (397-490) domains. a ARNT/ARNT2 vs ARNTL/ARNTL2. b NPAS1/NPAS3 vs SIM1/SIM2. c HIF-1-3α vs NPAS1/NPAS3. d ARNTL/ARNTL2 vs NPAS2/CLOCK. e ARNTL/ARNTL2 vs HIF-1-3α. f AHR vs NPAS1/NPAS3

Analysis of selective pressure

Considering the bHLH, PAS and PAS_3 domains are important for the bHLH/PAS genes, we tested the presence of positive selection at individual amino acid codons based on the three domains and their intervals by CODEML program (Yang 2007). According to the LRT of site-specific models, model M3 was significantly higher than model M0 (2ΔlnL = 1,805.934, p < 0.01, df = 4), indicating that heterogeneous selection is among amino acid sites (Table 3). Three kinds of sites under model M3 had ω values of 0.00929, 0.03115 and 0.09922, suggesting that most amino acid sites underwent strong purifying selection. Model M2a and M1a show no difference (2ΔlnL = 0, df = 2), indicating that no positive selection site is found in these three domains during vertebrate evolution. The fact that model M8 was not significantly higher than model M7 (2ΔlnL = 0, df = 2) also supports the above assertion. These three domains underwent a process of strong purifying selection and the positive selection site may be in the sequences outside these three domains. The other gene families were also reported to have undergone a strong purifying selection (Milenkovic et al. 2010; Teng et al. 2010; Yan and Cai 2010) especially in HIF-1-3α which also belongs to the bHLH transcription factor (Zhang et al. 2010).

Table 3 Detection of positive selection of bHLH/PAS genes in vertebrates and ciona

Discussion

In this study, we described the evolution and functional divergence of the bHLH/PAS gene family on a genome-wide scale. We found that bHLH/PAS proteins appeared during the emergence of metazoan (in sponge). Phylogenetic trees indicated that this family can be divided into three groups and its gene members expanded in vertebrates. Exon–intron structure analysis revealed a conserved gene structure with exon phase “1001-0210” in most members and some exon fusion and split events were inferred. Analysis of functional divergence suggested that a site-specific rate shift after gene duplication is a common phenomenon in their evolution. Our study provides new insights for a better understanding into the evolution and origination of the bHLH/PAS family, which may provide clues for further functional analysis.

Origin and evolution of bHLH/PAS gene

The bHLH/PAS proteins are critical regulators of gene expression networks underlying many essential physiological and developmental processes (Kewley et al. 2004). We performed extensive bioinformatics analysis and proposed a model to describe the origin and evolution of the bHLH/PAS gene. In brief this model involved two points: one is that the number of this gene family expanded in vertebrates as compared to invertebrates, the other is that a conserved exon–intron structure with “1001-0210” was retained in most clades during the evolution.

The bHLH proteins are available in plants, fungi and animals (Sailsbery and Dean 2012), whereas PAS domains can be found in bacterial, eukaryotic and archaeal proteins (Moglich et al. 2009). In our study, we focused on the sequences with both bHLH and PAS domains. The four sequences with two domains in M. brevicollis (jgi|Monbr1|12346|, jgi|Monbr1|26507|, jgi|Monbr1|37483| and jgi|Monbr1|7924|) implied that the domain fusion or domain rearrangement of bHLH and PAS/PAS_3 occurred before the emergence of metazoa. Considering that the PAS and PAS_3 domains are dimmers (Gilles-Gonzalez and Gonzalez 2004; Kewley et al. 2004), we inferred that the sponge bHLH/PAS genes evolved from a choanoflagellate bHLH/PAS-like primitive sequence by duplication of the PAS domain. Four sponge A. queenslandica bHLH/PAS genes are ancestor-like sequences which might be the primitive metazoan bHLH/PAS genes. They were distributed in the root of the two clades group II and group III (Fig. 1), implying that the early metazoan may contain two copies of these families. Based on the conserved exon pattern (Fig. 4a), we inferred that the group I may be a duplication of group II. Most invertebrates contained 4–7 bHLH/PAS genes, suggested that the gene number of this gene family is relatively conserved during the evolution of invertebrates. However, the number of this gene family expanded into 10–14 in vertebrates, which may be a result of 2R WGD (Hokamp et al. 2003; Panopoulou and Poustka 2005).

Fig. 4
figure 4

The evolutionary model of the exon–intron structure of bHLH/PAS genes. The sequences in pictures were from representative primitive species and human. Abbreviations of species names were in the pictures. We found that a conserved exon–intron structure with “1001-0210” existed in many species. The two or three exons indicated by red boxes and arrows show the exon split, fusion or rearrangement in HIF-1-3α, ARNTL1/2, NPAS2/CLOCK clades, which were detailed in Fig. 2 and Fig. S5. Three groups (I–III) corresponding to Fig. 1 were labeled in the diagram. a AHR, SIM1, SIM2, HIF-1α, HIF-2α, HIF-3α, NPAS1 and NPAS3 retained this conserved exon phases except HIF-1-3α. b NPAS2, CLOCK, ARNT, ARNT2, ARNTL and ARNTL2 changed a little in this conserved exon phases (color figure online)

We identified 18 bHLH/PAS genes in zebrafish (Table S3). The large number of genes may be due to the Teleost specific 3R WGD because we found some genes with two copies in zebrafish, such as SIM1, CLOCK and HIF-2α (Taylor et al. 2001; Vandepoele et al. 2004; Meyer and Van de Peer 2005). The three nematode and seven fly bHLH/PAS genes located in the corresponding vertebrate clusters hint that these genes were conserved in protein level among these three domain (bHLH + PAS + PAS_3) sequences during evolution. However, their exon–intron structures were changed as compared to the deuterostomes and sea anemone, maybe as a result of rearrangement in protostome (Putnam et al. 2007).

The yellow blocks as shown in Fig. S6 indicate the retentions and changes of the conserved exon–intron structure in three groups. The results show that the ancestor contained the exon–intron structure of eight exons with exon phase “1001-0210”, which can be found in sea anemone, sea urchin and amphioxus. The length outside these three domains (bHLH + PAS + PAS_3) varies among these 11 members of human bHLH/PAS genes, suggesting that the conservative only existed in sequences of these three domains. The clade of HIF-α, NPAS1/3 and SIM1/2 retained the “1001-0212” and did not change in evolution except HIF-α which showed little change (Fig. 4a). We noticed that other clades clustering together with a high Bayesian posterior probability of 100 in Bayesian tree (Fig. 1) have experienced a little change in the conserved exon phase “1001-0212”. The clade of NPAS2 and CLOCK also retained most of the exon–intron structure with an exon rearrangement and became “1001-10212” (Fig. 4b). The clade of ARNT/ARNT2 changed largely in vertebrates although the NvARNT still retained the ancestral exon phases and the eight exons (Fig. 4b).

We also noticed three exon changes in HIF-1-3α, ARNTL1/2 and NPAS2/CLOCK. The two exons in HIF-α with phase “00” in “1001” fused during the emergence of vertebrate (Fig. 2a). An exon fusion occurred in sea urchin and amphioxus (Fig. 2b). An exon split may occur in amphioxus and vertebrate (Fig. 2c). Two exons with phase “02” in “0210” rearranged in ARNTL1/2, ARNT1/2 and NPAS2/CLOCK (Fig. S5). Interestingly, these changes all occurred in intervals between the PAS and PAS_3 domains but not in the three functional domains (bHLH, PAS and PAS_3). In summary, the ancestral exon–intron structure “1001-0210” was retained in most members of this gene family, implying its important functions in evolution.

Functional divergence and purifying selection of the bHLH/PAS genes

Our results show that the functional divergence occurred in early gene duplication events before vertebrates emerged. Strong purifying selection plays a role in the origin of new subfamilies in bHLH/PAS evolution. HIF-α proteins in vertebrates were studied in molecular selection and functional divergence, and they also underwent a strong purifying selection (Zhang et al. 2010). A purifying selection was also detected in other genes such as anoctamin, CYP3, NMDA receptor subunit 2 and UCP genes (Hughes et al. 2009; Milenkovic et al. 2010; Teng et al. 2010; Yan and Cai 2010). Our result is consistent with theirs that a strong purifying selection plays a essential role during the evolution of some gene families, especially the bHLH/PAS subfamily HIF-α proteins in vertebrates (Zhang et al. 2010). Another study also referred the HIF-α and they focused on the evolution of PHD–HIF Oxygen-Sensing System, their result that three HIF isoforms in extant vertebrate genomes are products of 2R is consistent with our findings (Rytkonen et al. 2011).