Introduction

The iconic immunoglobulin (IG) molecule is a tetrapartite structure consisting of four polypeptide chains, two identical heavy (H) and two identical light (L) chains (Klein and Hořejší 1997; Lefranc and Lefranc 2001). Both the H and L chains consist of a variable (V) domain and a constant (C) region. The C region is encoded in a C gene. The V domain of the H chain is encoded by three kinds of genes, IGHV, IGHJ, and IGHD, each occurring in multiple copies and in different arrangements with the other two kinds of genes, depending on the species. For the formation of a V domain, one copy of each of the three kinds of genes comes together by a special process of genetic recombination. The rearrangement involves recombination signal sequences (RSS) composed of conserved heptamers and less conserved nonamers, separated by 23-bp spacer sequence (Early et al. 1980; Tonegawa 1983). The V domain can further be subdivided into the framework regions (FR) and hypervariable or complementarity-determining regions (CDR) distinguished by the extent of sequence divergence and structural delimitations.

The IGHV genes encode the antigen-binding regions of antibodies. Despite a clear sequence homology among IGHV sequences from different species, there is a marked plasticity in the organization of the region and in the mechanism for the generation of antibody diversity. In cartilaginous fishes, the IGHV genes are organized in cassettes of IGHV-IGHD-IGHJ-IGHC, which occur at different chromosomal locations (Litman et al. 1993). This organization is referred to as the cassette type. In bony fishes and tetrapods, the IGHV genes occur in the organization IGHVn-IGHDn-IGHJn-IGHCn (where ‘n’ stands for multiple copies) and are clustered in a single chromosomal location (Marchalonis et al. 1998). The advantage of this organization is thought to be in the facilitation of combinatorial diversification of antibodies (Litman et al. 1993). The repertoire of IGHV genes is produced by the combination of gene duplication and the divergence of duplicate genes (Hughes and Yeager 1997; Ota and Nei 1994). Hence, the evolution of the IGHV genes can be explained by two evolutionary processes: the birth-and-death process and diversifying selection (Ota and Nei 1994). In the birth-and-death model, new genes are created by gene duplication. Some of the duplicate genes acquire new functions and remain in the genome, while others become pseudogenes or are eliminated from the genome. The process of diversifying selection serves to increase variation in amino acid sequences of the CDRs by higher rates of non-synonymous compared to synonymous substitutions, without significant changes in the canonical structure of the FR regions (Tanaka and Nei 1989).

On the basis of the degree of sequence identity, mammalian IGHV genes have been classified into three major clans (clans I–III) (Kirkham et al. 1992; Kodaira et al. 1986; Kofler et al. 1992; Ota and Nei 1994; Schroeder et al. 1990). The number of IGHV genes in these three clans varies among different mammals (Sitnikova and Su 1998). The reason behind the expansion and contraction of the IGHV multigene family and the factors affecting the evolution of antibody repertoire in jawed vertebrates are poorly understood. Furthermore, little is known about the evolutionary relationship between mammalian and non-mammalian IGHV sequences and the evolutionary dynamics of IGHV genes at the chromosomal level, although the structural and functional significance of the genomic location of several genes has been recognized (Linardopoulou et al. 2001). Now that the draft genome sequences of several vertebrate species are available, we have conducted a comparative analysis of IGHV genes of 16 vertebrate species. These comparisons are expected to give new insights into the evolution of the IGHV multigene family.

Materials and methods

Identification of IGHV genes

An exhaustive gene search was conducted to identify all the IGHV genes in the draft genome sequences of zebrafish Danio rerio (assembly: Zv6, Mar 2006; 6.7× coverage), medaka Oryzias latipes (Assembly: HdrR, Oct 2005; 6.7× coverage), stickleback Gasterosteus aculeatus (assembly: BROAD S1, Feb 2006; 11× coverage), western clawed frog Xenopus tropicalis (assembly: JGI 4.1, Aug 2005; 7.6× coverage), chicken Gallus gallus (assembly: WASHUC2, May 2006; 7.1× coverage), platypus Ornithorhynchus anatinus (assembly: Ornithorhynchus_anatinus-5.0, Dec 2005; 6× coverage), opossum Monodelphis domestica (assembly: MonDom 4.0, Jan 2006; 6.5× coverage), dog Canis familiaris (assembly: CanFam 2.0, May 2006; 7.6× coverage), cat Felis catus (assembly: Pre Ensembl – release 41, Nov 2006; 2× coverage), mouse Mus musculus (assembly: NCBI m36, Dec 2005; 7.7× coverage), rat Rattus norvegicus (assembly: RGSC 3.4, Dec 2004; 7.0× coverage), macaque Macaca mulatta (assembly: MMUL 1.0, Feb 2006; 5.1× coverage), chimpanzee Pan troglodytes (assembly: CHIMP 2.1, Mar 2006; 6× coverage), and human Homo sapiens (assembly: NCBI Build 36.2, Sep 2006) from Ensembl Genome Browser. The IGHV genes from cow (Bos taurus; assembly: Btau 2.0, Oct 2005; 6.2× coverage) were retrieved from NCBI Map Viewer. The sheep (Ovis aries) IGHV sequences were identified by the sheep-human genome sequence comparison using the Australian Sheep gene mapping web site (http://rubens.its.unimelb.edu.au/%7Ejillm/jill.htm). The human position corresponding to the IGHV locus was used to retrieve the sheep IGHV genes. For all species except sheep, we performed a two-round TBlastN search (Altschul et al. 1997) with the cutoff E value of 10−15 against the genome sequences. In the first round, the amino acid sequences of seven functional IGHV genes (one from each family previously defined) annotated in the human genome sequence were used as queries. As these seven queries are similar to one another, they hit the same genomic regions. We extracted only non-overlapping sequences given by the best hit (with the lowest E value). Taking into account the alignment with the query IGHV genes, we manually annotated each retrieved sequence. If the retrieved sequence was aligned with query sequence without any frame shifts or premature stop codons in leader sequence and FR regions (FR1, FR2, and FR3) and has a proper RSS, the sequence was regarded as a potentially functional IGHV gene. Other sequences (including truncated sequences) were regarded as IGHV pseudogenes. Next, the first round Blast best-hit sequences of a specific organism were used as queries for the second round TBlastN search to find additional IGHV sequences and in a similar way non-redundant sequences were retrieved (see Supplementary Table 1 for the list of IGHV sequences). The flowchart of the procedure is shown in Supplementary Fig. 1.

For all species, except cat and sheep, the coverage of the genome was >5×. Therefore, the total number of IGHV genes identified in the present study appears to be close to the actual numbers. The IGHV gene contains one intron between the leader sequence and the V-exon, consisting of the complementarity-determining regions (CDRs) and framework regions (FRs). The CDRs were excluded from the analysis because they are highly variable and contain many insertions/deletions.

Phylogenetic analysis

The amino acid sequences of FR regions of the functional IGHV genes were aligned using CLUSTALW program (Thompson et al. 1994). After elimination of gap sites, p-distances (Nei and Kumar 2000) for amino acid sequences were computed, and phylogenetic trees for functional IGHV genes were constructed by the NJ method (Saitou and Nei 1987) using the MEGA4.0 program (Tamura et al. 2007). The p-distance refers to the distance measured by the proportion of amino acid differences between sequences and is known to give phylogenetic trees with higher bootstrap values (Takahashi and Nei 2000). The tree was rooted by using two IGHV sequences of elasmobranch species, Heterodontus francisci (accession no. S24657 and S24658). The reliability of the tree was assessed by bootstrap resampling with a minimum of 1,000 replications.

Results

Number of IGHV genes in vertebrates

We determined the number of IGHV genes from the draft genome sequences of 16 vertebrate species (Table 1). The total numbers of potentially functional (37–41) and probably nonfunctional (10–13) IGHV genes are nearly the same for zebrafish, medaka, and stickleback, although the species belong to different orders. By contrast, the number of IGHV genes varies strikingly among the mammalian species. The total number of functional IGHV genes in rodents (mouse and rat) is considerably higher than that of the other mammalian species. The numbers of both functional and nonfunctional IGHV genes in artiodactyls (cow and sheep) are much smaller than those in other placental mammals. The two non-placental mammals (opossum and platypus) also differ considerably from each other in the number of IGHV genes. In chicken, there is a single functional IGHV and 58 IGHV pseudogenes. As reported previously (Reynaud et al. 1989), most of these IGHV pseudogenes had the complete V-exon but lacked the proper leader and/or recombination signal sequence. A few of the IGHV pseudogenes were truncated in their 5′ or 3′ ends or contained internal stop codons or frame shift mutations.

Table 1 Number of Immunoglobulin IGHV genes in 16 vertebrates

There is a significant positive correlation between the number of functional and nonfunctional IGHV genes (Fig. 1). Therefore, it seems that the more duplicate genes occur, the more nonfunctional genes are produced in the IGHV multigene family. However, there are some exceptions to this rule. For example, the chicken has a special IGHV organization. The IGHV pseudogenes in this species are not truly nonfunctional, as they are used to generate immunoglobulin diversity by gene conversion and evolve slowly (Ota and Nei 1995; Reynaud et al. 1994). In earlier studies of IGHV evolution from a small number of species, it appeared that the number of IGHV genes per genome is roughly the same in different species (Gojobori and Nei 1984; Ota and Nei 1994). The present study, however, shows that the number of IGHV genes varies considerably with species. In some species, there are a small number of IGHV genes, but in others, the numbers of IGHV genes are very large. These differences have apparently arisen independently in different phylogenetic lineages.

Fig. 1
figure 1

Relationship between the numbers of functional IGHV and IGHV pseudogenes

IGHV sequence divergence in different species

To determine the pattern of sequence divergence of the IGHV genes, we calculated the average p-distance for all pairs of functional IGHV sequences in each species (Table 1). The extent of intraspecific IGHV sequence variation varied with species, and the average variation was generally higher in bony fishes than in tetrapods. Marked differences in sequence variation exist between mammalian species. Thus, artiodactyls (cow and sheep) show a lower level of variation than primates and rodents (Table 1).

Phylogenetic relationships of IGHV genes in vertebrates

In an earlier study based on a limited number of species, Ota and Nei (1994) classified various IGHV genes from vertebrates into five different phylogenetic classes. However, this classification no longer holds when a large number of species are included. On the 50 to 70% condensed phylogenetic trees (Nei and Kumar 2000) based on both fish and tetrapod sequences, no reproducible phylogenetic classification could be obtained when different sets of sequences are used (data not shown). When the trees were made separately for fish or tetrapod sequences, however, reproducible classification was observed in tetrapods (Fig. 2), but not in fishes (data not shown). The absence of clear-cut classification of fish sequences could be due to a high degree of intraspecific sequence divergence (Table 1). In fact, it has been shown that there are at least 11 IGHV families in the rainbow trout (Roman et al. 1996) and multiple IGHV families in the channel catfish (Ghaffari and Lobb 1999). By contrast, the phylogenetic classification of tetrapod IGHV sequences into three clans (I, II, and III) is clearly supported by high (>75%) bootstrap values. These three clans are equivalent to the clans I, II, and III reported previously for the phylogenetic classification of mammalian IGHV sequences (Kirkham et al. 1992; Schroeder et al. 1990). The presence of all three clans in the frog Xenopus (Fig. 2, Table 2) indicates that their divergence occurred before the radiation of tetrapods. The absence of clans I and II genes in the chicken suggests that they were lost in this species. Whether this situation is representative for all bird species remains to be seen. Similarly, the absence of clan I genes in several additional species (cow, sheep, opossum, and platypus), of clan II genes in opossum, and of clan III in cow and sheep could represent random losses of the IGHV genes.

Fig. 2
figure 2

NJ phylogenetic tree condensed at the 50% bootstrap value level for all functional IGHV genes of eight tetrapod species. Two shark IGHV sequences were used as the outgroup

Table 2 Number of functional IGHV genes in each tetrapod IGHV clan

Distribution of the functional IGHV sequences at the heavy chain locus of mammals

Some IGHV genes of a particular species often cluster together in the phylogenetic trees. For example, the mouse clan I genes form a large cluster in the tree in Fig. 2. To understand how such clustering has evolved, we analyzed the chromosomal distribution of functional IGHV genes representing the three phylogenetic clans in the heavy chain locus of human, mouse, rat, and dog whose genome sequences are better assembled than those of others. The chromosomal distribution of the functional IGHV genes is different in different species (Fig. 3). In mouse and rat, most of the functional IGHV genes proximal to the IGHD gene are members of clans III and II. No functional clan I members are found in the first 20 IGHD-proximal IGHV genes in mice. Similarly, the rat has no clan I gene in the first 68 IGHD-proximal functional IGHV genes. In the human, however, functional IGHV genes of all three clans are intermingled (Lefranc 2001; Lefranc and Lefranc 2001; Matsuda et al. 1998). Most IGHD-proximal IGHV in the dog belong to the functional clan II genes, and a single functional clan I IGHV gene is located in the middle of the clan III genes. In the mouse, 48 functional clan I genes most distant from the IGHD gene (Fig. 3), all belong to a specific cluster of the tetrapods IGHV tree (see Fig. 2). Therefore, it seems that these genes may have originated by tandem duplication after separation of the mouse lineage. These observations are consistent with the idea that the IGHV genes evolve by the birth-and-death process rather than by concerted evolution.

Fig. 3
figure 3

Distribution of the functional IGHV genes in the heavy chain locus of humans, mice, rats, and dogs. The red, blue, and yellow colors represent clan I, clan II, and clan III genes, respectively. The rectangular box indicates the IGHD gene

Locations of orthologous sequences in the IGHV locus of humans and chimpanzees

IGHV genes are short and evolve relatively fast so that it is difficult to identify the orthologous genes between mammalian species belonging to different orders. However, this can be done relatively easily between the human and chimpanzee, which diverged about 6 million years ago. We therefore examined the orthologous relationships of human and chimpanzee IGHV genes. These relationships were determined primarily by phylogenetic analysis (except truncated IGHV pseudogenes). However, this analysis occasionally gave ambiguous results because of the relatively low bootstrap values. We therefore used another method of identification of orthologous and paralogous genes using information about the flanking gene or repeat sequences. In this approach, we first identified the SINE or LINE or other repeat elements flanking the 5′ and 3′ sides of each IGHV gene in the human and chimpanzee genome sequences and used this information for identifying the homologous genes between the two species (the names of the repeat elements used for this purpose are given in Supplementary Table 2). We could identify the orthologous and paralogous relationships of about 80% of IGHV genes and their chromosomal locations by this method. In the case of truncated IGHV pseudogenes, we used the latter method exclusively. One limitation of this analysis was the incompleteness of the chimpanzee genome sequence, and because of this limitation, certain conclusions remain tentative.

The results of the analysis indicated that the IGHV genes and their chromosomal locations are generally conserved in both human and chimpanzee (Fig. 4). There are, however, some scattered events of gene duplication and deletion that have occurred after divergences of the two species. A small scale of sequence inversion and transposition also appears to have occurred. One block duplication involving two functional and three nonfunctional IGHV genes that occurred in the human lineage is also identifiable (Fig. 5). These results indicate that the IGHV locus has undergone a continuous change in gene copy number.

Fig. 4
figure 4

Location of human and chimpanzee IGHV genes and their orthologous and paralogous relationships. Long and short vertical rods represent IGHV functional and pseudogenes, respectively. Broken and solid lines show orthologous and paralogous relationships between IGHV sequences, respectively. The filled and open rectangles indicate the IGHV genes whose orthologs were presumably lost in human and chimpanzee lineages, respectively. ‘C’, ‘N’, and ‘UN’ stand for “chromosome”, “non-assembled” and “undetermined” regions, respectively. The gene number starts with the first IGHD-proximal IGHV genes. The red, blue, and yellow colors represent clan I, clan II, and clan III genes, respectively. Due to incompleteness of the chimpanzee genomic sequences (indicated by the gaps in the lines), orthologous relationships and exact locations could not be found for some of the IGHV genes. By using parsimony principle, 16 IGHV sequences have been placed between the 5th and 22nd IGHV genes

Fig. 5
figure 5

An example of a block gene duplication event in the human IGHV locus (see Fig. 4). The IGHV genes are color-coded as in Fig. 4

Chromosomal location of the IGHV multigene family

We determined the chromosomal location of the IGHV multigene family in the species in which genome assembly has been completed or nearly completed. In eutherian mammals, the IGHV multigene family is always found in the subtelomeric region of a specific chromosome (Fig. 6). By contrast, the gene family in zebrafish (Wallace and Wallace 2003), medaka (Ueda and Naoi 1999), and opossum (Rens et al. 2001) is located near the centromere. To examine whether other genes present in subtelomeric regions of different chromosomes are similarly conserved across eutherian species, we carried out synteny analysis between human, mouse, and dog genomes using the Ensembl genome browser. More than 50% of the subtelomeric regions of one species were found at the interstitial sites of chromosomes in other species (not shown). Earlier studies also indicated that several subtelomeric segments of the human chromosomes have homologous counterparts at interstitial sites of mouse and rat chromosomes (Gibbs et al. 2004). Hence, the subtelomeric conservation of IGHV genes does not appear to be a property that applies to other genes located in subtelomeric regions. As the subtelomeric localization of IGHV genes is not observed in the opossum, it appears that the subtelomeric conservation of the IGHV gene family in eutherian mammals occurred through chromosomal rearrangement after separation of placental and non-placental mammals.

Fig. 6
figure 6

Chromosomal location of the IGHV gene family deduced from the completely annotated or nearly completed genomes. The IGHV locus is indicated by an arrow. Black band indicates the centromeric region. The genomic location of IGHV gene family in macaque and cow was determined by synteny analysis, as the sequences are not yet completely assembled

Discussion

Our analysis of IGHV genes from 16 vertebrate genomes presents a more complete picture of the evolutionary dynamics of the multigene family than earlier studies suggested. We found that the number of IGHV genes varies considerably among different species of mammals, but the number of IGHV genes in teleosts is more or less uniform, although the three teleosts species studied diverged from their common ancestor about 140 Myr ago, long before the major mammalian radiation (80–100 Myr ago; Hedges and Kumar 2002; Yamanoue et al. 2006). By contrast, the overall intraspecies IGHV sequence variation is higher in teleosts than in tetrapods. There are several mechanisms to produce antibody diversity in jawed vertebrates (Klein and Hořejší 1997). After the activation of B cells, somatic hypermutation (SHM) introduces additional diversity and can improve the antigen-binding affinity of the expressed antibodies (Cannon et al. 2004). The enzyme activation-induced cytidine deaminase (AID) is known to be required for inducing SHM (Cannon et al. 2004; Yang et al. 2006). The role of AID in bonyfishes is still unclear. Although a recent study indicates the presence of SHM in fishes, the spectrum of mutational targets is restricted in comparisons to mammals (Yang et al. 2006). Hence, it is possible that in fishes, a high degree of intraspecies variation of IGHV sequences may compensate for the apparently poor somatic hypermutation in generation of antibody diversity. In mammals, there is also a marked heterogeneity in the intraspecies sequence divergence. For example, artiodactyls show low levels of sequence variation of IGHV genes, whereas primates and rodents exhibit a high degree of variation. The mechanism of antibody diversification in artiodactyls is different from primates and rodents in that somatic gene conversion and hypermutation play a more important role in artiodactyls than in primates and rodents (Reynaud et al. 1991; Sun and Butler 1996). The differences in intraspecies IGHV sequence variation between different species might therefore be associated with different mechanisms for the development of antibody repertoires, although several other factors might be involved in a synergistic way.

There is a significant positive correlation between the number of functional and nonfunctional IGHV genes. This observation suggests that the more gene duplications occur, the more IGHV pseudogenes are generated in the IGHV multigene family. Previously, Nei and coworkers (Nei et al. 1997; Nei and Hughes 1992; Nei and Rooney 2005; Nozawa and Nei 2007) showed that in many multigene families, gene duplication often occurs, but because of deleterious mutations, many duplicate genes become nonfunctional and either stay in the genome as pseudogenes or are gradually eliminated from the genome by unequal crossing over. Although the number of deleted IGHV pseudogenes cannot be assessed easily, our results suggest that throughout IGHV evolution, the numbers of functional and nonfunctional genes are maintained by birth-and-death evolution.

The IGHV genes of diverse tetrapod species have been found to fall into three major clans (clan I, II, and III). Hence, these clans must have persisted for about 370 Myr in the tetrapod genomes. Clan III IGHV genes have a broader taxonomic distribution than clan I and clan II genes. In cow and sheep (artiodactyls), only clan II sequences are found. By contrast, swine has only clan III IGHV genes (Aitken et al. 1997), although it belongs to the same mammalian order, Artiodactyla. Apparently, the ancestors of the extant artiodactyl species had IGHV genes belonging to clan II and III and swine lost clan II genes, whereas cow and sheep lost clan III genes after their divergence. In chicken, the single functional IGHV gene and all IGHV pseudogenes are closely related and belong to clan III (Ota and Nei 1995). A restricted IGHV repertoire is also observed in several non-placental and placental mammals. Therefore, it seems that the loss of the entire set or part of specific IGHV clan(s) is a relatively common phenomenon during IGHV gene evolution. The results of our evolutionary study of IGHV genes from different vertebrate species suggest that the great diversity of IGHV locus organization have been generated by large-scale birth-and-death evolution or genomic drift (Nei 2007).