Introduction

Gap junctions, found in the plasma membranes of vertebrate animal cells, consist of clusters of closely packed transmembrane channels, the connexons, in which the principal proteins are referred to as connexins (Beyer et al., 1987; Loewenstein, 1987; Kumar & Gilula, 1996; Harris, 2001; Shibata et al., 2001; Evans & Martin, 2002a; Hand et al., 2002). Topologically related putative gap junctional proteins found in both invertebrates and vertebrates exhibiting little or no significant sequence similarity to the connexins are called innexins (White & Paul, 1999; Phelan & Starich, 2001; Potenza et al., 2002). Connexins and innexins comprise two distinct protein families whose structures and functions have been suggested to be overlapping (Curtin et al., 1999; Ganfornina et al., 1999; Landesman et al., 1999; White & Paul, 1999; Stebbings et al., 2000).

Gap junctional complexes provide direct electrical coupling and metabolic communication by allowing the free flow of ions and other small molecules between neighboring cells (Bevans et al., 1998; Kim et al., 1999; Landesman et al., 1999). They play important roles in a variety of pathological conditions such as congenital deafness (Kitamura et al., 2000; D'Andrea et al., 2002), convulsive seizures (Jahromi et al., 2002), congenital cataracts (Mackay et al., 1999), erythrokeratodermia variabilis (Richard et al., 1998), and Charcot-Marie tooth disease (Omori et al., 1996). Their dynamic assembly (Lopez et al., 2001; Evans & Martin, 2002b) and regulation by ATP and protein kinases (Ghosh et al., 2002) and by Ca2+ and calmodulin (Sotkis et al., 2001) are complex. Vertebrate connexons consist of homo- and hetero-hexameric arrays of connexins, and the connexon in one plasma membrane docks end to end with a connexon in the membrane of a closely opposed cell (Yeager et al., 1998; Unger et al., 1999; Delmar, 2002). Although invertebrate innexins have been much less studied, both Drosophila and C. elegans innexins have multiple paralogues, some of which have been studied with respect to their capacity to form intercellular channels (Starich et al., 2002; Stebbings et al., 2002). Recently, innexins have been proposed to have orthologues in vertebrates based on sequence similarity (Panchin et al., 2000) although this has not been confirmed by functional studies.

Tight junctions, also found in the plasma membranes of animal cells, form charge-selective paracellular diffusion barriers that regulate the diffusion of small molecules across epithelial and endothelial cell sheets and serve as major cell adhesion molecules (Balda et al., 2000; Tsukita & Furuse, 2000; Blaschuk et al., 2002; Colegio et al., 2002; D'Atri & Citi, 2002). They also prevent the intermixing of apical and basolateral proteins, especially in the extracytoplasmic leaflet of these membranes (Tsukita & Furuse, 2002). Protein constituents of the tight junction include the claudins and the occludins (Tsukita & Furuse, 2000; Heiskala et al., 2001; Kollmar et al., 2001; D'Atri & Citi, 2002; Langbein et al., 2002). These oligomeric transmembrane proteins are regulated by phosphorylation (Cordenonsi et al., 1999). Like connexins, but unlike innexins in this regard, occludins are found in vertebrate animals. Claudins may be found in both vertebrates and invertebrates (Ando-Akatsuka et al., 1996; see below). Evidence suggests that claudins and occludins cooperate in the regulation of paracellular permeability (Balda et al., 2000; Morcos et al., 2001). As is well established for the connexins, claudins are differentially synthesized in various tissue and cell types (Kiuchi-Saishin et al., 2002). Interestingly, some of the claudins have been shown to secondarily serve as receptors for Clostridium perfringens enterotoxin (McClane, 2000; Long et al., 2001). Occludin isoforms of altered structure are synthesized in variable amounts, depending on conditions, and these isoforms may contribute to the regulation of occludin function (Ghassemifar et al., 2002).

Connexins, innexins, claudins and occludins share certain structural features but also exhibit distinctive characteristics. All four of these protein types exhibit four putative transmembrane α-helical spanners (TMS). They vary in size between about 20 kDa and 60 kDa with overlapping size variation within each of these four protein families (see below). Three-dimensional structural data are available for connexon membrane channels (Unger et al., 1999). Electron density analyses of the dodecameric channels, formed by end-to-end docking of two hexamers with a total of 48 TMSs, are consistent with an α-helical configuration for all four TMSs of each connexin subunit (Unger et al., 1999). The extracellular vestibule forms a tight seal to prevent the exchange of substances with the extracellular milieu.

We have identified all currently available homologues of the connexins, innexins, occludins and claudins in the publicly available databases using BLAST search tools. These searches were initially conducted in January, 2002, but the tabulations have been updated. However, the analyses reported were conducted with the family members available when the analyses were conducted. The sequences of the proteins in these four families were multiply aligned, and the alignments were used to generate average hydropathy, amphipathicity and similarity plots. Phylogenetic trees were constructed allowing definition of the sequence relatedness of proteins within each of these four families. The reported results not only define the current members of these four families of (putative) junctional proteins, they also allow predictions regarding the evolutionary origins of some of them. Thus, we can predict (1) which proteins are orthologues (having arisen in different species exclusively by speciation), (2) which proteins are recent versus early diverging paralogues (homologues that arose by gene duplication in a single organism), and (3) what the relative rates of sequence divergence were for different orthologous sets. We suggest that although these protein families do not exhibit significant sequence or motif similarity, the evolutionary precursor of the connexins and the innexins might have been the same. The same is possible for the claudins and occludins. We consider the possibility that at least some of these junctional proteins arose by an internal gene duplication event in which one or more 2-TMS-encoding genetic element(s) gave rise to the present-day 4-TMS-encoding gene. This hypothesis presupposes that this duplication event occurred more than once during the evolution of these protein families. Internal duplication may be a general evolutionary strategy that has been used to generate new families of channels and junctions with unique functions (Saier, 2000, 2001).

Results

CONNEXINS

Table 1 presents the sequenced connexin homologues we have identified from publicly available databases. All contain four transmembrane regions and are derived exclusively from vertebrates including mammals, birds, fish and amphibians.

Table 1 Sequenced proteins of the connexin family

Several organisms exhibit multiple paralogues. For example, six chicken paralogues, 12 rat paralogues, 14 mouse paralogues and 21 human paralogues are listed in Table 1. Because these proteins often do not exhibit sequence relationships suggestive of orthology with proteins from other organisms (see below), mammals, and possibly birds, may have as many as 22–24 connexin paralogues. However, one or more of these may be pseudogenes. Recently, the human genome was reported to contain 20 connexin paralogues as determined from genomic databases from Celera and NIH (Eiberger et al., 2001; Willecke et al., 2002). These are the same as the 20 sequence-divergent full-length human paralogues we report here.

Connexins tabulated in Table 1 are reported to be maximally 542 and minimally 223 amino-acyl residues (aas) in length. Because several of the largest and smallest proteins are found with comparable sizes, connexins probably exhibit just slightly greater than a 2× size variation.

The proteins listed in Table 1 were aligned using the CLUSTAL X program (Thompson et al., 1997). The complete multiple alignment (available on our ALIGN website; http://www-biology.ucsd.edu/~msaier/transport/ )Footnote 1 revealed that most of the size variation observed for these proteins occurred in their C-terminal regions and the single cytoplasmic loop between the second and third TMSs. The 4-TMS topology, originally deduced using site-directed antibody localization approaches (Milks et al., 1988; Yeager et al., 1998), and confirmed and extended by electron density analyses (Unger et al., 1999) is now well established. Both of the variable regions cited above are located intracellularly. Thus, residue positions 1–110 are well conserved; positions 121–200 are poorly conserved; positions 201–300 are well conserved, and the remaining residue positions of the alignment are poorly conserved. In the first well-conserved region (alignment positions 56–80), the following consensus motif was identified:

[X = any residue; alternative residues at a single alignment position are indicated in parentheses; *: a fully conserved position]

All of these residues are in the extracellular loop between TMSs 1 and 2.

In a second well-conserved region, a less well-conserved cysteine-rich motif was identified. This motif occurs at alignment positions 246–269 as follows:

[–: a one-residue gap in the alignment of most proteins.]

Three orthologous connexins, connexin β3 of the mouse, rat and human, display an additional residue at alignment position 247 corresponding to the gap (–). The best signature sequence for the connexin family (alignment positions 56–80) corresponding to the first conserved motif (see above) is:

The connexin phylogenetic tree, based on the complete multiple alignment presented on our website (Fig. S1), is shown in Fig. 1, and the corresponding tree for the human proteins, based on the alignment shown in Fig. S2, is shown on our website in Fig. S3. The proteins fall into 12 clusters that branch from points near the center of the unrooted tree as indicated by the roman numerals (I–XII). Human proteins are found in all 12 of these clusters, and four of the clusters include only mammalian proteins. Sequences from birds (the chicken) appear in six clusters; those from fish are found in five clusters, and those from amphibians are found in two clusters. The absence in these organisms of several of the connexin paralogues found in mammals may reflect a deficiency of sequence data. The configuration of the tree leads to the suggestion that most (but not all) of the sequence divergence observed for the connexins arose due to fairly early gene duplication events prior to divergence of most of the vertebrate species represented.

Figure 1
figure 1

Phylogenetic trees for the complete connexin family. Protein abbreviations are as indicated in Table 1. The Clustal X program (Thompson et al., 1997) was used to derive this tree and all other trees presented here and on our website. See text for explanation of the clustering patterns. The multiple alignment for all connexins is shown on our website (http://www-biology.ucsd.edu/~msaier/transport/ ; Fig. S1). That for the 22 human proteins is shown in website Fig. S2, and the tree for the human proteins is shown in Fig. S3.

The six clusters that include both mammalian and avian proteins reveal that in each cluster, the avian protein is more distant from the mammalian proteins than the latter are from each other. In all six cases it can be concluded that the chicken protein is orthologous to the mammalian proteins. Similarly, in the clusters including both mammalian and fish or amphibian proteins, the fish or amphibian proteins are always more distant from the mammalian and avian proteins than the latter are from each other. These observations provide evidence regarding potential orthologous relationships. They reveal that while the major clusters arose by fairly early gene duplication events, several late gene duplication events gave rise to similar sequence paralogues that cluster together. Thus, sets of orthologues as well as non-orthologous proteins can be visualized.

In almost all cases, a single human connexin is present in each set of mammalian orthologues. Cluster I includes three sets of probable orthologues (β1, β2 and β6), and of these, an avian protein is associated with one of them, while both fish and amphibian proteins are associated with another. Cluster II includes four sets of mammalian orthologues (β3, β4, β5 and HS-25). Clusters III and IV include exclusively mammalian proteins, and the two deep-rooted branches each bears only a single human protein. Cluster V includes one human protein (α1) and potential orthologues from other mammals, the chicken, the frog and fish, but surprisingly, one distant rat homologue (RN-33) that has no recognized human counterpart is found in this cluster. Cluster VI consists of one mammalian cluster (α4) with two human homologues (α4 and HS-37) and two associated distant frog proteins (XL-α4 and XL-α2). Based on the phylogenetic tree, at least one of these frog proteins (α2) is not likely to have a mammalian counterpart, possibly due to a unique function in Xenopus oocytes. Cluster VII consists of a single mammalian/avian cluster (α3) with two loosely associated fish proteins, both from the Atlantic croaker. As for the two frog proteins in cluster VI, at least one of these fish proteins probably lacks a mammalian counterpart. Clusters VIII (α5) and IX (α8) both include mammalian and avian proteins, but cluster X consists of a single mammalian/avian cluster (α7) with two distantly related human paralogues and two loosely associated fish proteins. Cluster XI consists of a single mammalian cluster (α9) with three related fish homologues, two of which are from the white perch. Finally, cluster XII consists of two distantly related human proteins with orthologues from the mouse that were revealed after this work was completed (see Footnote 1 to Table 1).

Further analysis of the tree shown in Fig. 1 revealed that some of the clusters of mammalian/avian orthologues have undergone very little sequence divergence, while others have undergone much more. For example, the α1 orthologues in cluster V exhibit minimal sequence divergence, while the α3 orthologues in cluster VII exhibit maximal divergence. The proteins in other probable orthologous clusters have diverged at intermediate rates. The results clearly suggest that all of the chicken homologues are orthologous to proteins in mammals, but that some of the fish and frog proteins lack mammalian orthologues. The human paralogues exhibit the phylogenetic relationships shown in Fig. S3 (see our ALIGN website). All relationships are in accord with those presented in Fig. 1.

Average hydropathy, average similarity and average amphipathicity (angle set at 100° for an α-helix) plots were derived using a sliding window of 21 residues (Kyte & Doolittle, 1982; Le et al., 1999; Zhai & Saier, 2001). The former two plots are presented in Fig. 2 A and B, respectively. Four clear peaks of hydrophobicity are apparent, the first pair separated from the second pair by a poorly conserved hydrophilic region of variable length (residue positions 100–190). A second variable hydrophilic region follows the fourth putative TMS (residue positions 300–550). As seen in the average similarity plot (Fig. 2 B), not only the four TMSs, but also the extracellular loops connecting TMSs 1 and 2, and TMSs 3 and 4 are well conserved. All cytoplasmically localized hydrophilic regions are poorly conserved. Interestingly, TMSs 1 and 2 and the intervening extracytoplasmic loop are much better conserved than TMSs 3 and 4 and the intervening loop. This fact clearly suggests that while TMSs 1 and 2 serve an important and universal functional role, TMSs 3 and 4 are either less important or provide functions that differ for different protein members of the family, e.g., such as forming the lining of the channel pore. The average amphipathicity plot was uninformative and is therefore not presented.

Figure 2
figure 2

Average hydropathy (A) and similarity (B) plots for the connexins. Proteins used for this study are the 19 sequence-divergent proteins included in the two partial multiple alignments shown in Fig. 3. The AveHAS program (Zhai & Saier, 2001) was used for both plots with a sliding window of 21 residues. Hydropathy values were those used by Kyte and Doolittle (1982).

For further similarity analyses, 19 sequence divergent proteins from all of the 12 clusters shown in Fig. 1 were selected for construction of a multiple alignment using the TREE program (Feng & Doolittle, 1990). As seen in Fig. 3 A and B, the first two TMSs are separated from each other by exactly the same number of residues as are the second two TMSs, showing that the two extracellular loops in these connexins are of the same length. The only exceptions are three of the aligned proteins, which have a single amino-acid insertion in this region (see legend to Fig. 3). Additionally, two of the three fully conserved cysteyl residues in the inter-TMS loops are conserved in position in the two alignments. Although there is little further residue conservation between these two protein segments, we suggest that the positional similarities of the TMSs and cysteyl residues argue that the connexins arose by an internal gene duplication event. The primordial protein presumably was half sized and exhibited just 2 TMSs. The proposed intragenic duplication event doubled the size of and number of TMSs.

Figure 3
figure 3

Alignments of the two well-conserved regions of 19 sequence-divergent connexins. Residues comprising the two putative TMSs in each alignment are presented in bold print, as are the three fully conserved cysteyl residues in each of the two inter-TMS loop regions. Fully conserved residues are indicated by a line adjacent to the lower right of the one-letter abbreviation of the amino acid. To be noted are the facts that the TMSs and two of the three fully conserved cysteyl residues align in the top and the bottom figures. The asterisk between the fully conserved Y and the largely conserved G in the lower alignment is the site of single amino-acyl residue insertions in three of these proteins.

INNEXINS

Table 2 presents the innexin homologues retrieved from the databases as of January 2002. Forty-two sequences were identified. Of these, twenty-six are from Caenorhabditis elegans (Starich et al., 2001) and nine are from Drosophila melanogaster (Stebbings et al., 2002). Both the C. elegans and D. melanogaster genomes had been fully sequenced when these studies were conducted, so these numbers presumably correspond to the total numbers encoded. It is surprising that the worm encodes three times as many innexin paralogues as does the fly. In addition to the worm and fly, only a few organisms, Schistocerca americana (grasshopper) and three closely related vertebrates are represented (Panchin et al., 2000). The vertebrate proteins have been suggested to be innexins based on sequence similarity with the invertebrate innexins, but it is not known whether they are able to form functional gap junction channels. After the completion of the work reported here, an innexin gene was cloned from the Annilida polychaete worm Chaetopterus variopedatus (Potenza et al., 2002).

Table 2 Sequenced proteins of the innexin family

As can be seen from the data summarized in Table 2, innexins fall roughly into the same size range as do the connexins (317–554 amino-acyl residues). However, excluding the single C. elegans unc9 homologue, the smallest protein is of 359 residues. Assuming that unc9 is an incomplete sequence, the size range of the innexins (359–554 residues) is narrower than that for the connexins (223–543 residues).

The complete multiple alignment of the innexin family proved to be much more divergent than that of the connexins in spite of their more narrow size range. Only seven fully conserved residues were identified (G189, C194, C214, P325, W329, F501 and K542; numbers refer to the alignment positions; see Fig. S4 in our ALIGN website). These were scattered throughout the alignment, as indicated. Only two of these seven residues proved to be cysteines. The alignment also revealed an increased proportion of gaps between putative transmembrane segments compared with the connexins (see below). As invertebrates evolved over a much greater time period than did the vertebrates, and the innexin family includes both invertebrate and vertebrate proteins, the degree of divergence is in accordance with expectation. The gaps and sequence divergence observed for the innexin alignment precluded derivation of a reliable signature sequence characteristic of this family.

The innexin family tree, shown in Fig. 4, differs greatly from the connexin tree shown in Fig. 1. All of the Drosophila and grasshopper proteins cluster separately from the C. elegans proteins, and the three mammalian proteins comprise a tight cluster that branches from a point between the worm and insect proteins. Moreover, there are far greater numbers of branches stemming from points near the center of the tree and far fewer large clusters than observed for the connexin tree. This latter fact reflects (1) the lack of more than a few sequence-similar paralogues in both C. elegans and D. melanogaster, and (2) the lack of close orthologues to any but a few of the innexins. The former fact contrasts with the situation for connexins in mammals, where relatively close paralogues have evolved as a result of more recent gene duplication events. The lack of close orthologues may reflect a deficiency of invertebrate sequence data. Thus, very scant sequence data are available for invertebrate organisms other than C. elegans and D. melanogaster. The absence of close paralogues between these two organisms represents a fundamental difference between vertebrate connexins and invertebrate innexins.

Figure 4
figure 4

Phylogenetic tree for the innexin protein family. Abbreviations of the proteins are as indicated in Table 2. Format of presentation and the program used were the sameas described in the legend to Figure 1. The multiple alignment upon which the tree was based is shown on our website (Fig. S4).

Fig. 5 shows the average hydropathy (A) and average similarity (B) plots for the innexin family. Both plots show four clear peaks of hydropathy (1–4 in A) corresponding to the four putative TMSs. The inter-TMS loops between TMSs 1 and 2 and TMSs 3 and 4 are poorly conserved. This fact contrasts with the situation for the connexins where both loops were well conserved. Not all of the inter-TMS loop regions are poorly conserved, however. Comparison of Fig. 5 A with Fig. 5 B shows that relatively well-conserved regions occur to the left of TMSs 1 and 3 and to the right of TMSs 2 and 4. These facts also become apparent when the width of the peaks in Fig. 5 B (average similarity) are compared with those in Fig. 5 A (average hydropathy). The latter are much sharper than the former. The plots shown in Fig. 5 also reveal that most of the size variation observed for the innexins occurs in the N-terminal region preceding TMS1, and to a lesser extent, in the C-terminal region following TMS4. Since none of these regions is well conserved, they presumably either do not serve an important functional role or their functions are not common to many innexins. This observation correlates with the great phylogenetic distance separating most of these proteins.

Figure 5
figure 5

Average hydropathy (A) and similarity (B) plots for the innexins. The format of presentation and the programs used were the same as for Fig. 2. The innexin family multiple alignment, from which these plots were derived using the AveHAS program (Zhai & Saier, 2001), is shown in Fig. S4 (see our ALIGN website).

Partial multiple alignments of putative TMSs 1 and 2 as compared with TMSs 3 and 4 revealed that the TMSs align approximately with each other, although there are many inter-TMS gaps. In contrast to the alignment of the connexin sequences, the cysteyl residues in the two segments do not align. This is not surprising in view of the fact that so many gaps are present in the alignment. If the innexins arose by an internal gene duplication event, many insertions and deletions must have been introduced during the evolution these proteins.

CLAUDINS

Table 3 tabulates the current members of the claudin family. Fifty-six sequences were identified, and of these, 17 are from humans, 22 are from the mouse, and 6 are from the rat. In addition to mammalian proteins, bird (chicken), fish (zebrafish), amphibian (frog) and chordate (ascidian) proteins are represented. These proteins are generally smaller than the connexins and innexins, the size range being 191–305 residues. Excluding the two largest and two smallest homologues, the size range is 207–264. Claudins have evidently undergone little size divergence during their evolution.

Table 3 Sequenced protein of the claudin family

According to the database entries provided, one claudin homologue is a senescence-associated epithelial protein, while another is found in brain endothelial cells, and a third is associated with oligodendrocytes. Dysentery-inducing bacteria such as Shigella spp. can regulate tight junction function both by regulating claudin-1 association and by influencing occludin phosphorylation (Sakaguchi et al., 2002). Claudin 4 can secondarily serve as a receptor for the Clostridium perfringens enterotoxin (see Introduction). Examination of the claudin family multiple alignment revealed that only three residues, two cysteines at alignment positions 122 and 136 and a glycyl residue at position 272 were fully conserved.

Tepass et al. (2001) notes that D. melanogaster encodes two possible claudin-like proteins (CG3770 and CG6982). Both of these invertebrate proteins are about 210 residues long and have four predicted transmembrane domains with a single large inter-TMS loop between putative TMSs 1 and 2. They show a low degree of sequence similarity with claudins and much more with mammalian lens fiber intrinsic membrane proteins and p53 apoptosis effectors. Sequences from C. elegans have also been suggested to be claudin-like. These include NP_509257, NP_508583, NP_509800 and NP_509847). Although some similarity is observed, the sequence similarity of these proteins with claudins is insufficient to establish homology, and no functional data suggest a role in tight junction formation. They were therefore not included in our study.

The claudin family tree, based on the multiple alignment shown in Fig. S5, is shown in Fig. 6. No two mammalian paralogues from the human, mouse or rat are closely related to each other, showing that the gene duplication events that gave rise to these paralogues occurred relatively early. This suggestion is substantiated by the observation that close mammalian orthologues occur frequently. Moreover, the two chicken proteins represented are probably orthologues of the mammalian CLD3 and CLD5 claudins. By contrast, none of the fish, frog or ascidian proteins cluster closely with any mammalian protein. Orthologous relationships of these proteins can therefore not be assigned.

Figure 6
figure 6

Phylogenetic tree for the claudin protein family. Protein abbreviations are as indicated in Table 3. The claudin family multiple alignment is shown in Fig. S5 on our ALIGN website.

Average hydropathy and similarity plots for the claudin family are shown in Fig. 7 A and B, respectively. The four peaks of hydropathy are clearly displayed. In contrast to the connexins and innexins, the claudins show comparable degrees of similarity in the loop regions between TMSs 1 and 2, and between TMSs 2 and 3, with substantially less similarity in the loop between TMSs 3 and 4. The N- and C-termini are poorly conserved. These facts suggest that the first extracellular loop as well as the central cytoplasmic loop may be more important for functions conserved among the proteins than the terminal extracellular loop.

Figure 7
figure 7

Average hydropathy (A) and similarity (B) plots for the claudins. The format of presentation and the programs used were the same as for Fig. 2.

OCCLUDINS

Only 7 tight-junctional occludins were identified following database searches (Table 4). These proteins are derived from mammals (4), the chicken (1), the kangaroo rat (1) and the frog (1). They are large proteins (489 to 522 residues) of fairly uniform size.

Table 4 Sequenced proteins of the occludin family

The occludin multiple alignment, including all seven sequenced members of the family, revealed considerable sequence conservation throughout the alignment (see Fig. S6 on our ALIGN website). The average hydropathy and average similarity plots for the occludins are shown in Figure 8. Like the connexins, the extracellular loops of the occludins are well conserved while the central cytoplasmic loop is not. Several extended well-conserved motifs including four fully conserved cysteyl residues (underlined) were present as follows:

Figure 8
figure 8

Average hydropathy (A) and similarity (B) plots for the occludins. The format of presentation and the programs used were the same as for Fig. 2.

The tree for the occludins is shown in Fig. 9. All mammalian proteins cluster tightly together, and the shape of the mammalian cluster suggests that these proteins are orthologous in agreement with the fact that only one occludin is found per organism. The kangaroo rat protein clusters loosely with the chicken protein, far from the frog homologue. However, in contrast to the connexins, large segments of the N- and particularly the C-terminal hydrophilic domains are well conserved. This fact suggests an important unified function for these large domains.

Figure 9
figure 9

Phylogenetic tree for the occludin protein family. Protein abbreviations are as indicated in Table 4. The occludin family multiple alignment is shown in Fig. S6 on our ALIGN website.

Perspectives and Conclusions

In this article we have analyzed the sequences of integral membrane 4 TMS proteins implicated in junction formation in animals. Four protein families were analyzed: the connexins, innexins, claudins and occludins. The uniform structural features of these proteins are illustrated in Fig. 10. The multiple sequence alignments for these 4 protein families revealed a higher degree of sequence similarity for the connexins than for the innexins, in agreement with the facts that invertebrates have evolved over a much greater period of time than have the vertebrates, and that innexin homologues, but not connexins, are shared by invertebrates and vertebrates. One might propose that the connexins arose from a primordial innexin precursor, but the similarities between the two halves of the connexins suggest that the gene duplication event that gave rise to these proteins occurred long after any duplication event or events that might have given rise to the innexins. If any two of these four families of junctional proteins are related, there is no compelling evidence. However, extensive sequence divergence could have obscured such an event. Multiple duplication events have been documented during the evolution of other protein superfamilies (Nies et al, 1998; Pao et al., 1998; Tseng et al., 1999; Saier, 2000, 2001).

Figure 10
figure 10

Schematic representation of the transmembrane topologies of all four types of junctional proteins examined in this report. N and C correspond to the N- and C-termini of the proteins—E1 and E2 are the two extracytoplasmic loops, while L is the single cytoplasmic loop.

Connexins exhibit uniform topological features as well as the presence of conserved cysteyl residues in the loops between TMSs 1 and 2, and TMSs 3 and 4. Except for the vertebrate innexins, this family similarly exhibits well-conserved cysteyl residues. Other residues are fully or well conserved within each of these families, but not between the two families. Thus, when the complete multiple alignment of the innexins was derived, several residues proved to be largely conserved, and these residues occur exclusively in the extra-cytoplasmic loops and in the even-numbered TMSs. The conserved residues include four cysteyl residues, two between TMSs 1 and 2, and two between TMSs 3 and 4. The two cysteyl residues in each extracytoplasmic loop are separated by 16 or 17 residues. Fully conserved residues in the first halves of the innexins are G, C, C, Y, W, P, and W while in the second halves they are F, C, C, N, K, and W. These fully conserved residues are generally not conserved in nature or position between the two halves. Assuming that these fully conserved residues are of structural or functional significance, we conclude that the two halves of these proteins serve dissimilar functions. The same argument can be made for the connexins, where except for the cysteyl residues, the fully conserved residues in the first extracellular loop differ in both nature and position from those in the second extracellular loop.

Multiple paralogues were identified for the connexin, innexin and claudin families but not for the occludins. Thus, 22 paralogous connexin homologues are present in humans, 26 and 9 paralogues of innexins were found in C. elegans and D. melanogaster, respectively, and 22 paralogous mouse claudins were identified. Many of these paralogues are likely to serve cell type or tissue-specific functions. However, the presence of over 200 cell types in a mammal clearly suggests that many cell types share the same junctional proteins.

Analyses of the data reported in this article led to the following evolutionary and functional suggestions: (1) In all four families, the most conserved regions of the proteins are the four TMSs. However, the loops between TMSs 1 and 2, and TMSs 3 and 4 are well conserved in the connexins and innexins (although less well conserved in the innexins). The loops between TMSs 1 and 2, and TMSs 3 and 4 are also well conserved in the claudins, and all loops plus flanking hydrophilic cytoplasm domains are well conserved in the occludins. This last fact may reflect the small number of occludins and the total lack of paralogues. (2) The phylogenetic trees for these four families allowed us to propose the existence of sets of orthologous proteins in all families except the innexins where phylogeny reflects the organismal source. Whether this is due to a lack of sequence information for other organisms or is a biological property of the innexin family remains to be determined. In this context, it is interesting to note that, unlike many vertebrate cells, gap junctional communication between cells from different insect orders could not be detected (Epstein & Gilula, 1977). (3) In the case of the connexins, evidence was presented to suggest that the two halves of the proteins derived from a common origin by internal gene duplication. Only the cysteyl residues that form disulfide bridges in the connexins and innexins on the external surfaces of the two adjacent cells are positionally well conserved both between the two halves of these proteins and between these two families (Kumar & Gilula, 1996; Yeager et al., 1998). This fact suggests an essential function, possibly as a receptor for specific protein-protein interactions, for the disulfide bridges that they form and leads to the very tenuous suggestion that connexins and innexins share a common origin. (4) No evidence for a common origin of claudins and occludins, or for an origin resulting from intragenic duplication was obtained. Thus, if they do share a common 2TMS precursor with each other or with the gap junctional proteins, they have diverged in sequence from the precursor peptide beyond recognition. Perhaps 3-dimensional structural evidence will provide evidence for or against such a proposal. We suggest a similar role for conserved extracellular residues in the claudins and occludins. These findings and suggestions should serve as guides for future studies concerning the functions and origins of junctional proteins.