Introduction

The complement control protein (CCP) family has been conserved from invertebrates to provide required regulation of complement and other cascades and to enable self–nonself discrimination. The CCPs have evolved in parallel with the complement cascade and can be regarded as the major agents of regulation. During vertebrate evolution the family has expanded such that there are numerous copies clustered throughout the vertebrate and indeed human genome. Apparently functions have diverged and now include viral receptors, regulation of adaptive immunity, and maintenance of fetomaternal tolerance. (Bell 2000; Xu et al. 2000).

The CCP family is defined by the presence of reiterated protein domains known as short consensus repeats (SCRs). An SCR typically comprises 56–70 amino acids, including four cysteines and a tryptophan which are thought to be essential in holding the domain in its rigid triple-loop structure (Schwarzenbacher et al. 1999) which permits the formation of complex proteins and multiple binding affinities. Each CCP has a specific number of SCR units at the genomic level, e.g., human (Homo sapiens) CR1 has 30 and Homo sapiens decay accelerating factor (DAF) has four SCRs. Numerous SCR-containing proteins are encoded within the genomic region known as the regulators of complement activation (RCA) complex on 1q32.

It remains to be determined how such domains could be responsible for such diverse functions, how each CCP achieves a remarkable degree of ligand specificity, and how self–nonself discrimination is mediated. Indeed, it is still uncertain how it will be possible to explain the similarities as well as the differences in function when component SCRs are compared.

Structure–function studies have exploited several different approaches but can be difficult to interpret. For example, the X-ray structure of two SCRs of CR2 in complex with C3d highlights some of the problems including similarities in sequence of SCRs and therefore cross-reactivity of antibodies and inhibitory peptides (Szakonyi et al. 2001).

Valuable insights have been provided through the use of in situ mutagenesis. For example, the Atkinson group (Krych et al. 1998; Krych-Goldberg et al. 1999) has shown that some residues are important in the binding of C4b to CR1. Interestingly, some of the effects of mutagenesis are unexpected in that the critical mutation is remote from the putative binding site. Furthermore, a major conclusion is that combinations of SCRs are more important than single domains (Kirkitadze et al. 1999), suggesting that the evolutionary unit is a specific set of SCRs. There is a need for alternative approaches to define the sequence characteristics of these sets. Here we illustrate the value of phylogenetic analysis.

From an evolutionary perspective, the CCP family is a very useful model. Undoubtedly, duplication has been the driver of expansion and diversification but the extent of the region duplicated must have ranged from a single domain or less through multiple contiguous domains as fixed sets to genomic segments containing large genes. All of these processes are suggested from a comparison of cDNA sequences and apparent to various degrees from genomic analysis of the data available for the RCA complex (Heine-Suner et al. 1997; Rodriguez de Cordoba et al. 1999; http://www.ncbi.nlm.nih.gov).

With the dual aims of elucidating structure–function relationships and defining mechanisms of genomic duplication, we have analyzed genes within the RCA complex and compared the results with those obtained after earlier analyses of duplication within the major histocompatibility complex (MHC) (Dawkins et al. 1999).

We show here that SCRs can be classified into a limited number of groups and the arrangement of these groups relates to function. By analogy, each SCR is a letter, the order of the letters defines sets (or words), and the number and content of the words create CCPs (or sentences).

We also show that, as in the MHC, it is the imperfection of the duplication processes that contributes to diversity.

Methods

Primate and Rodent RCA Sequences

The relationships between individual SCRs were investigated using the following amino acid and nucleotide sequences: Homo sapiens CR1 (P17927 and Y00816), Homo sapiens CR1L (positions 1278188–1355188 on gi:22041066), Homo sapiens complement receptor 2 (CR2) (PL0009), Homo sapiens DAF (P08174), Homo sapiens membrane cofactor protein (MCP) (P15529), Homo sapiens MCPL (positions 1263188–1273188 on gi:22041066), Homo sapiens C2 and factor B (Bf) (AF019413), chimpanzee (Pan troglodytes) CR1 (AAA51438), Pan troglodytes DAF (AAF73177), baboon (Papio hamadryas) CR1 (AAA62170), Papio hamadryas DAF (AAF73178), Papio hamadryas MCP (AAB66815), house mouse (Mus musculus) CR1 (AAA78271), Mus musculus complement regulatory protein (Crry) (AAA37468), Mus musculus DAF (AAB00092), Mus musculus MCP (BAA31859), Mus musculus CR2 (A43526), and Norway rat (Rattus norvegicus) Crry (AAA91821).

Strategy

Domain-by-Domain Analysis

Initially, we identified SCRs contained within RCA proteins and other CCPs using Pfam (http://pfam.wustl.edu/) and database annotation. Alternative definitions were also used, but in essence the defining motif was simply C..C..C..W..C (Hourcade et al. 1989; Reid et al. 1986).

Next, all domains were extracted and annotated before being added to a SCR database. The individual sequences could be analyzed without regard to their presence within a particular CCP. The intention was to compare and classify SCRs independently without bias.

Third, various phylogenetic analyses were used to establish relationships between individual SCR sequences. Alignments were created (ClustalW v1.8) and adjustments were evaluated depending upon the purpose of the alignment. For example, the clusters were found to differ by indels as well as residues and it was sometimes necessary to realign on the basis of the C..C..C..W..C motif. Given the divergence between and within duplicons of CCPs, evolutionary distances were estimated using the gamma distance model, which accounts for multiple amino acid substitutions and variation of substitution rate among sites. A gamma shape parameter of 0.93 was used based on previous studies of SCRs. (Krushkal et al. 2000). Phylogenetic trees were constructed by neighbor-joining and parsimony methods. Each tree underwent 1000 bootstrap replications and was not rooted (Molecular Evolutionary Genetics Analysis version 2.1; Pennsylvania State University 1993).

Peptide Searches

Using the amino acid alignment shown in Fig. 1, selected amino acid motifs from the group consensus sequences were searched against the databases SwissProt, Prosite, and GenBank/DDBJ/EMBL using BLAST (http://www.ncbi.nlm.nih.gov/Blast) and Motif 2 (http://motif.genome.ad.jp/MOTIF2.html).

Figure 1
figure 1

Characteristic amino acid patterns of 11 SCR groups. Hosa, Homo sapiens; Mumu, Mus musculus; Rano, Rattus norvegicus; Patr, Pan troglodytes; Paha, Papio hamadryas. *The consensus SCR sequence derived by Hourcade et al. (1989). The proteins used to define the groups are Mumu and Rano Crry, Hosa, Patr, Paha CR1, and the ajefbk portion of Mumu CR1. The residues essential for defining any of the groups were only assigned when all group members had a single residue at a specific position and are shown as black boxes. Positions where multiple residues were present are designated with an x. Of these, gray boxes indicate amino acids shared by multiple members of the group. Boxed amino acids indicate the same amino acid in the CR1 and Crry protein(s) and Hosa CR2, MCP, DAF, C2, and/or Bf. §Number of residues for group based only on CR1 and Crry proteins.

Genomic Analysis of the MCP and CR1 Duplicons

The genomic sequence containing the genes CR1, MCP, CR1L, and MCPL at 1q32 was taken from the NCBI database (http://www.ncbi.nlm.nih.gov/) (positions 1103188–1453188 on contig NT_021877.12 [gi:22041066]; accession numbers AL691452.7, AL137789.11, AL365178.8, and AL035209.1) and compared against itself using the dot matrix program, Dotter (Sonnhammer and Durbin 1995). SCR and septamer locations were identified by aligning the CR1 mRNA (Y00816) with the genomic sequence of CR1 and CR1L using Blast 2 (http://www. ncbi.nlm.nih.gov/blast/bl2seq/bl2.html).

Genomic Analysis of the CR1 Revised Sequence

The genomic sequence containing the gene CR1 at 1q32 has recently been revised. The new sequence was taken from the NCBI database (http://www.ncbi.nlm.nih.gov/) (positions 1129691–1479691 on contig NT_021877.15 [gi:29789880]; accession numbers AL691452.10 and AL137789.11) and compared against the older CR1 genomic sequence (positions 1128988–1248188 on contig NT_021877.12 [gi:22041066]; accession numbers AL691452.7 and AL137789.11) using the dot matrix program, Dotter (Sonnhammer and Durbin 1995). SCR and septamer locations were identified by aligning the CR1 mRNA (Y00816) with the genomic sequences of CR1 using Blast 2 (http://www.ncbi.nlm.nih.gov/blast/bl2seq/bl2.html).

Synonymous and Nonsynonymous Rates WithinCCPs

Based on the amino acid alignment described above, a sliding window analysis was performed on the nucleotide sequences of CR1 SCRs to determine synonymous and nonsynonymous rates using the program WINA v0.3 (Endo et al. 1996). Results were plotted using Microsoft Excel 98.

Results

Phylogenetic Analysis Suggests a UniversalClassification of SCRs into Subfamilies Shared by Many CCPs

To classify the SCRs and define relationships within and between CCPs, we undertook several phylogenetic analyses of individual domains. All revealed unexpected but remarkable similarities between some SCRs. As an approach to classification we selected the SCRs from the largest CCP. Figure 2 shows that the 30 SCRs of Homo sapiens CR1 yield seven distinct clusters containing four SCRs. Each cluster contains every 8th SCR, reflecting the four septamers (otherwise known as long homologous repeats; LHR) which account for the first 28 SCRs. The seven clusters were designated a (1, 8, 15, 22), b (5, 12, 19, 26), d (7, 14, 21, 28), e (3, 10, 17, 24), f (4, 11, 18, 25), j (2, 9, 16, 23), and k (6, 13, 20, 27) so as to reflect phylogenetic proximity. Although the relationships between clusters were only provisional due to the long branch lengths and require further analysis, the intention was to develop a series of hypotheses to describe the likelihood of common origins.

Figure 2
figure 2

Amino acid phylogeny of individual domains from Hosa CR1 reveals seven distinct clusters, each containing four SCRs. CR1 domains are numbered 1 to 30 (5′ to 3′). There are seven distinct clusters “abdefjk” containing every eighth domain, e.g., a contains 1, 8, 15, and 22. By contrast, c and h contain a single member (29 and 30, respectively). The data indicate that CR1 comprises four septamers, each containing in order ajefbkd. The neighbor-joining tree is unrooted and the bar at the bottom of the tree reflects the distance as calculated by the gamma distance model (α = 0.93).

So as to define further clusters we then pooled the SCRs of CR1 and DAF; the SCRs fell within the a, j, and e clusters such that CR1 and DAF can be written (ajefbkd)4, xx and aaje, respectively, indicating a common set of aje (Table 1).

Table 1 Conservation of the complement regulatory SCR set aje across vertebrates

When SCRs of CR2 are included, clusters c, g, h, and i appear, giving a total of 11 subfamilies. Interestingly, the two unclassified (x) and terminal (29 and 30) SCRs of CR1 cluster with corresponding SCRs of CR2, i.e., c contains CR1 SCR 29 and CR2 SCR 15 and h contains CR1 SCR 30 and CR2 SCR 16; CR1 and CR2 share a set consisting of ch. Thus, as illustrated in Table 1, Homo sapiens CR1 can be written (ajefbkd)4ch. CR2 is jfbkdgaidgbidfch, which can be rewritten jf(a/bi/kdf/g)3ch, where a/b, i/k, and f/g are introduced to recognize alternatives yielding three tetramers with a or b, i or k, d, f or g preceded by jf and followed, as in the case of CR1, by ch. Alternatively CR2 could be j(f/ga/bi/kd)3fch, where the tetramer is similar in content although different in order.

As shown in Figs. 1 and 2, this analysis suggests that SCR subfamilies have diverged to greater or lesser degrees since the original sets were duplicated. For example, the original tetramer of CR2 may have contained the progenitors of a + b, i+ j + k, d + e, and f + g, which is now present in CR1 as ajef and in CR2 as a/bi/ kdf/g. Within CR2, we deduce that the d subfamily has been conserved relative to the other more divergent components of the tetramer. Interestingly, the terminal ch set has been retained by both CR1 and CR2.

SCR Subfamilies and Their Sets Are Shared byDifferent RCA Proteins

With rare exceptions, all SCRs in our database could be classified into these 11 subfamilies (a to k) or their putative progenitors and all CCPs examined could be described according to the schema used for CR1 and CR2 (Table 1). For example, MCP is aj/kdx, suggesting a common origin with CR1 and CR2, divergence or shuffling yet relative retention of the proximal aj set. CR1L is particularly interesting. When the available genomic sequence is examined (Fig. 3), we identify extra SCRs and a structure of a(jefbk)2d, indicating a close but imperfect relationship with CR1, viz. (ajefbkd)4ch. Other sequences may well indicate further heterogeneity, but on the basis of the current data, it appears that the internal jefbk set was duplicated as a pentameric segment or that deletions occurred after duplication of the ajefbkd septamer. Apparently, divergence is ongoing since, for example, the introns between j1 and j2 differ in length and, especially, the content of insertions indicated by gaps (Figs. 4A and 4B). We therefore asked whether there were other ways of defining the subfamilies.

Figure 3
figure 3

Multiple duplications and divergence following segmental duplication of CR1 and MCP precursors. Dot-plot analysis of 350-kb genomic region against itself. Genes CR1, MCP, CRIL, and MCPL and part of CR2 are shown. The dot plot beneath the diagonal line is of the unmasked genomic sequence, while the dot plot above the line uses sequence that was masked using Repeat Masker (http://ftp.genome.washington.edu/cgi-bin/RepeatMasker). Since segmental duplication of the CR1 precursor, it is apparent that divergence has occurred by at least three separate processes: first, the basic unit has replicated such that CR1 is now a(jefbkd)4ch, whereas CR1L is now (ajefbk)2d; second, individual SCRs have been added or lost, e.g., c and h; third, there have been insertions/deletions to explain the expansion of the region between j(i) and j(ii) of CR1 septamer 4 (see Fig. 4a). Arrows indicate the direction of the coding sequence.

Figure 4A Figure 4B
figure 4

Analysis of masked genomic sequence of CR1L against CR1. Sequences were masked using Repeat Masker (http://ftp.genome.washington.edu/cgi-bin/RepeatMasker). Colored lines running through the plot indicate positions of individual SCRs such that a is red, j is dark blue, e is orange, f is green, b is pink, k is brown, d is light blue, and c and h are black. Filled regions show noncoding sequence between split exons (j and k). For the purpose of discussion we assume that the CR1L was the progenitor of CR1 which arose through further piecemeal duplication of components within CR1L. For example, the region of CR1L commencing at e1 has duplicated en bloc and there has been further diversification, especially in noncoding regions between j2(i) and j2(ii). There are other possible units of duplication commencing at e2. Note also that in this sequence there are 12 SCRs within CR1L. The deletion of the segment including d1, a2 is indicated by a vertical gap between k1(ii) and j2(i).

Dot-plot analysis of revised genomic sequence reveals an additional duplicon. Revised genomic contig (gi:29789880) containing CR1 shows evidence of an additional duplicated unit when compared against previously analyzed contig (gi:22041066). The same unit (e1 to j2[ii]) can be identified in Fig. 4A.

Substitutions and Indels Define SCR Subfamilies

Once the major clusters were identified by phylogenetic analysis, it became obvious that some residues are characteristic of, or even restricted to, a single cluster. Many of these residues had previously been regarded as alternatives within all SCRs but we prefer the possibility that there are sites which define specific subfamilies. Further evidence for this interpretation is shown in Table 2, where it can be seen that substitutions correlate with coding indels and therefore the number of residues, i.e., the length of the SCR.

Table 2 Characteristics amino acid patterns of nine SCR groups

Conservation of SCR Subfamilies and Their Sets in Other Vertebrates

The validity of the phylogenetic approach was evaluated by examining non-human primates and other vertebrates. If the clusters are meaningful, they should be apparent throughout at least recent evolution. In terms of the order and position of subfamilies, DAF is the same in primates and Mus musculus. Pan troglodytes and Papio hamadryas CR1 are identical to Homo sapiens CR1 with the exception that the first Papio hamadryas SCR a has been deleted at the protein level (Table 1).

More interesting is the comparison with the mouse. Mus musculus CR1 has been contrasted with Homo sapiens CR1 (Holers et al. 1992; Kalli and Fearon 1994; Molina et al. 1994; Prechl and Erdei 2000), but here we show that in spite of all the differences, the sequence designated MCR1 contains ajefbk and is therefore identical to the first six SCRs of Homo sapiens CR1 and CR1L. The remaining sequence of MCR1 is similar to CR2, as expected from earlier observations, suggesting that MCR1 is in fact CR1 and CR2 or, in view of the present findings, a composite of CR1L and CR2. A comparison of Homo sapiens and Mus musculus CR2 reveals the expected three tetramers (a/b, k/i, d, f/g) followed by ch, with the exception that the third tetramer in Mus musculus has lost the first SCR. Interestingly, as with Homo sapiens, d is less divergent than the other SCRs.

The differences between primate and rodent CR1 suggest that the true orthologue for Homo sapiens CR1 and CR1L may be Crry rather than Mus musculus CR1. Using our classification, the sequence designated Mus musculus Crry is ajefh, raising the interpretation that four external domains (ajef) may subserve at least some CR1 or CR1L functions. The Rattus norvegicus Crry sequence is designated ajefkhh and therefore shares the ajef, although the terminus may be different.

Human CR1 is Closer to Mus musculus Crrythan Mus musculus CR1

So as to investigate the evolutionary relationships between Homo sapiens CR1 and Mus musculus CR1 and Crry, we included all component SCRs in the same phylogenetic analysis as shown in Fig. 5. Clearly Crry ajef are very similar to the ajef, respectively, of the first, second, and third septamers of human CR1. Murine CR1 is more remote.

Figure 5
figure 5

Rodent Crry is phylogenetically closer to Hosa (Homo sapiens) CR1 than to Mumu (Mus musculus) CR1. In a, e, and f, the Crry domains are more similar to the corresponding SCRs of the first, second, and third septamers of Hosa CR1 (hCR1) than Mumu CR1 (mCR1). Domains from Mumu CR1 and the fourth septamer of Hosa CR1 are more divergent. The neighbor-joining tree is unrooted and the bar at the bottom of the tree reflects the distance as calculated by the gamma distance model (α = 0.93).

SCR Subfamilies and Function

Previous studies have suggested that the first three and particularly the second and third SCRs are involved in binding to C3 and C4. From Table 1, it can be deduced that the aje set encodes the region of relevance. Other sets of subfamilies clearly have a different significance. For example, the ch set relates to the terminus.

Peptide Patterns Marking Functional Subfamilies

If, as implied above, the classification of SCRs relates to function, it should be possible to determine which amino acid residues are specific and critical to the function of subfamilies, at least within orthologous proteins in related species. From Fig. 1 and Table 2 it can be seen that there C..C..C..W..C are defining in terms of SCRs generally but that most of the other patterns are actually characteristic of specific subfamilies at least within primates. For example, a contains an A which is specific, with some minor exceptions, within the b subfamily, possibly indicating the proximity of a and b. The proximal I/L/V marker, used in earlier studies (see patterns at the base of Fig. 1), is an I in d, e, and c but M in h and P in j. It can also be seen that the D/E marker is in a as a T, in b as Q, in c as M, in d as K, in j as S, in k as A. These residues are preceded by a G which is not present in f, g, or i. The more terminal proline is found in b, d, e, f, g, j, and k but not in a and not necessarily in h, c, or i, etc. With minor exceptions most of the so-called motifs can be used to define subfamilies of the SCRs of CR1 and Crry (Table 2).

Indels and Extensive Continuous Sequences

A striking feature of these analyses is that the different groups of SCRs have specific insertions and deletions (indels) and therefore specific lengths (Table 2). Furthermore, as a consequence, there are many more or less specific continuous sequences which might not have been expected from previous analyses.

Evolution of SCRs and CCPs

From the analyses of protein sequence, it is apparent that RCA proteins have evolved through a series of processes involving duplication and divergence of SCR domains. It is also clear that individual SCRs and particularly sets of SCRs have become specialized in terms of position, proximity, and undoubtedly function. Thus, for example, aje and ch occur on different proteins but in the same relative position. Duplications, deletions, and other processes of divergence have contributed to this process of specialization or selection.

Units of Duplication and Deletion

Having demonstrated that there are specific subfamilies of SCRs, it becomes clear that there must have been many diverse units of duplication and deletion. For example, a alone has been duplicated in DAF. The set ajefbkd has been duplicated as a septamer in the evolution of CR1. This septamer may have arisen by insertion and addition of d to ajefbk, which is found in CR1L and MCR1, or by a more complex process of shuffling. Based on similarities between CR1 and CR2, we expect that there was a much earlier tetramer corresponding to ajef and a trimer corresponding to bkd in CR1 and it appears that these may have been key elements in the formation of many CCPs. Note also that aje and bkd may have arisen from an ancestral form of a trimer designated a+b, j+k, d+e. That is, progenitors of a+b, of j+k, and of d+e may have given rise to aje and also bkd.

Examination of genomic sequence must be qualified by the difficulty in assembling duplicated sequences but appears to indicate extreme complexity. Large units have also been duplicated and/or deleted; shuffling of segments, whole genes, or their fragments has been superimposed on the addition or loss of smaller units. Thus the primordial CR1 and MCP sequences were duplicated as a segment of at least 100 kb which has now extended through further internal duplication and insertion.

It is apparent that duplication may have been imperfect, at least to the extent that some SCRs have diverged substantially and independently while maintaining their position within a duplicon. For example as shown in Fig. 2, in Homo sapiens CR1, a of septamer 1 (a1) has diverged from a2 and a3 (which are identical), whereas f1, f2, and f3 are conserved. In terms of divergence and conservation within CR1, there are similarities between a and. j, between e and f, and among b, d, and k. It is interesting that these groupings are similar to the sets identified above, suggesting that there may be constraints imposed by function and selection.

Relevance to CCPs Encoded Within the MHC

C2 and Bf contain C..C..C..W..C but are encoded within the MHC rather than the RCA complex. We therefore asked whether subfamilies of SCR can be identified. By phylogenetic analysis (Martinez et al. 2001), it can be seen that the h subfamily is present within C2-SCR2 and Bf-SCR2, suggesting that the subfamily existed before the separation of the MHC and RCA clusters. In both genes, SCR3 has some features of a and b, as might be expected if a and b had a common origin, viz a + b. In both genes, SCR1 is truncated but possibly related to a.

A remarkable feature of these analyses is the similarity between C2 and Bf (see Fig. 1). Thus although the two forms are quite different in length and intron structure, they have retained almost identical SCRs. The differences between C2 and Bf may relate to differences in their functional roles in the complement cascades but the residual similarities suggest that there is local conservation of sequence.

Nucleotide Analysis

Phylogenetic analysis of the nucleotide sequence (synonymous sites) of CR1 revealed a very similar result to the amino acid sequence as shown in Fig. 2. Not only were the clusters the same but so too were the different patterns of divergence. This finding emphasizes the apparent paradox: although it appears that CR1 has been formed by successive duplications of a septamer, some of the components of each septamer appear to have a different evolutionary history.

To search for indicators of selection, we used a window analysis to plot the synonymous (dS) and nonsynonymous (dN) changes throughout the length of the CR1 SCRs. The differences within a subfamily (see Fig. 2) can be predominantly synonymous (e.g., a1 [CR1-1]- versus a2 [CR1-8] and a3 [CR1-15]) or nonsynonymous (e.g., j2 [CR1-9] and j3 [CR1-16] versus j4 [CR1-23]) (Fig. 6), suggesting that there has been some divergence of function since the time of duplication of the septamers. In some comparisons (e.g., j and f), the dN/dS ratio reverses as the window progresses along the sequence (Fig. 6). The d and b comparisons reveal limited synonymous and nonsynonymous changes. These findings emphasize the complexity of duplication and suggest that there have been further indels and shuffling of the SCRs following duplication of the septamers.

Figure 6
figure 6

SCR window analysis reveals striking differences in dS and dN patterns. Comparisons yielding relatively flat lines were not included and correlated with short genetic distances in Fig. 2 (e.g., e1 vs e2). The asterisk indicates a large number based on the WINA program (Endo et al. 1996). The broken line indicates dN and is plotted against the right axis, and the black line indicates dS plotted against the left axis. A bold black line above the graph indicates dN>dS; a double line indicates dN>2dS.

Discussion

Units of Duplication

This work arose from our interest in comparing the MHC and the RCA complex, especially from the point of view of duplication. Within the MHC, the duplication process generally involves segments of heterogeneous genes, fragments, and indels and there is emerging evidence to suggest that the RCA complex may contain similar segments containing fragments of CR1 and MCP (Hourcade et al. 2000). However, in contrast to the MHC, the dominant process within the RCA complex appears to have involved sequential duplication of one or more relatively homogeneous domains (SCRs). So as to define the latter process, we classified SCR sequences excised from several different CCPs. Remarkably, most SCRs examined could be grouped into only 11 subfamilies, designated a to k. As shown above, Homo sapiens CR1 can be written as (ajefbkd)4 ch, and DAF as aaje. From such examples, it can be concluded that the subfamilies have been conserved but also shuffled to create specific sets which have themselves been conserved. Duplication has operated at multiple levels of complexity including individual SCRs, specific sets of SCRs such as aje and even ajefbkd, whole genes, and even segments with more than one gene.

After examining CCPs of different species, we conclude that this process of duplication can be imperfect or at least modified by deletion and insertion. As shown in Table 1, variants such as ajefbk, ajef and jefbkd occur, although possibly only as additional copies. Functional diversification and species differences may be direct consequences of imperfect duplication.

Overlap Between CCPs

Historically, distinct CCPs have been identified and contrasted with those already known. Preferred ligands have been defined and differences in expression have been described. The number of SCRs in each CCP has been used as a means of classification and equivalents in different species have been sought often with confusing and controversial conclusions.

The present study has revealed far more structural overlap than could have been expected from earlier observations and from the existing nomenclature. Because single domains from multiple CCPs were examined, unexpected relationships were revealed. These can be considered under three headings: within a gene of a species, between genes of a species, and between species.

Within-Gene Relationships

For more than a decade, it has been appreciated that the amino acid sequence of CR1 contains four long homologous repeats (LHR) or septamers followed by two different carboxy-terminal SCRs. So-called allotypes with three to six septamers have been described. Functional studies by the Atkinson (Uhrinova et al. 2003) and Barlow (Kirkitadze and Barlow 2001) groups have revealed similarities between septamers and the importance of particular sets of SCRs. For example (in CR1), functional Site 1, which binds C4b and has decay accelerating activity (DAA), resides in SCR 1–3, 8–10, and 15–17 of CR1 and all three sets are aje using our nomenclature. Interestingly, within DAF, it has been shown that DAA resides in SCR 2–4 (aje) and that SCR1 (a preceding aje) is redundant Thus, our approach has confirmed, extended, and simplified earlier observations (Krushkal et al. 2000). Furthermore, phylogenetic analysis alone could have predicted these functional relationships while suggesting a useful nomenclature.

Although there are insufficient data at present, it can be predicted that there will be considerable diversity when the same set is compared within a gene (isotypic) or between subjects (polymorphism).

Between-Gene Relationships

As shown in Table 1, there are many similarities between different CCPs within a species, and as indicated above, these correlate with shared functional sites, e.g., DAA is encoded within aje. Even more remarkable is the fact that another transmembrane CCP, CR2, shares the carboxy-terminal ch. The comparison of CR1 and CR2 suggests that an early and functional set is that represented by fbkd in CR1. These three sets can be regarded as the putative building blocks of CCPs and their functions as relevant to the evolution of innate and acquired immunity.

The degree of overlap between CCPs has been underestimated in the past. Attempts to define each CCP as a distinct protein must be reviewed. We expect considerable although not complete functional and immunological overlap between CR1, CR1L, and DAF inter alia. On the other hand, we also expect that there will be polymorphism.

Between-Species Relationships

It is often said that CCPs are very different between species but the present analysis yields a different conclusion. Within primates, subfamily classification of SCRs reveals striking similarities and the order is consistent allowing for occasional deletion as in the case of the first a in Papio hamadryas CR1.

When primates are compared with rodents, some differences are apparent but these can be explained simply by single indels or by different degrees of duplication and alternative shuffling of conserved sets of SCRs. Nomenclature and annotation may give false impressions as to true and false orthologues. For example, so-called MCR1 is a fusion of Homo sapiens CR1/CR1L and Homo sapiens CR2 (with a single deletion). The human equivalent of Mus musculus Crry has been sought because of its likely importance in successful pregnancy. Indeed, although the nomenclature is necessarily historic and arbitrary, there could be a case for revision. In the meanwhile, we argue that the human equivalent of the CCP which protects the murine fetus is a mixture of CR1 and all these human CCP which contain at least aje or more of the ajefbkd CR1 septamer and therefore the conformational unit implied by the collocation of up to seven SCRs on the trophoblast surface. It will be important to consider these possibilities (i.e., aje irrespective of the CCP) when defining the immunopathology and designing antibodies or peptides which might block or initiate the process of abortion.

Evolution of SCRs

While the data are still limited and the process obviously very complex, we propose a framework for the evolution of SCRs and CCPs. Early SCRs were beneficial because of their inherent binding affinities, especially when occurring together. Any high-affinity binding has the potential to contribute to innate immunity by way of localization and immobilization of invading organisms. The same binding may also play a role in self–nonself discrimination in that there could be selection for receptors which are unoccupied and therefore free to bind foreign (rather than host) ligands. Selection would have favored closely linked (duplicated) gene clusters since increased copy number would result in increased concentrations of the receptor while facilitating the assembly of the multicomponent products and the conservation of specific sets.

As individual SCR domains diverged, functional diversity increased especially when sets were imperfectly duplicated. Some sets, however, were conserved particularly when crucial to the regulation of coevolving cascades such as the complement system. Duplication of genes such as the C3+4+5 precursors favored the imperfect but segmental duplication of specific sets of SCRs. Imperfections included insertion and deletion of SCRs as well as errors in linear sequence. Some coding polymorphism arose but was not especially favored because it was necessary to retain specificity for particular ligands. Noncoding insertions may have been important in relation to achieving differential but precise regulation of the expression of CCPs. During primate evolution these processes intensified due to the recruitment of human endogenous retroviruses (HERVs) which facilitated further divergence (Gaudieri et al. 2000).