Introduction

With increasing knowledge of gene regulation, protein–protein interactions, and metabolic processes, it has become possible to assemble these and similar sorts of biological information in the form of networks, that is, graphical representations of intermolecular interactions (Kanehisa 2000). Networks have been constructed from information on metabolic pathways, signal transduction, transcriptional regulation, and other cellular processes (Kanehisa 2000). One important generalization regarding biological networks is that they tend to be scale-free (Barabási and Albert 1999). Unlike a random network, a scale-free network has the property that a small proportion of nodes have a large number of connections, while the other nodes have smaller numbers of connections (Barabási and Albert 1999). In intuitive terms, in a scale-free network, nodes are divided into “hubs” (having many connections) and “spokes” (having few connections and connected with one another mainly through hubs). More formally, the scale-free property occurs when P(k), the probability that a node in the network is connected to k other nodes, decays as a power law, following P(k) ∼ k k, where γ is a positive real number (often about 2.0 in a wide variety of networks known from both the biological and the social sciences) (Barabási and Albert 1999).

An apparent paradox of biological networks is that within many such networks there are numerous small modules of densely interconnected nodes, while connections between modules are sparser (Ravasz et al. 2002). A modular organization would seem to contradict the scale-free property, since in such a network nodes would tend to have roughly equal numbers of connections. However, it has been shown in the case of metabolic networks from a variety of species that the network consists of many small highly connected modules that combine in a hierarchical manner, as a result of a small number of nonrandom intermodule links that connect modules in a nested fashion (Ravasz et al. 2002). This hierarchical mode of organization explains the observation that biological networks are scale-free and yet consist of functionally distinct modules.

It has been hypothesized that repeated gene duplication over evolutionary time can account for the properties of biological networks (Wagner 2001). However, this will only be true if the duplication process has certain characteristics. Figure 1A illustrates the simplest possible protein–protein interaction network: a network consisting of just two interacting proteins, X and Y. This network is not scale-free. Suppose that the genes encoding X and Y are both duplicated, giving rise to two new genes encoding two new proteins (X′ and Y′). If X retains its interaction with Y, while X′ interacts only with Y′ (Fig. 1B), the network will not be any more scale-free than was the original network. Similarly, if X interacts with both Y and Y′ while X′ interacts with both Y and Y′ (Fig. 1C), the network will not be any more scale-free than was the original network.

Figure 1
figure 1

Hypothetical scenarios of gene duplication in biological networks. (For explanation, see text.)

On the other hand, the scale-free property is increased if, after duplication, interactions are lost differentially. An example of such a process is illustrated in Fig. 1D. Here, after duplication of both the genes encoding X and Y, X retains the capacity to interact with both Y and Y′, whereas X′ interacts only with Y′ (Fig. 1D). Likewise, the scale-free property of the network will be increased if only one of the two genes is duplicated, but both duplicates retain the capacity to interact with the unduplicated gene (Fig. 1E). The same effect would be produced if both genes were duplicated, but one duplicated gene was subsequently deleted.

These simple examples show that gene duplication will increase the scale-free property of networks if genes involved in networks are duplicated differentially, are deleted differentially after duplication, and/or if interactions are retained differentially after duplication. A similar effect would also be produced if different interactions were acquired by each gene independently after duplication, especially if duplicates differed with respect to the number of interactions acquired.

Figure 1F represents a simple “module” of three interacting proteins (X, Y, and Z). If the genes encoding Y and Z are duplicated, and the proteins encoded by the duplicates continue to interact with X (Fig. 1G), the result will be a network with two hierarchically combined modules. This simple example shows that differential duplication can, under appropriate circumstances, yield a network having the property of a modular and hierarchical organization. Again, the same effect would be produced if both genes were duplicated, but one duplicated gene was subsequently deleted. Consistent with this reasoning, eukaryotic genomes include substantial numbers of unduplicated genes (“singletons”), in spite of the existence of numerous multigene families (Friedman and Hughes 2001a, b).

These theoretical considerations lead to the predictions that biological networks will be characterized by three phenomena: (1) After gene duplication, new interactions will be differentially acquired by duplicates and/or ancestral interactions will be differentially retained by duplicates. The latter will be the case particularly when the duplication involves genes constituting multiconnected nodes (“hubs”) in a gene interaction network. (2) Genes corresponding to network hubs will consist disproportionately of singletons. (3) When genes corresponding to network hubs are duplicated, it will often happen that one copy is quickly deleted. Here we test these predictions with data on gene interaction networks from two model organisms: the genetic interaction network of yeast Saccharomyces cerevisiae (Tong et al. 2004) and protein–protein interaction network of the nematode worm Caenorhabditis elegans (Li et al. 2004).

Methods

The complete sets of predicted protein translations for the following organisms were downloaded from the euGenes Web site, http://www.iubio.bio.indiana.edu:8089: yeast Saccharomyces cerevisiae (version 06/24/2002) and Caenorhabditis elegans (version 06/24/2002). For each of these genomes, gene families were assembled using the BLASTCLUST computer program available in the BLAST tools (Altschul et al. 1997). This program establishes families by BLASTP homology search and the single-linkage method (i.e., if a match is scored between A and B and between B and C, then A, B, and C are placed in the same family). We used a value of 10−6 for the E parameter (representing the probability that a score as high as that observed between two sequences will be found by chance in a database of the size examined) of the BLAST algorithm. To score a match between two proteins, we further required that 30% of amino acids be identical and 50% of aligned amino acid sites be shared. These criteria have been shown to assemble multigene families whose members show evidence of homology throughout the length of the sequence, thus making them suitable for phylogenetic analysis and estimation of evolutionary distances (Hughes and Friedman 2004).

In the case of selected duplicate gene pairs, sequences were aligned at the amino acid level using the CLUSTALW program (Thompson et al. 1994), and this alignment was imposed on the DNA sequences. The number of synonymous nucleotide substitutions per synonymous site (d S ) and the number of nonsynonymous nucleotide substitutions per nonsynonymous site (d N ) were estimated by a maximum likelihood method (Yang and Nielsen 2000) using the software package PAML (Yang 1997). Since most synonymous mutations are selectively neutral or nearly so (Kimura 1977), d S is expected to be correlated with the amount of time since duplication of the two genes compared. By contrast, d N reflects the extent to which the two genes are subject to purifying selection arising from functional constraint on the amino acid sequence (Kimura 1977; Nei 1987).

Information for the yeast genetic interaction network was obtained from Tong et al. (2004), who determined about 4000 interactions for about 1000 genes using computational analysis of data from 132 synthetic gene array screens. Information for the C. elegans protein–protein interaction was obtained from Li et al. (2004), who obtained data on over 4000 interactions using high-throughput yeast two-hybrid screens and combined these results with previously described interactions and in silico predictions for a total of about 5500 interactions. Genes included in these networks (820 genes from yeast and 2606 genes from C. elegans) were matched with the gene families determined by homology search. We examined the relationship between the number of connections a gene had in the network and the size of the family to which it belonged. Family sizes were based on the complete sets of predicted proteins rather than on the sets included in the networks. The clustering coefficient for node i with k i links was defined as C i = 2n i /[k i (k i – 1)], where n i is the number of links between the k i neighbors of i (Ravasz and Barabási 2003). Note that C i is undefined for nodes with only one connection.

Information on duplicated segments in the yeast genome was obtained from Seoighe and Wolfe (1999). In order to obtain a conservative estimate of duplicated regions, for purposes of our analyses we did not include duplicated regions designated “possible” or “low-scoring” by those authors.

Phylogenetic analyses were conducted by the following methods: (1) the maximum parsimony (MP) method, implemented in the PAUP* program (Swofford 2002); (2) the quartet maximum likelihood method (QML), implemented in the PUZZLE 5.0 program (Strimmer and van Haeseler 1996); and (3) the neighbor-joining (NJ) method (Saitou and Nei 1987), implemented in the MEGA2 program (Kumar et al. 2001). The NJ trees were based on the gamma-corrected amino acid distance, with the shape parameter estimated by the PUZZLE 5.0 program. The reliability of clustering in the MP and NJ trees was assessed by bootstrapping (Felsenstein 1985); 1000 bootstrap samples were used. In QML trees, the proportion of puzzling steps supporting a branch provided a similar index of the reliability of clustering patterns. Since all phylogenetic methods produced essentially identical trees, only the MP tree is shown in the following.

Results

Network Properties

Both the yeast gene interaction network and the C. elegans protein–protein interaction network showed patterns characteristic of a scale-free network, with a negative exponential relationship between the number of connections and the number of nodes in the network having that number of connections. In each case, a regression relating the log number of genes to the log number of connections produced a highly significant linear relationship with a negative slope (Fig. 2). The slope of the relationship had a much higher absolute value in the C. elegans network (slope = −1.84; Fig. 2B) than in the yeast network (slope = −1.16; Fig. 2A). This difference is explained by a higher frequency of nodes with a small number of connections in the former network than in the latter network. In the C. elegans network, 1406 of 2606 nodes (54.0%) had a single connection, whereas in the yeast network only 254 of 820 (31.0%) of nodes had a single connection.

Figure 2
figure 2

Plots of the log number of nodes (N) having a given number of connections vs. the log number of connections (Connections) for (A) the yeast genetic interaction network and (B) the C. elegans protein–protein interaction network. In the yeast network, the relationship between N and Connections was described by the regression equation Y = 162.18 X−1.16 (R2 = 83.9%). In the C. elegans network, the relationship between N and Connections was described by the equation Y = 1288.25 X−1.84 (R2 = 91.4%).

In spite of this difference, clustering coefficients for the two networks were similar. The median clustering coefficient (C i ) for the yeast network was 0.500, while that for the C. elegans network was 0.533. These medians were not significantly different (Mann–Whitney test). In both networks, there was a strong negative correlation between C i and the number of connections at a node (Fig. 3). The Spearman rank correlation coefficient (rS) between C i and the number of connections was –0.588 (p < 0.001) in the case of the yeast network (Fig. 3A) and –0.500 (p < 0.001) in the case of the C. elegans network (Fig. 3B).

Figure 3
figure 3

Plots of the clustering coefficient (C i ) of a node vs. the number connections at the node for (A) the yeast genetic interaction network (rS = –0.588; p < 0.001) and (B) the C. elegans protein–protein interaction network (rS = –0.500; p < 0.001).

Family Size and Connections

In the yeast network, there was a negative rank correlation between the number of connections a gene had and its family size (rS = –0.124; p < 0.001; Fig. 4A). This negative correlation was explained by the fact that most genes with large numbers of connections were singletons (Fig. 4A). Nine of 10 genes with 100 or more connections were singletons, and 64 of 83 (77.1%) of genes with 25 or more connections were singletons (Fig. 4A). In the C. elegans network, there was a similar, though weaker, negative rank correlation between the number of connections a gene had and its family size (rS = –0.053; p = 0.007; Fig. 4B). The protein with the highest number of connections (90) was a member of a two-member family, while the proteins with the next three highest numbers of connections (82, 74, and 66) were all encoded by singletons (Fig. 4B). Of 27 proteins with 25 or more connections, 15 (55.6%) were proteins encoded by singletons.

Figure 4
figure 4

Plots of the number of connections of a node vs. family size for (A) the yeast genetic interaction network (rS = –0.124; p < 0.001) and (B) the C. elegans protein–protein interaction network (rS = –0.053; p = 0.007).

Because of the negative correlation between C i and the number of connections in both networks (Fig. 3), we further examined the relationship between family size and number of connections using rank partial correlation, controlling for the effect of C i (Table 1). Since C i is not defined for nodes with only one connection, such nodes were not included in the partial correlation analyses. Even excluding such nodes, there was a significant negative partial rank correlation between family size and number of connections in both the yeast network and the C. elegans network (Table 1). Likewise, in both networks there was a highly significant negative rank partial correlation between number of connections and C i , controlling for the effect of family size (Table 1). On the other hand, in neither network was there a significant partial rank correlation between family size and C i , controlling for the number of connections (Table 1).

Table 1 Rank partial correlations among three variables describing network nodes, in each case controlling for the other variable

Gene Duplication and Network Connections

There is evidence of extensive ancient segmental duplication in the yeast genome, which has been attributed to an ancient polyploidization event which occurred about 200 million years ago (Wolfe and Shields 1997; Seoighe and Wolfe 1999; Friedman and Hughes 2001a; Hughes and Friedman 2003; Kellis et al. 2004). Of 68 single-member yeast families with 25 or more network connections, 28 (44.4%) were located in duplicated blocks believed to have originated from polyploidization (Seoighe and Wolfe 1999). The fact that, in spite of their location in duplicated regions, these families contain a single member implies that, after segmental duplication, one duplicate member of each of these 28 families was deleted from the genome.

In C. elegans, we compared the connections of 34 two-member families, both members of which were included in the protein interaction network (Table 2). Most of the duplication events giving rise to the 34 pairs of paralogues were ancient, as indicated by high mean values of d S and d N (Table 2). In general, the paralogous gene pairs showed little tendency to share network connections; the mean number of connections shared was less than one, and the median number of connections shared was zero (Table 2). Furthermore, there was no significant correlation between either d S or d N and either the number of connections shared or the percentage of connections shared (data not shown). In only 3 of the 34 gene pairs was d S less than 1.0, and in all 3 of these pairs no network connections were shared between pair members. C38C10.4 and F22B7.13 were the protein pair with the lowest d S (0.038); these two proteins shared none of the five connections of the former protein or of the six connections of the latter protein.

Table 2 Summary statistics for variables describing 34 two-member families in the C. elegans protein–protein interaction network

Phylogenetic analyses of families with multiple members in a network were used to examine the relationship between phylogenetic relatedness and sharing of network connections. Figure 5A shows the phylogenetic tree of MAP kinases from the yeast network; this was the family showing the greatest within-family contrast in numbers of connections in either network. The two genes in this family with the highest numbers of connections, YPL031C (with 62 connections) and YHR030C (with 60 connections), were not sisters (Fig. 5A). There was strong (100%) bootstrap support for clustering of YHR030C with YLR113W, which had only 24 connections (Fig. 5A). The same clustering pattern received strong support in QML and NJ trees (data not shown). When sharing of connections among these genes was examined, YHR030C was found to share only a single connection with the closely related YLR113W (Fig. 5B). On the other hand, YHR030C shared 14 connections with YPL031C (Fig. 5B). All other members of this family included in the yeast network shared at most a single connection (Fig. 5B).

Figure 5
figure 5

A MP tree of yeast MAP kinases included in the genetic interaction network. Numbers in parentheses after each gene name are numbers of network connections. Numbers on the branches show the percentage of 1000 bootstrap samples supporting the branch. B Network indicating numbers of network connections shared by yeast MAP kinases. Numbers in parentheses after each gene name are numbers of network connections.

Discussion

A number of general patterns emerged from the evolutionary analysis of two biological networks with rather different properties, a genetic interaction network of yeast and a protein–protein interaction network of C. elegans. First, in both networks, genes belonging to gene families represented by a single member in the genome (“singletons”) were disproportionately represented among the nodes having large numbers of connections. Furthermore, in the case of yeast, there was evidence that when singletons with large numbers of connections have been duplicated, one of the two duplicate copies has frequently been deleted. Of 68 single-member yeast families with 25 or more network connections, 28 (44.4%) were located in duplicated genomic segments believed to have originated from an ancient polyploidization event (Seoighe and Wolfe 1999). Each of these 28 loci was thus presumably duplicated along with the genomic segment to which it belongs, but one of the two duplicates has subsequently been deleted.

A second property shared by both networks was the strong negative correlation between the clustering coefficient (C i ) of a node and the number of connections at the node. This relationship held even when the effect of family size was controlled for statistically. Nodes connected to major “hubs” with a large number of connections tended to be sparsely interconnected among themselves.

Finally, there was evidence that network connections are remarkably labile over evolutionary time. Immediately after gene duplication, it seems reasonable to suppose that gene duplicates have the same network connections, unless one duplicate is partial or an exon-shuffling event or other recombinational event has accompanied gene duplication. But our results suggest that duplicated genes generally have quite distinct sets of connections, and that such changes can happen soon after duplication, as indicated by paralogous gene pairs in C. elegans with relatively low synonymous divergence.

Taken together, these observations paint a picture of the evolutionary process underlying the known characteristics of biological networks, namely, the properties of being scale-free and modular/hierarchical in organization. A multiply connected node is a hierarchical hub if the nodes connected to it have relatively little connection among themselves, whereas a module within a network would consist of a small set of mutually interconnected nodes. Therefore, evidence of a negative correlation between the clustering coefficient and the number of connections at a node provides an insight into at least one mechanism maintaining the modular/hierarchical nature of biological networks.

There was evidence that duplicated copies of multiply connected genes are frequently deleted, as evidently happened with such genes in duplicated segments of the yeast genome. Such deletion of duplicate genes has happened frequently enough in yeast to suggest that it may result from natural selection against duplication of multiply connected hubs. Moreover, the evidence that network connections are highly labile over evolutionary time suggests that even when multiply connected genes are duplicated and both duplicates are retained, one duplicate may lose numerous connections, while the other duplicate retains ancestral connections. This process would result in widely different numbers of connections within multigene families, as in the MAP kinase family of yeast (Fig. 5).