Introduction

One of the current goals of proteomics is to map the protein interaction networks of a large number of model organisms [1]. Protein–protein interaction information allows the function of a protein to be defined by its position in a complex web of interacting proteins. Access to such information will greatly aid biological research and potentially make the discovery of novel drug targets much easier. Previously the detection of protein–protein interactions was limited to labor-intensive experimental techniques such as co-immunoprecipitation or affinity chromatography. High-throughput experimental techniques such as yeast two-hybrid and mass spectrometry have now also become available for large-scale detection of protein interactions. These methods however, may not be generally applicable to all proteins in all organisms, and may also be prone to systematic error. Recently, a number of complementary computational approaches have been developed for the large-scale prediction of protein–protein interactions based on protein sequence, structure and evolutionary relationships in complete genomes.

Initially computational prediction of protein–protein interactions was strictly limited to proteins whose three-dimensional structures had been determined. These methods predicted protein–protein interaction based on the structural context of proteins. Recent advances in complete genome sequencing have however provided a wealth of genomic information. It is now possible to establish the genomic context of a given gene in a complete genome [2, 3]. A gene is no longer thought of as a single protein-coding entity but as part of a coordinated network of interacting proteins. The potential for two proteins to interact is not only specified by the physical and structural properties of their structures, but is also encoded at a genomic level. For example, interacting genes are generally co-expressed [46] (both temporally and spatially). In other words, the fact that two proteins have the physical potential to interact is meaningless unless they are present in the same part of the cell at the same time. Other examples of genomic context include the co-localization of genes on chromosomes, the complete fusion of pairs of genes, correlated mutations between interacting protein families, and phylogenetic gene profiles. Even in the absence of structural or sequence information, one can detect the evolutionary fingerprints of pairs of interacting proteins from their genomic context. A number of these computational approaches also take advantage of high-throughput experimental information such as gene-expression data, cellular locality and molecular complex information [7, 8]. These hybrid computational approaches exploit both the genomic and biological context of genes and proteins in complete genomes in order to predict interactions.

Here, we will describe computational methods and resources available for protein–protein interaction prediction that exploit the structural, genomic and biological contexts of proteins in complete genomes. In addition to algorithms and methods for interaction prediction, a number of useful databases pertaining to protein–protein interaction will be described. These databases combine a large amount of data from both computational and experimental techniques. Finally, a number of tools for protein interaction network visualization and analysis will be described. Methods are presented in historical order together with online access information. Detailed computational protocols are also provided for each method.

Structural Context Approaches

Computational prediction of protein–protein interactions consists of two main areas (i) the mapping of protein–protein interactions, i.e., determining whether two proteins are likely to interact and (ii) the understanding of the mechanism of protein–protein interactions and the identification of residues in proteins which are involved in those interactions. The first successful computational analyses of protein–protein interactions, used the structural context of proteins in order to analyze known protein interaction interfaces in order to determine physical rules determining protein–protein interaction specificity. Unlike other computational methods that use an evolutionary or genomic context to predict interaction, structural approaches tend to be more limited in terms of scale, as only a small proportion of protein sequences have accurate three-dimensional structures deposited in the Protein Databank (PDB) [9]. However, structural approaches allow for a much more detailed analysis of protein interactions than the genome-context based approaches. Structural approaches can determine, not only whether two proteins interact, but also the physical characteristics of the interaction, and residues (sites) at the protein interface which mediate the interaction.

The identification of protein interaction sites is important for functional genomics, analysis of metabolic and signal transduction networks and also rational drug design. The first attempt to describe the characteristics of protein interaction sites was undertaken in 1975 by Chothia and Janin. With data from only three complexes, they suggested that the residues which form the interface are closely packed, tend to be hydrophobic and that complementarity may be an important factor in predicting which proteins can interact [10].

Later studies with larger datasets extended and developed their work to try to identify other characteristics of the interaction site that are sufficiently different from the rest of the protein to be identifiable, and thus be predictive. Further analysis of the hydrophobicity distribution of amino acids can be used to predict interaction sites since interacting regions tend to be the most hydrophobic clusters on the surface of the protein [1114]. This type of analysis yields a 60% success rate at predicting interacting sites. In general, hydrophobic residues such as Leu, Ile, Val, Phe, Tyr, and Met are over-represented at interaction sites, whereas polar residues such as Lys, Asp, and Glu (but not Arg) are under-represented [15, 16]. Other parameters which have been analyzed for their importance in identifying those residues in a protein which form the interaction site include the accessible surface area and residue composition [17, 18]. It has also become apparent that a distinction must be made between different types of complexes. Interaction sites on stable and transient complexes have different properties [18]. Another study [19] indicates that the residue composition can be used to identify six different types of protein–protein interfaces, from domain-domain interfaces in the same protein to inter-protein contact surfaces.

Further studies, [18, 20] using a six-parameter analysis (solvation potential, residue interface potential, hydrophobicity, planarity, protrusion and accessible surface area), have indicated that none of these parameters individually could be definitively used as a prediction method. Using a combined score from all six parameters yielded accurate predictions for 66% of 59 structures [21]. All interfaces tend to be planar and be surface accessible, the other parameters differed between complex types. Studies are also carried out to classify different types of protein–protein interfaces based on sequence and structural features and to generate non-redundant datasets of protein–protein interfaces from the known protein complexes. Such datasets provide the convenient resource for finding the interface motifs shared among homologous proteins and the possible binding orientations [2224]. Computational resources for the prediction and analysis of protein–protein interactions using structural features are described below (see section Structure Based Prediction of Interactions and Table 1).

Table 1 Methods and databases for computational prediction of protein–protein interactions

Shape complementarity is primarily used in docking studies which focus on finding the best fit of the two interacting proteins using rigid- and soft-body searches [5153]. Electrostatic complementarity between interfaces (Fig. 1) plays an important role in determining the best fit of two interacting proteins [52]. Interfaces between antibody-antigen complexes and transient heterodimers tend to have the least shape complementarity, while homodimers, enzyme-inhibitor complexes and permanent heterodimers are the most complementary [18]. The docking algorithms are further improved using the benchmark dataset [54]. Further, the performance of protein–protein docking methods is also assessed by a community wide experiment known as The Critical Assessment of Predicted Interactions (CAPRI) [55]. CAPRI is a blind test, which aims to assess the capability of different protein–protein docking methods to predict the mode of interactions between two proteins based on their 3D structures. The methods are evaluated by comparing the predictions with the unpublished experimental structures of the complexes.

Fig. 1
figure 1

Three-dimensional structure of the T7 bacteriophage RNA polymerase complexed with T7 lysozyme. The multi-colored structure on the left is RNA polymerase, shown with a transparent blue molecular surface. The lysozyme is shown on the right in gray with its associated transparent surface. The interaction interface is highlighted in yellow on both surfaces. This figure was produced using PDB structure 1ARO and PyMol

However, other research [56] has indicated that the chemico-physical properties of interacting surfaces are difficult to distinguish from those of the whole protein surface. It has been suggested that instead of using patch analysis, it may be better to use interface contacts [19], i.e., residues whose closest atoms are annotated in PDB as being less than 6 Å apart. They argue that the analysis of surface patches may miss slightly buried residues with long side-chains, while other residues identified as being part of a patch may in fact not be important, or may not form contacts at all.

Some methods are based on the fact that the residues at the interface critical for the binding are evolutionary conserved. These methods utilize the evolutionary information contained in the multiple sequence alignment and analyze amino acid characteristics of neighboring residues using neural networks. Multiple sequence alignments can help to identify specific family structures which are conserved within a subfamily but differ between subfamilies. These regions are interpreted as being interaction sites which may endow specificity of ligand interaction [5759]. Two groups [60, 61] have trained neural networks with sequence profiles of spatial neighbors of a target residue with solvent exposure to predict whether a residue will be part of an interaction site. Both of their methods gave approximately a 70% accurate prediction rate. The validity of using sequence profiles has been verified by results which demonstrate that the majority of interacting residues are clustered in sequence segments of several contacting residues [62].

Methods have also been developed to validate predicted protein–protein interactions against experimentally determined 3D structures [25, 63]. Given a known three-dimensional structure, they map homologs of the interacting proteins onto the structure and using empirical potentials and test whether the homologous proteins preserve the interactions from the known structure. However, the number of experimentally determined structures for complexes is small, and of the 2,590 interactions predicted by large-scale methods, only 59 could be mapped onto their set of interacting complexes. Of these, 59% had domains that appeared to be in direct contact, thus increasing the probability that these predicted protein–protein interactions are biologically correct. Computational methods [31] for the prediction of protein–protein interactions based on this (and other structural approaches) are described below (see section Structure Based Prediction of Interactions).

Genomic Context Approaches

Co-localization

One of the first methods for predicting protein–protein interactions from the genomic context of genes utilizes the idea of co-localization, or gene neighborhood (Fig. 2a). Such methods exploit the notion that genes which physically interact (or are functionally associated) will be kept in close physical proximity to each other on the genome [44, 6466]. The most apparent case of this phenomenon involves bacterial and archaeal operons, where genes that work together are generally transcribed on the same polycistronic mRNA. In these cases, proteins involved in the same process or pathway are frequently encoded on the same polycistronic messenger. Moreover, operons which encode for co-regulated genes are usually conserved.

Fig. 2
figure 2

Overview of genome context approaches. (A) Gene neighborhood plots for eight complete genomes, showing a pair of genes (red and blue) which are in close physical proximity in all eight genomes. A gene fusion event between two genes (yellow and light blue) in two genomes is also shown. (B) Example phylogenetic profiles of selected genes from the previous panel. These three pairs of genes have the same patterns of co-occurrence in all eight genomes, and may physically interact based on this evidence. (C) Two protein family alignments are shown with conserved regions highlighted (in red and blue). Correlated mutations (shown in green) are present in two identical sub-trees for each family, which indicates that these sites may be involved in mediating interactions between proteins from each family

Operons are rare in eukaryotic species [67, 68]. However, genes involved in the same biological process or pathway are frequently situated in close genomic proximity [65]. It is hence possible to predict functional or physical interaction between genes that are repeatedly observed in close proximity (e.g. within 500 bp) across many genomes. This method has been successfully used to identify new members of metabolic pathways [65]. Like many of the genome-context approaches, this method becomes more powerful with larger numbers of genomes. This approach and a number of online resources that implement it will be described in detail below (see section Gene-neighborhood Based Interaction Prediction).

Phylogenetic Profiles

A relatively simple, yet powerful, form of genomic context is the co-occurrence of pairs of genes across multiple genomes. Two of the main driving forces in genome evolution are gene genesis and gene loss [69, 70]. The fact that a pair of genes remains together across many disparate species represents a concerted evolutionary effort that suggests that these genes are functionally associated (i.e. same biological process or pathway) or physically interacting. This criterion is less stringent than that of gene co-localization, where gene pairs must not only be present, but also situated close to each other on the genome. Homologous genes can be termed either orthologs or paralogs. In general the term ortholog is used to describe genes that are related by a speciation event, i.e. perform analogous functions in different organisms and are related to a single common ancestor gene in an ancestor species. The term paralog is used to describe homologous genes that have arisen following a gene duplication event, i.e., perform similar functions in the same organism. Classifying homologous genes as either paralogs or orthologs is difficult in the absence of accurate phylogenetic or speciation information [71]. Classification of genes in this way allows the inference of a phylogenetic context for a given gene.

The analysis of phylogenetic context in this fashion has been termed phylogenetic profiling [72]. These profiles can be as simple as a binary representation of the presence or absence of a gene across multiple genomes [7274] (Fig. 2b). A library of these profiles may then be scanned to find genes that exhibit identical (or highly similar) phylogenetic patterns to each other. Pairs of genes detected in this fashion are hence candidates for physical interaction or functional association. This method has been used not only to infer physical interaction [72], but also to predict the cellular localization of gene products [75, 76]. Phylogenetic profiles can also be constructed for protein domains instead of entire proteins [77].

This system is not however without flaw. First, the strength of any inference made using such profiles is heavily dependent on the number and distribution of genomes used to build the profile. A pair of genes with similar profiles across many of bacterial, archaeal and eukaryotic genomes is much more likely to interact than genes found to co-occur in a small number of closely related species. Second, evolutionary processes such as lineage-specific gene loss, horizontal gene transfer, non-orthologous gene-displacement [78] and the extensive expansion of many eukaryotic gene families can make orthology assignment across genomes very difficult. Also, shared phylogenetic relationships between two proteins can sometime produce false correlations [79]. However, given the increasing number of completely sequenced genomes, the accuracy of these predictions is expected to improve over time. The details of this approach and online-resources for phylogenetic profile-based prediction of protein interaction are described below (see section Phylogenetic Profile Based Prediction of Interaction).

Gene Fusion

Genome context approaches to the prediction of protein–protein interaction also include the analysis of gene fusion across complete genomes. This method is complementary to both co-localization of genes and phylogenetic profiles and uses both gene location and phylogenetic analysis to infer function or interaction. A gene fusion event represents the physical fusion of two separate parent genes into a single multi-functional gene. This is the ultimate form of gene co-localization, i.e., interacting genes are not just kept in close proximity on the genome, but are also physically joined into a single entity (Fig. 2a). It has been suggested that the driving force behind these events is to lower the regulational load of multiple interacting gene products [80]. Gene fusion events hence provide an elegant way to computationally detect functional and physical interactions between proteins [80, 81]. FusionDB is such a resource which provides an in-depth analysis of bacterial and archael gene fusion events and can help in identifying potential protein–protein interactions as well as regulatory and metabolic pathways [82].

Gene fusion events are detected by cross-species sequence comparison. Fused (composite) proteins in a given reference genome are detected by searching for un-fused component protein sequences, that are homologous to the reference protein, but not to each other. These un-fused query sequences align to different regions of the reference protein, indicating that it is a composite protein resulting from a gene fusion event [80]. Once again, predictions of this type are complicated by a number of issues. The largest hindrance is the presence of so called promiscuous domains. These domains (such as helix-turn-helix (HTH) and DnaJ) are highly abundant in eukaryotic organisms. The domain complexity of eukaryotic proteins coupled with the presence of promiscuous domains and large degrees of paralogy can hamper the accurate detection of gene fusion events [83].

Although the method is not generally applicable to all genes, i.e., it requires that an observable fusion event can be detected between gene pairs, it has been successfully applied to a large number of genomes (including eukaryotes) [26]. The basic gene fusion detection method and online resources such as the AllFUSE database [26], will be described in detail below (see section Gene Fusion Prediction of Protein Interactions).

In-Silico Two-Hybrid

The in-silico two-hybrid (i2h) approach has much in common with the other genome-context approaches, but also indirectly assesses structural properties of proteins that potentially interact. It has previously been shown that a mutation in the sequence of one protein in a pair of interacting proteins is frequently mirrored by a compensatory mutation in its interacting partner (Fig. 2c). The detection of such correlated mutations can not only be used to predict protein–protein interactions, but also has the potential to identify specific residues involved at the interaction sites [84].

Previous analyses [85] involved searching for correlation of residue mutations between sequences in the same protein family alignment (intra-family). The in-silico two-hybrid method extends this approach by searching for such mutations across different protein families or domain families [86]. Prediction of protein–protein interactions using this approach is achieved by taking pairs of protein family alignments and concatenating these alignments into a single cross-family alignment. A position-specific matrix is then built from this alignment, and a correlation function is then applied to detect residues which are correlated both within and across families. Correlated sites that potentially indicate protein interaction are returned with a score. The method suffers due to the computational complexity of constructing the large numbers of alignments needed, and poor quality alignments can dramatically increase noise in the procedure [84]. However the method is similar to the gene fusion approach, as a single accurate prediction between two proteins can infer interaction between all members of both families used.

Biological Context Approaches

High-throughput experimental techniques now provide access to a more detailed view of protein interactions at a genomic level [87]. Gene expression analysis allows one to not only determine which genes are active in a given state, but also sets of genes which are co-regulated in many different states. It has been shown that many interacting proteins are co-expressed according to microarray analyses [46]. Current gene-expression methods now allow for every coding gene of a genome to be placed on a single microarray, allowing the activity of every gene to be monitored across different states or time-points. Although these methods cannot directly be used to determine whether or not two proteins interact or not, a number of computational approaches have been developed that use this information towards the prediction of protein–protein interaction and gene regulatory networks [46]. Other high-throughput experimental techniques such as yeast two-hybrid specifically test a bait protein for interactions against a set of prey proteins. The bait and prey consist of fusion constructs that activate a reporter gene if they interact with each other. While this method is not as accurate as other techniques such as co-immunoprecipitation, affinity chromatography or gel-overlay assays, it can be applied rapidly to genome-scale studies of protein–protein interactions.

Many of these high-throughput methods for investigating the biological context of genes and proteins are inherently noisy. For example, some proteins in yeast two-hybrid assays appear to detect a large number of spurious interactions (false-positives). Gene expression techniques suffer from a number of problems also, such as cross hybridization and poor signal-to-noise ratios. Recently however, research has shown that multiple datasets pertaining to the biological context of genes and proteins can be combined using machine learning techniques [8, 88]. Using Bayesian network analysis it is hence possible to computationally combine multiple noisy datasets in such a way that protein–protein interactions can be more reliably predicted. In this method each source of interaction evidence is compared against samples of known positive (proteins in the same complex) and negative (proteins in different cellular locations) interactions, allowing a statistical reliability index to be built for each data source. When this information is applied genome-wide, a prediction can be made for every protein pair in a genome by combining different sets of independent evidence according to their calculated reliability. Protein interactions predicted in this way have been shown to be as reliable as pure experimental techniques, while simultaneously covering a larger proportion of genes than most experimental methods [8].

A number of available resources for protein–protein interaction data, gene expression data and Bayesian network analysis of multiple interaction datasets will be described below (see section Prediction of Protein Interactions from High-throughput Biological Datasets).

Data-Sources and Visualization Techniques

Computational biology is a data-rich research field. The advent of complete genome sequencing and high-throughput experimental techniques has created an enormous amount of data. In order for these data to be both informative and useful, they must be stored in a sensible and accessible way, and tools must be made available to visualize and exchange this information. A number of initiatives are tackling these problems by creating freely accessible databases storing a wide variety of biological information including protein–protein interactions. Recently, a number of research groups have created visualization tools for biological networks. These tools provide a new way to analyze protein–protein interaction networks, provide a multitude of different ways to represent interactions and can overlay other biological information onto these networks. A number of databases that store protein–protein interactions, molecular complexes and pathways will be described later (see section Data resources for Protein–Protein Interactions and Table 1). Finally, we will detail methods for the visualization and analysis of protein–protein interaction networks (see section Tools for Protein–Protein Interaction Visualization).

Methods

In this section we will describe computational resources and methods for the prediction of protein–protein interactions. These methods will be detailed in chronological order. Within each section a number of on-line computational resources are described that allow one to perform this type of analysis interactively. Resources mentioned in this section are further summarized below (see Table 1).

Structure Based Prediction of Interactions

The Protein–Protein Interaction Server at University College London (UCL) provides a simple web-based interface for exploring protein–protein interaction interfaces, given three-dimensional structures [18]. This server takes into account the following information for interaction analysis: accessible surface area, planarity, length & breadth, secondary structure, hydrogen bonds, salt bridges, gap volume, gap volume index, bridging water molecules and interface residues. This resource (Table 1) is very useful for exploring the protein–protein interaction potential of two protein structures identified through docking or shape-complementarity.

The structural bioinformatics group at EMBL Heidelberg provides the InterPreTS server for protein–protein interaction prediction [25]. Using this resource (Table 1) one can submit pairs of sequences that are then compared to the three-dimensional structures of known protein–protein interactions. This resource utilizes a pre-built Database of Interacting Domains (DBID) and an empirical scoring system to test whether a sequence pair fits a known three-dimensional structure of an interacting pair of proteins.

Gene-Neighborhood Based Interaction Prediction

Co-localization of genes across multiple genomes provides a fingerprint that they may physically interact [65]. Analysis of conserved gene locations across multiple genomes (Fig. 2a) can hence be used to predict protein interaction networks and metabolic pathways [89]. A number of excellent resources exist that allow one to determine whether two proteins may interact using this approach. The most notable of these are STRING (Search Tool for Recurring Instances of the Neighborhood of Genes) [27] and WIT (What is There?) [28]. The STRING database (Fig. 3) provides a web interface giving comprehensive access to gene neighborhood information [90] for 356,775 genes in 110 complete genomes (Fig. 3). Similarly, the Predictome database at Boston University [29] provides a comprehensive web interface to predictions of this type. The WIT database provides access to protein family information, metabolic pathway reconstruction and gene co-localization information. Using these resources allows detailed pre-computed gene neighborhood information to be analyzed for evidence of protein–protein interaction (Table 1). The actual protocols used for these analyses can vary considerably, a general protocol adapted from WIT [28] is described below:

  1. 1.

    In order to assess whether pairs of orthologous genes share a common gene neighborhood across multiple genomes one needs (a) protein sequences/genomic locations and (b) orthology mappings between proteins from multiple genomes.

  2. 2.

    Orthology mappings are generated by searching for pairs of close bi-directional best hits (PCBBH). These are a specific form of bi-directional best hit (see section Phylogenetic Profile Based Prediction of Interaction), a commonly used method for orthology assignment. For a given pair of proteins α and β in genome X, a bi-directional best hit to genes α′ and β′ in genome Y is defined as follows:

    1. a.

      The best BLAST hit for protein α in genome X is protein α′ in genome Y.

    2. b.

      The best BLAST hit for protein β in genome X is protein β′ in genome Y.

    3. c.

      The genes of proteins α and β are situated within 300 bp in genome X.

    4. d.

      The genes of proteins α′ and β′ are situated within 300 bp in genome Y.

  3. 3.

    Genes that satisfy the above criteria can be considered as having a conserved gene neighborhood across two genomes. When this procedure is repeated across multiple genomes it becomes possible to identify genes which are significantly co-localized across many genomes, and are hence likely to either physically interact or be functionally associated.

  4. 4.

    The PCBBH criteria are quite strict, and it is also possible to perform the procedure using Pairs of Close Homologs (PCHs).

  5. 5.

    Sets of PCBBHs or PCHs in multiple genomes are typically scored for significance based on the number and phylogenetic distribution of genomes in which they are co-localized. Phylogenetic distance can be estimated by examining a 16S rRNA phylogenetic tree.

  6. 6.

    A common score (coupling score) for the likelihood that two genes interact based on summing individual scores from multiple genomes is then calculated.

  7. 7.

    Finally, candidate genes that have significant coupling scores are candidates for either physical interaction, or functional association.

Fig. 3
figure 3

Screenshots from the STRING web resource. The left panel illustrates the STRING representation of gene neighborhood and gene fusion. The right panel shows a typical phylogenetic profile for multiple genes and genomes. Finally, the inset shows a predicted protein interaction map generated from gene neighborhood, gene fusion, and phylogenetic profile methods combined

Phylogenetic Profile Based Prediction of Interaction

Phylogenetic profile based prediction of protein interactions (Fig. 2b) has been shown to be an accurate and widely applicable method. Perhaps the easiest way to utilize this information for prediction of protein interaction is to use precomputed phylogenetic profiles for proteins of interest. The Clusters of Orthologous Groups (COGs) resource at the National Center for Biotechnology Information (NCBI) contains large numbers of profiles for a variety of bacterial and archaeal organisms and also S. cerevisiae [30, 91]. Other excellent resources for combined computational predictions of protein interactions using phylogenetic profiles are available from the STRING [27] resource at EMBL Heidelberg and from Predictome [29] (Table 1). Using the web interfaces to these resources, it is relatively straightforward to find groups of proteins with similar or identical phylogenetic profiles, indicating proteins that physically interact or are functionally associated (Fig. 3). For a more detailed analysis of specific proteins of interest a general protocol is described below:

  1. 1.

    For each genome to be analyzed, a FASTA sequence file containing all protein sequences is assembled.

  2. 2.

    All protein sequences in each genome are compared against all other sequences using a sequence similarity search algorithm such as BLASTp [92]. A variety of other sequence similarity search tools could also be used at this step.

  3. 3.

    Orthology between proteins in different genomes is assigned as follows:

    • Two proteins (from different genomes) are orthologous if they were each other’s highest scoring BLAST hit when searched against the other genome. This is frequently referred to as a bidirectional best hit (BBH).

    • This process is repeated to assign (if possible) an ortholog for each protein in a given genome, to a protein in all other genomes.

  4. 4.

    All orthology assignments made in this way are stored for post-processing.

  5. 5.

    A phylogenetic profile for a protein can then be constructed by representing the presence or absence of an ortholog for that protein across all genomes analyzed. Frequently, this is represented by a simple binary vector with ‘1’ indicating presence and ‘0’ representing absence of a gene in each genome (Fig. 2b).

  6. 6.

    All profiles are compared to all other profiles using a clustering procedure. A distance measure (such as Pearson correlation of Euclidean distance) between each profile and all other profiles is used to group profiles according to how similar they are. Finally, protein profiles that are highly similar or identical to each other represent candidate proteins that physically or functionally interact.

Gene Fusion Prediction of Protein Interactions

Gene fusion is a relatively common evolutionary phenomenon [26]. A detected gene fusion between two genes indicates that their protein products may physically interact or be involved in the same biological process or pathway [80, 81]. One extreme example of this is the aromatic amino acid biosynthesis pathway in S. cerevisiae. In yeast a single fused gene encodes the entire pathway of these five normally separate genes [80]. Prediction of protein interactions using gene fusion has been successful in a number of areas, including the prediction of novel protein interactions involved in important biological processes in Drosophila melanogaster [93].

A comprehensive set of fused genes and inferred protein–protein interactions is available from the AllFUSE database [26] at the European Bioinformatics Institute (EBI), the STRING database at EMBL Heidelberg [27] (Fig. 3) and the Predictome database at Boston University [29] (Table 1). Using the AllFUSE resource one can search for potential interactions for a given protein sequence from a database of 24 complete genomes. A general protocol for gene fusion based prediction of protein–protein interactions can be described as follows [80]:

  1. 1.

    This analysis requires two genomes, a query and a reference. One searches for gene fusion (composite) proteins in the reference genome using protein sequences from the query genome. Sequences from both genomes need to be assembled into FASTA format for this analysis.

  2. 2.

    Each protein in the query genome is then interactively searched against each protein from the reference genome using a sequence similarity search tool such as BLASTp [92] using an expectation-value (E-value) threshold to eliminate similarities which may have arisen by chance.

  3. 3.

    All significant similarities detected in this way are then stored in a binary matrix which for each protein pair stores ‘1’ for significant similarity or ‘0’ for no detectable similarity. The matrix may be symmetrified by post-processing with a more sensitive sequence search tool such as Smith-Waterman [94] to clear up ambiguities.

  4. 4.

    Finding evidence of a gene fusion event in the reference species extends of the previous symmetrification problem to one of transitivity [80]. In this case one searches for instances where query proteins A and B match a reference protein C, but do not match each other (i.e. A⇔C; B⇔C but A≠B). These triangular inequalities are resolved once again by using the more accurate Smith-Waterman algorithm to double check that no detectable significant similarity exists between A and B. Further analysis using alignment geometry can then verify that proteins A and B are orthologous to different regions of a composite fusion protein but not to each other [95].

  5. 5.

    Candidate fusion proteins detected in this way provide evidence that proteins A and B may physically interact.

Although this method is not generally applicable to all genes, and suffers from the high levels of paralogy usually present in eukaryotic genomes [96]. This approach has been shown to have an accuracy as high as 90% and readily detects well-known interacting proteins (e.g., tryptophan synthase α and β subunits) and many proteins previously shown to form complexes. As such this method represents a useful way to build interaction networks for proteins of interest within and across genomes.

Prediction of Protein Interactions from High-Throughput Biological Datasets

Gene expression analysis allows for all genes from a given genome to be placed on a single microarray, allowing many gene-expression experiments to be carried out rapidly and in parallel. Recently, efforts have been made to standardize data formats for reporting the results of gene expression experiments. The Minimum Information About a Microarray Experiment (MIAME) [97] standard allows different laboratories to effectively and accurately exchange microarray expression information. Using such standards, it has become easier for a number of publicly accessible resources to distribute microarray data (Table 1).

The Stanford Microarray Database (SMD) [45] provides access to raw data from public microarray experiments, as well as a number of software tools for utilizing this data. Currently, 140 experiments are indexed in the SMD web resource. The MicroArray group at the European Bioinformatics Institute provides ArrayExpress [46], a publicly available gene expression data in MIAME format for over 66 publicly available experiments and also integrated tools for expression profile analysis. Finally, the Gene Expression Omnibus (GEO) [47] database at the NCBI contains data from over 300 large-scale publicly available microarray and SAGE experiments, for which all data is linked into the NCBI protein, nucleotide and genomic databases.

Using these resources, it is hence possible to select a number of datasets for an organism of interest, and extract gene expression profiles for some or all genes. Proteins whose genes exhibit very similar patterns of expression across multiple states or experiments [98] may then be considered candidates for functional association and possibly direct physical interaction [46]. Gene expression analysis becomes much more reliable with more expression data. For example, genes that have high correlation across 10 experiments are much more likely to be related functionally than genes correlating across two experiments. Gene expression data is relatively susceptible to noise, and great care must be taken to minimize and filter this from any analysis. This data can, however, be very powerful when combined with analyses involving regulatory network reconstruction, and with other methods of detection of functional association and interaction of proteins [8].

The Bayesian networks approach (see section Biological Context Approaches), which combines data from multiple biological datasets is a useful way to minimize this noise and perform reliable protein–protein interaction prediction in S. cerevisiae [8]. Validation of the method indicates that it can successfully recover large numbers of previously known protein–protein interactions (Fig. 4) and many novel interaction predictions. The results of this analysis are available from the GeneCensus web site at Yale University (Table 1). These predictions are remarkable as they illustrate that combining multiple independent and noisy datasets in an intelligent way does not necessarily increase noise in the combined protein interaction predictions (assuming orthogonal error between datasets). This is also an excellent example of a combined computational and experimental approach, as interactions predicted using this approach appear to be more reliable than many pure experimental approaches [8].

Fig. 4
figure 4

Bayesian network predictions of protein–protein interactions. Experimentally validated gold-standard protein–protein interactions (blue and green lines) between S. cerevisiae proteins (green dots) are shown as an interaction network. Bayesian network analysis prediction of protein–protein interaction successfully recovers a significant subset of these interactions (green lines). The gold-standard interactions are derived from MIPS and well-known complexes are annotated

Tools for Protein–Protein Interaction Visualization

Network and pathway visualization tools are computer programs that can automatically generate a diagram of a network or pathway. Perhaps the simplest representation of a protein–protein interaction network is a graph composed of nodes (proteins) connected by edges (interactions). Some of the first visualization tools were developed for browsing metabolic pathways. For example, a pathway drawing tool is present in the ACeDB database [99] and in EcoCyc [31]. In many cases these representations are clickable so that one can select a member of a pathway or a small molecule and get further information about that entity. Many of these initial visualization tools are static, and generated semi-automatically. The Kyoto Encyclopedia of Genes and Genomes (KEGG) [33], BioCarta and SigPath [34] websites (Table 1), are examples of this type of visualization. Other more advanced methods can dynamically generate pathway diagrams from raw information in a biological database, such as the EcoCyc and WIT databases (see Table 1). Recently, many of the above databases have started providing their data in BioPAX or Cytoscape SIF format for easy visualization.

A number of purely automatic and general algorithms based on a layout algorithm to organize graph of nodes and edges into an aesthetically pleasing layout have been developed for visualizing biological networks. In graph terms this usually means minimizing the number of edges that cross each other, and grouping groups of nodes that are highly connected to each other. Typically, a well-organized graph layout will allow the user to identify global features of their data that may not have been previously apparent. An example layout algorithm is the Spring Embedder algorithm. This method models the graph as a physical system where nodes are spheres connected by springs (edges). Nodes are initially organized in a random state, and forces between connected spheres (due to springs), push the system into a lower-energy more stable state. Other methods such as the Weighted Fruchterman-Rheingold algorithm [95], represent the graph as a system of nodes which exert an attractive force (similar to a spring) between nodes connected by an edge and a distance-dependant repulsive force between all nodes. Additionally the weighted algorithm allows the attractive forces between node to be modulated using weights, and the energy of the entire system is controlled using a temperature function. Other layout algorithms, can involve arranging nodes hierarchically, in a circular fashion or in less structured formats. It is important to choose the best layout algorithm for the type of graph being visualized. For example, a highly connected interaction network will not assume a meaningful graph layout when a hierarchical layout algorithm is used.

The visualization tools for biological networks are BioLayout [48], Cytoscape [49] and VisANT [50] (Table 1). BioLayout and Cytoscope are commonly used tools, which are written using the JAVA programming language and are hence portable across a wide variety of computer environments. Both tools also allow the interactive editing of graphs, through the movement of nodes, node labeling and the ability to change the appearance of nodes and edges. Additionally both tools can export publication quality high-resolution graph images. BioLayout utilizes the weighted Fruchterman-Rheingold layout algorithm, and has a number of options for graph customization, data-overlay, export and graph analysis, including conventional 2-D rendering (Fig. 5a) and hardware-accelerated 3-D rendering (Fig. 5b). Cytoscape provides a number of different layout algorithms for producing useful visualizations and a number of plugins and import options for representing data such as gene expression (Fig. 6). Specifically, circular, hierarchical, organic, embedded and random layouts are available. Circular and hierarchical algorithms try to layout a network as their names suggest. Organic and embedded are two versions of a force-directed layout algorithm. Types of plugins that are currently available for Cytoscape include one that allows reading PSI files (see section Data resources for Protein–Protein Interactions) and one called ActiveModules that finds regions of a molecular interaction network that are correlated across multiple gene expression experiments. Both of these methods are suitable for small to medium sized networks (less than 1,000 nodes), although it may not be long before both layout and visualization techniques become available for the analysis of much larger graphs.

Fig. 5
figure 5

Example graphs from BioLayout. (a) A genetic regulatory network of E. coli genes. Genes are represented by circles (nodes) connected by regulatory interactions represented by lines (edges). Nodes are colored according to biochemical pathway assignments. Nodes in the center of the graph are not labeled for clarity. (b) 3D expression network showing transcripts (nodes) connected to each other by virtue of their expression level correlation (edges) across multiple experiments

Fig. 6
figure 6

Example graph from Cytoscape. A number of the important features of CytoScape are represented in this graph layout. Nodes in this case represent genes and edges represent either genetic (green, cyan) interactions, protein–protein interactions (blue), or protein-DNA interactions (red). Nodes are colored according to the gene expression of that gene in a Gal4 knockout experiment, with blue representing highly significant fold-change of a gene, and red indicating no significant fold-change. Node shapes are determined by the annotation of each gene, diamonds for signal transduction genes, triangles for meiosis, Pol III transcription, mating response, and DNA repair. Circles represent genes that were not assigned to any of these categories

Data Resources for Protein–Protein Interactions

Current computational and experimental methods for protein–protein interaction prediction have been generating large amounts of data. It is imperative that this data be stored in a consistent and reliable way so that it may be useful for biological research. A number of databases are now publicly available for making both protein interaction and pathway information accessible. The PathGuide database [100] is a web resource aiming to capture information about all these resources in one searchable list. Two of the largest and most comprehensive interaction databases now available are the Biomolecular Interaction Network Database (BIND) [36] and the Database of Interacting Proteins (DIP) [37]. DIP is based at UCLA and currently contains over 56,000 experimentally determined protein–protein interactions for over 19,000 proteins in 154 organisms. Interactions in DIP are curated both manually (by expert curators) and automatically (text-mining approaches). BIND, at the University of Toronto, not only stores and curates pair-wise protein–protein interactions, but also molecular complex information and biological pathways. Currently, BIND contains over 188,000 protein–protein and protein-DNA interactions, 3,705 molecular complexes, and 8 pathways encompassing. Other databases include 3did [40], PIBASE [24], SCOPPI [41], iPfam [42], DIMA [43] and Prolinks [44].

A number of initiatives are currently underway to ensure that these data from different interaction databases are stored in a consistent and exchangeable format [101]. The Proteomics Standards Initiative (PSI) [102] has created a standard format for the exchange of protein–protein interaction data (PSI MI), while the BioPAX format aims to capture protein–protein interactions, molecular complexes and pathway information in a single consistent ontology and exchange format (http://www.biopax.org). Another commonly used format is the Systems Biology Markup Language (SBML) [103]. Access information for DIP, BIND and a number of other interaction and pathway databases are detailed further below (see Table 1).