Keywords

1 Introduction

Proteins interact with other proteins to complete life activities. The study of protein-protein interaction (PPI) network is helpful to understand the evolutionary process of life. Developing computational methods to predict and analyze protein-protein interaction networks is not only superior to traditional experiment, but also crucial for understanding biological functions. Identification is not complete about the physical interactions of proteins and the functions of many protein are not found.

Protein-protein interaction (PPI) have been researched from multiple perspectives [1]. Each protein is defined as a node in the network, and the interaction between proteins is defined as the linkage. Some proteins have distinctive characteristics and special regions for interacting with other protein.

A large number of research results show that protein-protein interaction (PPI) network is power-law connectivity distribution [2]. It indicates that some proteins are highly connected to other proteins which can be called hub protein, while most proteins interact with only a few proteins [1]. Each hub protein has original conformation for interacting in protein-protein interaction (PPI). One of the study perspective in the protein-protein network is how a hub protein can interact with other so many non-hub protein. In some environment, external changes can bring about the new transformation of the space conformation [3], such as PH, temperature, partner concentration and ionic strength. However, since the coverage of the natural protein-protein interaction (PPI) is low, it is still questioned whether the topology structure of the protein-protein interaction (PPI) network can be expressed correctly [4].

In protein-protein interaction (PPI) network, hub protein is a kind of protein with higher number of connections, which plays a key role in driving the evolution of genomes and genetic systems. However, the number of connections does not accurately reflect the role of proteins in protein-protein interaction (PPI) network, because hub proteins with same or similar connect degree are not always equal important role in biological network.

Although the distribution characteristics of protein-protein interaction (PPI) network have not been completely determined, it is obvious that highly connected proteins have certain properties to play an important role [3]. For example, the hub proteins might be crucial in drug design. An understanding of hub protein is necessary for the development of new drugs and drug discovery in modern era. Han [4] indicated that date hub proteins may be responsible for organizing biological modules in the protein-protein interaction (PPI) network.

One of the important characteristic of protein interaction interfaces is the contribution degree of the amino acid to the binding free energy. For many years, a large number of experiments show the binding free energy is not uniformly distributed in the binding of protein-protein interaction. Only a sub-fraction of amino acid residues (these residues are called hot spots) contributes a disproportionately generous to the interaction binding free energy [5]. In addition, some researchers showed that hot spots can be obtained from the alanine mutation energy database [6] to create the hot region models for analyzing the evolutionary mechanism. For the research of hot spot residues in protein-protein interfaces, a lot of computational methods have been proposed [7,8,9]. Hus [10] and Cukuroglu et al. [11] predicted hot regions in protein-protein interactions from the different perspective. Carles Pons [12] identified and analyzed the protein-protein binding regions conformation.

The PPI depends on distinct space conformation (3D structure) of proteins. Thus, the protein spatial 3D structure can help to analyze the characteristics and the rules of the evolution and functions of life. Our research group made contributions to the prediction of protein spatial 3D structure and hot regions in protein-protein interactions. Lin [13] has proposed the local adjust tabu search algorithm to predict protein spatial folding structure with the 2D off-lattice model and 3D off-lattice model. Zhang [14, 15] used the different heuristic algorithms to predict the protein spatial structure in 3D. Nan [16] and Hu [17] predicted the hot regions in protein-protein interaction by complex network and clustering method.

Tuncbag et al. [18] expressed that hot spot residues are the interface residues with △△G >= 2.0 kcal/mol in the protein-protein interaction interface, while nonhot spot residues are the interface residues with △△G < 0.4 kcal/mol. In addition, Ozlem Keskin [5], Reichman [19] and Ahmad et al. [20] defined hot regions by different method. This paper defines hot region which have at least three hot spot residues adjacent to each other in the protein-protein interaction interface. Cukuroglu [11] addressed hub proteins yet from interfaces structural, which investigated how hot spots and hot regions can be organized in hub proteins.

This paper consists of the follows sections. Section 2 describes the method of classification of date hub proteins and party hub proteins. Section 3 analyzes the hot region features. Section 4 gives the experiment results and discusses. Section 5 gives conclusion and future directions.

2 Classification of Date Hub and Party Hub

2.1 Classification Based on Average PCC

Han [4] proposed two types of hub proteins by selecting the average PCC: date hubs and party hubs. The former interacts with different proteins at different times or locations, and they are the global connection point between different groups of biological functions. The latter interacts with most of proteins at the same. Therefore, identification and analysis of date hub proteins are critical to the discovery of hidden biological information in protein-protein interaction (PPI) networks. The mRNA expressions are independent and are uncorrelated [21, 22] if hub proteins connected with other proteins by false-positive interactions [23], these proteins could be identified as date hubs. In order to reduce false-positive, Han [4] obtained a high-quality intersecting dataset. For each hub protein, it is necessary to calculated the average PCC of each mRNA expression between the hub proteins and other proteins.

$$ {\text{Average}}\_{\text{PCC}} = \sum {\frac{{PCC_{i} }}{n}} $$
(1)
$$ PCC_{i} = \sum {(E_{x} E_{y} )/N} $$
(2)

Where \( E_{x} \), \( E_{y} \) are mRNA expression in different conditions or samples. N is the number of samples. n indicates the number of interaction objects in hub proteins. Party hubs are those with an average PCC higher than the threshold. All other hubs are defined as date hubs.

2.2 Classification Based on Betweenness

According to many researches, the betweenness is also an important topological property beside the connectivity in graph theory. Nodes with higher betweenness values may be the key nodes of the control module in the whole network. The degree of betweenness is the number of the shortest path in a network through a node. Proteins with high betweenness value are the key node to connect multiple important biological pathways in PPI networks. The improved algorithms were proposed to count the betweenness by Girvan and Newman [24, 25]. In this paper, we also use the same definition as Yu [26] that proteins with high betweenness are defined as bottlenecks. To facilitate the comparison and analysis, the proteins with the top 20% betweenness values are selected as bottlenecks, which is consistent with Yu [27]. Then, all proteins in a certain network can be divided into four categories: Hub-Bottleneck Node (HBN, High betweenness and high connectivity), Non-hub-bottleneck node (NHBN, High betweenness and low connectivity), Hub-non-bottleneck node (HNBN, Low betweenness and high connectivity) and Non-hub-non-bottleneck node (NHNBN, Low betweenness and low connectivity).

The connection degree and betweenness of all nodes in the network are calculated, and the betweenness is defined as

$$ {\text{Betweenness}}({\text{v}}) = \sum\nolimits_{{{\text{x}},{\text{y}} \in {\text{V}}}} {\frac{{\upsigma({\text{x}},{\text{y}}|{\text{v}})}}{{\upsigma({\text{x}},{\text{y}})}}} $$
(3)

Where,\( \upsigma({\text{x}},{\text{y}}) \). represents the shortest path between x and y, \( \upsigma({\text{x}},{\text{y}}|{\text{v}}) \).represents the shortest path between x and y through the node v.

3 Hot Region Features and Classification

Cukuroglu [11] described hub proteins from an interfaces structural point and defined the different types of interfaces. In the study, the interfaces were defined as DD (interfaces between two date hubs), PP (between two party hubs), and NN (between two non-hub proteins) where D, P, N, and X are for date hub, party hub, non-hub, and any protein, respectively [11].

3.1 Feature Selection

Feature selection technique is a crucial step of classification, which has been widely used in protein-protein interactions [28]. It can contribute to avoid overfitting and enhance the accuracy of classification model, because feature selection can preserve the most primitive and optimal features of amino acid residues. The results of feature selection have a certain effect on the reliability of classification results. Many amino acid residues need to be removed because their biological characteristics are useless. The best feature subset can be obtained by feature selection procedure for improving the performance of the classifier. In this paper, properties of protein are estimated by mRMR (minimum Redundancy Maximum Relevance) algorithm [29].

The basic idea of mRMR algorithm is to meet the minimum redundancy and maximum correlation criterion. That is to say, the redundancy between the residues is analyzed based on the correlation measure, and the correlation function is combined into an objective function. When the objective function is optimal, the correlation between the residues is maximum, and the redundancy of the residues is minimum. Given two random variables X and Y, the mutual information is defined as

$$ {\text{I}}\left( {{\text{X}},{\text{Y}}} \right) = \iint {p(x,y)log\frac{p(x,y)}{p(x)p(y)}dx\,dy} $$
(4)

The maximum correlation criterion is defined as

$$ \hbox{max} D\left( {S,C} \right),D(S,C) = \frac{1}{\left| S \right|}\sum\nolimits_{{f_{i} \in S}} {I(f_{i, } C)} $$
(5)

The minimum redundancy criterion is defined as

$$ \hbox{min} R\left( S \right),R = \frac{1}{{\left| S \right|^{2} }}\sum\nolimits_{{f_{i} ,f_{j} \in S}} {I(f_{i, } f_{j } )} $$
(6)

Where S is the characteristic set, and C is the category of targets. \( I(f_{i, } {\text{C}}) \) is the mutual information between feature i and category C. \( I(f_{i, } f_{j } ) \) is the mutual information between feature i and feature j. Combining the above two formulas, the criterion about mRMR can be defined the following.

$$ \hbox{max} \varnothing \left( {{\text{D}},{\text{R}}} \right),\quad \varnothing = D - R $$
(7)

This criterion indicates that a feature subset should be selected with highly correlated and less redundant from the alternative features. Assume that the n feature has been selected as feature subset \( S_{n} \), then the j feature can be selected from alternative feature set \( {\text{S}} - S_{t} \) and satisfies the following formula.

$$ { \hbox{max} }_{{f_{j} \in {\text{S}} - S_{t} }} I(f_{i, } {\text{C}}) - \frac{1}{n}\sum\nolimits_{{f_{i} \in S_{t} }} {I\left( {f_{i, } f_{j} } \right)} $$
(8)

3.2 Classification

Machine learning methods have been widely used in bioinformatics [30,31,32,33]. This paper used the support vector machines (SVM) for classification. Support vector machines, developed by Vapnik, are a set of related supervised learning methods which are used for classification and regression. For many years, SVM classifier are more and more widely used in the field of computational biology for predicting protein-protein interaction sites [34,35,36,37]. The protein-protein interaction interfaces can be classified as hub and non-hub interfaces according to the different conformation of hot spot residues.

The prediction process of hot regions in protein-protein interaction interfaces adopts 10-fold cross validation. The data set can be separate into ten portion. The training set data has nine portion, and test data is the remained one. The average of the correct rates of the 10 results is used as an estimate of the algorithm’s accuracy. The main parameters of the kernel function can be given by the cross validation.

4 Experimental Results and Evaluation

In the experiments, the proportion of various proteins was detected by adjusting the threshold value to classify the date proteins and the party hub protein. According to Fig. 1, under different thresholds, Hub-Bottleneck Node (HBN, High betweenness and high connectivity) class proteins have the highest gene encoding ratio. So, these proteins are most likely to become date hub protein. When the threshold equals 0.1, the proportion of essential gene with Hub-Bottleneck Node (HBN, High betweenness and high connectivity) is the highest. Figure 2 gives the proportion of essential gene with the threshold equals 0.1.

Fig. 1.
figure 1

The proportion of four kinds of proteins with different thresholds

Fig. 2.
figure 2

The proportion with the threshold equals 0.1

It is generally known that polar and hydrophobic interactions play important role in protein-protein interfaces. Therefore, there are two categories amino acids: polar amino acids and non-polar ones. The former includes R, N, D, E, Q, H, K, S, T and Y, while the latter includes A, C, G, I, L, M, F, P, W and V. For the classification, it is necessary to assess the features used for classification. So, Table 1 lists the description of the alternative features, and Table 2 lists their assessment between different types of PPI. These features can be selected to classify if their values are smaller than 0.05.

Table 1. Description of the alternative features
Table 2. Assessment of the alternative features

Figure 3 shows that date hub proteins are more inclined to cluster one hot region or more hot regions, which is consistent with Cukuroglu’s conclusion. Results also reveal that there are obvious distinguish between different types of interfaces.

Fig. 3.
figure 3

The distribution of the average fraction of hot spots in the hot regions

5 Conclusion

Hub proteins have high connectivity in protein-protein interaction (PPI) network, which are one of the most significant factors in biological system. Nevertheless, the connectivity cannot entirely illuminate the hub protein’s role in protein-protein interaction (PPI) network. The reason is that hub proteins with similar connectivity maybe not equal contribution to the protein-protein interaction (PPI) network. Therefore, the betweenness is considered as the stronger determinant of analyzing and understanding the protein network system. On the other hand, the strong link can be found between hot spot residues or hot regions and hub proteins in the protein-protein interface. There are obviously differences between date hub protein and party hub protein interfaces. One of the future studies is to consider structural properties and energy contribution of different categories of hub proteins.