Keywords

1 Introduction

Protein-Protein Interactions (PPIs) get involved in many fundamental cellular functions, and the research on PPIs helps us to understand the molecular mechanisms of biological processes and to propose some new methods in practical medical field. So it is necessary and urgent to carry out the study of PPIs.

Nowadays, a large amount of high-throughput methods have been developed to predict PPIs, such as yeast two-hybrid (Y2H) screening methods [1, 2], immunoprecipitation [3], and protein chips [4]. However, there are some shortcomings in these experiments, such as high cost and time-consuming. Moreover, these methods yield high false positives and false negatives, which result in difficulties to predict unknown PPIs by experimental methods.

In addition, there are many biological databases, such as BIND [5], DIP [6] and MINT [7]. Protein sequences occupy an overwhelming advantage in quantity in these databases, so in order to efficiently utilize these sequence data, it is necessary to develop computational methods to predict PPIs from protein sequences. In general, sequence-based computational methods have two main parts: feature extraction and sample classification [8,9,10].

In first part, Scale-Invariant Feature Transform (SIFT) [11, 12] is applied to extract features from Position Weight Matrix (PWM) [13]. In order to reduce the effect of noise and shorten training time, Principal Component Analysis (PCA) is used to reduce the dimension of features.

In second part, Weighted Extreme Learning Machine (WELM) [14, 15] is used to identify protein pairs’ interacting or non-interacting based on SIFT features. WELM only needs to set two parameters, which is fast to get the best parameters. Moreover, WELM gets better performance in generalization.

In this paper, a novel computational method based on SIFT algorithm and WELM is proposed to predict protein-protein interactions, which helps to insight into the molecular mechanisms of cells and explain the causes of some disease, and it may propose some new treatment methods in practical medical field.

2 Materials and Methods

2.1 Datasets

In our experiment, we collect Yeast dataset from DIP [6]. After removing protein pairs whose sequence length less than 50 and filtering out protein pairs whose sequence identity bigger than 40%, we get 5594 positive protein pairs, and we construct 5594 negative sample according to the results in [16].

To demonstrate the generality of our approach, we collect 3899 protein pairs as positive dataset by removing sequence identity bigger than 25%, and we construct 4262 negative protein pairs according to the work in [17]. In addition, Helicobacter.pylori dataset consists of 1458 positive protein sequence and 1458 negative protein sequence according to the result of Martin et al. [18].

2.2 Scale-Invariant Feature Transform

Scale-Invariant Feature Transform (SIFT) is an algorithm widely used in the field of computer vision, which can be applied to extract local features from images. SIFT was firstly introduced by Lowe in [11], which was summarized and perfected in [12]. SIFT algorithm can be applied in different fields, such as face recognition, 3D modeling and template matching because of its robustness to rotation, scaling, viewpoint and so on. In this paper, SIFT is used to extract features.

2.3 Weighted Extreme Learning Machine

Extreme Learning Machine (ELM) [14] is a single hidden layer feed-forward neural network (SLFN) algorithm, which is simple in theory but effective in practice. ELM just needs to set the hidden nodes in network before the use, and ELM produces the unique optimal result, so it gets fast in learning and achieves better performance in generalization. Weighted ELM (WELM) is proposed to process the data with imbalanced class distribution [15], which can maintain the advantages of original ELM, and extend to cost-sensitive learning according to user’s needs.

2.4 Evaluation Criteria

In order to evaluate the performance of our method, we use the following evaluation criteria: accuracy, sensitivity, precision and Matthews correlation coefficient (MCC). They are calculated as

$$ Accuracy = \frac{TN + TP}{TN + TP + FN + FP} $$
(1)
$$ Sensitivity = \frac{TP}{TP + FN} $$
(2)
$$ Precision = \frac{TP}{TP + FP} $$
(3)
$$ MCC = \frac{TP \times TN - FP \times FN}{{\sqrt {(TP + FN) \times (TP + FP) \times (TN + FP) \times (TN + FN)} }} $$
(4)

where true positive (TP) stands for the number of true interacting pairs that predicted correctly; true negative (TN) represents the number of true non-interacting pairs that predicted correctly; false positive (FP) is the number of true non-interacting pairs that predicted incorrectly and false negative (FN) is the number of true interacting pairs that predicted to be non-interacting pairs falsely.

3 Results and Discussion

3.1 Evaluation of the Proposed Method

In our experiment, we set the same parameters for three datasets—Yeast, Human and H.pylori, which are classified by WELM. Here, L = 10000 and C = 25, where L means the number of hidden neurons and C represents the trade-off constant [15]. Five-fold cross validation is employed to evaluate the performance of our method, which can avoid over-fitting problem of our model and evaluate the stability of our model [19]. Results of our method are shown in Tables 1, 2 and 3.

Table 1. Five-fold cross validation results of our method applied on Yeast dataset.
Table 2. Five-fold cross validation results of our method applied on Human dataset.
Table 3. Five-fold cross validation results of our method applied on H. pylori dataset.

From above tables, we can refer that WELM classifier combining with SIFT descriptors can predict PPIs effectively, and the low standard deviations of the results indicate that our approach is robust. The excellent results of our method lie in the following reasons: (1) When compared to sequence dataset, the corresponding PWM matrix can retain more prior information. (2) The SIFT descriptors extracted from datasets retain abundant information of protein pairs and have strong ability to resist noise. (3) WELM is faster than traditional neural network algorithm in training while guaranteeing the learning accuracy.

3.2 Comparison with SVM-Based Method

To further evaluate our method, we compare results of the proposed approach with the widely used SVM classifier LIBSVM, which is developed by professor Chih-Jen Lin of National Taiwan University [20]. From Table 4, we notice that WELM achieves better performance than SVM when proposing classification on Yeast, Human and H. pylori datasets. Thus we can conclude that WELM is superior to SVM.

Table 4. Performance comparison between the SIFT+WELM and the SVM prediction models

4 Conclusions

The use of computational methods to predict PPIs is becoming more and more important because of its low cost and high efficiency when compared to the experimental methods. In this paper, we propose a novel prediction model by using scale-invariant feature transform and weighted extreme learning machine to predict PPIs. When compared to SVM-based methods, our method can increase the accuracy and shorten the training time greatly. The experimental results indicate that the proposed method is efficient, feasible and robust.