1 Introduction

Protein–protein interaction (PPI) plays an important role in the understanding of biological functions like cellular behave in cells, illness, and health. This is still challenging to find interaction among proteins and their complexes. Identifying protein complexes (PCs) from these PPINs is an important task in biomedical natural language processing (BioNLP) field [1]. By the interaction of the protein with other proteins, most of the proteins achieve their functions. So, it is important to discover the interaction between proteins to understand the biological functions in the body. Protein complex consists of multiple proteins that (stably) interact with each other. To form PCs and to carry out their biological function, proteins interact with each other. PCs intercommunicate with each other to form protein protein interaction network (PPIN) and are significant objects to recognize the attitude of cellular functions and organization in PPIN. Therefore, correctly classifying the interaction network among proteins in an organism is a significant task as it is suitable for interpreting the molecular mechanism underlying given biological functions [2].

Although protein interactions are extraordinarily diverse, all protein interfaces share confident mutual properties. Based on these common properties, different experimental and computational techniques were developed and used to detect protein complexes in PPINs. Experimental techniques use different biochemical [3], genetic, and physical methods such as DNA [4], yeast two-hybrid (Y2H) [5], tandem affinity purification (TAP) [6], mass spectroscopy (MS) [7], protein microarrays [8] and phage display [9] for analyzing protein interactions. Other approaches focus on monitoring and illustrating specific physicochemical and biochemical properties of a PC. With the development of these throughout experimental methods, a large amount of PPI datasets has been produced. However, among the variety of experimental methods, only a small part of protein complexes has been detected due to the limitation of these methods and it is still not sufficient to provide an effective and easy method for detecting proteins in the same complex [10]. In addition, datasets from significant throughout methods frequently suffer from false positives and false negatives. Therefore, detecting PCs from PPIN data using computational approaches provides an alternative way.

There are several proteins interaction and complex databases that are curated on the consensus knowledge of experts and primary experiments such as The Biological General Repository for Interaction Datasets (BioGrid) [11], Database of interacting proteins (DIP) [12], The comprehensive resource of mammalian protein complexes (Corum) [13], Mammalian Protein–Protein Interaction Database (MIPS) [14], catalogs of yeast protein complexes (cyc2008) [15], The Molecular Interaction Database (MINT) [16], KEGG [17], STRING [18], the universal protein knowledgebase (UniProt) [19] and Reactome [20]. These databases contain high-throughput data and information about protein interactions, protein complexes of different species and amino acid sequences of different proteins.

1.1 Density-Based Unsupervised Methods

Proteins are very much interactive with each other on the same PC. In the PPI network, PCs generally correspond to fully connected subgraphs. Based on this intuition, various computational algorithms have been devised. An overview of these methods is given below.

MCODE [21] produces both small- and large-size complexes. It initially selects a vertex of high local weight and then repeating adds neighbors vertices of alike weight to grow clusters. CFinder [22] forms clusters of varying sizes. It identifies fully densed subgraphs of different least merges subgraphs and clique sizes based on their ratio of shared members. That is why each node is a member of a complete order of groups of different sizes. CFinder gives results that differ significantly with changing minimum clique size. MCL [23] uses both weighted and unweighted graphs. It detects protein complexes in these graphs by pretending random walks. It expands and inflates iteratively. DPclus [24] is a group edge tracing algorithm to detect PCs by keeping tracking of the edge of an identified group.

Limiin et al. [25] reduce the number of parameters of the algorithm in DPClus uses two topological constraints: core and attachment to detect protein complexes in PPIN. Restricted Neighbors Searching Clustering (RNSC) detects PC built on two properties, i.e., graph-theoretical and gene-ontological. Leung et al. propose a statistical framework to find out PC cores using the CORE algorithm, in which CORE ranks the forecast complexes by assigning them scores. Wu et al. presented a method to core attachment which detects PCs by identifying the central complex and part complex [2]. Similarly, Dong et al. [26] proposed a search method based on a local-structure score function that detects PCs backward and forwards.

1.2 Feature Extraction and Selection-Based Supervised Learning Methods

Methods discussed in the previous section are based on the intuition that PCs form dense subgraphs in PPINS. Despite the fact, other topological constraints also have significant importance in predicting PCs from PPINs. As proteins not only form cliques in PPINs, other shapes of PCs, i.e., linear, hybrid and stars are also found in PPINs. Various supervised learning methods have been devised that identify PCs of different structures from PPINs using a number of topological features like degree, topological coefficient, clustering coefficient, etc. Subgraph topology is important for complex prediction, but the information contained within the protein’s amino acid sequence is also very important as these sequences specify the structure of proteins. Fewer computational approaches integrate biological properties computed from protein sequences with topological properties for complex prediction tasks from PPINs. Detailed overview of these methods is given below.

Qii et al. [27] presented a supervised graph local clustering framework. This framework predicts PCs based on topological properties including cluster size, cluster density, cluster degree, topological coefficient, cluster coefficient and biological property such as frequency of amino acids of identified complexes. Yu et al. [28] presented CDIP to discover protein complexes based on integrated properties such as graph size, graph density, degree statics, clustering, and topological coefficient statics and amino acid frequency. Amino acid frequency is measured for biological properties, and the diameter and density of the network are measured for topological constraints. Quan et al. [29] proposed that clusterEps used for the compositional score of rising pattern for measuring the likelihood of sub-graph as complex and just topological features are not enough for efficient detection of PCs . Their method uses an integrative score to measure the likelihood of a subgraph to form the protein complex. So, new protein complexes can be built from seed protein by updating the score iteratively. They used the MIPS dataset and they reported the precision, recall and F1-measure of 0.638, 0.769 and 0.696, respectively.

Jiancang Zeng et al. [30] proposed a three-step framework, in the first step from off the shell database, negative data set is collected. While in the second step for preprocessing of PPI sequence, the n-gram frequency method is used. The third step is for the connection of final features and then to find the optimal features from the computed one. Lastly, for the Random Forest classifier features are selected. The experiment is done on real data set that shows the fusion method outperformed and gives the inspiring results that are helpful for future research in protein function.

Many studies have worked for extracting the important features and reduce the size of the feature vector to improve the performance of models [31,32,33]. For extracting the important features and reducing the dimension of dataset, Quan Zou et al. [34] proposed the feature ranking method known as Max-Relevance-Max-Distance (MRMD). MRMD is used to balance the stability and accuracy of prediction task and feature ranking. This method is tested on two different datasets to prove the efficiency of big data. The first one is the benchmark dataset with high dimensionality known as image classification. The second dataset is protein protein interaction prediction data that comes from their private previous research. The proposed method maintains accuracy on both datasets proved by the experiment. Unlike other filtering and wrapping methods like information gain and mRMR, this proposed method Max-Relevance-Max-Distance (MRMD) run faster.

Shao Wu Zhang et al. [35] predicted PPI with a new technique based on amino acid distance frequency that is gathered with principal component analysis (PCA) [36] and physicochemical characteristics. First, based on four types of physicochemical property values, the twenty fundamental amino acids split into three groups. Second, the feature extraction process for distance frequency is presented for the representation of pairs of protein and also feature vectors are obtained with 4 physicochemical characteristics that are combined to create distinct set vector features. Third, the technique PCA has been used to decrease the dimension vector and adopted SVM Support Vector Machine as a classifier. The complete predicted accuracy of 4 physicochemical characteristics is 91.79%. The findings indicate that the present strategy is very capable of PPI’s prediction and can be a helpful tool in the appropriate fields.

In 2018, Aisha Sikandar et al. [37] used the topological features including Closeness Centrality, Average shortest path Length, Betweenness Centrality, Clustering Coefficient, Degree, Neighborhood Connectivity, Eccentricity, Radiality, Stress Centrality, Self-Loops Topological Coefficient, Density and Size, and 50 spectral domain biological features. They used the different versions of decision tree such as cart Cart, C4.5 and breadth first search and Cart outperforms the other models. Similarly, Misba et al. [61] worked on gene–disease association prediction using biological and topological features. They used the CORUM complexes that are used in this study for human and mouse complexes.

In 2019, Aisha Sikandar et al. [38] predicted PPI with subgraph topology and sequence entropy. Length and sequence entropy is calculated to catch the core of data comprised within the sequence of proteins. Interaction of proteins was done with one another and diverse sub-graph topologies were formed. Using a logistic tree model, the incorporation of biological characteristics was done with topological sub-graph characteristics and complexes are model. The experiment findings showed that this technique outperforms the other 4 state-of-the-art methods in terms of the number of protein complexes known to be detected. This framework also offers a perspective in future biological research and may be useful in anticipating other kinds of subgraph topologies. Similarly, in 2020, Faridoon et al. [39] proposed a supervised method to predicted protein complexes using biological and topological features. They integrated the SVM with Error-correcting output coding (ECOC) to optimize the performance based on the autodetection of multiple protein complexes idea. The features used by the comparative studies are shown in Table 1.

Table 1 Overview of features used by the comparative studies

Most of the above-mentioned unsupervised learning methods are based on intuition that proteins combine to form dense subgraphs in PPINs, whereas supervised learning methods not only mine cliques but linear, star and hybrid structures as PCs in PPINs by computing a wide variety of topological features like degree, clustering coefficients, topological coefficients, diameter, etc. Fewer methods take into account important biological information contained within protein amino acid sequence in the form of frequency, count vectorizer, fast Fourier transform and kidera factors as biological features. These methods by incorporating biological along with topological features achieve better performance as compared to previous studies. Therefore, in this paper, we have computed a wide variety of topological features given in Table 2 and integrate them with biological features. Among the biological features, we have computed spectral domain features, physicochemical properties and bag of words features. Another limitation of recent studies is that they have not used feature selection. In recent studies, the combination of different features leads to high-dimensional data, and high-dimensional data can lead to the curse of dimensionality. We have used a sequential forward feature selection method (SFFS) i.e., random forest based Boruta feature selection [40]. In this method, the best features are fused before giving to the model for training. As not all extracted features are important, we selected the important features that lead to more accurate results.

In this paper, we used gradient boosting classifier, decision tree classifier, linear discriminant analysis to train on human, mouse and yeast complexes databases. Our methodology shows a significant improvement in protein complex prediction. Our experiments show that Linear discriminant analysis performs better on both large and small databases with the best features selected by SFFS from a list of computed biological and, topological features.

2 Methods

The protein–protein interaction network can be represented as graph G = E, V, where E represents the edges (interaction) between proteins, V represents the vertices (proteins). The subgraphs in PPI can be represented as Gs = Es, Vs and these subgraphs are connected to form a complete graph. To predict protein complexes, we downloaded the hand-curated yeast complexes dataset from CYC2008 [15] and MIPS [14]. Human and Mouse protein complexes dataset is taken from MIPS CORUM [13] and amino acid sequences against these genes are taken from uniport kb [19]. We downloaded the protein interaction data from DIP and created the four PPINs from which two for the yeast databases, one for humans and one for the mouse database. Then, the feature selection method is used. By considering interaction networks, we have computed 13 topological features. From proteins amino acid sequences, a total of 150 features are computed. All computed features are listed in Table 2. The extracted 163 features are passed to SFFS, i.e., random forest-based Brouta feature selection algorithm and top ranked features including Amino acid G, Physicochemical feature 6, 26, 41, 21 and 61, Density, Size, Topological Coefficient, Neighborhood connectivity, Average Shortest Path length, Closeness Centrality, Eccentricity, Radiality were selected to train the model. Figure 1 represents the flow diagram of our methodology and algorithm 1 proposed technique, respectively.

Fig. 1
figure 1

Flow diagram of proposed methodology

figure a

2.1 Feature Extraction

For complex detection, we have used the topological, biological features such as the spectral domain, Bag of Words and physicochemical features. For computing topological features, we have created four PPINs two for yeast by considering CYC2008 and MIPS benchmark complexes and DIP interaction data and one for humans and 1 for mouse by considering complexes from CORUM and interaction data from DIP. Complexes of size greater than 3 are considered. To compute biological features amino acid sequences of proteins in PPINs are considered. We have calculated a total of 13 topological feature and 150 biological features. The input parameter, name, and category of these features are given in Table 2.Density-based unsupervised learning methods are based on the intuition of mining dense subgraphs from PPINs as complexes and used density as the main feature. These methods are unable to mine other types of complex structures found in PPINs, i.e., star, linear, and hybrid. Therefore, feature extraction and selection-based algorithms have been devised that incorporate a variety of topological features for mining not only cliques but star, linear and hybrid complexes as well from PPINs. The detail of these methods is given in Sect. 2.2.

Table 2 List of computed topological and biological features

2.1.1 Topological Feature

Protein–protein interaction networks can be represented as graphs in which proteins are represented as nodes denoted by V and the connection between them as edges denoted by E [41]. We have computed several topological features by considering PPIN of Yeast, mouse and human detail of these features is given below:


Average shortest path Length: The average shortest path length between the vertices V and E is calculated by taking the average of shorter path length


Closeness Centrality: It is a way to identify nodes, measures the average distance of a node to all nodes, and can very easily disperse information across a network. Nodes with a high score of closeness have the shortest distances to every other node. This function is useful in mining stars and cliques. The closeness centrality is defined in the following equation.

$$\begin{aligned} C_c (V)=1/{\text {avg}}(L(V,E)), \end{aligned}$$
(1)

where \(C_c(V)\) is the closeness centrality of vertex V. L(V, E) indicates the length of the shortest path between vertexes V and E. The value of closeness centrality lies between 0 and 1.


Betweenness centrality: The betweenness centrality for each node is the number of these shortest paths that traverse the node. It shows the degree to which nodes stand between each other in a PPIN, central nodes of the clique, star, and hybrid complexes have a higher betweenness centrality so this characteristic is useful in complex mining. The betweenness centrality is defined in the following equation.

$$\begin{aligned} C_c (V)=\sigma _(g\ne n\ne d)(\sigma _gd (V)/\sigma _gd), \end{aligned}$$
(2)

where \(C_c(V)\) is betweenness centrality of vertex V, g and d are the vertices in different clusters from that of V. From vertex g to vertex d, \(\sigma\)dg represent number of shortest paths, and V represents the vertices that lies on these shortest paths, from g to d is represented by the \(\sigma\)gd (V).


Clustering coefficient: The clustering coefficient is defined in the following equation.

$$\begin{aligned} C_d=2e_d/(K_d (K_d-1)), \end{aligned}$$
(3)

where \(C_d\) is the clustering coefficient of vertex d, \(k_d\) is the number of neighbors of vertex d and ed is the number of connected pairs among all the neighbors of vertex d.


Degree: The total number of vertices connected by an edge to vertex V is referred to as degree V.


Eccentricity: Vertex V’s eccentricity is the longest non infinite length of the shortest path in the network between vertex V and edge E.


Neighborhood connectivity: The number of a vertex’s neighbors is called its connectivity. The average connectivity of every neighbor of vertex V is called its neighborhood connectivity.


Radiality: Radiality is called the centrality index of a vertex V. It lies between 0 and 1. For computing radiality, the average shortest path length of a vertex V is subtracted from the connected component’s diameter plus 1 and then divides the result by the diameter of the connected component. Diameter is the number of edges in the shortest path between the furthest pair of vertices of a cluster.


SelfLoops: It is an edge that connects a vertex to itself.


Stress centrality: It is the number of shortest paths that include vertex V. A vertex got high stress value if it is traversed by the high number of shortest paths.

Topological coefficient : The topological coefficient is defined in the following equation.

$$\begin{aligned} T_V = {\text {avg}}( J(V,E) ) / K_v, \end{aligned}$$
(4)

where \(T_V\) is the topological coefficient of a vertex V and \(K_v\) is the neighbors of vertex V. Shared neighbors among vertex V and E are given by J (V, E). If there exists a single link among vertex m and vertex n then the value of J (V, E) is incremented by 1.


Size: Size can be defined as the number of vertices in a cluster.


Density: The density of a cluster G is defined in the following equation.

$$\begin{aligned} {\text {den}}(G)=(2*E)/(|V|*(|V|-1), \end{aligned}$$
(5)

where E represents edges and V represents vertices in cluster G.

2.1.2 Biological Features

Protein structure is defined by the composition of amino acid sequences, so investigation of the amino acid sequences is enough to understand the interacting properties of proteins for performing any kind of biological function [23]. Therefore, it is important to consider these amino acid sequences to classify the protein complexes these features are very helpful to classify complexes. The following key features can be used to predict the complexes based on the information within the protein amino acid sequences.


Spectral domain feature To capture the nature of data Fast Fourier Transformation (FFT) is a very useful method [42]. Previously, FFT has been used in the frequency based domain to know the patterns and in bioinformatics to classify proteins but not to classify the complexes in protein–protein interaction networks. As amino acid sequence information is enough to understand the structure of proteins, using FFT on these sequences we can extract features that can be useful in the prediction of protein complexes [36].

To get the set of spectral domain feature, FFT can be applied to protein amino acid sequences. To apply the FFT on amino acid sequence, amino acid sequences will be converted in their molecular weights. FFT is calculated for every amino acid sequence using the following equation.

$$\begin{aligned} x_{jk}= (x_{jk}- \bar{x}_k)/\sigma _k. \end{aligned}$$
(6)

To compute the FFT on the amino acid sequences the length of sequences should be the same but here the length of the amino acid sequence is different therefore to make the length same the principal component analysis [43] is used to reduce the dimension of amino acid sequences. Lowest 50 coefficients of FFT of the generated spectrum by PCA.

2.1.3 Bag of Words Features

As amino acid sequences are a collection of 20 different characters, therefore, bag of word can be applied to these features. In NLP, bag of the word is a powerful technique to convert the text in features and has many applications like sentiment analysis, document classification tasks and analogical reasoning [44]. We use the count vectorizer [45, 46] at character level to convert our amino acid sequences in the feature. Count vectorizer converts the text to features by using the characters as labels and assign the counts of character in sequences to its relevant labels. Every amino acid sequence against a protein has different lengths and character counts that can distinguish proteins and used to predict protein complexes.

2.1.4 Physicochemical Feature

Analogous properties of amino acid sequences determine the physicochemical properties of proteins. These physicochemical properties can be used to predict complexes more accurately [47]. There are more than 554 physicochemical properties [48,49,50,51] but feature extraction using all these measures leads to the curse of dimensionality and higher computational cost. Fortunately, Gaurav et al. [52] summarized the rank of physicochemical properties based on the frequency counts on the whole dataset and proved that not all properties are important to predict protein protein interaction complexes. These four properties that play an important role in PPI complex detection are polarity (P), hydrophobicity (H), polarizability (Z) and Waals volume (V). We selected these four properties to extract physicochemical features from amino acid sequences. To integrate the physicochemical properties, we used PSI-blast to create PSSM matrix that defines the probability of any amino acid occurring at a specific position in a sequence and multiply these properties with that matrix by equation 7.

$$\begin{aligned} F_n=S*R_n, \end{aligned}$$
(7)

where S is (S = S\(^j _i\) i = 1,2...M, j = 1,2,...,20) S is PSSM query of protein sequence is a M X 20. Where M is the length of the protein sequence and 20 denotes the 20 amino sequences. Rn is the nth physicochemical property and F \(_n\)(n = 1,2,3,4) of the physicochemical property.

2.2 Feature Selection

After calculating the topological and biological features we have applied FSSF, i.e., Randomforest-based Boruta feature selection for mining best features from computed one. Boruta algorithm was designed on the idea of stoppiglia [53] that expands the information system with shadows that are artificial characteristics generated by changing the order of values in the original data and then using shadows’ significance scores to determine the meaning of the scores obtained by the real characteristics.

Boruta checks what features in an iteration have attained greater importance than the best shadow; these events are counted for each feature until their number is either substantially higher or lower than predicted at random, using a 0.01 p value cut-off by definition. If the p value for the feature is greater than the threshold, then the feature is selected else rejected. Then, the rejected features are removed from the information system. After every iteration, it shuffles the shadow features and this process is repeated until all relevant features are selected or set limit of reputation is reached. In this feature selection method, k best features are selected in the set of n features on which classification model gives the best accuracy. Let us say A is a set of all features.

\(A = {a_1,a_2,\ldots a_n}\).

Boruta takes A as the input feature set and gives optimal features set B from A. The set B is a subset of A containing the selected features by Boruta based on Random Forest.

\(B={bj|j=1,2,3\ldots ,k; b_j \subset A}.\)

2.3 Classification

After applying the feature selection, these datasets are split into training and testing datasets by 70:30 ratio, respectively. After preparing the datasets, we have train decision tree, gradient boosting, and linear discriminant analysis classifiers and during test phase, complex prediction task is achieved.We have used grid search to find the optimal parameters for the models. For the decision tree classifier, we set the n_estimator from 50 to 200, min_sample_split and min_sample_leafs from 1 to 10. The optimal parameters for decision tree classifiers are n_estimator=100, criterion= “gini”, max_depth=None, min_sample_split=2 and min_sample_leaf=1. For the GradientBoostingClassifiers optimal hyperparameters are loss=’deviance’, learning_rate=0.01, subsample=1.0,

n_estimators=100, min_samples_split=2, min_samples_leaf=1, max_depth=5. For the linear disrciminenta analysis parameters are tol=0.0001, shrinkage=None, solver=’svd’, n_components=None, priors=None, store_covariance=False. These classifiers are explained in the following section.

2.3.1 Decision tree classifier

Decision tree classifier breaks down the complexes database into smaller groups and constructs the tree at the same time for these smaller groups. The resulted tree is in the form of leaf vertices and decision vertices. The core algorithm of decision tree is ID3 that uses a greedy search approach to create the branches. It uses information Gain and Entropy to construct a tree. Information Gain is defined by equation 8.

$$\begin{aligned} G(A,C)=E(A)-E(A,C), \end{aligned}$$
(8)

where G(A,C) is the information gain of feature C in the whole training set A. A contains all possible features of the dataset to train. E(A) is the entropy of the whole original training set and defined in Eq. 9. While E(A, C) in entropy of C feature w.r.t overall features A.

$$\begin{aligned} E(A)= -\sum _{j=1}^{n} a_i log (a_i), \end{aligned}$$
(9)

where \(a_i\) is the proportion of A that belongs to complex i and root vertex is the vertex with the highest information gain. Then, the training dataset is divided into smaller groups and each group denotes the branch that extend the root vertex. The process is repeated and only those data points are used that complete the branch. The node in which all data points have the same label is called a pure node and this process continues until all nodes become pure.

2.3.2 Gradient Boosting Classifier

Gradien boosting classifier uses a forward stagewise strategy to build an additive model. It allows changing the learning and loss function value to optimize the performance of classifiers like Neural Networks. In each iteration of model regression trees are fit on the negative gradient of the multinomial or binomial loss function. In complex classification, it generates the multi regression trees. Tree generation for complexes classification is defined in equation 10.

$$\begin{aligned} T_n(x)=\sum _{i=1}^{j_n} a_{jn} S_{jn}(X), \end{aligned}$$
(10)

where \(T_n (X)\) is a decision tree generated by gradient boosting for complexes. jn represents the number of nodes. The constructed tree divides the input features into jn disjoint portions \(S_{1n}\ldots\). \(S_{jn}\)n and predict a constant value for every portion.

2.3.3 Linear Discriminant Analysis

Classification of protein complexes using the Linear discriminant analysis is done based on transform space based on some distance measures like Euclidean distances. The scatter matrix for classification is made using equation 11

$$\begin{aligned} \sum _{a}= \sum _{i=1}^{n}mi(xi-x)(xi-x), \end{aligned}$$
(11)

where mi is training instances for each complex. xi is the mean of each complex and x is the overall mean of the feature vector.

3 Results and Discussion

In this section, we will discuss the characteristics of the datasets used in this study and evaluation measures to measure the performance of our model. We will also compare the performance and robustness of our model with state of the art studies and define the time and space complexity of our models. Furthermore, we will show the predicted protein complexes from the PPINs. The experiments were performed using Core i7 8th generation processor, 24 gigabytes of RAM and Python3.

3.1 Datasets Description and Evaluation Measures

We have used the five benchmark datasets in these experiments. To predict protein complexes we downloaded the hand-curated yeast complexes dataset from CYC2008 [15, 22] and MIPS [14] (versions used by Sikandar et al. [37, 38]). CYC2008 contains 148 complexes, 999 genes and 2475 interaction. MIPS yeast complexes dataset contains 109 complexes, 1004 genes and 2745 interactions. Human and Mouse protein complexes dataset is taken from MIPS CORUM [13]. CORUM contains overall 4274 complexes from the mammalian organism of which 67% human, 10% rat, 15% mouses and 4473 different genes information. It contains 678 mouse protein complexes and 2916 human protein complexes. Protein interaction information was taken from the DIP (database of interacting proteins) [12]. DIP contains 28850 proteins, 834 organisms, 81923 interactions overall. It contains 51221 proteins, 24918 interactions for S.cerevisiae, 5084 proteins and 9141 interactions for Homosepain, 2387 proteins and 3069 interactions for Mus Musculus. Amino acid sequences of all these proteins are taken from UniProt [19]. The summarize description of datasets is given in Table 3.

Table 3 Dataset description

We have evaluated our performance of our models with Precision, Recall, F1-measure and Matthew Correlation coefficient measures. Accuracy measures the ratio of correctly classify complexes from the total number of complexes.

Precision measures the complex ratio of the predicted protein complex datasets corresponding to at least one of the benchmark datasets complexes. Recall measures the ratio of at least one predicted protein complex from the benchmark dataset. F1-measure is harmonic mean (HM) of recall and precision. Matthew correlation coefficient (MCC) is a correlation between the actual and predicted protein complexes and returns a value between + 1 and − 1. The coefficient of − 1 represents the total contradiction in actual and predicted complex, 0 for random prediction and +1 denotes perfect prediction. The accuracy, precision, recall, F1-measure and Matthew correlation coefficient are defined by the following equations.

$$\begin{aligned} {\text {Precision}}_i&= \frac{A_{ii}}{(\sum _{i}A_{ji})} \end{aligned}$$
(12)
$$\begin{aligned} Recall_i&= \frac{Aii}{\sum Aji} \end{aligned}$$
(13)
$$\begin{aligned} {\text {F1-measure}}&= \frac{(2 \times {\text {precsion}} \times {\text {recall}})}{({\text {precision}}+{\text {recall}})} \end{aligned}$$
(14)
$$\begin{aligned} {\text {MCC}}&=\frac{({\text {TP}} \times {\text {TN}}-{\text {FP}} \times {\text {FN}})}{(({\text {TP}}+{\text {FP}})({\text {TP}}+{\text {FN}})({\text {TN}}+{\text {FP}})({\text {TN}}+{\text {FN}}))}, \end{aligned}$$
(15)

where \(A_{ii}\) represents the truly predicted complexes , \(\sum _{i}A_{ji}\) represents all the truly and falsely predicted complexes and \(\sum A{ji}\) represents all the true complexes. TP represents the true-positive rate, TN represents the true-negative rate, FP represents the false-positive rate and FN represents the false-negative rate of predicted complexes.

3.2 Comparison with Other Methods

To validate our work, we compare our methodology for the yeast datasets MIPS and CYC2008 [15] by considering DIP [14] interactions with MCL [23], MCODE [21], CFINDER [22], CDIP [28], and Sikandar et al. [37, 38] in Table 1. MCL used the topological feature average degree and average density of the protein protein interaction graph. MCODE topological features used the vertex weighting and outward traversal densed region, while CFINDER used the weighted density, graph size, clustering coefficients. CDIP also used the topological features which include graph density, graph size, degree statistics, clustering coefficient, topological coefficient. Topological features give the localization information of PPIN but these experiments show that this information is not enough to detect the protein complexes sufficiently. Therefore, Sikandar et al. used the biological features along with topological feature and this novel technique show the significant improvement in the protein complex detection. We also used these biological and topological with our bag of words and physicochemical features and the combination of these features significantly improve the results. The feature difference and comparative results are shown in Tables 45 and 6, Figures 23 and 4 for cyc2008 and MIPS yeast benchmark sets.

Table 4 CYC2008 Performance comparison
Table 5 MIPS performance comparison
Table 6 CORUM (human and mouse complexes) Performance comparison
Fig. 2
figure 2

Performance comparison of our methodology with baseline methods on CYC2008 dataset

Fig. 3
figure 3

Performance comparison of our methodology with baseline methods on MIPS dataset

Fig. 4
figure 4

Performance comparison of our methodology with baseline methods on CORUM (containing human and mouse complexes) dataset

These results show a significant improvement in results by using our methodology. We analyzed that results are getting better with the larger number of complexes. Our methodology is capable of handling the larger number of complexes and according to our observation, results of complex prediction from the PPINs will be more precise as the databases of interacting protein extend.

3.3 Time and Space Complexity of the Algorithm

The complexity of our models and algorithm for time and space is given in Table 7.

Table 7 Complexity of algorithms with respect to time and space

Here, we have represented the upper bound complexity. The complexity of these prediction models is defined in this table, while the other results are shown in Table 6. Sample space n represents the number of instances in the dataset and divided into A and B, where A represents the training examples and B represents the testing examples. The number of features is represented by the f while n\(_{\text{trees}}\) represents the number of trees generated by the algorithm to best fit the data. The prediction time complexity, for example, is calculated in the case of decision tree, gradient boosting and linear discriminant algorithms because these are the models that have been trained to predict the protein complexes and use for prediction. While in the case of random forest, complexity on the testing data is used because we have used the Random forest for the features selection method that selects the best feature best train test split methodology. The decision tree, random forest and gradient boosting classifiers use the trees for training and split until they reached the maximum depth d. The technique for detecting this split is to look for each feature (f) at the different thresholds (up to n) and the gain of knowledge (is O(n)). While in the linear discriminant analysis the question of determining the \(\beta\) weight function is determined by the equation: \(\beta\) = \((M' M)^1\) \(M' N\). The most computation-intensive aspect is to test the product M M, which takes place in operations f\(^2\) n, and then invert it, which takes place in operations f\(^3\). While most implementations tend to use a downward gradient to solve the equation scheme (M M) \(\beta\) = M N, the complexity remains the same.

Fig. 5
figure 5

Robustness comparison of our methodology with baseline methods on MIPS dataset

Fig. 6
figure 6

Robustness comparison of our methodology with baseline methods on CYC2008 dataset

3.4 Robustness of our methodology

Table 8 Precision, recall, F-measure and MCC rates against different datasets without feature selection
Table 9 Precision, recall, F-measure and MCC rates against different datasets with feature selection using whole dataset
Table 10 Precision, recall, F-measure and MCC rates against different datasets with feature selection using train set

Our methodology includes the feature selection method that returns the features with their ranking and selected features with high impact that give the best results. Although biological features with the combination of topological features give better results than alone topological features but it is difficult and computationally expensive to calculate the biological features from the amino acid sequences of proteins. While sequential features and physicochemical features are easy to calculate from amino acid sequences and gives better results with the combination of the topological feature. Another limitation of biological features is that they increase the dimension of the dataset that leads to the curse of dimensionality and low the learning capability of the model. Similar to our hypothesis, our feature selection returns us the features in which no biological feature is included so we can eliminate the biological features and use sequential, physicochemical, and topological features. By using the topological, biological, sequential and physicochemical features for the CYC2008 and MIPS complexes and topological, sequential and physicochemical features for human and mouse complexes we achieve the significantly improved results and reduced computational time has shown in Tables 89 and 10, and results without feature selection are shown in Table 8. Table 9 contains the results of feature selection on the whole dataset while the results of features selection just on the training dataset is shown in Table 10. These results show the robustness of our feature selection method as there is a significant difference between the result and computational time of both methods.

To show the robustness of our proposed methodology, we have compared the true positive rate of our model with other state of the art studies on eight thresholds t = 0.2, 0.3, 0.4, 0.5, 0.6; 0.7; 0.8; 0.9 in terms of the number of real complexes matched on two individual datasets. Figures 5 and 6 shows that for each value of t our proposed method matches more real complexes than other comparative methods.

3.5 Protein Complexes Prediction

We also predict the human Histone H3.1 complex, Mediator complex and Ubiquitin E3 ligase (FBXW11, SKP1A, CUL1, RBX1) and mouse Gata1–Fog1–MeCP1 complex, Ikaros and Ikaros–NuRD complex shown in Figures 7 and 8, respectively.

Fig. 7
figure 7

Predicted Human complexes using the localization and biological information (a) Histone H3.1 complex prediction based on H3C1 and H4C1 (Histone H4 protein) (b) Mediator complex prediction bases on Mediator of ribonueclic acid polymerse II transcription (MED 20) and Sterlo regularity element-binding protein (SREBF2) (c) Ubiquitin E3 ligase (FBXW11, SKP1A, CUL1, RBX1) prediction based on E3 ubiquitin protein (RBX1), S phase kinase associated protein (SKP1) and cullin 1 (CUL1)

Fig. 8
figure 8

Predicted Mouse complexes using biological and localization information (a) Gata1–Fog1–MeCP1 complex prediction based on Metastasis-associated protein (Mta 1), Histodine deacetylase 1 (Hdac1) and Histone-binding protein (Rbbp4) (b) Ikaros complex prediction based on Histone deacetylase protein (Hdac1), transcription activator BRG1 (Smarca4) and SNF/SWI subunit SMARCC1 (Smarcc1) (c) Ikaros–NuRD complex prediction based on Transcription Factor 7 like 1 (Tcf7l1)

Throughout DNA replication and possibly DNA repair, the deposition of the main histone H3 (H3.1) is combined with DNA synthesis, while the variant of Histone H3.3 acts as the substitution form for the Genetic code synthesis-independent deposition pathway [54]. The many ribonucleic acid polymerase II like the protein noncoding and coding ribonucleic acid genes need the Mediator complex to regulate their functionality. The Mediator complex is also typically targeted by DNA-binding transcription factors (TFs), sequence specific, that function in response to genetic or environmental indications to regulate gene expression systems [55]. Mediator’s main function is to relay signals from transcription activators linked to enhancer regions to the transcription machine assembled at promoters as a pre-initiation complex (PIC) for regulating transcription initiation [56]. The ubiquitin pathway is a powerful enzyme pathway that controls undesirable biosynthetic proteins for proteasome dissolution. The ubiquitin proteasome network is a master protein homeostasis regulator by means of which proteins are originally activated by E3 ligases for polyubiquitination and then degraded by the proteasome to short peptides [57].

GATA-1 is crucial to the growth of, mast cell lines, eosinophilic, erythroid megakaryocytic, and eosinophilic. It functions as a repressor and activator of various target genes, such as in erythroid cells it suppresses cell proliferation and premature hematopoietic genes while triggering erythroid genes. FOG-1 regulates GATA-1 associations with the MeCP1 complex and provides an explanation for the conflicting roles of these two erythropoiesis factors [58]. The IKAROS is a regulator of hematopoietic and its expression level influence the NuRD and the Mediator complex interact with ribonucleic acid polymerase 2 enzymes to normalize its ability to prompt protein-coding genes [59]. It Regulates the transcription expansion and suppress tumor in leukemia and mostly accompanying the poor scenario. It procedure a complex with NuRD complex and P-TEFb complex that is essential for productive transcription expansion [60]. In the predicted complexes, yellow node is the node that we used to predict the protein complex and the other nodes are true proteins in the complexes. The predicted protein complexes based on this Uniprot ID show the robustness of our model.

With the growing demand for protein protein interaction studies is presenting the computational methods for predicting the protein protein interaction and complexes. In our work, we explore the classification method for PPI combining the different features such as biological features, topological features, sequential features, and physicochemical features. In the method of feature extraction graphical features of interacting proteins, the sequence of amino acid, blosum62 and PSSM matrix representing the amino acid sequences have been proven very useful for the classification of protein complexes. Compared with earlier methods, our results have been improved using the novel features of protein and powerful classifiers. Using these feature models will perform well on other independent protein datasets. To the best of our knowledge, our method is superior, feasible, and robust.

4 Conclusion

The existing protein complex detection studies mostly used the topological features to predict the protein complexes and some recent studies have used the combination of biological and topological features. The high-dimensional data and biological feature extracted through Fast Fourier Transformation contain noise data that affect the accuracy of the model and misguide the protein complex prediction. In this study, we have computed the topological and biological features, i.e., bag of words, physicochemical and spectral domain features for mining PCs from PPINs. We have used FSSF, i.e., random forest based Boruta feature selection method to rank and select the best features that eliminate the noise from the dataset and reduce the dimension of data. Decision tree, linear discriminant analysis and gradient boosting classifiers are trained on best features selected by FSSF to predict the protein complexes. The comparison with the recent studies shows significant improvement in results. In our methodology, linear discriminant analysis classifier gives precision, recall and f-measure 94.58%, 94.92% and 94.45% for MIPS and 96.31%, 93.55% and 96.02% for CYC2008 dataset complexes, respectively. The robustness of our model is shown by the predicted complexes. In addition, our machine learning-based protein complex detection will help in future studies to predict the protein complexes based on these features much precisely. In future, we will implement the deep learning-based model to predict the protein complexes. Furthermore, we will implement the conformal prediction to mine protein complexes that will help to find the complexes with confidence.