1 Introduction

A genetic disease is any disease that is caused by an abnormality in an individual’s genome. The abnormality can range from minuscule to major; from a discrete mutation in a single base in the DNA of a single gene to a gross chromosome abnormality involving the addition or subtraction of an entire chromosome or set of chromosomes. Some genetic disorders are inherited from the parents, while other genetic diseases are caused by acquired changes or mutations in a preexisting gene or group of genes. Mutations can occur either randomly or due to some environmental exposure.

Contemporary classification of human disease dates to the late 19th century and derives from observational correlation between pathological analysis and clinical syndromes. Characterizing disease in this way established a classification schema that has served clinicians well to the current time, relying on observational skills to define the syndrome phenotype. Throughout the last century, this approach became more objective, as the molecular underpinnings of many disorders were identified and definitive laboratory tests became an essential part of the overall diagnostic paradigm [1].

In bioinformatics, various large projects, such as the human genome project, together with new techniques, such as the microarray, have created enormous amount of data. These data often come with high dimensionality so that they can involve a huge number of genes with many dimensions. This condition can significantly increase the computational burden, even to the extent that it renders some data mining approaches impossible. For example, it would be very difficult to train a neural network or support vector machine with tens of thousand input nodes. Furthermore, many of these tremendous input features are redundant and/or irrelevant to a given task and can act like noise to decrease performance. Feature selection [3, 14] is a useful technique since it can help alleviate the curse of dimensionality, speed up the learning process, and provide better interpretability.

Network data have become increasingly popular in the past decades, because of the proliferation of various social and information networks. Social networks such as Facebook and Twitter have millions of users all across the world. Different forms of information networks such as co-author networks, citation networks, and protein interaction networks also attract considerable research attention [4, 5]. In addition to the link structure, these network data are usually accompanied with content information on the nodes. For example, one can extract thousands of profiling features for users in social networks or ontology features for genes in protein interaction networks.

Proteins rarely act alone as their functions tend to be regulated. Many molecular processes within a cell are carried out by molecular machines that are built from a large number of protein components organized by their protein–protein interactions (PPIs). Direct PPIs are one of the strongest manifestations of a functional relation between genes/proteins, so interacting proteins may lead to the same disease phenotype when mutated. Protein–protein interactions refer to lasting or ephemeral physical contacts of high specificity established between two or more protein molecules as a result of biochemical events steered by electrostatic forces including the hydrophobic effect. These interactions make up the so-called interactomes of the organism, while aberrant PPIs are the basis of multiple aggregation-related diseases. A recent study showed that interacting proteins tend to lead to similar disease phenotypes when mutated. Therefore, protein–protein interactions might in principle be used to identify potentially interesting disease gene candidates.

Accordingly, we have incorporated PPI networks in feature selection process. This is because two linked proteins are more likely to have similar properties than two randomly picked ones. Using this network information along with the features themselves, we have tried to select more discriminative features of proteins. However, in most of the existing feature selection methods on gene/protein data, they seldom consider their inter-relations. This is due to lack of relationship information among instances in most biological datasets.

On the other hand, in the genomics setting, an increasingly common data configuration consists of a small set of sequences possessing a targeted property (positive instances) among a large set of sequences for which class label is unknown. Therefore, our proposed feature selection method tries to work in unsupervised manner.

By coinciding the link information of unlabeled proteins besides their abundant features, we have proposed an Unsupervised Feature Selection Framework for Linked Biological data (UFLB), in order to facilitate hereditary disease genes classification. For this purpose, we try to optimize a novel objective function via an efficient iterative algorithm in order to identify the most relevant and non-redundant features.

The rest of this paper is arranged as follows. The related work is presented in Sect. 2. Our new framework for unsupervised feature selection in biological data, UFLB, is introduced in Sect. 3, including approaches to capture protein–protein interactions, clustering the proteins iteratively, and optimization analysis. The experimental results with discussion are presented in Sect. 4. Finally, we conclude this work in Sect. 5.

2 Related work

Feature selection is an important operation in processing the data stored in gene microarrays. The most relevant features increase our understanding of the mechanism of disease formation and allow to predict the potential danger of being affected by such disease. The application of feature selection methods allows to identify a subset of important features that can be used as biomarkers of the appropriate disease. In the following, we introduce some related work on feature selection for both non-linked and linked data.

2.1 Feature selection for non-linked data

Recently, many learning techniques have been proposed to solve the problem of feature selection. It is certainly worth mentioning a number of methods that have emerged empirically for their effectiveness. One of the differences among various feature selection procedures is the way they perform the search in the feature space. Three categories of feature selection methods can be distinguished as follows: filter [7, 13], wrapper [8], and embedded methods [9, 10].

Filter methods assess features by calculating a relevant score for each one of them. The low‐relevant features are then removed, and the selected features may then be used to serve classification via many types of classifiers. Feature selection filter-based methods can scale easily to high-dimensional datasets since they are computationally simple and fast compared with the other approaches. Various examples for filter-based approaches are ReliefF [11], mRMR [12], SPEC [13], Laplacian score [14], and its extensions [13].

Wrapper methods evaluate feature subsets using a predictive model which is run on the dataset partitioned into training and testing sets. Each subset is used with training dataset to train the model, which is then tested on the test set. Calculating a model prediction error from the test set gives a score for that feature subset. The subset with the highest evaluation is selected as the final set on which to run this particular model. The wrapper methods are computationally expensive since they need a new model to be fitted for each subset [15, 16]. In the embedded models, however, feature selector is a combination of both filter and wrapper. They are less computationally expensive than the wrapper methods.

Depending on the availability of class labels, feature selection algorithms can be categorized into supervised methods and unsupervised methods [8]. In the supervised methods such as Fisher score [17] and ReliefF [11], class labels provide a clear guidance to the feature selection process. This is because supervised methods usually are more reliable than unsupervised ones. However, these methods suffer from two main restrictions. First, since they evaluate each feature independently, they ignore the correlation between features. Second, access to labeled training data in real world is too expensive. Nevertheless, much attention has been paid to unsupervised feature selection in recent years.

Unsupervised feature selection becomes more challenging problem due to the absence of class labels. Unsupervised filter methods usually assign each feature a score which can indicate the feature’s capacity to preserve the structure of data. Top-ranked features are selected since they can best preserve the structure of data. The typical methods include maximum variance [18], Laplacian score [14], and SPEC [13]. Unsupervised wrapper methods [19] require a learning algorithm to evaluate the candidate feature subsets. Unsupervised embedded methods perform feature selection as a part of model training process, e.g., UDFS [20] and NDFS [21].

State-of-the-art approaches introduce the notion of pseudo-labels [20,21,22] to guide the feature selection process. Unsupervised Discriminative Feature Selection (UDFS) [20] introduces pseudo-labels to better capture the discriminative information, and the sparsity-inducing l2,1 norm is used to select features in an iterative manner. NDFS [21] performs non-negative spectral analysis and feature selection simultaneously.

The basic idea is to imitate supervised methods by generating pseudo-labels via certain clustering methods (e.g., spectral clustering and non-negative matrix factorization) and performing sparse regression toward these cluster labels. However, the generated pseudo-labels are usually inaccurate and could further mislead the feature selection process.

2.2 Feature selection for linked data

Traditional feature selection approaches assume that data instances are independent and identically distributed (i.i.d). Several methods have been proposed in recent years in which the relationships among data are also considered. In the network data, however, instances are implicitly or explicitly related to certain correlations and dependencies. For example, in research collaboration networks, researchers who collaborate with each other (i.e., connections in the network) tend to share more similar research topics (i.e., close distances in the feature space) than researchers without such collaboration. Most existing feature selection approaches fail to exploit the rich information contained in the links.

In [23], a supervised feature selection algorithm (called FSNet) is proposed. It adopts linear regression to fit the content information. Moreover, it uses graph regularization to capture the link information. On the other hand, LinkedFS [24] selects features in social media data in a semi-supervised manner. A supervised feature selection framework, CoSelect, for social media data is proposed in [25]. Instance selection is incorporated into feature selection in CoSelect in order to select relevant instances while selecting features simultaneously.

Linked unsupervised feature selection (LUFS) [26] is an unsupervised feature selection method that utilizes both content and link information. LUFS exploits network information through incorporating social dimension-based regularization [27] into the UDFS framework [20]. It enforces the nodes within the same social dimension to have similar pseudo-labels. But the social dimensions generated from links (e.g., by modularity [28] or spectral clustering [29]) and pseudo-labels generated from attributes are usually far from accurate, which could mislead the feature selection process.

In our previous work, an unsupervised feature selection method in social media data (called UFSS) is presented [43]. UFSS incorporates the inter-relationship of objects in addition to their feature values. By using graph partitioning, the objects are labeled and then are applied in the objective function. An iterative algorithm is designed to optimize the proposed objective function. However, in UFSS, the labeling of objects is a pre-processing step and these labels do not change during the algorithm’s run; which is a constraint.

In this paper, however, the labels of objects are assigned in a dynamic manner. Unlike UFSS and LUFS which use graph partitioning and social dimension incorporation for static labeling, the proteins are labeled dynamically in the consecutive iterations of UFLB so that, after its convergence, an appropriate clustering of proteins is achieved. Our unsupervised feature selection method for linked biological data takes into account both inter-protein relationship information and feature content of proteins. It tries to select some features which effectively discriminate proteins in the reduced space by using PPIs.

3 The proposed method

Our proposed approach is categorized in hybrid methods since combines both filter and wrapper methods. In this section, we present several concepts as preliminaries of our unsupervised feature selection method. We aim to select a set of effective features which can highly discriminate the protein classes.

3.1 Notations

In this work, we use \( P = \left\{ {p_{1} ,p_{2} , \ldots ,p_{n} } \right\} \) to denote the set of \( n \) proteins and \( F = \left\{ {f_{1} , f_{2} , \ldots ,f_{m} } \right\} \) the set of \( m \) features. Also, let \( A \in {\mathcal{R}}^{m \times n} \) holds the feature values of these proteins. That is, the vector \( A\left( {:,j} \right) \) represents the features of protein and \( A\left( {i,:} \right) \) is the values of feature \( f_{i} \) in all proteins. Additionally, \( R \in {\mathcal{R}}^{n \times n} \) denotes the link information for protein–protein network where \( R\left( {i, j} \right) \) is set to 1 if protein \( p_{i} \) and \( p_{j} \) are linked and 0 otherwise. We imagine there are undirected connections between proteins, that is, \( R = R^{T} \). By applying the centering matrix \( H = I_{n} - \frac{1}{n}1_{n} 1_{n}^{T} \) on \( A \) via \( X = AH \), we obtain the data matrix \( X \in {\mathcal{R}}^{m \times n} \). This matrix is centered, that is, \( \sum\nolimits_{j = 1}^{n} {X\left( {:, j} \right)} = 0 \). In \( H \), \( I_{n} \) is the identity matrix and \( 1_{n} \) is a column vector of \( n \) ones.

3.2 Unsupervised feature selection for linked biological data

Supposing the \( n \) proteins are sampled from \( c \) classes/clusters, we assume that there is a mapping matrix \( M \in {\mathcal{R}}^{m \times c} \) which assigns the proteins with a cluster label indicator matrix \( C \in {\mathcal{R}}^{c \times n} \). In this matrix, \( C\left( {:,i} \right) \in \left\{ {0,1} \right\}^{{{\text{c}} \times 1}} \) represents the cluster indicator vector for protein \( p_{i} \). To use its scaled version \( G\left( {:,i} \right) \), we define the scaled cluster indicator matrix \( G \in {\mathcal{R}}^{c \times n} \) where \( G = \left( {CC^{T} } \right)^{{ - \frac{1}{2}}} C \) [30] and \( GG^{T} = \left( {CC^{T} } \right)^{{ - \frac{1}{2}}} CC^{T} \left( {CC^{T} } \right)^{{ - \frac{1}{2}}} = I_{c} \).

Accordingly, our aim was to learn the scaled cluster indicator matrix \( G \) and the feature selection matrix \( M \) simultaneously. In this regard, we propose to optimize the following objective function:

$$ \min_{M} \left\| {M^{T} X - G} \right\|_{F}^{2} + \lambda \left\| M \right\|_{2,1} $$
(1)

where \( \left\| . \right\|_{F}^{2} \) is the Frobenius norm [32] and \( \left\| M \right\|_{2,1} \) is the \( l_{2,1} \)-norm of \( M \) [31] which controls the capacity of this matrix. The parameter \( \lambda \) is used to control the sparsity of \( M \). Due to the nature of the \( l_{2,1} \)-norm penalty, some coefficients will be shrunk to exact 0 if \( \lambda \) is large enough.

In (1), \( M \) essentially contains the combination coefficients for different features in approximating \( G \). The joint minimization of the regression model and \( l_{2,1} \)-norm regularization term enables \( M \) to evaluate the correlation between cluster indicator and features. Also, minimizing \( \left\| M \right\|_{2,1} \) ensures that \( M \) is sparse in rows. These reasons, altogether, make \( M \) particularly suitable for feature selection.

By considering matrix \( R \) and the fact that linked proteins are likely to have similar cluster label indicator, we are going to minimize the following term:

$$ \min_{G} \frac{1}{2}\mathop \sum \limits_{i,j = 1}^{n} R_{ij} G\left\| {\left( {:,i} \right) - G\left( {:,j} \right)} \right\|_{2}^{2} = Tr\left[ {GLG^{T} } \right] $$
(2)

where \( L = D - R \) is a Laplacian matrix and \( D \) is a diagonal matrix with \( D_{ii} = \sum\nolimits_{j = 1}^{n} {R_{ij} } \) on diagonal elements. Including (2) in (1), we obtain the new version of objective function:

$$ { \hbox{min} }_{M,G} \left\| {M^{T} X - G} \right\|_{F}^{2} + \lambda \left\| M \right\|_{2,1} + Tr\left[ {GLG^{T} } \right] $$
(3)

According to the definition of \( G \), its elements are constrained to be discrete values, making the optimization of (3) an NP-hard problem [33]. A well-known solution is to relax it from discrete values to continuous ones [33, 34], so the objective function in (3) is relaxed to:

$$ { \hbox{min} }_{M,G} \left\| {M^{T} X - G} \right\|_{F}^{2} + \lambda \left\| M \right\|_{2,1} + Tr\left[ {GLG^{T} } \right] . $$
(4)
$$ s.t.\quad GG^{T} = I_{c} $$

The last part of our objective function is formed by taking into account the (centered) protein information matrix \( X \) for discrimination. A well-known method to utilize discriminative information is to find a low-dimensional subspace in which the between-class scatter matrix \( Q_{b} \) is maximized while minimizing the total scatter matrix \( Q_{t} \) [35].

As in [43], the maximum of \( Tr\left( {\frac{{Q_{b} }}{{Q_{t} }}} \right) \) (minimum of its negative) is included in (4) and the new objective function is given by:

$$ \min_{M,G} \left\| {M^{T} X - G} \right\|_{F}^{2} + \lambda \left\| M \right\|_{2,1} + Tr\left[ {GLG^{T} } \right] - \gamma Tr\left( {\frac{{Q_{b} }}{{Q_{t} }}} \right) $$
(5)
$$ s.t.\quad GG^{T} = I_{c} $$

where parameter \( \gamma \) controls the discrimination value. In order to use the definitions of \( Q_{b} \) and \( Q_{t} \) in [43], we define \( Y = M^{T} X \) and so:

$$ \min_{M,G} \left\| {M^{T} X - G} \right\|_{F}^{2} + \lambda \left\| M \right\|_{2,1} + Tr\left[ {GLG^{T} } \right] + \gamma Tr\left( {YY^{T} - YG^{T} GY^{T} } \right) $$
(6)
$$ s.t.\quad GG^{T} = I_{c} $$

Note that all the elements of \( G \) are non-negative by definition. However, the optimal \( G \) of (6) has mixed signs which violates its definition and makes \( G \) severely deviate from the ideal cluster indicators. As a result, we cannot directly assign labels to data using the cluster indicator matrix \( G \). To address this problem, it is reasonable to impose a non-negative constraint into the objective function. When both non-negative and orthogonal constraints are satisfied, there is only one positive element in each row of \( G \), while all others are zero. In that way, the learned \( G \) is more accurate and more capable to provide discriminative information. Therefore, by rewriting (6), the new objective function is obtained as follows:

$$ \min_{M,G} \left\| {M^{T} X - G} \right\|_{F}^{2} + \lambda \left\| M \right\|_{2,1} + Tr\left[ {GLG^{T} } \right] + \gamma Tr(M^{T} X\left( {I_{n} - G^{T} G} \right)X^{T} M $$
(7)
$$ s.t.\quad GG^{T} = I_{c} {\text{ and }}G \ge 0 $$

To optimize this function, we propose an iterative optimization algorithm. In this regard, we rewrite the objective function in (7) as follows:

$$ \min_{M,G} \left\| {M^{T} X - G} \right\|_{F}^{2} + \lambda \left\| M \right\|_{2,1} + Tr\left[ {GLG^{T} } \right] + \gamma Tr(M^{T} X\left( {I_{n} - G^{T} G} \right)X^{T} M + \alpha \left\| {GG^{T} - I} \right\|_{cF}^{2} $$
(8)
$$ s.t.\quad G \ge 0 $$

where \( \alpha > 0 \) is a parameter to control the orthogonality condition. In practice, \( \alpha \) should be large enough to insure the orthogonality satisfied. For the ease of representation, the last objective function \( J\left( {M,G} \right) \) is defined as follows:

$$ J\left( {M,G} \right) = \left\| {M^{T} X - G} \right\|_{F}^{2} + \lambda \left\| M \right\|_{2,1} + Tr\left[ {GLG^{T} } \right] + \gamma Tr(M^{T} X\left( {I_{n} - G^{T} G} \right)X^{T} M + \alpha \left\| {GG^{T} - I} \right\|_{cF}^{2} $$
(9)

Theorem 1

The mapping matrix \( M \) in \( J\left( {M,G} \right) \) can be updated as follows:

$$ M^{{\left( {\text{new}} \right)}} = \left( {XX^{T} + \gamma X\left( {I_{n} - G^{T} G} \right)X^{T} + \lambda D_{M}^{{\left( {\text{old}} \right)}} } \right)^{ - 1} XG^{T} $$
(10)

where \( D_{M} \) is an \( m \times m \) diagonal matrix with \( \frac{1}{{\left\| {2M\left( {i,:} \right)} \right\|_{2} }} \) on its \( i \) th row.

Proof

In order to minimize \( J\left( {M,G} \right) \) in (9), its derivative is taken as follows:

$$ \frac{{\partial J\left( {M,G} \right)}}{\partial M} = 2XX^{T} M - 2XG^{T} + 2\lambda D_{M} M + 2\gamma X\left( {I_{n} - G^{T} G} \right)X^{T} M $$

By setting this derivative to zero, the update rule in (10) is obtained.□

Theorem 2

The scaled cluster indicator matrix, \( G \), is updated by this rule:

$$ G_{ij}^{{\left( {\text{new}} \right)}} = G_{ij}^{{\left( {\text{old}} \right)}} .\frac{{U_{ij}^{{\left( {\text{old}} \right)}} }}{{V_{ij}^{{\left( {\text{old}} \right)}} }} $$
(11)

where \( U = M^{T} X + M^{T} XG^{T} M^{T} X + 2\alpha G \) and \( V = G + GL + 2\alpha GG^{T} G \).

Proof

Following [36,37,38], we introduce multiplicative updating rules. Setting derivative of \( J\left( {M,G} \right) \) with respect to \( G_{ij} \) to 0 and using the Karush–Kuhn–Tucker condition [39], we obtain the updating rule in (11).□

Using the updating rule of \( M \) in (10) and of \( G \). in (11), we have developed the iterative algorithm of UFLB:

The larger the norm of \( \left\| {M\left( {i,:} \right)} \right\|_{2} \), the more informative the feature \( f_{i} \) is.

4 Experiments and discussion

In this section, we present experiment details to verify the effectiveness of the proposed framework, UFLB. In this regard, it is compared against the state-of-the-art unsupervised feature selection with/without link information. We evaluate the effectiveness of the selected features using both accuracy measure and clustering quality. Then, the effects of parameters on performance of UFLB are discussed. At last, its convergence analysis is conducted via experiments.

4.1 Datasets

In this work, some labeled genes from Online Mendelian Inheritance in Man (OMIM) with six groups of confirmed diseases are selected. The labels are cardiovascular disease, endocrine disease, cancer disease, metabolic disease, neurological disease, and ophthalmological disease [40, 45]. With respect to the quality and the performance of disease gene classification methods, the data are derived from multiple biological sources [41]. This dataset consists of 949 genes with 4004 features and 956 links. The features are extracted from gene ontology (3000 features), protein domain (1000 features), and protein–protein interactions (4 features) to construct the feature vector of each gene [45]. We have reduced the dataset to uncover the features which none of the genes contain them. So, the dataset is reduced to 3522 features.

The second dataset consists of a subset of IntActFootnote 1 with three groups of diseases. IntAct provides an open source database and toolkit for the storage, presentation, and analysis of protein interactions. We extract 846 genes/proteins from cancer, Alzheimer, and Parkinson databases with 1980 links. The sequence of each gene/protein is obtained from UniProtFootnote 2 database. Then, using the distribution of 1 gram, 2 grams and 3 grams in each protein sequence, 8420 features are extracted from the combinations of amino acids [44]. By reducing the dataset to uncover the features which none of the genes/proteins contain them, the dataset is reduced to 8404 features.

A subset of HPRDFootnote 3 database is selected as third dataset. This dataset contains 234 genes from four disease classes: diabetes, myopathy, syndrome, and cancer. Each gene has 8420 features which are extracted from the combinations of amino acids. In our experiments, a small number of samples (genes) versus too many features are selected to evaluate UFLB.

The fourth dataset contains fewer features than the three datasets. In this dataset, 966 genes with 127 features are used. Its six classes, Parkinson, Alzheimer, vitiligo, chronic lymphocytic leukemia, schizophrenia, and type I diabetes mellitus, are extracted from HetioFootnote 4 database.

The statistics of datasets is shown in Table 1.

Table 1 Statistics of four linked biological datasets

4.2 Evaluation measures

Following the convention of clustering study, we have evaluated the clustering quality of UFLB by two commonly used metrics, Unsupervised Accuracy Measure (UAM) and Normalized Mutual Information (NMI).

Denoting \( q_{i} \) as the clustering result and \( g_{i} \) as the ground-truth label of protein \( p_{i} \), UAM is computed as [20]:

$$ {\text{UAM}} = \frac{1}{n}\mathop \sum \limits_{i = 1}^{n} \delta \left( {g_{i} ,map\left( {q_{i} } \right)} \right) $$
(12)

where

$$ \delta \left( {g,q} \right) = \left\{ {\begin{array}{*{20}l} {1 ,\quad if\,g = q} \\ {0 ,\quad if\,g \ne q} \\ \end{array} } \right. $$
(13)

and \( map\left( q \right) \) is the best mapping function that permutes clustering labels to match the ground-truth labels using the Kuhn–Munkres algorithm. The larger UAM indicates the better performance.

Also, NMI is computed as [42]:

$$ {\text{NMI}} = \mathop \sum \limits_{l} \mathop \sum \limits_{h} t_{l,h} {\text{log}}\left( {\frac{{n \times t_{l,h} }}{{t_{l} \times t_{h} }}} \right)/\sqrt {\left( {\mathop \sum \limits_{l} t_{l} {\text{log}}\left( {\frac{{t_{l} }}{n}} \right)} \right) \left( {\mathop \sum \limits_{h} t_{h} {\text{log}}\left( {\frac{{t_{h} }}{n}} \right)} \right)} $$
(14)

where \( t_{l} \) is the number of proteins in \( l \) th cluster \( \left( {l = 1, \ldots , c} \right) \)., and \( t_{h} \) is the number of proteins in \( h \) th ground-truth class \( \left( {h = 1, \ldots ,c} \right) \). Also, \( t_{l,h} \) is the number of proteins in both \( l \) th cluster and \( h \) th ground-truth class. Agn, higher values of NMI, report the better clustering results.

4.3 Experimental results

In this subsection, we compare the quality of features, selected by different algorithms, using NMI and accuracy metrics. For baseline methods with some parameters, we have tried different parameter values and reported the best performance. UFLB has three important parameters: \( \gamma \), \( \lambda \), and \( \alpha \) which control the discriminative information, sparsity, and orthogonality, respectively. To select the best parameters, each one is set while holding the others fixed to see how accuracy of UFLB varies when different number of features is selected. Figure 1 depicts the NMI measure when three parameters of UFLB are examined for OMIM dataset.

Fig. 1
figure 1

NMI measure with different values of a\( \alpha \), b\( \gamma \), and c\( \lambda \)

As shown in Fig. 1, it is clear that for \( \alpha > 0.5 \), NMI values degrade. In the case of \( \gamma \), the values are closer to each other though for \( \gamma = 0.9 \), they are least. Also for \( \lambda \), the values of NMI decrease when \( \lambda \) decreases. However, for \( \lambda = 0.9 \), NMI reduces drastically. Consequently, setting the parameters of orthogonality, sparsity, and discrimination to high values cannot generate the desired results. At last, these parameters are set as \( \alpha = 0.5 \), \( \gamma = 0.7 \), and \( \lambda = 0.3 \) in the experiments on OMIM dataset. Similarly, for three other datasets, these settings are achieved: \( \alpha = 0.3 \), \( \gamma = 0.5 \), \( \lambda = 0.3 \) for IntAct; \( \alpha = 0.1 \), \( \gamma = 0.2 \), \( \lambda = 0.3 \) for HPRD, and \( \alpha = 0.3 \), \( \gamma = 0.1 \), \( \lambda = 0.1 \) for Hetio.

After setting appropriately the parameters of UFLB, its performance is compared against seven unsupervised feature selection algorithms in terms of UAM and NMI metrics. These methods are described briefly here.

  1. (1)

    UDFS is proposed in [20] to optimize the \( l_{2,1} \)-norm regularized minimization problem with orthogonal constraint. It selects the most discriminative feature subset from the whole feature set in batch mode.

  2. (2)

    UFSS [43] utilizes both the relationship between instances and information of features to propose an objective function. This function seeks for a mapping matrix in which the discriminative information of each feature exists. Finally, the ranked features are obtained by utilizing this mapping matrix.

  3. (3)

    NMF [37] is a matrix factorization method which approximately decomposes a known matrix into two unknown matrices with much lower dimensions.

  4. (4)

    Laplacian score [14] is based on the observation that, in many real-world classification problems, data from the same class are often close to each other. The importance of a feature is evaluated by its power of locality preserving via Laplacian score.

  5. (5)

    LUFS [2] is an unsupervised feature selection framework for linked data in social media. This framework utilizes a concept of social dimensions from social network analysis to extract relations among linked data as groups. Then, it defines a social dimension regularization inspired by linear discriminant analysis to mathematically model these relations.

  6. (6)

    SPEC [13] is a general framework of spectral feature selection for both supervised and unsupervised learning. It is based on sparse multi-output regression by considering \( l_{2.1} \)-norm. This algorithm performs well in both redundant features removing and relevant features preserving.

  7. (7)

    CGSSL [6] jointly exploits non-negative spectral analysis and structural learning with sparsity. In this unsupervised feature selection approach, the cluster indicators, learned by non-negative spectral clustering, are used to provide label information for the structural learning.

The comparison results w.r.t both UAM and NMI are demonstrated in Table 2 and Table 3 for OMIM dataset. In these tables, \( q \) best features of each method are examined. Moreover, the clustering performance with all features (i.e., without feature selection) is also reported. Note that the results of each evaluation criterion are reported in two forms: in tabular (left) and in plot (right).

Table 2 NMI measure of different feature selection methods for OMIM dataset
Table 3 UAM of different feature selection methods for OMIM dataset

According to results in Table 2 and 3, UFLB mostly outperforms seven compared methods. Although NMI’s results for UFLB are much better than other methods, its values are generally small. This is because those proteins, which are placed in a ground-truth class, are not necessarily grouped in the same cluster. This in turn leads to a noticeable decrease in NMI. This also occurs for UAM.

In order to evaluate the accuracy measure more reliable, ground-truth labels are considered. By utilizing these labels, Multi-class Support Vector Machine (SVM) classifier [46] is trained and the classification accuracy is calculated according to the obtained results in Table 4. As depicted here, the accuracy of UFLB is near to LUFS and UFSS in most cases. This is because these algorithms take into account the link information. However, the execution time of UFLB is less than both LUFS and UFSS.

Table 4 Classification accuracy of different feature selection methods for OMIM dataset

In the same way, the NMI, UAM, and classification accuracy of eight methods for IntAct dataset are shown in Tables 5, 6, and 7, respectively.

Table 5 NMI measure of different feature selection methods for IntAct dataset
Table 6 UAM of different feature selection methods for IntAct dataset
Table 7 Classification accuracy of different feature selection methods for IntAct dataset

It is clear that in most cases for IntAct dataset, UFLB is better than the other methods. It outperforms almost all traditional feature selection methods which do not take into account link information. Also, in comparison with UFSS and LUFS, our proposed method obtains the better results in most cases.

In Tables 8, 9, and 10, the NMI, UAM, and classification accuracy of eight feature selection methods are shown for HPRD dataset. Clearly, UFLB works the best in most cases, though the sample size is not large. This depicts that the performance of UFLB is acceptable in medium-sized datasets with huge dimensions.

Table 8 NMI measure of different feature selection methods for HPRD dataset
Table 9 UAM of different feature selection methods for HPRD dataset
Table 10 Classification accuracy of different feature selection methods for HPRD dataset

The results of NMI, UAM, and accuracy on Hetio dataset are reported in Tables 11, 12, and 13 where UFLB is compared against seven unsupervised feature selection methods. According to these results, UFLB outperforms the other methods, in most cases, for this dataset with many genes and a few features, because of taking into account the link information.

Table 11 NMI measure of different feature selection methods for Hetio dataset
Table 12 UAM of different feature selection methods for Hetio dataset
Table 13 Classification accuracy of different feature selection methods for Hetio dataset

4.4 Time complexity

To evaluate the time complexity of UFLB, all unsupervised feature selection methods are implemented in MATLAB R2017, and then, their required CPU time is compared in the experiments. The codes are run on a Core i7, 1.8 GHz CPU with 8 GB of memory in 64-bit Windows 10. Table 14 shows the CPU time of UFLB against seven compared methods when run on four datasets.

Table 14 CPU time (in seconds) of UFLB against seven methods, run on four datasets

From the results, it is clear that UFLB is considerably faster than LUFS and UFSS which consider the link information. Also, it is faster than CGSSL and UDFS. However, UFLB is not faster than SPEC, Laplacian, and NMF since they do not use the link information in feature selection. Clearly, those methods which select the features by considering the link information between samples require more CPU time to process these communications to improve their performance.

4.5 Non-parametric test

In order to assess statistically the compared methods, we have used the non-parametric statistical test to justify the significant differences among them. Friedman’s test [47] with confidence level of 0.05 is used in the experiment. This test is usually applied to show any significant difference among more than two results.

In Table 15, the classification accuracy of all algorithms for the smallest subset of selected features (as reported in Tables 4, 7, 10, and 13) on four datasets is displayed. We use the Friedman’s test on these accuracies to examine the rejection of the hypothesis that all the feature selection methods perform equally well for a given level. It ranks the methods for each dataset separately, and the best performing method gets the highest rank.

Table 15 Classification accuracy of different feature selection methods for the smallest subset of selected features on four datasets

By applying the Friedman’s non-parametric test, we get the p value < 0.01. It can be concluded that at least two of the algorithms are significantly different from each other. Average rankings of the eight methods on four biological datasets, examined by Friedman’s test, are shown in Table 16. These rankings reveal that UFLB is the most influential for classification tasks.

Table 16 Average rankings of the compared methods

5 Discussion

In biological studies, it is important to know that which features can better discriminate the groups of hereditary diseases. In this subsection, we explain the nature of selected features in four datasets. First, by examining the top-ranked features for OMIM dataset, we found that 1000 top features are related to gene ontology (specifically biological process subontology). According to [46], gene ontology includes three subontologies: molecular function as the elemental activities of a gene product at the molecular level, biological process as a set of molecular functions, and cellular components which represent some parts of a cell or its extracellular environment. So, the features which are derived from biological process can do better discrimination.

Similarly, for IntAct and HPRD datasets, the top-ranked features are investigated. Since the large number of features is related to 3-gram distribution, compared to 1 gram and 2 grams, it is rational that a considerable portion of top-ranked features belong to this category. In IntAct dataset, among 1000 top features, 202 features are from 2-gram and only 3 features belong to 1-gram distribution. This means that the 1-gram features are not discriminative in disease genes.

Furthermore for Hetio dataset, two groups of features are extracted: (1) the features computed for each gene–disease pair and (2) the processed version of the GNF BodyMap [47] providing a gene’s expression value for 77 specific tissues which can do discrimination more precise.

As stated before, UFLB mostly outperforms the other methods. Usually, for small number of best features in four datasets, UFLB is the best method. This means that UFLB is able to recognize the top discriminative features in comparison with the other methods. It is worth mentioning that the methods, which consider link information, usually outperform the other methods. This is because the interacted genes/proteins have more similar characteristics than uncorrelated ones. Thus, it is beneficial to incorporate PPIs beside to features in the feature selection process.

5.1 Convergence study

In this part, we justify it experimentally via plotting the convergence speed. Figure 2 shows the value of objective function, in (9), in consecutive iterations of algorithm. From these plots, it is clear that UFLB converges only after few iterations in four datasets. This justifies that the algorithm of UFLB converges certainly and quickly.

Fig. 2
figure 2

Convergence curves of UFLB for a OMIM, b IntAct, c HPRD, and d Hetio datasets

6 Conclusion

Classification of hereditary disease genes/proteins plays a significant role in prediction and diagnosis of diseases. Diseases with the same or similar phenotype have the same biological features which describe them. Since there are often a large number of features related to biological data (genes/proteins), it is important to find out which input features are useful in diagnosis of a given disease. This is because feature selection is an important tool in biological researches.

On the other hand, almost all methods presented so far for feature selection in biological data have not considered the inter-relationship between data. However, interacted proteins have more similar characteristics than uncorrelated ones; it is beneficial to incorporate PPI in addition to features in feature selection.

Therefore, an unsupervised method for feature selection is proposed here because of the existing a huge subset of unlabeled data in biological studies. For this purpose, by optimizing a novel objective function, which incorporates both the inter-relationship of genes/proteins in addition to their features, the top-ranked features are extracted. Also, unlike other methods, in this paper, the data are labeled dynamically in the consecutive iterations of proposed algorithm so that, after its convergence, an appropriate clustering of proteins is achieved.

We compare our proposed method with some well-known evaluation criteria on two real-world datasets. The experimental results demonstrated the effectiveness of our proposed method in exploiting link information for selecting informative features in comparison with the state-of-the-art methods.