Unsupervised feature selection in linked biological data

Hoseini, Elham; Mansoori, Eghbal G.

doi:10.1007/s10044-018-0707-2

Unsupervised feature selection in linked biological data

Theoretical Advances
Published: 27 April 2018

Volume 22, pages 999–1013, (2019)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Pattern Analysis and Applications Aims and scope Submit manuscript

Unsupervised feature selection in linked biological data

Download PDF

Elham Hoseini¹ &
Eghbal G. Mansoori¹

346 Accesses
2 Citations
Explore all metrics

Abstract

Feature selection techniques have become an apparent need in many bioinformatics applications, especially when there exist a huge number of features. For instance, classification of hereditary disease genes/proteins plays a significant role in prediction and diagnosis of diseases. In this regard, some knowledge of features’ goodness in making predictions is needed. Apparently, distinctive features and their relevancy to class labels are determinant in designing efficient classifiers. Indeed, excluding redundant and/or irrelevant features, without incurring much loss of information, can reduce the processing cost while improving the predictor’s performance. Consequently, feature selection is a preliminary task in most biological studies. Traditionally, biological data analysis methods also use the common feature selection techniques which imagine the data instances as independent objects and so not consider their possibly inter-relations. For instance, protein–protein interactions (PPIs) handle a wide range of biological processes including cell-to-cell interactions and metabolic and developmental control. Apparently, linked data have more similar characteristics than uncorrelated ones and so accounting these inter-relations beside to data content will be beneficial in feature selection. To incorporate the data inter-relations (e.g., PPIs in biological data) along with the data content in selecting more effective features, a novel feature selection algorithm is proposed. This method works in unsupervised manner to handle the unlabeled biological data since most of the real-world genes/proteins have no label. For this purpose, we try to optimize a novel objective function which incorporates both the inter-relations of data instances and their content. The proposed method tries to identify the most relevant and non-redundant features and extract the top-ranked ones. For this purpose, an efficient iterative algorithm is developed to optimize the objective function. To assess our methods, three well-known evaluation criteria are examined on some real-world biological datasets and the results are compared against some of the state-of-the-art feature selection methods. The experiments demonstrate the effectiveness of our proposed approach.

McTwo: a two-step feature selection algorithm based on maximal information coefficient

Article Open access 23 March 2016

Multi-view feature selection for identifying gene markers: a diversified biological data driven approach

Article Open access 30 December 2020

Improving feature selection performance using pairwise pre-evaluation

Article Open access 20 August 2016

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

A genetic disease is any disease that is caused by an abnormality in an individual’s genome. The abnormality can range from minuscule to major; from a discrete mutation in a single base in the DNA of a single gene to a gross chromosome abnormality involving the addition or subtraction of an entire chromosome or set of chromosomes. Some genetic disorders are inherited from the parents, while other genetic diseases are caused by acquired changes or mutations in a preexisting gene or group of genes. Mutations can occur either randomly or due to some environmental exposure.

Contemporary classification of human disease dates to the late 19th century and derives from observational correlation between pathological analysis and clinical syndromes. Characterizing disease in this way established a classification schema that has served clinicians well to the current time, relying on observational skills to define the syndrome phenotype. Throughout the last century, this approach became more objective, as the molecular underpinnings of many disorders were identified and definitive laboratory tests became an essential part of the overall diagnostic paradigm [1].

In bioinformatics, various large projects, such as the human genome project, together with new techniques, such as the microarray, have created enormous amount of data. These data often come with high dimensionality so that they can involve a huge number of genes with many dimensions. This condition can significantly increase the computational burden, even to the extent that it renders some data mining approaches impossible. For example, it would be very difficult to train a neural network or support vector machine with tens of thousand input nodes. Furthermore, many of these tremendous input features are redundant and/or irrelevant to a given task and can act like noise to decrease performance. Feature selection [3, 14] is a useful technique since it can help alleviate the curse of dimensionality, speed up the learning process, and provide better interpretability.

Network data have become increasingly popular in the past decades, because of the proliferation of various social and information networks. Social networks such as Facebook and Twitter have millions of users all across the world. Different forms of information networks such as co-author networks, citation networks, and protein interaction networks also attract considerable research attention [4, 5]. In addition to the link structure, these network data are usually accompanied with content information on the nodes. For example, one can extract thousands of profiling features for users in social networks or ontology features for genes in protein interaction networks.

Proteins rarely act alone as their functions tend to be regulated. Many molecular processes within a cell are carried out by molecular machines that are built from a large number of protein components organized by their protein–protein interactions (PPIs). Direct PPIs are one of the strongest manifestations of a functional relation between genes/proteins, so interacting proteins may lead to the same disease phenotype when mutated. Protein–protein interactions refer to lasting or ephemeral physical contacts of high specificity established between two or more protein molecules as a result of biochemical events steered by electrostatic forces including the hydrophobic effect. These interactions make up the so-called interactomes of the organism, while aberrant PPIs are the basis of multiple aggregation-related diseases. A recent study showed that interacting proteins tend to lead to similar disease phenotypes when mutated. Therefore, protein–protein interactions might in principle be used to identify potentially interesting disease gene candidates.

Accordingly, we have incorporated PPI networks in feature selection process. This is because two linked proteins are more likely to have similar properties than two randomly picked ones. Using this network information along with the features themselves, we have tried to select more discriminative features of proteins. However, in most of the existing feature selection methods on gene/protein data, they seldom consider their inter-relations. This is due to lack of relationship information among instances in most biological datasets.

On the other hand, in the genomics setting, an increasingly common data configuration consists of a small set of sequences possessing a targeted property (positive instances) among a large set of sequences for which class label is unknown. Therefore, our proposed feature selection method tries to work in unsupervised manner.

By coinciding the link information of unlabeled proteins besides their abundant features, we have proposed an Unsupervised Feature Selection Framework for Linked Biological data (UFLB), in order to facilitate hereditary disease genes classification. For this purpose, we try to optimize a novel objective function via an efficient iterative algorithm in order to identify the most relevant and non-redundant features.

The rest of this paper is arranged as follows. The related work is presented in Sect. 2. Our new framework for unsupervised feature selection in biological data, UFLB, is introduced in Sect. 3, including approaches to capture protein–protein interactions, clustering the proteins iteratively, and optimization analysis. The experimental results with discussion are presented in Sect. 4. Finally, we conclude this work in Sect. 5.

2 Related work

Feature selection is an important operation in processing the data stored in gene microarrays. The most relevant features increase our understanding of the mechanism of disease formation and allow to predict the potential danger of being affected by such disease. The application of feature selection methods allows to identify a subset of important features that can be used as biomarkers of the appropriate disease. In the following, we introduce some related work on feature selection for both non-linked and linked data.

2.1 Feature selection for non-linked data

Recently, many learning techniques have been proposed to solve the problem of feature selection. It is certainly worth mentioning a number of methods that have emerged empirically for their effectiveness. One of the differences among various feature selection procedures is the way they perform the search in the feature space. Three categories of feature selection methods can be distinguished as follows: filter [7, 13], wrapper [8], and embedded methods [9, 10].

Filter methods assess features by calculating a relevant score for each one of them. The low‐relevant features are then removed, and the selected features may then be used to serve classification via many types of classifiers. Feature selection filter-based methods can scale easily to high-dimensional datasets since they are computationally simple and fast compared with the other approaches. Various examples for filter-based approaches are ReliefF [11], mRMR [12], SPEC [13], Laplacian score [14], and its extensions [13].

Wrapper methods evaluate feature subsets using a predictive model which is run on the dataset partitioned into training and testing sets. Each subset is used with training dataset to train the model, which is then tested on the test set. Calculating a model prediction error from the test set gives a score for that feature subset. The subset with the highest evaluation is selected as the final set on which to run this particular model. The wrapper methods are computationally expensive since they need a new model to be fitted for each subset [15, 16]. In the embedded models, however, feature selector is a combination of both filter and wrapper. They are less computationally expensive than the wrapper methods.

Depending on the availability of class labels, feature selection algorithms can be categorized into supervised methods and unsupervised methods [8]. In the supervised methods such as Fisher score [17] and ReliefF [11], class labels provide a clear guidance to the feature selection process. This is because supervised methods usually are more reliable than unsupervised ones. However, these methods suffer from two main restrictions. First, since they evaluate each feature independently, they ignore the correlation between features. Second, access to labeled training data in real world is too expensive. Nevertheless, much attention has been paid to unsupervised feature selection in recent years.

Unsupervised feature selection becomes more challenging problem due to the absence of class labels. Unsupervised filter methods usually assign each feature a score which can indicate the feature’s capacity to preserve the structure of data. Top-ranked features are selected since they can best preserve the structure of data. The typical methods include maximum variance [18], Laplacian score [14], and SPEC [13]. Unsupervised wrapper methods [19] require a learning algorithm to evaluate the candidate feature subsets. Unsupervised embedded methods perform feature selection as a part of model training process, e.g., UDFS [20] and NDFS [21].

State-of-the-art approaches introduce the notion of pseudo-labels [20,21,22] to guide the feature selection process. Unsupervised Discriminative Feature Selection (UDFS) [20] introduces pseudo-labels to better capture the discriminative information, and the sparsity-inducing l_2,1 norm is used to select features in an iterative manner. NDFS [21] performs non-negative spectral analysis and feature selection simultaneously.

The basic idea is to imitate supervised methods by generating pseudo-labels via certain clustering methods (e.g., spectral clustering and non-negative matrix factorization) and performing sparse regression toward these cluster labels. However, the generated pseudo-labels are usually inaccurate and could further mislead the feature selection process.

2.2 Feature selection for linked data

Traditional feature selection approaches assume that data instances are independent and identically distributed (i.i.d). Several methods have been proposed in recent years in which the relationships among data are also considered. In the network data, however, instances are implicitly or explicitly related to certain correlations and dependencies. For example, in research collaboration networks, researchers who collaborate with each other (i.e., connections in the network) tend to share more similar research topics (i.e., close distances in the feature space) than researchers without such collaboration. Most existing feature selection approaches fail to exploit the rich information contained in the links.

In [23], a supervised feature selection algorithm (called FSNet) is proposed. It adopts linear regression to fit the content information. Moreover, it uses graph regularization to capture the link information. On the other hand, LinkedFS [24] selects features in social media data in a semi-supervised manner. A supervised feature selection framework, CoSelect, for social media data is proposed in [25]. Instance selection is incorporated into feature selection in CoSelect in order to select relevant instances while selecting features simultaneously.

Linked unsupervised feature selection (LUFS) [26] is an unsupervised feature selection method that utilizes both content and link information. LUFS exploits network information through incorporating social dimension-based regularization [27] into the UDFS framework [20]. It enforces the nodes within the same social dimension to have similar pseudo-labels. But the social dimensions generated from links (e.g., by modularity [28] or spectral clustering [29]) and pseudo-labels generated from attributes are usually far from accurate, which could mislead the feature selection process.

In our previous work, an unsupervised feature selection method in social media data (called UFSS) is presented [43]. UFSS incorporates the inter-relationship of objects in addition to their feature values. By using graph partitioning, the objects are labeled and then are applied in the objective function. An iterative algorithm is designed to optimize the proposed objective function. However, in UFSS, the labeling of objects is a pre-processing step and these labels do not change during the algorithm’s run; which is a constraint.

In this paper, however, the labels of objects are assigned in a dynamic manner. Unlike UFSS and LUFS which use graph partitioning and social dimension incorporation for static labeling, the proteins are labeled dynamically in the consecutive iterations of UFLB so that, after its convergence, an appropriate clustering of proteins is achieved. Our unsupervised feature selection method for linked biological data takes into account both inter-protein relationship information and feature content of proteins. It tries to select some features which effectively discriminate proteins in the reduced space by using PPIs.

3 The proposed method

Our proposed approach is categorized in hybrid methods since combines both filter and wrapper methods. In this section, we present several concepts as preliminaries of our unsupervised feature selection method. We aim to select a set of effective features which can highly discriminate the protein classes.

3.1 Notations

In this work, we use $ P = \left\{ {p_{1} ,p_{2} , \ldots ,p_{n} } \right\} $ to denote the set of $ n $ proteins and $ F = \left\{ {f_{1} , f_{2} , \ldots ,f_{m} } \right\} $ the set of $ m $ features. Also, let $ A \in {\mathcal{R}}^{m \times n} $ holds the feature values of these proteins. That is, the vector $ A\left( {:,j} \right) $ represents the features of protein and $ A\left( {i,:} \right) $ is the values of feature $ f_{i} $ in all proteins. Additionally, $ R \in {\mathcal{R}}^{n \times n} $ denotes the link information for protein–protein network where $ R\left( {i, j} \right) $ is set to 1 if protein $ p_{i} $ and $ p_{j} $ are linked and 0 otherwise. We imagine there are undirected connections between proteins, that is, $ R = R^{T} $. By applying the centering matrix $ H = I_{n} - \frac{1}{n}1_{n} 1_{n}^{T} $ on $ A $ via $ X = AH $, we obtain the data matrix $ X \in {\mathcal{R}}^{m \times n} $. This matrix is centered, that is, $ \sum\nolimits_{j = 1}^{n} {X\left( {:, j} \right)} = 0 $. In $ H $, $ I_{n} $ is the identity matrix and $ 1_{n} $ is a column vector of $ n $ ones.

3.2 Unsupervised feature selection for linked biological data

Supposing the $ n $ proteins are sampled from $ c $ classes/clusters, we assume that there is a mapping matrix $ M \in {\mathcal{R}}^{m \times c} $ which assigns the proteins with a cluster label indicator matrix $ C \in {\mathcal{R}}^{c \times n} $. In this matrix, $ C\left( {:,i} \right) \in \left\{ {0,1} \right\}^{{{\text{c}} \times 1}} $ represents the cluster indicator vector for protein $ p_{i} $. To use its scaled version $ G\left( {:,i} \right) $, we define the scaled cluster indicator matrix $ G \in {\mathcal{R}}^{c \times n} $ where $ G = \left( {CC^{T} } \right)^{{ - \frac{1}{2}}} C $ [30] and $ GG^{T} = \left( {CC^{T} } \right)^{{ - \frac{1}{2}}} CC^{T} \left( {CC^{T} } \right)^{{ - \frac{1}{2}}} = I_{c} $.

Accordingly, our aim was to learn the scaled cluster indicator matrix $ G $ and the feature selection matrix $ M $ simultaneously. In this regard, we propose to optimize the following objective function:

$$ \min_{M} \left\| {M^{T} X - G} \right\|_{F}^{2} + \lambda \left\| M \right\|_{2,1} $$

(1)

where $ \left\| . \right\|_{F}^{2} $ is the Frobenius norm [32] and $ \left\| M \right\|_{2,1} $ is the $ l_{2,1} $-norm of $ M $ [31] which controls the capacity of this matrix. The parameter $ \lambda $ is used to control the sparsity of $ M $. Due to the nature of the $ l_{2,1} $-norm penalty, some coefficients will be shrunk to exact 0 if $ \lambda $ is large enough.

In (1), $ M $ essentially contains the combination coefficients for different features in approximating $ G $. The joint minimization of the regression model and $ l_{2,1} $-norm regularization term enables $ M $ to evaluate the correlation between cluster indicator and features. Also, minimizing $ \left\| M \right\|_{2,1} $ ensures that $ M $ is sparse in rows. These reasons, altogether, make $ M $ particularly suitable for feature selection.

By considering matrix $ R $ and the fact that linked proteins are likely to have similar cluster label indicator, we are going to minimize the following term:

$$ \min_{G} \frac{1}{2}\mathop \sum \limits_{i,j = 1}^{n} R_{ij} G\left\| {\left( {:,i} \right) - G\left( {:,j} \right)} \right\|_{2}^{2} = Tr\left[ {GLG^{T} } \right] $$

(2)

where $ L = D - R $ is a Laplacian matrix and $ D $ is a diagonal matrix with $ D_{ii} = \sum\nolimits_{j = 1}^{n} {R_{ij} } $ on diagonal elements. Including (2) in (1), we obtain the new version of objective function:

$$ { \hbox{min} }_{M,G} \left\| {M^{T} X - G} \right\|_{F}^{2} + \lambda \left\| M \right\|_{2,1} + Tr\left[ {GLG^{T} } \right] $$

(3)

According to the definition of $ G $, its elements are constrained to be discrete values, making the optimization of (3) an NP-hard problem [33]. A well-known solution is to relax it from discrete values to continuous ones [33, 34], so the objective function in (3) is relaxed to:

$$ { \hbox{min} }_{M,G} \left\| {M^{T} X - G} \right\|_{F}^{2} + \lambda \left\| M \right\|_{2,1} + Tr\left[ {GLG^{T} } \right] . $$

(4)

$$ s.t.\quad GG^{T} = I_{c} $$

The last part of our objective function is formed by taking into account the (centered) protein information matrix $ X $ for discrimination. A well-known method to utilize discriminative information is to find a low-dimensional subspace in which the between-class scatter matrix $ Q_{b} $ is maximized while minimizing the total scatter matrix $ Q_{t} $ [35].

As in [43], the maximum of $ Tr\left( {\frac{{Q_{b} }}{{Q_{t} }}} \right) $ (minimum of its negative) is included in (4) and the new objective function is given by:

$$ \min_{M,G} \left\| {M^{T} X - G} \right\|_{F}^{2} + \lambda \left\| M \right\|_{2,1} + Tr\left[ {GLG^{T} } \right] - \gamma Tr\left( {\frac{{Q_{b} }}{{Q_{t} }}} \right) $$

(5)

$$ s.t.\quad GG^{T} = I_{c} $$

where parameter $ \gamma $ controls the discrimination value. In order to use the definitions of $ Q_{b} $ and $ Q_{t} $ in [43], we define $ Y = M^{T} X $ and so:

$$ \min_{M,G} \left\| {M^{T} X - G} \right\|_{F}^{2} + \lambda \left\| M \right\|_{2,1} + Tr\left[ {GLG^{T} } \right] + \gamma Tr\left( {YY^{T} - YG^{T} GY^{T} } \right) $$

(6)

$$ s.t.\quad GG^{T} = I_{c} $$

Note that all the elements of $ G $ are non-negative by definition. However, the optimal $ G $ of (6) has mixed signs which violates its definition and makes $ G $ severely deviate from the ideal cluster indicators. As a result, we cannot directly assign labels to data using the cluster indicator matrix $ G $. To address this problem, it is reasonable to impose a non-negative constraint into the objective function. When both non-negative and orthogonal constraints are satisfied, there is only one positive element in each row of $ G $, while all others are zero. In that way, the learned $ G $ is more accurate and more capable to provide discriminative information. Therefore, by rewriting (6), the new objective function is obtained as follows:

$$ \min_{M,G} \left\| {M^{T} X - G} \right\|_{F}^{2} + \lambda \left\| M \right\|_{2,1} + Tr\left[ {GLG^{T} } \right] + \gamma Tr(M^{T} X\left( {I_{n} - G^{T} G} \right)X^{T} M $$

(7)

$$ s.t.\quad GG^{T} = I_{c} {\text{ and }}G \ge 0 $$

To optimize this function, we propose an iterative optimization algorithm. In this regard, we rewrite the objective function in (7) as follows:

$$ \min_{M,G} \left\| {M^{T} X - G} \right\|_{F}^{2} + \lambda \left\| M \right\|_{2,1} + Tr\left[ {GLG^{T} } \right] + \gamma Tr(M^{T} X\left( {I_{n} - G^{T} G} \right)X^{T} M + \alpha \left\| {GG^{T} - I} \right\|_{cF}^{2} $$

(8)

$$ s.t.\quad G \ge 0 $$

where $ \alpha > 0 $ is a parameter to control the orthogonality condition. In practice, $ \alpha $ should be large enough to insure the orthogonality satisfied. For the ease of representation, the last objective function $ J\left( {M,G} \right) $ is defined as follows:

$$ J\left( {M,G} \right) = \left\| {M^{T} X - G} \right\|_{F}^{2} + \lambda \left\| M \right\|_{2,1} + Tr\left[ {GLG^{T} } \right] + \gamma Tr(M^{T} X\left( {I_{n} - G^{T} G} \right)X^{T} M + \alpha \left\| {GG^{T} - I} \right\|_{cF}^{2} $$

(9)

Theorem 1

The mapping matrix $ M $ in $ J\left( {M,G} \right) $ can be updated as follows:

$$ M^{{\left( {\text{new}} \right)}} = \left( {XX^{T} + \gamma X\left( {I_{n} - G^{T} G} \right)X^{T} + \lambda D_{M}^{{\left( {\text{old}} \right)}} } \right)^{ - 1} XG^{T} $$

(10)

where $ D_{M} $ is an $ m \times m $ diagonal matrix with $ \frac{1}{{\left\| {2M\left( {i,:} \right)} \right\|_{2} }} $ on its $ i $ th row.

Proof

In order to minimize $ J\left( {M,G} \right) $ in (9), its derivative is taken as follows:

$$ \frac{{\partial J\left( {M,G} \right)}}{\partial M} = 2XX^{T} M - 2XG^{T} + 2\lambda D_{M} M + 2\gamma X\left( {I_{n} - G^{T} G} \right)X^{T} M $$

By setting this derivative to zero, the update rule in (10) is obtained.□

Theorem 2

The scaled cluster indicator matrix, $ G $, is updated by this rule:

$$ G_{ij}^{{\left( {\text{new}} \right)}} = G_{ij}^{{\left( {\text{old}} \right)}} .\frac{{U_{ij}^{{\left( {\text{old}} \right)}} }}{{V_{ij}^{{\left( {\text{old}} \right)}} }} $$

(11)

where $ U = M^{T} X + M^{T} XG^{T} M^{T} X + 2\alpha G $ and $ V = G + GL + 2\alpha GG^{T} G $.

Proof

Following [36,37,38], we introduce multiplicative updating rules. Setting derivative of $ J\left( {M,G} \right) $ with respect to $ G_{ij} $ to 0 and using the Karush–Kuhn–Tucker condition [39], we obtain the updating rule in (11).□

Using the updating rule of $ M $ in (10) and of $ G $. in (11), we have developed the iterative algorithm of UFLB:

The larger the norm of $ \left\| {M\left( {i,:} \right)} \right\|_{2} $, the more informative the feature $ f_{i} $ is.

4 Experiments and discussion

In this section, we present experiment details to verify the effectiveness of the proposed framework, UFLB. In this regard, it is compared against the state-of-the-art unsupervised feature selection with/without link information. We evaluate the effectiveness of the selected features using both accuracy measure and clustering quality. Then, the effects of parameters on performance of UFLB are discussed. At last, its convergence analysis is conducted via experiments.

4.1 Datasets

In this work, some labeled genes from Online Mendelian Inheritance in Man (OMIM) with six groups of confirmed diseases are selected. The labels are cardiovascular disease, endocrine disease, cancer disease, metabolic disease, neurological disease, and ophthalmological disease [40, 45]. With respect to the quality and the performance of disease gene classification methods, the data are derived from multiple biological sources [41]. This dataset consists of 949 genes with 4004 features and 956 links. The features are extracted from gene ontology (3000 features), protein domain (1000 features), and protein–protein interactions (4 features) to construct the feature vector of each gene [45]. We have reduced the dataset to uncover the features which none of the genes contain them. So, the dataset is reduced to 3522 features.

The second dataset consists of a subset of IntAct^{Footnote 1} with three groups of diseases. IntAct provides an open source database and toolkit for the storage, presentation, and analysis of protein interactions. We extract 846 genes/proteins from cancer, Alzheimer, and Parkinson databases with 1980 links. The sequence of each gene/protein is obtained from UniProt^{Footnote 2} database. Then, using the distribution of 1 gram, 2 grams and 3 grams in each protein sequence, 8420 features are extracted from the combinations of amino acids [44]. By reducing the dataset to uncover the features which none of the genes/proteins contain them, the dataset is reduced to 8404 features.

A subset of HPRD^{Footnote 3} database is selected as third dataset. This dataset contains 234 genes from four disease classes: diabetes, myopathy, syndrome, and cancer. Each gene has 8420 features which are extracted from the combinations of amino acids. In our experiments, a small number of samples (genes) versus too many features are selected to evaluate UFLB.

The fourth dataset contains fewer features than the three datasets. In this dataset, 966 genes with 127 features are used. Its six classes, Parkinson, Alzheimer, vitiligo, chronic lymphocytic leukemia, schizophrenia, and type I diabetes mellitus, are extracted from Hetio^{Footnote 4} database.

The statistics of datasets is shown in Table 1.

Table 1 Statistics of four linked biological datasets

Unsupervised feature selection in linked biological data

Abstract

Similar content being viewed by others

McTwo: a two-step feature selection algorithm based on maximal information coefficient

Multi-view feature selection for identifying gene markers: a diversified biological data driven approach

Improving feature selection performance using pairwise pre-evaluation

Explore related subjects

1 Introduction

2 Related work

2.1 Feature selection for non-linked data

2.2 Feature selection for linked data

3 The proposed method

3.1 Notations

3.2 Unsupervised feature selection for linked biological data

Theorem 1

Proof

Theorem 2

Proof

4 Experiments and discussion

4.1 Datasets

4.2 Evaluation measures

4.3 Experimental results

4.4 Time complexity

4.5 Non-parametric test

5 Discussion

5.1 Convergence study

6 Conclusion

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation