Keywords

1 Introduction

The development of microarray gene technology has enabled researchers to obtain a large amount of gene expression data. These samples have the characteristics of small-scale samples at high dimension [1]. At the same time, there are a number of unrelated genes in the obtained data. Therefore, when mining deep information, it is the key that selecting a valid method to obtain accurate sets of pathogenic genes for analyzing gene expression data.

Formal Concept Analysis (FCA), proposed by R. Wille in 1980s, is an effective tool for data analysis. It essentially reflects the association between objects and attributes (samples and features), and embodies relationship between instantiation and generalization by Hasse. FCA is applied to gene expression data to mine deep information. The existed methods have extended and applied concept lattices in many ways.

In 2009, Mehdi proposed two algorithms based on inter-ordinal scaling and pattern structures, and the pattern-structures algorithm calculates interval algebra by adjusting standard algorithm [2]. In 2010, Dexing Wang introduced association rules to reduce concept lattice on biological information data [3]. In 2011, Benjamin J found gene sets that reflected the strong relationship among multiple diseases in the gene expression data of similar diseases by constructing concept lattice [4]. In 2018, Hongxiang Tang used the data structural dependence to construct concept lattice and mined interesting information in genetic data [5]. At present, there are many methods can get better results. However, in the constructed concept lattice, interesting concepts containing important information and redundant concepts caused by redundant genes are difficult to select.

Feature selection, as an important data preprocessing process, can select important feature genes in biological data to alleviate high-dimensional problems while removing irrelevant feature genes [6]. In genetic data, a sample often contains tens of thousands gene expression values, most of which cannot account for the disease. Therefore, feature selection selects the smallest number of features from the original data to contain as much information as possible [7]. The concept lattice constructed in this way achieves the purpose of reducing redundant concepts and improving the valid and accuracy of data expression. There are many feature selection methods. Single feature selection methods may make bias of classification and ignore interesting genes.

Chong Du proposed an integrated feature selection method. This model improves the accuracy of single selection method in gene expression. Maghsoudloo established a hybrid feature selection framework for extracting crucial genes of asthma and other lung diseases as biomarkers [8]. Gang Fang used feature selection of ensemble learning methods to find genes related to stroke patients and predict acute stroke in the dataset [9, 10].

The contributions of this paper are listed as follows:

  1. 1.

    This paper proposed an integrated feature selection method called Multi-Cluster minimum Redundancy (MCmR). It improves redundant features of multiple clusters and reduces the interference of redundant genes on feature selection. This method can obtain excellent feature subsets in multiple clusters.

  2. 2.

    This paper uses Prostate_Tumor, Lung_cancer, Breast_cancer and Leukemia to verify the valid of MCmR. The experimental results show that it obtains an excellent subset of features in all datasets. Compared with other methods, our method gets better results on feature selection set.

In what follows, Sect. 2 introduces the definition and concepts of related methods. Section 3 shows the experiment and results on datasets. Section 4 summarizes this paper.

2 Related Work

2.1 Formal Concept Analysis

FCA takes concepts as basic elements, and each node represents a concept. As the core of data analysis, the concept lattice constructs the conceptual hierarchy structure in the formal context out of the partial order relationship among concepts [1, 2]. As a visualization method of concept lattice, Hasse intuitively describes the information of data.

Definition 1: Let a triple \(K = (G,A,I)\) composed of object set \(G\), attribute set \(A\) and binary relation set \(I\) between objects and attributes constitute a formal context collection, where \(G = (g_{1} ,g_{2} ,...,g_{n} )\), \(A = \{ a_{1} ,a_{2} ,...,a_{m} \}\), \(I \subseteq G \times M\) [4]. In \(I\), for any \(g\) in \(G\) and any \(a\) in \(A\) satisfy “\((g,a) \in K\)” or “\(gIa\)”, it means that the object \(g\) has attribute \(a\), which is represented by 1 (otherwise, it is represented by 0).

Definition 2: Set a tuple \((X,M)\) extracted from the formal context \(K = (G,A,I)\) satisfies \(X^{\prime} = M\) and \(M^{\prime} = X\), and can be called a formal concept (abbreviated as concept, \(C\)). Where, \(X \in G\), \(M \in A\). \(X\) is called the extension of the concept, which is a set of objects shared by all attributes of the concept, and \(M\), as the set of the attributes shared by all objects of the concept, is called the intension of the concept [11]. Where, \(X^{\prime}\) and \(M^{\prime}\) satisfy the following equation:

$$ \begin{gathered} X^{^{\prime}} = \{ a|a \in A,\forall x \in X,xIa\} ,X \subseteq G \hfill \\ M^{^{\prime}} = \{ x|x \in G,\forall a \in M,xIa\} ,M \subseteq A \hfill \\ \end{gathered} $$
(1)

Definition 3: Suppose two concepts \((X_{1} ,M_{1} )\) and \((X_{2} ,M_{2} )\) in the formal context, named \(C_{1}\) and \(C_{{2}}\). If the relationship conform to \(X_{1} \subseteq X_{2}\) (equivalent to \(M_{2} \subseteq M_{1}\)), \((X_{1} ,M_{1} )\) is called a sub-concept of \((X_{2} ,M_{2} )\), and \((X_{2} ,M_{2} )\) is called a super-concept of \((X_{1} ,M_{1} )\), denoted as \(C_{1} \le C_{2}\) [5, 12]. This relationship between concepts construct complete concept lattice, which called partial order. Formulated as follows (see Fig. 1) (Table 1):

$$ \begin{gathered} (X_{1} ,M_{1} ) \subseteq (X_{2} ,M_{2} ) \Leftrightarrow A_{1} \subseteq A_{2} \hfill \\ (X_{1} ,M_{1} ) \subseteq (X_{2} ,M_{2} ) \Leftrightarrow B_{2} \subseteq B_{1} \hfill \\ \end{gathered} $$
(2)
Table 1. Formal context
Fig. 1.
figure 1

The concept lattice.

2.2 Feature Selection Method

It achieves the purpose of removing redundant features by selecting feature subsets with strong resolution capacity from the high-dimensional raw data. The selected features retain information of raw data as much as possible [7]. Applying the feature selection method to gene data can obtain a subset of disease feature genes, and remove redundant genes. Feature selection are divided into supervised and unsupervised algorithms [13]. Unsupervised method become popular since it does not need to obtain labels in advance.

MCFS

Multi-Cluster Features Selection (MCFS) method spectral embedding for cluster analysis achieves the purpose of reducing dimensionality by constructing graph, defining the weight matrix (such as heat kernel weighting) and mapping feature. Eigenvalues and eigenvectors of the Laplacian matrix \(L\) will be calculated by the following formula [10]:

$$ Ly = \lambda Dy $$
(3)

\(D\) is diagonal matrix, and \(D_{ii} = \sum\nolimits_{j} {W_{ij} }\). Thus, Laplacian \(L = D - W\). \(Y{\text{ = [y}}_{{1}} {\text{,y}}_{{2}} {,}...{\text{,y}}_{{\text{k}}} {]}\) and \(y_{k}\) is the eigenvector corresponding to the smallest \(k\) non-zero eigenvalues [14].

Sparse coefficient vectors \(\{ a_{k} \}_{k = 1}^{K} \in {\mathbb{R}}^{M}\) measure the significance of each dimension and the capability of each feature to distinguish different clusters, which can be calculated by LARs algorithm to optimize L1-normth [15]. We obtain \(k\) sparse coefficient vectors by solving the L1-normth of every \(y_{k}\) in \(Y\).

$$ \begin{gathered} \mathop {\min }\limits_{{a_{k} }} ||y_{k} - X^{T} a_{k} ||^{2} \hfill \\ s.t.{\text{a}}_{k} {|} \le \gamma \hfill \\ \end{gathered} $$
(4)

Definition 4: For each feature \(j\), \(MCFS(j) = \mathop {\max }\limits_{k} |a_{k,j} |\). It calculates the MCFS_score, and selects the top \(d\) features according to descending order [10].

mRMR

The maximum correlation minimum redundancy (mRMR) method uses mutual information for feature selection [16]. The correlation between features and target categories is represented by mutual information \(I(f_{i} ;c)\). Redundancy between features is represented by mutual information \(I(f_{i} ;f_{j} )\) [6]. The feature subset selected by mRMR has maximum correlation and minimum redundancy.

$$ mRMR = \mathop {\max }\limits_{S} [D(S) - R(S)] $$
(5)

where \(R(S) = \frac{1}{{|S|^{2} }}\sum\limits_{{f_{i} ,f_{j} \in S}} {I(f_{i} ;f_{j} )}\) represents the redundancy of all features in set \(S\), and \(D(S,c) = \frac{1}{|S|}\sum\limits_{{f_{i} \in S}} {I(f_{i} ;c)}\) represents the correlation between all features in \(S\) and category \(c\).

Assuming that \(n\) features are already in the subset \(S_{n}\), selecting the next feature from set \(\{ S - S_{n} \}\) according to the following formula [17]:

$$ \mathop {\max }\limits_{{f_{i} \in S - S_{t} }} [I(f_{i} ;C) - \frac{1}{t}\sum\limits_{{f_{i} \in S_{t} }} {I(f_{j} ;f_{i} )} ] $$
(6)

2.3 Multi-cluster Minimum Redundancy

MCFS uses correlation within the cluster to select features, calculates MCFS_score for the selected features, and gets the first \(d\) features. This method only performs better when the clusters is less than 50, and it is not detected whether there is redundancy between features of clusters before calculating the score. The mRMR performs feature redundancy detection between clusters and eliminates features with higher redundancy. Thus, it can be used for further screening of feature selection to achieve its purpose. Therefore, we propose an integrated feature selection method named MCmR, which maintains its advantages in large scale clusters [17]. MCmR makes full use of effective information (see Fig. 2). This proposed feature selection algorithm effectively reduces the formal context and concept lattice [18].

Fig. 2.
figure 2

The process of feature selection by MCmR.

The MCmR is described as follows:

  1. 1.

    The inputting data will be performed by spectral cluster. It will get diagram matrix \(D\) from weights matrix \(W\), and gain set \(Y{ = [}y_{1} {,}y_{2} {,}...{,}y_{k} {]}\) by \(Ly = \lambda Dy\).

  2. 2.

    Get \(k\) sparse coefficient vectors by solving the L1-regularized regression of each \(y_{k}\) in \(Y\).

  3. 3.

    Set threshold of mRMR to filter the features with higher redundancy by \(mRMR = \mathop {\max }\limits_{S} [D(S) - R(S)]\).

  4. 4.

    The MCFS_scores of selected features will be calculated and sorted in descending order. Thus, the top \(d\) features will be selected.

  5. 5.

    Output Feature subset \(S\).

MCmR can not only gets more information feature genes cluster with less number, but also maintain accuracy when cluster is increases. MCmR makes up the shortcoming of single feature selection method in large number cluster, and improves the accuracy of feature selection by multiple selection with feature sets [19].

3 Experiment and Results

3.1 Datasets

This paper verified the valid of MCmR on Prostate_Tumor, Lung_Cancer, Breast Cancer and Leukemia. The detail informations of datasets are shown in Table 2.

Table 2. Detail information of datasets

3.2 Experiment

Evaluation Metrics

During the experiment, we take the Normalized Mutual Information (NMI) metric and Accuracy as evaluation metrics [14]. Comparing the cluster label calculated by algorithm and raw set can measure the cluster performance. Assuming two cluster labels set \(L\) and \(L^{^{\prime}}\), which respectively include the label provided by data and algorithm. The normalized mutual information metric \(MI(L,L^{^{\prime}} )\) is defined as follows:

$$ MI(L,L^{^{\prime}} ) = \sum\limits_{i = 1}^{|L|} {\sum\limits_{j = 1}^{{|L^{^{\prime}} |}} {P(i,j)\log \left( {\frac{P(i,j)}{{P(i)P^{^{\prime}} (j)}}} \right)} } $$
(7)

where, \(P(i)\) is the probability that a feature picked at random falls into class \(L_{i}\). \(P(i,j)\) is the probability that a feature picked at random falls into both class \(L_{i}\) and \(L_{j}^{^{\prime}}\).

The NMI is as follows:

$$ NMI(L,L^{^{\prime}} ) = \frac{{MI(L,L^{^{\prime}} )}}{{\sqrt {H(L)H(L^{^{\prime}} )} }} $$
(8)

where, \(H(L)\) and \(H(L^{^{\prime}} )\) represent the entropies of \(L\) and \(L^{^{\prime}}\) respectively. The evaluation criterion ranges from 0 to 1. It is similar between A and B when the value is bigger. NMI = 1 if the two sets of clusters are identical, and NMI = 0 if the two sets are independent [10].

Results

In this part, we compare performance of MCmR with MCFS, mRMR and Relief for various cluster number on Prostate_Tumor (see Fig. 3).

Fig. 3.
figure 3

The performance of MCmR, MCFS, mRMR and relief for various cluster number on Prostate_Tumor.

In Fig. 3, it shows that the performance of different algorithms on 30, 50, 70 and 90 clusters. Picture (a) shown that MCmR performs as well as MCFS when the features are less than 50. Picture (b) and (c) show that the performance of MCmR is improving gradually, and always higher than MCFS when cluster is 50 and 70. MCFS gets best results when cluster is 70. When cluster is more than 90, the experiment performance become worse. Picture (d) shows that MCmR gets the best performance when cluster is 90. Too many clusters lead to more redundant genes. On country, it results in filtering out disease-related genes. Comparing with 70 and 110 clusters, MCmR gets better results on Prostate_Tumor when cluster is 90. The quantity of features also affects experimental results. It is shown from picture (a–d), MCmR gets the best performance when feature is 130. The quantity of feature d is half of the features set satisfied threshold of mRMR. Then, we compare the experimental effects in different features numbers, and verify the best d.

Table 3. Cluster performance (%) by using 130 features on Prostate_Tumor

In Table 3, it shows that MCmR gets the best results on Prostate_Tumor. This prove our new method is valid. Compared with MCFS, mRMR and Relief, MCmR is improved by 4.1, 12 and 10.8 respectively.

Table 4. The ACC of methods on various datasets.

In Table 4, it is easy to know that the ACC of Relief and mRMR are close which reaches 82.0 and 82.4 on average. The effect of MCFS is better than those two, which reaches 89.8 on average. The highest ACC belongs to MCmR that it is 95.5 on average. It shows the valid and superiority of MCmR in feature selection.

Table 5. The number of concept in different datasets.

In Table 5, we use MCmR to construct a concept lattice in each dataset. Compared with MCFS, MCmR has less concept in concept lattice. It can be seen from the table that MCmR performs differently in different datasets. Among them, it performed best in Lung_cancer, with a concept reduction of 10.1%. Using feature selection to remove redundant genes from high-dimensional gene expression data and obtain feature genes. It can reduce the formal context and concepts in concept lattice (see Fig. 4).

The experimental results show that MCmR improves the accuracy of related gene selection and reduces redundant genes compared with other methods.

Fig. 4.
figure 4

MCmR Reduce Breast_cancer concept lattice

4 Conclusion

Aiming at the problem of poor feature selection effect in case of a large number of clusters, this paper proposes an integrated feature selection method MCmR to improve the capability of feature selection on genetic data. We focus on the extraction of feature genes by clustering to reduce attribute genes concept. Our experiments show that MCmR improved the accuracy of data classification. And it also reduces concept of concept lattices at the same time. However, this method has high computational overhead. Thus, we try to reduce it through improve framework of MCmR. For future work, we will pay attention to find a way to reduce computational overhead.