1 Introduction

DNA Microarray technology signs gene expression in a cell, which is implemented by gene expression data [19, 20]. Gene expression data matrix illuminates expression profiles of thousands of genes under a series of different sample groups. The matrix is applied to the orthodox clustering technology, whose rows, columns and elements are made up of genes, samples and gene expression level values along certain samples severally. Clustering technology is used to merge genes (rows) based on all samples (columns) or vice versa. In some researches, the guilt-by-association heuristic is hypothesized that genomes have the same regulatory scheme with similar gene expression level, which have the same or similar heuristic [13, 16, 17, 21, 28]. This viewpoint is recently being reformulated, because the co-expression of a set of genes may be independently activated under some certain sample groups, rather than the same regulatory mechanism is determined [14, 18, 23, 33]. In this case, the researchers have proposed the concept of biclustering technology based on clustering technology [24, 30, 32]. Biclustering of gene expression data that gathers genomes under a subset of samples and not all samples or cluster sample groups under partial genome and not all genes was proposed. In other words, the difference between orthodox clustering and biclustering is that the biclustering concerns clustering genes and samples simultaneously.

The advantage of biclustering approach is to seek sets of locally co-expressed genes efficiently without acting on the entire gene expression matrix. The first successful implementation of biclustering on gene expression data matrix was completed by Cheng and Church [8]. So far, large amounts of biclustering algorithms have been reported with two typical methods: different heuristic and search norm to seek biclusters [6, 7, 29]. Several classic biclustering algorithms including Cheng and Church’s algorithm (CC), Order-preserving Submatrix Algorithm (OPSM) [3], Flexible Overlapped biclustering Algorithm (FLOC) [31], Iterative Signature Algorithm (ISA) [4], Factor Analysis for Bicluster Acquisition (FABIA) [10] and QUBIC [15], are commonly used as standards for evaluating the other biclustering algorithms. CC, a greedy iterative search algorithm, generates a bicluster by adding and deleting rows or columns alternately. OPSM algorithm based on biclustering enumeration have been demonstrated as a high-efficiency method in biclustering family [2]. FLOC, based on the method described by J.Yang et al., has the advantages of being able to obtain a set of optimal biclusters simultaneously and combining methods for dealing with missing values. Besides, ISA, FABIA and QUBIC are also effective biclustering algorithms that have various advantages in processing different gene expression datasets [30].

The gene expression level based on a set of samples can be considered as discrete random variable values. Therefore, it is reasonable to predict the degree of linear dependence between two genes by calculating the correlation value of two random variables. In this paper, the concept is used to explore the optimal biclusters in gene expression data matrix, which acts as an important factor in the construction of the fitness function in the biclustering algorithm in this paper.

BASF algorithm is enlightened by the foraging behavior of fish in the search for high-quality food sources and has been applied to the analysis of optimization problems [9]. The foraging behavior is specially divided into four modules: the preying, aggregating, following and random behavior. In order to improve the ability of BASF algorithm to search the global optimum, the Binary Simulated Annealing algorithm is introduced to form the hybrid BAFS-BSA algorithm. In this work, we presented a new biclustering technology based on the hybrid BAFS-BSA algorithm.

The remainder of the paper is organized as follows. Section 2 presents the biclustering technology based on the hybrid BAFS-BSA algorithm. Section 3 evaluates the performance of the proposed method based on several commonly used gene expression datasets. Finally, Section 4 concludes this paper.

2 Methodology

2.1 Initialization

The initialization phase of the algorithm was composed of two parts. First, K-Means clustering method was adopted to cluster genes under samples and gather samples based on genes respectively. Therefore, m gene clusters and n sample clusters were produced. Then the gene clusters and sample clusters were combined to form m × n co-clusters. Second, these co-clusters were encoded into binary form in order to unify their formats. The length of each of encoded co-cluster was p, where p standed for the sum of the number of genes and samples in the gene expression data matrix. When a gene or a sample was in a co-cluster, the corresponding bit was set to 1, otherwise, it was set to 0. Moreover, the input datum of the Hybrid BAFS-BSA algorithm were these encoded co-clusters, namely Artificial Fishes (AFs). Figure 1 depicts the general format of AF [26].

Fig. 1
figure 1

The general format of AF

2.2 Fitness function in BASF-BSA-BIC

In the BAFS-BSA-BIC algorithm, firstly, an objective function as fitness function was defined in order to find biologically relevant biclusters. The fitness function we used reflected the relationship between the linear correlation coefficients of all the genes in a bicluster and the volume of a bicluster. By determining the relationship between these two aspects, we looked for related genes with shift and scale patterns.

To calculate the fitness function value of a bicluster, a bicluster B needed to be determined first. Each AF represents a bicluster in BAFS-BSA-BIC algorithm. For example, an AF is 10,101|101, and M is a 5 × 3 gene expression data matrix, containing 5 rows (genes) and 3 columns (samples). However, the size of actual gene expression data matrix is much larger than the data matrix in this example. As shown in Fig. 2, a 3 × 2 gene expression data matrix is extracted, namely bicluster B.

Fig. 2
figure 2

Conversion method between AF and bicluster in the gene expression data matrix

The proposed fitness function to estimate bicluster B is defined as follows.

$$ f(B)=\left(1-\rho (B)\right)+\frac{1}{N_G}+\frac{1}{N_C} $$
(1)

where

$$ \rho (B)=\frac{1}{\left(\genfrac{}{}{0pt}{}{N_G}{2}\right)}{\sum}_{i=1}^{N_G-1}{\sum}_{j=i+1}^{N_G}\left|{\rho}_{ij}\right| $$
(2)

where |ρij| represents the Pearson correlation coefficient of gene Gi and gene Gj. The number of genes and samples of B are NG and NC [30]. The best bicluster was searched by the BSFS-BSA algorithm, which has the lowest value of the fitness function.

2.3 Description of BAFS-BSA-BIC

BAFS-BSA-BIC is a hybrid BIClustering algorithm that combines a Binary Artificial Fish Swarm with a Binary Bimulated Annealing algorithm. This discussion was divided into two parts. First, the BAFS algorithm was described briefly. Second, the Binary SA (BSA) was introduced in detail, which was different from the traditional SA algorithm. BSA was combined with the BAFS algorithm to form the hybrid BAFS-BSA-BIC algorithm and enriched the biclustering technology system.

2.3.1 The BAFS algorithm

The implementation of BAFS algorithm was assembled by four main parts: following behavior, aggregating behavior, preying behavior and random behavior. In addition, there were some important details in this algorithm that we needed to consider, such as determination of the visual range of an artificial fish, calculation of crowding factor of fish swarm and setting criteria for performing four different behaviors.

The differences between BAFS and traditional artificial fish swarm algorithms are as follows:

  1. (i)

    Each fish in a shoal is represented as a Boolean vector rather than traditional values;

  2. (ii)

    The visual range is constantly changing, which is not a predetermined fixed value, according to the position of each fish, achieved by the hamming distance [8].

  3. (iii)

    The crowding factor that is not a fixed value varies with the visual range.

These improvements make the artificial fish swarm algorithm more efficient and more widely used. In this paper, the further improved BAFS algorithm, namely BAFS-BSA, is applied in the biclustering experiment of genes expression data.

2.3.2 The BSA and BAFS-BSA-BIC algorithm

SA is a local search method inspired by the physical annealing process studied in statistical mechanics [1]. SA approach follows search directions and executes thousands of iteration generation procedures that perfects the objection function of algorithm, which reduces the fitness function value in BAFSA-BS-BIC algorithm. While exploring solution space, the possibility of accepting poorer neighbor solutions in a controlled manner is offered by SA algorithm to escape from local minima [5]. The SA approach we used had some differences from the traditional SA approach. To enable SA algorithm to process the gene microarray or gene expression data matrix, the binary form was introduced. More precisely, in each iteration, a current solution vector AF is a Boolean vector of length p, where p is the sum of the number of genes and conditions.

$$ AF=\left({g}_1,{g}_2,..{g}_m,\dots, {s}_1,{s}_{2,}\dots, {s}_N\right),{g}_i,{s}_j\in \left(0,1\right),i\in \left(1,2,\dots, m\right),j\in \left(1,2,\dots, n\right) $$
(3)

The 1-opt neighborhood N(AF) could be obtained by changing an element in the solution AF, i.e., the hamming distance dH between the current solution AF and each neighbor AF (AF ∈ N(AF)) is equal to 1 [12]. Therefore, the size of N(AF) was p. In this iteration, for a current AF characterized by the fitness function value f(AF), a neighbor AF’ was randomly selected from the neighborhood N(AF) and f(AF’) was calculated. Then the objection difference  = f(AF) − f(AF) was evaluated. When  ≤ 0, AF is replaced by AF’. Otherwise, AF’ would be accepted with a probability P = e(−)/T [5].

2.4 Mean squared residue (MSR) and average correlation value (ACV)

MSR [25] is one of the most popular measures of homogeneity. Given a bicluster B with I rows and J columns, let aij represent the element of the bicluster. MSR of this bicluster can be defined as

$$ MSR=\frac{1}{\left|I\right|\left|J\right|}\sum \limits_{i\in I,j\in J}\left({a}_{ij}-{a}_{iJ}-{a}_{Ij}+{a}_{IJ}\right) $$

where

$$ {\displaystyle \begin{array}{c}{a}_{iJ}=\frac{1}{\left|J\right|}\sum \limits_j{a}_{ij}\\ {}{a}_{Ij}=\frac{1}{\left|I\right|}\sum \limits_i{a}_{ij}\\ {}{a}_{IJ}=\frac{1}{\left|I\right|\left|J\right|}\sum \limits_{i,j}{a}_{ij}\end{array}} $$
(4)

Here aiJ denotes the mean of the i-th row and aIj denotes the mean of the j-th column. Besides, aIJ refers to overall means of the bicluster.

ACV is another useful measure of the coherence of the genes and conditions in a bicluster. ACV of a bicluster B(I,J) can be expressed as

$$ \mathrm{ACV}=\max \left\{\frac{\sum_{i_1=1}^{\mid I\mid }{\sum}_{i_1=2}^{\mid J\mid}\mid r\_{row}_{i_1{i}_2}\mid -\mid I\mid }{{\left|I\right|}^2-\mid I\mid },\frac{\sum_{j_1=1}^{\mid J\mid }{\sum}_{j_1=2}^{\mid J\mid}\mid r\_{col}_{j_1{j}_2}\mid -\mid J\mid }{{\left|J\right|}^2-\mid J\mid}\right\} $$
(5)

Where \( r\_{row}_{i_1{i}_2} \)and \( r\_{col}_{j_1{j}_2} \) stands for the Pearson coefficient of rows i1, i2 and columns j1, j2 , respectively.

3 Experiments

The purpose of experiments was to verify the performance of the proposed algorithm. The BAFS-BSA algorithm was applied to multiple datasets and desirable results were obtained. Then these results were compared with the classical biclustering methods such as CC, QUBIC, FLOC and the original BAFS biclustering algorithm.

3.1 Datasets

The hybrid BAFS-BSA algorithm for biclustering has been applied to four datasets: complete_DTT, elutriation, cdc_15 and Mice Protein Expression datasets. The first three datasets respectively include 962 genes and 7 samples, 935 genes and 14 samples, 1087 genes and 24 genes, which have been downloaded from the supplementary information provided in [11]. The last dataset is composed of 1080 genes and 51 samples. Because a small part of the value of the downloaded original data set is missing, we adopted a method to improve this data set, which has been realized by J. A. Nepomuceno et al. [22].

3.2 Results

In complete_DTT, elutriation, cdc_15 and Mice Protein Expression datasets, 10 optimum biclusters were produced by utilizing the BAFS-BSA-BIC algorithm respectively. Most of these biclusters have better performance and several of them are selected randomly to show in Figs. 3, 4, 5 and 6. Each bicluster is represented by a graph, where the x-coordinate represents the sample and the y-coordinate represents the gene expression value. From Figs. 3, 4, 5 and 6, we can intuitively find that each gene in the bicluster searched by the BAFS-BSA-BIC algorithm has similar gene expression in corresponding samples. In addition, this algorithm worked better in processing complete_DTT, elutriation and Mice Protein Expression datasets, because most of the gene expression profile trends in each bicluster searched from these datasets are the same. According to the bicluster graphs obtained from the cdc_15 dataset, we can see these gene expression profile fluctuate a little in the trend, but remain stable overall.

Fig. 3
figure 3

Several biclusters results with the BASF-BSA method on cdc_15 dataset

Fig. 4
figure 4

Several biclusters results with the BASF-BSA method on complete_DTT dataset

Fig. 5
figure 5

Several biclusters results with the BASF-BSA method on elutriation dataset

Fig. 6
figure 6

Several biclusters results with the BASF-BSA method on Mice Protein Expression dataset

To confirm the degree of correlation between genes in these biclusters, The MSR and ACV and Volume (V) were introduced as three different intra-biclusters evaluation criteria [27]. The value of MSR represents the change related to the interaction between genes (rows) and sampes (columes) in each biclutser. From this it can be concluded that the lower MSR value, the stronger coherence in the bicluster, while the judgment criterion of ACV value is the opposite. The details are as follows: a bicluster with smaller MSR value, larger ACV value and volume value suggests that all its genes are more coherent and similar genes are more numerous [27]. Therefore, these standards can not only define the quality of biclusters, but also can be used to compare the performance among these biclustering algorithms.

Table 1 summarizes the quality of biclusters acquired by BAFS-BSA-BIC algorithm for the proposed four datasets. In Table 1, datasets are composed among complete_DTT (D1), elutriation (D2), cdc_15 (D3) and Mice Protein Expression (D4). The number of genes (Genes) and samples (Samples), the volume (Volume), the average correlation (ρ(B)), the MSR and ACV are presented to identify the quality of each bicluster. From Table 1, the biclusters with strong coherence from these four datasets can be obtained with our method. Each bicluster in this table has a large volume, a large ACV value and a small MSR value. At the same time, the correlation value ρ(B) close to 1 can also prove the strong coherence of the genes in these biclusters.

Table 1 The quality of biclusters in four datasets

3.3 Discussion

The performance of BAFS-BSA for bicluster has been compared with several classical biclustering algorithms such as CC [8], QUBIC [15], FLOC [31] and the original BAFS algorithm for biclustering. The parameters, which are average volume (Ave-Volume), average correlation (Ave- ρ(B)), average MAR (Ave-MSR) and average ACV (Ave-ACV), were proposed to record the performance of these optimal biclusters. Tables 2, 3, 4 and 5 records the parameter values of biclusters in complete DTT dataset, elutriation dataset, cdc_15 dataset and Mice Protein Expression dataset respectively.

Table 2 Comparison of performance of proposed method with other algorithms in complete_DTT datasets
Table 3 Comparison of performance of proposed method with other algorithms in elutriation datasets
Table 4 Comparison of performance of proposed method with other algorithms in cdc_15 datasets
Table 5 Comparison of performance of proposed method with other algorithms in Mice Protein Expression datasets

A bicluster with some characteristics including larger volume, correlation coefficient value, ACV value and the smaller MSR value can be defined as a better bicluster [27]. Figure 7 is the comparison of Ave-Volume, Ave-ρ(B), Ave-MSR and Ave-ACV of different biclustering algorithm based on the same datasets. Compared with other algorithms, BAFS-BSA-BIC has no obvious advantage of acquiring a bicluster with a larger volume. A bigger volume of each bicluster was obtained by FLOC and BAFS algorithm. However, according to the calculation results of ρ(B), ACV and MSR, the biclusters obtained by the proposed method has the best quality. The gene expression level is very similar in each bicluster. Moreover, BAFS-BSA-BIC can obtain satisfactory results when processing different datasets, indicating its wide applicability.

Fig. 7
figure 7

Comparison of Ave-Volume, Ave-ρ(B), Ave-MSR and Ave-ACV of five biclustering algorithms in four datasets

In particular, both CC and QUBIC perform the worst in the volume of these datasets, extracting biclusters with smaller volume. The correlation coefficient ρ(B) of biclusters acquired from FLOC is much smaller, which means that the degree of linear correlation between each two genes of the biclusters obtained is lower. However, BAFS-BSA-BIC has prominent advantages in ρ(B), ACV and MSR aspects despite no obvious deficiency in the volume, which indicates that our algorithm has better stability than other algorithms.

4 Conclusions

The Hybrid BASF-BSA algorithm for biclustering that integrates biological information and revels new molecular functions of the organism has been implemented in this paper. The BAFS algorithm has the characteristic of fast convergence speed, but it is prone to fall into local optimal problem. In order to avoid this risk, the BSA algorithm has been combined with BAFS algorithm. Experiments on several real datasets show superior performance of our method.