1 Introduction

A biological network is a linked collection of biological entities like genes, proteins, and metabolistes [34]. Analyzing information and extracting biologically relevant knowledge, from these entities, is one of the key issues of bioinformatics. For instance, DNA microarray technologies help to measure the expression levels of thousands of genes under experimental conditions [14]. To do so, gene expression data are arranged in a data matrix. In the latter, rows represent genes, columns represent samples (experimental conditions), and each entry of the matrix denotes the expression level of a gene under a certain experimental condition. In this respect, the discovery of transcriptional modules of genes that are co-regulated in a set of experiments is of paramount importance [14].

Interestingly enough, the clustering technique has been beneficial in many challenges in bioinformatics. In fact, it allows researchers to gather information such as cancer occurrences, specific tumor subtypes and cancer survival rates [67]. However, the use of clustering algorithms has two major drawbacks.

  1. 1.

    They consider the whole set of samples. This is despite the fact that genes may not be relevant to every sample. Instead, they can be relevant to only a subset of samples, which is a fundamental aspect for numerous problems in the biomedicine field [66]. Thus, clustering should be performed simultaneously on both genes and conditions.

  2. 2.

    Each gene can only be clustered into one group. Nevertheless, many genes can belong to several clusters depending on their influence in different biological processes [28].

In this respect, biclustering, which is a particular clustering type, palliates these drawbacks. Hence, biclustering aims to identify maximal sub-matrices (aka biclusters) where a subset of genes expresses highly correlated behaviors over a range of conditions [14]. Nevertheless, the biclustering task is a highly combinatorial problem and is known to be an NP-hard one [18].

As it can be witnessed in the dedicated literature, the biclustering usage is widespread in the analysis of gene expression data. It was first introduced by the pioneering work of [18]. Subsequently, a lot of other algorithms have been proposed [14,15,16, 23, 25, 41, 54]. According to [25], the existing biclustering algorithms can be grouped into two main classes: systematic search algorithms and stochastic search algorithms (cf. Sect. 2.2).

Despite the large number of the aforementioned biclustering algorithms, most of them are based on greedy or stochastic approaches. However, they provide sub-higher-quality answers with restrictions on the structure, the coherency and the quality of biclusters [14, 33, 35]. Some attempts to palliate such drawbacks have relied on the pattern-mining approaches [38, 39, 49, 52]. Pattern-mining-based biclustering approaches aim to perform efficient and flexible searches with better solutions in terms of coherency and quality [30]. These advantages will bring these algorithms into the spotlight when it comes to biological data analysis [31, 32, 34, 38, 40, 49].

Among these pattern-mining-based algorithms are those relying on the formal concept analysis (FCA). Biclustering has multiple elements in common with the FCA. In fact, a bicluster can be seen as a formal concept that reflects the inherent relationship between objects and attributes [38]. Indeed, as for Prelic et al. [60], an inclusion-maximal bicluster is the maximal set of objects related to a maximal set of attributes. This definition perfectly matches with that of a formal concept in the FCA theory [68]. This close connection motivates the wide use of the FCA’s large collection of mathematical results for the biclustering task. Indeed, the FCA is a key method used for the analysis of object-attribute relationships and for knowledge presentations [42, 45].

For these reasons, one might argue that the FCA can be considered as a type of biclustering methods for binary data. Various approaches have been interested in extracting biclusters using the FCA. However, these algorithms have the tendency to focus on one type of biclusters, extract overlapping ones or refrain from biological validation.

In this paper, we introduce a new FCA-based algorithm for biclustering DNA microarray data, called BiFCA+. This latter allows observing the profile of each gene through all pairs of conditions by discretizing the original microarray data. Interestingly enough, BiFCA+ relies on the Bond correlation measure [55] to avoid the high overlap between extracted biclusters.

The main contributions of this paper are as follows:

  • we propose a new discretization method of the DNA microarray data.

  • We design an efficient FCA-based algorithm for extracting correlated genes.

  • We filter out the obtained biclusters using the Bond correlation measure in order to remove high overlapping biclusters.

  • We show the effectiveness of our method through extensive carried-out experiments on three real-life DNA microarray data. Indeed, we extract statistically and biologically significant biclusters, highlighting competitive results versus other popular biclustering algorithms.

The remainder of this paper is organized as follows: In the next section, we provide a background on the target task and we review some related work. Section 3 is dedicated to the description of the algorithm. In Sect. 4, we provide the results of the application of our algorithm on real-life microarray datasets. Finally, Sect. 5 concludes this paper and sketches issues of future work.

2 Background

In this section, we provide the basics on the biclustering field and we review the dedicated related work.

2.1 Biclustering: basic notions

In the following, we recall some basic definitions borrowed from the biclustering field.

Definition 1

(Bicluster) A bicluster is a subset of objects (genes) associated with a subset of attributes (conditions) in which these rows are co-expressed.

The bicluster associated with the matrix \(M=(I,J)\) is a couple (AB), such that \(A\subseteq I\) and \(B\subseteq J\), where (AB) is maximal; i.e. there does not exist a bicluster (CD) with \(A\subseteq C\) or \(B \subseteq D\).

This leads to the definition of biclustering.

Definition 2

(Biclustering) The biclustering problem focuses on the identification of the best biclusters of a given dataset. The best bicluster must fulfill a number of specific homogeneity and significance criteria (guaranteed through the use of a function to guide the search) [56].

In the following, we present the different types of biclusters.

Definition 3

(Types of biclusters) According to [14], a bicluster can be one of the following types (cf. Fig. 1) :

  • Bicluster with constant values It is a bicluster where all values are equal.

    $$\begin{aligned} m_{ij}=c \end{aligned}$$
    (1)
  • Bicluster with constant values on rows and columns There are two types of biclusters with constant values:

    1. 1.

      Constant values on rows:

      $$\begin{aligned} m_{ij}= & {} c+a_{i} \end{aligned}$$
      (2)
      $$\begin{aligned} m_{ij}= & {} c*a_{i} \end{aligned}$$
      (3)
    2. 2.

      Constant values on columns:

      $$\begin{aligned} m_{ij}= & {} c+b_{j} \end{aligned}$$
      (4)
      $$\begin{aligned} m_{ij}= & {} c*b_{j} \end{aligned}$$
      (5)
  • Bicluster with coherent values There are two types of biclusters with coherent values. Those with an additive model and those with a multiplicative model defined respectively by:

    1. 1.

      Additive model:

      $$\begin{aligned} m_{ij}=c+a_{i}+b_{j} \end{aligned}$$
      (6)
    2. 2.

      Multiplicative model:

      $$\begin{aligned} m_{ij}=c*a_{i}*b_{j} \end{aligned}$$
      (7)
  • Bicluster with coherent evolutions It is a bicluster where the coherent evolutions are observed across the rows and/or columns of the data matrix.

Fig. 1
figure 1

Examples of different types of biclusters. a Constant bicluster, b constant rows, c constant columns, d coherent values (additive model), e coherent values (multiplicative model), f overall coherent evolution, g coherent evolution on rows, h coherent evolution on columns [14]

Fig. 2
figure 2

Structured view on existing biclustering algorithms

In the following, we scrutinize the pioneering work that has addressed the extraction of biclusters from gene expression data.

2.2 Related work

The costly computation complexity of extracting maximal sub-matrices of genes and conditions such that the genes express highly correlated behaviors over a range of conditions has been a main impediment to the wide-scale use of gene expression analysis community. A recent review of various biclustering algorithms for gene expression data is provided in [25], where existing biclustering algorithms were grouped into two main streams to which we add the third stream. At a glance, as depicted in Fig. 2, the dedicated literature witnessed three main streams for addressing the biclustering task. These streams are detailed in the following.

(i):

The systematic search-based stream includes the following approaches:

1.:

The divide-and-conquer (DAC)-based approach Generally, this approach repeatedly splits the problem into smaller ones with similar structures to the original problem, until these sub-problems become smaller enough to be straightforwardly solved. The solutions to the sub-problems are then combined to create a solution to the original problem respectively [25]. The algorithms adopting this approach were given in [60, 63].

2.:

The Greedy Iterative Search (GIS)-based approach In this approach, a solution is constructed in a step-by-step way using a given quality criterion. The decisions made at each step are based on information at hand without worrying about the impact of these decisions in the future. Moreover, once a decision is made, it will become irreversible and will never be reconsidered [25]. The algorithms adopting this approach were given in [9, 17, 69].

3.:

The Biclusters Enumeration (BE)-based approach As indicated by its name, an enumeration algorithm enumerates all the solutions for the original problem. The enumeration process is generally represented by a search tree [25]. The algorithms adopting this approach were given in [4, 7, 37, 61].

(ii):

The stochastic search-based stream includes the following approaches:

1.:

The Neighborhood Search (NS)-based approach It starts with an initial solution and then moves iteratively to a neighboring solution thanks to the neighborhood exploitation strategy. The algorithms adopting this approach were given in [5, 20].

2.:

The Evolutionary Computation (EC)-based approach This approach is based on the natural evolutionary process such as population, reproduction, mutation, recombination, and selection. The algorithms adopting this approach were given in [21, 22].

3.:

The Hybrid (H)-based approach The latter tries to combine the neighborhood search and the evolutionary approaches. The algorithms adopting this approach were given in [26, 50].

(iii):

The pattern-mining-based stream includes:

1.:

Sequential Pattern-Mining (SPM)-based approaches SPM is used in order to extract order-preserving biclusters. A bicluster is order-preserving if there is a permutation of its columns under which the sequence of values in every row increases. In this context, SPM is applied; and the biclusters are extracted from the frequent sequences as well as their supporting transactions. The algorithms adopting this approach were given in [29, 32].

2.:

Association Rules Mining (ARM)-based approaches ARM can be used to compose biclusters. To perform this task, they divide the problem into two sub-problems:

(1) Finding all association rules that represent biclusters’ samples/genes. In fact, they consider items of both the premise and conclusion of an association rule.

(2) Extracting the supporting transactions of these items. The authors in [51] provided a review of various biological applications of association rule mining.

3.:

Formal Concept Analysis (FCA)-based approaches The FCA can be viewed as a kind of biclustering for binary data. It provides pattern (bicluster) extraction from a binary relation, namely, a formal concept. In its gene expression data applications, the concept’s extent represents the maximal sets of genes related to a maximal set of samples (concept’s intent). The algorithms adopting this approach were given in [38, 39].

In this work, we are particularly interested in the FCA-based biclustering algorithms. In this respect, several approaches have been interested in extracting biclusters using the FCA. In [52], the authors proposed a new approach, called FIST, for extracting the bases of extended association rules and conceptual biclusters, using frequent closed itemsets [57]. Nevertheless, they failed to detail their discretization method, and treated the matrix as though it had been already binary. This was done despite the fact that microarray data were not initially coded in a binary format. Furthermore, their approach did not entail any biological validation of the extracted biclusters.

Whereas, Pensa et al. [59] relied on a single threshold, where expression values greater than this threshold were represented by 1, otherwise by 0. Most discretization techniques commonly applied to gene expression data use absolute expression values. However, the main drawback of this technique is how to find the best method to set the threshold value.

Kaytoue et al. [39] utilized the scaling of numerical data and considered that formal concepts were the groups of genes whose expression values were in the same intervals for a subset of conditions.

Added to that, Kaytoue el al. [38] referred to the algorithm presented in [39], using the triadic concept analysis [43, 64] in order to extract biclusters with similar values. Both of the latter approaches only paid attention to the extraction of one type of biclusters, i.e. biclusters with similar values. In addition, they did not offer any biological validation for the obtained biclusters.

The above mentioned biclustering algorithms have the tendency to focus on one type of biclusters, extract overlapping ones or refrain from biological validation. Thus, in the remainder, we introduce a new FCA-based approach for the extraction of biclusters from gene expression data.

3 BiFCA+: the proposed biclustering algorithm

In this section, we first recall some basic definitions borrowed from the FCA field. Second, we provide a detailed description of the proposed algorithm, followed by an illustrative example.

3.1 FCA basic settings

We start this subsection by presenting the notion of a formal context.

Definition 4

(Formal context) A formal context is a triplet \(\mathcal {K} = (\mathcal {O}, \mathcal {I}, \mathcal {R})\), where \(\mathcal {O}\) represents a finite set of objects, \(\mathcal {I}\) is a finite set of items (or attributes) and \(\mathcal {R}\) is a binary (incidence) relation (i.e., \(\mathcal {R}\subseteq \mathcal {O}\times \mathcal {I}\)). Each couple \((o, i) \in \mathcal {R}\) expresses that the object \(o\in \mathcal {O}\) contains the item \(i \in \mathcal {I}\). Table 1 sketches an example of a formal context.

It is worth mentioning that the link between the power-sets \(\mathcal {P}(\mathcal {I})\) and \(\mathcal {P}(\mathcal {O})\), associated respectively to the set of items \(\mathcal {I}\) and the set of objects \(\mathcal {O}\), is defined as follows:

Definition 5

(Galois connection) Let \(\mathcal {K} = (\mathcal {O}, \mathcal {I}, \mathcal {R})\) be a formal context. The application \(\psi\) is defined from the power-set of objects [i.e., \({\mathcal {P}}(\mathcal {O})\)] to the power-set of items [i.e., \({\mathcal {P}}(\mathcal {I})\)]. It associates to a set of objects O the set of items \(i\in \mathcal {I}\) that are common to all objects \(o\in O\):

$$\begin{aligned}&\psi : \mathcal {P}(\mathcal {O}){} \rightarrow \mathcal {P}(\mathcal {I}){}\\&O\mapsto \psi (O)=\{i\in \mathcal {I}| \forall o \in O, (o, i) \in \mathcal {R}\} \end{aligned}$$

In a dual way, the application \(\phi\) is defined from the power-set of items [i.e., \({\mathcal {P}}(\mathcal {I})\)] to the power-set of objects [i.e., \({\mathcal {P}}(\mathcal {O})\)]. It associates to a set of items I the set of objects \(o\in \mathcal {O}\) that contains all items \(i\in I\):

$$\begin{aligned}&\phi :\mathcal {P}(\mathcal {I}){} \rightarrow \mathcal {P}(\mathcal {O}){}\\&I\mapsto \phi (I)=\{o\in \mathcal {O}| \forall i \in I, (o, i) \in \mathcal {R}\} \end{aligned}$$

The coupled applications (\(\psi , \phi\)) form a Galois connection between the power-set of \(\mathcal {O}\) and that of \(\mathcal {I}\) [8, 27].

The following definition introduces the closure operators associated to the Galois connection.

Definition 6

(Galois closure operators) Let us consider the power-sets \(\mathcal {P}(\mathcal {I})\) and \(\mathcal {P}(\mathcal {O})\), with the inclusion relation \(\subseteq\), i.e. the partially ordered sets \((\mathcal {P}(\mathcal {I}),\subseteq\)) and \((\mathcal {P}(\mathcal {O}),\subseteq\)). The operators \(\gamma = \phi o \psi\) from \((\mathcal {P}(\mathcal {I}),\subseteq\)) to \((\mathcal {P}(\mathcal {I}),\subseteq\)) and \(\omega = \psi o \phi\) from \((\mathcal {P}(\mathcal {O}),\subseteq )\) to \((\mathcal {P}(\mathcal {O}),\subseteq\)) are closure operators of the Galois connection [8, 27]. They define closure systems on \((\mathcal {P}(\mathcal {I}),\subseteq\)) and \((\mathcal {P}(\mathcal {O}),\subseteq\)), respectively. The operator \(\gamma\) generates closed subsets of items, while the operator \(\omega\) generates closed subsets of objects.

This leads to the definition of a formal concept.

Definition 7

(Formal concept) A pair \(\langle A, B \rangle \in \mathcal {O}\times \mathcal {I}\) of mutually corresponding subsets, i.e. \(A = \psi (B)\) and \(B = \phi (A)\), is called a formal concept, where A is called extent and B is called intent.

In its gene expression data application, we consider a formal concept as a bicluster, where the concept’s extent represents genes while the concept’s intent represents the experimental conditions.

Example 1

Let us consider the formal context given by Table 1. We have: \(\mathcal {O}= \{1, 2, 3, 4, 5\}\) and \(\mathcal {I}= \{I_{1}, I_{2}, I_{3}, I_{4}, I_{5}\}\). From this formal context, we can extract \(\langle 135, I_{1}I_{3} \rangle\)Footnote 1 as a formal concept.

Table 1 Example of a formal context

In the following, we define the support of an itemset.

Definition 8

(Support of an itemset) Let \(\mathcal {K} = (\mathcal {O}, \mathcal {I}, \mathcal {R})\) be a formal context. We distinguish two kinds of support that can be associated to a non-empty itemset I:

- Conjunctive support: :

Supp(\(\wedge I\)) = \(|\{o\in \mathcal {O}| (\forall i \in I,(o,i)\in \mathcal {R})\}|.\) Supp(\(\wedge I\)), seen as a conjunction of items (i.e. \(i_1\wedge i_2\wedge \cdots \wedge i_n\)), is the number of objects containing all items of I.

- Disjunctive support: :

Supp(\(\vee I\)) = \(|\{o\in \mathcal {O}| (\exists i \in I,(o,i)\in \mathcal {R})\}|\). Supp(\(\vee I\)), seen as a disjunction of items (i.e. \(i_1\, \vee i_2\, \vee \cdots \vee i_n\)), is the number of transactions containing at least one item of I.

This leads us to present the Bond correlation measure.

Definition 9

(Bond correlation measure) The Bond correlation measure [55] computes the ratio between the conjunctive support and the disjunctive one. Thus, the Bond correlation measure of two concept’s intents \(B_{1}\) and \(B_{2}\) is defined as follows:

$$\begin{aligned} \textit{Bond}(B_{1},B_{2})=\frac{ Supp(\wedge B_{1},B_{2})}{Supp(\vee B_{1},B_{2})} \end{aligned}$$
(8)

Consequently, we can redefine the Bond correlation measure as follows:

$$\begin{aligned} \textit{Bond}(B_{1},B_{2})=\frac{ |B_{1} \bigcap B_{2}|}{|B_{1} \bigcup B_{2}|} \end{aligned}$$
(9)

In the remainder of this section, we thoroughly describe our proposed method.

Fig. 3
figure 3

BiFCA+’s at a glance

3.2 BiFCA+ algorithm

The BiFCA+ biclustering algorithm is an FCA-based algorithm that identifies biclusters from gene expression data. As illustrated by Fig. 3, it operates in three main phases. The first one is the discretization phase. Starting from a numerical dataset, the basic idea is to build a formal context where genes stand for objects and conditions for the attributes. Subsequently, it starts the mining phase. The latter allows extracting formal concepts that represent the correlated biclusters. Finally, we have to perform the filtering phase. This latter is performed in order to remove the biclusters that have a high overlap. Given the biclusters obtained from the previous phase, we compute the similarity measure between each pair of biclusters. The latter is defined as the ratio between the conjunctive support of two biclusters and their disjunctive support. We only retain the biclusters having the Bond correlation measure that does not exceed a given threshold minBond. The pseudo-code of BiFCA+ is shown in Algorithm 1. BiFCA+ takes as an input a data matrix \(M_{1}\) and a minimum correlation threshold minBond. BiFCA+ enables the determination, from the data matrix \(M_{1}\), of the set of the obtained biclusters \({\beta }\).

Table 2 Example of gene expression data matrix (\(M_{1}\))
Table 3 3-state data matrix (\(M_{2}\))
Table 4 Binary data matrix (\(M_{3}\))

The phases of BiFCA+ are thoroughly described in the following subsections.

figure a

3.2.1 Pre-processing of gene expression data matrix

Our method applies a pre-processing phase to transform the original data matrix \(M_{1}\) into a binary one. This phase is split into two steps:

  1. 1.

    First, we discretize the original data into a 3-state data matrix \(M_{2}\). This step aims to unveil the trajectory patterns of genes. According to [48, 58], in the DNA microarray data analysis, we add genes into a bicluster whenever their trajectory patterns of expression levels are similar across a set of samples.

    Interestingly enough, our proposed discretization phase keeps track of the profile shapeFootnote 2 over conditions and preserves the similarity information of the trajectory patterns of the expression levels.

    Before delving through the mining process, we must at first discretize the initial data matrix. The discretization process outputs the 3-state data matrix. It consists in combining in pairs, for each gene, all the adjacent conditions. Indeed, the 3-state data matrix gives an idea about the profile. Furthermore, it gives a global view of the profile of all conditions.

    In our case, each column of the 3-state data matrix carries the meaning of the variation of genes between a pair of \(M_{1}\) conditions. It offers useful information for the identification of biclusters, i.e. up (1), down (− 1) and no change (0).

    Formally, the matrix \(M_{2}\) (3-state data matrix) is defined as follows :

    $$\begin{aligned} M_{2}= {\left\{ \begin{array}{ll} 1 &{}\quad \text {if } x_{1}<x_{2} \\ -1 &{}\quad \text {if } x_{1}>x_{2} \\ 0 &{}\quad \text {if } x_{1}=x_{2} \end{array}\right. } \end{aligned}$$
    (10)

    with \(x_{1}= M_{1}[j,l]\) ; \(x_{2}= M_{1} [j,l+1];\) and \(j\in [1 \ldots n];l\in [1 \ldots m-1]\)

  2. 2.

    For the second step of the pre-processing phase, we build the binary data matrix in order to extract formal concepts. In this respect, we compute the average number of repetitions for each column in the matrix \(M_{2}\) (3-state data matrix). In other words, we have:

    1. (a)

      \(|\)maxrepeat\(|\): This variable stands for the maximum number of occurrences by column.

    2. (b)

      \(|\)minrepeat\(|\): This variable stands for minimum number of occurrences by column.

    3. (c)

      \(|\)mediumrepeat\(|\): It stands for the medium number of occurrences by column.

    It is better to choose the mean value since the maximum will produce a huge number of high overlapping biclusters, whereas the minimum value generates biologically none-valid biclusters.

    Formally, we define the binary matrix as follows:

    $$\begin{aligned} M_{3}= {\left\{ \begin{array}{ll} 1 &{} \text {if } M_{2}[j,l]= average~value \\ 0 &{} \text { otherwise } \end{array}\right. } \end{aligned}$$
    (11)

    where \(j \in [1 \ldots n]\) and \(l\in [1 \ldots m-1]\)

    After the discretization step, the dimensions of our data matrix (\(M_{2}\)) become equal to \(n*(m-1)\).

Example 2

Let us consider the data matrix \(M_{1}\) given by Table 2. For the first row, we have \(M_{1_{1j}}= (10, 20, 5, 15, 0, 18)\) with \(j\in [1 \ldots 6]\). In the first step of the pre-processing phase we obtain the discretized first row; i.e. \(M_{2_{1j}}\)= (1, − 1, 1, − 1, 1) with \(j \in [1 \ldots 5]\). In the second step, the first column becomes \(M_{3_{i1}}= (0, 0, 0, 0, 1, 1)\) with \(i \in [1 \ldots 6]\).

3.2.2 Extracting formal concepts (biclusters)

The FCA can be viewed as a kind of biclustering for binary data. It provides patterns (biclusters) extraction from a binary context.

In this respect, after preparing the binary data matrix, we move to extract formal concepts (biclusters) from the binary matrix \(M_{3}\).

The extraction of the formal concepts is carried out through the invocation of a slightly modified version of the efficient LCM algorithm [65]. The choice of this algorithm is argued by the fact that it has a linear complexity in the number of closed attributes and has been shown to be one of the best algorithms dedicated to such a task.

3.2.3 Computation of similarity measure (Bond)

The BiFCA+ algorithm is already able to identify overlapping biclusters. In fact, for the filtering process, we only consider biclusters having a low overlap. Indeed, in the case of biclusters that have a high overlap, they have the same biological signification. The Bond correlation measure achieves its minimum of 0 when the biclusters do not overlap at all and its maximum value 1 whenever they are identical.

In order to compute the similarity between the two biclusters (i.e. formal concepts) \(\hbox {FC}_{1}\) and \(\hbox {FC}_{2}\), with \(\hbox {FC}_{1}= \langle A_{1}, B_{1} \rangle\) and \(\hbox {FC}_{2}= \langle A_{2}, B_{2} \rangle\), where A\(_{i}\), \(i=1,2\), represents the extent, and B\(_{i}\) represents the intent, we use the Bond correlation measure. The latter assesses the overlap between two concept’s intents (cf. Definition 9).

Finally, we only retain the obtained biclusters for which the Bond correlation measure does not exceed a given threshold. The set of such biclusters represents a solution to our problem.

In the following, we provide an illustrative example of the BiFCA+ approach.

3.3 Illustrative example

Let us consider the data matrix given by Table 2. Each column represents all the gene expression levels from a single experiment, and each row represents the expression of a gene across all experiments.

3.3.1 Pre-processing phase of data matrix

The pre-processing phase goes as follows:

  1. 1.

    First, we transform the numerical data into the 3-state data matrix. This is done using Eq. 10. Table 3 provides the obtained results.

  2. 2.

    Second, we create the binary matrix, using the 3-state data matrix. Let us consider the 3-state data matrix given by Table 3. For the sake of building the binary data matrix, we compute the average number of repetitions for each column in the matrix \(M_{2}\); e.g., for the column \(\grave{C}\)1 we have:

    1. (a)

      \(|\)maxrepeat\(|\): It is equal to 3 and corresponds to the value 1.

    2. (b)

      \(|\)minrepeat\(|\): It is equal to 1 and corresponds to the value \(-1\).

    3. (c)

      \(|\)mediumrepeat\(|\): It is equal to 2 and corresponds to the value 0. Therefore, the average value is 0.

    Subsequently, and using Eq. 11, we obtain the binary matrix sketched by Table 4.

Fig. 4
figure 4

BiFCA+ biclusters number w.r.t. minBond variations

3.3.2 Formal concept extraction phase

After preparing the binary data matrix, we move to extract formal concepts, i.e. the candidate biclusters, from the matrix \(M_{3}\) (using the LCM algorithm [65]).

By using the binary data matrix given in Table 4, we obtain as a result the formal concepts shown in Table 5.

Table 5 Formal concepts extracted from binary context

3.3.3 Filtering phase

In this phase, we only retain biclusters having a low overlap. This overlap is assessed through the Bond correlation measure. For example, with respect to the formal concepts given in Table 5, if we consider \(\hbox {FC}_{3}\) and \(\hbox {FC}_{4}\), we compute the Bond correlation measure:

$$\begin{aligned}&\textit{Bond}\ (B_{3},B_{4}) = \frac{ | \{\grave{C}1\grave{C}2\} \bigcap \{\grave{C}1\grave{C}3\} |}{|\{\grave{C}1\grave{C}2\} \bigcup \{\grave{C}1\grave{C}3\}|}\\&\textit{Bond}\ (B_{3}, B_{4}) = \frac{1}{3} = 0.33\\ \end{aligned}$$

The Bond correlation measure threshold is equal to 0.5. Thus, we consider the formal concepts \(FC_{3}\) and \(FC_{4}\) as non overlapping biclusters. Nevertheless, by lowering the threshold value to 0.3, we only consider one bicluster, that is the one having the highest number of conditions.

4 Experimental results

In this section, we show the results of applying the BiFCA+ approach on three well-known real-life datasets. The evaluation of biclustering algorithms and the comparison are based on two criteria: statistical and biological. We compare the results obtained by our algorithm versus the state-of-the-art biclustering algorithms as well as the Trimax algorithmFootnote 3 [38], which also relies on the FCA. The experiments are carried out on the different datasets. According to the obtained experimental results, interesting reductions of the number of biclusters are obtained as far as the value of minBond is lowered. Representative results are plot by Fig. 4, where the minBond is set when there is a significant decrease in the number of obtained biclusters. For instance, in the yeast cell-cycle dataset, the minBond is set to 0.25.

4.1 Description of used datasets

In order to assess the performance of our proposed algorithm and analyze its results, we carry out a series of experimentations on the following real-life gene expression datasets:

  • Yeast cell-cycle dataset The yeast cell-cycleFootnote 4 is a very popular dataset in the gene expression analysis community. In fact, it is one of the most studied organisms, and the functions of each gene are well known. We use the Yeast Cell-Cycle dataset described in [62], processed in [18] and publicly available from [19]. It contains 2884 genes and 17 samples. In the experimentations conducted on this dataset, minBond is experimentally set to 0.25.

  • Saccharomyces cerevisiae dataset The Saccharomyces cerevisiae datasetFootnote 5 contains the expression levels of 2993 genes under 173 samples. We experimentally set the minBond to 0.3 for our experiments on this dataset.

  • Human B-cell Lymphoma dataset: The Human B-cell Lymphoma dataset [1] contains the expression levels of 4026 genes under 96 samples.Footnote 6 In the experimentations we conduct on this dataset, minBond is fixed experimentally to 0.1.

Table 6 Human B-cell lymphoma coverage for different algorithms
Table 7 Yeast cell-cycle coverage for different algorithms

4.2 Description of considered tests

In the following, we describe respectively the statistical and biological criteria.

4.2.1 Statistical criterion

To evaluate the statistical relevance of our algorithm, we heavily rely on the following criteria.

  • Coverage [12, 46, 50] It represents the total number of cells in a microarray data matrix covered by the obtained biclusters. In the biclustering domain, validation using coverage is considered interesting since a large coverage of a dataset is very important in several applications that rely on biclusters [25]. In fact, the higher the number of highlighted correlations, the greater the amount of extracted information. Consequently, the higher the coverage, the lower the overlapping in the biclusters.

    In the literature, this test has been applied respectively on the Yeast Cell-Cycle and human B-cell lymphoma datasets.Footnote 7

  • p value We compute the percentage of biclusters having an adjusted p value, i.e. the proportion between the number of biclusters having an adjusted p value and the total number of obtained bicluters. We compute the adjusted p value [60], i.e. based on the exact value of Fisher test [24] to measure the quality of the obtained biclusters. In fact, the biclusters having a p value lower than 5% are considered as over-represented; in other words, the majority of genes of a bicluster have common biological characteristics. The best biclusters have an adjusted p value less than 0.001. This measure is computed thanks to the web tool FuncAssociateFootnote 8[11]. This test is applied respectively on the Yeast Cell-Cycle and Saccharomyces cerevisiae datasets.

4.2.2 Biological criterion

The Gene Ontology (GO) projectFootnote 9 is a collaborative effort to address the need for consistent descriptions of gene products in different databases. The project began as a collaboration between three model organism databases, among them the Saccharomyces Genome Database (SGD). This latter concerns our datasets (Yeast Cell-Cycle and Saccharomyces cerevisiae). The GO project provides controlled vocabulary of defined terms representing gene product properties. This covers three domains: (1) biological process, (2) molecular function and (3) cellular component.

In order to evaluate our biclusters biologically we make use of the GoTermFinder web toolFootnote 10 [13]. It searches for significant shared GO terms, used to describe the genes in a given list to help discovring what the genes may have in common. In fact, the biological criterion enables measuring the quality of the resulting biclusters, by checking whether the genes of a bicluster have common biological characteristics.

This test is applied respectively on the Yeast Cell-Cycle and Saccharomyces cerevisiae datasets.

Fig. 5
figure 5

Proportions of biclusters significantly enriched by GO annotations (Saccharomyces cerevisiae dataset)

Fig. 6
figure 6

Proportions of biclusters significantly enriched by GO annotations (Yeast Cell-Cycle dataset)

4.3 Study of statistical relevance

To evaluate our algorithm on real-life datasets, the following criteria are of use:

4.3.1 Coverage criterion

As in [6, 12, 46, 47], we use the coverage criterion defined as the total number of cells in a microarray data matrix covered by the obtained biclusters.

We compare the results of our algorithm versus those obtained by Trimax [38] and those reported by [3]. In the latter reference, the following algorithms were considered: CC [18], BiMine [4], BiMine+ [7], BicFinder [6], MOPSOB [47], MOEA [50] and SEBI [22].

Table 6 (resp. Table 7) presents the coverage of the obtained biclusters. At a glance, we remark that most of the algorithms have relatively close results. For the Human B-cell Lymphoma (respectively the Yeast Cell-Cycle) dataset, the biclusters extracted by our algorithm cover 100% (respectively 80.12%) of the genes, 100% (respectively 100%) of the conditions and 67.84% (respectively 57.07%) of the cells of the expression data matrix. Trimax is largely outperformed, since it only covers respectively 8.50 % of cells, 46.32 % of genes and 11.46 % of conditions for the Human B-cell Lymphoma. It is also worth mentioning that for Yeast Cell-Cycle, the CC algorithm obtains the best results since it masks groups that are extracted with random values. Thus, it prohibits the genes/ conditions that were previously discovered from being selected during the next search process. This type of mask leads to a high coverage. Furthermore, The Yeast dataset only contains positive integer values. Consequently, one can use the Mean Squared Residue (MSR) [18] to extract large biclusters. By contrast, the Human B-cell Lymphoma dataset contains integer values including negative ones. This means that the application of the MSR on this dataset does not result in the extraction of large biclusters.

This implies that our algorithm can generate biclusters with a high coverage of a data matrix. This outstanding coverage is due to the discretization phase as well as the extraction of biclusters without focusing on a specific type of biclusters.

4.3.2 P value criterion

To assess the quality of the extracted biclusters, we use the web tool FuncAssociate [11] in order to compute the adjusted significance scores for each bicluster (adjusted p valueFootnote 11). In fact, the best biclusters have an adjusted p value less than 0.001%. The results of our algorithm are compared versus those obtained by Trimax [38] as well as those concerning CC [18], ISA [10], OSPM [9] and Bimax [60], reported by [3].

The obtained results of the Saccharomyces cerevisiae and the Yeast Cell Cycle datasets for different adjusted p values (p = 5%; 1%; 0.5%; 0.1%; 0.001%), for each algorithm over the percentage of total biclusters, are respectively depicted in Figs. 5 and 6. For the Saccharomyces cerevisiae dataset (Fig. 5), the BiFCA+ and Trimax results show that 100% of extracted biclusters are statistically significant with the adjusted p value equal to \(0.001 \%\). It is important to note that Bimax achieves its best results whenever \(p<0.1\%\), while CC, ISA and OSPM have a reasonable performance with \(p<0.5\%\).

Whereas, for the Yeast Cell Cycle (Fig. 6), 100% of the extracted biclusters by BiFCA+ are statistically significant when \(p <0.5\%\), while only 80% of extracted biclusters by Trimax are statistically significant for the same p value. By contrast, Trimax achieves 100% of extracted biclusters when \(p <1 \%\). Our results, then, sharply outperform those of Trimax; however, Bimax scored better when \(p <0.001\%\) and \(p <0.1 \%\).

Table 8 Significant GO terms (process, function, component) for two biclusters, extracted from Saccharomyces cerevisiae data using BiFCA+
Table 9 Significant GO terms (process, function, component) for two biclusters, extracted from Yeast Cell-Cycle data using BiFCA+
Fig. 7
figure 7

List of genes which concern the Gene Ontology term “ribonucleoprotein complex” (Cellular component) for the first bicluster (Saccharomyces cerevisiae dataset)

4.4 Biological results

The biological criterion allows measuring the quality of resulting biclusters, by checking whether the genes of a bicluster have common biological characteristics.

To evaluate the quality of the extracted biclusters and identify their biological annotations, we use GOTermFinder, which is designed to search for the significant shared Gene Ontology (GO) terms of a group of genes. The GO is organized according to 3 axes: biological process, molecular function and cellular component.Footnote 12 We indicate in Tables 8 and 9 the biological annotations of two randomly selected biclusters in terms of the above cited axis, where we report the most significant GO terms. For instance, with the first bicluster extracted from the Saccharomyces cerevisiae dataset (Table 8), the list of genes is illustrated in Fig. 7. These genes concern the Gene Ontology term “ribonucleoprotein complex”, in terms of Cellular component with a p value equal to 1.39\(e^{-65}\) (highly significant) and a background of 10.7%.

The results on these real-life datasets demonstrate that our proposed algorithm identifies biclusters with a high biological relevance.

4.5 Run time performs

Table 10 presents the comparison of the run time (in seconds) of our algorithm versus those respectively obtained by Trimax, BicFinder and BiMine. We note that for the Human B-cell Lymphoma dataset and Saccharomyces cerevisiae datasets BiFCA+ is the fastest, while BiMine is the costlier in execution time.

Table 10 Execution time of the algorithms: BiFCA+, Trimax, BicFinder and BiMine

5 Conclusion

In this work, we have introduced the BiFCA+ biclustering algorithm, a new FCA-based biclustering method for gene expression data.

Our approach consists in extracting formal concepts from a dataset after a discretization into a 3-state data matrix. A 3-state data matrix allows observing the profile of each gene through all pairs of conditions in the gene expression matrix. This latter discretization is used to extract formal concepts, a mathematical framework for deriving implicit relationships from a set of objects and their attributes. The resulting formal concepts represent biclusters. These biclusters are filtered with the help of the Bond correlation measure in order to remove the biclusters that have a high overlap.

The performances of the BiFCA+ algorithm have been assessed on three real-life DNA microarray datasets. These experimentations show that BiFCA+ enables extracting high quality biclusters. These biclusters have been evaluated with the GO annotations which checks the biological significance of biclusters. The obtained results confirm the BiFCA+’s ability to extract significant biclusters.

Future work will focus on the study of the extensions of the concepts of biclusters and formal concepts to those of triclusters and triconcepts [36]. Furthermore, in our forthcoming work, we will pay attention to the reduction of the obtained set of formal concepts [2] and the knowledge reduction of classical formal decision contexts [44]. Other avenues of future work also concern the extraction of biclusters by introducing biological knowledge during the extraction process. Moreover, we plan to use our method in other application domains such as text mining, target marketing and multimedia data processing. We also hope to enhance our experimentations by both extending our work to other correlation measures [53, 55] through classifying them into classes of measures sharing the same properties and using other statistical comparison criteria.