Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Association mapping has recently become a popular approach to discover the genetic causes of many complex diseases. A genome wide association study (GWAs) is the examination process of different genetic variants (markers) in several individuals in the purpose of detecting eventual association between the variants and certain traits. GWAs particularly focus on associations between single-nucleotide polymorphisms (SNPs) and traits like major diseases. Once such genetic associations are identified, researchers can use the information to promote new strategies to detect, treat and prevent the diseases [2].

Regarding the considered phenotype’s nature, GWA studies usually deal with two classes of data. In the first class, the data comprise the genetic informations of all or a large fraction of the diseased subjects (cases) that appear in the considered study base and then sampling a comparable number of healthy subjects (controls), ideally from the same study base, and potentially matched with the cases by some socio-demographic characteristics such as race, age and gender. Accordingly, the considered trait is a qualitative trait i.e. an individual is even a case or a control. In the second class, the addressed phenotype is a quantitative trait i.e. numerical values that can be ordered from highest to lowest such as height, weight, cholesterol level, etc. The analysis of the later form of data is known as Quantitative Trait Locus (QTL) analysis.

By considering the entire genome, case/control data analysis is essentially based on seeking alleles of variants that are more frequent in people with the disease (cases). The found variant is then said to be associated with the disease.

Quantitative trait locus (QTL) analysis is a statistical method that links two types of information i.e. phenotypic data (quantitative trait) and genotypic data (usually markers), in an attempt to explain the genetic basis of variation in complex traits [5]. QTL analysis allows researchers in different fields such as agriculture, evolution, and medicine to link certain complex phenotypes to specific regions of chromosomes. The goal of this process is to identify the action, interaction, number, and precise location of these regions.

A QTL analysis starts by collecting phenotype and genotype data from a number of unrelated individuals in the same way as in a case-control study. However, in QTL studies there are no cases and no controls, just individuals with a range of phenotype values. After that, association between the traits and the different SNPs are detected using statistical method. The associations are commonly formulated as predictive models.

Generally, genome wide associations studies are performed using supervised methods such as logistic regression and discriminant analysis [1, 9], Bayesian approaches [4], etc. Commonly, the treated data comprises two main informations for each individual: genotype informations and phenotype informations. Using a training data set, the study mainly consists in defining a predictive model and validate it through a test data set.

In this work we propose an unsupervised study of the GWA data with quantitative traits (QTL). By this study we aim to extract a subset of SNPs that have the same alleles for a sub set of individuals sharing similar traits. Actually, the considered data can be seen as a matrix \(A=(X,(Y,Z))=\{a_{ij}\}\) where each row \(i\) presents an individual, each column \(j\) represents either a SNP (\(j \in Y\)) or a trait (\(j \in Z\)) and an element \(a_{ij}\) presents the corresponding SNP’s allele (if \(j \in Y\)) or the corresponding traits value (if \(j \in Z\)) (see Table 1). Thus, a bicluster \(B=(I,(J,K))\) is a sub-matrix of \(A=(X,(Y,Z))\) where \(I \subset X\), \(J \subset Y\) and \(K \subset Z\).

Table 1. Studied GWA data

This paper is organized as follows. Section 2 presented the biclustering problem and a new multiobjective model for a biclustering problem applied to analyzing GWA data sets. An adapted heuristic and metaheuristic are proposed in Sect. 3 to solve the proposed model. In Sect. 4, experimental analysis of the proposed approaches and results are presented. Finally Sect. 5 concludes the paper and presents perspectives.

2 Biclustering Method in Analyzing GWA Data

2.1 Biclustering

Biclustering or co-clustering is a well-known data mining method that has been widely applied in a broad range of domains such as marketing, psychology and bioinformatics. It consists in extracting submatrices \(B=(I,J)\) \((I\subset X, J \subset Y)\) (called biclusters) with maximal size and respecting a certain coherence constraint. Depending on the addressed problem, biclusters of different types can be considered. The different biclusters types and some corresponding applications are described below.

  1. 1.

    Constant bicluster: all the biclusters elements have the same value.

  2. 2.

    Bicluster with constant rows/columns: the elements of each row (column) have the same value.

  3. 3.

    Bicluster with coherent values: the definition of this type of biclusters is a generalization of constant rows/columns biclusters. There exist two different models associated to this class of biclusters:

    1. (a)

      shifting model: where each row (and each column) can be obtained by adding an offset to an other row (column).

    2. (b)

      scaling model: where each row (and each column) can be obtained by multiplying an other row (column) by a factor.

  4. 4.

    Bicluster with coherent evolution: the elements of the bicluster behave similarly (correlated) independently of their numerical values.

When formulating a biclustering problem, a similarity (dissimilarity) measure is required in order to evaluate the extracted results. The measure is, commonly, related to the bicluster’s type. In the case of microarray data analysis, the study aim to extract biclusters with coherent values or evolution (gene that present similar behavior under a sub set of conditions). Different multiobjective modeling for biclustering problem for microarrays data have been proposed [7, 1014] but none for the case of GWA data. Commonly, the proposed multiobjective models comprise: one or more function(s) to optimize the biclusters sizes, a function that optimizes biclusters coherences and a function to optimize the rows variances. In all of these models, a solution represents one bicluster. Regarding the size, most of the models maximize the ratio between the biclusters elements number and the microarray data elements. However, as the number of rows is generally more important than the number of columns, such functions may favor the maximization of rows number with regard to columns number. Thereby, in [7], authors proposed to maximize the number of rows and columns separately by using two objective functions. Concerning biclusters coherence, all the proposed models consider the Mean Squared Residue MSR [3] dissimilarity measure. In [14] the MSR value is allowed to increase as it does not exceed the threshold \(\delta \). Regarding the rows fluctuations, all the existing models maximize the mean row variance. In [12] the coherence and fluctuation objectives are merged in one function by defining a function as the ratio between the MSR of the bicluster and its mean rows variance.

The MSR measure is well adapted to identify biclusters with coherent values. However, this measure can not be applied for GWA data as different biclusters type is required.

2.2 Multiobjective Problem Modeling

In this section, we propose a multiobjective model for a biclustering method applied to GWAs. In this study, we seek to extract biclusters with constant columns, which correspond to a set of individuals that share SNPs presenting the same alleles and the same traits. In order to extract such biclusters, two objectives have to be considered: maximizing the biclusters size (find maximal biclusters) and minimizing the average of columns variances. Actually, these two criteria are clearly independent and conflicting. In fact, a non perfect bicluster’s coherence (columns constance) can be improved by removing a row or a column, i.e. by reducing its size. We can therefore deduce that the problem of biclustering in GWAs can be formulated as a multiobjective optimization problem. Thus, the proposed model is given by:

figure a

Where \(f_1\) (size) has to be maximized and \(f_2\) (average variance) has to be minimized

3 Resolution Approaches

In this section we present two new approaches to solve the proposed model. The first approach is a greedy heuristic \(Sbic\) and the second approach is a multiobjective metaheuristic \(SHMOBI_{ibea}\).

3.1 Sbic Heuristic

\(Sbic\) is a greedy heuristic that aims to extract relevant biclusters from GWA data matrix and that has been designed in a similar manner as Cheng and Churchs heuristic [3] widely used for microarray data. At each run, \(Sbic\) extracts one bicluster from the data matrix. \(Sbic\) deletes (adds) nodes that meet with some conditions in order to decrease the biclusters average columns variances and increase its size. The main steps of \(Sbic\) are given in Algorithm 1.

figure b

In multiple node deletion phase, \(Sbic\) starts by removing some nodes (rows and columns) in order to decrease the average columns variance. In columns dimension, the variance of each column is calculated. The columns that have the highest variance are deleted. This process will clearly decrease the whole average variances of the columns. Similarly, the average variance can also be decreased by applying the same process on the rows dimension. Indeed, rows with the highest contribution on the average columns variances are deleted. After that, if the bicluster’s average variance still higher than \(\delta \) the bicluster has to undergo the single node deletion processes. The main steps are illustrated in Algorithm 2.

figure c

In single node deletion, the nodes with the highest contribution on the average variance are iteratively deleted until the \(Avar\) reaches the desired value. The main steps are illustrated in Algorithm 3.

figure d

Once the \(Avar\) of the considered bicluster reaches the desired value, the algorithm tries to add other rows (columns) without increasing the \(Avar\). For instance all the columns (not present yet in the bicluster) that have a variance lower than or equal to \(Avar\) are added to the bicluster. Furthermore, the expected contribution of each row \(i\) (\(con_i\)) in the biclusters \(Avar\) value is computed in order to decide whether the row can be added to the bicluster or not. The main steps are illustrated in Algorithm 4.

figure e

Actually, \(Sbic\) is a deterministic algorithm. Thus, the same bicluster will be extracted if the starting matrix is always the same. In order to extract several biclusters from a data matrix \((X,(Y,Z))\) we propose to apply the \(Sbic\) over the whole data matrix to extract the first bicluster. After that, \(Sbic\) can be applied over a sub-matrix containing \(p\%\) of the data’s rows and columns selected randomly which will lead to discovering different bicluster at each run.

In the following section we present the main components of \(SHMOBI_{ibea}\) metaheuristic.

3.2 \(SHMOBI_{ibea}\)

\(SHMOBI_{ibea}\) is based on \(HMOBI_{ibea}\) [15] which is a multiobjective metaheuristic based on the evolutionary algorithm \(MOBI_{ibea}\) [6] and \(DMLS(1 \cdot 1_{\succ })\) [8].

MOBI is a hybrid MOEA (Multi Objective Evolutionary Algorithm) for solving biclustering problem in the specific case of microarray data. It combines IBEA with a local search inspired from Cheng and Churchs heuristic [3] which is dedicated for biclustering of microarray data. \(MOBI_{ibea}\) [6] allows in the case of microarray data to extract biclusters of good quality.

DMLS (Dominance-based Multiobjective Local Search) are a general concept of multiobjective local searches using the concept of Pareto Optimality. At each generation, DMLS selects one or more non-visited solutions (solutions with non-explored neighborhood) from the archive and explores their neighborhoods. After that, the solutions are marked as visited. Different variants of DMLS exists depending on the number of selected solutions and on the exploration strategy. In this study, we will use \(DMLS(1 \cdot 1_{\succ })\) where one solution is randomly selected and the exploration of its neighborhood stops when the first improving solution is found.

In this section, we propose \(SHMOBI_{ibea}\) which is an adapted version of \(HMOBI_{ibea}\) to SNP data. Several changes have been done to adapt \(HMOBI_{ibea}\) to the specific case of SNPs. Therefore, we present a suitable solutions encoding and variation operators.

Solutions Encoding. In \(SHMOBI_{ibea}\), we choose to represent a bicluster as a list compound of six parts: Each one of the first 3 parts of the chromosome is an ordered list of indexes corresponding to either rows, columns or traits; while parts 4 to 6 are just the cardinalities of those lists.

Table 2. Example of SNPs and traits data matrix

Example:

Given the data matrix presented in Table 2, the string \(\{1~ 3~ 2~ 3 ~2~2~2~1\}\) represents the following bicluster compound of two rows (1 and 3), two SNPs (2 and 3) and one trait (2):

$$ \{1~ 3~ 2~ 3~2~2~2~1\}\Longrightarrow \left[ \begin{array}{ccc} 2 &{} 1&{} 0.3\\ &{} &{}\\ 0&{}0&{}-0.75\\ \end{array} \right] $$

Variation Operators. 

  1. 1.

    Crossover:

    A Single point crossover is used in the three first parts of the solution (rows part, columns part and traits part). Each part undergoes crossover separately.

    Let parents be chromosomes \(P_1 =\{r_1~...~ r_n~c_1~...~c_m~t_1~...t_p~r_{nb}~c_{nb}~t_{nb}\} \) and \(P_2 =\{r'_1~...~ r'_l~c'_1~...~c'_k~t'_1~...~t'_q~r'_{nb}~c'_{nb}~t'_{nb}\}\) where \(r_n \leqslant r'_l \).

    The crossover in the rows part is performed as follows: The crossover point in \(P_1\) (\(\lambda _1\)) is generated as a random integer in the range \(2 \leqslant \lambda _1 \leqslant r_n\). the crossover point in \(P_2\) \(\lambda _2 = r'_j\) where \(r'_j \geqslant \lambda _1\) and \(r'_{j-1} \leqslant \lambda _1\). In the same way, the crossover in the columns part and traits part is performed. The parts 4–6 are not involved directly in the crossover and are computed after it.

    For example, consider the parents \(P_1 \) and \(P_2\) presented in Fig. 1. Suppose the \(3^{rd}\) gene index and the \(2^{nd}\) condition index of \(P_1\) are selected, so: \(\lambda _1 = 15\) and \(\lambda '_1 = 5\) then \(\lambda _2 = 16\) and \(\lambda '_2 = 6\), which results on the offspring \(C_1\) and \(C_2\).

  2. 2.

    Mutation:

    We replace the mutation operator by the \(Sbic\) heuristic.

Fig. 1.
figure 1

An example of the crossover operator application.

Fig. 2.
figure 2

General scheme of \(SHMOBI\)

When generating random biclusters, it may happen that irrelevant rows and columns get included in spite of their values lying far apart. Therefore, we start by randomly generating a population where the irrelevant rows and columns of each bicluster are deleted using the \(Sbic\) heuristic. The resulting population is used as the initial population for \(SMOBI_{ibea}\). After that, the \(DMLS(1 \cdot 1_{\succ })\) is applied for each solution of \(SMOBI_{ibea}\)’s archive (Pareto approximation). The main steps of \(SHMOBI_{ibea}\) are illustrated in Fig. 2.

4 Experiments and Results

In this section we present the experimental protocol in assessing the performance of the presented algorithms over synthetic data sets.

4.1 Data Sets

In order to assess the performance of the proposed algorithms, we use synthetic data sets to investigate the ability of our algorithms to extract implanted biclusters. In this purpose, we randomly generate different data sets of size: Set1(100, (1000, 3)) which corresponds to 100 rows 1000 SNPs columns and 3 traits columns and Set2(100, (10000, 3)) which corresponds to 100 rows 10000 SNPs columns and 3 traits columns. For each data set we implant 1 (called Set1-1 et Set2-1) and 5 biclusters (called Set1-5 and Set2-5) with size 10 rows 50 SNPs columns. In each case, the biclusters may involve all (Set*-A) or some of the traits (Set*-T).

4.2 Comparison Criteria

In order to assess the performance of the proposed biclustering algorithm, we use the following two ratios:

$$\begin{aligned} \theta _{Shared} = \frac{S_{cb}}{Tot_{size}} \times 100 \end{aligned}$$
(1)

Where \(S_{cb}\) is the portion size of bicluster correctly extracted and \(Tot_{size}\) is the total size of the implanted bicluster.

$$\begin{aligned} \theta _{NotShared} = \frac{S_{ncb}}{Tot^{'}_{size}} \times 100 \end{aligned}$$
(2)

Where \(S_{ncb}\) is the portion size of bicluster not correctly extracted and \(Tot^{'}_{size}\) is the total size of the extracted bicluster.

The ratio \(\theta _{Shared}\) (resp. \(\theta _{NotShared}\)) expresses the rate of shared (resp. not shared) biclusters volume with real biclusters. In fact, when \(\theta _{Shared}\) (resp. \(\theta _{NotShared}\)) is equal to 100 % the algorithm extracts the correct (resp. not correct) biclusters. A perfect solution has \(\theta _{Shared}\) =100 % and \(\theta _{NotShared}\)=0 % respectively. That is, the exact number of rows and columns of implanted biclusters.

4.3 Parameters

Concerning the models parameters, we set \(\alpha \), \(\beta \) and \(\gamma \) to 0.5, 0, 0.5 respectively. In fact, given the nature of data, SNPs columns present low variance compared to trait columns. Hence, a big number of SNP columns will be added for each bicluster undergoing the \(Sbic\) heuristic. Therefore, we favor biclusters having low average variance and low SNPs columns to be selected in the search process and this by setting \(\beta =0\).

In the other hand, all algorithms parameters have been set experimentally. For the \(Sbic\) we set \(\alpha \) to 1.5, \(\delta \) to 0.15 and \(\%p\) to 50 %. The algorithm is run 20 times in order to extract 20 biclusters. The first run uses all the data matrix. The remaining runs starts by sub-matrices where the rows and the columns are chosen randomly. When selecting rows, more chance is given to rows not present yet in the previously extracted biclusters.

Concerning \(SMOBI_{ibea}\), we experimentally set the initial population size to 400. The mutation and crossover operators parameters are set to 0.2 and 0.5 respectively. The algorithm stops after a fixed time depending on the data set size. For Set1 data sets the execution time is set to 500 s, and 700 s for Set2 data sets. The same time is allocated to \(SHMOBI_{ibea}\) algorithm where 90 % of the execution time is accorded to \(SMOBI_{ibea}\) and the remaining 10 % to \(DMLS(1 \cdot 1_{\succ })\).

We apply our algorithms on the considered data sets and for each algorithm we select the closest biclusters to the implanted ones. Thereafter, we calculate \(\theta _{Shared}\) and \(\theta _{NotShared}\) for each bicluster. For instances where several biclusters are implanted, we report the average \(\theta _{Shared}\) and \(\theta _{NotShared}\) of the extracted biclusters.

4.4 Results

In this section, we compare the efficiency of \(Sbic\), \(SMOBI_{ibea}\) and \(SHMOBI_{ibea}\) in extracting the implanted biclusters. The comparison is done with regard to \(\theta _{Shared}\), \(\theta _{NotShared}\) and the rate of found biclusters.

Table 3. Comparative results when extracting one bicluster. \(SMOBI\) stands for \(SMOBI_{ibea}\), \(SHMOBI\) for \(SHMOBI_{ibea}\).
Table 4. Comparative results when extracting five biclusters. \(SMOBI\) stands for \(SMOBI_{ibea}\), \(SHMOBI\) for \(SHMOBI_{ibea}\).

Tables 3 and 4 present the obtained results for the different instances corresponding to one and five implanted biclusters respectively. A detailed observation of the found solutions show that, in most cases, the not correctly biclusters extracted portions are mainly composed of extra columns (SNPs).

In Table 3 we can observe that all the approaches can find the implanted bicluster. However, \(SHMOBI_{ibea}\) find the best results with the highest \(\theta _{Shared}\) and lowest \(\theta _{notShared}\). For instance, in the case of data Set1-1-A where all the traits are involved in the bicluster, \(SHMOBI_{ibea}\) extracts the bicluster with only \(\theta _{notShared}= 24.24\,\%\). Actually, \(SMOBI_{ibea}\) is able to find the implanted bicluster. However, the \(\theta _{NotShared}\) of the extracted bicluster is very high. This result demonstrates the role of \(DMLS(1, 1_{\succ })\) in fine-tuning the found results.

Similarly, Table 4 shows that \(SHMOBI_{ibea}\) outperforms \(Sbic\) and \(SMOBI_{ibea}\) in finding the implanted biclusters. Actually, \(SHMOBI_{ibea}\) finds more biclusters than the other approaches with higher \(\theta _{Shared}\). However, the \(\theta _{NotShared}\) value of the biclusters extracted using all the approaches are relatively high. This can be explained by the huge number of SNPs columns in the data set.

Concerning running times, they are of 500 s for small instances (Set1-*) and 700 s for large instances (Set2-*).

5 Conclusion

In this article, we have presented a preliminary study on using a biclustering method to analyze GWA data. Actually, GWA data consists in two types of information i.e. phenotype data (traits) and genotype data (genetic variations). Commonly, SNPs are considered as they present the most frequent form of genetic variations. The analysis of such data consists in finding eventual associations between traits and SNPs combinations. Therefore, we propose a multiobjective modeling for biclustering in order to extract samples (individuals) sharing similar traits and having same alleles for a SNPs combination. The corresponding biclusters are constant columns biclusters.

The extracted biclusters may bring out existing associations between the considered SNPs and traits. Moreover, the extracted biclusters may provide important informations that can be used in further GWA studies. Given the huge number of SNPs, we propose to solve this problem using a hybrid metaheuristic \(SHMOBI_{ibea}\). The efficiency of \(SHMOBI_{ibea}\) have been assessed using synthetic data sets of different sizes and different implanted biclusters numbers. Further studies will be carried out in real data sets provided by the company Genes Diffusions Footnote 1.