1 Introduction

Microarray technology has been widely applied in study of measuring gene expression levels for thousand of genes simultaneously. In this technology, cluster analysis is the most important method for gene expression data analysis [15]. By comparing the expression pattern of unknown genes to those known functions, one can predict the functions of unknown genes. This is the primary objective of the unsupervised cluster analysis of gene expression data.

Many clustering algorithms have been proposed for gene expression data analysis. The hierarchical clustering is one of the earlier methods applied to cluster the gene expression data [6, 7]. K-means clustering methods are used in gene expression data analysis due to its high computational performances [8, 9]. As one kind of neural network, self-organizing map (SOM) which presents high-dimensional data by the low-dimensional data has also been used for gene expression data clustering [10]. Other common clustering methods include CAST algorithm [11] , SVM clustering [12] and model-based clustering [13, 14]. Gene expression data have a lot of noise, and large amounts of data behind many variables cannot be observed. The model-based clustering algorithm assumes that the data comply with the internal framework for probabilistic model, according to the different parameters, the observed data can be divided into different clusters. Successful application of the model-based clustering to gene expression data has been reported; the most well-known mixture models clustering is the MCLUST [15, 16]. Yeung et al. [17] applied finite Gaussian mixture model to fit the yeast cell cycle data and showed good fitness of the model. Yi et al. [18] used the model-based algorithm in supervised clustering of gene expression data. Gaussian mixture models have received the bulk of the attention in the mixture modeling literature due to their mathematical tractability [19, 20].

The model-based clustering algorithm is used to estimate the parameters using EM algorithm. EM algorithm is a popular way to estimate the parameters of mixture models. Unfortunately, its performance highly depends on the initialization. In order to solve this problem, we propose to utilize an improved EM algorithm to estimate the model parameters. In our proposed method, we use random swap K-means algorithm to initialize the multivariate Gaussian mixture models. The EM algorithm starts with some initial values of all unknowns and iteratively updates each parameter conditional on the parameter values in the previous round of the iteration. Without any prior knowledge, each gene may be assigned to each cluster with equal probability.

This paper is structured as follows. Sect. 2 reviews the multivariate Gaussian mixture models and EM algorithm. The improved EM algorithm based on multivariate Gaussian mixture models is given in Sect. 3. Section 4 provides experimental results and analysis using the simulated and real gene expression data sets. We then conclude the paper in Sect. 5.

2 Multivariate Gaussian mixture models and EM algorithm

2.1 Multivariate Gaussian mixture models

The gene expression data are arranged in an m*n matrix denoted by X, where n is the number of genes and m is the level of treatment. Suppose \(X=\{x_1, x_2, \ldots , x_n \}\) is a random observation gene expression data set. All \(x_i (1\le i\le n)\) are mutually independent. Let \(x_i =[x_{i1}, x_{i2}, \ldots , x_{im} ]^{T}\) be the i-th column of matrix \(X^{T}\), where \(x_{ij} \) be the expression level of the i-th gene in the j-th treatment, for \(i=1, \ldots , n\) and \(j=1, \ldots , m.\,f(x_i )\)is the corresponding probability density function, in which \(x\in R^{d} \) is a d-dimensional random variable value of \(x_i \) and \(\theta _i \) is the parameter. With the finite multivariate mixture model, each \(x_i \) is assumed to follow an m-dimensional mixture of normal distributions. So the mixture distribution probability density \(f(x_i )\) is

$$\begin{aligned} f(x_i )=\sum _{k=1}^C {\alpha _k f_k (x_i \left| {\theta _i } \right. )} \end{aligned}$$
(1)

with\(\sum \nolimits _{k=1}^C {\alpha _k =1,0\le \alpha \le 1} \), here C is the number of components of the mixture models. \(\alpha _{1}\) is the proportion of mixture or weighing, and \(\theta =(\alpha _1, \ldots , \alpha _C ,\theta _1, \ldots , \theta _C )\)is the parameter space of models.

If the distribution of the component density function \(f_i (x\left| {\theta _i } \right. )\)can be determined, the model will become the mixture models of this component, and Eq. (1) can be called as mixture density function. Although m in Eq. (1) is usually treated as a fixed value, its real value is unknown in most applications. A common method of estimating the parameters of finite mixture models is the EM algorithm [21], which can be used to estimate maximum likelihood model parameters in incomplete data.

Multivariate Gaussian mixture model is a common mixture density model, which is used to describe the distribution of spatial data. Gaussian hypotheses are generally made (i.e.,   \(\theta =(\alpha _{1},\mu _{1},\sum _{1},\alpha _{2},\mu _{2}, \sum _{2},\ldots ,\alpha _{k},\mu _{k},\sum _{k})\), where \(\mu _{k}(\hbox {a}\, m\times 1\hbox {vector})\) is the mean of model k, \(\sum _{k}(\hbox {an}\, m\times m~\hbox {matrix})\) is the covariance of model k. The multivariate Gaussian mixture probability density is expressed as

$$\begin{aligned} f (x_i \left| {\mu _k ,\Sigma _k } \right. )=\sum _{k=1}^C {\alpha _k f_k (x_i \left| {\mu _k ,\Sigma _k } \right. )} \end{aligned}$$
(2)

2.2 EM algorithm

The EM algorithm [2224] is an iterative technique for finding maximum likelihood estimates when there are, or are assumed to be, missing data. Without any prior knowledge, each gene may be assigned to each cluster with equal probability. The EM algorithm iterates between an expectation(E) step and a maximization(M) step. In the E step, we compute the expectation of the log likelihood of complete data with respect to latent variables given the current parameter estimates. In the M step, we maximize the expected log likelihood of complete data. Therefore, the EM algorithm transfers the problem of maximizing the original log likelihood to the problem of maximizing the expected log likelihood of complete data, which usually much easier to deal with. The EM iteration is described in the following steps:

  1. (1)

    Initialization: The parameters needed for the next step are initialized in the following way:

    $$\begin{aligned} P_{ik}^0&= 1/C{\begin{array}{ll} &{} \quad {\forall i=1,\ldots ,n;{\begin{array}{ll} {k=1,\ldots ,C}&{} \\ \end{array} }} \end{array} } \end{aligned}$$
    (3)
    $$\begin{aligned} \alpha _k^0&= 1/C{\begin{array}{ll} &{}\quad {\forall {\begin{array}{ll} {k=1,\ldots ,C}&{} \\ \end{array} }} \\ \end{array} } \end{aligned}$$
    (4)
  2. (2)

    Step E: this step consists of the estimation of the posterior probability, i.e., \(P_{ik}^t (x_i )\)for the gene expression data \(k\) belonging to the class \(k\) at the \(t\)-th iteration:

    $$\begin{aligned} P_{ik}^t (x_i )=\frac{\alpha _k^t f(x_i \left| {\theta _k^t } \right. )}{\sum \nolimits _{k=1}^C {\alpha _k^t f(x_i \left| {\theta _k^t } \right. )} } \end{aligned}$$
    (5)
  3. (3)

    Step M: the parameters needed for the next step are estimated in the following way:

    $$\begin{aligned} \alpha _k^{t+1}&= \frac{1}{n}\sum _{i=1}^n {P_{ik}^t (x_i )} \end{aligned}$$
    (6)
    $$\begin{aligned} \mu _k^{t+1}&= \frac{\sum \nolimits _{i=1}^n {x_i P_{ik}^t (x_i )} }{\sum \nolimits _{i=1}^n {P_{ik}^t (x_i )} } \end{aligned}$$
    (7)
    $$\begin{aligned} \Sigma _k^{t+1}&= \frac{\sum \nolimits _{i=1}^n {P_{ik}^t (x_i )(x_i -\mu _k^{t+1} )(x_i -\mu ^{t+1})^{T}} }{\sum \nolimits _{i=1}^n {P_{ik}^t (x_i )} } \end{aligned}$$
    (8)
  4. (4)

    Turn back to step 2 or stop until convergence.

3 The improved EM algorithm based on multivariate Gaussian mixture models

The idea of the improved EM algorithm is to alternate between simple perturbation to the solution by remove–add and convergence toward nearest optimum by the EM algorithm. The initialization is performed as in the K-means method, after the solution has been initialized, we perform remove and add operations.

Our method is outlined as follows:

  1. (1)

    Initialization step: the number of class C is assumed to be known. An initial solution of the parameters \((\mu _k^0 ,\Sigma _k^0 )\) of the multivariate mixture models is extracted by the K-means. The parameter \((\alpha _k^0 )\) of mixture weight can be initialized as following:

    $$\begin{aligned} \alpha _k^0 =\frac{N_k^0 }{N}{\begin{array}{ll} &{}\quad {\forall {\begin{array}{ll} {k=1,\ldots ,C}&{} \\ \end{array} }} \\ \end{array} } \end{aligned}$$
    (9)
  2. (2)

    Remove and Add step: the remove operation is done by selecting a component randomly and adds a component. The location of the new component is decided by selecting one data point, and setting it as the mean vector of the new component. The new component is therefore more likely to be placed in areas of high point density, such as cluster centers. At the s-th iteration when remove and add operation, the parameters of the new component are as follows:

    $$\begin{aligned} \mu _r^s&= x_p {\begin{array}{ll} &{} \quad {r= {random}(1,C)} \\ \end{array} }{\begin{array}{ll} &{} \\ \end{array} } \end{aligned}$$
    (10)
    $$\begin{aligned} \alpha _r^s&= \sum _{l=1,l\ne r}^C {\left( \sum _{i=1}^N {P_{il}^{s-1} } \right) } {\begin{array}{ll} {\alpha _r^{s-1} }&{} \\ \end{array} } \end{aligned}$$
    (11)
    $$\begin{aligned} \Sigma _r^s&= \sum _{l=1,l\ne r}^C {\left( \sum _{i=1}^N {P_{il}^{s-1} } \right) } {\begin{array}{ll} {\Sigma _r^{s-1} }&{} \\ \end{array} } \end{aligned}$$
    (12)
  3. (3)

    Step E: this step consists of the estimation of the posterior probability, i.e., \(P_{ik}^t (x_i )\)for the gene expression data \(x_i \) belonging to the class \(k\) at the \(t\) th iteration:

    $$\begin{aligned} P_{ik}^t (x_i )=\frac{\alpha _k^t f(x_i \left| {\theta _k^t } \right. )}{\sum \nolimits _{k=1}^C {\alpha _k^t f(x_i \left| {\theta _k^t } \right. )} } \end{aligned}$$
    (13)
  4. (4)

    Step M: the parameters needed for the next step are estimated in the following way:

    $$\begin{aligned} \mu _k^{t+1}&= \frac{\sum \nolimits _{i=1}^n {x_i P_{ik}^t (x_i )} }{\sum \nolimits _{i=1}^n {P_{ik}^t (x_i )} } \end{aligned}$$
    (14)
    $$\begin{aligned} \alpha _k^{t+1}&= \frac{1}{n}\sum \nolimits _{i=1}^n {P_{ik}^t (x_i )} \end{aligned}$$
    (15)
    $$\begin{aligned} \Sigma _k^{t+1}&= \frac{\sum \nolimits _{i=1}^n {P_{ik}^t (x_i )(x_i -\mu _k^{t+1} )(x_i -\mu ^{t+1})^{T}} }{\sum \nolimits _{i=1}^n {P_{ik}^t (x_i )} } \end{aligned}$$
    (16)
  5. (5)

    Turn back to step 2 or stop until convergence.

  6. (6)

    Turn back to step 2 or stop when the number of remove and add operation is large enough.

This improved EM algorithm iteration scheme is robust and well behaved. The convergence speed is also reasonably fast.

4 Result and analysis

In this section, in order to evaluate the gene expression data clustering algorithm based on multivariate Gaussian mixture models, both synthetic and real data sets used in the clustering experiments are introduced and analyzed. All the experiments were programmed by C\(++\). Lenovo PC with CPU—Intel (2) nl (TM), speed—1.5GHz, memory—1Gb, hard disk—120Gb is used in this experiment.

4.1 Evaluation measures

The adjusted rand index (ARI) evaluates the degree of agreement between two partitions of the same set of objects [25]. Suppose C is the true clustering of a gene expression data set based on domain knowledge, and \(\textit{C}^\prime \) is clustering result given by some clustering algorithm. Let a represent the number of pairs that are members of the same cluster in both C and \(\textit{C}^\prime \), b represent the number of pairs belonging to the same cluster in C but to different clusters in \(\textit{C}^\prime \), c represent the number of pairs belonging to different cluster in C but to the same clusters in \(\textit{C}^\prime \) and d represent the number of pairs belonging to different clusters in both C and \(\textit{C}^\prime \). The ARI (C, \(\textit{C}^\prime \)) is defined by

$$\begin{aligned} \mathrm {ARI}(C,C)=\dfrac{2(ad-bc)}{(a+b)(b+d)+(a+c)(c+d)} \end{aligned}$$
(17)

The higher the value of ARI indicates that when C is more similar to \(\textit{C}^\prime \), the better the clustering performance.

The quantities, Precision (\(P\)) and Recall (\(R\)), are methods of performance evaluation. Precision is a measure of how much noise there is in the output of the detector, while recall is a measure of how much of the ground truth is detected. Precision is the fraction of detections that are true positives rather than false positives, and Recall is the fraction of true positives that are detected rather than missed. F-measure that combines precision (precision) and recall (recall) evaluate the clustering results

$$\begin{aligned} F=\frac{2PR}{P+R} \end{aligned}$$
(18)

The higher the value of \(F\), the better the clustering performance.

4.2 The number of clusters

A good estimation of the number of the clusters is very important for clustering. The number of clusters may be treated as another parameter and inferred from the data. The Akaike’s information criterion (AIC) or Bayesian inference criterion (BIC) is used to estimate the optimal number of clusters [26, 27]. The AIC is

$$\begin{aligned} \hbox {AIC}(G)=-2 \hbox {In} L(\theta (G)) +2P(G) \end{aligned}$$
(19)

Where P is the number of parameters to be estimated in the model, \(L(\theta (G))\) is the likelihood value evaluated at \(\theta (G)\),the vector of maximum likelihood estimate of the parameters, and G is the Gaussian density function. The number of clusters(C) that has the minimum AIC value is the estimated C.

Bozdgoan proposed the modified AIC called \(\hbox {AIC}_{3} \) and \(\hbox {AIC}_{4}\), and their expression as follows:

$$\begin{aligned} \hbox {AIC}_3 (G)=-2 \hbox {In} L(\theta (G)) +3P(G) \end{aligned}$$
(20)
$$\begin{aligned} \hbox {AIC}_4 (G)=-2 \hbox {In} L(\theta (G)) +4P(G) \end{aligned}$$
(21)

In order to adapt to the over-dispersed data, Lebreton [28] proposed the modified AIC criteria as QAIC and the expression is:

$$\begin{aligned} \hbox {QAIC}(G)=-2 ^{*}\hbox {ln} L(\theta (G))/c+2P(G) \end{aligned}$$
(22)

where c is a single variance inflation factor, which can be estimated from the goodness-of-fit Chi-square statistic \(\chi ^{2}\) of the global model and its degrees of freedom,

$$\begin{aligned} c=\chi ^{2}/df \end{aligned}$$
(23)

We compare and analyze the experiment results based on the simulated datasets, consisting of 81 data, each data has coordinate attributes \((x, y)\), as we can depict in Fig. 1. Figure 2 illustrate the results of density function and the Gaussian mixture density estimation when the clustering number \(C\,=\,2,3,4,5\), singular covariance is existed when set \(C\,=\,6\), so the maximum value of the clustering number \(C\,=\,5\). From Fig. 2, the result of the mixture density function is smoothest when \(C\,=\,3\). How can the number \(C\) of mixture models be quantitative analyzed? We have done a lot of experiments by using the methods of the AIC, AIC\(_{3}\), AIC\(_{4}\), BIC and QAIC, the results illustrated in Figs. 3 and 4.

But the log-likelihood function values of AIC, AIC\(_{3}\), AIC\(_{4}\) are larger than those of \(P(G)\), so the trend of decline is shown in Fig. 3a–c. If we want to obtain the minimum function value and the best fitting graphics, the value of \(G\) is always large and can result in over-fitting. The method of BIC increased the value of the penalty function, with more obvious trend of increasing first and decreasing later. When the value of \(G\) is 3, we obtain the minimum function value and the best fitting.

For the method of QAIC, the value of \(C\) is set to 2–9 corresponding to Fig. 4a–h, 11. From the trend of the figure: when \(c\,=\,2\), the function values of the QAIC method decrease from large to small; when \(c\,=\,3, 4, 5\), the values decrease first and increase gradually, and the monotonous decreasing amplitude is large while the monotonous increasing amplitude is small; when \(c\,=\,6, 7, 8\), the values of QAIC function subject to monotonous decrease first and monotonous increase later, and the increase changes from small to large, at this time the monotonous decreasing amplitude is small while the monotonous increasing amplitude is large, and the line of symmetry moves from right to left, only when \(C\,=\,5\), the minimum value of function is the closest to the axis of symmetry, so \(C\,=\,5\) and \(G\,=\,3\) is the most reasonable fitting.

Fig. 1
figure 1

Simulated dataset D1

Fig. 2
figure 2

Results of the different density function and the Gaussian mixture estimation of the dataset D1 a The density function of dataset D1 \(G\,=\,2)\) b The Gaussian mixture estimation of the dataset D1 \((\textit{G}\,=\,2)\). c The density function of dataset D1 \((G\,=\,3)\). d The Gaussian mixture estimation of the dataset D1 \((G\,=\,3)\). e The density function of dataset D1 \((G\,=\,4)\). f The Gaussian mixture estimation of the dataset D1 \((G\,=\,4)\). g The density function of dataset D1 \((G\,=\,5)\). h Estimation of the dataset D1 \((G\,=\,5)\)

Fig. 3
figure 3

The values of \(\textit{AIC(G)}\)AIC(G) \(_{3}\)AIC(G) \(_{4}\) and BIC

Fig. 4
figure 4

The values of QAIC(\(G\)) when \(c\,=\,2,c\,=\,3,c\,=\,4,c\,=\,5,c\,=\,6,c\,=\,7,c\,=\,8,c\,=\,9\)

From the analysis above, considering G values which is based on information criterion, we need to consider the relationship between the value of likelihood function and penalty function value, which means specific data set is needed to be modified constantly.

When the function \(\hbox {AIC}(G)\) obtains the minimum value, \(G\) is most reasonable number of the mixture density model. But to search the minimum value for AIC function, we need to be set up a larger \(M\) in advance, and find a minimum value of function AIC in the interval of 1 to \(M\). But \(M\) should be more than the minimum parameters G at least, but it is difficult to determine; at the same time, it is difficult to guarantee efficiency when \(M\) is larger and computational time is too much.

According to the analysis of the above experiments, we used this QAIC method for determining the number of components, the criterion function expression:

$$\begin{aligned} \hbox {QAIC}(G)=-2^{*}\hbox {In}L(\theta (G))/c+2P(G) \end{aligned}$$
(24)

where \(c=5\). We can determine the number of components from this expression.

4.3 Synthetic data experiment

We tested the algorithms using synthetic data sets on different simulated multivariate Gaussian mixture distributions. We demonstrate the Gaussian models estimated from EM, and our improved EM method on synthetic data sets in Fig. 5. The data sets contain more than 10,000 points, and the models are displayed as ellipses. The experiment is conducted by 20 repetitions.

Fig. 5
figure 5

Gaussian models estimated from initial solution, EM and Our method. a Initialization. b EM. c Our improved EM method

Figure  5a shows the initialization effect of the data sets. Figure  5b is the result of EM clustering and classification compared to the initial solution, while the clustering result of our improved EM method in Fig.  5c is clearly better in parameter estimation than initial solution and EM. For the standard EM has the shortcoming of getting stuck in a local maximum and our proposed improved EM algorithm overcome it well.

The number of add and remove operations is a key parameter in our improved EM method .To ensure a good solution, the number of iterations for remove and add has been set large enough. We observed that the effect of a bad initialization is diminished when remove–add operation is used. We therefore expect remove and add operation to yield good results with Gaussian mixture models. For a good remove and add to occur, a badly placed component must be chosen and a location from the area where the component needs to move must also be chosen.

To compare the random swap EM algorithm with the standard EM algorithm, we calculated the quantitative accuracy of the results are shown in Table 1.

From Table 1, we can observe that RSEM got higher precision (P) and recall (R) value compared with EM. According to column (3), our improved EM method undoubtedly reached a higher value of F, which means the boundary representation is relatively better. The higher value of ARI of our improved EM method indicates that C is more similar to \(\textit{C}^\prime \), so the clustering performance of our improved EM method is better. With the random perturbation on the solution before continuing EM iterations and the simple parameter-setting of the number of add and remove operation, the random swap EM is shown to be much simpler and more efficient than EM.

4.4 Real data sets experiment

In this section, we compare performance of our improved EM algorithm with the classical EM, EM algorithm based on multivariate distributions [29], two standard clustering algorithm [30], and K-means [31] on the four real gene expression data sets.

The rat CNS data set was obtained by reversing transcription-coupled PCR to study the expression levels of 112 genes during rat central nervous system development over nine time points. These 112 genes were selected from four gene families according to prior knowledge of biology by Wen et al. [32] we used this \(\hbox {112}\times \hbox {9}\) microarray as one of our experimental data sets and took these four functional categories as external standard categories.

The human fibroblasts serum data set contains the expression levels of 8613 human genes [33]. The data set has 13 dimensions corresponding to 12 time points (0,0.25,0.5,1,2,4,6,8,12,16,20, and 24h) and one unsynchronized sample. A subset of 517 genes used in our experiments whose expression levels changed substantially across the time points have been chosen [34].

Table 1 P’s, R’s, and F’s for the EM algorithm and RSEM algorithm

The yeast cell cycle data set is from http://cellcycle-www.stanford.edu. In the study by Yeung et al. [17], a subset of 384 genes was used (n = 84). These genes had expreddion levels peaking at different times corresponding to the five (C = 5) phases of the cell cycle, including Early G1, Late G1, S, G2 and M. This subset of the data is available at http://www.cs.washington.edu/homes/kayee/model. For preprocessing, we removed the data corresponding to the 90- and 100-min time points, because the two time point, points were reported to be unreliable [35]. After the deletion, the total number of treatment became 15. We then standardized each gene expression profile by subtracting the mean expression from the original value and dividing the difference by the SD so that the transformed expression level has a 0 mean and variance of 1. All the 384 genes were assigned to one of the five clusters by the original investigators [17].

The HAHrma data consist of time course responses of human bronchial cell line A549 to Interleukin 13(IL13), a protein coded by the IL13 gene. Il13 is known to up-regulate CD23 and MHC class II expression, promote switching of the IgE isotype in a special kind of white blood cells known as B cells, and down-regulate the production of pro-inflammatory cytokines and chemokines that aid in the defense mechanism for white blood cells [36]. The human bronchial cells were exposed to IL13, and measurements on the expression levels of the 22,283 genes were taken at 0, 4, 12 and 24 hours after exposure by hybirdization with Affymetrix U133a chips.

For all the competitor algorithms, we used this QAIC method for determining the number of components based on four real data sets. We analyzed and compared these clustering algorithms on the quantities Precision (P), Recall (R), F-measure and ARI for algorithms.

Table 2 shows the comparative results of the six clustering algorithms implemented in this experiment on the rat CNS data set T. The values of the quantities Precision (P) , Recall (R) and F-measure are listed in Table 2 as the criteria to evaluate the respective clustering algorithm. The values of ARI were also considered in our comparison of the clustering algorithms. The best solution in each case has been shown in bold.

Table 2 Comparison of the clustering results of various classification methods on the rat CNS data
Table 3 Comparison of the clustering results of various classification methods on the human fibroblasts serum data
Table 4 Comparison of the clustering results of various classification methods on the yeast cell cycle microarray data

Tables 3, 4 and 5 compare the quality of the six clustering algorithms on human fibroblasts serum, yeast cell cycle and HAHrma data sets. The values of the quantities Precision (P), Recall (R),F-measure and ARI were also as the criterion to evaluate the performance of the clustering algorithms.

From Tables 2, 3, 4 and 5, it can be seen that our method of improved EM had overall better performance than the other four competitive clustering algorithms, from the precision and recall. For the ARI of the algorithms, the value of our method is higher than other clustering algorithms. It is because that our method can solve the problem of over-reliance on the initialization better.

Table 5 Comparison of the clustering results of various classification methods on the HAHrma data

5 Conclusions

An improved EM method based on multivariate Gaussian mixture models of gene expression data clustering was presented in this paper. The proposed method is used to estimate the coefficients of the multivariate Gaussian mixture models and the weight of the model. In order to find the number of clusters that fits the data best, we have compared and analyzed the results of the experiments by the methods of the AIC,AIC\(_{3}\),AIC\(_{4}\) BIC and QAIC, the QAIC method is used in the paper. To estimate the performance of the improved EM method, we compared our new method with other clustering algorithm based on gene expression data analysis. One artificial and real gene expression data set was selected to implement the experiments. Experimental results show that the improved EM method was better than the other clustering algorithm.

Finally, data normalization is the prerequisite of gene expression data analysis. The model-based method developed here based on the normal assumption but have the problem of model mismatch between the real distribution of gene expression data and the assumption. Therefore, the method may be more sensitive to any departure from normality. Future research should focus on overcoming the model mismatch.