Introduction

Recent developments in microarray technology make it possible to rapidly capture all of the gene expression profiles in biological samples (Ross et al. 2000; Welsh et al. 2001; Bouton and Pevsner 2002; Guffanti et al. 2002). This technology results in large amounts of data, the interpretation of which is a major bottleneck in current studies. A natural step in extracting microarray data information is to examine the extremes, for example, genes with significant differential expression in two samples (case vs control) or in a time-series (such as cell cycles).

Microarray data are characterized by high dimensionality (thousands of genes) and small sample size (often <30). Systematic and stochastic fluctuations are usually involved in microarray experiments (Schuchhardt et al. 2000). Therefore, the raw dye intensity or ratio value has a high noise to signal ratio between probes. The x-fold change approach may induce high false positives and/or false negatives when used as a simple criterion to determine the genes differentially expressed between query and reference samples. Some biologically important genes with small x-fold changes are highly statistically significant when they are measured repetitively with high precision. Conversely, many genes with large x-fold changes in one array and high variability across multiple arrays have no statistical significance (Wolfinger et al. 2001). Various statistical methods have been proposed for identifying differentially expressed genes (DEGs; Chen et al. 1997, 2002; Ideker et al. 2000; Kerr et al. 2000; Newton et al. 2001; Thomas et al. 2001; Wolfinger et al. 2001; Efron et al. 2001; Churchill 2002; Ibrahim et al. 2002; West 2003; Smyth 2004), but none has yet gained widespread acceptance for the analysis of microarray data. The most basic statistical problem is that the measured differential expression cannot completely reflect a real biological shift in gene expression (Newton et al. 2001).

Discrimination and cluster analysis techniques have been very useful for searching patterns of gene expression that are highly correlated (Eisen et al. 1998; Spellman et al. 1998; Golub et al. 1999; Tamayo et al. 1999; Hastie et al. 2000). These methods are involved in using various types of clustering algorithms, such as self-organizing maps, k-means clustering and hierarchical clustering, to discriminate and characterize patterns of gene expression. However, such exploratory methods alone do not provide the opportunity to engage in statistical inference. Furthermore, the gene expression level or relative ratio level with sampling errors within experiments is performed directly in discrimination and cluster analyses; and thus the distance between data-points cannot reflect the true differential expression between genes.

Mixed-model approaches are widely used to partition various sources of variability. They have the flexibility to handle unbalanced data and can be easily extended to more complicated biological models which have been proven as powerful statistical tools in classic quantitative genetic analyses (Searle et al. 1992). The objectives of this paper are: (1) to propose a mixed-model approach to analyzing variance components for cDNA microarray data analysis, applying the method to selecting a target subset of DEGs that are of biological interest and (2) to assess the effectiveness of this method by extensive computer simulations, specifically compared with the widely used approach based on t-statistics for single genes (Dudoit et al. 2000). Analyzing data publicly available for the study of unstable transcripts in Arabidopsis demonstrates the utility of our method.

Materials ad methods

Each datum in a microarray experiment is associated with one particular combination of an array in the experiment: a fluorescence dye (red or green), a treatment and a gene. In our analysis, we used the logarithms of the original fluorescence measurements as phenotypic values, not the log ratio values, as used by some previous studies (Kerr et al. 2000; Wolfinger et al. 2001).

To alleviate the computation burden, we propose a two-step strategy for analyzing microarray data. In the first step, we choose a subset of genes that are potentially expressed differentially among treatments with a loose criterion. In the second step, these potential genes are combined for further analyses and data-mining with a stringent criterion, in which DEGs are confirmed and some quantities of interest (such as gene × treatment interaction) are estimated. Both types of the aforementioned analyses are performed using a mixed-model approach for a variance–component framework.

Choosing a subset of potential genes with differential expression

We first normalized the original fluorescence data before choosing a subset of genes. The purpose of normalization is to minimize systematic experimental biases so that the observed variation arises from biological differences. Let y ijkl denote the logarithm of a measurement from the ith array, the jth treatment, the kth dye and the lth gene in a cDNA microarray experiment. The original fluorescence data are normalized as: \( r_{ijkl} = y_{ijkl} - \left( {\bar y_i \ldots + \bar y._j .. + \bar y.._k . - 2\bar y....} \right). \)

The normalized data, r ijkl , can be viewed as a variation for each gene after removing systematic experimental errors and are the input data for the following single-gene model:

$$ r_{ijkl} = \mu _l + A_{il} + T_{jl} + D_{kl} + \gamma _{ijkl} $$
(1)

Here, μ l represents the overall average expression level of gene l (a fixed effect), A il is the ith array effect of gene l (a random effect): \( A_{il} \sim \left( {0,\;\sigma _{A(l)}^2 } \right) \) T jl is the jth treatment effect of gene l (a random effect): \( T_{jl} \sim \left( {0,\;\sigma _{T(l)}^2 } \right); \) D kl is the kth dye effect of gene l (a random effect): \( D_{kl} \sim \left( {0,\;\sigma _{\gamma (l)}^2 } \right);\;\gamma _{ijkl} \) is the residual error of gene l: \( \gamma _{ijkl} \sim \left( {0,\;\sigma _{\gamma (l)}^2 } \right). \) The array effects account for differences among arrays. Differences among arrays may arise from differences in print quality or from differences in the ambient conditions when the plates were processed, which may increase or reduce the hybridization efficiencies of labeled cDNA. The treatment effects account for differences among treatments. Such differences can arise when some treatments (e.g., a specific cell line) have more transcription activity in general than others. The dye effects account for fluorescent signal differences. One dye may show consistently higher signal intensity than another. The single-gene model is fitted separately to the normalized data from each gene, allowing an elementary inference to be made, using a separate estimate of variability. The methods described here are for the prejudication of a subset of genes with differential expression. This procedure is similar to a variation filter that is commonly used to exclude genes with less than a certain x-fold variation among the collected samples (Golub et al. 1999). However, the x-fold variation filter is usually based on total gene expression variations. Instead, our procedure focuses on total treatment effects, which may increase the filter efficiency.

Combining analysis of multiple genes

A subset of genes potentially expressed differentially between one or more pairs of samples in the dataset can be used for further analysis as follows:

$$ y_{ijkl} = \mu + G_l + A_i + T_j + D_k + GA_{li} + GT_{lj} + GD_{lk} + \varepsilon _{ijkl} $$
(2)

where μ is the average of overall expression levels (a fixed effect), G l is the fixed effect of the lth gene, \( A_i \sim\left( {0,\;\sigma _A^2 } \right) \) is the random effect of the ith array, \( T_j \sim\left( {0,\;\sigma _T^2 } \right) \) is the random effect of the jth treatment and \( D_k \sim \left( {0,\;\sigma _D^2 } \right) \) is the random effect of the kth dye. \( GA_{li} \sim\left( {0,\sigma _{GA}^2 } \right) \) is the interaction between the lth gene and the ith array, \( GT_{lj} \sim\left( {0,\sigma _{GT}^2 } \right) \) is the interaction between the lth gene and the jth treatment and \( GD_{lk} \sim \left( {0,\sigma _{GD}^2 } \right) \) is the interaction between gene l and dye k. The random error term ɛ ijkl is the residual effect: \( \varepsilon _{ijkl} \sim \left( {0,\;\sigma _\varepsilon ^2 } \right). \) Interpretations of A i , T j and D k are similar to those in Eq. 1. The gene effects, G l , account for differences in transcription level among the genes. Some genes may be inherently more active in mRNA transcription than others. The gene × array interactions, GA li , account for the average effect of the spot on the ith array for the lth gene. It is a “spot” effect due to the potential incomplete control over the amount and concentration of cDNA immobilized from one array to the next. The gene × dye interactions, GD lk , are gene-specific dye effects and account for the average effect of the kth fluorescence dye for the lth gene. This may contribute to the differential hybridization efficiencies of two chemically different fluorescence dyes for the same probe. The gene × treatment interactions, GT lj , are of interest in microarray experiments. These effects capture the departure from the overall averages that are attributable to the specific combination of the jth treatment and the lth gene.

Similar interpretations of the aforementioned factors were also detailed by Kerr et al. (2000). Whether a specific factor is regarded as fixed or random depends not only on the levels of source variation but also on the investigator’s particular interest in the study. A fixed effect is one that is repeatable. That is, if other researchers repeat a specific microarray experiment, they are estimating the same effects. A random effect is one that is not repeatable. That is, another researcher will not (probably cannot) estimate the same effects, but can estimate the variance of the effects from another sample. In our study, we treated gene effects as fixed, while others were treated as random. For example, the print quality of the arrays and the ambient conditions under which the arrays were probed varied from one microarray experiment to another. Such array effects may not be repeatable among different microarray experiments and thus are treated as random effects. The basic mRNA transcription level for a specific gene may remain inherently similar among different microarray experiments when there are no interference factors such as those from arrays and treatments. Such a basic transcription level is estimable with suitable experimental designs. Therefore, the gene effects are treated as fixed effects in our model.

Statistical assessment of gene significance

Both types of the above models can be analyzed by a mixed-model approach. The single-gene model (Eq. 1) can be rewritten in the following matrix form:

$$ \begin{gathered} {\mathbf{r}}_{(l)} = {\mathbf{1}}\mu _{(l)} + {\mathbf{U}}_{A(l)} {\mathbf{e}}_{A(l)} + {\mathbf{U}}_{T(l)} {\mathbf{e}}_{T(l)} + {\mathbf{U}}_{D(l)} {\mathbf{e}}_{D(l)} + {\mathbf{e}}_{\varepsilon (l)} \\ = {\mathbf{1}}\mu _{(l)} + \sum\limits_{u = 1}^4 {{\mathbf{U}}_{u(l)} {\mathbf{e}}_{u(l)} \sim N\left( {{\mathbf{\mu }}_{(l)} ,{\mathbf{V}}_{(l)} } \right)} \\ \end{gathered} $$
(3)

with this variance–covariance matrix:

$$ {\text{Var}}\left( {{\mathbf{r}}_{(l)} } \right) = {\mathbf{V}}_{(l)} = \sigma _{A(l)}^2 {\mathbf{U}}_{A(l)} {\mathbf{U}}_{A(l)}^{\text{T}} + \sigma _{T(l)}^2 {\mathbf{U}}_{T(l)} {\mathbf{U}}_{T(l)}^{\text{T}} + \sigma _{D(l)}^2 {\mathbf{U}}_{D(l)} {\mathbf{U}}_{D(l)}^{\text{T}} + \sigma _{\varepsilon (l)}^2 {\mathbf{I}} $$

where \({\mathbf{\mu }}_{(l)} \) is the population mean over all entries of gene l, \( {\mathbf{e}}_{u(l)} \) is the vector of random effects: \( {\mathbf{e}}_{u(l)} \sim (0,\sigma _{u(l)}^2 {\mathbf{I}});\;{\mathbf{U}}_{u(l)} \) is the known incidence matrix relating to the random vector \( {\mathbf{e}}_{u(l)} ,\;U_{_{u(l)} }^{\text{T}} \) is the transposition of \( {\mathbf{U}}_{u(l)} ;{\mathbf{U}}_{4(l)} = {\mathbf{I}} \) is an identity matrix. Similarly, the multi-gene model (Eq. 2) can also be expressed as the matrix form.

Variance components of the aforementioned models can be estimated using maximum likelihood estimation (ML), restricted maximum likelihood estimation (REML), and minimum norm quadratic unbiased estimation (MINQUE; Searle et al. 1992). Among these three methods, MINQUE possesses the advantages of unbiasedness, no assumption of normal distribution and less computation (Zhu and Weir 1994a). The prediction of random effects can be obtained using methods for best linear unbiased prediction (BLUP; Henderson 1963), linear unbiased prediction (LUP; Zhu and Weir 1994a) and adjusted linear unbiased prediction (AUP; Zhu 1993; Zhu and Weir 1996). The fixed effects can be obtained through the ordinary least square estimation (OLSE) method or the generalized least square estimation (GLSE) method. The Jackknife resampling procedure (Miller 1974; Searle et al. 1992) can be used for estimating the sampling variance of estimated variance components, predicted random effects and estimated fixed effects; and a t-test is then used for the significance test.

Microarray data are characterized by high dimensionality and small sample size, which may not warrant normal distribution of the data and usually requires intensive computation for ML or REML estimators. From this reason, MINQUE(1), an unbiased MINQUE method with all the prior values set at one (Zhu and Weir 1996), was used to estimate the variance components and the Jackknife resampling procedure was used for significance tests in our method. The AUP and OLSE methods were used for predicting random effects and estimating fixed effects, respectively.

In the single-gene model, a series of hypotheses can be made about the variance of treatment: H0:σ2T(l)=0 vs H1:σ2 T(l) =0. If H0 in the null hypothesis about gene l is rejected, the observation of this gene is retained for further analysis in the multi-gene model. In the subsequent multi-gene model, a t-test following the Jackknife resampling procedure is applied to test the null hypothesis of a specific gene without differential expression, that is, the gene × treatment interaction effect (i.e., e GT ) is not significantly different from zero. However, if at least one of the e GT of gene l is not equal to zero, the gene l is considered a DEG. This resample-based t-test in the multi-gene model can capture the departure from the overall average that is attributable to the specific combination of the jth treatment and the lth gene.

Simulation design

A series of simulations for cDNA microarray experiments was conducted to evaluate the performance of the proposed approach. The loop design was adopted in our simulated experiments. The loop design involves constructing a cyclic sequence of n treatments on n arrays, with each treatment represented twice, each time labeled with a different fluorescence dye (Kerr and Churchill 2001). In all the simulations conducted, there were 4,000 genes and six treatments. The six treatments were divided into two groups of three each. Treatments T1–T3 were in one group and treatments T4–T6 were in another. For the first group, in the first array treatment T1 was marked with Cy3 dye and treatment T2 was marked with Cy5 dye, in the second array treatment T2 was marked with Cy3 dye and treatment T3 was marked with Cy5 dye and in the third array treatment T3 was marked with Cy3 dye and treatment T1 was marked with Cy5 dye. Note that in spotted cDNA microarrays the two treatments under comparison are labeled with two different dyes and co-hybridized to the same array. The design was similar for another group with treatments T4–T6 (Table 1). Each of them was replicated three times, giving 18 arrays in total.

Table 1 Experimental design of simulations

Generating gene-expression data

To generate each dataset, we preset different magnitudes of source variations (i.e., variance components) in the simulated microarray experiments. The gene × treatment interaction variance was set as 50 and the ratio of the gene × treatment interaction variance (V GT ) to the total phenotypic variance (V P ), that is, V GT /V P , varied from 0.1 to 0.9 in all of the simulations. Four configurations of the remaining variance components (V A , V D , V T , V GA , V GD , Vɛ) were simulated for the remainder of the phenotypic variation (i.e., V P V GT ): (1) the effects of A, D, T, GA, GD and ɛ contribute equally to the remainder of phenotypic variation, that is, V A :V D :V T :V GA :V GD :V ɛ =1:1:1:1:1:1 (denoting EQUAL), (2) the A and GA effects dominate in the remainder of phenotypic variation, that is, (V A + V GA )/(V P V GT )=0.9 and V D :V T :V GD :V ɛ =1:1:1:1 (denoting ARRAYDOM), (3) the D and GD effects dominate in the remainder of phenotypic variance, that is, (V D + V GD )/(V P V GT )=0.9 and V A :V T :V GA :V ɛ =1:1:1:1 (denoting DYEDOM) and (4) the T effects dominate the remainder of phenotypic variation, that is, V T /(V P V GT )=0.9 and V A :V D :V GA :V GD :V ɛ =1:1:1:1:1 (denoting TREATDOM). Note that the efficiency of identifying DEGs is dependent on the relative proportions among different source variations rather than on the absolute magnitude of each of them. We assumed that there were only 40 DEGs among a total of 4,000 genes tested in the experiment (representing 1% of total genes), that is, 40 genes had gene × treatment interaction effects. The gene-expression value was obtained by the multi-gene model (Eq. 2) and the random effects in the model were drawn by generating a pseudo-random normal deviate with zero mean and different known variances.

Efficiency of identifying differentially expressed genes

We compared the proposed method with the conventional two-sample t-test method (Dudoit et al. 2000). For the t-test method, simulations were performed with and without x-fold filter. In the former case, we first excluded those genes with maximum x-fold changes of less than two among different treatments and then performed the t-test method on the remaining dataset. In the latter case, we performed the t-test method directly on the whole dataset. Power, false discovery rate and false number were used to evaluate the efficiency of these methods for identifying DEGs. Power refers to the probability of declaring a statistical significance when a true DEG exists. False discovery rate is the proportion that genes declared to be differentially expressed which are not differentially expressed in reality. False number is the total number of false positives (genes declared to be differentially expressed which in reality are not) and false negatives (genes truly differentially expressed but not declared as such). Global significant level was set at 0.05; and multiple testing was adjusted by Bonferroni’s correction in both the mixed-model and the t-test methods.

Efficiency of predicting random effects and estimating fixed effects

We then evaluated the efficiency of predicting random effects and estimating fixed effects with our models, using the proportion of bias, \((\bar {\hat {\theta}} - \theta )/\left| \theta \right|,\) where θ is the true effect value and \(\bar {\hat {\theta}}\) is the mean of the predicted random effect or estimated fixed effect.

Results

Monte Carlo simulations were run 200 times for each case and the mean results of the 200 simulations are presented below.

Identifying DEGs

We first evaluated the performance of the mixed-model approach and t-test methods under different source variations resulting from microarray experiments. Powers and false discovery rates are summarized in Fig. 1 and false numbers are summarized in Fig. 2. There is a general tendency: the larger GT interactions account for the gene differential expression, higher power and lower false discovery rate; and fewer false numbers are achieved by each of these methods. Their efficiencies in identifying DEGs are apparently dependent on various source variations in the microarray experiments. In addition, the t-test method with the filtration procedure worked a little better than that without the filtration procedure in most cases, but the difference was quite small. For a simpler and clearer presentation of the results, in the following comparisons we applied the t-test methods to both of the above two methods, that is, t-test methods with and without the filtration procedure.

Fig. 1
figure 1

False discovery rates (FDR) and powers of identifying DEGs using the mixed-model approach (circles) and the t-test method with (squares) and without (triangles) the filtration procedure. Dotted lines are false discovery rates and solid lines are powers

Fig. 2
figure 2

False numbers (FN) when identifying DEGs by the mixed-model approach (circles) and the t-test method with (squares) and without (triangles) the filtration procedure

When the variances of A, D, T, GA, GD and ɛ are of similar magnitude (EQUAL), our method achieved consistently higher powers and lower false discovery rates than the t-test method. When the A and GA effects dominated in the remainder of the phenotypic variance (ARRAYDOM), our method produced dramatically higher powers and lower or similar false discovery rates than the t-test method. When the D and GD effects dominated in the remainder of phenotypic variance (DYEDOM), our method still gave dramatically higher powers than the t-test method. The false discovery rates of our method were slightly higher than the t-test method when the V GT /V P exceeded 0.3. When the T effects accounted for a majority of the remainder of the phenotypic variance (TREATDOM), the t-test method showed a higher power than our method but at the cost of extremely higher false discovery rates. In all of the four cases studied, our method always produced fewer false numbers than the t-test method. In particular, in the case of TREATMENT, about 2,500–3,000 genes of the total 4,000 genes were false positives or false negatives by the t-test method, while only 4–40 genes were false positives or false negatives by our method. These results indicate that, in most cases, our approach has a higher efficiency of identifying DEGs, while the odds of falsely declaring DEGs are lower.

We then classified differential expression into three categories with regard to individual GT variance of a specific gene: genes with a large GT variance, genes with a medium GT variance and genes with a small GT variance. Powers of the mix-model and the t-test method for identifying each of the three groups of genes are shown in Table 2. All methods showed higher powers of identifying DEGs having a large GT variation. Specifically, those genes with GT variation >3% of the total GT variation of all genes were more frequently declared to be differentially expressed in our simulated experiments. When V GT /V P =0.8, the powers for identifying DEGs with a large GT variation were similar in these methods. The differences in statistical powers between these methods were due to their ability to identify genes with medium or small GT variation. When V GT /V P =0.4, neither method could efficiently identify the DEGs with a medium or small GT variation, but there were differences in statistical powers for identifying genes with large GT variation. In the simulated experiments, our method generally had high efficiency in identifying genes with medium to large GT variation in most cases when V GT /V P >0.6.

Table 2 Effects of individual GT variance on powers for identifying DEGs. MM Mixed-model approach, t-testF t-test method with filtration procedure, t-test t-test method without filtration procedure

Predicting random effects and estimating fixed effects

Table 3 shows the proportion of bias for GT effects predicted by the AUP method and for gene effects estimated by the OLSE method, respectively. For GT effects with large absolute sizes, the biases of their predictors were reasonably small (ca. 5%). However, for GT effects with small absolute sizes, the biases of their predictors were considerably larger. Similar results were also observed in the estimation of gene effects. These results suggest that our method can well predict GT effects with large absolute values, while prediction of GT effects with small absolute values should be treated with caution. This is also true for the estimation of gene effects.

Table 3 Bias proportion of GT effects predicted by AUP and gene effects estimated by OLSE. GT effects and gene effects are divided into large, medium and small, according to their true absolute size

Real example

We applied our method to analyze the publicly available datasets from the study of Gutiérrez et al. (2002), who examined mRNA degradation in intact Arabidopsis thaliana by cDNA microarrays containing 11,521 clones. In their study, three independent cordycepin treatments (biological replicas) were analyzed. Each pair of samples from 0 min and 120 min after cordycepin treatment was used in two microarray hybridizations, the second with reverse labeling relative to the first (technical replicas). Statistical analyses of the ratios were performed using the t-test. The data are available online at the Stanford microarray database (http://genome-www5.stanford.edu/; ExptID: 11374, 11333, 11339, 11323, 11375, 11342).

When using the t-test and the conservative Bonferroni method to adjust P values, 100 genes with unstable transcripts showed significantly different ratios from the mean of the population at α<0.0001 (see Gutierrez et al. 2002, supporting table 2). For a comparison of the results, the significance level of α=0.0001 was also adopted for single tests using the mixed-model approach. We found 90 genes with significant mRNA degradation from 0 min to 120 min, including 51 genes identified by both methods and 39 genes identified only by the mixed-model approach (Table 4).

Table 4 A. thaliana genes with unstable transcripts identified by the mixed-model approach. Expressed sequence tags (Locus) were identified as differentially expressed genes by both the mixed-model approach and the t-test method

Gutiérrez et al. (2002, Table 1) listed some Arabidopsis genes with unstable messages, including the DNA-binding protein RAV1 gene at locus At1g13260 and the homeodomain transcription factor (ATHB-6) gene at locus At2g22430. AA395830 and N37328 are two expressed sequence tags (ESTs) from the gene at locus At1g13260; and H77088 and T04337 are two ESTs from the gene at locus At2g22430. They were all identified as unstable transcripts by our method, while only N37328 and T04337 were found by the t-test. AA720100, AA720105 and T76004 are all from the nucleotide sugar epimerase gene at locus At4g30440; and T20600, N65459 and T75944 are all from cytochrome P450 monooxygenase gene at locus At4g31500. The t-test only found that AA720100 and T20600 were unstable, whereas AA720105, T76004, N65459 and T75944 were identified as unstable genes by our method. T20543, AA720239 and AA720240 are three ESTs from the gene at locus At5g64260 which were identified as unstable genes by our method but not by the t-test. AA067525 and AA067498 are both from the calmodulin-related protein 2 gene at locus At5g37770, AA597715 and H36178 are both from the ethylene responsive element binding factor-like gene at locus At5g61590 and both AA597849 and T46143 are from the gene at locus At1g72450. Both of the methods identified one transcript from each of the three genes, respectively. However, the t-test did not find multiple transcripts from the same gene that were not found by the mixed-model approach. These EST identifications were searched in the A. thaliana annotation database and the A. thaliana gene index at the Institute for Genomic Research (http://www.tigr.org). Finding several unstable transcripts from the same gene is to be expected since the probes, coding for the same gene, should display very similar expression profiles (Liu et al. 2003). From this aspect, the mixed-model approach can identify more reasonable unstable transcripts.

In addition, polyA may play an important role in the translation of mRNA by increasing the stability of mRNA and allowing mRNA to function normally. Half-lives for histone mRNA that lacks a polyA tail were considerably lower than 30 min (Greenberg 1972). Two histone-related ESTs (H76940, AA720291) that were not identified as unstable genes by the t-test were found by our approach.

Discussion

Genome-wide identification of DEGs using conventional molecular techniques (e.g., Northern blot analysis) is expensive and time-consuming. Microarray technology represents one of the latest breakthroughs in experimental molecular biology which allows the monitoring of gene expression for tens of thousands of genes in parallel. It is already producing huge amounts of valuable data (Brazma and Vilo 2000). Many standard statistical methods have been used to mine such data. In the present study, we propose a method for microarray data analysis based on a mixed-model approach. As compared with the conventional t-test approach, our method tends to have a higher efficiency in identifying DEGs, while the odds of falsely declaring genes with differential expression are lower. Furthermore, some quantities of interests can be obtained by the AUP method for random effects or by the OLSE method for fixed effects. The method developed here has been implemented in the Windows-interface software QGA Station that is available at http://www.cab.zju.edu.cn/english/ics/faculty/zhujun.html.

Our method is an extension of recent groundwork by Kerr et al. (2000) and Wolfinger et al. (2001). The rationale underlying these methods is that total gene expression is partitioned into various source variations due to different factors, attempting to minimize and/or eliminate inherent “noise” in microarray experiments. However, the mixed linear models employed in our method are of a different form from previous studies. We implemented our method in two interconnected steps using a concise algorithm, MINQUE, with no requirement for assuming a normal distribution in the microarray data. In the first step, we choose a subset of potential DEGs, using the single-gene model. This procedure is similar to a x-fold variation filter. However, the x-fold variation filter is usually based on total gene-expression variations, while our procedure uses total treatment effects, which may increase the filter efficiency. In the second step, multiple gene-expression profiles are analyzed simultaneously and some interesting effects are estimated, using the multi-gene model. In our study, Bonferroni’s method was used to set the cutoff for a significant P-value in both the single-gene and multi-gene models. The significance level, α, can be a little larger in the former than in the latter, which may reduce the risk of losing some interesting DEGs during the filtration procedure and thus increase statistical power. Other criteria such as Benjamini and Hochberg’s procedure can also be used to adjust the P-value to control false discovery rates in these two models (Benjamini and Hochberg 1995). Our method can also handle designs with more than two dyes that can decrease the experimental costs (Forster et al. 2004). Another advantage of our method is its ability to handle missing data, a common problem in microarray experiments.

Replications of spot measurements either within or between arrays are essential in our method. Our method can be applied to the reference design and loop design and their modifications with replications. Replication is an important aspect of a good microarray design. There are basically two types of replication: (1) biological replication in which RNA samples from independent sources are used and (2) technical replication in which the same RNA sample is applied to different arrays. Whether biological or technical replication or both are used in microarray experiments depends on the relative magnitude of the biological and technical variability in the sample. Repeated spots on the same array are a kind of replication but apply the same RNA samples within the same array. This can reduce array effects due to the quality of robot-fabricated immobilized cDNA probes within the same array. Lee et al. (2000) recommended that at least three replicates be used in designing experiments using cDNA microarrays. In our simulated experiments with three replicates, although our method performed reasonably better than the t-test method, only those DEGs with large GT variation were consistently identified in most cases. Therefore, the number of genes identified in most microarray experiments likely represents an underestimate of DEGs when using a conservative significant level. If experimental outlay and sample are enough, six to eight replicates are likely the best (Pan 2002).

Various clustering methods are commonly used in microarray data analysis (Eisen et al. 1998; Spellman et al. 1998; Golub et al. 1999; Tamayo et al. 1999; Hastie et al. 2000; Pan 2002). In these methods, expression levels or ratios with sampling errors within experiments are usually analyzed directly, which may introduce noise and even bias in identifying groups of genes and thus result in the false interpretation of gene-expression patterns. Our method is complementary to the current clustering methods. In our method, interesting effects (such as the gene × treatment interactions here) can be predicted and/or estimated. Investigators can use these genetic effects in clustering to make sure the inputs are biologically meaningful. In our previous study, we also proposed a dissimilarity coefficient for clustering populations, using mixed linear models (such as the models proposed in our microarray study). The dissimilarity coefficient has two parameters, for squared difference of marginal mean and variance component of interaction, and has appropriate statistical properties (Zhu and Weir 1994b). Incorporation of such techniques in our method specifically for microarray data is straightforward and awaits further investigation.

In our simulations, we investigated the impact of various source variations on the efficiency of identifying genes expressed differentially among different treatments. We found that the same method resulted in dramatically different efficiencies (power, false discovery rate) under different configurations of the remaining source variations, given that the proportion of GT interactions accounting for the total gene-expression variations was fixed. For example, when V GT /V P =0.6, the t-test method had 40% power in identifying DEGs when the dye effect and gene-specific dye effect accounted for a majority of the remainder variation, while this method had less than 10% power when the array effect and spot effect dominated the remainder variation (Fig. 1). A similar trend was observed in our method. This suggests that the efficiency of detecting DEGs is more affected by the systematic variation arising from arrays than that from dyes. If the experiment is finished for several batches within each array, the batch effects in the arrays may be considered to diminish the systematic errors. Modeling such effects or other appropriate effects in the single- and multi-gene models is straightforward in our method. Our studies have an important implication for the experimental design and execution of microarray studies. A desirable experimental design of a microarray should keep experiment-wise systematic errors as low as possible and, at the same cost, selectively diminish the systematic errors of some specific factors (such as the arrays here) that have more effect on the efficiency of detecting DEGs.

Treatments, genes, dyes, arrays and their interactions are well known as the source of effects contributing to variations in microarray data (Kerr et al. 2000; Churchill 2002). However, simulations of microarray data have not gained wide acceptance because, in the real world, a potential complexity may be involved in these source variations. This also makes difficulties for theoretical justifications of different statistical methods. In our study, in addition to simulated data, we compared experimentally the mixed-model approach with the t-test, using a real dataset for identifying unstable transcripts (Gutiérrez et al. 2002). The results showed that our method can identify more unstable transcripts than the t-test. We suggest researchers check their data distribution and pre-analyze various source variations in their experiments. Our method can be a competing candidate approach for those datasets which depart from normality and have moderate experimental errors.