A two-step strategy for detecting differential gene expression in cDNA microarray data

Lu, Yan; Zhu, Jun; Liu, Pengyuan

doi:10.1007/s00294-004-0551-3

A two-step strategy for detecting differential gene expression in cDNA microarray data

Research Article
Published: 10 December 2004

Volume 47, pages 121–131, (2005)
Cite this article

Download PDF

Access provided by CONRICYT-eBooks

Current Genetics Aims and scope Submit manuscript

A two-step strategy for detecting differential gene expression in cDNA microarray data

Download PDF

Yan Lu¹,
Jun Zhu¹ &
Pengyuan Liu¹

291 Accesses
8 Citations
Explore all metrics

Abstract

A mixed-model approach is proposed for identifying differential gene expression in cDNA microarray experiments. This approach is implemented by two interconnected steps. In the first step, we choose a subset of genes that are potentially expressed differentially among treatments with a loose criterion. In the second step, these potential genes are used for further analyses and data-mining with a stringent criterion, in which differentially expressed genes (DEGs) are confirmed and some quantities of interest (such as gene × treatment interaction) are estimated. By simulating datasets with DEGs, we compare our statistical method with a widely used method, the t-statistic, for single genes. Simulation results show that our approach produces a high power and a low false discovery rate for DEG identification. We also investigate the impacts of various source variations resulting from microarray experiments on the efficiency of DEG identification. Analysis of a published experiment studying unstable transcripts in Arabidopsis illustrates the utility of our method. Our method identifies more novel and biologically interesting unstable transcripts than those reported in the original literature.

Screening internal controls for expression analyses involving numerous treatments by combining statistical methods with reference gene selection tools

Article 09 October 2018

Inferring and analyzing gene regulatory networks from multi-factorial expression data: a complete and interactive suite

Article Open access 26 May 2021

Identification and prioritization of differentially expressed genes for time-series gene expression data

Article 23 June 2018

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Recent developments in microarray technology make it possible to rapidly capture all of the gene expression profiles in biological samples (Ross et al. 2000; Welsh et al. 2001; Bouton and Pevsner 2002; Guffanti et al. 2002). This technology results in large amounts of data, the interpretation of which is a major bottleneck in current studies. A natural step in extracting microarray data information is to examine the extremes, for example, genes with significant differential expression in two samples (case vs control) or in a time-series (such as cell cycles).

Microarray data are characterized by high dimensionality (thousands of genes) and small sample size (often <30). Systematic and stochastic fluctuations are usually involved in microarray experiments (Schuchhardt et al. 2000). Therefore, the raw dye intensity or ratio value has a high noise to signal ratio between probes. The x-fold change approach may induce high false positives and/or false negatives when used as a simple criterion to determine the genes differentially expressed between query and reference samples. Some biologically important genes with small x-fold changes are highly statistically significant when they are measured repetitively with high precision. Conversely, many genes with large x-fold changes in one array and high variability across multiple arrays have no statistical significance (Wolfinger et al. 2001). Various statistical methods have been proposed for identifying differentially expressed genes (DEGs; Chen et al. 1997, 2002; Ideker et al. 2000; Kerr et al. 2000; Newton et al. 2001; Thomas et al. 2001; Wolfinger et al. 2001; Efron et al. 2001; Churchill 2002; Ibrahim et al. 2002; West 2003; Smyth 2004), but none has yet gained widespread acceptance for the analysis of microarray data. The most basic statistical problem is that the measured differential expression cannot completely reflect a real biological shift in gene expression (Newton et al. 2001).

Discrimination and cluster analysis techniques have been very useful for searching patterns of gene expression that are highly correlated (Eisen et al. 1998; Spellman et al. 1998; Golub et al. 1999; Tamayo et al. 1999; Hastie et al. 2000). These methods are involved in using various types of clustering algorithms, such as self-organizing maps, k-means clustering and hierarchical clustering, to discriminate and characterize patterns of gene expression. However, such exploratory methods alone do not provide the opportunity to engage in statistical inference. Furthermore, the gene expression level or relative ratio level with sampling errors within experiments is performed directly in discrimination and cluster analyses; and thus the distance between data-points cannot reflect the true differential expression between genes.

Mixed-model approaches are widely used to partition various sources of variability. They have the flexibility to handle unbalanced data and can be easily extended to more complicated biological models which have been proven as powerful statistical tools in classic quantitative genetic analyses (Searle et al. 1992). The objectives of this paper are: (1) to propose a mixed-model approach to analyzing variance components for cDNA microarray data analysis, applying the method to selecting a target subset of DEGs that are of biological interest and (2) to assess the effectiveness of this method by extensive computer simulations, specifically compared with the widely used approach based on t-statistics for single genes (Dudoit et al. 2000). Analyzing data publicly available for the study of unstable transcripts in Arabidopsis demonstrates the utility of our method.

Materials ad methods

Each datum in a microarray experiment is associated with one particular combination of an array in the experiment: a fluorescence dye (red or green), a treatment and a gene. In our analysis, we used the logarithms of the original fluorescence measurements as phenotypic values, not the log ratio values, as used by some previous studies (Kerr et al. 2000; Wolfinger et al. 2001).

To alleviate the computation burden, we propose a two-step strategy for analyzing microarray data. In the first step, we choose a subset of genes that are potentially expressed differentially among treatments with a loose criterion. In the second step, these potential genes are combined for further analyses and data-mining with a stringent criterion, in which DEGs are confirmed and some quantities of interest (such as gene × treatment interaction) are estimated. Both types of the aforementioned analyses are performed using a mixed-model approach for a variance–component framework.

Choosing a subset of potential genes with differential expression

We first normalized the original fluorescence data before choosing a subset of genes. The purpose of normalization is to minimize systematic experimental biases so that the observed variation arises from biological differences. Let y_ijkl denote the logarithm of a measurement from the ith array, the jth treatment, the kth dye and the lth gene in a cDNA microarray experiment. The original fluorescence data are normalized as: $ r_{ijkl} = y_{ijkl} - \left( {\bar y_i \ldots + \bar y._j .. + \bar y.._k . - 2\bar y....} \right). $

The normalized data, r_ijkl, can be viewed as a variation for each gene after removing systematic experimental errors and are the input data for the following single-gene model:

$$ r_{ijkl} = \mu _l + A_{il} + T_{jl} + D_{kl} + \gamma _{ijkl} $$

(1)

Here, μ_l represents the overall average expression level of gene l (a fixed effect), A_il is the ith array effect of gene l (a random effect): $ A_{il} \sim \left( {0,\;\sigma _{A(l)}^2 } \right) $ T_jl is the jth treatment effect of gene l (a random effect): $ T_{jl} \sim \left( {0,\;\sigma _{T(l)}^2 } \right); $ D_kl is the kth dye effect of gene l (a random effect): $ D_{kl} \sim \left( {0,\;\sigma _{\gamma (l)}^2 } \right);\;\gamma _{ijkl} $ is the residual error of gene l: $ \gamma _{ijkl} \sim \left( {0,\;\sigma _{\gamma (l)}^2 } \right). $ The array effects account for differences among arrays. Differences among arrays may arise from differences in print quality or from differences in the ambient conditions when the plates were processed, which may increase or reduce the hybridization efficiencies of labeled cDNA. The treatment effects account for differences among treatments. Such differences can arise when some treatments (e.g., a specific cell line) have more transcription activity in general than others. The dye effects account for fluorescent signal differences. One dye may show consistently higher signal intensity than another. The single-gene model is fitted separately to the normalized data from each gene, allowing an elementary inference to be made, using a separate estimate of variability. The methods described here are for the prejudication of a subset of genes with differential expression. This procedure is similar to a variation filter that is commonly used to exclude genes with less than a certain x-fold variation among the collected samples (Golub et al. 1999). However, the x-fold variation filter is usually based on total gene expression variations. Instead, our procedure focuses on total treatment effects, which may increase the filter efficiency.

Combining analysis of multiple genes

A subset of genes potentially expressed differentially between one or more pairs of samples in the dataset can be used for further analysis as follows:

$$ y_{ijkl} = \mu + G_l + A_i + T_j + D_k + GA_{li} + GT_{lj} + GD_{lk} + \varepsilon _{ijkl} $$

(2)

where μ is the average of overall expression levels (a fixed effect), G_l is the fixed effect of the lth gene, $ A_i \sim\left( {0,\;\sigma _A^2 } \right) $ is the random effect of the ith array, $ T_j \sim\left( {0,\;\sigma _T^2 } \right) $ is the random effect of the jth treatment and $ D_k \sim \left( {0,\;\sigma _D^2 } \right) $ is the random effect of the kth dye. $ GA_{li} \sim\left( {0,\sigma _{GA}^2 } \right) $ is the interaction between the lth gene and the ith array, $ GT_{lj} \sim\left( {0,\sigma _{GT}^2 } \right) $ is the interaction between the lth gene and the jth treatment and $ GD_{lk} \sim \left( {0,\sigma _{GD}^2 } \right) $ is the interaction between gene l and dye k. The random error term ɛ_ijkl is the residual effect: $ \varepsilon _{ijkl} \sim \left( {0,\;\sigma _\varepsilon ^2 } \right). $ Interpretations of A_i, T_j and D_k are similar to those in Eq. 1. The gene effects, G_l, account for differences in transcription level among the genes. Some genes may be inherently more active in mRNA transcription than others. The gene × array interactions, GA_li, account for the average effect of the spot on the ith array for the lth gene. It is a “spot” effect due to the potential incomplete control over the amount and concentration of cDNA immobilized from one array to the next. The gene × dye interactions, GD_lk, are gene-specific dye effects and account for the average effect of the kth fluorescence dye for the lth gene. This may contribute to the differential hybridization efficiencies of two chemically different fluorescence dyes for the same probe. The gene × treatment interactions, GT_lj, are of interest in microarray experiments. These effects capture the departure from the overall averages that are attributable to the specific combination of the jth treatment and the lth gene.

Similar interpretations of the aforementioned factors were also detailed by Kerr et al. (2000). Whether a specific factor is regarded as fixed or random depends not only on the levels of source variation but also on the investigator’s particular interest in the study. A fixed effect is one that is repeatable. That is, if other researchers repeat a specific microarray experiment, they are estimating the same effects. A random effect is one that is not repeatable. That is, another researcher will not (probably cannot) estimate the same effects, but can estimate the variance of the effects from another sample. In our study, we treated gene effects as fixed, while others were treated as random. For example, the print quality of the arrays and the ambient conditions under which the arrays were probed varied from one microarray experiment to another. Such array effects may not be repeatable among different microarray experiments and thus are treated as random effects. The basic mRNA transcription level for a specific gene may remain inherently similar among different microarray experiments when there are no interference factors such as those from arrays and treatments. Such a basic transcription level is estimable with suitable experimental designs. Therefore, the gene effects are treated as fixed effects in our model.

Statistical assessment of gene significance

Both types of the above models can be analyzed by a mixed-model approach. The single-gene model (Eq. 1) can be rewritten in the following matrix form:

$$ \begin{gathered} {\mathbf{r}}_{(l)} = {\mathbf{1}}\mu _{(l)} + {\mathbf{U}}_{A(l)} {\mathbf{e}}_{A(l)} + {\mathbf{U}}_{T(l)} {\mathbf{e}}_{T(l)} + {\mathbf{U}}_{D(l)} {\mathbf{e}}_{D(l)} + {\mathbf{e}}_{\varepsilon (l)} \\ = {\mathbf{1}}\mu _{(l)} + \sum\limits_{u = 1}^4 {{\mathbf{U}}_{u(l)} {\mathbf{e}}_{u(l)} \sim N\left( {{\mathbf{\mu }}_{(l)} ,{\mathbf{V}}_{(l)} } \right)} \\ \end{gathered} $$

(3)

with this variance–covariance matrix:

$$ {\text{Var}}\left( {{\mathbf{r}}_{(l)} } \right) = {\mathbf{V}}_{(l)} = \sigma _{A(l)}^2 {\mathbf{U}}_{A(l)} {\mathbf{U}}_{A(l)}^{\text{T}} + \sigma _{T(l)}^2 {\mathbf{U}}_{T(l)} {\mathbf{U}}_{T(l)}^{\text{T}} + \sigma _{D(l)}^2 {\mathbf{U}}_{D(l)} {\mathbf{U}}_{D(l)}^{\text{T}} + \sigma _{\varepsilon (l)}^2 {\mathbf{I}} $$

where ${\mathbf{\mu }}_{(l)} $ is the population mean over all entries of gene l, $ {\mathbf{e}}_{u(l)} $ is the vector of random effects: $ {\mathbf{e}}_{u(l)} \sim (0,\sigma _{u(l)}^2 {\mathbf{I}});\;{\mathbf{U}}_{u(l)} $ is the known incidence matrix relating to the random vector $ {\mathbf{e}}_{u(l)} ,\;U_{_{u(l)} }^{\text{T}} $ is the transposition of $ {\mathbf{U}}_{u(l)} ;{\mathbf{U}}_{4(l)} = {\mathbf{I}} $ is an identity matrix. Similarly, the multi-gene model (Eq. 2) can also be expressed as the matrix form.

Variance components of the aforementioned models can be estimated using maximum likelihood estimation (ML), restricted maximum likelihood estimation (REML), and minimum norm quadratic unbiased estimation (MINQUE; Searle et al. 1992). Among these three methods, MINQUE possesses the advantages of unbiasedness, no assumption of normal distribution and less computation (Zhu and Weir 1994a). The prediction of random effects can be obtained using methods for best linear unbiased prediction (BLUP; Henderson 1963), linear unbiased prediction (LUP; Zhu and Weir 1994a) and adjusted linear unbiased prediction (AUP; Zhu 1993; Zhu and Weir 1996). The fixed effects can be obtained through the ordinary least square estimation (OLSE) method or the generalized least square estimation (GLSE) method. The Jackknife resampling procedure (Miller 1974; Searle et al. 1992) can be used for estimating the sampling variance of estimated variance components, predicted random effects and estimated fixed effects; and a t-test is then used for the significance test.

Microarray data are characterized by high dimensionality and small sample size, which may not warrant normal distribution of the data and usually requires intensive computation for ML or REML estimators. From this reason, MINQUE(1), an unbiased MINQUE method with all the prior values set at one (Zhu and Weir 1996), was used to estimate the variance components and the Jackknife resampling procedure was used for significance tests in our method. The AUP and OLSE methods were used for predicting random effects and estimating fixed effects, respectively.

In the single-gene model, a series of hypotheses can be made about the variance of treatment: H₀:σ²_T(l)=0 vs H₁:σ²_T(l)=0. If H₀ in the null hypothesis about gene l is rejected, the observation of this gene is retained for further analysis in the multi-gene model. In the subsequent multi-gene model, a t-test following the Jackknife resampling procedure is applied to test the null hypothesis of a specific gene without differential expression, that is, the gene × treatment interaction effect (i.e., e_GT) is not significantly different from zero. However, if at least one of the e_GT of gene l is not equal to zero, the gene l is considered a DEG. This resample-based t-test in the multi-gene model can capture the departure from the overall average that is attributable to the specific combination of the jth treatment and the lth gene.

Simulation design

A series of simulations for cDNA microarray experiments was conducted to evaluate the performance of the proposed approach. The loop design was adopted in our simulated experiments. The loop design involves constructing a cyclic sequence of n treatments on n arrays, with each treatment represented twice, each time labeled with a different fluorescence dye (Kerr and Churchill 2001). In all the simulations conducted, there were 4,000 genes and six treatments. The six treatments were divided into two groups of three each. Treatments T₁–T₃ were in one group and treatments T₄–T₆ were in another. For the first group, in the first array treatment T₁ was marked with Cy3 dye and treatment T₂ was marked with Cy5 dye, in the second array treatment T₂ was marked with Cy3 dye and treatment T₃ was marked with Cy5 dye and in the third array treatment T₃ was marked with Cy3 dye and treatment T₁ was marked with Cy5 dye. Note that in spotted cDNA microarrays the two treatments under comparison are labeled with two different dyes and co-hybridized to the same array. The design was similar for another group with treatments T₄–T₆ (Table 1). Each of them was replicated three times, giving 18 arrays in total.

Table 1 Experimental design of simulations

Full size table

Generating gene-expression data

To generate each dataset, we preset different magnitudes of source variations (i.e., variance components) in the simulated microarray experiments. The gene × treatment interaction variance was set as 50 and the ratio of the gene × treatment interaction variance (V_GT) to the total phenotypic variance (V_P), that is, V_GT/V_P, varied from 0.1 to 0.9 in all of the simulations. Four configurations of the remaining variance components (V_A, V_D, V_T, V_GA, V_GD, V_ɛ) were simulated for the remainder of the phenotypic variation (i.e., V_P–V_GT): (1) the effects of A, D, T, GA, GD and ɛ contribute equally to the remainder of phenotypic variation, that is, V_A:V_D:V_T:V_GA:V_GD:V_ɛ=1:1:1:1:1:1 (denoting EQUAL), (2) the A and GA effects dominate in the remainder of phenotypic variation, that is, (V_A+ V_GA)/(V_P–V_GT)=0.9 and V_D:V_T:V_GD:V_ɛ=1:1:1:1 (denoting ARRAYDOM), (3) the D and GD effects dominate in the remainder of phenotypic variance, that is, (V_D+ V_GD)/(V_P–V_GT)=0.9 and V_A:V_T:V_GA:V_ɛ=1:1:1:1 (denoting DYEDOM) and (4) the T effects dominate the remainder of phenotypic variation, that is, V_T/(V_P–V_GT)=0.9 and V_A:V_D:V_GA:V_GD:V_ɛ=1:1:1:1:1 (denoting TREATDOM). Note that the efficiency of identifying DEGs is dependent on the relative proportions among different source variations rather than on the absolute magnitude of each of them. We assumed that there were only 40 DEGs among a total of 4,000 genes tested in the experiment (representing 1% of total genes), that is, 40 genes had gene × treatment interaction effects. The gene-expression value was obtained by the multi-gene model (Eq. 2) and the random effects in the model were drawn by generating a pseudo-random normal deviate with zero mean and different known variances.

Efficiency of identifying differentially expressed genes

We compared the proposed method with the conventional two-sample t-test method (Dudoit et al. 2000). For the t-test method, simulations were performed with and without x-fold filter. In the former case, we first excluded those genes with maximum x-fold changes of less than two among different treatments and then performed the t-test method on the remaining dataset. In the latter case, we performed the t-test method directly on the whole dataset. Power, false discovery rate and false number were used to evaluate the efficiency of these methods for identifying DEGs. Power refers to the probability of declaring a statistical significance when a true DEG exists. False discovery rate is the proportion that genes declared to be differentially expressed which are not differentially expressed in reality. False number is the total number of false positives (genes declared to be differentially expressed which in reality are not) and false negatives (genes truly differentially expressed but not declared as such). Global significant level was set at 0.05; and multiple testing was adjusted by Bonferroni’s correction in both the mixed-model and the t-test methods.

Efficiency of predicting random effects and estimating fixed effects

We then evaluated the efficiency of predicting random effects and estimating fixed effects with our models, using the proportion of bias, $(\bar {\hat {\theta}} - \theta )/\left| \theta \right|,$ where θ is the true effect value and $\bar {\hat {\theta}}$ is the mean of the predicted random effect or estimated fixed effect.

Results

Monte Carlo simulations were run 200 times for each case and the mean results of the 200 simulations are presented below.

Identifying DEGs

We first evaluated the performance of the mixed-model approach and t-test methods under different source variations resulting from microarray experiments. Powers and false discovery rates are summarized in Fig. 1 and false numbers are summarized in Fig. 2. There is a general tendency: the larger GT interactions account for the gene differential expression, higher power and lower false discovery rate; and fewer false numbers are achieved by each of these methods. Their efficiencies in identifying DEGs are apparently dependent on various source variations in the microarray experiments. In addition, the t-test method with the filtration procedure worked a little better than that without the filtration procedure in most cases, but the difference was quite small. For a simpler and clearer presentation of the results, in the following comparisons we applied the t-test methods to both of the above two methods, that is, t-test methods with and without the filtration procedure.

When the variances of A, D, T, GA, GD and ɛ are of similar magnitude (EQUAL), our method achieved consistently higher powers and lower false discovery rates than the t-test method. When the A and GA effects dominated in the remainder of the phenotypic variance (ARRAYDOM), our method produced dramatically higher powers and lower or similar false discovery rates than the t-test method. When the D and GD effects dominated in the remainder of phenotypic variance (DYEDOM), our method still gave dramatically higher powers than the t-test method. The false discovery rates of our method were slightly higher than the t-test method when the V_GT/V_P exceeded 0.3. When the T effects accounted for a majority of the remainder of the phenotypic variance (TREATDOM), the t-test method showed a higher power than our method but at the cost of extremely higher false discovery rates. In all of the four cases studied, our method always produced fewer false numbers than the t-test method. In particular, in the case of TREATMENT, about 2,500–3,000 genes of the total 4,000 genes were false positives or false negatives by the t-test method, while only 4–40 genes were false positives or false negatives by our method. These results indicate that, in most cases, our approach has a higher efficiency of identifying DEGs, while the odds of falsely declaring DEGs are lower.

We then classified differential expression into three categories with regard to individual GT variance of a specific gene: genes with a large GT variance, genes with a medium GT variance and genes with a small GT variance. Powers of the mix-model and the t-test method for identifying each of the three groups of genes are shown in Table 2. All methods showed higher powers of identifying DEGs having a large GT variation. Specifically, those genes with GT variation >3% of the total GT variation of all genes were more frequently declared to be differentially expressed in our simulated experiments. When V_GT/V_P=0.8, the powers for identifying DEGs with a large GT variation were similar in these methods. The differences in statistical powers between these methods were due to their ability to identify genes with medium or small GT variation. When V_GT/V_P=0.4, neither method could efficiently identify the DEGs with a medium or small GT variation, but there were differences in statistical powers for identifying genes with large GT variation. In the simulated experiments, our method generally had high efficiency in identifying genes with medium to large GT variation in most cases when V_GT/V_P>0.6.

Table 2 Effects of individual GT variance on powers for identifying DEGs. MM Mixed-model approach, t-testF t-test method with filtration procedure, t-test t-test method without filtration procedure

Full size table

Predicting random effects and estimating fixed effects

Table 3 shows the proportion of bias for GT effects predicted by the AUP method and for gene effects estimated by the OLSE method, respectively. For GT effects with large absolute sizes, the biases of their predictors were reasonably small (ca. 5%). However, for GT effects with small absolute sizes, the biases of their predictors were considerably larger. Similar results were also observed in the estimation of gene effects. These results suggest that our method can well predict GT effects with large absolute values, while prediction of GT effects with small absolute values should be treated with caution. This is also true for the estimation of gene effects.

Table 3 Bias proportion of GT effects predicted by AUP and gene effects estimated by OLSE. GT effects and gene effects are divided into large, medium and small, according to their true absolute size

Full size table

Real example

We applied our method to analyze the publicly available datasets from the study of Gutiérrez et al. (2002), who examined mRNA degradation in intact Arabidopsis thaliana by cDNA microarrays containing 11,521 clones. In their study, three independent cordycepin treatments (biological replicas) were analyzed. Each pair of samples from 0 min and 120 min after cordycepin treatment was used in two microarray hybridizations, the second with reverse labeling relative to the first (technical replicas). Statistical analyses of the ratios were performed using the t-test. The data are available online at the Stanford microarray database (http://genome-www5.stanford.edu/; ExptID: 11374, 11333, 11339, 11323, 11375, 11342).

When using the t-test and the conservative Bonferroni method to adjust P values, 100 genes with unstable transcripts showed significantly different ratios from the mean of the population at α<0.0001 (see Gutierrez et al. 2002, supporting table 2). For a comparison of the results, the significance level of α=0.0001 was also adopted for single tests using the mixed-model approach. We found 90 genes with significant mRNA degradation from 0 min to 120 min, including 51 genes identified by both methods and 39 genes identified only by the mixed-model approach (Table 4).

Table 4 A. thaliana genes with unstable transcripts identified by the mixed-model approach. Expressed sequence tags (Locus) were identified as differentially expressed genes by both the mixed-model approach and the t-test method

Full size table

Gutiérrez et al. (2002, Table 1) listed some Arabidopsis genes with unstable messages, including the DNA-binding protein RAV1 gene at locus At1g13260 and the homeodomain transcription factor (ATHB-6) gene at locus At2g22430. AA395830 and N37328 are two expressed sequence tags (ESTs) from the gene at locus At1g13260; and H77088 and T04337 are two ESTs from the gene at locus At2g22430. They were all identified as unstable transcripts by our method, while only N37328 and T04337 were found by the t-test. AA720100, AA720105 and T76004 are all from the nucleotide sugar epimerase gene at locus At4g30440; and T20600, N65459 and T75944 are all from cytochrome P450 monooxygenase gene at locus At4g31500. The t-test only found that AA720100 and T20600 were unstable, whereas AA720105, T76004, N65459 and T75944 were identified as unstable genes by our method. T20543, AA720239 and AA720240 are three ESTs from the gene at locus At5g64260 which were identified as unstable genes by our method but not by the t-test. AA067525 and AA067498 are both from the calmodulin-related protein 2 gene at locus At5g37770, AA597715 and H36178 are both from the ethylene responsive element binding factor-like gene at locus At5g61590 and both AA597849 and T46143 are from the gene at locus At1g72450. Both of the methods identified one transcript from each of the three genes, respectively. However, the t-test did not find multiple transcripts from the same gene that were not found by the mixed-model approach. These EST identifications were searched in the A. thaliana annotation database and the A. thaliana gene index at the Institute for Genomic Research (http://www.tigr.org). Finding several unstable transcripts from the same gene is to be expected since the probes, coding for the same gene, should display very similar expression profiles (Liu et al. 2003). From this aspect, the mixed-model approach can identify more reasonable unstable transcripts.

In addition, polyA may play an important role in the translation of mRNA by increasing the stability of mRNA and allowing mRNA to function normally. Half-lives for histone mRNA that lacks a polyA tail were considerably lower than 30 min (Greenberg 1972). Two histone-related ESTs (H76940, AA720291) that were not identified as unstable genes by the t-test were found by our approach.

Discussion

Genome-wide identification of DEGs using conventional molecular techniques (e.g., Northern blot analysis) is expensive and time-consuming. Microarray technology represents one of the latest breakthroughs in experimental molecular biology which allows the monitoring of gene expression for tens of thousands of genes in parallel. It is already producing huge amounts of valuable data (Brazma and Vilo 2000). Many standard statistical methods have been used to mine such data. In the present study, we propose a method for microarray data analysis based on a mixed-model approach. As compared with the conventional t-test approach, our method tends to have a higher efficiency in identifying DEGs, while the odds of falsely declaring genes with differential expression are lower. Furthermore, some quantities of interests can be obtained by the AUP method for random effects or by the OLSE method for fixed effects. The method developed here has been implemented in the Windows-interface software QGA Station that is available at http://www.cab.zju.edu.cn/english/ics/faculty/zhujun.html.

Our method is an extension of recent groundwork by Kerr et al. (2000) and Wolfinger et al. (2001). The rationale underlying these methods is that total gene expression is partitioned into various source variations due to different factors, attempting to minimize and/or eliminate inherent “noise” in microarray experiments. However, the mixed linear models employed in our method are of a different form from previous studies. We implemented our method in two interconnected steps using a concise algorithm, MINQUE, with no requirement for assuming a normal distribution in the microarray data. In the first step, we choose a subset of potential DEGs, using the single-gene model. This procedure is similar to a x-fold variation filter. However, the x-fold variation filter is usually based on total gene-expression variations, while our procedure uses total treatment effects, which may increase the filter efficiency. In the second step, multiple gene-expression profiles are analyzed simultaneously and some interesting effects are estimated, using the multi-gene model. In our study, Bonferroni’s method was used to set the cutoff for a significant P-value in both the single-gene and multi-gene models. The significance level, α, can be a little larger in the former than in the latter, which may reduce the risk of losing some interesting DEGs during the filtration procedure and thus increase statistical power. Other criteria such as Benjamini and Hochberg’s procedure can also be used to adjust the P-value to control false discovery rates in these two models (Benjamini and Hochberg 1995). Our method can also handle designs with more than two dyes that can decrease the experimental costs (Forster et al. 2004). Another advantage of our method is its ability to handle missing data, a common problem in microarray experiments.

Replications of spot measurements either within or between arrays are essential in our method. Our method can be applied to the reference design and loop design and their modifications with replications. Replication is an important aspect of a good microarray design. There are basically two types of replication: (1) biological replication in which RNA samples from independent sources are used and (2) technical replication in which the same RNA sample is applied to different arrays. Whether biological or technical replication or both are used in microarray experiments depends on the relative magnitude of the biological and technical variability in the sample. Repeated spots on the same array are a kind of replication but apply the same RNA samples within the same array. This can reduce array effects due to the quality of robot-fabricated immobilized cDNA probes within the same array. Lee et al. (2000) recommended that at least three replicates be used in designing experiments using cDNA microarrays. In our simulated experiments with three replicates, although our method performed reasonably better than the t-test method, only those DEGs with large GT variation were consistently identified in most cases. Therefore, the number of genes identified in most microarray experiments likely represents an underestimate of DEGs when using a conservative significant level. If experimental outlay and sample are enough, six to eight replicates are likely the best (Pan 2002).

Various clustering methods are commonly used in microarray data analysis (Eisen et al. 1998; Spellman et al. 1998; Golub et al. 1999; Tamayo et al. 1999; Hastie et al. 2000; Pan 2002). In these methods, expression levels or ratios with sampling errors within experiments are usually analyzed directly, which may introduce noise and even bias in identifying groups of genes and thus result in the false interpretation of gene-expression patterns. Our method is complementary to the current clustering methods. In our method, interesting effects (such as the gene × treatment interactions here) can be predicted and/or estimated. Investigators can use these genetic effects in clustering to make sure the inputs are biologically meaningful. In our previous study, we also proposed a dissimilarity coefficient for clustering populations, using mixed linear models (such as the models proposed in our microarray study). The dissimilarity coefficient has two parameters, for squared difference of marginal mean and variance component of interaction, and has appropriate statistical properties (Zhu and Weir 1994b). Incorporation of such techniques in our method specifically for microarray data is straightforward and awaits further investigation.

In our simulations, we investigated the impact of various source variations on the efficiency of identifying genes expressed differentially among different treatments. We found that the same method resulted in dramatically different efficiencies (power, false discovery rate) under different configurations of the remaining source variations, given that the proportion of GT interactions accounting for the total gene-expression variations was fixed. For example, when V_GT/V_P=0.6, the t-test method had 40% power in identifying DEGs when the dye effect and gene-specific dye effect accounted for a majority of the remainder variation, while this method had less than 10% power when the array effect and spot effect dominated the remainder variation (Fig. 1). A similar trend was observed in our method. This suggests that the efficiency of detecting DEGs is more affected by the systematic variation arising from arrays than that from dyes. If the experiment is finished for several batches within each array, the batch effects in the arrays may be considered to diminish the systematic errors. Modeling such effects or other appropriate effects in the single- and multi-gene models is straightforward in our method. Our studies have an important implication for the experimental design and execution of microarray studies. A desirable experimental design of a microarray should keep experiment-wise systematic errors as low as possible and, at the same cost, selectively diminish the systematic errors of some specific factors (such as the arrays here) that have more effect on the efficiency of detecting DEGs.

Treatments, genes, dyes, arrays and their interactions are well known as the source of effects contributing to variations in microarray data (Kerr et al. 2000; Churchill 2002). However, simulations of microarray data have not gained wide acceptance because, in the real world, a potential complexity may be involved in these source variations. This also makes difficulties for theoretical justifications of different statistical methods. In our study, in addition to simulated data, we compared experimentally the mixed-model approach with the t-test, using a real dataset for identifying unstable transcripts (Gutiérrez et al. 2002). The results showed that our method can identify more unstable transcripts than the t-test. We suggest researchers check their data distribution and pre-analyze various source variations in their experiments. Our method can be a competing candidate approach for those datasets which depart from normality and have moderate experimental errors.

References

Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing, J R Stat Soc B 57:289–300
Google Scholar
Bouton CM, Pevsner J (2002) DRAGON view: information visualization for annotated microarray data. Bioinformatics 18:323–324
Article Google Scholar
Brazma A, Vilo J (2000) Gene expression data analysis. FEBS Lett 480:17–24
Article CAS PubMed Google Scholar
Chen Y, Dougherty ER, Bittner MI (1997) Ratio-based decisions and the quantitative analysis of cDNA microarray images. J Biomed Opt 2:364–374
Article CAS Google Scholar
Chen Y, Kamat V, Dougherty ER, Bittner ML, Meltzer PS, Trent JM (2002) Ratio statistics of gene expression levels and applications to microarray data analysis. Bioinformatics 18:1207–1215
Article CAS PubMed Google Scholar
Churchill GA (2002) Fundamentals of experimental design for cDNA microarrays. Nat Genet 32 [Suppl]:490–495
Article CAS PubMed Google Scholar
Dudoit S, Yang YH, Callow MJ, Speed TP (2000) Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Stat Sin 12:111–139
Google Scholar
Efron B, Tibshirani R, Storey JD, Tusher V (2001) Empirical Bayes analysis of a microarray experiment. J Am Stat Assoc 96:1151–1160
Article Google Scholar
Eisen MB, Spellman PT, Brown PO, Botstein D (1998) Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA 95:14863–14868
Article Google Scholar
Forster T, Costa Y, Roy D, Cooke HJ, Maratou K (2004) Triple-target microarray experiments: a novel experimental strategy. BMC Genomics 5:13
Article PubMed Google Scholar
Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, et al (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286:531–537
Article CAS PubMed Google Scholar
Greenberg JR (1972) High stability of messenger RNA in growing cultured cells. Nature 240:102–104
CAS PubMed Google Scholar
Guffanti A, Reid JF, Alcalay M, Simon G (2002) The meaning of it all: web-based resources for large-scale functional annotation and visualization of DNA microarray data. Trends Genet 18:589–592
Article PubMed Google Scholar
Gutierrez RA, Ewing RM, Cherry JM, Green PJ (2002) Identification of unstable transcripts in Arabidopsis by cDNA microarray analysis: rapid decay is associated with a group of touch- and specific clock-controlled genes. Proc Natl Acad Sci USA 99:11513–11518
Article CAS PubMed Google Scholar
Hastie T, Tibshirani R, Eisen MB, Alizadeh A, Levy R, Staudt L, Chan WC, Botstein D, Brown P (2000) ‘Gene shaving’ as a method for identifying distinct sets of genes with similar expression patterns. Genome Biol 1:3
Article Google Scholar
Henderson CR (1963) Selection index and expected genetic advance. In: Hanson WD, Robinson HE (eds) Statistical genetics and plant breeding. National Academy of Sciences, Washington, DC
Google Scholar
Ibrahim JG, Chen MH, Gray RJ (2002) Bayesian models for gene expression with DNA microarray data. J Am Stat Assoc 97:88–99
Article Google Scholar
Ideker T, Thorsson V, Siegel AF, Hood LE (2000) Testing for differentially-expressed genes by maximum likelihood analysis of microarray data. J Comput Biol 7:805–817
Article CAS PubMed Google Scholar
Kerr MK, Churchill GA (2001) Experimental design for gene expression microarrays. Biostatistics 2:183–201
Article CAS PubMed Google Scholar
Kerr MK, Martin M, Churchill GA (2000) Analysis of variance for gene expression microarray data. J Comput Biol 7:819–837
Article CAS PubMed Google Scholar
Lee ML, Kuo FC, Whitmore GA, Sklar J (2000) Importance of replication in microarray gene expression studies: statistical methods and evidence from repetitive cDNA hybridizations. Proc Natl Acad Sci USA 97:9834–9839
Article CAS PubMed Google Scholar
Liu L, Hawkins DM, Ghosh S, Young SS (2003) Robust singular value decomposition analysis of microarray data. Proc Natl Acad Sci USA 100:13167–13172
Article CAS PubMed Google Scholar
Miller RG (1974) The Jackknife: a review. Biometrika 61:1–15
Google Scholar
Newton MA, Kendziorski CM, Richmond CS, Blattner FR, Tsui KW (2001) On differential variability of expression ratios: improving statistical inference about gene expression changes from microarray data. J Comput Biol 8:37–52
Article CAS PubMed Google Scholar
Pan W (2002) A comparative review of statistical methods for discovering differentially expressed genes in replicated microarray experiments. Bioinformatics 18:546–554
Article Google Scholar
Ross DT, Scherf U, Eisen MB, Perou CM, Rees C, Spellman P, Iyer V, Jeffrey SS, Van de RM, Waltham M, et al (2000) Systematic variation in gene expression patterns in human cancer cell lines. Nat Genet 24:227–235
Article CAS PubMed Google Scholar
Schuchhardt J, Beule D, Malik A, Wolski E, Eickhoff H, Lehrach H, Herzel H (2000) Normalization strategies for cDNA microarrays. Nucleic Acids Res 28:E47
Article CAS PubMed Google Scholar
Searle SR, Casella G, McCulloch CE (1992) Variance components. Wiley, New York
Google Scholar
Smyth GK (2004) Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol 31
Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, Brown PO, Botstein D, Futcher B (1998) Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell 9:3273–3297
CAS PubMed Google Scholar
Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E, Lander ES, Golub TR (1999) Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc Natl Acad Sci USA 96:2907–2912
Article CAS PubMed Google Scholar
Thomas JG, Olson JM, Tapscott SJ, Zhao LP (2001) An efficient and robust statistical modeling approach to discover differentially expressed genes using genomic expression profiles. Genome Res 11:1227–1236
Google Scholar
Welsh JB, Zarrinkar PP, Sapinoso LM, Kern SG, Behling CA, Monk BJ, Lockhart DJ, Burger RA, Hampton GM (2001) Analysis of gene expression profiles in normal and neoplastic ovarian tissue samples identifies candidate molecular markers of epithelial ovarian cancer. Proc Natl Acad Sci USA 98:1176–1181
Article CAS PubMed Google Scholar
West D (2003) Bayesian factor regression models in the “large ,p small n” paradigm. Bayesian Stat 7:723–732
Google Scholar
Wolfinger RD, Gibson G, Wolfinger ED, Bennett L, Hamadeh H, Bushel P, Afshari C, Paules RS (2001) Assessing gene significance from cDNA microarray expression data via mixed models. J Comput Biol 8:625–637
Article CAS PubMed Google Scholar
Zhu J (1993) Methods of predicting genotype value and heterosis for offspring of hybrids. J Biomath 8:32–44
Google Scholar
Zhu J, Weir BS (1994a) Analysis of cytoplasmic and maternal effects. I. a genetic model for diploid plant seeds and animals. Theor Appl Genet 89:625–637
Google Scholar
Zhu J, Weir BS (1994b) Clustering populations by mixed linear models. J Biomath 9:1–14
Google Scholar
Zhu J, Weir BS (1996) Diallel analysis for sex-linked and maternal effects. Theor Appl Genet 92:1–9
Google Scholar

Download references

Acknowledgements

This research was supported in part by the National Natural Science Foundation of China. We greatly thank David Bartsch for his careful reading of this manuscript. We thank the Stanford microarray database for their opening data source.

Author information

Authors and Affiliations

Institute of Bioinformatics, Zhejiang University, Hangzhou, 310029, Peoples Republic of China
Yan Lu, Jun Zhu & Pengyuan Liu

Authors

Yan Lu
View author publications
You can also search for this author in PubMed Google Scholar
Jun Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Pengyuan Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jun Zhu.

Additional information

Communicated by S. Hohmann

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lu, Y., Zhu, J. & Liu, P. A two-step strategy for detecting differential gene expression in cDNA microarray data. Curr Genet 47, 121–131 (2005). https://doi.org/10.1007/s00294-004-0551-3

Download citation

Received: 04 September 2004
Revised: 27 October 2004
Accepted: 27 October 2004
Published: 10 December 2004
Issue Date: February 2005
DOI: https://doi.org/10.1007/s00294-004-0551-3

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

A two-step strategy for detecting differential gene expression in cDNA microarray data

Abstract

Similar content being viewed by others

Screening internal controls for expression analyses involving numerous treatments by combining statistical methods with reference gene selection tools

Inferring and analyzing gene regulatory networks from multi-factorial expression data: a complete and interactive suite

Identification and prioritization of differentially expressed genes for time-series gene expression data

Introduction

Materials ad methods