On the Identification of Correlated Differential Features for Supervised Classification of High-Dimensional Data

Ng, Shu Kay; McLachlan, Geoffrey J.

doi:10.1007/978-3-319-55723-6_4

Shu Kay Ng²¹ &
Geoffrey J. McLachlan²²

Part of the book series: Studies in Classification, Data Analysis, and Knowledge Organization ((STUDIES CLASS))

3589 Accesses
1 Citations

Abstract

Many real problems in supervised classification involve high-dimensional feature data measured for individuals of known origin from two or more classes. When the dimension of the feature vector is very large relative to the number of individuals, it presents formidable challenges to construct a discriminant rule (classifier) for assigning an unclassified individual to one of the known classes. One way to handle this high-dimensional problem is to identify highly relevant differential features for constructing a classifier. Here a new approach is considered, where a mixture model with random effects is used firstly to partition the features into clusters and then the relevance of each feature variable for differentiating the classes is formally tested and ranked using cluster-specific contrasts of mixed effects. Finally, a non-parametric clustering approach is adopted to identify networks of differential features that are highly correlated. The method is illustrated using a publicly available data set in cancer research for the discovery of correlated biomarkers relevant to the cancer diagnosis and prognosis.

Access provided by CONRICYT-eBooks. Download conference paper PDF

Nonparametric classification of high dimensional observations

Article 08 October 2022

Multiple Bayesian discriminant functions for high-dimensional massive data classification

Article 28 October 2016

Variable selection in discriminant analysis for mixed continuous-binary variables and several groups

Article 21 September 2018

1 Introduction

In supervised classification, the data are classified with respect to g known classes and the intent is to construct a discriminant rule or classifier on the basis of these classified data for assigning an unclassified individual to one of the g classes on the basis of its feature vector. Many real problems in supervised classification, however, involve high-dimensional feature vectors. While there is a vast literature on dimensional reduction and/or feature selection in supervised classification [4, 8, 13], some of the methods may become inapplicable or unreliable when the dimension of the feature vector is very large relative to the number of individuals [2, 10, 15, 24]. An example of such an application is the analysis of gene-expression data, where expression levels of genes (features) are available from patients in g known classes of distinct disease stages or outcomes and the aim is to identify a small subset of “marker” genes that characterize the different classes and construct a discriminant rule to predict the class of origin of an unclassified patient [11, 17]. One way to handle this high-dimensional problem is to identify genes that are differentially expressed among the g classes of tissue samples. In this context, multiple hypothesis test-based approaches [27–29] have been proposed to assess statistical significance of differential expression for each gene separately, with control for the false discovery rate (FDR) which is defined as the expected proportion of false positives among the genes declared to be differentially expressed [1]. Clustering-based approaches have also been considered, but these methods either work on gene-specific summary statistics [14, 23] or reduced forms of gene-expression data [6]. Alternatively, clustering methods that can handle full gene-expression data rely on the assumption that pure clusters of null (non-differentially expressed) genes and differentially expressed genes exist [12, 26]; see also [25]. More recently, a mixture model-based approach with random-effects terms was proposed to draw inference on differences between classes using full gene-expression data [22]. This method does not rely on the clusters being pure as to whether all cluster members are differentially expressed or null genes. In this paper, we propose a new three-step method that extends this mixture model-based approach in order to identify networks of correlated differential features (genes) for supervised classification of high-dimensional data.

The rest of the paper is organized as follows. In Sect. 2, we describe the mixture model with random-effects terms [20] that is adopted in the first step to cluster the genes using full gene-expression data. We also present the second step, where the relevance of each feature variable for differentiating the classes is formally tested and ranked on the basis of cluster-specific contrasts of mixed effects. In Sect. 3, we describe the final third step in which a non-parametric clustering approach is used to further explore the group structures of selected highly ranked differential features for each cluster identified in the first step. Section 4 presents the application of the proposed method to a publicly available gene-expression data set in cancer research for the discovery of correlated biomarkers relevant to the cancer prognosis. Discussion is given in Sect. 5.

2 Mixture Model with Random-Effects Terms

With supervised classification, it is supposed that an individual belongs to one of g classes, denoted by C ₁, …, C _g, and that there is a vector of p feature variables measured on each individual. Based on the observed feature vectors, represented by an n × p matrix, the intent is to construct a discriminant rule for allocating an unclassified individual to one of the g classes [15]. For applications in the context of supervised classification with gene-expression data, the number of individual tissue samples n is very small relative to the number of genes p. To handle this high-dimensional problem, it is proposed to adopt a mixture model with random-effects terms to firstly cluster the p genes and then identify those genes that are highly differentiated between the g classes of tissue samples.

Let $\boldsymbol{y}_{j} = (\,y_{1j},\ldots,y_{nj})^{T}$ contain the measurements on the jth gene ( j = 1, …, p), where the superscript T denotes vector transpose and p is much greater than n. It is assumed that $\boldsymbol{y}_{j}$ has a h-component mixture distribution with probability π _i of belonging to the ith cluster (i = 1, …, h), where the π _i sum to one. We let the h-dimensional vector $\boldsymbol{z}_{j}$ denote the cluster membership of $\boldsymbol{y}_{j}$, where $z_{ij} = (\boldsymbol{z}_{j})_{i} = 1$ if $\boldsymbol{y}_{j}$ belongs to the ith cluster and zero otherwise (i = 1, …, h). A mixture model with random-effects terms [20] is required because it is anticipated that repeated measurements of gene expression for a tissue sample and expression levels for a gene are both correlated; see also [19]. Specific random effects are thus considered in the mixture model to capture individual gene effects and the correlation between gene-expression levels among the tissue classes [22]. Conditional on its membership of the ith cluster, the distribution of $\boldsymbol{y}_{j}$ is specified by the linear mixed model

$$\displaystyle{ \boldsymbol{y}_{j} = \boldsymbol{X}\boldsymbol{\eta }_{i} + \boldsymbol{U}\boldsymbol{b}_{ij} + \boldsymbol{V }\boldsymbol{c}_{i} + \boldsymbol{\varepsilon }_{ij}, }$$

(1)

where $\boldsymbol{X},\boldsymbol{U}$, and $\boldsymbol{V }$ denote the known design matrices corresponding to the fixed effects terms $\boldsymbol{\eta }_{i}$ and to the random-effects terms $\boldsymbol{b}_{ij}$ and $\boldsymbol{c}_{i}\,(i = 1,\,\ldots,\,h;\,j = 1,\ldots,p)$, respectively. The vector $\boldsymbol{b}_{ij} = (b_{1ij},\,\ldots,\,b_{gij})^{T}$ contains the unobservable gene-specific random effects for each of the g tissue classes, and $\boldsymbol{c}_{i} = (c_{1i},\,\ldots,\,c_{ni})^{T}$ contains the random effects common to all genes from the ith cluster. The measurement error vector $\boldsymbol{\epsilon }_{ij}$ is taken to be multivariate normal $N_{n}(\boldsymbol{0},\,\boldsymbol{A}_{i})$, where $\boldsymbol{A}_{i}$ is a diagonal matrix. The vectors $\boldsymbol{b}_{ij}$ and $\boldsymbol{c}_{i}$ of random-effects terms are taken to be multivariate normal $N_{g}(\boldsymbol{0},\,\boldsymbol{B}_{i})$ and $N_{n}(\boldsymbol{0},\,\boldsymbol{C}_{i})$, respectively, where the variance component $\boldsymbol{C}_{i}$ is assumed to be diagonal and $\boldsymbol{B}_{i}$ is a non-diagonal g × g matrix, where the correlation between gene-specific random effects b _lij (l = 1, …, g) is modelled via the off-diagonal elements in $\boldsymbol{B}_{i}$; see, for example, [22]. The assignment of the p genes into h clusters is implemented using the estimated conditional posterior probabilities of cluster membership given $\boldsymbol{y}_{j}$ and $\hat{\mathbf{c}}_{l}(\,j = 1,\ldots,p;\,l = 1,\ldots,g)$:

$$\displaystyle{ \tau _{i}(\boldsymbol{\,y}_{j};\hat{\boldsymbol{\varPsi }},\hat{\mathbf{c}}) = \mbox{ pr}(Z_{ij} = 1\vert \boldsymbol{y}_{j},\hat{\mathbf{c}}) ={ \hat{\boldsymbol{\pi }}_{i}\,f(\boldsymbol{\,y}_{j}\vert z_{ij} = 1;\hat{\boldsymbol{\psi }}_{i},\hat{\mathbf{c}}_{i}) \over \sum _{m=1}^{h}\hat{\boldsymbol{\pi }}_{m}\,f(\boldsymbol{\,y}_{j}\vert z_{mj} = 1;\hat{\boldsymbol{\psi }}_{m},\hat{\mathbf{c}}_{m})}, }$$

(2)

where $\boldsymbol{\psi }_{i}$ is the parameter vector for the ith component density containing the unknown parameters $\boldsymbol{\eta }_{i}$ and distinct elements in $\boldsymbol{A}_{i}$, $\boldsymbol{B}_{i}$, and $\boldsymbol{C}_{i}(i = 1,\,\ldots,\,h)$,and

$$\displaystyle{ \log f(\boldsymbol{\,y}_{j}\vert z_{ij} = 1;\hat{\boldsymbol{\psi }}_{i},\hat{\mathbf{c}}_{i}) = -\frac{1} {2}\bigg\{\log \vert \hat{\mathbf{D}}_{i}\vert + (\boldsymbol{\,y}_{j} -\boldsymbol{X}\hat{\boldsymbol{\eta }}_{i} -\boldsymbol{V }\hat{\mathbf{c}}_{i})^{T}\hat{\mathbf{D}}_{ i}^{-1}(\boldsymbol{\,y}_{ j} -\boldsymbol{X}\hat{\boldsymbol{\eta }}_{i} -\boldsymbol{V }\hat{\mathbf{c}}_{i})\bigg\} }$$

is the log density of $\boldsymbol{y}_{j}$ conditioned on $\hat{\mathbf{c}}_{i}$ and the membership of the ith cluster, apart from an additive constant, and where $\hat{\mathbf{D}}_{i} = \hat{\mathbf{A}}_{i} + \boldsymbol{U}\hat{\mathbf{B}}_{i}\boldsymbol{U}^{T}$; see [20].

To quantify the relevance of each gene for differentiating the g classes, we consider an individual observation-specific contrast in the estimates of the fixed and random effects weighted by the estimated posterior probabilities (2) of cluster membership:

$$\displaystyle{ W_{j} =\sum _{ i=1}^{h}\tau _{ i}(\boldsymbol{\,y}_{j};\,\hat{\boldsymbol{\varPsi }},\hat{\mathbf{c}})\,\hat{S}_{ij}\qquad (\,j = 1,\ldots,p), }$$

(3)

where

$$\displaystyle{ \hat{S}_{ij} = \boldsymbol{d}_{j}^{T}(\hat{\boldsymbol{\eta }}_{ i}^{T},\hat{\mathbf{b}}_{ G_{i}}^{T},\hat{\mathbf{c}}_{ i}^{T})^{T}\bigg/\sqrt{\boldsymbol{d} _{ j}^{T}\hat{\boldsymbol{\varOmega }}_{i}\boldsymbol{d}_{j}} }$$

(4)

is the cluster-specific normalized contrast with the BLUP estimator of the mixed effects, and where $\boldsymbol{d}_{j}$ is a vector whose elements sum to zero, $\boldsymbol{b}_{G_{i}} = (\boldsymbol{b}_{i_{1}}^{T},\,\ldots,\,\boldsymbol{b}_{i_{p_{ i}}}^{T})^{T}$ contains the gene-specific random-effects terms for the p _i genes belonging to the ith cluster G _i (i = 1, …, h), and $\hat{\boldsymbol{\varOmega }}_{i}$ is the covariance matrix of the BLUP estimator of the mixed effects, which can be partitioned conformally corresponding to $\boldsymbol{\eta }_{i}\vert \boldsymbol{b}_{G_{i}}\vert \boldsymbol{c}_{i}$, respectively, as described in [22].

Based on the weighted contrast W _j ( j = 1, …, p) given in (3), the p genes can be ranked in the order of their relevance for differentiating the g classes (with respect to the defined form of d _j for the normalized contrast (4)). In the final step of the proposed method to be described in the next section, we intend to explore the group structure of top-ranked differentially expressed genes in each identified cluster G _i(i = 1, …, h), say, for those genes with contrast W _j more extreme than thresholds w _0u or w _0d for upregulated and downregulated genes, respectively. A guide to plausible values of w _0u and w _0d can be obtained using the percentile rank of W _j( j = 1, …, p), whereby the percentiles are taken to be the mixing proportions of the non-central portions of W _j fitted by a three-component mixture of t-distributions (these two components are considered as representing the distribution of W _j for upregulated and downregulated differentially expressed genes).

3 A Non-parametric Clustering Approach for Identification of Correlated Features

We consider the r _i top-ranked genes with W _j more extreme than either w _0u or w _0d in Cluster G _i(i = 1, …, h) and adopt a non-parametric method to cluster the r _i genes into networks of differentially expressed genes that are highly correlated. The method starts with the calculation of pairwise correlation coefficients for each pair of the r _i genes in G _i(i = 1, …, h). Significance of the pairwise correlation coefficients is then assessed with the use of a permutation method [21] to determine the null distribution of correlation coefficients. Precisely, the n class labels of tissue samples are randomly permuted separately for each gene. We pool the permutations for all $N_{r_{i}} = r_{i}(r_{1} - 1)/2$ pairs of genes to determine the null distribution of correlation coefficients. In this paper, we consider the use of S = 100 repetitions of permutations and estimate the P-value for each pair of genes by

$$\displaystyle{ P_{l} =\sum _{ s=1}^{S}{\#\{m: R_{0m}^{(s)} \geq R_{ l},m = 1,\ldots,N_{r_{i}}\} \over N_{r_{i}}S} \qquad (l = 1,\ldots,N_{r_{i}}), }$$

(5)

where R _0m ^(s) is the null version of correlation coefficient for the mth pair of genes after the sth repetition of permutations $(m = 1,\ldots,N_{r_{i}};\,s = 1,\ldots,S)$. Let $P_{(1)} \leq \cdots \leq P_{(N_{r_{ i}})}$ be the ordered observed P-values obtained from (5). The Benjamini–Hochberg procedure [1] is adopted to determine the cut-off $\hat{k}$, where

$$\displaystyle{ \hat{k} = \mbox{ arg max}\{k: P_{(k)} \leq \alpha k/N_{r_{i}}\}, }$$

(6)

with control of the FDR at level α. Pairwise correlation coefficients corresponding to P-values $P_{(1)} \leq \cdots \leq P_{(\hat{k})}$ are identified to be significant. Significance of the pairwise correlation coefficients is represented by an r _i × r _i symmetric binary matrix M with elements of one or zero indicating that the corresponding correlation coefficients are significance or not. Finally, we search in M to identify networks of differentially expressed genes in which all members in a group significantly correlate with one another [21]. This non-parametric clustering approach obtains overlapping groups (networks) of correlated differentially expressed genes.

4 Real Example

We consider the colorectal cancer gene-expression data set [5], which comprised expression values of 15,552 genes for plasma samples from 12 colorectal cancer patients and 8 healthy donors. The original study aims to validate the power of four randomly selected markers (from a list of 40 genes differentially upregulated in cancer patients) in enabling differentiation of the tumour from the healthy condition [5]. With the proposed three-step approach, we first fitted a mixture model with random-effects terms to the column-normalized gene-expression data set with h=3 to h=20 clusters, taking $\boldsymbol{X} = \boldsymbol{U}$ to be a 20 × 2 zero-one matrix (the first 12 rows are (1, 0) and the next 8 rows are (0, 1)) and taking $\boldsymbol{V }$ to be $\boldsymbol{I}_{20}$. Based on the Bayesian information criterion (BIC) for model selection, we identified that there are 15 clusters of genes. The ML estimates of the unknown parameters are presented in Table 1. The ranking of differentially expressed genes is then implemented on the basis of the weighted estimates of a contrast in the mixed effects (3). For the case of g=2 classes of tissue samples (tumour versus healthy), we consider $\boldsymbol{d}_{j}$ of the form as

$$\displaystyle{ \boldsymbol{d}_{j}^{T} = (1\ \ \mbox{ $-$1}\ \ \vdots\ \ 0\ \ 0,\ \ \ldots,\ \ 0\ \ 0,\ \ 1\ \ \mbox{ $-$1},\ \ 0\ \ 0,\ \ \ldots \ \ \vdots\ \ 0\ \ \ldots \ \ 0), }$$

(7)

where only one pair of (1 −1) exists in the second partition corresponding to $\boldsymbol{b}_{G_{i}}$; see Eq. (4). We then fitted a three-component mixture of t-distributions [16] to W _j and obtained the mixing proportions of the components corresponding to the non-central portion of W _j, which are 11.5 and 7.2% for upregulated and downregulated genes in the tumour tissues, respectively. Thus we selected w _0u = 1. 661 (the 88.5th percentile of W _j) and w _0d = −2. 236 (the 7.2th percentile of W _j( j = 1, …, p)). There are a total of 2907 differentially expressed genes with W _j more extreme than w _0u or w _0d (W _j > 1. 661 or W _j < −2. 236). Among them, 1581 genes have valid identifiers (1073 upregulated and 508 downregulated). Descriptive statistics of W _j for these 1581 differentially expressed genes are provided in Table 2. It can be seen that Clusters 7–11 and 13–14 contain upregulated differentially expressed genes, Clusters 1, 3, 6, and 12 contain downregulated differentially expressed genes, and Clusters 5 and 15 contain both upregulated and downregulated differentially expressed genes.

Table 1 Estimates of the mixture model with random-effects terms for the colorectal cancer data set (15 clusters)

Full size table

Table 2 Descriptive statistics of W _j for the differentially expressed genes with valid gene identifiers and W _j more extreme than either w _0u or w _0d (15 clusters)

Full size table

In the final step, we applied the non-parametric method to identify networks of correlated differentially expressed genes from the r _i genes in Cluster G _i. We set α to be between 0.1 and 0.00005 such that the expected number of false positives among the pairs of genes identified to be significantly correlated is smaller than one; see [21]. With the matrix M, networks of differentially expressed genes were displayed using UCINET6 for Windows [3]. Figure 1 presents the identified networks of upregulated differentially expressed genes in Clusters 7, 9, 13, and 14, where the nodal size of a gene is proportional to the degree of the node (the number of genes that are significantly correlated with the gene). Networks of downregulated differentially expressed genes (Clusters 3, 6, and 12) were provided in Fig. 2. Clusters 5 and 15 had networks of up- and down regulated differentially expressed genes (Fig. 3).

A summary of the identified networks of correlated differentially expressed genes for each cluster is given in Table 3. Two isolated networks of differentially expressed genes were identified: {N62132, TRIP10, AA455350} downregulated genes network from Cluster 5 and {CLK2, ENSA, AA416971} upregulated genes network from Cluster 13. It is noted that four upregulated genes were considered in the original study and three of them (EPAS1, UBE2D3, KIAA0101) were validated to be significantly increased in cancer compared to healthy donors [5]. Our clustering results confirmed the same findings; these three genes were identified as differentially expressed genes in Cluster 14 (with contrast W _j = 3.7, 3.3, and 2.0, respectively, and ranked the 2nd, 8th, and 156th among the 224 differentially expressed genes in Cluster 14). The original study could not validate the remaining upregulated gene DDX46. However, our method has sufficient power to identify DDX46 as a differentially expressed gene in Cluster 5, with W _j = 2.4 and ranked the 1st among the 14 upregulated differentially expressed genes in Cluster 5.

Table 3 A summary of networks of highly correlated differentially expressed genes

Full size table

5 Discussion

We have presented a new approach to identify correlated differential features for supervised classification of high-dimensional data. The method adopts a mixture model with random-effects terms to cluster the feature variables and then ranks them in terms of their cluster-specific contrasts of mixed effects that quantify the evidence of differentiation between the known classes. The final step of the method adopts a non-parametric clustering approach to identify networks of differential features that are highly correlated in each identified cluster.

The proposed method is illustrated using an application on the analysis of gene-expression cancer data. The identified differentially expressed genes and their correlation structures can have significant contribution in the discovery of novel biomarkers relevant to the cancer diagnosis and prognosis; see also [7, 9] for the benefit of using the covariance information among genes for feature selection. Moreover, these differentially expressed genes can be included in a model to construct a classifier with a smaller subset of marker genes, using methods such as mixtures of factor analysers [15, 16] or mixtures of multivariate generalized Bernoulli distributions [18]. This work will be pursued in future research.

References

Benjamini, Y., Hochberg, Y.: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. B 57, 259–300 (1995)
MathSciNet MATH Google Scholar
Bickel, P.J., Levina, E.: Some theory for Fisher’s linear discriminant function, ‘naive Bayes’, and some alternatives when there are many more variables than observations. Bernoulli 10, 989–1010 (2004)
Article MathSciNet MATH Google Scholar
Borgatti, S.P., Everett, M.G., Freeman, L.C.: Ucinet for Windows: Software for Social Network Analysis. Analytic Technologies, Harvard, MA (2002). Available via http://www.analytictech.com/. Accessed 8 Dec 2015
Cai, T., Liu, W.: A direct estimation approach to sparse linear discriminant analysis. J. Am. Stat. Assoc. 106, 1566–1577 (2011)
Article MathSciNet MATH Google Scholar
Collado, M., Garcia, V., Garcia, J.M., Alonso, I., Lombardia, L., et al.: Genomic profiling of circulating plasma RNA for the analysis of cancer. Clin. Chem. 53, 1860–1863 (2007)
Article Google Scholar
Dahl, D.B., Newton, M.A.: Multiple hypothesis testing by clustering treatment effects. J. Am. Stat. Assoc. 102, 517–526 (2007)
Article MathSciNet MATH Google Scholar
Donoho, D., Jin, J.: Higher criticism for large-scale inference, especially for rare and weak effects. Stat. Sci. 30, 1–25 (2015)
Article MathSciNet MATH Google Scholar
Fan, J., Lv, J.: A selective review of variable selection in high dimensional feature space. Stat. Sin. 20, 101–148 (2010)
MATH Google Scholar
Fan, J., Feng, Y., Tong, X.: A road to classification in high dimensional space: the regularized optimal affine discriminant. J. R. Stat. Soc. B 74, 745–771 (2012)
Article MathSciNet Google Scholar
Hall, P., Pittelkow, Y., Ghosh, M.: Theoretic measures of relative performance of classifiers for high-dimensional data with small sample sizes. J. R. Stat. Soc. B 70, 158–173 (2008)
MATH Google Scholar
Hall, P., Jin, J., Miller, H.: Feature selection when there are many influential features. Bernoulli 20, 1647–1671 (2014)
Article MathSciNet MATH Google Scholar
He, Y., Pan, W., Lin, J.: Cluster analysis using multivariate normal mixture models to detect differential gene expression with microarray data. Comput. Stat. Data Anal. 51, 641–658 (2006)
Article MathSciNet MATH Google Scholar
Kersten, J.: Simultaneous feature selection and Gaussian mixture model estimation for supervised classification problems. Pattern Recogn. 47, 2582–2595 (2014)
Article MATH Google Scholar
Matsui, S., Noma, H.: Estimating effect sizes of differentially expressed genes for power and sample-size assessments in microarray experiments. Biometrics 67, 1225–1235 (2011)
Article MathSciNet MATH Google Scholar
McLachlan, G.J.: Discriminant analysis. WIREs Comput. Stat. 4, 421–431 (2012)
Article Google Scholar
McLachlan, G.J., Peel, D.: Finite Mixture Models. Wiley, New York (2000)
Book MATH Google Scholar
McLachlan, G.J., Do, K.A., Ambroise, C.: Analyzing Microarray Gene Expression Data. Wiley, New York (2004)
Book MATH Google Scholar
Ng, S.K.: A two-way clustering framework to identify disparities in multimorbidity patterns of mental and physical health conditions among Australians. Stat. Med. 34, 3444–3460 (2015)
Article MathSciNet Google Scholar
Ng, S.K., McLachlan, G.J.: Mixture models for clustering multilevel growth trajectories. Comput. Stat. Data Anal. 71, 43–51 (2014)
Article MathSciNet Google Scholar
Ng, S.K., McLachlan, G.J., Wang, K., Ben-Tovim, L., Ng, S.-W.: A mixture model with random-effects components for clustering correlated gene-expression profiles. Bioinformatics 22, 1745–1752 (2006)
Article Google Scholar
Ng, S.K., Holden, L., Sun, J.: Identifying comorbidity patterns of health conditions via cluster analysis of pairwise concordance statistics. Stat. Med. 31, 3393–3405 (2012)
Article MathSciNet Google Scholar
Ng, S.K., McLachlan, G.J., Wang, K., Nagymanyoki, Z., Liu, S., Ng, S.-W.: Inference on differences between classes using cluster-specific contrasts of mixed effects. Biostatistics 16, 98–112 (2015)
Article MathSciNet Google Scholar
Pan, W., Lin, J., Le, C.T.: Model-based cluster analysis of microarray gene-expression data. Genome Biol. 3, 0009.1–0009.8 (2002)
Google Scholar
Pyne, S., Lee, S.X., Wang, K., Irish, J., Tamayo, P., et al.: Joint modeling and registration of cell populations in cohorts of high-dimensional flow cytometric data. PLoS One 9, e100334 (2014)
Article Google Scholar
Qi, Y., Sun, H., Sun, Q., Pan, L.: Ranking analysis for identifying differentially expressed genes. Genomics 97, 326–329 (2011)
Article Google Scholar
Qiu, W., He, W., Wang, X., Lazarus, R.: A marginal mixture model for selecting differentially expressed genes across two types of tissue samples. Int. J. Biostat. 4, Article 20 (2008)
Article MathSciNet Google Scholar
Smyth, G.: Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat. Appl. Genet. Mol. Biol. 3, Article 3 (2004)
Article MathSciNet MATH Google Scholar
Storey, J.D.: The optimal discovery procedure: a new approach to simultaneous significance testing. J. R. Stat. Soc. B 69, 347–368 (2007)
Article MathSciNet Google Scholar
Zhao, Y.: Posterior probability of discovery and expected rate of discovery for multiple hypothesis testing and high throughput assays. J. Am. Stat. Assoc. 106, 984–996 (2011)
Article MathSciNet MATH Google Scholar

Download references

Acknowledgements

Part of this work has been presented in the Conference of the International Federation of Classification Societies, Bologna, July 2015. This work was supported by a grant from the Australian Research Council.

Author information

Authors and Affiliations

School of Medicine and Menzies Health Institute Queensland, Griffith University, Nathan, QLD, 4111, Australia
Shu Kay Ng
Department of Mathematics, University of Queensland, St Lucia, QLD, 4072, Australia
Geoffrey J. McLachlan

Authors

Shu Kay Ng
View author publications
You can also search for this author in PubMed Google Scholar
Geoffrey J. McLachlan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shu Kay Ng .

Editor information

Editors and Affiliations

Department of Political Sciences, University of Naples Federico II, Napoli, Italy
Francesco Palumbo
Department of Statistical Sciences Paolo Fortunati, Alma Mater Studiorum, University of Bologna, Bologna, Italy
Angela Montanari
Department of Statistical Sciences, Sapienza University of Rome, Rome, Italy
Maurizio Vichi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ng, S.K., McLachlan, G.J. (2017). On the Identification of Correlated Differential Features for Supervised Classification of High-Dimensional Data. In: Palumbo, F., Montanari, A., Vichi, M. (eds) Data Science . Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Cham. https://doi.org/10.1007/978-3-319-55723-6_4

Download citation

DOI: https://doi.org/10.1007/978-3-319-55723-6_4
Published: 05 July 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-55722-9
Online ISBN: 978-3-319-55723-6
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics

On the Identification of Correlated Differential Features for Supervised Classification of High-Dimensional Data

Abstract

Similar content being viewed by others

Nonparametric classification of high dimensional observations

Multiple Bayesian discriminant functions for high-dimensional massive data classification

Variable selection in discriminant analysis for mixed continuous-binary variables and several groups

1 Introduction

2 Mixture Model with Random-Effects Terms

3 A Non-parametric Clustering Approach for Identification of Correlated Features

4 Real Example

5 Discussion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

On the Identification of Correlated Differential Features for Supervised Classification of High-Dimensional Data

Abstract

Similar content being viewed by others

Nonparametric classification of high dimensional observations

Multiple Bayesian discriminant functions for high-dimensional massive data classification

Variable selection in discriminant analysis for mixed continuous-binary variables and several groups

1 Introduction

2 Mixture Model with Random-Effects Terms

3 A Non-parametric Clustering Approach for Identification of Correlated Features

4 Real Example

5 Discussion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation