Keywords

1 Introduction

The genomes of cancer often contain a large number of somatical aberrations information, e.g., DNA copy number aberrations are closely relation to tumor gene by gene amplification or tumor suppressor loss because of genomic instability and deregulation [1, 2]. Other cases, epigenetic aberrations also result in oncogene such as genomic methylation [3]. DNA sequence change will directly affect the mRNA expression levels even other non-coding microRNA, and then change the outcome of the transcriptome, eventually produce individual heterogeneity and lead to distortion of cancer cells. The same cancer may have diverse somatic mutation and transcriptional level, so that the formation of different kinds of subtypes has diverse heterogeneity of biological progresses and phenotypes [4]. For example, glioblastoma (GBM) can be defined as the Classical, Mesenchymal, and Proneual subtypes by aberrations and gene expression of EGFR, NF1 and PDGFRA/IDH1 [5].

Recently, many cancer genome projects are established and amassed a large number of various types of data. For example, The Cancer Genome Atlas (TCGA) (http://cancergenome.nih.gov/) contains genome, transcriptome and expression information for over 20 cancers from thousands of patients, which produced several types of data, such as methylation data, mRNA expression data, DNA copy number data and so on. Currently, some integrative methods have been proposed which combine different biological data for cancer subtype classification. For example, iCluster is a integrating probability model of multiple data based on Gaussian latent variable model. Which first structures an optimizing penalized log-likelihood function to estimate using Expectation-Maximization algorithm with lasso-type sparse [6, 7], then using K-means to get subtypes. However, bulky datasets also bring about many challenges for subtypes classification. Firstly, the key of data-integration clustering method tend to construct a variance - covariance structure within data types, namely coefficient matrix solving, which is equivalent to a feature selection process. The coefficient matrix is a projection matrix that projects the original data onto an eigengene-eigenarray subspace. Secondly, high-dimensional datasets have a common feature that the number of samples small yet the number of genes is large, so the dimension reduction of coefficient matrix is essential. iCluster used PCA method to estimate the coefficient matrix that defined the first k–1 eigenvectors by a pivoted QR decomposition [8, 9]. However, in the Gaussian latent variable models, PCA has many deficiencies: (i) significant features can’t completely be extracted when facing high dimensional statistical data. (ii) eigenvalue of the first principal component is much larger than the eigenvalues of the other main components.

Briefly, we can see that the estimator of coefficient matrix is very important. Considering Gaussian latent variable model, sparse reduced-rank regression (Srrr) is a useful parsimony model when facing a large number of data for multiple response regression [1014]. Generally, Srrr with the purpose of solving an indicator matrix can be divided into three steps in different algorithm: (i) Working out reduced-rank matrix that can reduce the noise of the model and improve the robustness. (ii) Constructing sparse group lasso, group bridge or group MCP term, which can solve the problem that the sample volume is pretty smaller than the gene volume [10, 11]. (iii) Establishing minimum optimization function to solve the coefficient matrix. Until now, Srrr method has been applied in several research area. E.g., Lin et al. (2013) used it to detect genetic networks associated with brain functional networks in schizophrenia [12]. Chen et al. (2012) proposed a weighted rank-constrained group lasso approach with two heuristic numerical algorithms and studied its large sample asymptotics [13].

In this paper, we used subspace assisted regression with row sparsity (SARRS) algorithm that proposed by Ma et al. (2014) [14], combining with iCluster (Srrr-cluster) to discovery cancer subtypes. Srrr-cluster can be regarded as a data-integration clustering method which first estimating the coefficient matrix of the latent variable model using the Srrr method, and then solving the estimator of the desigen matrix through optimizing a penalized complete-data log-likelihood with sparse term using the Expectation-Maximization (EM) algorithm.

2 Srrr-cluster Methods

2.1 Data Types Integration and a Gaussian Latent Variable Model Representation

Mo et al. summarized the different data types to adapt to different mathematical probability models [15]. For example, mutation status is defined binary variable that is suit for logistic regression model; copy number loss, gain, and normal status are defined multicategory variable that are suit for multilogit regression; NDA copy number data, DNA methylation data, mRNA expression data and so on are defined continuous variable that are suit for Gaussian latent model. In this paper, different types of continuous data are regressed using Srrr model to discovery cancer subtypes. We can fuse the same samples with different types of continuous data into a multiple genomic data. Therefore, we employ an integrating genomic data that harbor different levels of expression and transcriptome information to search subtypes.

Firstly, we establish a Gaussian latent variable model:

$$ X = ZW + \varepsilon $$
(1)

here \( X = \left\{ {X_{1} , \ldots X_{P} } \right\} \) is the original integration data of dimension \( n \times m \), where \( X_{1} \) can denote DNA methylation data of dimension \( n \times m_{1} \), \( X_{2} \) can denote DNA copy number data of dimension \( n \times m_{2} \), \( X_{p} \) can denote mRNA expression of dimension \( n \times m_{p} \) and so forth. Z is the design matrix of dimension \( {\text{n}} \times l \), W is the coefficient matrix of dimension \( l \times m \), \( \varepsilon \) is the error term and make the additional assumption that \( Z\sim N(0,I) \) and \( \varepsilon \sim N(0,\psi ) \). p is the number of genomic data types, n is the number of samples, m is the number of the genes, \( l \) is the number of predictors. Ding et al. (2004) noted that the K-means solution of Z can directly be selected using the first k–1 eigenvectores that span a low-dimensional latent space where the original data are projected onto each of the first K–1principal directions such that the total variance is maximized by PCA. So, Z is the design matrix of dimension \( {\text{n}} \times ({\text{k}} - 1) \) that is finally clusters latent tumor subtypes and the initial value of Z is the first k–1 eigenvectores by PCA, where k is the number of clusters [9].

2.2 An Adaptive Srrr Method and Srrr-cluster

Following Eq. (1), we can afford to estimate the solution of the coefficient matrix \( W \) using an adaptive Srrr method. The goal is to reduce the rank r of \( W \) under the Gaussian latent variable model. Firstly, two error parameters, i.e., a noise level \( \sigma \) expressed as \( \sigma = median(\sigma (X))/\sqrt {\hbox{min} (n,m)} \), where \( \sigma (X) \) is the collection of all nonzero singular values of X, and a noise rank level \( \eta \), expressed as \( \eta = \sqrt {2m} + \sqrt {2(\hbox{min} (n,k)} \), are estimated to work out the reduced-rank r and an orthonormal matrix \( V_{(0)} \) that is non-orthogonal to the right singular subspace of \( W \). The estimator of r is computed by:

$$ r = \hbox{max} \{ j:\sigma_{j} (Z(Z^{\prime}Z)^{ - } Z^{\prime}X) \ge \sigma \eta \} $$
(2)

Where \( \left( {ZZ^{\prime}} \right)^{ - } \) is Moore-Penrose pseudo-inverse. So, the Srrr method use the first r-th right vector of \( Z(Z^{\prime}Z)^{ - } Z^{\prime}X \) to estimate the orthonormal matrix \( V_{(0)} \):

$$ V_{(0)} = (V_{1}^{(0)} , \ldots ,V_{r}^{(0)} ) $$
(3)

Depending on characters of the orthonormal matrix \( V_{(0)} \), such as \( W = WV_{(0)} V^{\prime}_{(0)} \), the reduced-rank matrix B can be expressed as:

$$ B = WV_{(0)} $$
(4)

with dimension \( (k - 1) \times r \) which columns being the estimator of rank. What more, \( VV^{\prime} \) is a projection matrix that approximatively maps onto the right singular subspace of \( W \).

For the sake of simplicity, Ma et al. take sparse group lasso in this model, where each row of the B is regarded as a group and all groups are of the same size r [14]. Each row takes sparse process by the \( \ell_{2} \) matrix norm as follows:

$$ \rho (B;\lambda ) = \lambda \sum\limits_{j = 1}^{k - 1} {\left\| {B_{j * } } \right\|_{2} } $$
(5)

where \( \lambda \) is the penalty level.

Following these, Srrr method constructs a right bias-variance tradeoff function with reduced-rank term representing the variance part and sparse lasso term representing the bias part using SARRS algorithm:

$$ W = \mathop {\arg \hbox{min} }\limits_{{Z \in \Re^{(k - 1) \times n} }} \left\{ {{{\left\| {XV_{(0)} V_{(0)}^{'} - ZWV_{(0)} V_{(0)}^{'} } \right\|_{F}^{2} } \mathord{\left/ {\vphantom {{\left\| {XV_{(0)} V_{(0)}^{'} - ZWV_{(0)} V_{(0)}^{'} } \right\|_{F}^{2} } {2 + \rho (WV_{(0)} V_{(0)}^{'} ;\lambda )}}} \right. \kern-0pt} {2 + \rho (WV_{(0)} V_{(0)}^{'} ;\lambda )}}} \right\} $$
(6)

we can further reduce the computation cost by first solving:

$$ B_{(1)} = \mathop {\arg \hbox{min} }\limits_{{B \in \Re^{(k - 1) \times r} }} \left\{ {{{\left\| {XV_{(0)} - ZB} \right\|_{F}^{2} } \mathord{\left/ {\vphantom {{\left\| {XV_{(0)} - ZB} \right\|_{F}^{2} } {/2 + \rho (B;\lambda )}}} \right. \kern-0pt} {/2 + \rho (B;\lambda )}}} \right\} $$
(7)

However, \( B_{(1)} \) is not accurate but close to \( WV_{(0)} \) because the columns of \( V_{(0)} \) is just approximate to the right singular subspace of \( W \).

It is worth noting that the right singular subspace of \( W \) is exactly the same as that of \( ZW \). Next step, we can estimate the left singular subspace \( U_{(1)} \in {\mathbb{R}}^{n \times r} \) of \( ZB_{(1)} \). Due to (4), \( U_{(1)} \) is exactly the left singular subspace of \( ZWV_{(0)} \), which in turn equals the left subspace of \( ZW \). Through the same line of logic, \( U_{(1)} U^{\prime}_{(1)} \) is a projection matrix that accurately maps onto the left singular subspace of \( WZ \). Then, we can easily compute the right singular vectors \( V_{(1)} \in {\mathbb{R}}^{m \times r} \) of \( U_{(1)} U^{\prime}_{(1)} X \), which in turn equals the right subspace of \( ZW \). Successfully, a pretty accurate right singular vectors of \( W \) is estimated. Finally, using \( V_{(1)} \) instead of \( V_{(0)} \) to solve the equation:

$$ B_{(2)} = \mathop {\arg \hbox{min} }\limits_{{B \in \Re^{(k - 1) \times r} }} \{ \left\| {XV_{(1)} - ZB} \right\|_{F}^{2} /2 + \rho (B;\lambda )\} $$
(8)

Hopefully, we compute the estimated indictor matrix by \( W = B_{(2)} V^{\prime}_{(1)} \).

Given two or more types of data from the same cohort of patients, our Srrr-cluster method first fuse these data into an integrative matrix, and then use the optimized PCA to compute a design matrix for the integrative data. The next step is to use the adaptive Srrr method to calculate the coefficient matrix under the Gaussian latent variable model, which can project sample \( \times \) gene space of the original data into eigenarray \( \times \) eigengene subspace. Finally, we use iCluster method to discovery cancer subtypes.

2.3 Evaluation Metric

We use three commonly used metrics to evaluate Srrr-cluster performance by identifying subtypes in these cancers. (i) Silhouette score, a measure of cluster homogeneity, which is defined as \( s(i) = {{(b(i) - a(i))} \mathord{\left/ {\vphantom {{(b(i) - a(i))} {(\hbox{max} (a(i),\,b(i)))}}} \right. \kern-0pt} {(\hbox{max} (a(i),\,b(i)))}} \), where \( a(i) \) is average dissimilarity between i and all the other points of the same subtypes, \( b(i) \) is average dissimilarity between i and all the other points of the different subtypes, i is an arbitrary sample. If sihouette value is close to 1, it means that the data are appropriate [16]. (ii) P value in Cox log-rank test, which is used to assess the significance of the different in survival profiles between subtypes [17]. (iii) The proportion of deviance (POD), which is a score of evaluating cluster degree of separation by a diagonal block structure. We set a matrix \( A = Z^{T} Z,A \in \Re^{n \times n} \). Then the elements of A is defined as \( {{a_{ij} } \mathord{\left/ {\vphantom {{a_{ij} } {\sqrt {a_{ii} a_{jj} } }}} \right. \kern-0pt} {\sqrt {a_{ii} a_{jj} } }} \) for i = 1,…, n and j = 1,…,n, and set negative values to zero, which can order cancers belonging to the same clusters into a adjacent structure. If the diagonal block matrices were prefect, all elements of the diagonal blocks would be non-negative and all elements of the off-diagonal blocks would be zero. So, compared A with the prefect diagonal block structure, we define a deviance measure d, which is the sum of quantities that the diagonal blocks’ elements of A appear zero and the off-diagonal blocks’ elements of A appear non-negative values. POD is defined as \( {d \mathord{\left/ {\vphantom {d {n^{2} }}} \right. \kern-0pt} {n^{2} }} \) so that POD is between 0 and 1. Small values of POD indicate strong cluster separability, and large values indicate of POD indicate poor cluster separability[6].

3 Results

3.1 Subtypes Discovery in Breast Cancer

Using DNA copy number and mRNA expression on the same cDNA microarrys that contain 6691 genes form Pollack et al. [1] from 37 primary breast cancers and four breast cancer cell lines, we compared the Srrr-cluster results with iCluster. As well known, the expression profiles of the four cell line samples (BT474, T47D, MCF7 and SKBR3) should be similar so that they can be considered as a subtype from the rest of the tumor samples. Additional, HER2/ERBB2 is an important prognostic factor for breast cancer near17q12, the clinical features and biological behavior of a special performance, treatment modalities HER2-positive (overexpression or amplification) of breast cancer also have a great difference with other types of breast cancer.

Figures 1 and 2 shows the diagonal blocks structures under the sparse solution \( \lambda = (0.2,0.2) \), r corresponding to Cluster (k = 4) and Srrr-cluster (k = 4, k = 3 and k = 5) respectively. POD values of the clustering solutions are 0.1519533, 0.1254317, 0.2478124 and 0.1259145 respectively. Considering the values of POD, the four clusters obtained by Srrr-cluster method should be the best one. Figure shows the heatmaps of the profiles of DNA copy number data and mRNA expression data when samples were splitted four clusters using Srrr-cluster method. Carefully analysing the four clusters combined heatmaps, we can see that cluster 1 is composed of the four cell lines and cluster 2 is amplification in the DNA and overexpression in the mRNA associated with the HER2/ERBB2.

Fig. 1.
figure 1

Diagonal blocks structures obtained using iCluster (k = 4) and Srrr-cluster (k = 4, k = 3 and k = 5) methods.

Fig. 2.
figure 2

Heatmaps of copy number (DNA) and gene expression (mRNA) with samples arranged by the integrated cluster assignment under the Srrr-Cluster model. (Color figure online)

3.2 Subtypes Discovery in GBM

The GBM dataset contains miRNA (534 genes) and mRNA expression (1740 genes) data from 73 patients with GBM [18]. We used three evaluation metrics to evaluate the result of the Srrr-cluster and iCluster Scluster: (i) The sihouette scores. (ii) The P values. (iii) The POD values. The results of these three metrics are listed in Table 1. According to these metrics, we can see that, using Srrr-cluster method, the within-clusters have stronger coherence and the between-clusters have well separability.

Table 1. Three evaluation metrics to evaluate iCluster and Scluster (3 clustering solution)

4 Discussion

Srrr-cluster method can find more suitable coefficient matrix which can project the original data onto an eigengene-eigenarray subspace when analyzing dataset with small sample size and large variables. In this paper, we proposed to use Srrr-cluster method for cancer subtypes discovery. Compared with iCluster method, our method can identify more stable clusters. However, because Srrr-cluster is established on the basis of iCluster, it has a major limitation that it needs a priori gene selection. In future, we will explore how to solve this problem.