1 Introduction

Principal component analysis (PCA) is among the most commonly used statistical methods for exploratory analysis of multivariate data [e.g., 10]. By seeking a low-rank approximation to the data matrix, PCA allows us to reduce the dimensionality of the data, and oftentimes serves as a useful first step to capture the essential features in the data. In particular, PCA has been widely used in analyzing gene expression data collected for multiple time points or across different biological conditions [1, 35, 38]. While PCA is appropriate to analyze data matrices, data sometimes come in the format of higher order tensors, or multilinear arrays. In particular, our work here is motivated by characterizing the spatiotemporal gene expression patterns of the human brain based on gene expression profiles collected from multiple brain regions of both developing and adult post-mortem human brains.

The human brain is a sophisticated and complex organ that contains billions of cells with different morphologies, connectivity and functions [e.g., 11]. Different brain regions have specific compositions of cell types, expressing unique combinations of genes at different developmental periods. Recent advances in sequencing and micro-dissection technology have provided us new and powerful tools to take a closer look at this complex system. Many studies have been conducted to date to collect spatiotemporal expression data to identify spatial and temporal signatures of gene regulation in the brain, and to gain insights into various biological processes of interest such as brain development processes, central nervous system formation, and brain anatomical structure shaping, among others [8, 12, 16, 24, 29, 30, 37].

The spatiotemporal expression data may be modeled by a third order multilinear array, or tensor, with one index for gene, one for region, and another one for time. Because the classical PCA can only be applied to data matrices, previous analyses of such data often consider the spatial and temporal patterns separately. To characterize temporal patterns of gene expression, data from different regions are first pooled and treated as replicates, before applying PCA. Similarly, when extracting spatial patterns of gene expression, data from different time points are combined so that PCA could be applied. Such analyses have yielded some useful insights on the gene regulation in spatiotemporal transcriptome [12, 19]. But the data pooling precludes us from understanding the heterogeneity in temporal dynamics across different regions of the brain, or the evolution of spatial gene regulation patterns over time. There is a clear demand to develop statistical methods that can more effectively utilize the tensor structure of spatiotemporal expression data.

To this end, we introduce in this article a higher order generalization, hereafter referred to as tensor PCA, of the classical PCA to better characterize spatial and temporal gene expression dynamics. As in the classical PCA, we seek the best low-rank orthogonal approximation to the data tensor. The orthogonality among the rank-one components is automatically satisfied by the classical PCA but is essential for our purpose. It not only ensures that the components can be interpreted in the same fashion as the classical PCA, but also is necessary for the low rank approximation to be well-defined. Unlike in the case of matrices, low rank approximations to a higher order tensor without orthogonality is ill-posed and the best approximation may not even exist [e.g., 4]. However, even with orthogonality, low rank approximations to a higher order tensor is still in general NP hard to compute [e.g., 9]. Heuristic or approximation algorithms are often adopted, and they often lead to suboptimal statistical performances [e.g., 25]. It is an active area of research in recent years to achieve a balance between computational and statistical efficiency when dealing with higher order tensor. For our purposes, we propose an efficient algorithm that combines tensor unfolding and power iteration to compute the principal components under the tensor PCA framework. We also show that our estimates are not only easy to compute but also attain the optimal rate of convergence under suitable conditions.

Numerical experiments further demonstrate the merits of our proposed method. We also applied the method to the spatiotemporal expression data from [12]. We found that the proposed tensor PCA approach can effectively reduce the dimensionality of the data while preserving inherent structure among the genes. In particular, through clustering analysis, we show that tensor PCA reveals interesting relationships between gene functions and the spatiotemporal dynamics of gene regulation. To fix ideas, we focus on spatiotemporal expression data in this paper. Our methodology, however, is also readily applicable to other settings where data are in the form of tensor.

The rest of the article is organized as follows. Section 2 introduces the proposed tensor PCA methodology. Section 3 reports the result from simulation studies. Section 4 presents an application of the proposed methodology to a spatiotemporal brain gene expression data set. Finally, we conclude with some remarks and discussions by Sect. 5. All proofs are covered in supplementary materials.

2 Methodology

Denote by \(x_{gst}\) an appropriately normalized and transformed expression measurement for gene g, in region s, at time t, where \(g=1,\ldots , d_G\), \(s=1,\ldots , d_S\), and \(t=1,\ldots , d_T\) and \(d_G\), \(d_S\) and \(d_T\) are the number of genes, regions, and time points, respectively. In many applications, we may also have replicate measurements so that \(x_{gst}\) is a vector rather than a scalar. To fix ideas, we shall focus on the case where there is no replicate. In practice, we can average over replicate measurements to convert \(x_{gst}\) from a vector to scalar in practice if necessary. Treatment of the more general situation is analogous albeit more cumbersome in notation.

2.1 From Classical PCA to Tensor PCA

As mentioned above, the classical PCA is often applied to estimate spatial and temporal patterns of gene regulation separately. Consider, for example, inferring the spatial patterns of gene regulation. Let

$$\begin{aligned} {\bar{x}}_{gs\cdot }=\frac{1}{d_T}\sum _{t=1}^{d_T} x_{gst}, \end{aligned}$$

be the averaged expression measurements for gene g in region s. The classical PCA then extracts the leading principal components, or equivalently the leading eigenvectors of \(d_G\times d_S\) matrix \({\varvec {x}}_g:= ({\bar{x}}_{g1\cdot },\ldots , {\bar{x}}_{gd_S\cdot })^\top\). The principal components can also be interpreted through singular value decomposition of data matrix \(({\varvec {x}}_1,\ldots ,{\varvec {x}}_{d_G})^\top\). Denote by \({\varvec {v}}_k:= (v_{k1},\ldots ,v_{kd_S})^\top\) the kth leading principal component and \({\varvec {u}}_k:= (u_{k1},\ldots ,u_{kd_G})^\top\) its normalized loadings, that is its \(\ell _2\) norm \(\Vert {\varvec {u}}\Vert =1\). Then, after appropriate centering, the observed expression measurements can be written as

$$\begin{aligned} {\bar{x}}_{gs\cdot }=\sqrt{d_G}\sum _{k=1}^r\lambda _k u_{kg} v_{ks}+{\bar{\epsilon }}_{gs}, \end{aligned}$$
(1)

where \(\lambda _1\ge \lambda _2\ge \cdots \ge \lambda _r>0\) so that \(\sqrt{d_G}\lambda _k\) is the kth largest singular value of the data matrix \(({\bar{x}}_{gs\cdot })_{1\le g\le d_G, 1\le s\le d_S}\), and the idiosyncratic noise \({\bar{\epsilon }}_{gs}\) are iid centered normal random variables. Note that, in (1), the scaling factor \(\sqrt{d_G}\) is in place to ensure that \(\lambda _k^2\) (more precisely \(\lambda _k^2+\text{var} ({\bar{\epsilon }}_{gs})\)) can also be understood as the kth largest eigenvalue of the covariance matrix of \(({\bar{x}}_{gs\cdot })_{1\le s\le d_S}\) when they are viewed as independent random vectors for \(g=1,\ldots , d_G\).

Obviously, because of pooling measurements from different time points, the principal components extracted this way can only be identified with spatial patterns averaged over all time points. Therefore it is not able to capture spatial patterns that evolve over time. Similar problem also arises when we pool data from different regions and extract principal components for temporal patterns. In order to model the spatial and temporal dynamics jointly, we now consider a generalization of PCA to specifically account for the tensor structure of the expression data.

The expression data \({\varvec {X}}= (x_{gst})_{1\le g\le d_G,1\le s\le d_S,1\le t\le d_T}\) can be conveniently viewed as a third order tensor of dimension \(d_G\times d_S\times d_T\). It is clear that the pooled data matrix

$$\begin{aligned} ({\varvec {x}}_1,\ldots ,{\varvec {x}}_{d_G})^\top ={\varvec {X}}\times _3 \left ( \frac{1}{d_T}\varvec{1}_{d_T}\right) , \end{aligned}$$

where \(\varvec{1}_d\) is a d dimensional vector of ones, and \(\times _j\) between a tensor and vector stands for multiplication along its jth index, that is,

$$\begin{aligned} ({\varvec {A}}\times _3 {\varvec {x}})_{ij}=\sum _k A_{ijk}x_k. \end{aligned}$$

See, e.g., [14] for further discussions on tensor algebra. Instead of seeking a low-rank approximation to the pooled data matrix, we shall work directly with the data tensor \({\varvec {X}}\). More specifically, with slight abuse of notation, we shall consider the following low rank approximation to \({\varvec {X}}\):

$$\begin{aligned} {\varvec {X}}=\sqrt{d_G}\sum _{k=1}^r\lambda _k \left ( {\varvec {u}}_k\otimes {\varvec {v}}_k\otimes {\varvec {w}}_k\right) +{\varvec {E}}, \end{aligned}$$
(2)

where the eigenvalues \(\lambda _1\ge \cdots \ge \lambda _r>0\), \({\varvec {u}}_k\)s, \({\varvec {v}}_k\)s and \({\varvec {w}}_k\)s are orthonormal basis in \({\mathbb {R}}^{d_G}\), \({\mathbb {R}}^{d_S}\) and \({\mathbb {R}}^{d_T}\) respectively, and the \({\varvec {E}}= (e_{gst})\) is the residual tensor consisting of independent idiosyncratic noise following a normal distribution \(N (0,\sigma ^2)\). Here \(\otimes\) stands for the outer product so that

$$\begin{aligned} x_{gst}=\sqrt{d_G}\sum _{k=1}^r\lambda _k u_{kg} v_{ks}w_{kt}+e_{gst},\qquad 1\le g\le d_G, 1\le s\le d_S, 1\le t\le d_T. \end{aligned}$$

Conceptually, model (2) can be viewed as a natural multiway generalization of the model for the classical PCA. Similar to the classical PCA, such a tensor decomposition allows us to conveniently capture the spatial dynamics and temporal dynamics by \({\varvec {v}}_k\)s and \({\varvec {w}}_k\)s, respectively. The loading of each gene for a particular interaction of spatial and temporal dynamics is then represented by \({\varvec {u}}_k\)s.

2.2 Estimation for Tensor PCA

Clearly, any interpretation of the data based on the tensor PCA model (2) depends on our ability to estimate the principal components \({\varvec {v}}_k\)s and \({\varvec {w}}_k\)s from the expression data \({\varvec {X}}\). Naturally, we can consider estimating them via maximum likelihood, leading to the problem of computing the best rank r approximation to data tensor \({\varvec {X}}\). In the case of the usual PCA, such a task can be accomplished by applying SVD to the data matrix. But for the tensor PCA model, this is a more delicate issue because low rank approximation to a generic tensor could be hard to compute at least in the worst case. To address this challenge, we introduce here an approach that combines tensor unfolding and power iteration and show that we can estimate the tenor principal components in an efficient way, both computationally and statistically.

2.2.1 Tensor Unfolding

A commonly used heuristic to overcome this problem is through tensor unfolding. In particular, in our case, we may collapse the second and third indices of \({\varvec {X}}\) to unfold into a \(d_G\times (d_S\cdot d_T)\) matrix \({\mathcal {M}} ({\varvec {X}})\) by collapsing the second and third indices, that is,

$$\begin{aligned} {[}{\mathcal {M}} ({\varvec {X}})]_{i, (j-1)d_T+k}=X_{ijk},\qquad 1\le i\le d_G, 1\le j\le d_S, 1\le k\le d_T. \end{aligned}$$

It is clear that

$$\begin{aligned} {\mathcal {M}} ({\varvec {X}})=\sqrt{d_G}\sum _{k=1}^r \lambda _k {\varvec {u}}_k\otimes \text {vec} ({\varvec {v}}_k\otimes {\varvec {w}}_k)+{\mathcal {M}} ({\varvec {E}}), \end{aligned}$$

where \(\text {vec} (\cdot )\) vectorizes a matrix into a vector of appropriate dimension. This suggests that \(\{\text {vec} ({\varvec {v}}_k\otimes {\varvec {w}}_k): 1\le k\le r\}\) are the top right singular vectors of \({\mathbb {E}}[{\mathcal {M}} ({\varvec {X}})]\) and can therefore be estimated by applying singular value decomposition to \({\mathcal {M}} ({\varvec {X}})\). Denote by \(\sqrt{d_G}{\widehat{\lambda }}_k\) the kth leading singular value of \({\mathcal {M}} ({\varvec {X}})\), and \({\widehat{{\varvec {h}}}}_k\) its corresponding right singular vector. We can reshape \({\widehat{{\varvec {h}}}}_k\) into a \(d_S\times d_T\) matrix \(\text {vec}^{-1} ({\widehat{{\varvec {h}}}}_k)\), that is

$$\begin{aligned} {[}\text {vec}^{-1} ({\widehat{{\varvec {h}}}}_k)]_{ij}= ({\widehat{{\varvec {h}}}}_k)_{ (i-1)d_T+j},\qquad \forall 1\le i\le d_S, 1\le j\le d_T. \end{aligned}$$

An estimate of \({\varvec {v}}_k\) and \({\varvec {w}}_k\) can then be obtained by the leading left and right singular vectors, denoted by \({\widehat{{\varvec {v}}}}_k\) and \({\widehat{{\varvec {w}}}}_k\) respectively, of \(\text {vec}^{-1} ({\widehat{{\varvec {h}}}}_k)\). It turns out that this simple approach can yield a consistent estimate of \(\lambda _k\)s, \({\varvec {v}}_k\)s and \({\varvec {w}}_k\)s. More specifically, we have

Theorem 1

There exists an absolute constant \(C>0\) such that for any simple eigenvalue \(\lambda _k\) (\(1\le k\le r\)) under the tensor PCA model (2), if the eigen-gap

$$\begin{aligned} g_k:=\min \left\{ \lambda _{k-1}^2-\lambda _k^2,\lambda _k^2-\lambda _{k+1}^2\right\} \ge C\left ( \sigma ^2+\sigma \lambda _1\right) \left ( d_Sd_T/d_G\right) ^{1/2}, \end{aligned}$$

with the convention that \(\lambda _0=\infty\) and \(\lambda _{r+1}=0\), then

$$\begin{aligned} \max \left\{ {\widehat{\lambda }}_k^2-\lambda _k^2, 1-|\langle {\widehat{{\varvec {v}}}}_k,{\varvec {v}}_k\rangle |, 1-|\langle {\widehat{{\varvec {w}}}}_k,{\varvec {w}}_k\rangle |\right\} \le C\left ( \sigma ^2+\sigma \lambda _1\right) g_k^{-1} (d_Sd_T/d_G)^{1/2}, \end{aligned}$$

with probability tending to one as \(d_G\rightarrow \infty\).

Theorem 1 indicates that the eigenvalue \(\lambda _k\) and its associated eigenvectors \({\varvec {v}}_k\) and \({\varvec {w}}_k\) can be estimated consistently whenever the eigen-gap

$$\begin{aligned} g_k\gg \sigma ^2 (d_Sd_T/d_G)^{1/2}. \end{aligned}$$

In the context of spatiotemporal expression data, the number of genes \(d_G\) is typically much larger than \(d_Sd_T\). Therefore, even if the eigen-gap is constant, the spatial and temporal PCA can still be consistently estimated.

2.2.2 Power Iteration

Although Theorem 1 suggests that the eigenvalue and eigenvector estimates obtained via our tensor folding scheme is consistent under fairly general conditions, they can actually be further improved. We can indeed use them as the initial value for power iteration or altering least squares to yield estimates that converge to the truth at faster rates.

Power iteration is perhaps the most commonly used algorithm for tensor decomposation [39]. We assume the standard deviation of noise is known and denoted as \(\sigma\). In practice, when \(\sigma\) is unknown, one can estimate it by the sample variance of the residual tensor with the initial estimate. Specifically, let \({\varvec {b}}^{[0]}\) and \({\varvec {c}}^{[0]}\) be initial values for \({\varvec {v}}_k\) and \({\varvec {w}}_k\). Let \({\varvec {a}}\), \({\varvec {b}}\) and \({\varvec {c}}\) be the estimates of \({\varvec {u}}_k\), \({\varvec {v}}_k\) and \({\varvec {w}}_k\), respectively. Then at the mth (\(m\ge 1\)) iteration, we update \({\varvec {a}}\), \({\varvec {b}}\) and \({\varvec {c}}\) as follows:

  • Let \({\varvec {a}}^{[m]}={\varvec {a}}/\Vert {\varvec {a}}\Vert\) where

    $$\begin{aligned} {\varvec {a}}={\varvec {X}}\times _2{\varvec {b}}^{[m-1]}\times _3{\varvec {c}}^{[m-1]}; \end{aligned}$$
    (3)
  • Let \({\varvec {b}}^{[m]}={\varvec {b}}/\Vert {\varvec {b}}\Vert\) where

    $$\begin{aligned} {\varvec {b}}={\varvec {X}}\times _1{\varvec {a}}^{[m]}\times _3{\varvec {c}}^{[m-1]}-\sigma ^2{\varvec {b}}^{[m-1]}; \end{aligned}$$
    (4)
  • Let \({\varvec {c}}^{[m]}={\varvec {c}}/\Vert {\varvec {c}}\Vert\) where

    $$\begin{aligned} {\varvec {c}}={\varvec {X}}\times _1{\varvec {a}}^{[m]}\times _2{\varvec {b}}^{[m]}-\sigma ^2{\varvec {c}}^{[m-1]}. \end{aligned}$$
    (5)

The following theorem shows that the algorithm, after a certain number of iterations, yields estimates of the tensor principal components at an optimal convergence rate.

Theorem 2

Let \({\varvec {b}}^{[m]}\) and \({\varvec {c}}^{[m]}\) be the estimates of \({\varvec {v}}_k\) and \({\varvec {w}}_k\) from the mth modified power iteration with initial values \({\varvec {b}}^{[0]}={\widehat{{\varvec {v}}}}_k\) and \({\varvec {c}}^{[0]}={\widehat{{\varvec {w}}}}_k\) obtained by tensor unfolding as described before. Suppose that the conditions of Theorem 1 hold. Then there exist absolute constants \(C_1,C_2>0\) such that if

$$\begin{aligned} \lambda _k^2g_k\ge C_1 (\sigma ^2+\lambda _1\sigma )\lambda _1^2\sqrt{\frac{d_Sd_T}{d_G}}, \end{aligned}$$

then for any

$$\begin{aligned} m\ge -C_2\log \left ( \lambda _k^{-2} (\sigma ^2+\lambda _1\sigma )\sqrt{\frac{d_S+d_T}{d_G}}\right) , \end{aligned}$$

we have

$$\begin{aligned} \max \left\{ 1-|\langle {\varvec {b}}^{[m]},{\varvec {v}}_k\rangle |, 1-|\langle {\varvec {c}}^{[m]},{\varvec {w}}_k\rangle |\right\} =O_p\left ( \lambda _k^{-2} (\sigma ^2+\lambda _1\sigma )\sqrt{\frac{d_S+d_T}{d_G}}\right) ,\quad \text{as\ }d_G\rightarrow \infty . \end{aligned}$$

Note that we only require that the number of genes \(d_G\) diverges in Theorem 2, which is the most relevant setting in spatiotemporal expression data. If the singular values \(\lambda _1,\ldots ,\lambda _r\) are simple and finite, as typically the case in practice, then Theorem 2 indicates that the spatial and temporal PCAs can be estimated at the rate of convergence \(\sqrt{ (d_S+d_T)/d_G}\). This is to be compared with the unfolding estimates which converge at the rate of \(\sqrt{d_Sd_T/d_G}\).

It is also worth noting, assuming that \(\lambda _k\)s and \(\sigma\) are finite, the rate of convergence given by Theorem 2 is optimal in the following sense. Suppose that \({\varvec {v}}_k\) is known in advance, it is not hard to see that \({\varvec {X}}\times _2{\varvec {v}}_k\) is a sufficient statistics for \({\varvec {w}}_k\). Because \({\varvec {w}}_k\) is the usual principal component of \({\varvec {X}}\times _2{\varvec {v}}_k\), following classical theory for principal components [see, e.g., 26], we know that the optimal rate of convergence for estimating \({\varvec {w}}_k\) is of the order \(\sqrt{d_T/d_G}\). Similarly, even if \({\varvec {w}}_k\) is known apriori, the optimal rate of convergence for estimating \({\varvec {v}}_k\) would be of the order \(\sqrt{d_S/d_G}\). Obviously, not knowing either \({\varvec {v}}_k\) or \({\varvec {w}}_k\) only makes their estimation more difficult. Therefore, the rate of convergence established in Theorem 2 is the best attainable.

A key difference between the power iteration described above and the usual ones is that subtract \(\sigma ^2{\varvec {b}}^{[m-1]}\) and \(\sigma ^2{\varvec {c}}^{[m-1]}\) when updating \({\varvec {b}}\) and \({\varvec {c}}\) at each iteration. This modification is motivated by a careful examination of the effect of noise \({\varvec {E}}\) on the power iteration. Although not essential for the performance of the final estimate, this adjustment allows for faster convergence of the power iterations. A careful inspection of the proof of Theorem 2 suggests that the results continue to hold in this case because of the consistency of the initial value.

Our approach is developed for effectively modeling brain spatiotemporal gene expression data, which is a 3-order tensor. For tensor with higher orders, we can naturally generalize our algorithm. Assume the tensor is \({\varvec {X}}= \lambda {\varvec {a}}\otimes {\varvec {b}}\otimes {\varvec {c}}\otimes {\varvec {d}}+ {\varvec {E}}\). For the tensor unfolding part, we can recursively apply the singular value decomposition on the unfolded tensor with first dimension fixed. For example, we can first estimate \({\varvec {a}}\) and \({\varvec {b}}\otimes {\varvec {c}}\otimes {\varvec {d}}\) by applying SVD on unfolded tensor with the last three dimensions flattened. Then we can estimate the \({\varvec {b}}\), \({\varvec {c}}\), and \({\varvec {d}}\) according to our tensor unfolding algorithm. For power iteration, we can first add \(\times _4{\varvec {d}}^{[m-1]}\) on the first term of Eqs. 34, and 5. The we follow the Eq. 5 to update \({\varvec {d}}\):

$$\begin{aligned} {\varvec {d}}= {\varvec {X}}\times _1{\varvec {a}}^{[m]}\times _2{\varvec {b}}^{[m]}\times _3{\varvec {c}}^{[m]} - \sigma ^2{\varvec {d}}^{[m-1]}. \end{aligned}$$
(6)

We can do the same approach for even higher order tensors. The theoretical and numerical study of the algorithm on higher order tensors are beyond the scope of this paper.

3 Numerical Experiments

To demonstrate the merits of the tensor PCA method described in the previous section, we conducted several sets of simulations.

3.1 Convergence of Power Iteration

To gain further insights into the operating characteristics of the power iteration, we examine how the estimation error changes from iteration to iteration for 50 typical simulation runs with \(\lambda =4\) and \(d=200\) in Fig. 1. First, it is evident to see the estimation error reduces quickly with the iterations. It is also worth noting that the algorithm converges in only several iterations. This has great practical implication as computation is often a significant issue when dealing with tensor data.

Fig. 1
figure 1

Estimation error as a function of iterations for 50 typical simulated datasets with \(\lambda =4\) and \(d=200\)

3.2 Principal Components Estimation Accuracy

3.2.1 Synthetic Data Generation

We begin with a simple simulation setup designed to investigate the effect of dimensionality and signal strength on the estimation of tensor accuracy. In particular, we simulated data tensor from the following rank-one tensor PCA model:

$$\begin{aligned} {\varvec {X}}=\sqrt{d}\lambda {\varvec {u}}\otimes {\varvec {v}}\otimes {\varvec {w}}+{\varvec {E}}. \end{aligned}$$
(7)

To assess the effect of dimensionality, we consider cubic tensors of dimension \({\mathbb {R}}^{d\times d\times d}\) where \(d=25, 50, 100\). We set \(\lambda =4\). The principal components \({\varvec {v}}\) and \({\varvec {w}}\), as well as the loadings \({\varvec {u}}\) were uniformly sampled from the unit sphere in \({\mathbb {R}}^{d}\). We recall that a uniform sample from the unit sphere in \({\mathbb {R}}^d\) can be obtained by \(Z/\Vert Z\Vert\) where \(Z\sim N (0,I_d)\). The noise tensor \({\varvec {E}}\) is a Gaussian ensemble whose entries are independent standard normal variables.

3.2.2 Baseline Approaches and Metrics

For each simulated data tensor \({\varvec {X}}\), we compared our proposed approach (TPCA) with the following baseline approaches:

  • Tensor unfolding (UFD) The baseline approach is described in Sect. 2.2.1. This baseline is to study the effect of power iteration.

  • Power iteration (PI1, PI5, or PI10) We conduct power iteration (described in Sect. 2.2.2) from random initial state. We repeat the power iteration with different starting states 1, 5, or 10 times and denote them as PI1, PI5, or PI10, respectively. This is to study the efficiency of using tensor unfolding as initial state.

We use \(2\cdot \max \{1-|\langle {\hat{{\varvec {v}}}}, {\varvec {v}}\rangle |, 1-|\langle {\hat{{\varvec {w}}}}, {\varvec {w}}\rangle |\}\) as the estimation error, which is equivalent to \(\max (\Vert {\hat{{\varvec {v}}}}-{\varvec {v}}\Vert ^2, \Vert {\hat{{\varvec {w}}}}-{\varvec {w}}\Vert ^2)\).

3.2.3 Results

For each simulation setting, we repeat it for 200 times and report the metrics in Table 1. Compared with UFD, it is evident from the comparison that TPCA improves the quality of estimates, especially for situations with high dimensionality. These observations are in agreement with the theoretical analysis presented in Theorems 1 and 2. Compared with (PI1, PI5, PI10), TPCA achieves the best performance. As the dimension goes higher, it requires more repetitions in power iteration from random states. During the simulation, we chose the smallest error among repetitions, which is infeasible in real applications since we don’t know the ground truth. In all other cases, TPCA significantly improves upon the power iteration from random. PI1 performs worse than UFD, which suggests that pure power iteration cannot yield good results even compared to tensor unfolding. The improvement is least significant in the easiest case with \(d=25\) when PI5 estimate already appears to be quite accurate.

Table 1 Principal components estimation errors comparison for rank 1 tensor

3.3 Synthetic Gene Expression Data

Our development was motivated by the analysis of spatiotemporal expression data. To better assess the performance of our method in such a context, we now consider a simulation setting designed to mimic it. More specifically, we simulated a spatiotemporal gene expression data tensor with \(d_G=2000\) genes, \(d_S=10\) spatial regions, \(d_T=13\) temporal regions. We assume the following tensor PCA model of rank three:

$$\begin{aligned} {\varvec {X}}=\sqrt{d_G}\lambda \sum _{k=1}^3{\frac{4-k}{3}} \cdot {\varvec {u}}_k\otimes {\varvec {v}}_k\otimes {\varvec {w}}_k+{\varvec {E}}, \end{aligned}$$

where we fix \(\sigma =1\) and \(\lambda =3\). The eigenvectors \({\varvec {u}}\in {\mathbb {R}}^{d_G}\), \({\varvec {v}}\in {\mathbb {R}}^{d_S}\) and \({\varvec {w}}\in {\mathbb {R}}^{d_T}\) were uniformly sampled from the Grassmannian of conformable dimensions. This simulation setting allows us to appreciate the effect of eigengap and eigenvalue, as well as the unequal dimensions on the accuracy of our estimates.

Usually, spatial-temporal gene expression data are heterogeneous. It could be the case that the variance differs across genes, locations, and time periods. To study the effect of heterogeneity along dimension \(d_G\), we apply linear increase of standard deviation as \(\sigma _i = i/d_G\) for \(i=1,\ldots ,d_G\). Similarly, we can apply on the heterogeneous noise on spatial and temporal dimension.

3.3.1 Principal Components Estimation Accuracy

We compare the proposed tensor PCA approach with the classical PCA approach for estimating each of the flattened spatiotemporal principal component. We add homogeneous noise and heterogeneous noise across gene, spatial, temporal dimensions for each simulation. For each principal component, we use \((\Vert x-{\varvec {v}}\otimes {\varvec {w}}\Vert\) as metrics, where \(x={\hat{{\varvec {v}}}}\otimes {\hat{{\varvec {w}}}}\) for TPCA and \(x=\text {"right singular vector"}\) for PCA. The results reported in Table 2 confirm our theoretical findings and suggests the superior performance of the proposed approach over the classical PCA agnostic to the heterogeneity of the noise. It is worth noting that gene wise heterogeneous noise has smaller effect on the principal component estimation, while spatial and temporal heterogeneity can make the estimation more challenging.

Table 2 Principal components estimation errors comparison for synthetic rank 3 noisy spatiotemporal gene expression tensor with different noise types

3.3.2 Signal Tensor Estimation Accuracy

We compared TPCA with classical PCA, tensor unfolding (UFD), Higher Order Orthogonal Iteration of Tensors (HOOI) [3] on signal tensor estimation. HOOI is a specific orthogonal Tucker decomposition algorithm that generalizes the matrix singular value decomposition. It is an iterative approach that computes the singular values for each mode fixing others. See Sheehan and Saad [33] for details of the algorithm. We simulated the data in the same way as described early in the section. For signal tensor \({\varvec {T}}\) and estimated signal tensor \({\hat{{\varvec {T}}}}\), we compute the relative error as \(\Vert {\hat{{\varvec {T}}}} - {\varvec {T}}\Vert _F / \Vert {\varvec {T}}\Vert _F\), where \(\Vert \cdot \Vert _F\) denotes the Frobenius norm of a tensor. The relative errors are reported in Table 3. TPCA again shows the best performance among all approaches.

Table 3 Signal tensor estimation comparisons among the classical PCA, Higher Order Orthogonal Iteration of Tensors, tensor unfolding, and tensor PCA, in terms of relative errors averaged over 200 simulation runs

3.4 Clustering Based on Tensor PCA

Oftentimes in practice, PCA is not the final goal of data analysis. It is commonly used as an initial step to reduce the dimensionality before further analysis. For example, PCA based clustering is often performed when dealing with gene expression data. See, e.g., [38]. Similarly, our tensor PCA can serve the same purpose. To investigate the utility of our approach in this capacity, we conducted a set of simulation studies where for each simulated dataset, we first estimated the loadings \({\varvec {u}}_k\)s and then applied clustering to the loadings. To fix ideas, we adopted the popular k-means technique for clustering although other alternatives could also be employed.

Motivated by the dataset from [12], which we shall discuss in further details in the next section, we simulated a data tensor of size \({\mathbb {R}}^{1087\times 10\times 13}\) from the following model:

$$\begin{aligned} {\varvec {X}}=\sum _{k=1}^3\lambda _k {\varvec {u}}_k\otimes {\varvec {v}}_k\otimes {\varvec {w}}_k+\sigma ^2{\varvec {E}}. \end{aligned}$$
(8)

where \(\lambda _1=337.8\), \(\lambda _2=27.1\), \(\lambda _3=9.0\), and \(\sigma =0.2\). These values, along with the principal components \({\varvec {v}}_k\) and \({\varvec {w}}_k\) are based on estimates when fitting a tensor PCA model to the data from [12]. The clusters, induced by the loadings \({\varvec {u}}_k\), were generated as follows. For a given number K of clusters, we first generated the cluster centroids \(C\in {\mathbb {R}}^{K\times 3}\) from right singular vector matrix of K by 3 Gaussian random matrix. We then assigned clusters among 1087 observations and generated the observed tensor with \(\sigma =1,5,10,20\), representing different levels of signal-to-noise ratio.

For comparison purposes, we also considered using the classical PCA based approach to reduce the dimensionality. For each method, we took the loadings from the first four directions and then applied k-means to infer the cluster membership. We used the adjusted Rand Index as a means of measuring the clustering quality. The results for each method and a variety of combinations of dimension, averaged over 200 runs, are reported in Table 4. The results suggest that tensor PCA based clustering is superior to that based on the classical PCA.

Table 4 Clustering performance comparison between the classical PCA and tensor PCA, in terms of Adjusted Rand index averaged over 200 simulation runs

4 Application to Human Brain Expression Data

We now turn to the spatiotemporal expression data from Kang et al. [12] that we alluded to earlier.

4.1 Dataset Description and Preprocessing

4.1.1 Dataset Description

[12] reported the generation and analysis of exon-level transcriptome and associated genotyping data from multiple brain regions and neocortical areas of developing and adult post-mortem human brains. The dataset was also analyzed by Liu et al. [22] on selecting ultrahigh dimensional feature and Lin et al. [21] on modeling spatial temporal pattern with Markov Random Field. It consists of spatiotemporal gene expression data of post mortem human brains with each from a time period with all neocortex regions. It has 11 areas and 15 time periods. The areas include orbitofrontal cortex (OFC), dorsolateral prefrontal cortex (DFC), ventral frontal cortex (VFC), primary motor cortex (M1C), primary somatosensory cortex (S1C), posterior inferior parietal cortex (IPC), primary auditory (A1) cortex (A1C), superior temporal cortex (STC), medial prefrontal cortex (MFC), inferior temporal cortex (ITC), and primary visual cortex (V1C). The time periods span from embryonic (period 1) to late adulthood (period 15), we refer readers to Table 5 for details. We ignore the first two time periods (period 1 and 2) and one neocortex region (V1C) due to the high variations. For one time period with more than one brains, we aggregate over samples for each time and region combination. We refer the readers to those papers for more dataset description.

Table 5 Periods of human development and adulthood as defined by [12]: M postnatal months; PCW post-conceptional weeks; Y postnatal years

4.1.2 Dataset Preprocessing

Following [8], we selected genes with reproducible spatial patterns across individuals according to their correlations between samples, leading to a total of 1087 genes. To reduce individual variations, we first take average across subjects for each (gene, location, time period). Then we get a data tensor of size \(d_G=1087\), \(d_S=10\) and \(d_T=13\).

Before applying the tensor PCA, we first centered the gene expression measurements by subtracting the mean expression level for each gene because we are primarily interested in the spatial and temporal dynamics of the expression levels. To remove the mean level, however it is more subtle than the classical PCA, we want to remove both mean spatial effect and mean temporal effect. More specifically, we applied tensor PCA to \({\tilde{{\varvec {X}}}}\in {\mathbb R}^{d_G\times d_T\times d_S}\) where

$$\begin{aligned} {\tilde{x}}_{gst}=x_{gst}-{\bar{x}}_{g\cdot t}-{\bar{x}}_{gs\cdot }+{\bar{x}}_{g\cdot \cdot } \end{aligned}$$

and \({\varvec {X}}\) is the original data tensor.

4.2 Analysis Based on Tensor PCA

4.2.1 Choose the Number of Components

We first conduct tensor decomposition by our proposed algorithm. As in the classical PCA, we can look at the scree plot to examine the contribution of each component in the tensor PCA model. We can see that the contribution from the principal components quickly tapers off (Fig. 2). We choose the top three components according to the scree plot. Notice that choosing the number of components is trickier for clustering analysis, we use the scree plot here to fix ideas. For more discussion on how to choose optimal number of components, we refer readers to Yeung and Ruzzo [38].

Fig. 2
figure 2

Scree plot of the tensor PCA for the dataset from [12]

4.2.2 Biological Interpretations of the Spatial and Temporal Factors

To gain insights, the top three spatial and temporal principal components are given in Fig. 3. And the top three spatial factors are mapped to brain neocortex regions in Fig. 4, where the color represents value, the darker the higher. L1 to L8 denote the different physical slice coordinates of brains. The first factor increases from L1 and L2 to L4 and L5. The second factor achieves maximum at M1C and S1C and decays over distance from the above two regions. The third factor shows strong signals in MFC and ITC.

Fig. 3
figure 3

Temporal and spatial factors of tensor PCA for the dataset from [12]

Fig. 4
figure 4

Spatial factors on locations of neocortex

To better understand these three factors, we conducted gene set enrichment analysis based on Gene Ontology (http://geneontology.org) for each factor. We calculated the relative weight of factor i for each gene by \(|u_i|/{\sum _{j=1}^3 |u_j|}\), where \(u\in {\mathbb {R}}^3\) is one row of gene factors. For each factor, we chose the top 15% quantile genes to form the gene sets. The results are presented in Table 6. Factor 1 relates with anatomical structure development, and this result is consistent with its spatial gradient pattern and decrease in magnitude of temporal pattern. Factor 2 has enriched term in sensory organ development, and this agrees with its huge magnitude in S1C. Besides, regulation of anatomical structure morphogenesis term supports the smooth spatial pattern from S1C and M1C to MFC and ITC. Factor 3 is enriched in innervation related with aging [2, 18], startle response associated with ITC [32], and chemical synaptic transmission related with aging [23].

Table 6 Gene enrichment analysis results on factors

To further examine the meaning of the spatial factors, we use the three spatial factors as the coordinates for each of the 10 locations in a 3D plot as shown in Fig. 5. Remarkably the spatial patterns of these locations are fairly consistent with the physical locations of these neocortex regions in the brain.

Fig. 5
figure 5

Loadings on the top three spatial factors for each of the ten neocortex regions

It is interesting to note, from the temporal trajectories, that the first two factors show clear signs of prenatal development (until Period 7) while the third factor exhibits increasing influence from young childhood (from Period 11). Factor 1 shows a spatial gradient effect that expression level tapers off from ITC to MFC or the other way. Remarkably, the same effect was reported in [24], which is explained by intrinsic signaling controlled partly by graded expression of transcription factors. Some representative genes such as FGFR3 and CBLN2 were found to preserve in both human and mouse neocortex. Taking temporal effect into consideration, factor 1 indicates that the gradient effect diminishes from early fetal (Period 3) to late fetal (Period 7), and almost vanishes after early infancy. Same effects were observed in [30] that areal transcriptional become more synchronized during postnatal development. Factor 2 suggests the importance of prenatal development of M1C and S1C. Both areas are well represented in the second factor while essentially absent from the other factors. This observation based on our analysis seems to agree with recent findings in neuroscience that activation patterns of extremely preterm infants’ primary somatosensory cortex area are predictive of future development outcome. See, e.g., [27]. Factor 3 distinguishes middle adulthood (Period 14) and late adulthood (Period 15) with different value in ITC and MFC comparing other 8 regions. This effect was reported in [30] that MFC and ITC have much higher number of neocortical interareal differentially expressed (DEX) genes. In term of aging, declining metabolism in MFC correlates with declining cognitive function [5,6,7, 28], and shrinkage of ITC increases with age [31]. When we consider 3 factors together, we can validate the temporal hourglass pattern observed in [30] that huge number of DEX genes exist before infancy (Period 8), and areal differences almost vanish from infancy to adulthood (Period 14) and reappear in late adulthood (Period 15).

4.2.3 Clustering Analysis

Finally, we used the factors estimated based on our tensor PCA model as the basis for clustering. In particular, we applied k-means clustering with \(k=5\) clusters to the three dimensional factor loadings. The resulting cluster sizes are 156, 167, 332, 280, and 152, respectively. Gene set enrichment analysis based on Gene ontology was performed for each group with the results presented in Table 7.

Table 7 Gene enrichment analysis results. We apply TPCA algorithm to reduce the dimensionality of spatiotemporal pattern to 3d for each gene

These results show a clear separation among different functional groups. This further indicates that the spatiotemporal pattern of a gene informs its functionality. Moreover, enriched terms such as anatomical structure development, forebrain development are highly associated with the spatial areas of neocortex, which again suggests the meaningfulness of the tensor principal components.

5 Conclusions

In this paper, we have introduced a generalization of the classical PCA that can be applied to data in the form of tensors. We also proposed efficient algorithms to estimate the principal components using a novel combination of power iteration and tensor unfolding. Both theoretical analysis and numerical experiments point to the efficacy of our method. Although the methodology is generally applicable to other applications, our development was motivated by the analysis of spatiotemporal expression data which in recent years have become a common place in studying brain development among other biological processes. An application of our method to one such example further demonstrates its potential usefulness.

6 Software

Software in the form of R package with complete documentation. It is available at https://github.com/TerenceLiu4444/tensorpca.