1 Introduction

Because of continuous price reduction of multiomics data measurements, including gene expression, promoter methylation, SNP, histone modification, and miRNA expression, more number of experimental conditions come to be considered. For example, if gene expression is measured for various tissues of patients, gene expression has better to be formatted, not in matrix, but in tensor, as patients vs tissue vs genes. In this case, TD rather than PCA is a suitable technology to apply. On the other hand, in the previous chapter, we aimed various integrated analysis, e.g., miRNA and mRNA expression, mRNA expression and methylation, mRNA expression of two species. If genes or features are shared in the integrated analysis, generation of case I or II tensor and application of TD to it is a suitable treatment. In the following, we introduce some application of TD based unsupervised FE to either of these cases.

2 PTSD Mediated Heart Diseases

The first example to be processed as tensor form is PTSD mediated heart diseases. Although this disease has already been analyzed in the previous chapter (Sect. 6.4.1), the data set analyzed there includes only one tissue, heart. Nonetheless, if one would like to understand how PTSD mediates heart disease, we need to know gene expression of both heart and brain. Fortunately, there is a such kind of data set. In this section, I would like to demonstrate the usefulness of TD based unsupervised FE applied to gene expression of multiple tissues aiming to understand PTSD mediated heart disease based upon the recent publication [24].

The data set analyzed is composed of the following samples (Table 7.1). It includes ten tissues under eight experimental conditions. This data set is formatted as a five-mode tensor, \(x_{ij_1j_2j_3j_4} \in \mathbb {R}^{43699 \times 2 \times 10 \times 2 \times 3 }\), of the ith probe, subjected to j 1th treatment (j 1 = 1: control, j 1 = 2: treated [stress-exposed] samples), in the j 2th tissue [j 2 = 1: amygdala (AY), j 2 = 2: hippocampus (HC), j 2 = 3: medial prefrontal cortex (MPFC), j 2 = 4: septal nucleus (SE), j 2 = 5: striatum (ST), j 2 = 6: ventral striatum (VS), j 2 = 7: blood, j 2 = 8: heart, j 2 = 9: hemibrain, j 2 = 10: spleen], with the j 3th stress duration (j 3 = 1: 10 days, j 3 = 2: 5 days) and j 4th rest period after application of stress (j 4 = 1: 1.5 weeks, j 4 = 2: 24 h, j 4 = 3: 6 weeks). Zero values are assigned to missing observations (e.g., measurements at 6 weeks after a 5-day period of stress are not available).

Table 7.1 Samples used in this study

HOSVD algorithm (Fig. 3.8) is applied to \(x_{ij_1j_2j_3j_4}\) as

$$\displaystyle \begin{aligned} x_{ij_1j_2j_3j_4} = \sum_{\ell_5=1}^{43699}\sum_{\ell_1=1}^{2}\sum_{\ell_2=1}^{10}\sum_{\ell_3=1}^{2}\sum_{\ell_4=1}^{3} G(\ell_1,\ell_2,\ell_3, \ell_4, \ell_5) u^{(j_1)}_{\ell_1j_1}u^{(j_2)}_{\ell_2j_2}u^{(j_3)}_{\ell_3j_3}u^{(j_4)}_{\ell_4j_4}u^{(i)}_{\ell_5i}\end{aligned} $$
(7.1)

where \(u^{(i)}_{\ell _5 i} \in \mathbb {R}^{43699 \times 43699}\), \(u^{(j_1)}_{\ell _1 j_1} \in \mathbb {R}^{2 \times 2}\), \(u^{(j_2)}_{\ell _2 j_2} \in \mathbb {R}^{10 \times 10}\), \(u^{(j_3)}_{\ell _3 j_3} \in \mathbb {R}^{2 \times 2}\), and \(u^{(j_4)}_{\ell _4 j_4} \in \mathbb {R}^{3 \times 3}\) are singular value vectors and \(G(\ell _1,\ell _2,\ell _3, \ell _4, \ell _5) \in \mathbb {R}^{43699 \times 2 \times 10 \times 2 \times 3 }\) is a core tensor.

We need to specify which singular value vector attributed to genes, \(\mathbf {\mathit {u}}^{(i)}_{\ell _1}\), is used for gene selection. For this purpose, we investigate other singular value vectors, \(\mathbf {\mathit {u}}^{(j_k)}_{\ell _k}, 1 \leq k \leq 4\). One of the important points is tissue specificity. What I would like to find is a set of genes expressive in common between heart and brain. Because 1 ≤ j ≤ 6 and j = 9 correspond to brain and j = 8 corresponds to heart, we need to find \(\mathbf {\mathit {u}}^{(j_2)}_{\ell _2}\) expressive in common j = 1, 2, ⋯ , 6, 8, 9. Figure 7.1 shows the singular value vectors, \(\mathbf {\mathit {u}}^{(j_2)}_{\ell _2}, 1 \leq \ell _2 \leq 10\). Although no \(\mathbf {\mathit {u}}^{(j_2)}_{\ell _2}\) fully satisfies this requirement, \(\mathbf {\mathit {u}}^{(j_2)}_4\) relatively fulfills this requirement. \(\mathbf {\mathit {u}}^{(j_2)}_4\) are negatively signed in common for j = 1, 2, 8, 9 that correspond to AY, HC, heart, and hemibrain. Especially, because AY and HC are very important in PTSD [14], it is promising that we can get singular value vector expressive in common AY, HC, and heart.

Fig. 7.1
figure 1

Singular value vectors, \(\mathbf {\mathit {u}}^{(j_2)}_{\ell _2}, 1 \leq \ell _2 \leq 10\). Red horizontal broken lines show baseline

The next important requirement is that control and stressed samples should be oppositely expressive. This means, \(u^{(j_1)}_{\ell _1 1} = - u^{(j_1)}_{\ell _1 2}\). This requirement is easy to fulfill because \(u^{(j_1)}_{\ell _1 1} = - u^{(j_1)}_{\ell _1 2}\) or \(u^{(j_1)}_{\ell _1 1} = u^{(j_1)}_{\ell _1 2}\) must be satisfied when there are only two classes and mean is zero. Figure 7.2 shows the singular value vectors, \(\mathbf {\mathit {u}}^{(j_1)}_{\ell _1}, \ell _1=1,2\). As expected, 1 = 2 corresponds to the reversed sign between control and stressed samples.

Fig. 7.2
figure 2

Singular value vectors, \(\mathbf {\mathit {u}}^{(j_1)}_{\ell _1}, \ell _1 =1,2\). Red horizontal broken lines show baseline

Because there are no known pre-defined desirable properties for experimental conditions, i.e. stress and rest period, we should find G(2, 4, 3, 4, 5) with the larger absolute values. Table 7.2 shows the top ranked G with larger absolute values. Then we can find that 5 = 1, 4, 11 are associated with G(2, 4, 3, 4, 5) with the larger absolute values. Thus we decided to attribute P-values using 5 = 1, 4, 11 with assuming χ 2 distribution as

$$\displaystyle \begin{aligned} P_i = P_{\chi^2} \left [ > \sum_{\ell_5=1,4,11} \left ( \frac{u_{\ell_5 i}}{\sigma_{\ell_5}} \right)^2\right]. {} \end{aligned} $$
(7.2)

P-values are corrected by BH criterion and 801 probes associated with adjusted P-values less than 0.01 are selected.

Table 7.2 Top-ranked G( 1 = 2, 2 = 4, 3, 4, 5) with greater absolute values

The first validation of selected 801 probes is to see if these are expressed distinctly between control and stressed samples, selectively on only heart and brain. In order to confirm this, we apply t test to the selected 801 probes between control and stressed samples for all combination of tissues, rest and stressed period. P-values are corrected by BH criterion and conditions associated with adjusted P-values less than 0.01 are considered to be expressed distinctly and significantly between control and stressed samples. Table 7.3 shows the results. The selected 801 genes are expressed distinctly between control and stressed samples, selectively in heart, HC, and AY (it is also in spleen, because it is oppositely expressed toward heart, HC, and AY as shown in Fig. 7.1).

Table 7.3 Thirteen combinations of tissues and experimental conditions where the selected 801 probes are differentially expressed between stress-exposed and control samples

Here we would like to emphasize the difficulty of gene selection in this data set. As mentioned above, what we are aiming is quite abstract, i.e., “genes expressive in common between brain and heart as well as distinctly between control and stressed samples.” As a result, we realize that common expression between AY, HC, and heart is possible (with the investigation of \(\mathbf {\mathit {u}}^{(j_2)}_4\) in Fig. 7.1). Generally, it is impossible to know this combination in advance. When no clear purpose is given in advance, supervised methods cannot perform well while unsupervised methods can.

In order to see how well other conventional supervised methods perform, we test three methods, SAM, limma, and categorical regression analysis . The first example to be compared with TD based unsupervised FE is categorical regression analysis . For the data set shown in Table 7.1, the only possible way to apply categorical regression is to treat it as 80 classes (10 tissues vs four experimental conditions vs control and stressed samples). Although it is better to consider the pair of control and stressed samples, it is impossible. Typically, although ratio might be taken, because it is not paired samples, i.e., there is no one-to-one correspondence, we cannot take ratio. Table 7.4 shows the result of categorical regression analysis. Because of treatment as 80 classes, genes associated with any kind of distinction are detected (i.e., associated with significantly small adjusted P-values). As a result, almost all genes are judged as distinct between some combinations. It is obvious that this result is not desirable for our purpose, “genes expressive in common between brain and heart distinctly between control and stressed samples,” at all, because of lack of specificity. To screen these genes, we need some additional criterion that TD based unsupervised FE does not require. Thus, TD based unsupervised is more fitted to the present purpose than categorical regression.

Table 7.4 Results of gene selection based on categorical regression

Next, we apply SAM with assuming 80 classes to the data set shown in Table 7.1. Table 7.5 shows the result of SAM. p0, which represents the contribution of null hypothesis that no distinction exist among 80 classes, is 1%. This means, almost all genes are distinctly expressive in either of these combinations. Although FDR corresponds to the adjusted P-values, it is clear that all genes are associated with FDR less than 0.01. Although this conclusion itself is coincident with that of categorical regression, in this sense SAM is not useful to select “genes expressive in common between brain and heart distinctly between control and stressed samples,” either.

Table 7.5 Results by SAM

Finally, we apply limma to the data set shown in Table 7.1. Fortunately, limma enables us to select genes that are distinct between any pairs of controls and samples. Thus, we apply limma in two ways. One assumes 80 classes (case A in Table 7.6) and the other assumes 40 classes (case B in Table 7.6) composed of forty (10 tissues vs four experimental conditions) pairwise combinations between control and stress samples. Possibly because of its advanced feature, limma successfully denies the detection of genes expressive distinct among any pairs of 80 classes (case A). Nevertheless, limma still detects too many positives in 40 pairwise comparisons (case B). As expected, because of lack of well-defined screening criterion, three supervised methods are useless to find “genes expressive in common between brain and heart as well as distinctly between control and stressed samples.” In conclusion, none of the three conventional supervised methods are as useful as TD based unsupervised FE for the present purpose.

Table 7.6 Results of gene selection based on limma

Although TD based unsupervised FE successfully identifies genes expressive distinct between control and stressed samples in tissue specific manner (Table 7.3), if it is biologically useless, it cannot be considered to be successful. In order to evaluate selected probes biologically, we try to identify protein coding genes associated with these 801 probes. Then, we find 457 genes (because of lack of space, we cannot list all of 457 genes, which is available as Additional file 5 [24], if the readers are particularly interested in them). We upload 457 genes to DAVID . The result is quite promising. Table 7.7 shows the enriched KEGG pathway associated with adjusted P-values less than 0.05. They include four neurodegenerative diseases as well as one cardiac problem. Thus, they are quite suitable to be candidate genes that cause PTSD mediated heart diseases as those in Table 6.20 where PTSD mediated heart disease is investigated by PCA based unsupervised FE.

Table 7.7 KEGG pathway enrichment by the 457 genes identified by TD based unsupervised FE

3 Drug Discovery From Gene Expression

Drug discovery is time-consuming and expensive processes. It starts from preparing as many small molecules as possible. Then, tries to find one effective to target diseases by exhausted search. The number of initially prepared molecule can be 104; testing this many number of compounds causes huge amount of money and long period. If we can reduce the number of initial candidate small molecules to one tenth, it benefits so much to reduce the time and cost required.

In this sense, the so-called in silico drug discovery develops with much expectation to fulfill this requirement. In silico drug discovery is aiming to identify candidate small molecules without wet experiments. With making full use of recently developed computational power, including CPU with high speed computing, huge storage that can store massive information as well as recently developed machine learning technique, in silico drug discovery enables us to prepare set of more promising candidate small molecules as drugs.

Traditionally, there are two main streams of in silico drug discovery. One is ligand-based drug design [1] (LBDD) and the other is structure-based drug design [3] (SBDD). LBDD is aiming to identify new candidate drug compounds based upon the similarity with known drugs. LBDD has huge varieties depending upon how similarity is defined. The advantage of LBDD is that it has more trust, i.e. larger probability to find true drug compounds, and requires smaller computational resources than SBDD. The disadvantage of LBDD is that it requires the information of known drugs and fails to find new drug candidates that lack similarity with known drug. On the contrary, SBDD has the advantage that it can predict new candidate drugs without the information of known drugs. The disadvantage of SBDD is that it requires massive computation, because it must execute docking simulation between drug candidate compounds and target proteins. Another disadvantage of SBDD is that it needs protein tertiary structure to which individual candidate drug compounds must bind. Experimental measurements of protein tertiary structure itself are difficult tasks. Although it has become much easier because of the invention of cryo-electron microscopy [10] than before, it still needs to pay much amount of money and time. When there are no protein tertiary structures available, protein tertiary structure itself must be computationally predicted [6]. The prediction inevitably has inaccuracy that affects the prediction of binding affinity of small molecules.

In order to compensate these disadvantages of LBDD and SBDD, the third option is recently proposed: drug design from gene expression [5]. Post-treatment gene expression can be used to screen candidate compounds for their ability to induce the target phenotype. This approach is very useful once post-treatment gene expression is available. In this section, we try to make use of TD based unsupervised FE to predict new drug target with analyzing post-treatment gene expression [27].

Post-treatment gene expression is obtained from LINCS [20]. L1000 is highly reproducible, comparable to RNA sequencing, and suitable for computational inference of the expression levels of 81% of non-measured transcripts. Gene expression profile is available in GEO with GEO ID GSE70138. Table 7.8 summarizes the gene expression profiles. They include 13 cell lines to which 100–300 compounds (denoted as “all compounds”) are treated. One problem of this data set is that it includes only 978 genes’ expression profiles, because it is measured by Luminex scanners. Gene expression profiles in individual cell lines are formatted as tensor, \(x_{ijk} \in \mathbb {R}^{978 \times 6 \times K}\); i denotes gene (probe), j denotes dose density of drug compound, and k stands for individual compounds among K total number of compounds that correspond to “all compounds” in Table 7.8. HOSVD algorithm (Fig. 3.8) is applied as

$$\displaystyle \begin{aligned} x_{ijk} = \sum_{\ell_1=1}^{978} \sum_{\ell_2=1}^{6} \sum_{\ell_3=1}^K G(\ell_1,\ell_2,\ell_3) u^{(i)}_{\ell_1 i} u^{(j)}_{\ell_2 i} u^{(k)}_{\ell_3 k}\end{aligned} $$
(7.3)

where \(\mathbf {\mathit {u}}^{(i)}_{\ell _1} \in \mathbb {R}^{978}\), \(\mathbf {\mathit {u}}^{(j)}_{\ell _2} \in \mathbb {R}^6\), \(\mathbf {\mathit {u}}^{(k)}_{\ell _3} \in \mathbb {R}^K\), are the singular value vectors, and \(G(\ell _1,\ell _2,\ell _3) \in \mathbb {R}^{978 \times 6 \times K}\) is a core tensor.

Table 7.8 The number of the inferred compounds and inferred genes associated with significant dose-dependent activity
Table 7.9 Compound–gene interactions presented in Table 7.8 that significantly overlap with interactions described in two data sets

The first step is to identify genes whose expression is altered by drug treatment. In order that, we try to identify which u (j) has monotonic dependence upon dose density. Figure 7.3 shows \(\mathbf {\mathit {u}}^{(j)}_{\ell _2}, 1 \leq \ell _2 \leq 3\) for 13 cell lines listed in Table 7.8. It is obvious that \(\mathbf {\mathit {u}}^{(j)}_2\) shows almost linear dependence upon dose independent of cell lines. The next task is to identify G( 1, 2, 3) with larger absolute values in order to decide which \(\mathbf {\mathit {u}}^{(i)}_{\ell _1}\) and \(\mathbf {\mathit {u}}^{(k)}_{\ell _3}\) are used for selecting the combinations of genes and compounds that commit linear dose dependence. Because

$$\displaystyle \begin{aligned} G(\ell_1 \le 6, \ell_2 \le 6,\ell_3 \le 6 )= \frac{\sum_{\ell_1 \le 6, \ell_2 \le 6,\ell_3 \le 6 } G(\ell_1,\ell_2, \ell_3)^2}{\sum_{\ell_1,\ell_2,\ell_3} G(\ell_1,\ell_2, \ell_3)^2} {}\end{aligned} $$
(7.4)

exceeds 0.95 for almost all cell lines, it is decided to employ ( 1 ≤ 6, 2 = 2, 3 ≤ 6) components for FE. Nonetheless, in the case of PC3 cells, ( 1 ≤ 8, 2 = 2, 3 ≤ 8), as an exception, are used for FE because the eighth component is found to have non-negligible contributions in this cell line.

Fig. 7.3
figure 3

Singular value vectors, \(\mathbf {\mathit {u}}^{(j)}_{\ell _2}, 1 \leq \ell _2 \leq 3\). Red horizontal broken lines indicates baseline. Black: 2 = 1, red: 2 = 2, green: 2 = 3

To identify the genes and compounds associated with a significant dose-dependent activity, it is assumed that \(u_{\ell _1 \le 6,i}\) and \(u_{\ell _3 \le 6,k}\) follow independent normal distributions and P-values are attributed to the ith gene and the kth compounds using a χ 2 distribution ,

$$\displaystyle \begin{aligned} P_i = P_{\chi^2} \left[ > \sum_{\ell_1 \le 6} \left(\frac{u^{(i)}_{\ell_1 i}}{\sigma_{\ell_1}} \right)^2\right] \end{aligned} $$
(7.5)

and

$$\displaystyle \begin{aligned} P_k = P_{\chi^2} \left[ > \sum_{\ell_3 \le 6} \left(\frac{u^{(k)}_{\ell_3 k}}{\sigma_{\ell_3}} \right)^2\right] \end{aligned} $$
(7.6)

where \(\sigma _{\ell _1}\) and \(\sigma _{\ell _3}\) are the standard deviations of \(u^{(i)}_{\ell _1 i}\) and \(u^{(k)}_{\ell _3 k}\), respectively. For PC3 cells, 1 ≤ 8 and 3 ≤ 8 are used in the above equations. \(P_{\chi ^2} [>x]\) is the cumulative probability that the argument is greater than x assuming a χ 2 distribution with eight degrees of freedom for PC3 cell lines and with six degrees of freedom for other cell lines. P i and P k are adjusted by means of the BH criterion , and compounds and genes associated with the adjusted P-value lower than 0.01 are selected as those associated with a significant dose-dependent cellular response. The number of selected genes and compounds are listed as “inferred genes” and “inferred compounds” in Table 7.8, respectively. The above process is illustrated in Fig. 7.4.

Fig. 7.4
figure 4

Starting from gene expression profile formatted as tensor, x ijk, singular value vectors, \(\mathbf {\mathit {u}}^{(i)}_{\ell _1}, \mathbf {\mathit {u}}^{(j)}_{\ell _2},\) and \(\mathbf {\mathit {u}}^{(k)}_{\ell _3}\), are obtained. After identifying 2 = 2 as associated with linear dose dependence (see Fig. 7.3), 1 ≤ 6 and 3 ≤ 6 are decided to be used for FE because of larger contribution defined in Eq. (7.4). Genes i and compounds k are selected using \(\mathbf {\mathit {u}}^{(i)}_{\ell _1}, \ell _1 \leq 6, \mathbf {\mathit {u}}^{(k)}_{\ell _3}, \ell _3 \leq 6\)

The next task is to identify proteins to which selected compounds bind. “inferred genes” in Table 7.8 do not correspond to the proteins to which selected compounds bind, because they are the genes whose mRNA expression is altered because of drug treatment. Usually, mRNA expression of proteins to which selected compounds bind is not altered because of drug treatment. Thus we need to infer proteins targeted by drug treatment. In order that, we need additional external information that lists the genes whose mRNA expression is altered because of a gene perturbation. Then if “inferred genes” matched with genes mRNA expression is altered because of the gene perturbation, we infer the perturbed gene as target protein (Fig. 7.5). There can be multiple resources from which we can retrieve the list of genes whose mRNA expression is altered because of single gene perturbation. Here we employ Enrichr [11] that collects multiple data resources in order to perform various enrichment analyses. After uploading “inferred genes” to Enrichr, we list genes associated with adjusted P-values less than 0.01 in the category of “Single gene Perturbations from GEO up.” Their number corresponds to the number of “predicted targets” in Table 7.8. This strategy is especially efficient for LINCS data set that includes only expression of 978 genes. Employing the strategy in Fig. 7.5, we can identify target proteins not included in these 978 genes.

Fig. 7.5
figure 5

After the drug (red hexagon) treatments, we can detect mRNAs with altered expression (filled cyan circle) along with those without altered expression (filled green circle). We have no information about proteins (circled A, B, and C). List of genes with altered expression can be compared with genes with altered expression when genes A, B, or C is perturbed. Then, we can identify compounds that might bind to protein A, because the list of genes whose mRNA expression is altered are common

Next we would like to evaluate if our prediction is correct, i.e., if “inferred compounds” bind to “predicted targets.” In principle, it is impossible to check the accuracy of our prediction without experiments. Thus, instead of executing experiments, we compare our prediction with known list of target proteins of drug compounds. For this purpose, we employ two information resources, drug2gene.com [19] and DSigDB [33]. Table 7.9 shows the results of Fisher’s exact test that evaluates overlaps between “predicted targets” and known target proteins of “inferred compounds.” If P-values computed by Fisher’s exact test is less than 0.05, it is significant (no correction considering multiple comparisons). It is obvious that in most of the cases, our prediction significantly overlaps with known target proteins of drug compounds. Thus, TD based unsupervised FE can be used for in silico drug discovery from gene expression.

It is also interesting that “inferred compounds” are largely overlapped among cell lines. Because two to nine compounds are identified in each of 13 cell lines, the total number of identified compounds can be several tens. Nevertheless, the number of compounds listed in Table 7.9 is as small as 19. In some sense, it might be an evidence that our strategy is correct. It is reasonable that anti-cancer drugs are effective to multiple cancers. Thus, large overlap of “inferred compounds” between distinct cell lines makes sense. On the other hand, analyses based upon distinct gene expression profiles unlikely results in largely overlapped results without any biological reasons. Possibly, the result shown in Table 7.9 are trustable.

Although we employed single gene perturbation to infer target proteins from the list of genes with altered expression caused by drug treatment, any other database that can describe gene interaction should be usable. As an alternative, we try “PPI Hub Proteins” in Enrichr instead of “Single gene Perturbations from GEO up.” The primary difference between “PPI Hub Proteins” and “Single gene Perturbations from GEO up” is the number of genes included. “PPI Hub Proteins” includes only a few hundred genes, while “Single gene Perturbations from GEO up” includes a few thousand genes. This suggests that the results using “PPI Hub Proteins” might be less significant. Table 7.10 lists the results of Fisher’s exact test of the comparison between predicted targets based upon “PPI Hub Proteins” and drug2gene.com database. In contrast to the expectation, all cases have significant overlap with drug2gene.com. This supports our expectation that any kind of gene–gene interaction is usable together with TD based unsupervised FE for in silico drug discovery from gene expression.

Table 7.10 A significant overlap demonstrated between compound–target interactions presented in Table 7.8 and drug2gene.com

It might be useful to demonstrate how more direct and simple approach fails. One possible alternative simpler way is to apply linear regression

$$\displaystyle \begin{aligned} x_{ijk} = a_{ik} + b_{ik} D_j {} \end{aligned} $$
(7.7)

where D j is the jth dose density and a ik and b ik are regression coefficients. Then simply select i and k associated with more significant P-values as in the case of TD based unsupervised FE. In order to show that it cannot give us the reasonable set of is and ks, we apply Eq. (7.7) to A375 cell lines ((13) in Tables 7.8, 7.9, and 7.10) as an example. After correcting P-values that Eq. (7.7) gives by BH criterion , we find that all compounds have adjusted P-values less than 0.01 with at least one of the genes while all genes have adjusted P-values less than 0.01 with at least one of the compounds. Thus, by simply requesting “adjusted P-values less than 0.01” as in the case of TD based unsupervised FE, we cannot screen either genes or compounds. We can still try to select “top ranked” genes or compounds. In order to show that this cannot work well either, we apply two distinct criteria to select “top ranked” compounds as

  • Select top ranked 10 compounds having larger number of genes associated with adjusted P-values less than 0.01.

  • Suppose P ik is P-value that Eq. (7.7) gives. Select top ranked 10 compounds having smaller \(\sum _i \log P_{ik}\).

These two criteria rank compounds with more significant correlation with genes through dose density in some sense. The result is a bit disappointing (Table 7.11). Only three of top 10 compounds are chosen in common. This suggests that it is not easy to select compounds in robust way simply based upon P-values that Eq. (7.7) gives. Thus, TD based unsupervised FE is much better strategy without no additional criterion than adjusted P-values than selection based upon P-values that Eq. (7.7) gives.

Table 7.11 Compounds selected by P-values that Eq. (7.7) gives, for A375 cell line ((13) in Tables 7.8, 7.9, and 7.10)

Before ending this section, I would like to mention briefly why the results of TD based unsupervised FE differ from that based upon linear regression, Eq. (7.7), so much in spite of that both TD based unsupervised FE and linear regression try to find the combinations of genes and compounds associated with dose dependence. As can be seen in Fig. 7.3, \(\mathbf {\mathit {u}}^{(j)}_2\) used for FE is not simple linear function of dose density. In spite of that, the dependence of \(\mathbf {\mathit {u}}^{(j)}_2\) upon dose density is quite universal, in other words, independent of cell lines. TD is the only method that can successfully identify this universal (independent of cell lines) functional form. There are no other ways to find it in advance. This cannot be achieved by any other supervised method, because any supervised method cannot avoid assuming something contradictory to this universal functional form. Because of this superiority, TD based unsupervised FE can achieve good performance shown in Tables 7.9 and 7.10.

4 Universarity of miRNA Transfection

miRNA transfection is a popular method that finds miRNA target genes experimentally. Nevertheless, some doubt arises if transfected miRNA can work similar to endogenous miRNAs [9], because it causes various unexpected effects that cannot be seen by upregulation of endogenous miRNAs. Because the aim of miRNA transfection experiments is to find miRNA target genes, only genes downregulated by the transfection are searched. Nevertheless, it is quite usual to find that many mRNAs are upregulated because of transfection. These upregulated mRNAs are usually ignored, because it is not interpretable from the knowledge about conventional miRNA functions. On the other hand, Jin et al. [9] argued that miRNA transfection can cause non-specific changes in gene expression. To the best of my knowledge, there are no studies that try to identify these non-specific effects in more positive points of view.

In this section, using TD based unsupervised FE, we are aiming to study how universal these non-specific gene expression alterations by miRNA transfections are. In order that, we collect multiple studies where multiple miRNA transfection experiments are performed. In individual studies, genes whose expression is altered in common over multiple miRNA transfection experiments are tried to be identified. Then it is checked if genes identified in individual studies are common over multiple studies. If so, sequence-nonspecific off-target regulation of mRNA does really exist and might play some critical roles in biology, too.

The identification of genes altered in common by sequence-nonspecific off-target regulation caused by miRNA transfection can be performed by TD based unsupervised FE as follows [26]. In usual application of TD based unsupervised FE, singular value vectors associated with desired sample dependence, e.g., distinction between patients and healthy controls, are searched to identify genes associated with such a dependence. On the contrary, in the present application, we are aiming to seek singular value vectors “not” associated with the distinction between transfected miRNAs, because lack of transfected miRNA dependence might be the evidence that gene expression alteration caused by miRNA expression toward these genes is because of sequence-nonspecific off-target regulation, no matter what the biological reasons that cause it are. Table 7.12 lists 11 studies including the gene expression profiles collected for the analysis in this study. It is obvious that they are quite diverse. Not only used cell lines but also transfected miRNAs differ from experiments to experiments. Both KO (knock out) and OE (over expression) are considered. Thus, if there are genes chosen in common among these eleven studies, it is quite likely caused by sequence-nonspecific off-target regulation.

Table 7.12 Eleven studies conducted for this analysis

Because of their diversity, not only TD based unsupervised FE but also PCA based unsupervised FE is used. If the number of samples used for individual transfection in individual experiments does not match with one another, multiple experiments in which distinct miRNAs are transfected are hardly formulated in tensor forms. In these cases, PCA based unsupervised FE is employed instead. In the following, individual data set and how to format them in either matrix or tensor is discussed in a little bit detail in Appendix.

Table 7.13 shows the results. In spite of the heterogeneous data sets analyzed, they are highly consistent with one another. Thus, there might be some universal mechanisms that cause sequence-nonspecific off-target regulation.

Table 7.13 Fisher’s exact test for coincidence among 11 miRNA transfection studies for PCA or TD based unsupervised FE and t test

From the data science point of view, it is important to see if other methods can derive the set of genes associated with the same amount of consistency among 11 studies listed in Table 6.12. For the comparison, we select t test . What we aim is essentially to find genes expressed distinctly between control and transfected samples. This kind of two class comparisons can be done by t test , too. In order to see if t test is inferior to TD and PCA based unsupervised FE, t test is applied to 11 studies. In this analysis, samples in individual studies are divided into two classes: samples to which no miRNAs (or mock miRNA) were transfected and samples to which miRNAs were transfected. Two-sided t test is applied to individual 11 studies. Then, obtained P-values are adjusted by BH criterion . Then, probes associated with adjusted P-values less than 0.01 are selected (Table 7.14). The result is a little bit disappointing. For five out of 11 studies, t test cannot identify any differently expressed genes. On the other hand, the numbers of selected genes vary from 35 to 11,060, which is contrast to the range of number of genes selected by PCA or TD based unsupervised FE, ∼102 (Table 7.13). These numbers are unlikely biologically trustable. This possibly shows the failure of methodology.

Table 7.14 The number of genes selected by t test

In order to further demonstrate the inferiority of t test to TD or PCA based unsupervised FE, we try to reproduce the results of PCA or TD based unsupervised FE in Table 7.13. Since the number of genes selected by t test is often 0 (Table 7.14), the same number of top ranked genes with smaller P-values as those in PCA or TD based unsupervised FE are selected in individual experiments based upon P-values computed by t test even though P-values are not significant. It is obvious that the selected genes by t test are less coincident with each other than the selected genes by PCA or TD based unsupervised FE (Table 7.13) because odds ratios are smaller and P-values are larger. Thus, also from the point of coincidence between 11 studies, t test is inferior to TD or PCA based unsupervised FE.

Although PCA or TD based unsupervised FE successfully identifies sets of genes highly coincident between heterogeneous eleven studies, if they are not biologically reasonable, they are useless. In order to see biological values of selected genes, we here show one evaluation, although many evaluations were performed in my published paper [26] (I am not willing to show all of them here, because it might be simply boring).

Table 7.15 is the result for KEGG pathway enrichment by uploading selected genes to Enrichr. It is obvious that not only there are many significant enrichment but also they are highly coincident between 11 studies. Thus, coincidence of selected genes between eleven studies shown in Table 7.13 is also biologically reasonable. In this sense, PCA or TD based unsupervised FE can identify biologically meaningful genes chosen in common between heterogeneous studies including various miRNAs transfected to various cell lines. Universal nature detected has seemingly biological importance, too.

Table 7.15 In each of 11 studies, 20 top-ranked significant KEGG pathways whose associated genes significantly match some genes selected for each experiment are identified

5 One-Class Differential Expression Analysis for Multiomics Data Set

In general, there are two kinds of biological experiments, in vivo and in vitro. In vivo means real biological experiments using living organisms, e.g., animals and plants. Nevertheless, in vivo cannot be said as very economical, because it wastes whole body even when we are interested in a specific tissue. For example, even if you are interested in liver disease, in vivo experiments require to cultivate a whole body. You may wonder if only liver can be separately cultivated, it would be more effective. In vivo experiments recently have tendency to be avoided from the ethical point of view, too, because they kill numerous animals. In vitro experiments can fulfill these requirements more or less. in vitro makes use of cell lines, which is an immortalized cell that is often made out of cancer cells. Once cell line is established, you can do any kind of experiments in vitro using cell lines. Because cell lines can be cultivated even in a dish, it is definitely cost effective and does not kill any animals.

One possible problem of in vitro is the lack of control samples. It is known that cell lines differ from the tissue cells from which cell lines are established. Thus, usually cell lines are compared between not treated and treated ones. Characterizing immortalized cell lines themselves is not an easy task.

In this section, we propose the method that can characterize cancer cell line from gene expression without comparing with something [22]. In this criterion, genes are expressive in common over multiple cancer subtypes are searched and are considered to be characteristic gene expression of cancer cell line. In this regard, TD based unsupervised FE used to identify expressed gene in common over multiple miRNAs transfection studies in the previous section is employed again.

In addition to this, TD based unsupervised FE is used as a tool that integrates omics data. The data set used is downloaded from DBTSS [21], which is a database of transcriptional start sites (TSS), and includes RNA-seq, TSS-seq, and ChIP-seq (histone modification, H3K27ac). These are observed in 26 NSCLC subtype cell lines using HTS technology; DBTSS also stores various omics data set measured on various cell lines and living organisms.

Before starting analysis, we briefly explain the difference among TSS-seq, RNA-seq, and ChIP-seq. As it name says, TSS-seq tries to sequence RNA transcribed from the region around TSS. Thus, TSS-seq basically counts how many times transcription starts. On the other hand, RNA-seq counts the fragments taken from any part of whole RNA. In this sense, RNA-seq counts the total amount of RNA transcribed. Generally, TSS-seq and RNA-seq are positively correlated, although there are no known functional forms that relate between these two, because the function is affected by many factors, e.g., individual genes have various length and some genes are long while others are short. If longer genes are more transcribed, the ratio RNA-seq to TSS-seq becomes larger. In addition to this, individual genes have isoforms, each of which has different length. This mechanism is called as an alternative splicing. If more number of longer isoforms are transcribed from each gene, it also contributes to the increased RNA-seq/TSS-seq ratio. Although there are many detailed points that must be considered in order to relate RNA-seq to TSS-seq, there is one clear point; TSS-seq and RNA-seq should be positively correlated. Thus, seeking genes associated with both more TSS-seq counts and RNA-seq counts can reduce the possibility that genes are wrongly identified as being upregulated or downregulated, e.g., because of technical issues like miss amplification.

ChIP-seq is a different technology that detects to which part of DNA the protein binds. Although I do not explain the details of the relationship between DNA and proteins that bind to it, basically DNA binding protein can control the rate of transcription. ChIP-seq can study this relationship by considering DNA binding protein. Histone modification is more advanced feature. In order to suppress the self-entanglements of lengthy DNA, long DNA string is wrapped around protein core called histone. Because tightly wrapped DNA is hardly transcribed, how tightly DNA is wrapped around histone can affect the amount of transcription drastically. On the other hand, affinity between histone and DNA can be affected by chemical modification of histone. Among various histone modification, acetylation of histone tail is supposed to enhance the transcription by reducing the affinity between DNA and histone. As a result, considering histone modification (H3K27ac) together with RNA-seq and TSS-seq can further reduce the possibility of wrongly identified up/downregulated genes. In the following, we try to seek genes simultaneously associated with the increased TSS-seq, RNA-seq, and ChIP-seq that measureds H3K27ac counts.

When formatting RNA-seq, TSS-seq, and ChIP-seq measurement data into tensor form, how we can practically perform this is a problem. Fundamentally, although it is possible to perform it in single nucleotide base, it results in too huge tensor that requires too large memory to manage. In this case, it is better to employ coarse graining approach that takes average over local chromosome regions. The problem is how long regions should be. If the length of the region is too large, each region includes more than one (protein coding) genes. Then, increased or decreased counts within each region might reflect more than one genes. This will result in low interpretability. On the other hand, if the length of the region is too short, individual (protein coding) genes are expressed over multiple region. It again results in low interpretability. Thus, there should be somewhat optimal length of region. In this section, I try 25,000 nucleotides as a length of region. Generally, the average length of protein is ∼102. Because one amino acid is coded by three-nucleotide (codon), a length of region that codes individual protein coding genes should be at most ∼103. The regions that code protein coding genes are typically composed of both exon and intron, which correspond to translated and non-translated regions, respectively. Thus, the region of DNA that codes individual genes might be doubled. It is still expected not to exceed ∼103 so much. In actuality, some literature reported that average length of DNA regions that code human protein coding genes is still a little bit shorter than ∼104 [8]. Nevertheless, if the region over which TSS-seq, RNA-seq and ChiP-seq count data is averaged is as long as expected length of DNA region that codes individual protein coding genes, boundaries between averaging region might frequently fall into the mid of the DNA region that codes individual protein coding region. Thus, the length of region averaging counts data should be a few times longer than expected length of DNA region that codes individual protein coding region. Based upon these considerations, 25,000 nucleotides region over which TSS-seq, RNA-seq, and ChIP-seq counts are averaged is proposed.

In the data set having a type “human lung adenocarcinoma cell line 26 cell line” in inhouse data category, RNA-seq, TSS-seq, and ChIP-seq data are used. Among ChIP-seq data, only the H3K27ac is used (H3K27ac means that K27 position of the 3rd histone (H3) is acetlyated). Counts are averaged over chromosomal regions fragmented to regions of length of 25,000 nucleotides. Tensors are generated for each chromosome separately. Then, tensor is the form of \(x_{ijk} \in \mathbb {R}^{N \times 26 \times 3}\), where N is the total number of regions of the length of 25,000 nucleotides within each chromosome, j stands for 26 cell lines, and k stands for counts of TSS-seq, RNA-seq, and ChIP-seq. HOSVD algorithm, Fig. 3.8, is applied to x ijk as

$$\displaystyle \begin{aligned} x_{ijk} = \sum_{\ell_1=1}^N \sum_{\ell_2=1}^{26} \sum_{\ell_3=1}^3 G(\ell_1,\ell_2,\ell_3) u^{(i)}_{\ell_1 i}u^{(j)}_{\ell_2 j}u^{(k)}_{\ell_3 k} \end{aligned} $$
(7.8)

where \(u^{(i)}_{\ell _1 i} \in \mathbb {R}^{N \times N}\), \(u^{(j)}_{\ell _2 j} \in \mathbb {R}^{26 \times 26}\), and \(u^{(k)}_{\ell _3 k} \in \mathbb {R}^{3 \times 3}\) are singular value matrices and \(G(\ell _1,\ell _2,\ell _3) \in \mathbb {R}^{N \times 26 \times 3}\) is a core tensor.

First, we need to find \(\mathbf {\mathit {u}}^{(j)}_{\ell _2}\) that is independent of 26 cell lines and \(\mathbf {\mathit {u}}^{(k)}_{\ell _3}\) that is independent of RNA-seq, TSS-seq, and ChIP-seq. Figure 7.6 shows \(\mathbf {\mathit {u}}^{(j)}_1\). Excluding X chromosome, it is highly independent of 26 cell lines. Then we decide to employ 2 = 1. Figure 7.7 shows \(\mathbf {\mathit {u}}^{(k)}_1\). They are highly independent of TSS-seq, RNA-seq, and ChIP-seq. Then we decide to employ 3 = 1.

Fig. 7.6
figure 6

\(\mathbf {\mathit {u}}^{(j)}_1\). The first row, from left to right, chromosome 1, 2, 3, the second row, from left to right, chromosome 4, 5, 6 , and so on. The last row, from left to right, chromosome 22, X, Y. Red broken line is baseline

Fig. 7.7
figure 7

\(\mathbf {\mathit {u}}^{(k)}_1\). The first row, from left to right, chromosome 1, 2, 3, the second row, from left to right, chromosome 4, 5, 6, and so on. The last row, from left to right, chromosome 22, X, Y. Red broken line is baseline

Then we try to find which G( 1, 1, 1) has the largest absolute value and find that G(1, 1, 1) has always the largest absolute values independent of chromosome. Thus, \(\mathbf {\mathit {u}}^{(i)}_1\) is used to attributed P-value to regions as

$$\displaystyle \begin{aligned} P_i = P_{\chi^2} \left[ > \left(\frac{u^{(i)}_{1 i}}{\sigma_1} \right)^2\right]. {} \end{aligned} $$
(7.9)

P-values are collected from 24 chromosome and are corrected by BH criterion . Then 826 regions associated with adjusted P-values less than 0.01 are selected. 826 is very small compared with the total number of regions; because the total number of regions is about 3 × 109∕2.5 × 104∼105 where 3 × 109 is the total length of human genome while 2.5 × 104 is the length of individual regions, 826 corresponds to as little as 0.8% of regions. This is reasonable because only a few percentages of genome code protein coding genes.

In order to validate these selected regions, we upload 1741 Entrez genes associated with these 826 regions to DAVID . Entrez genes are gene ID manually curated gene unique ID that is integer number [12]. Table 7.16 lists the KEGG pathway enrichment associated with adjusted P-values less than 0.05. At a glance, they do not look like related to cancers. Nevertheless, some of them are cancer related terms. For example, the relationship between “antigen processing and presentation” and cancer is often discussed [4]. Parkinson’s disease is often reported to be related to lung cancer [30]. Although we are not willing to discuss fully about the relations between the detected KEGG pathway enrichment and NSCLC, it is obvious that TD based unsupervised FE can detect set of genes including those related to NSCLC.

Table 7.16 KEGG pathway enrichment by the 1741 Entrez genes identified by TD based unsupervised FE

Although it is better to evaluate the performance of TD based unsupervised FE based upon the comparison with other methods, it is not easy because there are no control samples to be compared. Thus, alternatively we select genes based upon the ratio of standard deviation to average over 26 cell lines, because the smaller ratio of variance to mean might suggest smaller variability between 26 cell lines. For each of TSS-seq, RNA-seq, and ChIP-seq, we select top 5% regions with smaller ratio. Then regions chosen in common among TSS-seq, RNA-seq, and ChIP-seq are collected; we find that 2041 Entrez genes are included in these regions chosen in common. This number, 2041, is comparative with 1741 that is the number of Entrez genes selected by TD based unsupervised FE. Thus, uploading these to DAVID is a suitable test to see if TD based unsupervised FE is superior to this alternative method. Then we find that only two KEGG pathways, “Spliceosome” and “Ubiquitin mediated proteolysis” are associated with adjusted P-values less than 0.05. This suggests that TD based unsupervised FE can identify far more biologically reasonable set of genes than this alternative approach.

6 General Examples of Case I and II Tensors

Before demonstrating individual cases using case I and case II tensor in detail, we demonstrate various cases briefly based upon the recent publication [23]. As shown in Table 5.3, matrices or low mode tensor can be combined to generate (higher mode) tensor. In this section, we demonstrate how the combinations shown in Table 5.3 work to select genes critical to the diseases or phenomena considered.

6.1 Integrated Analysis of mRNA and miRNA

Integrated analysis of mRNA and miRNA was also performed by PCA based unsupervised FE (Sect. 6.4), which is once applied to mRNA and miRNA separately. Then obtained two sets of PC loading attributed to sample were investigated to seek those sharing common nature between two sets. After that, corresponding PC scores attributed to mRNA and miRNA were used for FE. On the contrary, in the application of TD based unsupervised FE to the integrated analysis of mRNA and miRNA, mRNA and miRNA expression profiles are integrated in advance.

The analyzed data set is composed of mRNA and miRNA profiles which were measured for multi-class breast cancer samples including normal breast tissues [7]. mRNA and miRNA expression profiles of multi-omics data are downloaded from GEO using GEO ID GSE28884. At first, GSE28884_RAW.tar is downloaded and expanded. For mRNA, 161 files whose names ended by the string “c.txt.gz” are used. Each file is loaded into R by read.csv command and the second column named “M” is employed as mRNA expression values. Probes not associated with Human Genome Organisation (HUGO) gene names are discarded and 13,393 probes remain. One hundred and sixty one files whose names end by the string “geo.txt.gz” are used for miRNA expression profiles; mRNA expression profiles of the corresponding samples are also used. Each file is loaded into R by read.csv command and the second column (“Count”) is summed using the same third column (“Annotation”) values. If the resulting total sum is less than 10, it is discarded and not used for further analysis.

Because the 161 samples are shared between miRNA and mRNA expression profiles, the multi-omics data corresponds to case I data (Table 5.3). TD based unsupervised FE is applied to the data set in order to identify disease critical genes and latent relations between miRNA and mRNA, whose expression profiles are \(x^{\mbox{ mRNA}}_{i_1j} \in \mathbb {R}^{13393 \times 161}\) and \(x^{\mbox{ miRNA}}_{i_2j} \in \mathbb {R}^{755 \times 161}\), respectively. They can be formatted as case I tensor as

$$\displaystyle \begin{aligned} x_{i_1i_2j} = x^{\mbox{ mRNA}}_{i_1j}x^{\mbox{ miRNA}}_{i_2j}. \end{aligned} $$
(7.10)

HOSVD , Fig. 3.8, is applied to \(x_{i_1i_2j}\) as

$$\displaystyle \begin{aligned} x_{i_1i_2j} = \sum_{\ell_1=1}^{13393} \sum_{\ell_2=1}^{755} \sum_{\ell_3=1}^{161} G(\ell_1,\ell_2,\ell_3) u^{(i_1)}_{\ell_1 i_1} u^{(i_2)}_{\ell_2 i_2} u^{(j)}_{\ell_3 j} {} \end{aligned} $$
(7.11)

where \(u^{(i_1)}_{\ell _1 i_1} \in \mathbb {R}^{13393 \times 13393}\),\(u^{(i_2)}_{\ell _2 i_2} \in \mathbb {R}^{755 \times 755}\) and \( u^{(j)}_{\ell _3 j} \in \mathbb {R}^{161 \times 161}\) are singular value matrices and \(G(\ell _1,\ell _2,\ell _3) \in \mathbb {R}^{13393 \times 755 \times 161}\) is a core tensor.

First we need to seek singular value vectors, \(\mathbf {\mathit {u}}^{(j)}_{\ell _3} \in \mathbb {R}^{161}\), with significant cancer subtype dependence. Figure 7.8 shows boxplots of \(\mathbf {\mathit {u}}^{(j)}_{\ell _3}, 1\leq \ell _3 \leq 5\); it is obvious that these singular value vectors have significant class (cancer subtypes) dependence. The next step is to find G( 1, 2, 1 ≤  3 ≤ 5) with larger absolute values. Table 7.17 shows the top ranked G( 1, 2, 1 ≤  3 ≤ 5)s; there are clearly only 1 ≤  1 ≤ 5 and 1 ≤  2 ≤ 2, respectively. Thus, P-values are attributed to i 1 and i 2 using \(u^{(i_1)}_{\ell _1i_1}, 1 \leq \ell _1 \leq 5\) and \(u^{(i_2)}_{\ell _2i_2}, 1 \leq \ell _1 \leq 2\), respectively, as

$$\displaystyle \begin{aligned} \begin{array}{rcl} P_{i_1} &\displaystyle = &\displaystyle P_{\chi^2} \left[ > \sum_{\ell_1=1}^5 \left(\frac{u^{(i_1)}_{\ell_1 i_1}}{\sigma_{\ell_1}} \right)^2\right], \end{array} \end{aligned} $$
(7.12)
$$\displaystyle \begin{aligned} \begin{array}{rcl} P_{i_2} &\displaystyle = &\displaystyle P_{\chi^2} \left[ > \sum_{\ell_2=1}^2 \left(\frac{u^{(i_2)}_{\ell_2 i_2}}{\sigma_{\ell_2}} \right)^2\right]. \end{array} \end{aligned} $$
(7.13)

Computed P-values are adjusted by BH criterion ; i 1s and i 2s associated with adjusted P-values less than 0.01 are selected. Then, 426 mRNA probes and 7 miRNAs are selected, respectively.

Table 7.17 Top ranked 10 G( 1, 2, 1 ≤  3 ≤ 5)s with larger absolute values among 1 ≤  1, 2, 3 ≤ 10 in Eq. (7.11)
Fig. 7.8
figure 8

Boxplot of \(\mathbf {\mathit {u}}^{(j)}_{\ell _3}, 1\leq \ell _3 \leq 5\) when HOSVD is applied as Eq. (7.11). P-values computed by categorical regression. 1st: 2.39 × 10−5, 2nd: 5.83 × 10−14, 3rd: 1.36 × 10−24, 4th: 2.58 × 10−2, 5th: 2.12 × 10−5

In order to evaluate selected 426 mRNAs biologically, we upload these mRNAs to DAVID . Then we can find numerous enrichment. Tables 7.18 and 7.19 show the results of GO term enrichment (adjusted P-values less than 0.05). BP is related to biological feature, CC is related to the location within cell, and MF is function of gene as molecules. Although we are not willing to summarize all of them, most of them are reasonably related to cancers, e.g., immune related or cell surface enrichment. Thus TD based unsupervised FE is likely successful to identify cancer related genes.

Table 7.18 GO BP enrichment by the 426 ensembl genes identified by TD based unsupervised FE
Table 7.19 GO CC and MF enrichment by the 426 ensembl genes identified by TD based unsupervised FE

In order to demonstrate superiority of type I tensor, we also employ type II tensor as

$$\displaystyle \begin{aligned} x_{i_1i_2} = \sum_j x_{i_1i_2j}. \end{aligned} $$
(7.14)

Applying SVD to \(x_{i_1i_2}\), we get singular value vectors \(u^{(i_1)}_{\ell _1 i_1} \in \mathbb {R}^{13393 \times 161}\) and \(u^{(i_2)}_{\ell _2 i_2} \in \mathbb {R}^{755 \times 161}\). In order to select singular vector used for FE, we need to know dependence upon classes (in this case, cancer subtype). In order that, we need singular value vectors attributed to samples. It is computed as Eqs. (5.12) and (5.13),

$$\displaystyle \begin{aligned} \begin{array}{rcl} u^{j;i_1}_{\ell_1 j}&\displaystyle = &\displaystyle \sum_{i_1=1}^{13393} x_{i_1 j} u^{(i_1)}_{\ell_1 i_1} {} \end{array} \end{aligned} $$
(7.15)
$$\displaystyle \begin{aligned} \begin{array}{rcl} u^{j;i_2}_{\ell_2 j}&\displaystyle = &\displaystyle \sum_{i_2=1}^{755} x_{i_2 j} u^{(i_2)}_{\ell_2 i_2} {}\vspace{-3pt} \end{array} \end{aligned} $$
(7.16)

Figure 7.9 shows boxplot of \(u^{j;i_1}_{\ell _1 j}\) and \(u^{j;i_2}_{\ell _2 j}\) for 1 ≤  3 ≤ 5. It is obvious that these singular value vectors have significant class (cancer subtypes) dependence.

Fig. 7.9
figure 9

Boxplot of \(u^{j;i_1}_{\ell _1 j}\) (upper row) and \(u^{j;i_2}_{\ell _2 j}\) (lower row) for 1 ≤  3 ≤ 5 computed by Eqs. (7.15) and (7.16). P-values computed by categorical regression. Upper, 1st: 4.07 × 10−11, 2nd: 4.36 × 10−22, 3rd: 2.03 × 10−23, 4th: 4.14 × 10−4, 5th: 1.57 × 10−4. Lower, 1st: 3.36 × 10−27, 2nd: 3.91 × 10−13, 3rd: 7.39 × 10−9, 4th: 9.32 × 10−5, 5th: 2.82 × 10−5

Thus, P-values are attributed to i 1 and i 2 using \(u^{(i_1)}_{\ell _1 i_1}\) and \(u^{(i_2)}_{\ell _2 i_2}\) for 1 ≤  3 ≤ 5, respectively, as

$$\displaystyle \begin{aligned} \begin{array}{rcl} P_{i_1} &\displaystyle = &\displaystyle P_{\chi^2} \left[ > \sum_{\ell_1=1}^5 \left(\frac{u^{(i_1)}_{\ell_1 i_1}}{\sigma_{\ell_1}} \right)^2\right], \end{array} \end{aligned} $$
(7.17)
$$\displaystyle \begin{aligned} \begin{array}{rcl} P_{i_2} &\displaystyle = &\displaystyle P_{\chi^2} \left[ > \sum_{\ell_2=1}^5 \left(\frac{u^{(i_2)}_{\ell_2 i_2}}{\sigma_{\ell_2}} \right)^2\right].\vspace{-3pt} \end{array} \end{aligned} $$
(7.18)

P-values are adjusted by BH criterion . i 1 and i 2 associated with adjusted P-values less than 0.01 are selected. Then, 374 mRNA probes and 21 miRNAs are selected.

In order to validate selected 374 mRNAs, we upload these mRNAs to DAVID . Then we can find numerous enrichment. Table 7.20 shows the results of GO term enrichment (adjusted P-values less than 0.05) as in Tables 7.18 and 7.19. Thus, although the number of enrichment decreases than that in the type I tensor, still there are many cancer related GO terms. Thus, type II tensor approach is still valid enough biologically.

Table 7.20 GO BP, CC and MF enrichment by the 374 ensembl genes identified by TD based unsupervised FE for type II tensor

Finally, in order to emphasize the superiority of TD based unsupervised FE to conventional supervised methods, we apply categorical regression analysis to mRNAs expression,

$$\displaystyle \begin{aligned} x_{i_1j} = a_{i_1} + \sum_s b_{i_1 s} \delta_{j s} \end{aligned} $$
(7.19)

where \(a_{i_1}\) and \(b_{i_1 s}\) are the regression coefficients. Based upon the results by categorical regression analysis, because too many 16,917 mRNAs probes are associated with adjusted P-values less than 0.01, we instead upload top ranked 500 mRNAs with smaller P-values to DAVID . As a result, only one GO CC enrichment, cytoplasm, associated with adjusted P-values less than 0.05, 1.9 × 10−3, is detected. Although more advanced methods than categorical regression might achieve better performance, this drastic decrease of the number of detected GO terms enrichment demonstrates the superiority over conventional supervised method. In this sense, TD based unsupervised FE is outstanding, no matter which of type I or type II tensor is used.

6.2 Temporally Differentially Expressed Genes

Although type I and type II tensor approaches achieved good performance in integrated analysis of multi-class multi-omics data set in the previous section, it is better if we can demonstrate yet another example to which TD based unsupervised FE can achieve better performance. In this subsection, we try to identify genes temporally expressed distinctly between two classes.

The first data set analyzed is the comparison of NSCLC cell line H1975, with and without EGF treatment [2]. EGF is a gene supposed to accelerate cell growth and is known to be expressive frequently in cancers. Thus, EGF treatment is expected to activate cancer cell lines. The data set is composed of two mRNA expression profile, \(x^{\mbox{ control}}_{ij_1} \in \mathbb {R}^{39937 \times 13}\) and \(x^{\mbox{ EGF}}_{ij_2} \in \mathbb {R}^{39937 \times 15}\), which are gene expressions of cell lines without and with EGF treatment, respectively. j 1 and j 2 represent time points after the treatment (Table 7.21). Because they share genes, \(x^{\mbox{ control}}_{ij_1}\) and \(x^{\mbox{ EGF}}_{ij_2}\) can be converted to case II type I tensor as

$$\displaystyle \begin{aligned} x_{ij_1j_2} = x^{\mbox{ control}}_{ij_1}x^{\mbox{ EGF}}_{ij_2}. {} \end{aligned} $$
(7.20)

HOSVD , Fig. 3.8, is applied to \(x_{ij_1j_2}\) as

$$\displaystyle \begin{aligned} x_{ij_1j_2} = \sum_{\ell_1=1}^{13} \sum_{\ell_2=1}^{15} \sum_{\ell_3=1}^{39937}G(\ell_1,\ell_2, \ell_3) u^{(j_1)}_{\ell_1j_1}u^{(j_2)}_{\ell_2 j_2}u^{(i)}_{\ell_3 i} {} \end{aligned} $$
(7.21)
Table 7.21 List of samples in EGF treatment experiments

At first, we need to find singular value vectors \(\mathbf {\mathit {u}}^{(j_1)}_{\ell _1} \in \mathbb {R}^{13}\) and \(\mathbf {\mathit {u}}^{(j_2)}_{\ell _2} \in \mathbb {R}^{15}\) that exhibit distinct temporal expression between them. Figure 7.10 shows time development of \(\mathbf {\mathit {u}}^{(j_1)}_{\ell _1}\) and \(\mathbf {\mathit {u}}^{(j_2)}_{\ell _2}\) for 1 =  2 = 1, 2. Here the components of singular value vectors sharing the time points are averaged within individual vectors, \(u^{(j_1)}_{\ell _1j_1}\). It is obvious that \(\mathbf {\mathit {u}}^{(j_1)}_1\) and \(\mathbf {\mathit {u}}^{(j_2)}_1\) do not exhibit any time dependence while \(\mathbf {\mathit {u}}^{(j_1)}_2\) and \(\mathbf {\mathit {u}}^{(j_2)}_2\) do. Thus, there is a possibility that genes associated with \(\mathbf {\mathit {u}}^{(j_1)}_2\) and \(\mathbf {\mathit {u}}^{(j_2)}_2\) also exhibit the temporal difference between control and EGF treated cells.

Fig. 7.10
figure 10

Singular value vectors, Eq. (7.21). (a) \(\mathbf {\mathit {u}}^{(j_1)}_1\) (black) and \(\mathbf {\mathit {u}}^{(j_2)}_1\) (red). (b) \(\mathbf {\mathit {u}}^{(j_1)}_2\) (black) and \(\mathbf {\mathit {u}}^{(j_2)}_2\)(red)

In order to select genes associated with \(\mathbf {\mathit {u}}^{(j_1)}_2\) and \(\mathbf {\mathit {u}}^{(j_2)}_2\), we need to find G( 1, 2, 3), 1 = 2 or 2 = 2 having larger absolute values; G(2, 1, 2) and G(1, 2, 2) have larger absolute values (Table 7.22). Thus we decide to use \(\mathbf {\mathit {u}}^{(i)}_2\) for FE. P-values are attributed to i as

$$\displaystyle \begin{aligned} P_i = P_{\chi^2} \left[ > \left(\frac{u^{(i)}_{2 i}}{\sigma_2} \right)^2\right]. {} \end{aligned} $$
(7.22)

P-values are corrected by BH criterion and genes associated with adjusted P-values less than 0.01 are selected. Then 552 mRNA probes are selected.

Table 7.22 Top ranked 10 G( 1, 2, 3)s with larger absolute values among in Eq. (7.21)

Next, we need to see if the selected 552 mRNA probes really exhibit temporal difference between control and EGF treated cells. For this purpose, we compute correlation coefficient between

$$\displaystyle \begin{aligned} \left (x^{\mbox{ control}}_{i1},\ldots,x^{\mbox{ control}}_{i13},x^{\mbox{ EGF}}_{i1},\ldots,x^{\mbox{ EGF}}_{i15} \right) {} \end{aligned} $$
(7.23)

and

$$\displaystyle \begin{aligned} \left (u^{(j_1)}_{2,1},\ldots,u^{(j_1)}_{2,13},u^{(j_2)}_{2,1},\ldots, u^{(j_2)}_{2,15}\right){} \end{aligned} $$
(7.24)

to see if 552 selected genes are coincident with \(\mathbf {\mathit {u}}^{(j_1)}_2\) and \(\mathbf {\mathit {u}}^{(j_2)}_2\). Figure 7.11a shows the histogram of correlation coefficients. Because there are two peaks at ± 1, it is obvious that gene expression of selected 552 mRNA probes is highly coincident with \(\mathbf {\mathit {u}}^{(j_1)}_2\) and \(\mathbf {\mathit {u}}^{(j_2)}_2\).

Fig. 7.11
figure 11

(a) Histogram of correlation coefficients between Eqs. (7.23) and (7.24) for case II type I tensor, Eq. (7.20). (b) Boxplot of Eqs. (7.25) (black boxes filled with green) and (7.26) (red boxes filled with blue) for case II type I tensor, Eq. (7.20). P-values computed by t test: 0.5 h:2.83 × 10−2, 1 h:6.81 × 10−8, 2 h:5.63 × 10−12, 4 h:3.5 × 10−1, 6 h:4.83 × 10−2, 24 h:5.0 × 10−1, 48 h:1.70 × 10−6

Before comparing 552 genes directly between control and EGF treated cells, we need shift and scale individual gene expression profiles such that they have same baseline and amplitude. In order that, we apply the following linear regression

$$\displaystyle \begin{aligned} \begin{array}{rcl} u^{(j_1)}_{2j_1} &\displaystyle = &\displaystyle a_i x^{\mbox{ control}}_{ij_1} + b_i {} \end{array} \end{aligned} $$
(7.25)
$$\displaystyle \begin{aligned} \begin{array}{rcl} u^{(j_2)}_{2j_2} &\displaystyle = &\displaystyle a_i x^{\mbox{ EGF}}_{ij_2} + b_i {} \end{array} \end{aligned} $$
(7.26)

where a i and b i are the regression coefficients. Because regression coefficients are shared between control and EGF treated ones, this does not reduce the difference between these two. Then, we compare \(a_i x^{\mbox{ control}}_{ij_1} + b_i\) and \(a_i x^{\mbox{ EGF}}_{ij_2} + b_i\) of selected 552 mRNA probes (Fig. 7.11b). Not all, but the comparisons of five out of seven time points excluding two time points, 4 and 24 h, after the EGF treatment are associated with P-values less than 0.05. Thus, TD based unsupervised FE has the ability to select genes associated with temporal distinction.

Next, we try to see if type II tensor approach works as well. Because case II tensor share the feature whose number is generally much larger than the number of samples, type II tensor where shared dimension is summed up can result in much smaller number of components. Type II tensor is defined as

$$\displaystyle \begin{aligned} x_{j_1j_2} = \sum_{i=1}^{39937} x_{ij_1j_2}. {} \end{aligned} $$
(7.27)

where \(x_{ij_1j_2}\) is defined in Eq. (7.20). The number of components in \(x_{j_1j_2} \in \mathbb {R}^{13 \times 15}\) is 13 × 15 = 195, which is as small as 1/39937 of the number of components in \(x_{ij_1j_2} \in \mathbb {R}^{39937 \times 13 \times 15}\). Thus, if type II tensor approach works as well, it is very effective. SVD is applied to \(x_{j_1j_2}\) as

$$\displaystyle \begin{aligned} x_{j_1j_2} = \sum_\ell \lambda_\ell u^{(j_1)}_{\ell j_1}u^{(j_2)}_{\ell j_2} {} \end{aligned} $$
(7.28)

Figure 7.12 shows the \(\mathbf {\mathit {u}}^{(j_1)}_\ell \) and \(\mathbf {\mathit {u}}^{(j_2)}_\ell \) for  = 1, 2. Basically, it looks similar to Fig. 7.10. Thus we decide to employ  = 2 for FE. Then, singular value vectors attributed to i can be computed as Eq. (5.14),

$$\displaystyle \begin{aligned} \begin{array}{rcl} u^{i;j_1}_{\ell i}&\displaystyle = &\displaystyle \sum_{j_1=1}^{13} x^{\mbox{ control}}_{i j_1} u^{(j_1)}_{\ell j_1} {} \end{array} \end{aligned} $$
(7.29)
$$\displaystyle \begin{aligned} \begin{array}{rcl} u^{i;j_2}_{\ell i}&\displaystyle = &\displaystyle \sum_{j_2=1}^{15} x^{\mbox{ EGF}}_{i j_2} u^{(j_2)}_{\ell j_2} {} \end{array} \end{aligned} $$
(7.30)

Thus P-values are also attributed to i in two ways as

$$\displaystyle \begin{aligned} \begin{array}{rcl} P_i^{j_1} &\displaystyle = &\displaystyle P_{\chi^2} \left[ > \left(\frac{u^{(i;j_1)}_{2 i}}{\sigma_2} \right)^2\right], \end{array} \end{aligned} $$
(7.31)
$$\displaystyle \begin{aligned} \begin{array}{rcl} P_i^{j_2} &\displaystyle = &\displaystyle P_{\chi^2} \left[ > \left(\frac{u^{(i;j_2)}_{2 i}}{\sigma^{\prime}_2} \right)^2\right]. \end{array} \end{aligned} $$
(7.32)

P-values are corrected by BH criterion . mRNA probes associated with adjusted P-values less than 0.01 are selected. Then, 482 and 487 mRNA probes, between which 396 mRNA probes are chosen in common, are selected using \(P_i^{j_1}\) and \(P_i^{j_2}\), respectively. Thus, in some sense, type II tensor approach can give the results coincident between two approximations of singular value vectors attributed to i using Eqs. (7.29) and (7.30), respectively.

Fig. 7.12
figure 12

Singular value vectors, Eq. (7.28). (a) \(\mathbf {\mathit {u}}^{(j_1)}_1\) (black) and \(\mathbf {\mathit {u}}^{(j_2)}_1\) (red). (b) \(\mathbf {\mathit {u}}^{(j_1)}_2\) (black) and \(\mathbf {\mathit {u}}^{(j_2)}_2\)(red)

Next, we need to see if the 396 mRNA probes chosen in common really exhibit temporal difference between control and EGF treated cells as in the case of type I tensor approach. The correlation coefficient between Eqs. (7.23) and (7.24) is computed again to see the coincidence between gene expression and singular value vectors (Fig. 7.13a). It is obvious that the peaks at ± 1 is much steeper than that in Fig. 7.11a. This suggests that type II tensor approach might be better than type I tensor approach in spite of the smaller computational resources required.

Fig. 7.13
figure 13

(a) Histogram of correlation coefficients between Eqs. (7.23) and (7.24) for case II type II tensor, Eq. (7.27). (b) Boxplot of Eqs. (7.25) (black boxes filled with green) and (7.26) (red boxes filled with blue) for case II type II tensor, Eq. (7.27). P-values computed by t test: 0.5 h:1.68 × 10−2, 1 h:2.56 × 10−5, 2 h: 3.83 × 10−7, 4 h:9.14 × 10−2, 6 h:7.30 × 10−4, 24 h:2.36 × 10−2, 48 h:5.55 × 10−38

In order to confirm the superiority of type II tensor approach, we again apply linear regression Eqs. (7.25) and (7.26) replacing singular value vectors with those obtained by type II tensor (Fig. 7.13b). Because six among seven time points excluding 4 h after the EGF treatment are associated with P-values less than 0.05, type II tensor approach is superior to type I tensor approach.

Finally, in order to validate 552 and 396 mRNA probes selected by type I and II tensor approaches, respectively, we upload RefSeq mRNA IDs associated with these probes to DAVID . Table 7.23 lists the KEGG pathways identified by DAVID for type I and II tensor approach. Although common five KEGG pathways are associated with adjusted P-values less than 0.05, P-values for type II tensor approach are smaller than those for type I tensor approach. Because P-values are more likely smaller for more number of genes uploads, smaller P-values attributed to KEGG pathways by type II tensor approach where less number of genes are selected suggest the superiority of type II tensor approach from the biological point of view.

Table 7.23 KEGG pathways identified by DAVID for genes associated with 552 (upper numbers) and 396 (lower numbers) miRNA probes selected using type I, Eq. (7.20), and II, Eq. (7.27), tensor approach

Although type II approach is better than type I approach in this specific example, because it is highly dependent upon data sets analyzed, it is difficult to know in advance which is better.

7 Gene Expression and Methylation in Social Insects

As the first example of the application of case I tensor approach, we employ the multi-omics analysis of social insects. Social insects, e.g., ants and bees, are known to have castes where distinct phenotypes appear in spite of shared genome. Thus, it is interesting to know what drives differentiation between castes.

One possible scenario is the alteration of epigenome [29], because epigenome has plasticity that can mediate differentiation between castes. Most typical caste is composed of queen and worker. The former, queen, concentrates on reproduction while the latter, workers, serve to maintain colony. In spite of their strict difference of phenotype, they are often known to be relatives. Thus, they share genome to some extent with having distinct phenotype. This suggests that epigenome can play potential roles in the differentiation of caste.

In this section, we try to identify genes associated with differential expression and methylation between caste, especially queens and workers [25], because such genes are potential candidates that can mediate distinct phenotypes between castes. In order that, we employ TD based unsupervised FE that can integrate multi-omics data sets. The data set analyzed [16] is composed of two insect species, bee (P. canadensis) and ant (D. quadriceps). Table 7.24 shows the number of samples available from GEO with GEO ID GSE59525. As can be seen, it is a typical large p small n data set.

Table 7.24 Number of samples in social insect study [16]

Because the amount of gene expression is measured by the unit of Reads Per Kilobase of exon per Million mapped reads (RPKM), it is used as it is. Because the gene expression profile of P. canadensis was log2-ratio converted, it is expanded to the original one as 2x where x is gene expression. On the other hand, we would like to employ case II tensor format (Table 5.3) where genes are shared. Thus we need to convert methylation profiles to be attributed to individual genes. In order that, assuming \(m_{s_1}\) and \(m_{s_2}\) are methylation and nonmethylation values, respectively, at locus s, then the relative methylation within the ith gene can be defined as

$$\displaystyle \begin{aligned} \frac{\sum_{s\in i} m_{s_1}}{\sum_{s\in i} \left ( m_{s_1} +m_{s_2}\right) } \end{aligned} $$
(7.33)

where ∑si is taken over s bases within DNA sequences corresponding to the ith gene body; the reason why methylation not in promoter region but in the gene body is summed up and is attributed to genes is because gene body methylation is believed to affect gene expression in insects [32]. Relative methylation profile is formatted as

$$\displaystyle \begin{aligned} \begin{array}{rcl} x^{\mbox{ metyl, bee}}_{ik} &\displaystyle \in &\displaystyle \mathbb{R}^{N \times 7}, \end{array} \end{aligned} $$
(7.34)
$$\displaystyle \begin{aligned} \begin{array}{rcl} x^{\mbox{ metyl, ant}}_{ik} &\displaystyle \in &\displaystyle \mathbb{R}^{N \times 7}, \end{array} \end{aligned} $$
(7.35)

where N is the number of genes. k = 1 corresponds to control samples. 2 ≤ k ≤ 4 and 5 ≤ k ≤ 7 correspond to queens and workers, respectively. On the other hand, mRNA expression is formatted as

$$\displaystyle \begin{aligned} \begin{array}{rcl} x^{\mbox{ mRNA, bee}}_{ij} &\displaystyle \in &\displaystyle \mathbb{R}^{N \times 10} , \end{array} \end{aligned} $$
(7.36)
$$\displaystyle \begin{aligned} \begin{array}{rcl} x^{\mbox{ mRNA, ant}}_{ij} &\displaystyle \in &\displaystyle \mathbb{R}^{N \times 13} . \end{array} \end{aligned} $$
(7.37)

where 1 ≤ j ≤ 4 and 5 ≤ j ≤ 10 for bee correspond to queens and workers, respectively, while 1 ≤ j ≤ 7 and 8 ≤ j ≤ 13 for ant correspond to queens and workers, respectively. Then case II tensor is generated as

$$\displaystyle \begin{aligned} \begin{array}{rcl} x^{\mbox{ bee}}_{ijk} &\displaystyle = &\displaystyle x^{\mbox{ mRNA, bee}}_{ij}x^{\mbox{ metyl, bee}}_{ik} , \end{array} \end{aligned} $$
(7.38)
$$\displaystyle \begin{aligned} \begin{array}{rcl} x^{\mbox{ ant}}_{ijk} &\displaystyle = &\displaystyle x^{\mbox{ mRNA, ant}}_{ij}x^{\mbox{ metyl, ant}}_{ik} , \end{array} \end{aligned} $$
(7.39)

where \(x^{\mbox{ bee}}_{ijk} \in \mathbb {R}^{N \times 10 \times 7}\) and \(x^{\mbox{ ant}}_{ijk} \in \mathbb {R}^{N \times 13 \times 7}\). HOSVD , Fig. 3.8, is applied to \(x^{\mbox{ bee}}_{ijk}\) and \(x^{\mbox{ ant}}_{ijk}\) as

$$\displaystyle \begin{aligned} \begin{array}{rcl} x^{\mbox{ bee}}_{ijk} &\displaystyle = &\displaystyle \sum_{\ell_1=1}^N \sum_{\ell_2=1}^{10} \sum_{\ell_3=1}^7 G(\ell_1,\ell_2,\ell_3) u^{\mbox{ bee}(i)}_{\ell_1 i} u^{\mbox{ bee}(j)}_{\ell_2 j} u^{\mbox{ bee}(k)}_{\ell_3 k} \end{array} \end{aligned} $$
(7.40)
$$\displaystyle \begin{aligned} \begin{array}{rcl} x^{\mbox{ ant}}_{ijk} &\displaystyle = &\displaystyle \sum_{\ell_1=1}^N \sum_{\ell_2=1}^{13} \sum_{\ell_3=1}^7 G(\ell_1,\ell_2,\ell_3) u^{\mbox{ ant}(i)}_{\ell_1 i} u^{\mbox{ ant}(j)}_{\ell_2 j} u^{\mbox{ ant}(k)}_{\ell_3 k} \end{array} \end{aligned} $$
(7.41)

where \(u^{\mbox{ bee}(i)}_{\ell _1 i} \in \mathbb {R}^{N \times N}\), \(u^{\mbox{ bee}(j)}_{\ell _2 j} \in \mathbb {R}^{10 \times 10}\), \(u^{\mbox{ bee}(k)}_{\ell _3 k} \in \mathbb {R}^{7 \times 7}\), \(u^{\mbox{ ant}(i)}_{\ell _1 i} \in \mathbb {R}^{N \times N}\), \(u^{\mbox{ ant}(j)}_{\ell _2 j} \in \mathbb {R}^{13 \times 13}\), and \(u^{\mbox{ ant}(k)}_{\ell _3 k} \in \mathbb {R}^{7 \times 7}\).

Next, as usual, we need to find which singular value vectors are coincident with the distinction between queens and workers. Figures 7.14a and b, 7.15a and b show singular value vectors associated with highest distinction between queens and workers. Unfortunately, singular value vectors of methylation do not exhibit small enough P-values to be significant. Nevertheless, because selected genes might exhibit significant distinct expression between queens and workers, we continue the procedure. We seek G( 1, 1, 3) for P. canadensis and G( 1, 1, 5) for D. quadriceps with larger absolute values.

Fig. 7.14
figure 14

Singular value vectors for P. canadensis. P-values are computed by t test between queens and workers. (a) \(\mathbf {\mathit {u}}^{ \mbox{ bee}(k)}_1, P=1.1 \times 10^{-1}\) (b) \(\mathbf {\mathit {u}}^{ \mbox{ bee}(j)}_3, P=1.65 \times 10^{-2}\) (c) \(\mathbf {\mathit {u}}^{ \mbox{ bee}(i)}_{\ell _1}, \ell _1=9,10\). Blue open circles are selected genes

Table 7.25 lists the top ranked Gs with larger absolute values. Then we decide that \(\mathbf {\mathit {u}}^{\mbox{ bee}(i)}_{\ell _1}, \ell _1=9,10\) and \(\mathbf {\mathit {u}}^{\mbox{ ant}(i)}_{11}\) are used for FE (Figs. 7.14c and 7.15c). P-values are attributed to ith gene as

$$\displaystyle \begin{aligned} P^{\mbox{ bee}}_i = P_{\chi^2} \left[ > \sum_{\ell_1=9}^{10} \left(\frac{u^{\mbox{ bee}(i)}_{\ell_1 i}}{\sigma_{\ell_1}} \right)^2\right], \end{aligned} $$
(7.42)

and

$$\displaystyle \begin{aligned} P^{\mbox{ ant}}_i = P_{\chi^2} \left[ > \left(\frac{u^{\mbox{ ant}(i)}_{11 i}}{\sigma_{11}} \right)^2\right], \end{aligned} $$
(7.43)

P-values are adjusted by BH criterion . Genes associated with adjusted P-values less than 0.01 are selected. As a result, 133 and 128 genes are selected for P. canadensis and D. quadriceps, respectively.

Table 7.25 The top 10 core tensors, G, with large absolute values
Fig. 7.15
figure 15

Singular value vectors for D. quadriceps. P-values are computed by t test between queens and workers. (a) \(\mathbf {\mathit {u}}^{ \mbox{ ant}(k)}_1, P=1.9 \times 10^{-1}\) (b) \(\mathbf {\mathit {u}}^{ \mbox{ ant}(j)}_5, P=1.25 \times 10^{-3}\) (c) \(\mathbf {\mathit {u}}^{ \mbox{ ant}(i)}_{11}\). Blue open circles are selected genes

The point is if selected genes are associated with distinct gene expression and methylation between queens and workers simultaneously. Then we apply three statistical tests to 133 genes and 128 genes between queens and workers (Table 7.26). Selected genes exhibit simultaneous distinct gene expression and methylation between queens and workers for P. canadensis, but not for D. quadriceps. Thus selected genes can be potential factors that can mediate caste differentiation for P. canadensis, but not for D. quadriceps. Although we are not sure the lack of detection for D. quadriceps is because of biological reason or failure of our methodology, at least, our purpose is achieved for P. canadensis. In order to clarify this point, we need to continue research.

Table 7.26 Statistical tests of the differences (between queens and workers) in gene expression and methylation

In order to see if conventional supervised methods can do this, we apply t test to gene expression and promoter methylation to find genes that exhibit significant distinction between queens and workers. As a result, two genes for distinct gene expression between queens and workers for D. quadriceps are associated with adjusted P-vales less than 0.01. This poor performance is because of small number of samples. Thus, TD based unsupervised FE has the ability to find significant genes for large p small n problem, for which conventional supervised method fails.

Before closing this section, we would like to validate selected genes from the biological point of view. Because these two insects are not included in popular enrichment servers, e.g. DAVID or Enrichr, instead we download list of GO terms,Footnote 1 PCAN.v01.GO.tsv for P. canadensis and DQUA.v01.GO.tsv for D. quadriceps. Fisher’s exact test is performed in order to evaluate enrichment and computed P-values are corrected by BH criterion . GO terms associated with adjusted P-values less than 0.05 are searched. There are three GO terms, Lipid transporter activity (GO:0005319), Lipid particle (GO:0005811), and Lipid transport (GO:0006869) enriched in 133 genes selected for P. canadensis, while there are no GO terms enriched in 128 genes selected for D. quadriceps. This might be reasonable because 128 genes selected for D. quadriceps are not associated with distinct methylation between queens and workers (Table 7.26). Anyway, 133 genes selected for P. canadensis, which is simultaneously associated with distinct gene expression and methylation between queens and workers, are associated with a few GO term enrichment. Thus, at least for P. canadensis, TD based unsupervised FE is useful also from the biological point of view.

8 Drug Discovery From Gene Expression: II

In Sect. 7.3, we have already shown that TD based unsupervised FE successfully identifies compounds that affect gene expression in dose-dependent manner and their target proteins from only gene expression profiles in fully unsupervised manner. Nevertheless, it is strictly restricted to cancers because gene expression profiles are measured in cancer cell lines. The identifying drug compounds that are effective to other diseases requires additional gene expression profiles treated by compounds in specific diseases, e.g., model animals or cell lines originated from the disease. Thus in the manner in Sect. 7.3, the effectiveness of methods is quite limited.

In this section, with using case II tensor where genes are shared between two matrices or tensors, we try to identify disease effective drugs without measuring gene expression repeatedly for individual diseases. The study design is as follows (Fig. 7.16). \(x_{ij_1j_2}\) is the ith gene expression profiles of animals treated by j 1 compound at the time point j 2 after the treatment. \(x_{ij_3}\) is the human gene expression profile of gene i at j 3th patients or healthy control. Case II tensor \(x_{ij_1j_2j_3}\) is generated as

$$\displaystyle \begin{aligned} x_{ij_1j_2j_3} = x_{ij_1j_2} x_{ij_3} \end{aligned} $$
(7.44)

HOSVD algorithm, Fig. 3.8, is applied to \(x_{ij_1j_2j_3}\) as

$$\displaystyle \begin{aligned} x_{ij_1j_2j_3} = \sum_{\ell_1=1}^{N_1} \sum_{\ell_2=1}^{N_2} \sum_{\ell_3=1}^{N_3} \sum_{\ell_4=1}^{N_4} G(\ell_1,\ell_2,\ell_3,\ell_4) u^{(j_1)}_{\ell_1 j_1} u^{(j_2)}_{\ell_2 j_2} u^{(j_3)}_{\ell_3 j_3} u^{(i)}_{\ell_4 i} \end{aligned} $$
(7.45)

Then, \(\mathbf {\mathit {u}}^{(j_2)}_{\ell _2}\) that exhibits time dependence and \(\mathbf {\mathit {u}}^{(j_3)}_{\ell _3}\) that exhibits distinction between healthy controls and patients are searched. After identifying 2 and 3, 1 and 4 associated with G( 1, 2, 3, 4) with larger absolute values are selected. Once, 1 and 4 are selected, P-values are attributed to i and j 1 as

$$\displaystyle \begin{aligned} P_i = P_{\chi^2} \left[ > \left(\frac{u_{\ell_4 i}}{\sigma_{\ell_4}} \right)^2\right], \end{aligned} $$
(7.46)

and

$$\displaystyle \begin{aligned} P_{j_1} = P_{\chi^2} \left[ > \left(\frac{u_{\ell_1 j_1}}{\sigma_{\ell_1}} \right)^2\right]. \end{aligned} $$
(7.47)

P-values are corrected by BH criterion and i and j 1 associated with adjusted P-values less than 0.01 (filled pink circles and filled light green circles surrounded by pink oval in Fig. 7.16) are supposed to be selected. Target proteins are decided by the comparison with external databases (as shown in Fig. 7.5). This process results in the set of drug candidates compounds and candidate target proteins. Figure 7.17 and Table 7.27 summarize the process till selection of singular value vectors attributed to genes and compounds. There are six diseases analyzed: heart failure, PTSD , acute lymphoblastic leukemia (ALL), diabetes, renal carcinoma, and cirrhosis. In some cases, modes of case II tensors are more than four because human gene expression profiles are represented as not matrices but tensors.

Fig. 7.16
figure 16

Integrated analysis of gene expression profile of drug treated animals, \(x_{ij_1j_2}\) and human gene expression profiles of patients and healthy control, \(x_{ij_3}\). i: genes, j 1: compounds, j 2: time point after the treatment, j 3: human samples

Fig. 7.17
figure 17

Schematics that illustrate the procedure of TD-based unsupervised FE applied to the various disease and DrugMatrix data sets. SVV: singular value vector. Selected four time points (tps) are 1/4, 1, 3, and 5 days after treatment

Table 7.27 A summary of TDs and identification of various singular value vectors for identification of candidate drugs and genes used to find genes encoding drug target proteins

Gene expression profiles of model animals are downloaded from DrugMatrix [15] where rats are treated as model animals and gene expression profiles of various tissues are extracted. Corresponding human or rat disease expression profiles are downloaded from GEO. For heart failure, human disease heart failure gene expression profiles and rat heart gene expression profiles treated by drugs are used. For PTSD , stressed mouse brain gene expression profiles and rat brain gene expression profiles treated by drugs are used. For ALL, drug treated rat and ALL human patients bone marrow gene expression profiles are used. For diabetes and renal carcinoma, drug treated rat kidney gene expression profiles are used. Diabetes and renal carcinoma human patients kidney gene expression profiles are used for diabetes and renal carcinoma, respectively. For cirrhosis, drug treated rat liver gene expression profiles and cirrhosis human liver expression profiles are used. See appendix for more details.

After selecting genes and drugs, genes are uploaded to Enrichr for target protein identification. Genes enriched (adjusted P-values less than 0.01) in “Single gene perturbation GEO up” and “Single gene perturbation GEO down” are selected as target proteins. This process is similar to that illustrated in Fig. 7.5. Table 7.28 summarizes the number of identified genes, compounds, and target proteins.

Table 7.28 The number of genes, drugs, and target proteins identified by TD based unsupervised FE

In order to validate the relationship between drugs and target proteins predicted, we compare them with DINIES [31] that stores known protein–drug interactions. We upload drugs one by one to DENIES with parameters “chemogenomic approach” and “with learning on all DBs” and can get list of target proteins. They are merged into a list of proteins because individual proteins can be targeted by multiple drugs. The obtained set of target proteins are compared with predicted targets in Table 7.28. Here total proteins considered is limited to genes included in “Single_Gene_Perturbations_from_GEO_all_list” of Enrichr. Table 7.29 shows the results of evaluation by Fisher’s exact test and χ 2 test. Ten out of twelve are evaluated as significant (P-values less than 0.05) by either Fisher’s exact test or χ 2 test. This suggests that TD based unsupervised FE can be used for the prediction of target protein and diseases of drugs only from gene expression profile, in fully unsupervised manner in the sense that it does not require any pre-knowledge about disease–drug or protein–drug interaction.

Table 7.29 Fisher’s exact test (P F) and the uncorrected χ 2 test (\(P_{\chi ^2}\)) of known drug target proteins regarding the inference of the present study

9 Integrated Analysis of miRNA Expression and Methylation

Unsupervised method is often useful when applied to something for which no pre-knowledge is available. For example, two kinds of omics data might be correlated with unknown reasons. To search this kind of hidden (latent) relationship, unsupervised method is critically useful. In this section, we propose the application of case I type II tensor to investigate relationship between miRNA expression and methylation, between which no direct relationships are biologically expected.

Promoter methylation of genes targeted by miRNAs can of course affect expression of these genes. Nevertheless, there seem to be no biological reasons that promoter methylation of genes targeted by miRNAs affects the expression of these miRNAs themselves or vice versa. Thus, if we can find any correlations between these two, it might be a starting point of finding new biological points of view.

In this section, we make use of TCGA data set [28]. The data set we analyze is composed of eight normal ovarian tissue samples and 569 tumor samples. Our data set includes expression data on 723 miRNAs as well as promoter methylation profiles of 24,906 genes. They are formatted as matrices

$$\displaystyle \begin{aligned} \begin{array}{rcl} x^{\mbox{ methyl}}_{ij} &\displaystyle \in &\displaystyle \mathbb{R}^{24906 \times 577} \end{array} \end{aligned} $$
(7.48)
$$\displaystyle \begin{aligned} \begin{array}{rcl} x^{\mbox{ miRNA}}_{kj} &\displaystyle \in &\displaystyle \mathbb{R}^{723 \times 577} \end{array} \end{aligned} $$
(7.49)

They are converted to case I tensor because they share samples as

$$\displaystyle \begin{aligned} x_{ijk} = x^{\mbox{ miRNA}}_{kj}x^{\mbox{ methyl}}_{ij} \end{aligned} $$
(7.50)

Usually, HOSVD , Fig. 3.8, is supposed to be applied to x ijk as

$$\displaystyle \begin{aligned} x_{ijk} = \sum_{\ell_1=1}^{24906} \sum_{\ell_2=1}^{577} \sum_{\ell_3=1}^{723} G(\ell_1,\ell_2,\ell_3) u^{(i)}_{\ell_1 i} u^{(j)}_{\ell_2 j} u^{(k)}_{\ell_3 k}. \end{aligned} $$
(7.51)

Unfortunately, x ijk is too huge to apply HOSVD directly. Thus, instead, we derive type II tensor as

$$\displaystyle \begin{aligned} x_{ik} = \sum_{j=1}^{577} x_{ijk}. \end{aligned} $$
(7.52)

Now it is a matrix. Thus we can apply PCA to it. Then we can have PC score \(\mathbf {\mathit {u}}_{\ell } \in \mathbb {R}^{723}\) attributed to miRNA and PC loading \(\mathbf {\mathit {v}}_{\ell } \in \mathbb {R}^{24906}\) attributed to methylation. The singular value vectors attributed to sample j are computed in two ways as Eq. (5.15)

$$\displaystyle \begin{aligned} \begin{array}{rcl} u^{(j;k)}_{\ell j} &\displaystyle = &\displaystyle \sum_k u_{\ell k} x^{\mbox{ miRNA}}_{kj}, \end{array} \end{aligned} $$
(7.53)
$$\displaystyle \begin{aligned} \begin{array}{rcl} u^{(j;i)}_{\ell j} &\displaystyle = &\displaystyle \sum_i v_{\ell i} x^{\mbox{ methyl}}_{ij}. \end{array} \end{aligned} $$
(7.54)

The first thing to check is if there are any s such that \(\mathbf {\mathit {u}}^{(j;k)}_\ell \in \mathbb {R}^{577}\) and \(\mathbf {\mathit {u}}^{(j;i)}_\ell \in \mathbb {R}^{577}\) satisfy the following requirements simultaneously;

  • \(\mathbf {\mathit {u}}^{(j;i)}_\ell \) and \(\mathbf {\mathit {u}}^{(j;k)}_\ell \) are significantly correlated.

  • \(\mathbf {\mathit {u}}^{(j;k)}_\ell \) is expressed distinctly between healthy controls (j ≤ 8) and patients (j > 8).

  • \(\mathbf {\mathit {u}}^{(j;i)}_\ell \) is expressed distinctly between healthy controls (j ≤ 8) and patients (j > 8).

In order to validate these requirements visually, we show scatterplot for 1 ≤  ≤ 9 (Fig. 7.18). More or less all nine scatterplots look like satisfying the above requirements simultaneously. In order to select u and v used for miRNA and gene selection, respectively, we need to identify which satisfies the above requirements best. In order that, we propose several measures. First, we select miRNAs and genes. P-values are attributed as

$$\displaystyle \begin{aligned} \begin{array}{rcl} P_k &\displaystyle = &\displaystyle P_{\chi^2} \left[ > \left(\frac{u_{\ell k}}{\sigma_\ell} \right)^2\right], \end{array} \end{aligned} $$
(7.55)
$$\displaystyle \begin{aligned} \begin{array}{rcl} P_i &\displaystyle = &\displaystyle P_{\chi^2} \left[ > \left(\frac{v_{\ell i}}{{\sigma'}_\ell} \right)^2\right]. \end{array} \end{aligned} $$
(7.56)

P-values are adjusted by BH criterion and i and k associated with adjusted P-values less than 0.01 are selected. Then we require genes and miRNA selected similar to the above requirements as

  • Selected genes and miRNAs are significantly correlated.

  • Selected miRNAs are expressed distinctly between normal controls (j ≤ 8) and patients (j > 8).

  • Selected genes are methylated distinctly between normal controls (j ≤ 8) and patients (j > 8).

In order that, we compute the followings:

  1. (a)

    Correlation coefficient between \(\mathbf {\mathit {u}}^{(j;i)}_\ell \) and \(\mathbf {\mathit {u}}^{(j;k)}_\ell \).

  2. (b)

    P-value attributed to the above correlation coefficients.

  3. (c)

    P-values computed by t test that evaluates if \(\mathbf {\mathit {u}}^{(j;k)}_\ell \) is distinct between normal control (j ≤ 8) and patients (j > 8).

  4. (d)

    P-values computed by t test that evaluates if \(\mathbf {\mathit {u}}^{(j;i)}_\ell \) is distinct between normal control (j ≤ 8) and patients (j > 8).

  5. (e)

    Ratio of significantly correlated pairs of genes and miRNAs selected.

  6. (f)

    Ratio of miRNA associated with adjusted P-values computed by t test that evaluates if selected miRNAs are expressed distinctly between normal control (j ≤ 8) and patients (j > 8).

  7. (g)

    Ratio of genes associated with adjusted P-values computed by t test that evaluates if selected genes are methylated distinctly between normal control (j ≤ 8) and patients (j > 8).

  8. (h)

    The number of selected miRNAs.

  9. (i)

    The number of selected genes.

Here significant correlation is evaluated if associated BH criterion adjusted P-values are less than 0.01 (see page 112 for how to compute P-values attributed to correlation coefficients). Table 7.30 shows the result.  = 3 seems to be the best, because  = 3 is the best for the sixth and the seventh measures and the second best in the fifth measure; the fifth, sixth, and seventh measures are important because they are direct evaluations of selected genes and miRNAs. Because the number of selected genes and miRNAs do not vary depending on so much, it is the best to select  = 3. Because more than 88% of genes and miRNAs and their pairs satisfy the desired requirements in the above (88% is the smallest ratio (percentage) among requirements from (e) to (g) in Table 7.30), TD based unsupervised FE can be considered to have ability to select miRNAs and genes satisfying desired requirements mentioned above.

Table 7.30 Measures that evaluate which satisfies the desired requirements best
Fig. 7.18
figure 18

Scatterplots of \(\mathbf {\mathit {u}}^{(j;k)}_\ell \) (horizontal) and \(\mathbf {\mathit {u}}^{(j;i)}_\ell \) (vertical) for 1 ≤  ≤ 9. Red filled circle: eight normal controls (j ≤ 8), gray filled circles: ovarian cancer patients (j > 8)

In order to see if other supervised methods can identify set of genes and miRNAs satisfying desired requirements, i.e., selected genes are methylated distinctly between healthy control and patients, miRNAs selected are expressed distinctly between healthy controls and patients, selected genes and miRNAs are significantly correlated, we apply t test to select genes methylated distinctly between healthy controls and patients and miRNA expressed distinctly between healthy controls and patients. P-values are attributed to miRNAs and genes and adjusted by BH criterion . Then, 214 miRNAs and 19,395 genes associated with adjusted P-values less than 0.01 are selected. In order to see how much ratio of significantly correlated pairs among total 241 × 19395 = 4, 829, 355 pairs is, we compute correlation coefficients between them and attribute P-values to these pairs (see page 112 for how to compute P-values attributed to correlation coefficients). P-values are corrected by BH criterion and 555,391 pairs are associated with adjusted P-values less than 0.01. Because this is as small as 11.5% of 4829, 355 pairs, t test is inferior to TD based unsupervised FE to identify genes and miRNAs satisfying desired requirements.

This poor performance might be because of the too many genes and miRNAs selected. P-values given by t test have strong tendency to reduce its value when many samples are available. In this example, because as many as 575 samples are available, even gene and miRNAs associated with small distinction are associated with small enough P-values. In order to avoid this difficulty, we reduce the number of genes and miRNAs selected by t test as many as those by TD based unsupervised FE, by selecting to ranked seven miRNA and 284 methylation probes attributed to genes based upon P-values computed by t test . Then among 7 × 284 = 1967 pairs, as small as 50 pairs are associated with adjusted P-values less than 0.01 attributed to correlation coefficient. Thus, only 2.5% of 1967 pairs are significantly correlated. Thus, the ratio decreases instead of increasing in opposed to the expectation.

It might be possible to select genes and miRNAs starting from identifying significantly correlated pairs before finding genes and miRNAs distinct between healthy control and patients. Then correlation coefficients are computed among all pairs of genes and miRNAs. P-values are attributed to correlation coefficient (see page 112 for how to compute P-values attributed to correlation coefficients) and are corrected by BH criterion . Then among 24, 906 × 723 = 18, 007, 038 pairs, 1,197,772 pairs are associated with adjusted P-values less than 0.01. Unfortunately, these pairs include all genes and miRNAs. Thus, starting from pairs significantly correlated is not an effective strategy. This poor performance achieved by t test as well as correlation analysis demonstrates the difficulty of identifying gene and miRNAs satisfying desired requirement, i.e., selected genes are methylated distinctly between healthy control and patients, miRNAs selected are expressed distinctly between healthy controls and patients, selected genes and miRNAs are significantly correlated, which is easily achieved by TD based unsupervised FE.

Before closing this section, genes and miRNA selected should be biologically evaluated, too. First, 240 gene symbols associated with 284 probes are uploaded to DAVID (Table 7.31). At a glance, although it does not look deeply related to cancers, detailed investigation can alter this impression. This data is about ovarian cancer. The most major subtype is surface epithelial-stromal tumor which is known to be associated with keratinization [13]. Thus, the detection of keratinization as the most enriched term is reasonable, while the third enriched one is also related to keratinization. Because the fifth one, epidermis development, is the parent term of keratinization, it is also understandable.

Table 7.31 GO BP enrichment by the 274 gene symbols identified by TD based unsupervised FE for ovarian cancer data from TCGA

Next, the selected seven miRNAs are uploaded to DIANA-mirpath for the evaluation (Fig. 7.19). It is obvious that they are enriched with various cancers. Thus, the selected seven miRNAs are supposed to be related to cancers.

Fig. 7.19
figure 19

Heatmap that summarize the results of DIANA-mirpath for the selected seven miRNAs, with specifying “pathways union” option

In conclusion, TD based unsupervised FE successfully identifies reasonable genes and miRNAs also from the biological point of view.

10 Summary

Because TD based unsupervised FE was more recently proposed than PCA based unsupervised FE, the examples of applications of TD based unsupervised FE introduced in this chapter are very limited. In spite of that, it still covers wide range of applications tried in the previous chapter using PCA based unsupervised FE: analysis of time course data set, integrated analysis of multi-omics data set, and identification of disease causing genes. In addition to this, it has new application target, e.g., application to in silico drug discovery.

The general procedure of application of TD based unsupervised FE is as follows. If there are no tensors available, generate case I or case II tensor of type I. Occasionally, it might be requires to generate type II tensor in order to reduce the required computational memory. If generated type II tensor is matrix, apply PCA. If not, apply HOSVD . If type II tensor is employed, generate missing singular value vectors by multiplying original tensor to obtained singular value vectors. Seek singular value vectors attributed to samples coincident with desired property, e.g., distinction between controls and treated samples. Then, in order to select singular value vectors attributed to features used for FE, core tensor is investigated. Singular value vectors that share core tensor with larger absolute values with singular value vectors attributed to samples associated with desired properties are selected. P-values are attributed to features using selected singular value vectors attributed to features with assuming χ 2 distributions. P-values are corrected by BH criterion and features associated with adjusted P-values less than 0.01.

This general procedure can be applied to wide range of bioinformatics topics depending upon what kind of singular value vectors attributed to samples are selected. In this sense, TD based unsupervised FE is expected to be applicable to wider range of biological problems other than those treated in this chapter.