Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

7.1 Introduction

Tensors are multi-dimensional arrays and are commonly used for representing multi-dimensional data, such as sensor streams and social networks [9, 17]. Thanks to the widespread availability of multi-dimensional data, tensor decomposition operations (such as CP [10] and Tucker [31]) are increasingly being used to implement various data analysis tasks, from anomaly detection [17], correlation analysis [26] to pattern discovery [13] and clustering [22, 28, 32].

A critical challenge for tensor-based analysis is its computational complexity and decomposition can be a bottleneck in some applications [14, 21, 30]. Phan and Cichocki [23] proposed a methodology to partition the tensor into smaller sub-tensors to deal with this issue: (a) partition the given tensor into blocks (or sub-tensors), (b) decompose each block independently, and then (c) iteratively combine these sub-tensor decompositions into a final decomposition for the input tensor. This process leads to two key observations:

  • Observation # 1: Our key observation in this chapter is that Step (c), which iteratively updates and stitches the sub-tensor decompositions obtained in Steps (a) and (b), is where the various decompositions interact with each other and where any inaccuracies in individual sub-tensor decompositions can propagate (through the update rules introduced in Sect. 7.2.4) to the decomposition of the complete tensor.

  • Observation # 2: We further observe that if we can quantify and capture how these sub-tensors interact and inaccuracies propagate, we can use this information to better allocate resources to tackle the accuracy–efficiency trade-off inherent in the decomposition process.

Based on these two observations, in this chapter, we introduce the notion of sub-tensor impact graphs (SIGs) (Sect. 7.3), which capture and represent how the decompositions of these sub-partitions impact each other and the overall tensor decomposition accuracy and present several complementary algorithms that leverage this novel concept to address various key challenges in tensor decomposition.

7.1.1 Contributions of This Chapter: Sub-Tensor Impact Graphs

While block-based tensor decomposition techniques [15, 23] provide potential opportunities to boost the accuracy/efficiency trade-off, this solution leaves several open questions, including (a) how to partition the tensor and (b) how to most effectively combine results from these partitions. In this chapter, we introduce the notion of sub-tensor impact graphs (SIGs), which quantify how the decompositions of these sub-partitions impact each other and the overall tensor decomposition accuracy and present four complementary algorithms that leverage this novel concept to address various key challenges in tensor decomposition, including personalization, noise, and dynamic data.

7.1.1.1 Challenge #1: Decomposition in the Presence of Dynamic Data

Firstly, we rely on sub-tensor impact graphs (SIGs) to tackle performance challenges that dynamic data pose in tensor analytics: incremental tensor decomposition. Re-computation of the whole tensor decomposition with each update will cause high computational costs and incur large memory overheads. Especially for applications where data evolves over time and the tensor-based analysis results need to be continuously maintained. In Sect. 7.4, we present a two-phase block-incremental CP-based tensor decomposition technique (BICP), which relies on sub-tensor impact graphs to prune unnecessary computation in the presence of incremental updates on the data [11].

7.1.1.2 Challenge #2: Dealing with Noisy Data

Next, in Sect. 7.5, we present a Noise Adaptive Tensor Decomposition ( nTD ) method that leverages sub-tensor impact graphs to tackle deal with noisy data. nTD partitions the tensor into multiple sub-tensors and then decomposes each sub-tensor probabilistically through Bayesian factorization—the resulting decompositions are then recombined through an iterative refinement process to obtain the decomposition for the whole tensor. nTD leverages a resource allocation strategy that accounts for the impact of the noise density of one sub-tensor on the decomposition accuracies of the other sub-tensors, based on the underlying sub-tensor impact graph [19].

7.1.1.3 Challenge #3: Personalization of the Decomposition Process

Finally, we introduce a novel personalized tensor decomposition (PTD) mechanism for accounting for the user’s focus and interests during tensor decomposition (Sect. 7.6). We present alternative ways to account for the impact of the accuracy of one region of the tensor to the accuracies of the other regions of the tensor, each based on a different assumption about how the impact of inaccuracies propagates along the tensor. Given a model of impact, PTD (a) first partitions the input tensor in a way that reflects user’s interest, (b) constructs a sub-tensor impact graph reflecting the tensor content and its partitions, and then (c) analyzes this sub-tensor impact graph (in the light of the user’s interest) to identify initial decomposition ranks for the sub-tensors in a way that will boost the final decomposition accuracies for partitions of interest [18].

7.2 Background

7.2.1 Tensors

A tensor is a multi-dimensional array. More formally, an N-way or Nth-order tensor is an element of the tensor product of N vector spaces, each of which has its own coordinate system. A third-order tensor has three indices. A first-order tensor is a vector, a second-order tensor is a matrix, and tensors of order three or higher are called higher-order tensors. As in the case of matrices, the dimensions of the tensor array are referred to as its modes. For example, the tensor, \({\mathcal {X}}\ \in \mathbb {R}^{\textit {I}\times \textit {J}\times \textit {K}}\), shown in Fig. 7.1, is of third-order and has three modes: I columns (mode 1), J rows (mode 2), and K tubes (mode 3). Fibers are the higher-order analogue of matrix rows and columns. A fiber is defined by fixing every index but one. A matrix column is a mode-1 fiber and a matrix row is a mode-2 fiber. Slices are two-dimensional sections of a tensor, defined by fixing all but two indices [16].

Fig. 7.1
figure 1

A third-order (3-mode) tensor of dimensions, I × J × K

7.2.2 Tensor Decomposition

Tensor-based algorithms, most notably tensor decomposition, are increasingly important tools for analysis, including clustering, of high-dimensional data sets. Intuitively, tensor decomposition process generalizes matrix decomposition-based data analysis and clustering (such as PCA [7] and SVD [5, 8]) to high-dimensional arrays (known as tensors) and rewrites the given tensor in the form of a set of factor matrices (one for each mode of the input tensor) and a core tensor (which, intuitively, describes the spectral structure of the given tensor). These factor matrices and core tensors then can be used for obtaining multi-modal clusters of the input data. The two most popular tensor decomposition algorithms are the Tucker [31] and the CANDECOMP/PARAFAC(CP) [10] decompositions. We next provide a brief description of these algorithms.

7.2.2.1 CP and Tucker Decompositions

The PARAFAC decomposition can be seen as a generalization of matrix factorizations to tensors [10]. PARAFAC decomposition is also known as CANDECOMP/PARAFAC (CP) decomposition. As shown in Fig. 7.2, given a tensor \( {\mathcal {X}}\), CP factorizes the tensor into F component matrices (where F is a user supplied non-zero integer value also referred to as the rank of the decomposition). For the simplicity of the discussion, let us consider a 3-mode tensor \( {\mathcal {X}}\ \in \ \mathbb {R}^{\textit {I}\times \textit {J}\times \textit {K}}\). CP would decompose \( {\mathcal {X}}\) into consisting of three matrices A, B, and C, such that

where \(a_{f}\ \in \ \mathbb {R}^{\textit {I}}\), \(b_{f} \in \mathbb {R}^{\textit {J}}\), and \(c_{f} \in \mathbb {R}^{\textit {K}}\). The factor matrices A, B, C are the combinations of the rank-one component vectors into matrices, e.g., A = [ a1 a2aF ]. This is visualized in Fig. 7.2.

Fig. 7.2
figure 2

CP decomposition of a 3-mode tensor results in a diagonal core and three factors

Tucker decomposition generalizes singular value matrix decomposition (SVD) to higher-dimensional data (Fig. 7.3). Given a tensor \( {\mathcal {X}} \in \mathbb {R}^{\textit {I}\times \textit {J}\times \textit {K}}\), Tucker decomposition factorizes the tensor into factor matrices with different number of rows, which are referred to as the rank of the decomposition. Tucker decomposition would decompose \( {\mathcal {X}}\) into three matrices A, B, C and one core dense tensor G, such that

$$\displaystyle \begin{aligned} {\mathcal{X}} \approx {\tilde{\mathcal{X}}} = \mathbf{G} \times_{1} \mathbf{A} \times_{2} \mathbf{B} \times_{3} \mathbf{C} \equiv \sum_{p=1}^{\textit{P}} \sum_{q=1}^{\textit{Q}} \sum_{r=1}^{\textit{R}} g_{pqr} a_{p} \circ b_{q} \circ c_{r}, \end{aligned}$$

where \(\mathbf {A} \in \mathbb {R}^{\textit {I}\times \textit {P}}\), \(\mathbf {B} \in \mathbb {R}^{\textit {J}\times \textit {Q}}\), \(\mathbf {C} \in \mathbb {R}^{\textit {K}\times \textit {R}}\) are the factor matrices and can be treated as the principal components in each mode. The (dense) core tensor, \(\mathbf {G} \in \mathbb {R}^{\textit {P}\times \textit {Q}\times \textit {R}}\), indicates the strength of interactions among different components of the factor matrices.

Fig. 7.3
figure 3

Tucker decomposition of a three-mode tensor

7.2.2.2 Accuracy of Tensor Decomposition

Note that, in general, unlike matrix decomposition (where each matrix has an exact decomposition), tensors may not have exact decompositions [16]. Therefore, many of the algorithms for decomposing tensors are based on an iterative process that tries to improve the approximation until a convergence condition is reached, such as an alternating least squares (ALS) method: at its most basic form, ALS estimates, at each iteration, one factor matrix while maintaining other matrices fixed; this process is repeated for each factor matrix associated with the modes of the input tensor. Note that due to the approximate nature of tensor decomposition operation, given a decomposition [A, B, C] of \( {\mathcal {X}}\), the tensor \( {\tilde {\mathcal {X}}}\) that one would obtain by re-composing the tensor by combining the factor matrices A, B, and C is often different from the input tensor, \( {\mathcal {X}}\). The accuracy of the decomposition is often measured by considering the Frobenius norm of the difference tensor:

$$\displaystyle \begin{aligned}accuracy( {\mathcal{X}}, {\tilde{\mathcal{X}}}) = 1- error( {\mathcal{X}}, {\tilde{\mathcal{X}}}) = 1 - \left(\frac{\| {\tilde{\mathcal{X}}} - {\mathcal{X}}\|}{\| {\mathcal{X}}\|}\right). \end{aligned}$$

7.2.3 Tensor Decomposition and Clustering

As we mentioned earlier, intuitively, tensor decomposition process generalizes matrix decomposition to high-dimensional arrays and the resulting factor matrices and core tensors then can be used for obtaining multi-modal clusters of the input data. Indeed, tensor-based representations of data and tensor decompositions (especially the two widely used decompositions CP [10] and Tucker [31]) are proven to be effective in multi-aspect data analysis and clustering. For instance, [22] used tensor decomposition to cluster patients in a health-care setting based on their individual and health profile data, including age, medical history, and diagnostics: in particular, the authors have created a patient information tensor and decomposed this tensor (by nonnegative low-rank approximation methods) to obtain semantic clusters that can be used to characterize patients’ records. Davidson et al. [6] applied tensor decomposition to fMRI data to help differentiating healthy and Alzheimer affected individuals. Cao et al. [3] used a similar tensor decomposition-based approach to cluster face images: authors modeled a collection of faces as a tensor and they applied a tensor-based principal component analysis for seeking face clusters. Wu et al. [32] leveraged CP decomposition (solved through stochastic gradient descent) to cluster heterogeneous information networks: each type of object in the network is represented as a different mode of the tensor. Sun et al. [28], on the other hand, has shown that Tucker decomposition can be used for subspace clustering which simultaneously conducts dimensionality reduction and membership representation.

Algorithm 1 The outline of the block-based iterative improvement process

7.2.4 Block-Based Tensor Decomposition

One key challenge with tensor decomposition is its computational complexity: decomposition algorithms have high computational costs and, in particular, incur large memory overheads (also known as the intermediary data blow-up problem) and, thus, basic algorithms and naive implementations are not suitable for large problems. HaTen2 [12] focuses on sparse tensors and presents a scalable tensor decomposition suite of methods for Tucker and PARAFAC decompositions on the MapReduce framework. TensorDB [15] leverages a chunk-based framework to store and retrieve data, extends array operations to tensor operations, and introduces optimization schemes for in-database tensor decomposition.

One way to deal with this challenge is to partition the tensor and obtain the tensor decomposition leveraging these smaller partitions. Block-based decomposition techniques partition the given tensor into blocks or sub-tensors, initially decompose each block independently, and then iteratively combine these decompositions into a final decomposition. GridPARAFAC [23], for example, partitions the tensor into pieces, obtains decomposition for each piece (potentially in parallel), and stitches the partial decomposition results into a combined decomposition for the initial tensor through an iterative improvement process. Here, we provide an overview of the block-based tensor decomposition process.

Let us consider an N-mode tensor \( {\mathcal {X}} \in \mathbb {R}^{I_1\times I_2 \times \ldots \times I_N}\), partitioned into a set (or grid) of sub-tensors \( {\mathfrak X} = \{ {\mathcal {X}}_{\mathbf {k}}\; |\; {\mathbf {k}} \in {\mathcal {K}}\}\) where \( {\mathcal {K}}\) is the set of sub-tensor indexes. Without loss of generality, let us assume that \({\mathcal {K}}\) partitions the mode i into Ki equal partitions, i.e., \(|{\mathcal {K}}| = \prod _{i=1}^N K_i\). Let us also assume that we are given a target decomposition rank, F, for the tensor \( {\mathcal {X}}\). Let us further assume that each sub-tensor in \( {\mathfrak X}\) has already been decomposed with target rank F and let \( {\mathfrak U}^{(i)} = \{ {U}^{(i)}_{\mathbf {k}}\;|\; \mathbf {k} \in {\mathcal {K}}\}\) denote the set of F-rank sub-factorsFootnote 1 corresponding to the sub-tensors in \( {\mathfrak X}\) along mode i. In other words, for each \( {\mathcal {X}}_{\mathbf {k}}\), we have

$$\displaystyle \begin{aligned} \begin{aligned} {\mathcal{X}}_{\mathbf{k}} &\approx {I} \times_1 {U}^{(1)}_{{\mathbf{k}}} \times_2 {U}^{(2)}_{{\mathbf{k}}} \cdots \times_N {U}^{(N)}_{{\mathbf{k}}} , \end{aligned}{} \end{aligned} $$
(7.1)

where I is the N-mode F × F ×… × F identity tensor, where the diagonal entries are all 1s and the rest are all 0s. Given these, Phan and Cichocki [23] presents an iterative improvement algorithm for composing these initial sub-factors into the full F-rank factors, A(i) (each one along one mode), for the input tensor, \( {\mathcal {X}}\). The outline of this block- based process is as follows: Let us partition each factor A(i) into Ki parts corresponding to the block boundaries along mode i:

$$\displaystyle \begin{aligned} {A}^{(i)} = [ {A}^{(i)T}_{(1)} {A}^{(i)T}_{(2)} ... {A}^{(i)T}_{(K_i)}]^{T}. \end{aligned}$$

Given this partitioning, each sub-tensor \( {\mathcal {X}}_{\mathbf {k}}\), \({\mathbf {k}}= [k_1, \ldots ,k_i, \ldots , k_N] \in {\mathcal {K}}\) can be described in terms of these sub-factors:

$$\displaystyle \begin{aligned} \begin{aligned}{} {\mathcal{X}}_{\mathbf{k}} &\approx {I} \times_1 {A}^{(1)}_{(k_1)} \times_2 {A}^{(2)}_{(k_2)} \cdots \times_N {A}^{(N)}_{(k_N)} \end{aligned} \end{aligned} $$
(7.2)

Moreover [23] shows that the current estimate of the sub-factor \( {A}^{(i)}_{(k_i)}\) can be revised using the update rule (for more details on the update rules please see [23]):

$$\displaystyle \begin{aligned} \begin{aligned}{} {A}^{(i)}_{(k_i)}&\longleftarrow & {T}^{(i)}_{(k_i)}\left( {S}^{(i)}_{(k_i)}\right)^{-1} \end{aligned} \end{aligned} $$
(7.3)

where

$$\displaystyle \begin{aligned} \begin{aligned} {T}^{(i)}_{(k_i)}& &=& &\sum_{\mathbf{l}\in \{[*,\ldots,*,k_i,*,\ldots,*]\}}^{} {U}^{(i)}_{\mathbf{l}} \left( {P}_{\mathbf{l}} \oslash ( {U}_{\mathbf{l}}^{(i)T} {A}^{(i)}_{(k_i)})\right) {} \\ {S}^{(i)}_{(k_i)}& &=& &\sum_{\mathbf{l}\in \{[*,\ldots,*,k_i,*,\ldots,*]\}}^{} {Q}_{\mathbf{l}} \oslash \left( {A}_{(k_i)}^{(i)T} {A}^{(i)}_{(k_i)}\right) \end{aligned} \end{aligned}$$

such that, given l = [l1, l2, …, lN], we have

  • \( {P}_{\mathbf {l}} = \operatorname *{\circledast }_{h=1}^N( {U}^{(h)T}_{\mathbf {l}} {A}^{(h)}_{(l_h)})\) and \( {Q}_{\mathbf {l}} = \operatorname *{\circledast }_{h=1}^N( {A}^{(h)T}_{(l_h)} {A}^{(h)}_{(l_h)})\).

Above, \(\circledast \) denotes the Hadamard product and ⊘ denotes element-wise division.

The block-based tensor decomposition process is outlined in pseudocode in Algorithm 1. Figure 7.4 provides a visual example of this process: The given input tensor \( {\mathcal {X}}\) is partitioned into two sub-tensors, \( {\mathcal {X}}_{1}\) and \( {\mathcal {X}}_{2}\). In the first stage, each sub-tensor is decomposed by CP, thus obtaining partial factors. The second stage combines these partial decomposed factors using iterative updates to derive the final factors (and the corresponding core) for tensor \( {\mathcal {X}}\).

Fig. 7.4
figure 4

Illustration of block-based tensor decomposition process

7.3 Sub-Tensor Impact Graphs (SIGs) and Sub-Tensor Impact Scores

In this section, we formally introduce the concept of sub-tensor impact graph (SIG) that captures and represents the underlying structure of sub-tensors and helps efficiently calculate the impact of each sub-tensor on the decomposition accuracy of the overall tensor.

Let an N-mode tensor, \( {\mathcal {X}} \in \mathbb {R}^{I_1\times I_2 \times \ldots \times I_N}\), be partitioned into a grid, \( {\mathfrak {X}} = \{ {\mathcal {X}}_{\mathbf {k}}\; |\; {\mathbf {k}} \in {\mathcal {K}}\}\), of sub-tensors, such that

  • Ki indicates the number of partitions along mode-i,

  • the size of the jth partition along mode i is Ij,i (i.e., \(\sum _{j=1}^{K_i} I_{j,i}=I_i\)), and

  • \({\mathcal {K}} = \{ [k_{j_1}, \ldots ,k_{j_i}, \ldots , k_{j_N}]\; |\; 1\leq i \leq N,\; 1 \leq j_i \leq K_i\}\) is a set of sub-tensor indexes.

The number, \(\| {\mathfrak {X}}\|\), of partitions (and thus also the number, \(\|{\mathcal {K}}\|\), of partition indexes) is \(\prod _{i=1}^N K_i\).

Example 7.1

Figure 7.5 shows a 3-mode tensor, partitioned into 27 sub-tensors: 12 tensor-blocks (sub-tensors 1, 3, 7, 9, 10, 12, 16, 18, 19, 21, 15, 27), 12 slices (sub-tensors 2, 8, 11, 17, 20, 26, 4, 6, 13, 15, 22, 24), and three fibers (sub-tensors 5, 14, 23). The specific shapes of partitions may correspond to user’s requirement such as the degree of importance or user focus.

Fig. 7.5
figure 5

A sample 3-mode tensor, partitioned into 27 heterogeneous sub-tensors

7.3.1 Accuracy Dependency Among Sub-Tensors

In Sect. 7.2.4, we presented update rules block-based tensor decomposition algorithms use for stitching the individual sub-tensor decompositions into a complete decomposition for the whole tensor. While the precise derivation of these update rules is not critical for our discussion (and is beyond the scope of this chapter), it is important to note that, as visualized in Fig. 7.6, each \( {A}^{(i)}_{(k_i)}\) is maintained incrementally by using, for all 1 ≤ j ≤ N, the current estimates for \( {A}^{(j)}_{(k_j)}\) and the decompositions in \( {\mathfrak U}^{(j)}\), i.e., the F-rank sub-factors of the sub-tensors in \( {\mathfrak X}\) along the different modes of the tensor. Moreover, and most importantly for the present discussion, this update rule for \( {A}^{(i)}_{(k_i)}\) supports the following observation: Given

$$\displaystyle \begin{aligned} {\mathcal{X}}_{\mathbf{k}} \approx {I} \times_1 {A}^{(1)}_{(k_1)} \times_2 {A}^{(2)}_{(k_2)} \cdots \times_N {A}^{(N)}_{(k_N)}, \end{aligned}$$

the final accuracy for the sub-tensor \( {\mathcal {X}}_{\mathbf {k}}\), k = [k1, …, ki, …, kN], depends on the accuracies of sub-factors \( {A}^{(i)}_{(k_i)}\). Moreover, the accuracy of each of these, in turn, depends on the accuracies of the sub-factors of the contributing sub-tensors. More specifically, when updating \( {A}^{(i)}_{(k_i)}\), we need to compute

  • \( {T}^{(i)}_{(k_i)}\), which involves the use of \( {U}^{(i)}_{[*,\ldots ,*,k_i,*,\ldots ,*]}\) (i.e., the mode-i factors of \( {\mathcal {X}}_{[*,\ldots ,*,k_i,*,\ldots ,*]}\)), and

  • \( {P}_{[*,\ldots ,*,k_i,*,\ldots ,*]}\), which in turn uses \( {U}^{(h)}_{[*,\ldots ,*,k_i,*,\ldots ,*]}\) for 1 ≤ h ≤ N (i.e., all factors of \( {\mathcal {X}}_{[*,\ldots ,*,k_i,*,\ldots ,*]}\)).

Therefore, the final accuracy of \( {\mathcal {X}}_{\mathbf {k}}\) depends directly on the initial decomposition accuracies of the factor matrices \( {U}^{(h)}_{[*,\ldots ,*,k_i,*,\ldots ,*]}\), for 1 ≤ i, h ≤ N.

Fig. 7.6
figure 6

The block-based update rule maintains \( {A}^{(i)}_{(k_i)}\) incrementally by using the current estimates for \( {A}^{(j)}_{(k_j)}\) and the decompositions in \({\mathfrak U}^{(j)}\)

In other words, for each sub-tensor \( {\mathcal {X}}_{\mathbf {k}}\), there is a set, \(direct\_impact( {\mathcal {X}}_{\mathbf {k}}) \subseteq {\mathfrak X}\), of sub-tensors that consists of those sub-tensors whose initial decomposition accuracies directly impact the final decomposition accuracy of \( {\mathcal {X}}_{\mathbf {k}}\). Moreover, as visualized in Fig. 7.7, \(direct\_impact( {\mathcal {X}}_{\mathbf {k}})\) consists of those sub-tensors that are aligned (i.e., share the same slices) with \( {\mathcal {X}}_{\mathbf {k}}\), along the different modes of the tensor.

Fig. 7.7
figure 7

The sub-tensors whose initial decomposition accuracies directly impact given sub-tensors are aligned (i.e., share the same slices) with that sub-tensor along the different modes of the tensor

7.3.2 Sub-Tensor Impact Graphs (SIGs)

Given the accuracy dependencies among the sub-tensors formalized above, we can define a sub-tensor impact graph (SIG):

Definition 7.1

Let an N-mode tensor, \( {\mathcal {X}} \in \mathbb {R}^{I_1\times I_2 \times \ldots \times I_N}\), be partitioned into a grid, \( {\mathfrak {X}} = \{ {\mathcal {X}}_{\mathbf {k}}\; |\; {\mathbf {k}} \in {\mathcal {K}}\}\), of sub-tensors. The corresponding sub-tensor impact graph (SIG) is a directed, weighted graph, G(V, E, w()), where

  • for each \( {\mathcal {X}}_{\mathbf {k}} \in {\mathcal {X}}\), there exists a corresponding vk ∈ V ,

  • for each \( {\mathcal {X}}_{\mathbf {l}} \in direct\_impact( {\mathcal {X}}_{\mathbf {k}})\), there exists a directed edge vl → vk in E, and

  • w() is an edge weight function, such that w(vl → vk) quantifies the direct accuracy impact of decomposition accuracy of \( {\mathcal {X}}_{\mathbf {l}}\) on \( {\mathcal {X}}_{\mathbf {k}}\). ♢

Intuitively, the sub-tensor impact graph represents how the decomposition accuracies of the given set of sub-tensors of an input tensor impact the overall combined decomposition accuracy. A key requirement, of course, is to define the edge weight function, w(), that quantifies the accuracy impacts of the sub-tensors that are related through update rules. In this section, we introduce three alternative strategies to account for the propagation of impacts within the tensor during the decomposition process.

7.3.2.1 Alt. #1: Uniform Edge Weights

The most straightforward way to set the weights of the edges in E is to assume that the propagation of the inaccuracies over the sub-tensor impact graph is uniform. In other words, in this case, for all e ∈ E, we set wuni(e) = 1.

7.3.2.2 Alt. #2: Surface of Interaction-Based Weights

While being simple, the uniform edge weight alternative may not properly account for the impact of the varying dimensions of the sub-tensors on the error propagation.

As we see in Fig. 7.5, in general, the neighbors of a given sub-tensor can be of varying shape and dimensions and we may need to account for this diversity in order to properly assess how inaccuracies propagate in the tensor. In particular, in this subsection, we argue that the surface of interaction between two sub-tensors \( {\mathcal {X}}_{\mathbf {j}}\) and \( {\mathcal { X}}_{\mathbf {l}}\), defined as below, may need to be considered to account for impact propagation:

Definition 7.2 (Surface of Interaction)

Let \( {\mathcal {X}}\) be a tensor partitioned into a set (or grid) of sub-tensors \( {\mathfrak {X}} = \{ {\mathcal {X}}_{\mathbf {k}}\; |\; {\mathbf {k}} \in {\mathcal {K}}\}\). Let also \( {\mathcal {X}}_{\mathbf {j}}\) and \( {\mathcal {X}}_{\mathbf {l}}\) be two sub-tensors in \( {\mathfrak {X}}\), such that

  • \(\mathbf {j} = [k_{j_1}, k_{j_2},\ldots , k_{j_N}]\) and

  • \(\mathbf {l} = [k_{l_1}, k_{l_2},\ldots , k_{l_N}]\).

We define the surface of interaction, \(surf( {\mathcal {X}}_{\mathbf {j}}, {\mathcal {X}}_{\mathbf {l}})\), between \( {\mathcal {X}}_{\mathbf {j}}\) and \( {\mathcal {X}}_{\mathbf {l}}\) as follows:

$$\displaystyle \begin{aligned} surf( {\mathcal{X}}_{\mathbf{j}}, {\mathcal{X}}_{\mathbf{l}}) = \prod_{h\; s.t.\;j_h = l_h} I_{{j_h},h}. \end{aligned}$$

Here \(I_{j_h,h}\) is the size of the jhth partition along mode h.

Principle 1

Let G(V, E, w()) be a sub-tensor impact graph and let (vj → vl) ∈ E be an edge in the graph. The weight of this edge from vj to vl should reflect the area of the surface of interaction between the sub-tensors \( {\mathcal {X}}_{\mathbf {j}}\) and \( {\mathcal {X}}_{\mathbf {l}}\).

Intuitively, this principle verbalizes the observation that impacts are likely to propagate more easily if two sub-tensors share large dimensions along the modes on which their partitions coincide. Under this principle, we can set the weight of the edge (vj → vl) ∈ E as follows:

$$\displaystyle \begin{aligned}w_{\mathrm{sur}}(v_{\mathbf{j}} \rightarrow v_{\mathbf{l}}) = \frac{surf( {\mathcal{X}}_{\mathbf{j}}, {\mathcal{X}}_{\mathbf{l}})} {\sum_{ (v_{\mathbf{j}} \rightarrow v_{\mathbf{m}}) \in E } surf( {\mathcal{X}}_{\mathbf{j}}, {\mathcal{X}}_{\mathbf{m}}) }. \end{aligned}$$

7.3.2.3 Alt. #3: Value Alignment-Based Edge Weights

Although surface of interaction-based edge weights can potentially account for the varying shapes and sizes of the sub-tensors of \( {\mathcal {X}}\), they fail to take into account for how similar these sub-tensors are—more specifically, they ignore how the values within the sub-tensors are distributed and whether these distributions are aligned across them.

Intuitively, if the value distributions are aligned (or similar) along the modes that two sub-tensors share, then they are likely to have high impacts on each other’s decomposition during the decomposition process. If they are dissimilar, on the other hand, their impacts on each other will be minimal. Therefore, considering only the area of the surface of interaction may not be sufficient to properly account for the inaccuracy propagation within the tensor. More specifically, we need to measure the value alignment between sub-tensors as well:

Definition 7.3 (Value Alignment)

Let \( {\mathcal {X}}\) be a tensor partitioned into a set (or grid) of sub-tensors \( {\mathfrak {X}} = \{ {\mathcal {X}}_{\mathbf {k}}\; |\; {\mathbf {k}} \in {\mathcal {K}}\}\). Let also \( {\mathcal {X}}_{\mathbf {j}}\) and \( {\mathcal {X}}_{\mathbf {l}}\) be two sub-tensors in \( {\mathfrak {X}}\), such that

  • \(\mathbf {j} = [k_{j_1}, k_{j_2},\ldots , k_{j_N}]\) and

  • \(\mathbf {l} = [k_{l_1}, k_{l_2},\ldots , k_{l_N}]\).

Let, \(A = \{h\; | \; k_{j_h}=k_{l_h} \}\) be the set of modes along which the two sub-tensors are aligned and let R be the remaining modes. We define the value alignment, \(align( {\mathcal { X}}_{\mathbf {j}}, {\mathcal {X}}_{\mathbf {l}}, A)\), between \( {\mathcal {X}}_{\mathbf {j}}\) and \( {\mathcal {X}}_{\mathbf {l}}\) as

$$\displaystyle \begin{aligned} align( {\mathcal{X}}_{\mathbf{j}}, {\mathcal{X}}_{\mathbf{l}},A) = cos(\mathbf{ {c}}_{\mathbf{j}}(A),\mathbf{ {c}}_{\mathbf{l}}(A)),\end{aligned} $$

where the vector cj(A) is constructed from the sub-tensor \( {\mathcal {X}}_{\mathbf {j}}\) as followsFootnote 2:

$$\displaystyle \begin{aligned}\mathbf{ {c}}_{\mathbf{j}}(A) = vectorize ( {\mathcal{M}}_{\mathbf{j}}(A))\end{aligned} $$

and the tensor \( {\mathcal {M}}_{\mathbf {j}}(A)\) is constructed from \( {\mathcal {X}}_{\mathbf {j}}\) by fixing the values along the modes in A: \(\forall 1 \leq i_h \leq I_{j_h,h}\),

Here, norm() is the standard Frobenius norm and denotes the part of \( {\mathcal {X}}_{\mathbf {j}}\) where the modes in A take values i1,i2, through i|A|. ♢

Intuitively, cj(A) captures the value distribution of the tensor \( {\mathcal {X}}_{\mathbf {j}}\) along the modes in A.

Principle 2

Let G(V, E, w()) be a sub-tensor impact graph and let (vj → vl) ∈ E be an edge in the graph. The weight of this edge from vj to vl should reflect the structural alignment between the sub-tensors \( {\mathcal {X}}_{\mathbf {j}}\) and \( {\mathcal {X}}_{\mathbf {l}}\).

This principle verbalizes the observation that impacts are likely to propagate more easily if two given sub-tensors are structurally aligned along the modes on which their partitions coincide. As before, under this principle, we can set the edge weights of the edge (vj → vl) ∈ E in the sub-tensor impact graph as follows:

$$\displaystyle \begin{aligned}w_{\mathrm{align}}(v_{\mathbf{j}} \rightarrow v_{\mathbf{l}}) = \frac{align( {\mathcal{X}}_{\mathbf{j}}, {\mathcal{X}}_{\mathbf{l}})} {\sum_{ (v_{\mathbf{j}} \rightarrow v_{\mathbf{m}}) \in E } align( {\mathcal{X}}_{\mathbf{j}}, {\mathcal{X}}_{\mathbf{m}}) }. \end{aligned}$$

7.3.2.4 Alt. #4: Combined Edge Weights

The surface of interaction-based edge weights account for the shapes of the sub-tensors, but do not account for their value alignments. In contrast, value alignment-based edge weights consider the structural similarities of the sub-tensors, but ignore how big the surfaces they share are.

Therefore, a potentially more effective alternative would be to combine these surface of interaction and value alignment-based edge weights into a single weight that takes into account both aspects of sub-tensor interaction:

$$\displaystyle \begin{aligned} w_{\mathrm{comb}}(v_{\mathbf{j}} \rightarrow v_{\mathbf{l}}) = \frac{comb( {\mathcal{X}}_{\mathbf{j}}, {\mathcal{X}}_{\mathbf{l}})} {\sum_{ (v_{\mathbf{j}} \rightarrow v_{\mathbf{m}}) \in E } comb( {\mathcal{X}}_{\mathbf{j}}, {\mathcal{X}}_{\mathbf{m}}) }, \end{aligned}$$

where \(comb( {\mathcal {Y}}, {\mathcal {Z}}) = align( {\mathcal {Y}}, {\mathcal {Z}}) \times surf( {\mathcal {Y}}, {\mathcal {Z}})\).

7.3.3 Sub-Tensor Impact Scores

While the edges on the sub-tensor impact graph, G, account for how (in)accuracies propagate during each individual application of the update rules, it is important to note that after several iterations of updates, indirect propagation of impacts also occur over the graph G:

  • during the first application of the update rule, impacts propagate among the sub-tensors that are immediate neighbors;

  • during the second application of the update rule, impacts reach from one sub-tensor to those sub-tensors that are 2-hop away;

  • during the mth application of the rule, impacts propagate to the m-hop neighbors of each sub-tensor.

In order to use the sub-tensor impact graph to assign resources, we therefore need to measure how impacts propagate within G over a large number of iterations of the alternating least squares (ALS) process.

For this purpose, we rely on a random-walk-based measure of node relatedness on the given graph. More specifically, we rely on personalized PageRank (PPR [2, 4]) to measure sub-tensor relatedness. Like all random-walk-based techniques, PPR encodes the structure of the graph in the form of a transition matrix of a stochastic process and complements this with a seed node set, S ⊆ V , which serves as the context in which scores are assigned: each node, vi in the graph is associated with a score based on its positions in the graph relative to this seed set (i.e., how many paths there are between vi and the seed set and how short these paths are). Intuitively, these seeds represent sub-tensors that are critical in the given application (e.g. high-update, high-noise, or high-user-relevance; see Sects. 7.3 through 7.3.2 for various applications).

Given the graph and the seeds, the PPR score p[i] of vi is obtained by solving the following equation:

$$\displaystyle \begin{aligned} \mathbf{p} = (1- \beta) {\mathbf{T}}_G \;\mathbf{p} + \beta \mathbf{s}, {}\end{aligned} $$

where TG denotes the transition matrix corresponding to the graph G (and the underlying edge weights) and s is a re-seeding vector such that if vi ∈ S, then \(\mathbf {s}[i] = \frac {1}{\|S\|}\) and s[i] = 0, otherwise. Intuitively, p is the stationary distribution of a random walk on G which follows graph edges (according to the transition probabilities TG) with probability (1 − β) and jumps to one of the seeds with probability β. Correspondingly, those nodes that are close to the seed nodes over a large number of paths obtain large scores, whereas those that are poorly connected to the nodes in S receive small PPR scores. We note that the iterative nature of the random-walk process underlying PPR fits well with how inaccuracies propagate during the iterative ALS process. Based on this observation, given a directed, weighted sub-tensor impact graph (SIG), G(V, E, w()), we construct a transition matrix, TG, and obtain the PPR score vector p by solving the above equation.Footnote 3 The resulting sub-tensor impact scores are then used for assigning appropriate resources to the various sub-tensors as described in the next three sections.

7.4 Application #1: Block-Incremental CP Decomposition (BICP) and Update Scheduling Based on Sub-Tensor Impact Scores

There are many applications in which data is evolving dynamically. Obviously, in such scenarios, re-computation of the whole tensor decomposition with each update will cause high computational costs. In this section, we present a block-incremental CP decomposition (BICP) scheme which leverages SIGs to efficiently conduct the iterative refinement process during the second phase of the block-based tensor decomposition process. Let us assume that we are given a tensor, \( {\mathcal {X}}\), with decomposition, , and an update, Δ, on the tensor. BICP significantly reduces computational cost of obtaining the decomposition of the updated tensor, while maintaining high accuracy by relying on two complementary techniques:

  • Update-Sensitive Block Maintenance in First Phase: In its first phase of the process, instead of repeatedly conducting ALS on each sub-tensor, BICP only revises the decompositions of the sub-tensors that contain updated data. Moreover, when the update is small with respect to the block size, BICP relies on incremental factor tracking [20, 27] to avoid re-decomposition of the updated sub-tensor.

  • Update-Sensitive Refinement in the Second Phase: In its second phase, BICP leverages (automatically extracted) metadata about how decompositions of the sub-tensors impact each other’s decompositions and a block-centric iterative refinement to help achieve high efficiency and accuracy:

    • BICP limits the refinement process to only those blocks that are aligned with the updated block.

    • We employ sub-tensor impact graph (SIG) to account for the refinement relationships among the sub-tensors; we further apply impact score to reduce redundant work: we

      • identify sub-tensors that do not need to be refined and (probabilistically) prune them from further consideration, and/or

      • assign different ranks to different sub-tensors according to their impact score: naturally, the larger the impact likelihood of a sub-tensor is, the larger target rank BICP assigns to that tensor.

      Intuitively, the above process enables BICP to assign appropriate levels of accuracy to sub-tensors in a way that reflects the distribution of the updates on the whole tensor. This ensures that the process is fast and accurate.

In this chapter, we focus on the SIG-based update sensitive refinement during the second phase of the block-based decomposition process.

7.4.1 Reducing Redundant Refinements

During the refinement process of Phase 2, those sub-tensors that have direct refinement relationships with the updated sub-tensors are critical to the refinement process. Our key observation is that if we could quantify how much an update on a sub-tensor impacts sub-factors on other sub-tensors, then we could use this to optimize Phase 2. More specifically, given an update, Δ on tensor \( {\mathcal {X}}\), BICP assigns an update sensitive impact score, \(I_{\varDelta }( {\mathcal {X}}_{\mathbf {k}})\) to each sub-tensor, \( {\mathcal {X}}_{\mathbf {k}}\), and leverages this impact score to regulate the refinement process to eliminate redundant work

Intuitively, if the two sub-tensors are similarly distributed along the modes that they share, then they are likely to have high impacts on each other’s decomposition; Therefore we use alternative #3: value alignment-based edge weights to assign the weight of edge (introduced in Sect. 7.3.2). To calculate an update sensitive impact score, we can rely on personalized PageRank (introduced in Sect. 7.3.3) to measure sub-tensor relatedness. PPR encodes the structure of the graph in the form of a transition matrix of a stochastic process from which the significances of the nodes in the graph can be inferred. Here, we choose updated sub-tensors as seed nodes and calculate PPR scores for all the other nodes as their impact scores.

  • Optimization Phase 2-I: Intuitively, if a sub-tensor has a low impact score, its decomposition is minimally affected given the update, Δ. Therefore, those sub-tensors with very low-impact factors can be completely ignored in the refinement process and their sub-factors can be left as they are without any refinement.

  • Optimization Phase 2-P: While optimization phase 2-I can potentially save a lot of redundant work, completely ignoring low-impact tensors may have a significant impact on accuracy. An alternative approach, with a less drastic impact than ignoring sub-tensors, is to associate a refinement probability to sub-tensors based on their impact scores. In particular, instead of completely ignoring those sub-tensors with low-impact factors, we assign them an update probability, 0 < prob_update < 1. Consequently, while the factors of sub-tensors with high impact scores are refined at every iteration of the refinement process, factors of sub-tensors with low-impact scores have lesser probabilities of refinement and, thus, do not get refined at every iteration of Phase 2.

  • Optimization Phase 2-R: A second alternative to completely ignoring the refinement process for low-impact sub-tensors is to assign different ranks to different sub-tensors according to their impact scores: naturally, the higher the target rank is, the more accurate the decomposition of the sub-tensor is. We achieve this by adjusting the decomposition rank, Fk of \( {\mathcal {X}}_{\mathbf {k}}\), as a function of the corresponding tensor’s update sensitive impact score:

    $$\displaystyle \begin{aligned} F_{\mathbf{k}} = \left\lceil F \times \frac{I_{\delta}( {\mathcal{X}}_{\mathbf{k}})}{max_{\mathbf{h}}\{I_{\delta}( {\mathcal{X}}_{\mathbf{h}})\}} \right\rceil. \end{aligned}$$

    Intuitively, this formula sets the decomposition rank of the sub-tensor with the highest impact score relative to the given update, Δ, to F; other sub-tensors are assigned progressively smaller ranks (potentially all the way down to 1)Footnote 4 based on their impacts scores. Once the new ranks are computed, we obtain new U(k) factors with partial ranks Fk for \({\mathcal {X}}_{\mathbf {k}}\) and refine these incrementally in Phase 2.

    Here, we consider two rank-based optimization strategies, phase 2-Ra and phase 2-Ri. In phase 2-Ra, we potentially adjust the decomposition rank for all relevant sub-tensors. In phase 2-Ri, however, we adjust ranks only for sub-tensors with high impact on the overall decomposition.

By extending the complexity formulation from [23], we can obtain the complexityFootnote 5 of Phase 2 as \(\mathcal {O}((F\times \sum _{i=1}^{N}\frac {I_{i}}{K_{i}}+F^{2})\times \mathcal {T} \times H\times \left | \mathcal {D}\right |) \) where \(\mathcal {T}\) is the number of refinement iterations, H = (100 − L)% is the ratio of high impact sub-tensors maintained, \(\left | \mathcal {D}\right |\) is the number of sub-tensors that have direct impact on updated sub-tensors, Ii is the dimensionality of the tensor along mode i, and Ki is the number of partitions along that mode.

7.4.2 Evaluation

In this section, we report sample results that aim to assess the effectiveness of the proposed BICP approach in helping eliminate redundant refinements

7.4.2.1 Setup

Data Sets

In these experiments, we used three data sets: Epinions [29], Ciao [29], and Enron [24]. The first two of these are comparable in terms of their sizes and semantics: they are both 5000 × 5000 × 27 tensors, with schema 〈user, item, category〉, and densities 1.089 × 10−6 and 1.06 × 10−6, respectively. The Enron email data set, on the other hand, has dimensions 5632 × 184 × 184, density 1.8 × 10−4, and schema 〈time, from, to〉.

Data Updates

We divided the tensor into 64 blocks (using 4 × 4 × 4 partitioning) and applied all the updates to four of these blocks; Once the blocks are selected, we randomly pick a slice on the block and update 10% of the fibers on this slice.

Alternative Strategies

We consider the following strategies to maintain the tensor decomposition: Firstly, we apply the basic two-phase block-centric decomposition strategy, i.e., we decompose all sub-tensors with CPALS in Phase 1 and we apply iterative refinement using all sub-tensors in Phase 2 (in the charts, we refer to this non-incremental decomposition approach as ORI). For Phase 1, we use a version of STA where we update fibers that are update-critical, i.e., with highest energy among all the affected fibers. For Phase 2, again, we have several alternatives: (a) applying Phase 2 without any impact score-based optimization (P2N), (b) ignoring L% of sub-tensors with the lowest impact scores (P2I), (c) reducing the decomposition rank of sub-tensors (P2Ra and P2Ri), or (d) using probabilistic refinements for sub-tensors with low impact scores (P2P). In these experiments, we choose L = 50% and, for P2P, we set the update probability to p = 0.1. In addition to the block-based BCIP and its optimizations, we also considered, as an efficient alternative, application of the incremental factor tracking process to the whole tensor as in STA [27]—in the charts, we refer to this alternative approach as Whole.

Evaluation Criteria

We use the measure reported in Sect. 7.2.2.1 to assess decomposition accuracy. We also report decomposition time for different settings. In these experiments, the target decomposition rank is set to F = 10. Unless otherwise specified, the maximum number of iterations in Phase 2 is set to 1000. Each experiment was run 100 times and averages are reported.

Hardware and Software

We used a quad-core Intel(R) Core(TM)i5-2400 CPU @ 3.10GHz machine with 8.00GB RAM. All codes were implemented in Matlab and run using Matlab 7.11.0 (2010b) and Tensor Toolbox Version 2.5 [1].

7.4.2.2 Discussion of the Results

Impact scores measure how different regions of the tensor impacts other parts of the tensor during the alternating least squares (ALS) process. Therefore, we expect that, when we leverage the impact scores (computed in a way to account for the distribution of the data updates) to assign the decomposition ranks, we should be able to focus the decomposition work to better fit the dynamically evolving data. Figure 7.8 compares execution times and accuracies of several approaches. Here, ORI indicates non-incremental two-phase block centric decomposition, whereas Whole indicates application of factor tracking to the whole tensor. The other five techniques in the figure (P2N, P2I, P2Ri, P2Ra, P2P) all correspond to optimizations of the proposed BICP approach for Phase 2.

Fig. 7.8
figure 8

Comparison of (a) Execution times and (b) Decomposition accuracies under the default configuration: the proposed optimizations provide several orders of gain in execution time relative to ORI, while (unlike Whole) they match ORI’s accuracy

Firstly, this figure shows that the two social media data sets, Epinions and Ciao, with similar sizes and densities show very similar execution time and accuracy patterns. The figure also shows that the Enron data set also exhibits a pattern roughly similar to the other data sets, despite having a different size and density.

The key observation in Fig. 7.8 is that the SIG-based optimizations provide several orders of gain in execution time while matching the accuracy of non-optimized version almost perfectly (i.e., the optimizations come without accuracy penalties). In contrast, the alternative strategy, Whole, which incrementally maintains the factors of the whole tensor (as opposed to maintaining the factors of its blocks) also provides execution time gains, but sees a significant drop in its accuracy.

We note that P2P, which probabilistically updates low-impact sub-tensors rather than completely ignoring them, does not significantly improve accuracy. This is because the P2I approach already has an accuracy almost identical to P2N, i.e., ignoring low-impact tensors is a very safe and effective method to save redundant work. Therefore, also considering that, unless a large number of blocks are ignored, P2I is able to match the accuracy of P2N, we do not see a major need to use P2P to reduce the impact of ignored sub-tensors.

7.5 Application #2: Noise-Profile Adaptive Decomposition (nTD) and Sample Assignment Based on Sub-Tensor Impact Scores

Many of the tensor decomposition schemes are sensitive to noisy data, an inevitable problem in the real world that can lead to false conclusions. Recent research has shown that it is possible to avoid over-fitting by relying on probabilistic techniques that leverage Gibbs sampling-based Bayesian model learning [33]; however, these assume that all the data and intermediary results can fit in the main memory, and (b) they treat the entire tensor uniformly, ignoring potential non-uniformities in the noise distribution. In this chapter, we present a Noise Adaptive Tensor Decomposition ( nTD ) method, which leverages a probabilistic two-phase decomposition strategy, complemented with sub-tensor impact graphs, to develop a sample assignment strategy that best suits the noise distribution of the given tensor to leverage available rough knowledge regarding where in the tensor noise might be more prevalent.

More specifically, nTD partitions the tensor into multiple sub-tensors and then decomposes each sub-tensor probabilistically through Bayesian factorization—the resulting decompositions are then recombined through an iterative refinement process to obtain the decomposition for the whole tensor. We refer to this as Grid- Based Probabilistic Tensor Decomposition (GPTD). nTD complements GPTD with a SIG-based resource allocation strategy that accounts for the impact of the noise density of one sub-tensor on the decomposition accuracies of the other sub-tensors. This provides several benefits: Firstly, the partitioning helps ensure that the memory footprint of the decomposition is kept low. Secondly, the probabilistic framework used in the first phase ensures that the decomposition is robust to the presence of noise in the sub-tensors. Thirdly, a priori knowledge about noise distribution among the sub-tensors is used to obtain a resource assignment strategy that best suits the noise profile of the given tensor.

Algorithm 2 Phase 1: Monte Carlo-based Bayesian decomposition of each sub-tensor

7.5.1 Grid-Based Probabilistic Tensor Decomposition (GPTD)

As a block-based algorithm, Grid-Based Probabilistic Tensor Decomposition (GPTD) partitions the given tensor into blocks, decomposes each block independently, and then iteratively combines these decompositions into a final composition. Differently from Algorithm 1, however, GPTD leverages Monte Carlo-based Bayesian decomposition of sub-tensors in its Phase 1 (see Algorithm 2) to better deal with the problem of over-fitting, which is a challenge especially when the data is noisy.

Intuitively, entries in the factor matrices are modeled as probabilistic variables and decomposition is posed as a maximization problem where these (latent) random variables fit the observed data. In the presence of noise in the data, the observed variables may also be modeled probabilistically: since the observations cannot be precisely described, they may be considered as samples from a probability distribution. In this section, following [25], in the presence of data uncertainty (due to noise), we describe the fit between the observed data and the predicted latent factor matrices, probabilistically, as follows:

$$\displaystyle \begin{aligned} \begin{aligned}{} {\mathcal{X}}^{}_{{\mathbf{k}(i_1,i_2,\ldots,i_N)}}\Big|{ {U}^{(1)}_{{\mathbf{k}}}},{ {U}^{(2)}_{{\mathbf{k}}}}\ldots,{ {U}^{(N)}_{{\mathbf{k}}}} \sim {\mathcal{N}}([{ {U}^{(1)}_{{\mathbf{k}(i_1)}}},{ {U}^{(2)}_{{\mathbf{k}(i_2)}}}\ldots,{ {U}^{(N)}_{{\mathbf{k}(i_N)}}}], \alpha^{-1}), \end{aligned} \end{aligned} $$
(7.4)

where the conditional distribution of \( {\mathcal {X}}^{ }_{{\mathbf {k}(i_1,i_2,\ldots ,i_N)}}\) given \({ {U}^{(j)}_{{\mathbf {k}}}}\) (1 ≤ j ≤ N) is a Gaussian distribution with mean \([{ {U}^{(1)}_{{\mathbf {k}(i_1)}}}\),\({ {U}^{(2)}_{{\mathbf {k}(i_2)}}}\), …,\({ {U}^{(N)}_{{\mathbf {k}(i_N)}}}]\) and the observation precision α. We also impose independent Gaussian priors on the modes:

$$\displaystyle \begin{aligned} \begin{aligned}{} { {U}^{(j)}_{{\mathbf{k}(i_j)}}} \sim {\mathcal{N}}(\mu_{ {U}^{(j)}_{{\mathbf{k}}}},\varLambda_{ {U}^{(j)}_{{\mathbf{k}}}}^{-1}) \quad i_j = 1...I_j \end{aligned} \end{aligned} $$
(7.5)

where Ij is the dimensionality of the jth mode. Given this, one can estimate the latent features \({ {U}^{(j)}_{{\mathbf {k}}}}\) by maximizing the logarithm of the posterior distribution, \( \log p({ {U}^{(1)}_{{\mathbf {k}}}},{ {U}^{(2)}_{{\mathbf {k}}}}\ldots ,{ {U}^{(N)}_{{\mathbf {k}}}}| {\mathcal {X}}_{\mathbf {k}})\).

One difficulty with the approach, however, is the tuning of the hyper-parameters of the model: α and \(\varTheta _{ {U}^{(j)}_{{\mathbf {k}}}} \equiv \{\mu _{ {U}^{(j)}_{{\mathbf {k}}}},\varLambda _{ {U}^{(j)}_{{\mathbf {k}}}}\}\) for 1 ≤ j ≤ N. [33] notes that one can avoid the difficulty underlying the estimation of these parameters through a fully Bayesian approach, complemented with a sampling-based Markov Chain Monte Carlo (MCMC) method to address the lack of the analytical solution.

7.5.2 Noise-Sensitive Sample Assignment

One crucial piece of information that the basic grid-based decomposition process fails to account for is potentially available knowledge about the distribution of the noise across the input tensor. As discussed earlier, a sub-tensor which is poorly decomposed due to noise may negatively impact decomposition accuracies also for other parts of the tensor. Consequently, it is important to allocate resources to prevent a few noisy sub-tensors from negatively impacting the overall accuracy.

We note that there is a direct relationship between the amount of noise a sub-tensor has and the number of Gibbs samples it requires for accurate decomposition. In fact, the numbers of Gibbs samples allocated to different sub-tensors \( {\mathcal {X}}_{\mathbf {k}}\) in Algorithm 2 do not need to be the same. As we have seen in Sect. 7.5.1, Phase 1 decomposition of each sub-tensor is independent from the others and, thus, the number of Gibbs samples of different sub-tensors can be different. In fact, more samples can provide better accuracy for noisy sub-tensors and this can be used to improve the overall decomposition accuracy for a given number of Gibbs samples. Consequently, given a set of sub-tensors, with different amounts of noise, uniform assignment of the number of samples, \(L = \left (\frac { L_{(\rm total)}}{|{\mathcal {K}}|}\right )\), where L(total) is the total number of samples for the whole tensor and \(|{\mathcal {K}}|\) is the number of sub-tensors, may not be the best choice. In this chapter, we rely on this key observation to help assign Gibbs samples to the various sub-tensors. On the other hand, the number of samples also directly impacts the cost of the probabilistic decomposition process. Therefore, the sample assignment process must be regulated carefully.

7.5.2.1 Naive Option: Noise Density-Based Sample Assignment

Intuitively, the number of samples a noisy sub-tensor, \( {\mathcal {X}}_{\mathbf {k}}\), is allocated should be proportional to the density, ndk, of noise it contains:

$$\displaystyle \begin{aligned} L( {\mathcal{X}}_{\mathbf{k}}) = \lceil \gamma \times nd_{\mathbf{k}} \rceil + L_{\min}, \end{aligned} $$
(7.6)

where \(L_{\min }\) is the minimum number of samples a (non-noisy) tensor of the given size would need for accurate decomposition and γ is a control parameter. Note that the value of γ is selected such that the total number of samples needed is equal to the number, L(total), of samples allocated for the whole tensor:

$$\displaystyle \begin{aligned} L_{(\rm total)} = \sum_{\mathbf{k} \in {\mathcal{K}}} L( {\mathcal{X}}_{\mathbf{k}}). \end{aligned} $$
(7.7)

7.5.2.2 SIG-Based Sample Assignment: S-Strategy

Equations (7.6) and (7.7), above, help allocate samples across sub-tensors based on their noise densities. However, as discussed earlier, inaccuracies in decomposition of one sub-tensor can propagate across the rest of the sub-tensors in Phase 2. Therefore, a better approach would be to consider how errors can propagate across sub-tensors when allocating samples. More specifically, if we could assign a significance score to each sub-tensor, \( {\mathcal {X}}_{\mathbf {k}}\), that takes into account not only its noise density, but also the position of the sub-tensor relative to other sub-tensors, we could use this information to better allocate the Gibbs samples to sub-tensors.

As discussed earlier in Sect. 7.3.3, the sub-tensor impact graph (SIG) of a given tensor can be used for assigning impact scores to each sub-tensor. This process, however, requires (in addition to the given SIG) a seed node set, S ⊆ V , which serves as the context in which scores are assigned: Given the SIG graph, G(V, E), and a set, S ⊆ G(V, E), of seed nodes, the score p[i] of a node vi ∈ G(V, E) is obtained by solving p = (1 − β)A p + β s, where A denotes the transition matrix, β is a parameter controlling the overall importance of the seeds, and s is a seeding vector.

Our intuition is that we can use the sub-tensors with noise as the seeds in the above process. The naive way to create this seeding vector is to set \(\mathbf {s}[i] = \frac {1}{\|S\|}\) if vi ∈ S, and to s[i] = 0, otherwise. However, we note that we can do better: given noise densities (nd) of the sub-tensors we can create a seeding vector

$$\displaystyle \begin{aligned} \mathbf{s}[\mathbf{k}] = \frac{nd_{\mathbf{k}} } {\sum_{\mathbf{j} \in {\mathcal{K}}} nd_{\mathbf{j}} }, \end{aligned}$$

and then, once the sub-tensor impact scores ( p) are computed, we can associate a noise-sensitive significance score,

$$\displaystyle \begin{aligned}\eta_{\mathbf{k}} = \frac{\mathbf{p}[\mathbf{k}] - \operatorname*{min}_{\mathbf{j}\in{\mathcal{K}}}(\mathbf{p}[\mathbf{j}])} {\operatorname*{max}_{\mathbf{j}\in{\mathcal{K}}}(\mathbf{p}[\mathbf{j}]) - \operatorname*{min}_{\mathbf{j}\in{\mathcal{K}}}(\mathbf{p}[\mathbf{j}])}, \end{aligned}$$

to each sub-tensor \( {\mathcal {X}}_{\mathbf {k}}\). Given this score, we can then rewrite Eq. (7.6) as

$$\displaystyle \begin{aligned} L( {\mathcal{X}}_{\mathbf{k}}) = \lceil \gamma \times \eta_{\mathbf{k}} \rceil + L_{\min}. \end{aligned} $$
(7.8)

7.5.3 Evaluation

In this section, we report experiments that aim to assess the proposed noise-sensitive sample assignment strategy (s-strategy) by comparing the performance of nTD, which leverages this strategy, against GPTD with uniform sample assignment and other naive strategies

7.5.3.1 Setup

Data Sets

In these experiments, we used one user centered data set:Ciao [29]. The data is represented in the form of 5000 × 5000 × 996 (density 1.7 × 10−6) tensor. Its schema is 〈user, item, time〉. In this data, the tensor cells contain rating values between 1 and 5 or (if the rating does not exist) a special “null” symbol.

Noise

In these experiments, uniform value-independent type of noise was introduced by modifying the existing ratings in the data set. More specifically, given a uniform noise profile and density, we have selected a subset of the existing ratings (ignoring “null” values) and altered the existing values—by selecting a completely new rating (which we refer to as value-independent noise).

Evaluation Criteria

We use the root mean squares error (RMSE) inaccuracy measure to assess the decomposition effectiveness. We also report the decomposition times. Unless otherwise reported, the execution time of the overall process is reported as if sub-tensor decompositions in Phase 1 and Phase 2 are all executed serially, without leveraging any sub-tensor parallelism. Each experiment was run ten times with different random noise distributions and averages are reported.

Hardware and Software

We used a quad-core CPU Nehalem Node with 12.00GB RAM. All codes were run using Matlab R2015b. For conventional CP decomposition, we used MATLAB Tensor Toolbox Version 2.6 [1].

7.5.3.2 Discussion of the Results

We start the discussion of the results by studying the impact of the s-strategy for leveraging noise profiles.

Impact of Leveraging Noise Profiles

In Fig. 7.9, we compare the performance of nTD with noise-sensitive sample assignment (i.e., s-strategy) against GPTD with uniform sample assignment and the two naive noise adaptations, presented in Sects. 7.5.2.1 and 7.5.2.2, respectively. Note that in the scenario considered in this figure, we have 640 total Gibbs samples for 64 sub-tensors, providing on the average 10 samples per sub-tensor. In these experiments, we set \(L_{\min }\) to 9 (i.e., very close to this average), thus requiring that 576(= 64 × 9) samples are uniformly distributed across the sub-tensors—this leaves only 64 samples to be distributed adaptively across the sub-tensors based on the noise profiles of the sub-tensors and their relationships to other sub-tensors. As we see in this figure, the proposed nTD is able to leverage these 64 uncommitted samples to significantly reduce RMSE relative to GPTD with uniform sample assignment. Moreover, we also see that naive noise adaptations can actually hurt the overall accuracy. nTD-naive uses biased sampling on the noise blocks and focuses on the centrality of sub-tensors. Thus, nTD-naive performs worse than uniform way. These together show that the proposed s-strategy is highly effective in leveraging rough knowledge about noise distributions to better allocate the Gibbs samples across the tensor.

Fig. 7.9
figure 9

RMSE and execution time (without sub-tensor parallelism) for nTD with different num. of noisy sub-tensors (4 × 4 × 4 grid; uniform noise; value independent noise; noise density 10%; total num. of samples = 640; \(L_{ \min } = 9\), F = 10; max. num. of P2 iteration = 1000). (a) RMSE for Ciao. (b) Time for Ciao

In summary, the proposed sub-tensor impact graphs help allocate Gibbs samples in a way that takes into account how errors due to noise propagate across the whole tensor during the decomposition process.

7.6 Application #3: Personalized Tensor Decomposition (PTD) and Rank Assignment Based on Sub-Tensor Impact Scores

In many clustering applications, the user may have a focus of interest, i.e., part of the data for which the user needs high accuracy, and beyond this area focus, accuracy may not be as critical. Relying on this observation, in this section, we present a personalized tensor decomposition (PTD) mechanism for accounting for the user’s focus. Intuitively, the personalized tensor decomposition (PTD) algorithm partitions the tensor into multiple regions and then assigns different ranks to different sub-tensors: naturally, the higher the target rank is, the more accurate the decomposition of the sub-tensor. However, we note that preserving accuracy for foci of interest, while relaxing accuracy requirements for the rest of the input tensor is not a trivial task, especially because loss of accuracy at one region of the tensor may impact accuracies at other tensor regions. Therefore, PTD leverages sub-tensor impact graphs to account for the impact of the accuracy of one region of the tensor to the accuracies of the other regions of the tensor, each based on a different assumption about how the impact of inaccuracies propagates along the tensor. In particular, PTD analyzes the sub-tensor impact graph (in the light of the user’s interest) to identify initial decomposition ranks for the sub-tensors in a way that will boost the final decomposition accuracies for partitions of interest.

7.6.1 Problem Formulation

Let an N-mode tensor \( {\mathcal {X}} \in \mathbb {R}^{I_1\times I_2 \times \ldots \times I_N}\) be partitioned into a set (or grid) of sub-tensors \( {\mathfrak {X}} = \{ {\mathcal {X}}_{\mathbf {k}}\; |\; {\mathbf {k}} \in {\mathcal {K}}\}\), where Ki indicates the number of partitions along mode-i, the size of the jth partition along mode i is Ij,i (i.e., \(\sum _{j=1}^{K_i} I_{j,i}=I_i\)), and \({\mathcal {K}} = \{ [k_{j_1}, \ldots ,k_{j_i}, \ldots , k_{j_N}]\; |\; 1\leq i \leq N,\; 1 \leq j_i \leq K_i\}\) is a set of sub-tensor indexes. The number, \(\| {\mathfrak {X}}\|\), of partitions (and thus also the number, \(\|{\mathcal {K}}\|\), of partition indexes) is \(\prod _{i=1}^N K_i\). In addition, let \({\mathcal {K}}_{P} \subseteq {\mathcal {K}}\) be the set of sub-tensor indexes that indicate those sub-tensors for which the user requires higher accuracy. Note that, without loss of generality, we assume that the tensor \( {\mathcal {X}}\) is re-ordered before it is partitioned in such a way that the number, Ki, of resulting partitions along each mode-i is minimal—i.e., along each mode, the entries of interest are clustered together to minimize the number of partitions.Footnote 6

Given the above, let us use \( {\mathcal {X}}_{P}\) as a shorthand to denote the cells of \( {\mathcal {X}}\) collectively covered by the sub-tensors indexed by \({\mathcal {K}}_{P}\). The goal of the personalized tensor decomposition (PTD) is to obtain a personalized (or preference sensitive) decomposition \( {\hat {\mathcal {X}}}\) of \( {\mathcal {X}}\) in such a way that

$$\displaystyle \begin{aligned}accuracy( {\mathcal{X}}_P, {\hat{\mathcal{X}}}_P) > accuracy( {\mathcal{X}}_P, {\tilde{\mathcal{X}}}_P),\end{aligned}$$

where \( {\hat {\mathcal {X}}}_P\) is the reconstruction of the user selected region from the personalized decomposition \( {\hat {\mathcal {X}}}\) and \( {\tilde {\mathcal {X}}}_P\) is the reconstruction of the same region from a decomposition of \( {\mathcal {X}}\) insensitive to the user preference \({\mathcal {K}}_{P}\) (or equivalently \({\mathcal {K}}_{P} = {\mathcal {K}}\)). Naturally, we also aim that the time to obtain the personalized decomposition \( {\hat {\mathcal {X}}}\) will be lesser than the time needed to obtain preference insensitive decomposition \( {\tilde {\mathcal {X}}}\) and also that the personalized decomposition minimally impacts the rest of the tensor, i.e.,

$$\displaystyle \begin{aligned}accuracy( {\mathcal{X}}, {\hat{\mathcal{X}}}) \sim accuracy( {\mathcal{X}}, {\tilde{\mathcal{X}}}).\end{aligned}$$

7.6.2 Sub-Tensor Rank Flexibility

Remember from Sect. 7.2.4, where we presented the update rules block-based tensor decomposition algorithms use for stitching the individual sub-tensor decompositions into a complete decomposition for the whole tensor, that (as visualized in Fig. 7.6) each \( {A}^{(i)}_{(k_i)}\) is maintained incrementally by using, for all 1 ≤ j ≤ N, the current estimates for \( {A}^{(j)}_{(k_j)}\) and the decompositions in \( {\mathfrak U}^{(j)}\), i.e., the F-rank sub-factors of the sub-tensors in \( {\mathfrak X}\) along the different modes of the tensor. A closer look at the update rule for \( {A}^{(i)}_{(k_i)}\) further reveals the following observation:

  • Sub-tensor Rank Flexibility: One critical observation is that the above formulation does not require that all sub-tensors in \( {\mathfrak X}\) are decomposed with the same target rank F.

    In fact, as long as one sub-tensor is decomposed to rank F, all other sub-tensors can be decomposed to ranks lesser than F and we can still obtain full F-rank factors, A(i), for \( {\mathcal {X}}\).

Algorithm 3 Overview of the PTD-CP process

7.6.3 Rank Assignment for Personalized Tensor Decomposition

The PTD algorithm relies on this sub-tensor rank flexibility property to personalize the block-based (CP) decomposition process described earlier: unlike the basic block-based scheme discussed in the introduction, PTD improves the accuracy of the high-priority sub-tensors (indicated by \({\mathcal {K}}_P\)) by assigning them higher initial decomposition ranks than the rest of the partitions (Algorithm 3). The key difficulty, however, is that one cannot arbitrarily reduce the decomposition ranks of low priority partitions, because the accuracy in one partition may impact final decomposition accuracies of other tensor partitions. As visualized in Algorithm 3, the proposed PTD algorithm first constructs a sub-tensor impact graph, G, that accounts for the propagation of inaccuracies along the tensor during a block-based decomposition process. The PTD algorithm then leverages this graph to account for the impact of the initial decomposition inaccuracy of one sub-tensor on the final decomposition accuracy of \( {\mathcal {X}}_P\), i.e., the cells of \( {\mathcal {X}}\) collectively covered by the user’s declaration of interest (i.e., \({\mathcal {K}}_{P}\)).

Intuitively, the initial decomposition rank, Fk, of sub-tensor \( {\mathcal {X}}_{\mathbf {k}}\) will need to reflect the impact of the initial decomposition of the sub-tensor \( {\mathcal {X}}_{\mathbf {k}}\) on the final decomposition of the high-priority sub-tensors, \( {\mathcal {X}}_{\mathbf {\kappa }}\), \(\mathbf {\kappa } \in {\mathcal {K}}_{P}\). This implies that, when picking the decomposition ranks, Fk, we need to measure how inaccuracies propagate within G over a large number of iterations of the alternating least squares (ALS) process. For this purpose we rely on the sub-tensor impact scores introduced in Sect. 7.3; more specifically, we compute the initial decomposition rank Fk of \( {\mathcal {X}}_{\mathbf {k}}\) as

$$\displaystyle \begin{aligned} F_{\mathbf{k}} = \left\lceil F \times { \frac{\mathbf{s}[k]}{\max_h\{\mathbf{s}[h]\}}} \right\rceil, {} \end{aligned}$$

where s[k] denotes the sub-tensor impact score of the sub-tensor \( {\mathcal {X}}_{\mathbf {k}}\) in G. Intuitively, this formula sets the initial decomposition rank of the sub-tensor with the highest sub-tensor impact score (i.e., highest accuracy impact on the set of sub-tensors chosen by the user) to F, whereas other sub-tensors are assigned progressively smaller ranks (potentially all the way down to 1)Footnote 7 based on how far they are from the seed set in the sub-tensor impact graph, G.

7.6.4 Evaluation

In this section, we report sample results that aim to assess the effectiveness of the proposed personalized tensor decomposition (PTD) approach in helping preserve the tensor decomposition accuracy at parts of the tensor that are high-priority for the users.

7.6.4.1 Setup

Data Set

In the experiments reported in this chapter, we used the Ciao [29] data set, represented in the form of a 167 × 967 × 18 (density 2.2 × 10−4) tensor, with the schema 〈user, item, category〉. In these experiments, we assume that the input tensor is partitioned into 8 (= 2 × 2 × 2) according to the scenarios shown in Table 7.1 and two randomly selected sub-tensors are marked as more important than the rest.

Table 7.1 Various tensor partitioning scenarios considered in the experiments: the percentages are the sizes of the partitions (relative to the overall size of the mode) along each mode

Decomposition Strategies

We considered five decomposition strategies: not personalized (NP), uniform edge weights (UNI), surface of interaction-based edge weights (SURF), value alignment-based edge weights (VAL), and combined edge weights (COMB). The target CP decomposition rank, F, is set to 10.

Evaluation Criteria

We use the measure reported in Sect. 7.2.2.2 to assess decomposition accuracy and execution time. In particular, we report accuracies for both user’s area of focus and the whole tensor.

Hardware and Software

We ran the experiments reported in this section on a quad-core Intel(R) Core(TM)i5-2400 CPU @ 3.10GHz machine with 8.00GB RAM. All codes were implemented in Matlab and run using Matlab R2015b. For CP decomposition, we used MATLAB Tensor Toolbox Version 2.6 [1].

7.6.4.2 Discussion of the Results

Sub-tensor impacts help take into account how inaccuracies in the decomposition propagate into high- and-low-priority regions. We therefore expect that allocating resources using sub-tensor impact graphs should provide better accuracies for high-priority regions. As we see in Fig. 7.10, as expected, PTD algorithms boost accuracy for the high-priority partitions in the user focus, especially where the partitions are of heterogeneous sizes (as is likely to be the case in real situations). While, as would be expected, the PTD algorithms have impact on the overall decomposition accuracy for the whole tensor, this is more than compensated by gains in accuracies in high-priority areas. Moreover, the figure also shows that the gains in accuracy in high-priority partitions within the user’s focus come also with significant gains in execution times for the decomposition process.

Fig. 7.10
figure 10

Experiment results with two partitions in focus. (a) Accuracy of the region of focus. (b) Whole tensor accuracy. (c) Decomposition time

The figure also establishes that, both in terms of accuracy and execution time gains, we can order the various edge weighting strategies as follows: UNI (least effective), VAL, SURF, and COMB (most effective). In other words, as we argued in Sect. 7.3.2.4, the most effective way to account for the propagation of inaccuracies is to combine the surface of interaction and value alignment-based edge weights into a single weight which accounts for both shapes of the sub-tensors and their value alignments.

7.7 Conclusions

Computational complexity of tensor decomposition is a major bottleneck in many applications. Block-based tensor decomposition is employed to efficiently conduct tensor decomposition for large-scale data analysis. However, we need a smart strategy to account for the relationship among these sub-tensors(blocks). Therefore, we proposed sub-tensor impact graph (SIG) to account for the propagation of impacts within the sub-tensors during the decomposition process. Then, we presented three applications of SIG to efficiently solve the challenges in personalized tensor analysis, incremental tensor analysis, and noise tensor analysis. Experiments results on real data sets show that SIGs can improve the performance of tensor analysis in these three applications in both of execution time and decomposition accuracy.

Finally, we would like to note that, here we presented three distinct uses of sub-tensor impact graphs for three distinct challenges (dynamic data, noisy data, and personalization). In practice, there is no reason these approaches cannot be combined to tackle more complex scenarios, such as personalized clustering and analysis over dynamically evolving data sets. We leave the study of these more complex scenarios as future work.