1 Introduction

Brain diseases affect millions of people worldwide and impose significant challenges to healthcare systems. Functional magnetic resonance imaging (fMRI) is a key brain imaging technique for diagnosis, monitoring and treatment of brain diseases. Beyond healthcare, fMRI is also an indispensable tool in neuroscience studies [5].

Facing the challenging “large p (brain voxels) small n (samples)” problem in brain imaging, sparse learning models [9, 11] are found to be attractive on fMRI data because they can reveal the direct dependency of a response (e.g., diagnosis outcome or brain states) on a small portion of features i.e., brain voxels. Recently, tensor-based sparse multilinear regression methods are emerging as a promising direction, where tensor refers to multidimensional array. For example, a 3D fMRI volume can be seen as a 3D tensor or a third-order tensor.

Tensor-based sparse multilinear regression models relate a feature tensor with a univariate response via a coefficient tensor, generalizing Lasso-based models [4] to tensor data. Regularization that promotes sparsity and low rankness is also generalized to the coefficient tensor. For example, the regularized multilinear regression and selection (Remurs) model [12] incorporates a sparse regularization term, via an \(\ell _1\) norm, and a Tucker rank-minimization term [13], via a summation of the nuclear norms (SNN) of unfolded matrices with application to task-based fMRI data. A new Tubal Tensor Nuclear Norm (TNN) [14] has recently been proposed based on the tubal rank, which originates from the tensor singular value decomposition (t-SVD) [6]. In this work, we study sparse multilinear regression under the t-SVD framework for fMRI classification. The success of TNN was limited to unsupervised learning settings such as completion/recovery and robust PCA [2]. To our knowledge, TNN has not been studied in a supervised setting yet, such as multilinear regression for predicting a response based on a set of tensor-structural samples. Moreover, the targeted fMRI classification tasks have the challenge of small sample size (relative to the feature dimension).

Our contributions are twofold: (1) We propose a Sparse tubal-regularized multilinear regression (Sturm) method that incorporates TNN regularization and a sparsity regularization on the coefficient tensor; (2) We evaluate Sturm and related methods on both resting-state and task-based fMRI classification problems, instead of only one of them as in previous work [3].

2 Method

Notations. We use lowercase, bold lowercase, bold uppercase, calligraphic uppercase letters to denote scalar, vector, matrix, and tensor, respectively. A third-order tensor \(\mathcal {A}\in \mathbb {R}^{I_1\times I_2 \times I_3}\) is addressed by three indices \(\{i_n\}\), \(n=1, 2, 3\). Each \(i_n\) usually addresses the nth mode of \(\mathcal {A}\). For example, in a 3D fMRI volume, \(I_1, I_2\) and \(I_3\) could indicate the sagittal, coronal and axial dimension, respectively.

2.1 Tubal Rank and Tubal Tensor Nuclear Norm (TNN)

Tubal rank is derived from the t-SVD as illustrated in Fig. 1, where \(\mathcal {U}\in \mathbb {R}^{I_1\times I_1\times I_3}\), \(\mathcal {V}\in \mathbb {R}^{I_2\times I_2\times I_3}\) are orthogonal tensors, and \(\mathcal {S}\in \mathbb {R}^{I_1\times I_2\times I_3}\) is an f-diagonal tensor. The number of nonzero singular tubes (along the diagonal direction) of \(\mathcal {S}\) is defined as the tubal rank.

Fig. 1.
figure 1

Illustration of t-SVD \(\mathcal {A}= \mathcal {U}*\mathcal {S}*\mathcal {V}^{\top }\), \(*\) means tensor product [6], assuming \(I_1>I_2\). \(\mathcal {U}\) and \(\mathcal {V}\) are orthogonal tensors. \(\mathcal {S}\) is an f-diagonal tensor. Tubal rank is the number of nonzero singular tubes of \(\mathcal {S}\). In this example, the tubal rank is 6.

t-SVD can be computed efficiently via the discrete Fourier transfer (DFT). Denote the Fourier transformed tensor \(\mathcal {A}\) as \(\mathcal {A}_\mathscr {F}\), \(\mathcal {A}_\mathscr {F}= \mathtt{fft}(\mathcal {A},[],3)\). As a convex relaxation for the tubal rank, the TNN of \(\mathcal {A}\in \mathbb {R}^{I_1\times I_2\times I_3}\) is defined as \( \Vert \mathcal {A}\Vert _{TNN} = \frac{1}{I_3}\sum _{i_3=1}^{I_3} \Vert \mathbf {A}_\mathscr {F}^{(i_3)}\Vert _*,\) where \(\Vert \cdot \Vert _* \) denotes the matrix nuclear norm. More about t-SVD and TNN can be found in [14].

2.2 The Sturm Model

Regularized Multilinear Regression Model. This approach trains a model from M pairs of feature tensor and associated response label, \((\mathcal {X}_m \in \mathbb {R}^{I_1\times I_2\times I_3}, y_m)\) with \(m = 1,...,M\), to relate them via a coefficient tensor \(\mathcal {W}\in \mathbb {R}^{I_1\times I_2\times I_3}\) as

$$\begin{aligned} \min _{\mathcal {W}} \frac{1}{M} \sum _{m=1}^M L(\langle \mathcal {X}_m,\mathcal {W}\rangle , y_m) + \lambda \varOmega (\mathcal {W}), \end{aligned}$$
(1)

where \(L(\cdot )\) is a loss function, \(\varOmega (\cdot )\) is a regularization term, \(\lambda \) is a balancing hyperparameter, and \(\langle \mathcal {X},\mathcal {W}\rangle \) denotes the inner product (a.k.a. the scalar product) of two tensors of the same size:

$$\begin{aligned} \langle \mathcal {X}, \mathcal {W}\rangle :=\sum _{i_1}\sum _{i_2}\sum _{i_3} \mathcal {X}(i_1,i_2,i_3)\cdot \mathcal {W}(i_1,i_2,i_3). \end{aligned}$$
(2)

The State-of-the-Art Model. The Remurs [12] model has been successfully applied to task-based fMRI data. It uses a conventional least square loss function and assumes \(\mathcal {W}\) to be both sparse and low rank. The sparsity of \(\mathcal {W}\) is regularized by an \(\ell _1\) norm and the low rank by an SNN norm. However, the SNN requires unfolding \(\mathcal {W}\) into matrices, susceptible to losing some higher-order structural information. Moreover, it has been pointed out in [10] that SNN is not a tight convex relaxation of its target rank.

A New Model. The limitation of SNN motivates us to propose a Sparse tubal-regularized multilinear regression (Sturm) model which replaces SNN in Remurs with TNN. This leads to the following objective function

$$\begin{aligned} \min _{\mathcal {W}} \frac{1}{2} \sum _{m=1}^M (y_m-\langle \mathcal {X}_m,\mathcal {W}\rangle )^2 + \tau \Vert \mathcal {W}\Vert _{TNN} + \gamma \Vert \mathcal {W}\Vert _{1}, \end{aligned}$$
(3)

where \(\tau \) and \(\gamma \) are hyperparameters, and \(\Vert \mathcal {W}\Vert _{1}\) is the \(\ell _1\) norm of tensor \(\mathcal {W}\), defined as \(\Vert \mathcal {W}\Vert _1 = \sum _{i_1}\sum _{i_2}\sum _{i_3} \left| \mathcal {W}(i_1,i_2,i_3)\right| ,\) which is equivalent to the \(\ell _1\) norm of its vectorized representation \(\mathbf {w}\). Here, the TNN regularization term \(\Vert \mathcal {W}\Vert _{TNN}\) enforces low tubal rank in \(\mathcal {W}\).

Optimization Algorithm for Sturm. ADMM [1] is a standard solver for Problem (3). Thus, we derive an ADMM algorithm to optimize the Sturm objective function. We begin with introducing two auxiliary variables, \(\mathcal {A}\) and \(\mathcal {B}\) to disentangle the TNN and \(\ell _1\)-norm regularization:

$$\begin{aligned} \min _{\mathcal {W}} \frac{1}{2} \sum _{m=1}^M (y_m\!-\!\langle \mathcal {X}_m,\mathcal {A}\rangle )^2 \!+\! \tau \Vert \mathcal {B}\Vert _{TNN} \!+\! \gamma \Vert \mathcal {W}\Vert _{1}, s.t.\ \mathcal {A}= \mathcal {W}\ \text {and } \mathcal {B}= \mathcal {W}. \end{aligned}$$
(4)

Then, we introduce two Lagrangian dual variables \(\mathcal {P}\) (for \(\mathcal {A}\)) and \(\mathcal {Q}\) (for \(\mathcal {B}\)). With a Lagrangian constant \(\rho \), the augmented Lagrangian becomes,

$$\begin{aligned} \begin{aligned}&L_{\rho }(\mathcal {A},\mathcal {B},\mathcal {W},\mathcal {P},\mathcal {Q}) = \frac{1}{2} \sum _{m=1}^M (y_m-\langle \mathcal {X}_m,\mathcal {A}\rangle )^2 + \tau \Vert \mathcal {B}\Vert _{TNN} + \gamma \Vert \mathcal {W}\Vert _{1}\\&+ \Big \langle \mathcal {P}, \mathcal {A}-\mathcal {W}\Big \rangle + \frac{\rho }{2}\Vert \mathcal {A}-\mathcal {W}\Vert _F^2 + \Big \langle \mathcal {Q}, \mathcal {B}-\mathcal {W}\Big \rangle + \frac{\rho }{2}\Vert \mathcal {B}-\mathcal {W}\Vert _F^2. \end{aligned} \end{aligned}$$
(5)

where \(\Vert \cdot \Vert _F\) is the Frobenius norm defined as \(\Vert \mathcal {T}\Vert _F = \sqrt{\langle \mathcal {T}, \mathcal {T}\rangle }\) using Eq. (2). We further introduce two scaled dual variables \(\mathcal {P'} = \frac{1}{\rho }\mathcal {P}\) and \(\mathcal {Q'} = \frac{1}{\rho }\mathcal {Q}\) only for notational convenience. Next, we derive the update from iteration k to \(k+1\) by minimizing one variable with all the other variables fixed.

Updating \(\mathcal {A}^{k+1}\):

$$\begin{aligned} \mathcal {A}^{k+1} = \arg \min _{\mathcal {A}} \frac{1}{2} \sum _{m=1}^M (y_m-\langle \mathcal {X}_m,\mathcal {A}\rangle )^2 + \frac{\rho }{2} \Vert \mathcal {A}-\mathcal {W}^k + \mathcal {P'}^{k}\Vert _F. \end{aligned}$$
(6)

This can be rewritten as a linear-quadratic objective function by vectorizing all the tensors. Specifically, let \(\mathbf {a}=\mathtt{vec}(\mathcal {A})\), \(\mathbf {w}^k=\mathtt{vec}(\mathcal {W}^k)\), \(\mathbf {p'}^k=\mathtt{vec}(\mathcal {P'} ^k)\), \(\mathbf {y}=[y_1 \cdots y_M]^\top \), \(\mathbf {x}_m=\mathtt{vec}(\mathcal {X}_m)\), and \(\mathbf {X}=[\mathbf {x}_1 \cdots \mathbf {x}_M]^\top \). Then we get an equivalent objective function with the following solution:

$$\begin{aligned} \mathbf {a}^{k+1} = (\mathbf {X}^\top \mathbf {X}+\rho \mathbf {I})^{-1}(\mathbf {X}^\top \mathbf {y}+ \rho (\mathbf {w}^k - \mathbf {p'}^{k})), \end{aligned}$$
(7)

where \(\mathbf {I}\) is an identity matrix. \(\mathcal {A}^{k+1}\) is obtained by folding (reshaping) \(\mathbf {a}^{k+1}\) into a third-order tensor, denoted as \(\mathcal {A}^{k+1} = \mathtt{tensor}_{3}(\mathbf {a}^{k+1})\).

Updating \(\mathcal {B}^{k+1}\):

$$\begin{aligned} \mathcal {B}^{k+1} = \arg \!\min _{\mathcal {B}} \tau \Vert \mathcal {B}\Vert _{TNN} + \frac{\rho }{2} \Vert \mathcal {B}\!-\!\mathcal {W}^k+\!\mathcal {Q'}^{k}\Vert _F^2 =\mathtt{prox}_{\frac{\tau }{\rho }\Vert \cdot \Vert _{TNN}} (\mathcal {W}^k\!-\!\mathcal {Q'}^{k}). \end{aligned}$$
(8)

The proximal operator for the TNN at tensor \(\mathcal {T}\) with parameter \(\mu \) is denoted by \(\mathtt{prox}_{\mu \Vert \cdot \Vert _{TNN}} (\mathcal {T})\) and defined as

\( \mathtt{prox}_{\mu \Vert \cdot \Vert _{TNN}} (\mathcal {T}) := \arg \min _{\mathcal {W}} \mu \Vert \mathcal {W}\Vert _{TNN} + \frac{1}{2} \Vert \mathcal {W}-\mathcal {T}\Vert _F^2\). More details of the TNN proximal operator can be found in [14].

Updating \(\mathcal {W}^{k+1}\):

$$\begin{aligned} \mathcal {W}^{k+1} = \mathtt{prox}_{\frac{\gamma }{2\rho }\Vert \cdot \Vert _{1}} ({\mathcal {A}^{k+1}+\mathcal {P'}^{k} + \mathcal {B}^{k+1}+\mathcal {Q'}^{k}})/{2} . \end{aligned}$$
(9)

It can be solved by calling the proximal operator of the \(\ell _1\) norm with parameter \(\frac{\gamma }{2\rho }\), which is simply the element-wise soft-thresholding, i.e., \(\mathtt{prox}_{\mu \Vert \cdot \Vert _1}(\mathcal {T}) = sign(\mathcal {T})(|\mathcal {T}|-\mu )_{+}\).

Updating \(\mathcal {P}^{k+1}\) and \(\mathcal {Q}^{k+1}\):

$$\begin{aligned} \mathcal {P'}^{k+1} = \mathcal {P'}^{k} + \mathcal {A}^{k+1}-\mathcal {W}^{k+1}, \text { and } \mathcal {Q'}^{k+1} = \mathcal {Q'}^{k} + \mathcal {B}^{k+1}-\mathcal {W}^{k+1}. \end{aligned}$$
(10)

Optimization Algorithm Analysis. Sturm is an instance of the ADMM algorithm applied to linear constraint convex optimization problem, which is guaranteed to converge with a \(\frac{1}{\epsilon }\) convergence rate [1].

3 Experiments and Discussion

3.1 Classification Problems and Datasets

Resting-State fMRI for Disease Diagnosis. We use two freely available datasets. Rest 1 – ABIDE\( _{NYU \& UM}\): the two largest subsets NYU and UM from the Autism Brain Imaging Data Exchange (ABIDE)Footnote 1, which consists of 101 patients with autism spectrum disorder (ASD) and 131 healthy control subjects. Rest 2 – ADHD-200\(_{NYU}\): the NYU subset from the Attention Deficit Hyperactivity Disorder (ADHD) 200 datasetFootnote 2 with 118 ADHD patients and 98 healthy controls. The 4D raw fMRI data is reduced to 3D by either taking the average or the amplitude of low frequency fluctuation of voxel values along the time dimension. We perform experiments on both and report the best results.

Task-Based fMRI for Neural Decoding. We consider four datasets from the OpenfMRIFootnote 3 in two binary classification problems. Task 1 – Balloon vs Mixed Gamble, two gamble-related datasets with 64 subjects in total; and Task 2 – Simon vs Flanker, two recognition and response related tasks with overall 94 subjects. The OpenfMRI data is processed with a standard template following [8] to obtain the 3D statistical parametric maps (SPMs) for each brain condition.

3.2 Algorithms and Evaluation Settings

Algorithms. Sturm and Sturm + SVM (support vector machine) are compared against the following four algorithms and three additional algorithms combining with SVM.

  • SVM: a linear SVM is chosen for both speed and accuracy consideration.

  • Lasso [4]: a linear regression method with the \(\ell _1\) norm regularization.

  • Elastic Net (ENet) [15]: a linear regression method with \(\ell _1\) and \(\ell _2\) norm.

  • Remurs [12]: a multilinear regression model with \(\ell _1\) norm and Tucker rank-based SNN regularization.

SVM, Lasso, and ENet take vectorized fMRI data as input while Remurs directly takes 3D fMRI tensors as input. In addition, Lasso, ENet, Remurs, and Sturm can also be used for feature selection. Hence, we can use the selected voxels from each of the above algorithm as input to SVM for classification i.e., Lasso + SVM, ENet + SVM, Remurs + SVM and Sturm + SVM.

Model Hyper-parameter Tuning. For Sturm, we follow the Remurs default setting [12] to set \(\rho \) to 1 and use the same set \(\{10^{-3}, 5\times 10^{-3}, 10^{-2}, \dots , 5\times 10^2, 10^3\}\) for \(\tau \) and \(\gamma \), additionally scaling the first term in Eq. (3) by a factor \(\alpha = \sqrt{(\max (I_1, I_2)\times I_3)}\) to better balance the scales of the loss function and regularization terms [7]. Ten-fold cross validation is applied for tuning hyper-parameters in all the algorithms.

Image Resizing. To improve computational efficiency and reduce the small sample size problem (and overfitting), the input 3D tensors are further re-sized into three different sizes with a factor \(\beta \), choosing from \(\{0.3, 0.5, 0.7\}\). The best accuracy from all three sizes is reported for each algorithm in this paper.

Feature Selection. In Lasso + SVM, ENet + SVM, Remurs + SVM, and Sturm + SVM, we rank the selected features by their associated absolute values of \(\mathcal {W}\) in the descending order and feed the top \(\eta \%\) of the features to SVM. We study five values of \(\eta \): \(\{1, 5, 10, 50, 100\}\) and report the best accuracy for each algorithm.

Evaluation Metric and Method. The classification accuracy is our primary evaluation metric, and we also examine the sparsity of the obtained solutions for all algorithms except SVM. The sparsity is calculated as the ratio of the number of zeros in the output coefficient tensor \(\mathcal {W}\) to its size \(I_1\times I_2 \times I_3\). In general, higher sparsity implies better interpretability [4].

Table 1. Classification accuracy (mean ± standard deviation in %). Rest 1 and Rest 2 denote two disease diagnosis problems on ABIDENYU&UM and ADHD-200, respectively. Task 1 and Task 2 denote two neural decoding problems on OpenfMRI datasets for Balloon vs Mixed gamble and Simon vs Flanker, respectively. The best accuracy among all of the compared algorithms for each column is highlighted in bold and the second best is underlined.
Table 2. Sparsity (mean ± standard deviation) for respective results in Table 1 with the best and second best highlighted.

3.3 Result Discussion and Visualization

Classification Accuracy. Table 1 shows the classification accuracy for all algorithms. Over the two resting-state problems, Sturm + SVM has the highest accuracy of 65.45%, and Lasso is the second-best (63.00%). Whereas, on these two task-based problems, Sturm has outperformed all other algorithms in accuracy, with 88.00% accuracy. Lasso is again the second-best in accuracy (86.30%). One interesting observation is that on task-based classification problems, Sturm + SVM has significant drop in accuracy compared with Sturm alone, and Lasso + SVM, ENet + SVM and Remurs + SVM all have lower accuracy compared to without SVM. Overall, Sturm still achieves the highest accuracy.

Model Sparsity. Table 2 presents the respective sparsity values except SVM, which uses all features so the sparsity is zero. Over the two resting-state classification problems, Sturm + SVM and Sturm are the top two algorithms with sparsity of 0.93 and 0.86, respectively. Noticeably, Lasso & ENet fail to select a sparse solution via cross validation on Rest 2. Over the two task-based problems, both ENet and ENet + SVM have the best sparsity (0.96). However, both of them have much lower accuracy than Sturm and Remurs. Over all the problems, Sturm + SVM obtains the most sparse model with the overall sparsity of 0.82, meanwhile, Sturm and Remurs have almost identical sparsity of 0.76.

Fig. 2.
figure 2

Weight maps on Task Simon vs Flanker. Sturm and Remurs select clear regions which are easier for further analysis than Lasso and ENet.

Weight Map Visualization. Take Task Simon vs Flanker as an example. The selected voxels on two brain slices from the top 5% (4514 voxels, ranked by the absolute weights) of the full-brain voxels are visualized in Fig. 2 using the best performing parameters in cross-validation of Lasso, ENet, Remurs and Sturm. Figure 2 clearly shows that Sturm and Remurs select more spatially connected voxels than Lasso or ENet does, resulting in more robust and easier interpreted models for further analysis. Both Sturm and Remurs choose voxels from highly consistent regions, while Sturm produces weight maps with slightly higher visual contrast than Remurs does.

4 Conclusion

The proposed multilinear regression model (Sturm) performs regression with regularization on the tubal tensor nuclear norm (TNN), demonstrates the superior overall performance of Sturm (and Sturm + SVM, in some cases) over other methods in terms of accuracy, sparsity and weight maps. Thus, it is promising to use TNN and tubal rank regularization in a supervised setting for fMRI.