1 Introduction

The availability of large datasets in real-world applications poses significant challenges in optimization and machine learning. These massive datasets are often referred to as Big Data as they consist of very large numbers of data samples as well as features. Feature selection plays a pivotal role in the analysis of such data as it enables the extraction of salient information to base decisions. As Big Data are very high-dimensional, this step reduces the likelihood of model overfitting and computational complexity of decision models. Feature selection is a process of selecting a subset of the original features according to certain criteria [59]. Not only does feature selection reduce the dimensionality of the data, but it also increases the signal-to-noise ratio by removing irrelevant, redundant or noisy features, which in turns improves the performance of decision models in terms of prediction accuracy, result interpretability and computational run-time.

Feature selection is an optimization problem by nature. Its objective is to find the optimal subset of features that can achieve the best performance on some criterion (e.g., prediction accuracy). If the number of original features is \(p\), the number of possible subsets is \(2^p-1\). Even if the number of features to be selected is known, \(k\), there are still \( \small \left( \begin{array}{c} p\\ k \end{array}\right) \) subsets of features. Generally speaking, feature selection can be formulated as a mathematical program with \(p\) binary variables, each indicating if a feature is selected. The criteria used to select the features may be modeled as an objective function as well as included as knapsack-type selection constraints. Thus, one can generally say that feature selection problem is NP-hard and cannot be solved in a polynomial time [1]. The problem becomes even harder when the number of features far exceeds the number of observations (data instances). Given that \(n\) is the number of observations, such a problem is often called the “\(n\ll p\)” problem.

In this paper, we focus on an application of feature selection in neuroimaging. Feature selection is extremely important in neuroimaging because the features correspond to anatomical region(s), allowing inference about which brain structures are involved in cognitive processes. In addition, there are systematic sources of overfitting that need to be mitigated to allow for scientifically meaningful generalizability of classification models. Thus, selected features have real-world meaning and offer interpretability when reconstructing classification models. Multi-voxel pattern analysis (MVPA) of functional magnetic resonance imaging (fMRI) data will be the main case study in this paper. MVPA is used to study cognitive processes measured by fMRI by ascertaining where and how information is encoded in the brain. A main focus of MVPA is to classify or “decode” different cognitive states based on patterns of neural activity measured in a feature subset of image voxels. By the nature of the functional organization of the brain, only some fMRI voxels will be relevant for decoding. The remaining voxels will be uninformative for the particular cognitive task, with their signal variance for practical purposes reflecting noise. Using all the voxels in a classification model would lead to overfitting and result in poor generalization. Thus, feature selection is key to building an accurate and robust classification model. Because of the duration and economic constraints of fMRI acquisition, most fMRI studies include relatively few observations (e.g., \(n < 100\)). Meanwhile, the number of voxels (also referred to as features) is comparatively very large (e.g., \(p>10{,}000\)). Thus, MVPA is a classic “\(n\ll p\)” feature selection problem.

The fMRI signal is inherently multivariate, reflecting spatially distributed neural processing captured in the activity pattern across multiple voxels. Successful interrogation of cognitive representations requires joint assessment of this activity. While there have been many feature selection algorithms proposed in the literature, certain approaches are better suited to fMRI. Here we focus on an embedded feature selection framework, which includes all features in an integrated feature selection and classification model. In such a framework, sparsity is enforced in the classification model that is trained to maximize the classification accuracy and minimize the number of selected features. This sparse optimization for regularization (or sparse regularization) is very important for MVPA because feature selection allows for functional localization of cognitive processes, with sparser feature selection providing more concise localization. In this paper we focus on logistic regression (LR) with sparse regularization as a supervised feature selection and classification framework. Our contribution is to introduce, employ and evaluate the embedded feature selection framework to the application of MVPA. The framework provides an alternative approach to select features while simultaneously performing classification. The logistic regression is used because its linear model offers better interpretability in cognitive neuroscience. Three types of regularization are employed: ridge, lasso and elastic net penalties.

The remainder of the paper is organized as follows. In Sect. 2, we provide the background of feature selection and more details of MVPA. In Sect. 3, we present the optimization formulation of logistic regression with various types of penalty. In Sect. 4, the details of our computational framework including solution approaches, cross-validation and parameter selection procedure are given. We present the datasets and the experimental results in Sect. 5. We conclude the study in Sect. 6.

2 Background

2.1 Feature selection

The curse of dimensionality poses challenges to learning algorithms when dealing with high-dimensional data, in which the number of features is large and only a few are informative. In such a situation, learning algorithms likely overfit classification models and the learned models are less generalizable. Feature selection is a method to identify relevant features in order to improve classification accuracy and facilitate more stable and interpretable results [15, 30, 45]. Feature selection algorithms can be categorized as supervised, semi-supervised or unsupervised. Supervised feature selection algorithms [44, 46, 52, 53] use the statistical dependency between the feature and the class variable to determine the degree of feature relevance. In an absence of class labels, unsupervised feature selection algorithms evaluate the degree of feature relevance from data variance and separability [9, 22]. In a situation where labeled data can be obtained but very expensive, semi-supervised feature selection algorithms [55, 58] can use a small portion of labeled data as an additional information to improve the performance of unsupervised feature selection algorithms.

A large number of feature selection algorithms have been developed but most can be grouped into one of the three models: filter, wrapper or embedded [59]. The filter model depends on the characteristics of the data alone without involving the learning (e.g., classification and regression) algorithms. Many feature selection algorithms in the filter model rely on using certain metrics to rank or eliminate features. For instance, correlation [6, 54, 56], t-test [49, 60] and mutual information (MI) [2, 39, 47, 50, 51] have been used to rank features or eliminate irrelevant features. The wrapper model requires a learning algorithm to assess the classification performance (e.g., prediction accuracy or cardinality) as evaluation criteria to select features [3, 25, 43]. The embedded model integrates feature selection with the classification model in the training process. Training performance and selected features are achieved simultaneously. Examples of embedded models include decision tree C4.5 [42], \(L_{1}\)-norm SVM [32], and logistic regression with \(L_{1}\)-norm regularization [10, 12, 24, 48].

Logistic regression (LR) has been widely used as a classifier because of not only its performance, but also the interpretability and simplicity to implement. However, LR alone without regularization often results in a high variance estimation of its coefficients, especially when there are many correlated features (variables). Such an issue can be mitigated using ridge (\(L_{2}\)-norm) regularization to shrink the size of coefficients [23]. Nevertheless, almost all (if not all) of the coefficients still remain non-zero. Thus, this method does not possess the characteristic of feature selection. Moreover, the resulting coefficients tend to spread equally within a set of correlated features, resulting in underestimated coefficients which can often be over-enforced when performing feature selection by thresholding the coefficients. The problem can be alleviated by imposing lasso (\(L_{1}\)-norm) regularization [10, 14, 48] which introduces sparse solution compared to ridge penalty. However, this penalty tends to pick only a few features (if not only one) from a set of correlated features, yielding a very sparse solution which is often not robust in practice. \(L_{q}\)-norm was proposed to relieve the issue by generalizing the norm and selecting \(L_{q}\)-norm such that \(q\) is between 1 and 2 to combine the effect of both ridge and lasso as appeared in [11]. However, \(L_{q}\)-norm penalty in such range of \(q\) does not provide a sparse solution because the norm is still differentiable at zero when \(q>1\). Elastic net penalty [61] was introduced as a linear combination between ridge and lasso penalty, resulting in a compromise characteristic of both. The lasso part in elastic net penalty encourages sparse solution whereas the ridge part encourages spreading coefficients among a set of correlated features, resulting in theoretically more robust classification compared to lasso and explicit feature selection not available through ridge. More detailed explanations for each model are described in Sect. 3. Furthermore, LR is a very promising model computationally as its inference on a large dataset can also be accomplished using stochastic gradient descent [29, 57], which can be parallelized in MapReduce framework, but is beyond the scope of this paper. Enthusiastic readers please refer to [4, 7, 26].

2.2 Multi-voxel pattern analysis (MVPA)

Conventional fMRI data analysis has relied on univariate statistical approaches to elucidate the neural basis of cognition. In such approaches, the response is assessed at each voxel in the brain independently. However, a growing body of evidence suggests that mental representations are more effectively studied by considering the joint activity of multiple voxels [19, 21, 37]. Thus, MVPA, adapted from machine learning and pattern recognition, has emerged as a new analysis framework for fMRI. MVPA is often used to perform cognitive state decoding, whereby cognitive representations are classified into discrete categories of stimulus conditions.

MVPA involves several computational steps: feature extraction, feature selection, and pattern classification. For fMRI data, features are conventionally operationalized as voxels. Feature extraction is a procedure to characterize the temporally-evolving response to a stimulus at a voxel, often with a summary value such as a regression coefficient. Feature selection is a procedure to identify and select the subset of voxels to use with the classifier. The voxel selection process is considerably important, especially in cognitive neuroscience where selected voxels implicate brain regions involved in cognitive processes. Pattern classification is a procedure to train a classification algorithm to create a prediction/classification model that best separates the stimulus categories represented in the multidimensional space defined by the selected features (voxels).

Figure 1 illustrates the feature extraction step of fMRI signals from the ventral temporal (VT) cortex as the region of interest (ROI). To characterize the blood-oxygen-level dependence (BOLD) response to a given stimulus condition (an indirect measure of the neural response), a general linear model (GLM) is applied, and coefficient parameters “beta” are estimated by fitting a GLM with different predictors for each stimulus block or entity. Unless otherwise noted, in the studies presented here, the predictors were modeled with a boxcar convolved with a canonical hemodynamic response function (HRF) [41]. The HRF has been used to characterize the temporally-evolving BOLD signal change in response to a briefly presented stimulus. In summary, each stimulus can be represented by a 3-dimensional volume matrix, with each entry in the matrix representing a real-valued beta coefficient of a voxel.

Fig. 1
figure 1

An illustration of the canonical data matrix of the fMRI data used in the pattern classification system. Each experimental condition is induced by a visual stimulus (image) presented to a human subject in a short period of time, and is eventually transformed into each row \(i\) of the \(n\times p\) data matrix \(x\), whose class label is denoted by \(c_{i}\)

In practice, when performing feature selection and classification, it is more convenient to reorganize the volume matrix into a canonical 2-dimensional input data matrix (see Fig. 1). The data matrix is denoted by \(\mathbf {x}\) of the dimension \(n\times p\), where \(n\) is the number of data instances/observations (the total number of presented stimuli); and \(p\) is the number of features (voxels) in the ROI. The entry \(x_{ ij }\) of the data matrix represents the real-valued coefficient parameter beta of the \(i\)th data instance at the \(j\)th voxel. We denote class label \(c_{i}\in \{1,\ldots ,K\}\) (i.e., stimulus category), where \(K\) is the total number of stimulus categories. For each data instance \(i, c_{i}\) is known precisely according to the experiment design.

In our previous study [3], a new feature selection based on MI criterion, called maximum informativeness (MaxI), was developed. MI is widely used as a criterion to rank the feature relevance [2, 47, 50, 51], starting from calculating the MI between each feature and the class label vector. MaxI prioritizes the voxels to be selected based on the informativeness of individual features to class labels, assessed by the value of MI (called importance index). The notion of MaxI is to determine the best level of importance index of voxels, rather than the best number for voxels to be selected. To optimize the best level of importance index, a calibration procedure is iteratively carried out with a classification algorithm on the leave-one-run-out cross validation. In that study, SVM, LR, and Gaussian Naïve Bayes (GNB) model were used as classification algorithms.

One of the main drawbacks of MaxI is that it evaluates each individual feature on a univariate basis. That is, it does not consider the non-decomposable information of jointly working features involved in cognitive representations. In the literature, an efficient way to capture the jointly working features is to use forward/backward selection algorithm [16], where each feature will be included in the selected set when its combination with the selected features gives the best performance. This process is continued until all the features are included into the selected set or until the storage cost is reached. However, the approach requires \(O(p^{2})\) which is intractable with a large \(p\) value. Recently, an efficient approach based on submodularity optimization has been proposed [27, 28, 31]. Although this approach provides theoretical foundation on the performance, it strictly requires that the objective function of the classification model to be submodular.

3 Logistic regression with regularizations

Logistic regression is widely used as a classifier for classification problems together with feature selection because of its simplicity on performance and implementation. In this section, we present the formulation and the characteristics of linear (binomial and multinomial) logistic regression with penalties of ridge, lasso, \(L_{q}\)-norm, and elastic net, respectively.

3.1 Logistic regression

Let \(c\) denote the class variable and \(\fancyscript{C}=\{1,2\}\) denote the label set with two categories. A logistic regression model incorporates a linear function of the predictors \(x\) into the class-conditional probability. The model is formulated as follows:

$$\begin{aligned} J(\beta _{0},\beta )=\frac{1}{n}\sum _{i=1}^{n}\left\{ I_{1}(c_{i})\log Pr(c_{i}=1|x_{i})+I_{2}(c_{i})\log Pr(c_{i}=2|x_{i})\right\} \!, \end{aligned}$$
(1)

where \(n\) is the number of data instances, \(I_{k}(c_{i})\) is an indicator function returning 1 when \(c_{i}=k\) and 0 otherwise, and \(Pr(c_{i}=1|x) = \frac{1}{1+e^{-(\beta _{0}+x^{\top }\beta )}}\) and \(Pr(c_{i}=2|x) = 1-Pr(c_{i}=1|x) = \frac{e^{-(\beta _{0}+x^{\top }\beta )}}{1+e^{-(\beta _{0}+x^{\top }\beta )}}\) are probability functions of both class outcomes. The coefficients \((\beta _{0},\beta )\) can be computed (trained) by maximizing the objective function \(J(\beta _{0},\beta \)) with respect to \(\beta _{0}\) and \(\beta \):

$$\begin{aligned} \max _{(\beta _{0},\beta )\in \fancyscript{R}^{p+1}}J(\beta _{0},\beta ). \end{aligned}$$
(2)

It is noted that the learned coefficients \((\beta _{0},\beta )\) are not scale-invariant to the input \(x\), so it is often necessary to standardize the input \(x\) (e.g., z-score) before solving the maximization problem in Eq. (2).

3.2 Ridge penalty

When there are many correlated features (variables) in the linear model, the coefficients \((\beta _{0},\beta )\) of these correlated features may cancel each other out, and unbiased estimates may be associated with high variance. Such issue can be alleviated by imposing a size constraint on the coefficients using the \(L_{2}\)-norm squared of \(\beta \), called ridge penalty \(\hat{\beta }_{ ridge }\). A new maximization model with the size constraint is given by

$$\begin{aligned}&( LR + ridge ) \quad \max _{(\beta _{0},\beta )\in \fancyscript{R}^{p+1}} \quad J(\beta _{0},\beta ) \end{aligned}$$
(3)
$$\begin{aligned}&\text{ s.t. } \quad \sum _{j=1}^{p}\beta _{j}^{2}\le t, \end{aligned}$$
(4)

where \(t\) is the bound of the sum of coefficients squared. Note that the \(\beta _{0}\) is excluded from the sum.

To facilitate such optimization problem, we apply a Lagrange multiplier method to incorporate the constraint in Eq. (3) into the objective function in Eq. (4). The Lagrangian is then given by

$$\begin{aligned} \max _{(\beta _{0},\beta )\in \fancyscript{R}^{p+1}}\left\{ J(\beta _{0},\beta )-\lambda \sum _{j=1}^{p}\beta _{j}^{2}\right\} \!, \end{aligned}$$
(5)

where \(\lambda \ge 0\) is the Lagrange multiplier that represents a complexity parameter and in turn controls the amount of shrinking: the larger value of \(\lambda \), the greater the amount of shrinkage and hence the smaller the size of coefficients. There is a one-to-one correspondence between the parameters \(\lambda \) in Eq. (5) and \(t\) in Eq. (3) [11].

It is worth noting that the maximization model with the ridge penalty does not have the characteristic of feature selection because all the coefficients still remain non-zero even though the ridge penalty shrinks the size of coefficients toward zero. The coefficients of correlated features tend to spread among them and underestimate the true importance of the correlated features. A thresholding strategy can be used to eliminate features with coefficient values near zero. However, this strategy would degrade the performance of the classification model, as the weights of these features underestimate their joint contribution to the model. Theoretically, the lack of explicit thresholding should result in the classification model with the ridge penalty having the same number as features as the model without any regularization. However, in practice, coefficients with numerical values very close to zero (e.g., \(\beta <10^{-14}\)) are rounded to zero for numerical robustness, leading to an occasional reduction in the number of features.

3.3 Lasso penalty

Lasso penalty works similar to ridge penalty except that \(L_{1}\)-norm is used in the constraint of the coefficients. The optimization model of logistic regression with a lasso penalty is given by

$$\begin{aligned}&\displaystyle ( LR + Lasso ) \quad \max _{(\beta _{0},\beta )\in \fancyscript{R}^{p+1}}\quad J(\beta _{0},\beta ) \end{aligned}$$
(6)
$$\begin{aligned}&\displaystyle \text{ s.t. } \quad \sum _{j=1}^{p} |\beta _{j}|\le t. \end{aligned}$$
(7)

An equivalent Lagrangian form is given by

$$\begin{aligned} \max _{(\beta _{0},\beta )\in \fancyscript{R}^{p+1}}\left\{ J(\beta _{0}, \beta )-\lambda \sum _{j=1}^{p}\left| \beta _{j}\right| \right\} \!. \end{aligned}$$
(8)

The lasso (\(L_{1}\)-norm) penalty introduces a sparse solution, compared to the ridge (\(L_{2}\)-norm) penalty. Figure 2 displays a geometric example with two parameters \(\beta _{1}\) and \(\beta _{2}\). The residual sum of squares has elliptical contour and has its center at the least squares solution. The point where the elliptical contour first touches the constraint region is the solution to the optimization problem. The constraint region of lasso has corners, and each corner forces that at least one of the features must be zero. It therefore results in a sparse solution. Furthermore, in a higher-dimensional space (\(p>2\)), there are more corners, and thus there is a higher chance that the first-touch point ends up at one of the corners. However, it is not true for ridge because the first-touch point can hit anywhere with equal probability. The sparse solution for ridge penalty is rare when \(p\) is large.

Fig. 2
figure 2

The geometry of lasso and ridge penalty in the space \((\beta _{1},\beta _{2})\). The constraint region for \(L_{2}\)-norm and \(L_{1}\)-norm is represented by the circular disk (orange) and the diamond disk (red) respectively. The residual sum of squares has elliptical contour (blue) and has its center at the least square solution \(\hat{\beta }\). The dotted line and the outer-most thick solid line each is the contour where the ridge solution and the lasso solution occur respectively. The corners of the diamond suggest sparse solution, which can happen with greater probability in the lasso regularization than in the ridge regularization. (Color figure online)

It is important to note that the lasso penalty tends to pick only a few features (if not only one) from a set of correlated features, yielding a very sparse solution. In practice, the solution might be less robust across validation folds. Therefore, a more generalized form is suggested, called \(L_{q}\)-norm with a penalty \(\lambda \sum \nolimits _{j=1}^{p}\left| \beta _{j}\right| ^{q}\). The value of \(q\) in the range of \(q\in (1,2)\) suggests the compromise between the ridge and lasso penalties. However, \(\left| \beta _{j}\right| ^{q}\) is differentiable at 0 in such range of \(q\), thus does not share the ability of lasso for assigning some \(\beta _{j}\)’s to zero. In other words, the \(L_{q}\)-norm does not provide a sparse solution when \(q\in (1,2)\).

3.4 Elastic net penalty

Lasso penalty may be too stringent in the selection among a set of strong but correlated features, whereas the ridge regularization tends to shrink the coefficients of correlated features toward each other. The elastic net penalty is introduced to compromise between both penalties. A combined optimization model is given by

$$\begin{aligned}&\displaystyle ( LR + Elastic \ net ) \quad \max _{(\beta _{0},\beta )\in \fancyscript{R}^{p+1}} \quad J(\beta _{0},\beta ) \end{aligned}$$
(9)
$$\begin{aligned}&\displaystyle \text{ s.t. } \quad \sum _{j=1}^{p}\left( \alpha \left| \beta _{j}\right| +(1-\alpha ) \beta _{j}^{2}\right) \le t, \end{aligned}$$
(10)

where \(\alpha \in [0,1]\) is a tradeoff parameter between the lasso and ridge penalties. Its equivalent Lagrangian form is given by

$$\begin{aligned} \max _{(\beta _{0},\beta )\in \fancyscript{R}^{p+1}}\left\{ J(\beta _{0},\beta )- \lambda \sum _{j=1}^{p}\left( \alpha \left| \beta _{j}\right| +(1-\alpha ) \beta _{j}^{2}\right) \right\} \!. \end{aligned}$$
(11)

The penalty term is a linear combination of lasso penalty and ridge penalty. The first term (lasso) encourages a sparse solution of \(\beta \), while the second term (ridge) encourages strongly correlated features to be averaged. Therefore, the elastic net provides both sparsity and selection of correlated features although the \(\alpha \) needs to be predetermined.

3.5 Multinomial logistic regression with elastic net penalty

For multi-class classification problems, a maximization model using a penalized maximum multinomial log-likelihood and incorporating the elastic net penalty is given by

$$\begin{aligned} ( MLR + Elastic \, net ) \quad \max _{(\beta _{l0},\beta _{l})_{1}^{K}\in \fancyscript{R}^{K(p+1)}}\left\{ J\left( (\beta _{l0},\beta _{l})_{1}^{K}\right) -\lambda \sum _{l=1}^{K}P_{\alpha }(\beta _{l})\right\} \!, \end{aligned}$$
(12)

where

$$\begin{aligned} J \left( (\beta _{l0},\beta _{l})_{1}^{K}\right) =\frac{1}{n}\sum _{i=1}^{n}\log Pr(c_{i}|x_{i}) \end{aligned}$$

and

$$\begin{aligned} P_{\alpha }(\beta _{l})=\sum _{j=1}^{p}\left( \alpha \left| \beta _{lj}\right| +(1-\alpha )\beta _{lj}^{2}\right) \!, \end{aligned}$$

where \(c\) is the class variable taking the value from the label set \(\fancyscript{C}\!=\!\{1,\ldots ,K\}\) and \( Pr (c\!=\!l|x)=\frac{e^{-(\beta _{l0} +x^{\top }\beta _{l})}}{\sum \nolimits _{l'=1}^{K}e^{-(\beta _{l'0}+x^{\top }\beta _{l'})}}\) is the probability function of multiple outcomes, employed from [61].

4 Computational framework

In this section, we present a computational framework of feature selection in cognitive neuroscience datasets where the number of data instances is much less than the number of features (i.e., \(n\ll p\)) due to the collection-time limitation and human-factor practicality. Optimization of feature selection in prediction with a family of logistic regression classifier is discussed specifically. The coefficients of features represent the contribution of features and the features with non-zeros coefficients becomes features selected in prediction models.

4.1 Optimization of the penalized logistic regression

Because the closed-form analytical solution is not available, we resort to numerical optimization approach for logistic regression. The penalized (binomial and multinomial) logistic regression can be solved differently based on penalty type.

For conventional logistic regression (LR) without regularization and ridge penalized LR (LR \(+\) ridge), optimal solutions are obtained using a trust-region algorithm. The algorithm can be available in the unconstrained optimization package in MATLAB. In particular, the algorithm takes the derivative of the Lagrangian in Eq. (5) with respect to each \(\beta _{j}\) for \(j\in \{0,1,\ldots ,K\}\), and the local minimum of the Lagrangian is obtained for each \(\lambda \). Because of the convexity of the objective function, a globally optimal solution can be determined among all local solutions with respect to \(\lambda \). When \(\lambda =0\), we obtain the solution for LR, while \(\lambda >0\) suggests the solution for LR \(+\) ridge.

Lasso penalized logistic regression (LR \(+\) lasso) can be regarded as a special case of elastic net penalized logistic regression (LR \(+\) elastic net). We employ the algorithms using cyclical coordinate descent (CCD) to solve the LR \(+\) elastic recently proposed by [12] because it has computational advantages over least angle regression (LAR) algorithm proposed in [10]. They compute a regularization path of \(\lambda \) along. For each value of \(\lambda \), the CCD creates an outer loop cycling over class \(l\) and evaluates a partial quadratic approximation of the multinomial log-likelihood \(J((\beta _{l0},\beta _{l})_{1}^{K})\) about the current parameters \((\beta _{l0},\beta _{l})\). Consequently, a quadratic approximation is incorporated with the elastic net penalty and becomes penalized weighted least squares problem which can be solved using coordinate descent. In our study, the optimization of LR \(+\) elastic net and LR \(+\) lasso are implemented using the optimization package glmnet provided by [13] while there are considerable numerical techniques used to stabilize CCD.

Recall the tradeoff parameter \(\alpha \) in Eq. (11), the solution of LR \(+\) lasso can be obtained when \(\alpha =1\). On the other hand, the solution of LR \(+\) ridge can be obtained when \(\alpha =0\). However, the computation is not stable in this paradigm, and it has to derive the solution of LR \(+\) ridge separately.

4.2 Cross validation and free parameters selection

In this study, we apply a leave-one-run-out cross validation paradigm for optimizing/learning the model parameters, selecting the free parameters, and reporting the prediction accuracy. A dataset \(\mathbf { x}\) is divided into \(F\) mutually exclusive sections or runs (a shorthand for experiment runs); then a run is marked as a testing dataset and the remaining are divided into a training dataset, and a validation dataset denoted by \(\mathbf { x}_{ test }, \mathbf { x}_{ train }\) and \(\mathbf {x}_{ valid }\), respectively. In a cross validation process, each run (or a subset of data) takes a turn as a testing run. For each run, the training dataset is used to train the model parameters, the free parameters are selected based on the validation dataset, and finally the prediction accuracy is evaluated based on the testing dataset. Usually, a final (prediction) accuracy is reported as the average accuracy across all runs.

It was mentioned in [61] that free parameter \(\alpha \) is problem-dependent and fixed while a model refinement can be done by adjusting \(\lambda \). However, in the current work, we treat both \(\alpha \) and \(\lambda \) as free parameters to be optimized so that the model is fully data-driven. The model parameters are essentially the function of the free parameters \((\beta _{0}(\alpha ,\lambda ),\,\beta (\alpha ,\lambda ))\) and can be greatly determined by the choice of the free parameters given to the model. Note that the selection criteria of the free parameters are subjective and problem-dependent.

Given a free parameter pair \((\alpha ,\lambda )\), the training dataset \(\mathbf { x}_{ train }\) is used to learn the model parameters \((\beta _{0},\beta )\) of a LR classifier. Validation accuracy is then reported from the LR classifier with the learned model parameters on the validation dataset \(\mathbf {x}_{ valid }\). The free parameter pair \((\alpha ^{*},\lambda ^{*})\) is optimized according to the validation accuracy as follows:

$$\begin{aligned} (\alpha ^{*},\lambda ^{*})=\arg \max _{(\alpha ,\lambda )\in \varOmega } accuracy (x_{ valid },y_{ valid };\beta _{0}(\alpha ,\lambda ),\,\beta (\alpha ,\lambda )), \end{aligned}$$

where \(\varOmega \) is a set of free parameter candidates defined by the user; \(\mathbf {x}_{ valid }\) and \(\mathbf { y}_{ valid }\) denote a data matrix and its corresponding class label vector in the validation dataset, respectively. Consequently, the optimal model parameters can be obtained from \(\beta _{0}^{*}=\beta _{0}(\alpha ^{*},\lambda ^{*})\) and \(\beta ^{*}=\beta (\alpha ^{*},\lambda ^{*})\) accordingly. We report the testing accuracy by applying the optimal model parameters \((\beta _{0}^{*},\beta ^{*})\) to the testing dataset \(\mathbf {x}_{ test }\).

5 Experimental results

In this paper, we evaluate the performances of feature selection via regularization methods on three datasets: (1) Haxby, (2) Lexical and (3) CMU. Summary information for each dataset can be found in Table 1, with more details provided in Sect. 5.2.

Table 1 Summary of datasets used in this paper, including regions of interest (ROI), the number of subjects (no. of subjects) and the number of voxels (no. of voxels) in each ROI

5.1 Implementation and evaluation

For each dataset, we applied the linear-kernel logistic regression (LR) with four different types of penalties as described in Sect. 4:

  1. 1.

    LR \(+\) elastic net: using a linear combination of both lasso and ridge regularization to find the compromise of sparsity and predictivity. That is, \(0<\alpha <1\) and \(\lambda >0\).

  2. 2.

    LR \(+\) lasso: using \(L_{1}\)-norm regularization to effectively induce a sparse solution by assigning a large portion of \(\beta _{j}\)’s to be zero. That is, \(\alpha =1\) and \(\lambda >0\).

  3. 3.

    LR \(+\) ridge: using \(L_{2}\)-norm regularization to shrink the coefficients by imposing a penalty based on their size. The solution is not sparse however, since the coefficients are still non-zero. That is, \(\alpha =0\) and \(\lambda >0\).

  4. 4.

    LR \(+\) none: this is a control case, meaning that there is no regularization added into the objective function at all. That is, \(\lambda =0\) regardless of \(\alpha \).

Although the free parameter pair \((\alpha ,\lambda )\) are to be selected automatically with respect to the dataset, it is fair to assure the robustness of the solution by imposing the search range of the free parameters \(\alpha \in \fancyscript{A}=\{0,0.1,0.2,\ldots ,1\}\) and \(\lambda \in \fancyscript{L}=\{0,0.001,0.002,\ldots ,1\}\). We shall emphasize that the range of \(\alpha \) and \(\lambda \) should entirely represent all possible regularization methods we wish to benchmark. That is, for ridge penalty \((\alpha =0,\lambda \in \fancyscript{L})\); lasso penalty \((\alpha =1,\lambda \in \fancyscript{L})\); elastic net penalty \((0<\alpha <1,\lambda \in \fancyscript{L})\); and \(\lambda =0\) (regardless of any \(\alpha \)) for not penalizing at all.

An additional criterion is used in our experiment as a tie-breaker when the validation accuracy of two free parameter pairs are approximately equal (within some tolerance.) In such a case we prefer the solution that is more sparse (i.e., more interpretable), namely, the solution with fewer non-zero coefficients. Specifically, the solution with bigger \(\alpha \) or bigger \(\lambda \) is more preferable.

For lasso and elastic net we use the MATLAB package glmnet from [12, 13], and for ridge and none we implemented our own MATLAB code, more details are discussed earlier in Sect. 4.1. We adopt the cross validation paradigm as illustrated in Sect. 4. For all datasets, we organized the training, validation, and testing folds according to experiment run number mentioned in Sect. 4.2. The approach avoids positively biasing results due to the within-run signal structure but also makes the classification problem more challenging due to the presence of difference in signal structure among training, testing and validation folds. Details with respect to each dataset are described in Sect. 5.2 and in Table 1. Performance of each classification model is evaluated by the prediction accuracy in the testing set or testing accuracy for short-handed notation. For each subject, the individual testing accuracy is calculated by averaging the testing accuracies across all runs. In each run, the individual testing accuracy is obtained at the optimal free parameter \((\alpha ^{*},\lambda ^{*})\) and the optimal model parameters \((\beta _{0}^{*},\beta ^{*})\) according to Sect. 4.2. Finally, we average the testing prediction accuracy across all subjects in the experiment and report the average testing accuracy. The details of the dataset and the experiment settings are discussed in the following section.

5.2 Data and experiment setting on each dataset

5.2.1 Haxby dataset

The seminal work by [19] demonstrated the utility of pattern classification approaches in fMRI for investigating object category representation in ventral temporal cortex (VTC). The data have since been made publicly available and are widely used to benchmark performance of pattern classification techniques [1720, 38]. In the study, subjects viewed gray-scale images from eight different object categories (face, house, cat, bottle, scissor, shoe, chair, and ‘scrambled pictures’) as part of a one-back detection task. Exemplars from each category were presented in blocks of 24 s followed by 12 s of rest. Each object category was shown once per fMRI run, with 12 runs of fMRI data acquired per subject. The fMRI data were acquired on a 3T GE scanner and consisted of image volumes of \(64 \times 64 \times 40\) voxels acquired every 2.5 s.

Standard processing was performed on the fMRI data, including motion correction and linear de-trending. Data were then standardized (z-scored) by subtracting the mean and dividing by the standard deviation of the time series signal at each voxel. To characterize the fMRI response associated with each object category, beta coefficient parameters were estimated by fitting a general linear model (GLM). A different predictor was used to model each object block in each run, producing 96 different parameter estimates (12 parameters for each of the eight object categories) for each subject. We refer interested readers to [40, 41] for more details.

The original dataset contains 12 runs per subject, which is divided into 1 run for testing, 2 runs for validating and 9 runs for training denoted by (1:2:9). For this dataset, we focus on selecting features from two different initial regions of interest (ROI):

  1. 1.

    Ventral temporal masks provided by the Haxby group (vtc): The masks were defined using combined anatomic functional criteria [19]. The resultant ROI masks were relatively small, ranging from 307 to 675 voxels across subjects.

  2. 2.

    All voxels in the whole brain (wb): Across subjects, the number of voxels varied from 36,292 to 39,280.

The dataset information is summarized in Table 1. The experimental results can be found in Table 2.

Table 2 Summary of the results from Haxby dataset

When using the vtc mask, LR \(+\) ridge gives the best classification performance, followed by LR \(+\) elastic net, LR \(+\) none, and LR \(+\) lasso. These results illustrate that the ridge penalty performs very well if the feature subset is initially well-constrained. Although the LR \(+\) elastic net is about 7 % poorer than LR \(+\) ridge, the selected feature subset is roughly half the size of the initial mask. LR without regularization (LR \(+\) none) is the baseline model we want to compare with as it demonstrates the characteristic of LR when the size of LR coefficients \(\beta \)’s are not regularized. LR \(+\) lasso gives the fewest, hence, most sparse, voxels of all the approaches. LR \(+\) lasso gives the poorest results here, perhaps because the solution it gives is too sparse, especially given it is performed on a dataset with a small feature dimensionality.

It is also interesting to see that the accuracy gap between training accuracy and validation/testing accuracy is not small despite the regularization is imposed in the classifier. That is because the dataset is partitioned into training, validation and testing set based on the experiment run number. fMRI data from different runs have substantial run-related structured “noise” due to instrument variation and/or differences in the subject factors (e.g., amount of head movement) [5, 33]. Therefore, the model learned by the classifier also likely includes the run-specific information present in the training set, but absent from the validation/testing set, which would contribute to the accuracy gap. Furthermore, the classification model will not embody the run-related information of the testing/validation run. These run-specific effects contribute to the gap between training accuracy and validation/testing accuracy and reflect the reduction in the ability of the classifier to generalize to the class conditions (i.e. the scientifically meaningful information). While partitioning the data based on run reduces classification accuracy in the validation/testing runs, it ensures that accuracy is not positively biased by run effects. The ideal way to partition the data would be to ensure that each dataset contains at least a few examples from each run so that the run-specific information would be captured by the model. We note that the accuracy gap is smaller in the approaches with regularization than ones without regularization, suggesting that regularization mitigates the undesirable effects caused by run-specific information.

In the case that \(p\) is very large compared to \(n\) (\(n\ll p\)) like in the wb mask, it is more obvious that the sparsity regularization approaches such as elastic net and lasso outperform those without enforcement (i.e., ridge and none). This is because the irrelevant features are subdued better in approaches with sparsity regularization which can be seen as an automatic feature selection step in the classifier.

5.2.2 Lexical dataset

The lexical fMRI data, denoted by Lexical, were acquired from seven subjects performing an object naming task. The subjects were scanned on a Siemens 3T TIM Trio Scanner during which they produced names out loud in response to 104 color pictures of ‘animals’ or man-made manipulable objects (i.e., ‘tools’) across four runs. The pictures were presented in a rapid event-related design, with each pictured entity randomly repeated four times (using different examples) within a run. Different entities were presented in each run. Imaging data were analyzed using FMRIB’s Improved Linear Model [54] using standard preprocessing approaches. Each stimulus entity was modeled separately to obtain individual coefficient estimates of the fMRI response per entity [36].

The dataset is used in a binary classification experiment of “animals” versus “tools”, denoted by Lexical-animtool. The class “animals” is obtained from combining all the observations whose entities belong to the animal category such as ‘leopard’, ‘ant’, ‘duck’, ‘fish’, ‘turtle’, etc. The class “tools” is the combination of ‘paperclip’, ‘spatula’, ‘pliers’, ‘scissors’, etc. The 4 runs of the data are divided into testing, validation and training in the format of (1:1:2), and there are 13 observations per class per run, therefore 104 observations in total. For testing, only category entities not used during training are evaluated. Consequently, testing accuracy reflects the ability of the classification model to capture generalized category-level and not the entity-level information.

The gap between the training accuracy and validation/testing accuracy is not small due to the nature of this experiment where we expect the classifier to capture the generalized category-level and not the entity-level information. However, there is some entity-level information captured by the classifier. In other words, the accuracy gap is contributed partially by the entity-level information captured by the classifier in each run. It is also worth noting that the accuracy gap is even larger when regularization is not imposed, underscoring the importance of regularization to produce scientifically meaningful results.

Instead of analyzing the whole brain data, we focus our attention on two ROI masks:

  1. 1.

    Features (operationally, voxels) were initially selected based on structural anatomical mask (i.e., posterior occipitotemporal cortex defined using Freesurfer’s Desikan parcellation scheme [8]) in the ventral temporal cortex (vtc). This ROI mask is available for all seven human subjects.

  2. 2.

    The whole brain’s gray-matter mask (wb) which aims to reveal the feature that are relevant. This ROI mask was evaluated for only four human subjects.

The dataset information is summarized in Table 1. The experimental results can be found in Table 3. In the vtc mask, which is a small preselected ROI mask, LR \(+\) ridge is the best, followed by LR \(+\) elastic net, LR \(+\) lasso and LR \(+\) none. LR \(+\) elastic net and LR \(+\) ridge perform competitively, but LR \(+\) elastic net requires fewer features than ridge. In fact, the testing accuracy of the classification model from the lasso regularization is not much lower than that for elastic net and ridge, though the model is much sparser than either of them. LR \(+\) none performs the worst, well below all regularized approaches in this experiment.

Table 3 Summary of the results from Lexical dataset

When considering the case where \(p\) is large such as in the wb mask, the prediction accuracy of both sparsity regularization approaches, LR \(+\) elastic net and LR \(+\) lasso, clearly outperforms that of LR \(+\) ridge and LR \(+\) none. Again, the sparse regularization is more advantageous when \(p\) is larger.

5.2.3 CMU dataset

The dataset was collected and used in [34] and is made available to public in the authors’ supplemental website [35]. Since the dataset was originally collected by the researchers from Carnegie Mellon University, we shall refer to the dataset as CMU.

fMRI data were available from nine participants who viewed 60 different word-picture pairs, each pair is presented six times, with the randomly permuted sequence of stimuli on each presentation. Participants were asked to think about the properties of the item as they were viewing. Data were acquired on a Siemens Allegra 3.0T scanner, with an acquisition matrix was 64 \(\times \) 64 with 3.125 mm \(\times \) 3.125 \(\times \) 5 mm voxels. Data were corrected for motion and slice acquisition timing.

The dataset contains 12 image categories, with each category consisting of five entities each with six observations. The dataset is used in two classification experiments:

  1. 1.

    Binary classification of “animals” versus “tools”, denoted by CMU-animtool. The class “animals” is obtained from combining the observations from two original categories, ‘animal’ and ‘insect’ in the CMU dataset. The class “tools” is the combination of ‘tool’ and ‘furniture’. Thus, there are 120 observations in total.

  2. 2.

    Multiclass classification of “animal”, “insect”,“tool” and “vegetable”, denoted by CMU-4class. The four classes are directly retrieved from the respective categories in the original dataset without modification. Thus, there are 120 observations in total.

Since there are six runs in total, we arrange testing, validation and training set in the format of (1:1:4) in both experiments, yielding 10 and 5 observations/class/run in CMU-animtool and CMU-4class respectively. Since the dataset was preprocessed and the ROI was pre-selected, we adopt the original voxels set provided by [34] without modification. The feature size (number of voxels) of the nine subjects varies from 19,750 to 21,764. The dataset information is summarized in Table 1. The experimental results can be found in Table 4. In both binary classification and multiclass classification experiments, LR \(+\) elastic net gives the best testing accuracy, followed by LR \(+\) ridge, LR \(+\) lasso and LR \(+\) none. All regularization approaches report the testing accuracies above chance; however, we note that the accuracies drop significantly from binary to multiclass classification. This may be because the cognitive processes of those four categories are quite similar.

Table 4 Summary of the results from CMU dataset

6 Conclusion

In this paper, we presented a sparse optimization framework for regularizing pattern recognition models. The framework was applied to emerging cognitive neuroscience problems based on analyses of neuroimaging data. Logistic regression classifiers with a penalty (regularization) yielded better prediction accuracy performance than ones without regularization. This was especially noticeable when the number of features \(p\) was large. The benefits of regularization were observed even when the features were initially well-constrained using anatomic functional criteria. Under these initial conditions, the ridge penalty was sufficient for high classification accuracy and outperformed sparsity-enforcing regularization methods. We note that the LR \(+\) ridge is not technically a feature selection method, as the ridge penalty does not eliminate features but rather shrinks their coefficients towards zero.

When the feature size \(p\) was bigger (i.e. brain voxels were not restricted using anatomical and/or functional criteria), the advantages of sparsity-enforcement methods became apparent. In such cases, both the LR \(+\) lasso and LR \(+\) elastic net penalty resulted in classification models with higher prediction accuracy than models obtained using the LR \(+\) ridge penalty. These former two regularization methods eliminate irrelevant and noisy features by setting their coefficients to zero. Thus, they embed a feature selection step as part of the training of the classification model and substantially reduce the number of model features. Of the two methods, the lasso penalty produced the sparsest solution. However, classification models obtained with the lasso penalty had lower prediction accuracy than those obtained with the elastic net penalty. This finding suggests that the lasso regularization method produced feature subsets that were too sparse, and hence as robust in their ability to generalize to the testing data. When the features were well defined initially, LR \(+\) elastic net performed competitively with LR \(+\) ridge. As elastic net attempts to find the optimal compromise between lasso and ridge regularization, it retains the good prediction accuracy of the ridge penalty, while still providing quite sparse solutions like lasso. Therefore, when taking into account both prediction accuracy and the conciseness in the number of selected features, elastic net appears to be more a desirable regularization approach for fMRI applications.

In the methods described here, optimization of the classifier was achieved by incorporating a penalty term into the objective function. This optimization framework is extensible and allows for incorporation of additional domain specific constraints. In neuroimaging, functional and/or anatomical criteria, such as spatial contiguity and anatomical or functional connectivity, could also be included as constraints embedded in the training process of the classification model. Implementing such approaches could improve scientific interpretability of the results and is an exciting, but non-trivial, future research direction for optimization.