1 Introduction

Multiple regression analysis is a statistical process for estimating the relationship between explanatory and response variables. When some of explanatory variables are highly correlated, the reliability of the analysis is decreased because of the low quality of the resultant estimates. This problem is known as multicollinearity [4, 11, 12].

There are several approaches to avoiding the deleterious effects of multicollinearity, such as principal component regression [21, 28], partial least squares regression [39, 40], ridge regression [17], and subset selection [16, 30]. This paper is focused on subset selection, a commonly used method for eliminating multicollinearity. Conventionally in this method, explanatory variables are removed one at a time on the basis of indicators for detecting multicollinearity, such as condition number of the correlation matrix and variance inflation factor (VIF) [11]. On the other hand, the potential disadvantage of this iterative procedure is that the best (e.g., in the least-squares sense) subset of variables is not necessarily found. More precisely, the iterative procedure may fail to provide an optimal solution to the following problem: Find a subset of variables that minimizes the residual sum of squares under the constraint that multicollinearity measured by the VIF is undetected when using that subset.

Multiple regression analysis has two primary purposes: prediction and description [18]. Subset selection is particularly beneficial for description purposes because it promotes better understanding of the causal relationships between explanatory and response variables. Various computational algorithms have been proposed for subset selection [10, 14, 23, 26], and many of them are categorized as heuristic algorithms. However, these heuristic algorithms are often unsuitable for description purposes because they can yield low-quality solutions that lead to incorrect conclusions about causality.

For this reason, we adopt a mixed integer optimization (MIO) approach to subset selection. This approach was first proposed in the 1970s [1], and recently it has received renewed attention due to advances in algorithms and hardware [9, 15, 24, 37, 38]. In contrast to heuristic algorithms, the MIO approach has the potential to provide the best subset of variables with respect to several criterion functions, which include Mallows’ \(C_p\) [31], adjusted \(R^2\) [32], discrete Dantzig selector [29], and some information criteria [22, 32]. Due to its usefulness and good performance, MIO-based subset selection has extended the range of applications to areas such as logistic regression [8, 34], sequential logit models [35], support vector machines [27], cluster analysis [5], and classification trees [6].

To avoid multicollinearity in MIO-based subset selection, Bertsimas and King [7] suggested the use of a cutting plane algorithm, which iteratively adds valid inequalities to cut off sets of collinear variables. These valid inequalities can be strengthened by means of a local search algorithm [36]; however, the cutting plane algorithm must solve a series of MIO problems, each of which is NP-hard.

Meanwhile, Tamura et al. [36] devised a mixed integer semidefinite optimization (MISDO) formulation for subset selection to eliminate multicollinearity. In contrast with the cutting plane algorithm, this approach merely needs to solve a single MISDO problem. In this MISDO formulation, however, only the condition number can be adopted as an indicator for detecting multicollinearity. Although VIF is better-grounded in statistical theory [4, 11], to the best of our knowledge, none of the existing studies have developed a computationally tractable MIO formulation for eliminating multicollinearity on the basis of VIF.

The purpose of this paper is to devise mixed integer quadratic optimization (MIQO) formulations that can be used to select the best subset of explanatory variables subject to the upper bounds on the VIFs of selected variables. Our two MIQO formulations are based on two equivalent definitions of VIF. The effectiveness of our MIQO formulations is assessed through computational experiments using several datasets from the UCI Machine Learning Repository [25].

The main contributions of the present paper are as follows.

  • We obtain computationally tractable MIQO formulations for best subset selection under the upper-bound constraints on VIF. Although the cutting plane algorithm was the only way to exactly solve this subset selection problem, we successfully reformulated the problem as a convex MIQO problem, which can be handled using standard MIO software.

  • Our MIQO formulations are capable of verifying the optimality of the selected subset of variables. We verify through computational experiments that when the number of candidate explanatory variables is less than 30, our MIQO problems can be solved to optimality within a few tens of seconds.

  • The proposed MIQO formulations provide solutions of good quality even if the computation is terminated before verifying optimality. The computational results demonstrate that even when the number of candidate explanatory variables is more than 30, our MIQO formulations with some preprocessing can find, within a time limit of 10,000 s, better subsets of variables than those obtained using local search algorithms.

2 Multiple linear regression and variance inflation factor

Let us suppose that we are given n samples, \((y_i;x_{i1},x_{i2},\ldots ,x_{ip})\) for \(i = 1,2,\ldots ,n\). Here, \(y_i\) is a response variable and \(x_{ij}\) is the jth explanatory variable for each sample \(i=1,2,\ldots ,n\). The index set of all candidate explanatory variables is denoted by \(P := \{1,2,\ldots ,p\}\).

For simplicity of explanation, in Sects. 2 and 3 we assume that all explanatory and response variables are centered and scaled for unit length; that is,

$$\begin{aligned} \sum _{i=1}^n x_{ij} = \sum _{i=1}^n y_i = 0 \quad \text{ and } \quad \sum _{i=1}^n (x_{ij})^2 = \sum _{i=1}^n (y_i)^2 = 1 \end{aligned}$$
(1)

for all \(j \in P\). The multiple linear regression model is then formulated as follows:

$$\begin{aligned} {\varvec{y}} = {\varvec{X}} {\varvec{a}} + {\varvec{\varepsilon }}, \end{aligned}$$

where \({\varvec{y}} := (y_1,y_2,\ldots ,y_n)^{\top }\), \({\varvec{a}}:=(a_1,a_2,\ldots ,a_p)^{\top }\), \({\varvec{\varepsilon }} := (\varepsilon _1,\varepsilon _2,\ldots ,\varepsilon _n)^{\top }\), and

Here, \({\varvec{a}}\) is a vector of regression coefficients to be estimated, and \({\varvec{\varepsilon }}\) is a vector composed of a prediction residual for each sample \(i=1,2,\ldots ,n\).

In what follows, we consider selecting a subset \(S \subseteq P\) of explanatory variables to mitigate the negative influence of multicollinearity on regression estimates. On account of assumption (1), the correlation matrix of selected variables is calculated as

$$\begin{aligned} {\varvec{R}}_S := (r_{j\ell })_{(j,\ell ) \in S \times S} = {\varvec{X}}_{S}^{\top } {\varvec{X}}_S, \end{aligned}$$

where \({\varvec{X}}_{S} := ({\varvec{x}}_j)_{j \in S}\) is the submatrix of \({\varvec{X}}\) corresponding to the set S.

The variance inflation factor, VIF, for detecting multicollinearity is defined for each \(\ell \in S\). Specifically, VIF of the \(\ell \)th explanatory variable is defined as the \(\ell \)th diagonal entry of the inverse of \({\varvec{R}}_S\), that is,

$$\begin{aligned} \mathrm{VIF}(\ell ,S) := [{\varvec{R}}_S^{-1}]_{\ell \ell }. \end{aligned}$$
(2)

When some of the selected variables are highly correlated, \({\varvec{R}}_S\) is close to singular, and the corresponding VIF value is very large. Hence, the following upper-bound constraints should be imposed on the set S:

$$\begin{aligned} \mathrm{VIF}(\ell ,S) \le \alpha \quad (\ell \in S), \end{aligned}$$
(3)

where \(\alpha \) is a user-defined parameter larger than one.

On the other hand, VIF has another easily interpretable definition. To describe this, we consider a linear regression model that explains the relationship between the \(\ell \)th explanatory variable and other variables in the set S,

$$\begin{aligned} {\varvec{x}}_{\ell } = {\varvec{X}}_{S\setminus \{\ell \}} {\varvec{a}}^{(\ell ,S)} + {\varvec{\varepsilon }}^{(\ell ,S)}, \end{aligned}$$
(4)

where \({\varvec{a}}^{(\ell ,S)} \in \mathbb {R}^{|S|-1}\) and \({\varvec{\varepsilon }}^{(\ell ,S)} \in \mathbb {R}^n\) are vectors of regression coefficients and residuals.

To estimate the regression coefficients, \({\varvec{a}}^{(\ell ,S)}\), the ordinary least squares (OLS) method minimizes the residual sum of squares (RSS),

$$\begin{aligned} \Vert {\varvec{x}}_{\ell } - {\varvec{X}}_{S\setminus \{\ell \}} {\varvec{a}}^{(\ell ,S)}\Vert ^2_2 = ({\varvec{x}}_{\ell } - {\varvec{X}}_{S\setminus \{\ell \}} {\varvec{a}}^{(\ell ,S)})^{\top } ({\varvec{x}}_{\ell } - {\varvec{X}}_{S\setminus \{\ell \}} {\varvec{a}}^{(\ell ,S)}). \end{aligned}$$
(5)

This is equivalent to solving the well-known normal equation:

$$\begin{aligned} {\varvec{X}}_{S\setminus \{\ell \}}^{\top } {\varvec{X}}_{S\setminus \{\ell \}} \hat{{\varvec{a}}}^{(\ell ,S)} = {\varvec{X}}_{S\setminus \{\ell \}}^{\top } {\varvec{x}}_{\ell }, \end{aligned}$$
(6)

where \(\hat{{\varvec{a}}}^{(\ell ,S)}\) is called the OLS estimator.

The goodness-of-fit of regression model (4) is measured by the coefficient of determination. Due to assumption (1), it is calculated based on the OLS estimator as follows:

$$\begin{aligned} R^2(\ell ,S) := 1 - \Vert {\varvec{x}}_{\ell } - {\varvec{X}}_{S\setminus \{\ell \}} \hat{{\varvec{a}}}^{(\ell ,S)}\Vert ^2_2. \end{aligned}$$

When \(R^2(\ell ,S)\) is close to one, the \(\ell \)th explanatory variable has a strong linear relationship with other variables in the set S. It is known that VIF of the \(\ell \)th explanatory variable can also be defined as follows [4, 12]:

$$\begin{aligned} \mathrm{VIF}(\ell ,S) := \frac{1}{1 - R^2(\ell ,S)} = \frac{1}{\Vert {\varvec{x}}_{\ell } - {\varvec{X}}_{S\setminus \{\ell \}} \hat{{\varvec{a}}}^{(\ell ,S)}\Vert ^2_2}. \end{aligned}$$
(7)

Here we briefly explain an advantage to using VIF over using the condition number as an indicator of multicollinearity. To eliminate multicollinearity, we select explanatory variables and construct a model in which VIF or the condition number does not exceed a specified upper bound; 10 and 225, respectively, are frequently used as upper bounds for VIF and the condition number [11]. From a modeling perspective, we can directly control the upper bound of the degree of multicollinearity by using VIF. For example, when we need to tighten the upper bound of \(R^2(\ell ,S)\) from 0.9 to 0.8, we change the upper bound of VIF\((\ell ,S)\) from \(1/(1-0.9)=10\) to \(1/(1-0.8)=5\) due to the definition (7). On the other hand, such control is not obvious when using the condition number.

3 Mixed integer quadratic optimization formulations

In this section, we consider minimizing RSS of a subset regression model under the upper-bound constraints (3) on VIFs. Let \({\varvec{z}}:=(z_1,z_2,\ldots ,z_p)^{\top }\) be a vector of 0–1 decision variables for subset selection. Accordingly, \(S({\varvec{z}}) := \{j \in P \mid z_j = 1\}\) is a selected subset of explanatory variables. The subset selection problem for eliminating multicollinearity based on VIF is posed as an MIO problem:

$$\begin{aligned} \mathop {\text{ minimize }}_{}&\quad \Vert {\varvec{y}} - {\varvec{X}} {\varvec{a}}\Vert ^2_2 \end{aligned}$$
(8)
$$\begin{aligned} \text{ subject } \text{ to }&\quad z_j = 0~~\Rightarrow ~~a_j = 0 \quad (j \in P), \end{aligned}$$
(9)
$$\begin{aligned}&\quad z_{\ell } = 1~~\Rightarrow ~~\mathrm{VIF}(\ell ,S({\varvec{z}})) \le \alpha \quad (\ell \in P), \end{aligned}$$
(10)
$$\begin{aligned}&\quad {\varvec{a}} \in \mathbb {R}^p,~{\varvec{z}} \in \{0,1\}^p. \end{aligned}$$
(11)

If \(z_j = 0\), then the jth explanatory variable is deleted from the regression model because its coefficient is set to zero by the logical implications (9). The VIF constraints (3) are imposed in the form of logical implications (10). It is known that these logical implications can be represented by using a big-M method or a special ordered set type 1 (SOS 1) constraint [2, 3].

However, other than logical implications, constraints (10) still contain difficulty to be handled in an MIO formulation. In the following two subsections, we derive two tractable formulations of these VIF constraints.

3.1 Normal-equation-based formulation

We first propose an MIQO formulation based on the definition (7) of VIF. Let us introduce a vector of decision variables, \({\varvec{a}}^{(\ell )} := (a^{(\ell )}_j)_{j \in P \setminus \{\ell \}} \in \mathbb {R}^{p-1}~(\ell \in P)\). To convert the VIF constraints (10) into a set of linear constraints, we exploit the normal-equation-based constraints proposed by Tamura et al. [36]:

$$\begin{aligned}&z_j = 1~~\Rightarrow ~~{\varvec{x}}_j^{\top } {\varvec{X}}_{P \setminus \{\ell \}} {\varvec{a}}^{(\ell )} = {\varvec{x}}_j^{\top } {\varvec{x}}_{\ell } \quad (j \in P \setminus \{\ell \}), \end{aligned}$$
(12)
$$\begin{aligned}&z_j = 0~~\Rightarrow ~~a^{(\ell )}_j = 0 \quad (j \in P \setminus \{\ell \}). \end{aligned}$$
(13)

Theorem 1

Suppose that \(({\varvec{a}}^{(\ell )},{\varvec{z}}) \in \mathbb {R}^{p-1} \times \{0,1\}^p\) satisfies constraints (12)–(13). Then, we have

$$\begin{aligned} \mathrm{VIF}(\ell ,S({\varvec{z}})) = \frac{1}{\Vert {\varvec{x}}_{\ell } - {\varvec{X}}_{P \setminus \{\ell \}} {\varvec{a}}^{(\ell )}\Vert ^2_2} \quad (\ell \in S({\varvec{z}})). \end{aligned}$$

Proof

Let s be the number of nonzero elements of \({\varvec{z}}\). Without loss of generality, we may assume that \(S({\varvec{z}}) = \{1,2,\ldots ,s\}\). According to \(S({\varvec{z}})\), we partition \({\varvec{a}}^{(\ell )}\) as

where \({\varvec{a}}^{(\ell )}_2 = {\varvec{0}}\) due to constraints (13). Therefore constraints (12) correspond to the normal equation (6) for \(S = S({\varvec{z}})\); that is,

$$\begin{aligned} {\varvec{X}}_{S({\varvec{z}})\setminus \{\ell \}}^{\top } {\varvec{X}}_{S({\varvec{z}})\setminus \{\ell \}} {\varvec{a}}^{(\ell )}_1 = {\varvec{X}}_{S({\varvec{z}})\setminus \{\ell \}}^{\top } {\varvec{x}}_{\ell }. \end{aligned}$$
(14)

Since the OLS estimator provides the minimum value of RSS (5), it follows that

$$\begin{aligned} \Vert {\varvec{x}}_{\ell } - {\varvec{X}}_{S({\varvec{z}})\setminus \{\ell \}} \hat{{\varvec{a}}}^{(\ell ,S({\varvec{z}}))}\Vert ^2_2 = \Vert {\varvec{x}}_{\ell } - {\varvec{X}}_{S({\varvec{z}})\setminus \{\ell \}} {\varvec{a}}^{(\ell )}_1\Vert ^2_2 = \Vert {\varvec{x}}_{\ell } - {\varvec{X}}_{P \setminus \{\ell \}} {\varvec{a}}^{(\ell )}\Vert ^2_2. \end{aligned}$$

Hence, the definition (7) of VIF completes the proof. \(\square \)

The VIF constraints (10) can be rewritten by Theorem 1 as follows:

$$\begin{aligned} z_{\ell } \le \alpha \Vert {\varvec{x}}_{\ell } - {\varvec{X}}_{P \setminus \{\ell \}} {\varvec{a}}^{(\ell )}\Vert ^2_2 \quad (\ell \in P). \end{aligned}$$

However, these are reverse convex constraints, which are very difficult to handle in an MIO formulation. To resolve this reverse-convexity, we exploit the normal equation (14) as follows:

$$\begin{aligned}&\Vert {\varvec{x}}_{\ell } - {\varvec{X}}_{P\setminus \{\ell \}} {\varvec{a}}^{(\ell )} \Vert ^2_2 \nonumber \\&\quad = {\varvec{x}}_{\ell }^{\top } {\varvec{x}}_{\ell } - 2{\varvec{x}}_{\ell }^{\top } {\varvec{X}}_{P\setminus \{\ell \}} {\varvec{a}}^{(\ell )} + ({\varvec{a}}^{(\ell )}_1)^{\top } {\varvec{X}}_{S({\varvec{z}})\setminus \{\ell \}}^{\top } {\varvec{X}}_{S({\varvec{z}})\setminus \{\ell \}} {\varvec{a}}^{(\ell )}_1 \nonumber \\&\quad ={\varvec{x}}_{\ell }^{\top } {\varvec{x}}_{\ell } - 2{\varvec{x}}_{\ell }^{\top } {\varvec{X}}_{P\setminus \{\ell \}} {\varvec{a}}^{(\ell )} + ({\varvec{a}}^{(\ell )}_1)^{\top } {\varvec{X}}_{S({\varvec{z}})\setminus \{\ell \}}^{\top } {\varvec{x}}_{\ell } \nonumber \\&\quad = {\varvec{x}}_{\ell }^{\top } {\varvec{x}}_{\ell } - {\varvec{x}}_{\ell }^{\top } {\varvec{X}}_{P\setminus \{\ell \}} {\varvec{a}}^{(\ell )}. \end{aligned}$$
(15)

Consequently, the subset selection problem (8)–(11) can be formulated as an MIQO problem, which we call the normal-equation-based formulation:

$$\begin{aligned} \mathop {\text{ minimize }}_{}&\quad \Vert {\varvec{y}} - {\varvec{X}} {\varvec{a}}\Vert ^2_2 \end{aligned}$$
(16)
$$\begin{aligned} \text{ subject } \text{ to }&\quad z_j = 0~~\Rightarrow ~~a_j = 0 \quad (j \in P), \end{aligned}$$
(17)
$$\begin{aligned}&\quad z_{\ell } \le \alpha ({\varvec{x}}_{\ell }^{\top } {\varvec{x}}_{\ell } - {\varvec{x}}_{\ell }^{\top } {\varvec{X}}_{P\setminus \{\ell \}} {\varvec{a}}^{(\ell )}) \quad (\ell \in P), \end{aligned}$$
(18)
$$\begin{aligned}&\quad z_j = 1~~\Rightarrow ~~{\varvec{x}}_j^{\top } {\varvec{X}}_{P \setminus \{\ell \}} {\varvec{a}}^{(\ell )} = {\varvec{x}}_j^{\top } {\varvec{x}}_{\ell } \quad (\ell \in P,~j \in P \setminus \{\ell \}), \end{aligned}$$
(19)
$$\begin{aligned}&\quad z_j = 0~~\Rightarrow ~~a^{(\ell )}_j = 0 \quad (\ell \in P,~j \in P \setminus \{\ell \}), \end{aligned}$$
(20)
$$\begin{aligned}&\quad {\varvec{a}} \in \mathbb {R}^p,\ {\varvec{z}} \in \{0,1\}^p, \ {\varvec{a}}^{(\ell )} \in \mathbb {R}^{p-1} \quad (\ell \in P). \end{aligned}$$
(21)

3.2 Inverse-matrix-based formulation

We next propose another MIQO formulation based on the definition (2) of VIF. Let us introduce square matrices of decision variables, \({\varvec{Q}} := (q_{\ell j})_{(\ell ,j) \in P \times P}\) and \({\varvec{U}} := (u_{\ell j})_{(\ell ,j) \in P \times P}\). To compute the inverse of the correlation matrix \({\varvec{R}}_{S({\varvec{z}})}\), we make use of the following constraints:

$$\begin{aligned}&{\varvec{Q}} {\varvec{R}}_P + {\varvec{U}} = {\varvec{I}}_p, \end{aligned}$$
(22)
$$\begin{aligned}&z_j =1~~\Rightarrow ~~u_{\ell j} = 0 \quad (\ell \in P,~j \in P), \end{aligned}$$
(23)
$$\begin{aligned}&z_j = 0~~\Rightarrow ~~q_{\ell j} = q_{j \ell } = 0 \quad (\ell \in P,~j \in P), \end{aligned}$$
(24)

where \({\varvec{I}}_p\) is the identity matrix of size p. To promote an understanding of constraints (22), we consider a case in which all candidate explanatory variables are selected (i.e., \(z_j = 1~(j \in P)\)). In this case, \({\varvec{U}}\) becomes the zero matrix because of constraints (23). Consequently, we have \({\varvec{Q}} {\varvec{R}}_P = {\varvec{I}}_p\); that is, \({\varvec{Q}}\) is the inverse of the correlation matrix of all candidate explanatory variables.

Theorem 2

Suppose that \(({\varvec{Q}},{\varvec{U}},{\varvec{z}}) \in \mathbb {R}^{p \times p} \times \mathbb {R}^{p \times p} \times \{0,1\}^p\) satisfies constraints (22)–(24). Then, we have \(q_{\ell \ell } = [{\varvec{R}}_{S({\varvec{z}})}^{-1}]_{\ell \ell }\) for \(\ell \in S({\varvec{z}})\), and \(q_{\ell \ell } = 0\) for \(\ell \not \in S({\varvec{z}})\).

Proof

Similarly to Theorem 1, we may assume without loss of generality that \(S({\varvec{z}}) = \{1,2,\ldots ,s\}\). We partition \({\varvec{Q}},{\varvec{R}}_P,{\varvec{U}}\) and \({\varvec{I}}_p\) according to \(S({\varvec{z}})\) and rewrite constraints (22) as follows:

where \({\varvec{Q}}_1,{\varvec{R}}_1,{\varvec{U}}_1 \in \mathbb {R}^{s \times s}\), \({\varvec{Q}}_2,{\varvec{R}}_2,{\varvec{U}}_2 \in \mathbb {R}^{s \times (p-s)}\), \({\varvec{Q}}_3,{\varvec{R}}_3,{\varvec{U}}_3 \in \mathbb {R}^{(p-s) \times s}\), \({\varvec{Q}}_4,{\varvec{R}}_4,{\varvec{U}}_4 \in \mathbb {R}^{(p-s) \times (p-s)}\), and \({\varvec{O}}\) is the zero matrix of size \(s \times (p-s)\). Note here that \({\varvec{Q}}_2,{\varvec{Q}}_3,{\varvec{Q}}_4, {\varvec{U}}_1\) and \({\varvec{U}}_3\) become zero matrices due to constraints (23)–(24). As a result, the above constraints are reduced to

(25)

while other constraints are satisfied through free decision variables in \({\varvec{U}}_2\) and \({\varvec{U}}_4\). Since \({\varvec{R}}_1 = {\varvec{R}}_{S({\varvec{z}})}\), it follows that \({\varvec{Q}}_1 = {\varvec{R}}_{S({\varvec{z}})}^{-1}\), which completes the proof. \(\square \)

Using the definition (2), the subset selection problem (8)–(11) is reformulated as an MIQO problem, which we call the inverse-matrix-based formulation:

$$\begin{aligned} \mathop {\text{ minimize }}_{}&\quad \Vert {\varvec{y}} - {\varvec{X}} {\varvec{a}}\Vert ^2_2 \end{aligned}$$
(26)
$$\begin{aligned} \text{ subject } \text{ to }&\quad z_j = 0~~\Rightarrow ~~a_j = 0 \quad (j \in P), \end{aligned}$$
(27)
$$\begin{aligned}&\quad q_{\ell \ell } \le \alpha \quad (\ell \in P), \end{aligned}$$
(28)
$$\begin{aligned}&\quad {\varvec{Q}} {\varvec{R}}_P + {\varvec{U}} = {\varvec{I}}_p, \end{aligned}$$
(29)
$$\begin{aligned}&\quad {\varvec{Q}} = {\varvec{Q}}^{\top }, \end{aligned}$$
(30)
$$\begin{aligned}&\quad z_j =1~~\Rightarrow ~~u_{\ell j} = 0 \quad (\ell \in P,~j \in P), \end{aligned}$$
(31)
$$\begin{aligned}&\quad z_j = 0~~\Rightarrow ~~q_{\ell j} = 0 \quad (\ell \in P,~j \in P), \end{aligned}$$
(32)
$$\begin{aligned}&\quad {\varvec{a}} \in \mathbb {R}^p,~{\varvec{Q}} \in \mathbb {R}^{p \times p},~{\varvec{U}} \in \mathbb {R}^{p \times p},~{\varvec{z}} \in \{0,1\}^p. \end{aligned}$$
(33)

Note that the correlation matrix is symmetric, so is its inverse; thus, constraint (30) is redundant. However, we explicitly add the constraint because it improves computational speed.

3.3 Preprocessing for faster computation

In this subsection, we propose some ideas for speeding up the MIQO computation. In our preliminary experiments, however, preprocessing (iii) and (iv) did not shorten the computation time; hence, we will evaluate the efficiency of preprocessing steps (i) and (ii) in the next section.

Preprocessing (i): Deleting redundant VIF constraints

The definition (7) of VIF implies that \(\mathrm{VIF}(\ell ,S) \le \mathrm{VIF}(\ell ,P)\) for all \(S \subseteq P\). Therefore, the VIF constraints for \(\ell \in P_0\) can be deleted, with

$$\begin{aligned} P_0 := \{\ell \in P \mid \mathrm{VIF}(\ell ,P) \le \alpha \}. \end{aligned}$$
(34)

Preprocessing (ii): Adding cutting-plane-based constraints

This step is the following. First, find subsets \(S_k \subseteq P~(k \in K)\) of collinear variables such that \(\mathrm{VIF}(\ell ,S_k) > \alpha \) for some \(\ell \in S_k\). Next, cut them off by means of the following cutting-plane-based constraints [7, 36]:

$$\begin{aligned} \sum _{j \in S_k} z_j \le |S_k| - 1 \quad (k \in K). \end{aligned}$$
(35)

Preprocessing (iii): Tightening constraints

Constraints (18) can be tightened by using the minimum RSS (5) of \(S = P\) as follows:

$$\begin{aligned} z_{\ell } + (1 - z_{\ell }) \alpha \Vert {\varvec{x}}_{\ell } - {\varvec{X}}_{P\setminus \{\ell \}} \hat{{\varvec{a}}}^{(\ell ,P)}\Vert ^2_2 \le \alpha ({\varvec{x}}_{\ell }^{\top } {\varvec{x}}_{\ell } - {\varvec{x}}_{\ell }^{\top } {\varvec{X}}_{P\setminus \{\ell \}} {\varvec{a}}^{(\ell )}) \quad (\ell \in P). \end{aligned}$$

Preprocessing (iv): Linearization of the objective function

The objective function can be linearized by applying the transformation (15) to \(\Vert {\varvec{y}} - {\varvec{X}} {\varvec{a}}\Vert ^2_2\), which changes the proposed MIQO formulations into mixed integer linear optimization formulations.

Table 1 List of instances

4 Computational results

This section evaluates the computational performance of our MIQO formulations for subset selection to eliminate multicollinearity as characterized by VIF.

We downloaded six datasets for regression analysis from the UCI Machine Learning Repository [25]. Table 1 lists the instances used for computational experiments, where n and p are the numbers of samples and candidate explanatory variables, respectively. In the SolarFlareC instance, C-class flares production was employed as a response variable. Categorical variables were encoded into sets of dummy variables. Samples containing missing values were removed, and then redundant variables (i.e., those with the same value in all samples) were also removed.

We compare the computational performance of the following subset selection algorithms.

  1. FwS:

    Forward selection method: Starts with \(S = \emptyset \) and iteratively adds the jth variable (i.e., \(S \leftarrow S\cup \{j\}\)) that decreases RSS the most; this operation is repeated while the VIF constraints (3) are satisfied.

  2. BwE:

    Backward elimination method: Starts with \(S = P\) and iteratively eliminates the jth variable (i.e., \(S \leftarrow S\setminus \{j\}\)) that increases RSS the most; this operation is repeated until the VIF constraints (3) are satisfied.

  3. CPA:

    Cutting plane algorithm [7, 36]: Solves the MIQO problem (8)–(9) and (11), and adds a valid inequality (cf. (35)) to the problem to cut off subsets of collinear variables; this operation is repeated until the VIF constraints (3) hold.

  4. CPA*:

    Cutting plane algorithm in which valid inequalities are strengthened by using a backward elimination method [36].

  5. NEF:

    Normal-equation-based MIQO formulation (16)–(21).

  6. NEF+:

    Normal-equation-based MIQO formulation (16)–(21) with the preprocessing (i) and (ii).

  7. IMF:

    Inverse-matrix-based MIQO formulation (26)–(33).

  8. IMF+:

    Inverse-matrix-based MIQO formulation (26)–(33) with the preprocessing (i) and (ii).

These computations were performed on a Windows 7 PC with an Intel Core i7-4770 CPU (3.40 GHz) and 8 GB memory. The algorithms FwS and BwE were implemented with R 3.1.1 [33]; the algorithms CPA and CPA* were implemented in Python 2.7 with Gurobi Optimizer 7.5.2 [13], where the logical implications (9) were incorporated in the form of SOS 1 constraint as in Tamura et al. [36]. The MIQO problems (i.e., NEF, NEF+, IMF, and IMF+) were solved using IBM ILOG CPLEX 12.6.3.0 [19] with eight threads. Here the indicator function implemented in CPLEX was used to impose the logical implications (17), (19), (20), (27), (31), and (32).

A subset of explanatory variables often has collinearity problems when its VIF value is greater than 5 or 10 (see [20], p. 101). Thus, we tested two values (\(\alpha = 5\) and 10) of the upper bound on VIF to evaluate the effect on computational results.

The algorithms NEF+ and IMF+ involved preprocessing steps (i) and (ii), as explained in Sect. 3.3. Specifically, the VIF constraints for the set (34) were deleted in advance, and the cutting-plane-based constraints (35) were included. Here, the subsets \(S_k~(k \in K)\) of collinear variables were found by applying the algorithm FwS to the regression model (4) for each \(\ell \in P\).

Results of the preprocessing are summarized in Table 2 (\(\alpha = 5\)) and Table 3 (\(\alpha = 10\)), where \(|P_0|\) and |K| are the numbers of redundant VIF constraints and cutting-plane-based constraints, respectively. The column labeled “Time (s)” shows the computation time in seconds. We can see that preprocessing (i) required only a few seconds, but the computation time of preprocessing (ii) increased greatly with the number of candidate explanatory variables.

Table 2 Results of preprocessing (\(\alpha = 5\))
Table 3 Results of preprocessing (\(\alpha = 10\))

We evaluate the results of the subset selection algorithms on the basis of the following evaluation criteria.

  • VIF: must be smaller than the specified upper bound. Note that if VIF is minimized as the objective function, then the optimal solution becomes a trivial and meaningless one, that is, a subset consisting of only one explanatory variable. Therefore, VIF should not be treated as the objective function but should be kept small by means of the upper-bound constraints.

  • Computation time: lower is better. Several days of computation are rarely acceptable for subset selection because it needs to be performed repeatedly (e.g., for cross-validation) in many practical situations. In addition, we are faced with a growing number of samples and explanatory variables because of recent advances in information technology. As a result, a faster algorithm is required for subset selection.

  • \(R^2\) value: higher is better. The goodness-of-fit of regression model is quantified by the coefficient of determination, \(R^2\). Accordingly, an algorithm that produces a larger \(R^2\) value is desirable.

  • Implementation: simpler is better. If two algorithms offer a similar performance, an easily implementable one is highly preferred.

Table 4 Results of subset selection algorithms (\(\alpha = 5\))
Table 5 Results of subset selection algorithms (\(\alpha = 10\))

Tables 4 and 5 show the computational results of the subset selection algorithms for \(\alpha =5\) and 10, respectively. The column labeled “\(R^2\)” shows the value of the coefficient of determination of a subset regression model; the largest \(R^2\) values for each instance are indicated in bold. The column labeled “\(\mathrm VIF_{max}\)” shows the value of \(\max \{\mathrm{VIF}(\ell ,S) \mid \ell \in S\}\), and the column labeled “|S|” shows the number of selected explanatory variables. The computation of each subset selection algorithm was terminated if it did not finish by itself within 10,000 s. In these cases, the best feasible solution obtained within 10,000 s was taken as the result.

First, we can confirm that all solutions satisfy the constraint \(\mathrm{VIF_{max}} \le {\alpha } \) unless no feasible solution was found within 10,000 s. Note that if a solution satisfies the VIF constraint, the value itself of \(\mathrm VIF_{max}\) does not matter.

Next, we discuss the computation time and \(R^2\) values of each algorithm. Note that FwS and BwE are local search algorithms, and thus they complete the search process quickly without certificates of optimality of the obtained solutions. Meanwhile, other MIO approaches require a sufficient amount of computation time to verify optimality of the obtained solutions. Nevertheless, most of the MIQO problems for the Servo, AutoMPG, and SolarFlareC instances were solved to optimality within a few tens of seconds. In this case, \(R^2\) values of our MIQO formulations were always the largest because the obtained solutions were verified to be optimal. We can also see that our preprocessing significantly reduced the time used in solving the MIQO problems. For instance, in Table 4, IMF required a relatively long computation time for the AutoMPG and SolarFlareC instances, but the computation of IMF+ finished much earlier for the same instances. In the case of the BreastCancer instance, the computation was completed about 20 times faster by IMF+ than by IMF for both \(\alpha =5\) and 10.

The MIQO computations for the Automobile and Crime instances were terminated due to the time limit of 10,000 s; nevertheless, they successfully found solutions of good quality within 10,000 s. On the other hand, IMF failed to deliver a solution of good quality to the Crime instance; specifically, it found only the feasible solution \(S = \emptyset \) within 10,000 s for \(\alpha = 5\) and 10. This issue was observed due to numerical instability in IMF, whereas it did not happen in IMF+.

The algorithm CPA took a long time to solve even small instances (e.g., AutoMPG and SolarFlareC), and it failed to find a feasible solution to the BreastCancer, Automobile, and Crime instances within the time limit of 10,000 s. The algorithm CPA* is much faster than CPA, but it seems slower than IMF+, which is the fastest of four MIQO formulations (NEF, NEF+, IMF, and IMF+).

Table 6 Number of times the best solutions were generated

Table 6 shows the number of times the best solution was generated by each algorithm. We can see that CPA* was competitive with our MIQO formulations in terms of solution quality. Here, we also take into account the simplicity of implementation of these algorithms. The implementation of CPA* is relatively complicated; MIQO problems combined with the backward elimination method must be solved repeatedly. On the other hand, our MIQO formulations enable us to pose the subset selection problem as a single MIQO problem, which can be handled using standard MIO software. From the point of view of implementation, our MIQO formulations have a crucial advantage over the cutting plane algorithm.

We conclude this section by comparing our MIQO formulations with the MISDO formulation devised by Tamura et al. [36] for subset selection with the condition number constraint. The MISDO formulation spent 5563.34 s (\(\kappa = 100\)) and 336.14 s (\(\kappa = 225\)) for the AutoMPG instance, where \(\kappa \) is the upper bound on the condition number. These results suggest that our VIF-based MIQO formulations had a clear computational advantage over the condition-number-based MISDO formulation. In addition, Tamura et al. [36] could not solve many of the MISDO problems for larger instances because of numerical instability.

5 Conclusions

This paper dealt with the problem of selecting the best subset of explanatory variables under upper-bound constraints on VIFs of selected variables. For this problem, we presented two MIQO formulations (a normal-equation-based formulation and an inverse-matrix-based formulation), based on two equivalent definitions of VIF.

The research contribution in this paper is computationally tractable MIQO formulations for eliminating multicollinearity by using VIF. Previously, no tractable formulation of the VIF constraint was known, and we have successfully written it as a set of linear constraints. As a result, we reduced the subset selection to a single MIO problem, which can be handled using standard MIO software.

Our MIQO formulations must spend a certain amount of time to verify optimality of the selected subset of variables. Nevertheless, it was demonstrated that when the number of candidate explanatory variables was less than 30, most of the MIQO problems were solved to optimality within a few tens of seconds. Our MIQO formulations also have the advantage of being able to find solutions of good quality in the early stage of the computation. Indeed, even when the number of candidate explanatory variables was more than 30, our MIQO formulations with preprocessing provided solutions of better quality within the time limit than those obtained using local search algorithms. These results reveal that our method is particularly effective for small-to-medium-sized problems (e.g., \(p \le 100\)). We believe that our method has a potential value in reality because the number of candidate variables for regression analysis is less than one hundred in many cases.

A conventional way of avoiding multicollinearity is to remove explanatory variables one at a time according to the VIF value of each variable. Our MIQO formulations, however, have a clear advantage in terms of solution quality over such a heuristic algorithm. A solution of better quality leads to more reliable results for description purposes; hence, our MIQO formulations will be helpful in enhancing reliability of the regression analysis.

A future direction of study will be to extend our formulations to other regression/classification models. Multicollinearity generates a harmful effect on most statistical models, so our formulations are expected to be useful in various regression/classification models.