1 Introduction

Models are simple forms of research phenomena that relate ideas and conclusions (Hjorth 1994). In statistics, formulating a model to answer a scientific question is usually the first step taken in an empirical study. Parametric and nonparametric models are two major approaches to statistical modeling in machine learning. Parametric models depend on certain distributional assumptions; if those assumptions hold, they give reliable inferences. Otherwise, nonparametric modeling is recommended.

Multivariate adaptive regression splines (MARS) is a nonparametric regression method (Friedman 1991; Hastie et al. 2001), and widely used in biology, finance and engineering. This method is proved to be useful for handling complex data, which has a nonlinear relationship among numerous variables. MARS builds models by running forward selection and backward elimination algorithms in succession. In the forward algorithm, deliberately, as large model as possible is fitted. Later, in the backward elimination, terms which do not contribute to the model are omitted.

In recent years, a lot of studies have been conducted involving MARS modeling. To exemplify, Denison et al. (1998) provide a Bayesian algorithm for MARS. Moreover, Holmes and Denison (2003) used Bayesian MARS for classification. York et al. (2006) compare the power of the least squares (LS) fitting to that of the MARS with polynomials. Kriner (2007) uses this model for survival analysis. Deconinck et al. (2008) show that MARS is better for fitting nonlinearities, more robust to small changes in data and easy to interpret compared to boosted regression trees. Zakeri et al. (2010) predict the energy expenditure for the first time in this research area by using MARS. Lin et al. (2011) apply MARS to time series data. Lee and Wu (2012) study the MARS applications, where it is used as a metamodel in the global sensitivity analysis of ordinary differential equation models. Ghasemi and Zolfonoun (2013) propose a new approach for MARS using principal component analysis for selection of inputs and apply it to determine the chemical amounts.

Depending on the power of MARS method in modeling high-dimensional and voluminous data, several studies have been conducted to improve its capability. One of them is Conic MARS (CMARS) developed as an alternative to backward elimination algorithm by using conic quadratic programming (CQP) (Yerlikaya 2008), and it is improved to model nonlinearities better (Batmaz et al. 2010). Taylan et al. (2010) compare the performances of MARS and CMARS in classification. Later, its performance is rigorously evaluated and compared with that of MARS using various real-life and simulated data sets with different features (Weber et al. 2012). The results show that CMARS is superior to MARS in terms of accuracy, robustness and stability under different data features. Moreover, it performs better than MARS on noisy data. Nevertheless, CMARS produces models which are at least as complex as MARS.

CMARS has also been compared with several other methods such as classification and regression trees (CART) (Sezgin-Alp et al. 2011), infinite kernel learning (IKL) (Çelik 2010), and generalized additive models (GAMs) with CQP (Sezgin-Alp et al. 2011) for classification, and multiple linear regression (MLR) (Yerlikaya-Özkurt et al. 2014) and dynamic regression model (Özmen et al. 2011) for prediction. These studies reveal that CMARS method performs as good as or even better than the others considered. For detailed findings one can refer to a comprehensive review of CMARS (Yerlikaya-Özkurt et al. 2014).

A quick look into literature demonstrates that almost a decade has been devoted to the development and improvement of the CMARS method. All these studies lead to a powerful alternative to MARS with respect to several criteria including accuracy. Nevertheless, as stated above, the complexity of CMARS models does not compete with that of MARS. Therefore, in this study, we aim at reducing the CMARS models’ complexity. In the usual parametric modeling, the statistical significance of the model parameters can be investigated by testing hypothesis or by constructing confidence intervals (CIs). Because there are no parametric assumptions regarding CMARS models, the methods of computational statistics (CS) may be a plausible approach to take here.

CS is relatively a newer branch of statistics which develops methodologies that intensively use computers (Wegman 1988). Some examples include bootstrap, CART, GAMs, nonparametric regression methods (Efron and Tibshirani 1991) and visualization techniques like parallel coordinates, projection pursuits, and so on (Martinez and Martinez 2002). Advances in the computer science make all these methods feasible and popular especially after 1980s. In this study, the mathematical intractability appears to be the lack of distribution fitting to CMARS parameters. An empirical cumulative distribution function (CDF) is tried to be fitted to each parameter by a CS method, called bootstrap resampling. In this approach, samples are drawn from the original samples with replacement (Hjorth 1994).

There are some applications of this technique to estimate the significance of parameters in a model. Efron (1988) implements bootstrap to least absolute deviation (LAD). Efron and Tibshirani (1993) employ resampling residuals to a model based on least median of squares (LMS). Montgomery et al. (2006) apply bootstrapping residuals to a nonlinear regression method, called Michaelis-Menten. Fox (2002) uses random-X and fixed-X resampling for a robust regression technique which uses M-estimator with the Huber weight function. Also, Salibian-Barrera and Zamar (2002) apply bootstrapping to robust regression. Flachaire (2005) compares the pairs bootstrap with wild bootstrap for heteroscedastic models. Austin (2008) uses bootstrap and with backward elimination which results in improvement of estimation. Chernick (2008) uses vector resampling for a kind of nonlinear model used in aerospace engineering. Yetere-Kurşun and Batmaz (2010) compare regression methods employing different bootstrapping techniques.

In this study, to reduce the complexity of CMARS models without degrading its performance with respect to other measures, a new algorithm, called Bootstrapping CMARS (BCMARS), is developed by using three different bootstrapping regression methods, namely fixed-X, random-X and wild bootstrap. Next, these algorithms are run on four data sets chosen with respect to different sample sizes and scales. Then, the performances of the models developed are compared according to the complexity, stability, accuracy, precision, robustness and computational efficiency.

This paper is organized as follows. In Sect. 2, MARS, CMARS, bootstrap regression and validation methods are described. The proposed approach, BCMARS, is explained in Sect. 3. In Sect. 4, applications and findings are presented. Results are discussed in Sect. 5. In the last section, conclusions and further studies are stated.

2 Methods

2.1 MARS

MARS, developed by Friedman (1991), is a nonparametric regression model where there is no specific assumption regarding the relationship between the dependent and independent variables; it constructs one of the best models which approximates the nonlinearity and handles the high dimensionality in data. MARS models are built in two steps: forward and backward. In the forward step, the largest possible model is obtained. However, this large model leads to over fitting. Thus, a backward step is required to remove terms that do not contribute significantly to the model.

In general, a nonparametric regression model is defined as

$$\begin{aligned} y=f\left( {\varvec{\theta },\varvec{x}} \right) +\varepsilon , \end{aligned}$$
(1)

where \(\varvec{\theta }\) represents the unknown parameter vector; \(\varvec{x}\) shows the independent variable vector; \(\varepsilon \) is the error term. In the model, \(f\left( {\varvec{\theta },\varvec{x}} \right) \) is the unknown form of the relation function. In MARS model, instead of original predictor variables, a special form of them is used to construct models. These are called basis functions (BFs) and represented with the following equations

$$\begin{aligned} \left( {x-t} \right) _{+} =\left\{ {{\begin{array}{l} {x-t,\,\,if\,\,x>t,}\\ {0,\,\,\hbox {otherwise}.}\\ \end{array} }} \right. \quad \left( {t-x} \right) _{+} =\left\{ {{\begin{array}{l} {t-x,\,\,if\,\,x<t,} \\ {0,\,\,\hbox {otherwise}.}\\ \end{array} }} \right. \end{aligned}$$
(2)

Here, \(t\in \left\{ {x_{1,j} ,x_{2,j},\ldots ,x_{N,j} } \right\} \, (j=1,2,\ldots ,p)\) is called as a knot value and these two BFs are the reflected pairs of each other. Note that \((\cdot )_{+}\) denotes the positive part of the component in (2). The multivariate spline BFs take the following form to employ the BF that is tensor products of univariate spline functions

$$\begin{aligned} B_{m} (\varvec{x}^{m})=\prod \limits _{k=1}^{K_{m} } {\left[ {s_{km} \left( {x_{km} -t_{km} } \right) } \right] _{+} } , \end{aligned}$$
(3)

where \(K_{m} \) represents the number of truncated functions in the \(m^{th} \hbox { BF}; \,x_{km} \)shows the input variable corresponding to the \(k^{th}\) truncated linear function in the \(m^{th}\) BF; \(t_{km} \) is the corresponding knot value. Note that \(s_{km} \) takes the value of 1 or -1. As a result, the MARS model is defined as

$$\begin{aligned} y=f\left( {\varvec{\theta },\varvec{x}} \right) +\varepsilon ={\theta }_{0} +\sum \limits _{m=1}^M {{\theta }_{m} B_{m} \left( {\varvec{x}^{m}} \right) } +\varepsilon , \end{aligned}$$
(4)

where each \(B_{m} \) is the \(m^{th}\) BF, and \(M\) represents the number of BFs in the final model. Given a choice for the \(B_{m} \), the coefficients for the parameters (\(\theta _{m} )\) are estimated by minimizing the residual sum of squares (RSS) with the same method used in the MLR, called LS. The important point here is to determine the \(B_{m} \left( {\varvec{x}^{m}} \right) \). For this purpose, \(B_{0} \left( {\varvec{x}^{0}} \right) =1\) is taken as the starting function, and then, by considering all elements in the set of BFs as candidate functions, the one which causes the most amount of reduction in the RSS is included in the model. When the maximum number of terms (determined by the user) is reached, the forward step ends. After obtaining the largest model, backward step starts to prevent overfitting. In this step, a term in the model whose deletion causes the least amount of reduction in RSS is deleted first. This procedure leads to the best estimated model function, \(\hat{{f}}_{M} ,\) for each size (number of terms) \(M\). Cross validation (CV) is a possible technique for finding the optimal value for \(M\). However, generalized cross validation (GCV) is preferred by Friedman (1991) in his original work since it reduces the computational burden; it is defined as

$$\begin{aligned} GCV = \frac{1}{N}\frac{{\sum \nolimits _{{i = 1}}^{N} {\left( {y_{i}-\hat{f}_{M} \left( {\varvec{\theta },\varvec{x}_{i} } \right) } \right) ^{2} } }}{{\left( {1 - C(M)/N} \right) ^{2}}}, \end{aligned}$$
(5)

here, the number of observations (i.e. number of data points) is represented by \(N\); the numerator of (5) is the usual RSS; \(C(M)\) in denominator represents the cost penalty measure of a model with \(M\) BFs. The MARS model is constructed, when the minimum value of the GCV is reached.

2.2 CMARS

CMARS is an alternative to the backward step of MARS developed by Weber et al. (2012), Yerlikaya (2008). It uses BFs generated by the forward step of MARS, and applies CQP to prevent over fitting. For this purpose, penalized RSS (PRSS) is constructed as the sum of two components: RSS and the complexity measure, as follows

$$\begin{aligned} PRSS: = \sum \limits _{{i = 1}}^{N} \left( {y_{i} - \;f\left( {\varvec{\theta },\varvec{\tilde{x}}_{i} } \right) } \right) ^{2} + \sum \limits _{{m = 1}}^{{M_{{\max }} }} {\lambda _{m} \mathop {\mathop {\sum }\limits _{\left| \varvec{\alpha } \right| = 1}}\limits _{\varvec{\alpha } = (\alpha _{1} ,\alpha _{2} )^{T}}^{2}} {\mathop {\mathop {\sum }\limits _{r < s}}\limits _ {r,s \in V(m)} {\int \limits _{{Q^{m} }}} {\theta _{m}^{2} \left[ {D_{{r,s}}^{\varvec{\alpha }} B_{m} (\varvec{z}^{m} )} \right] ^{2} d\varvec{z}^{m} }},\nonumber \\ \end{aligned}$$
(6)

where \(\left( {\tilde{\varvec{{x}}}_{i} ,y_{i} } \right) \quad \left( {i=1,2,\ldots ,N} \right) \) represents our data points with \(p\)-dimensional predictor variable vector \(\tilde{\varvec{{x}}}_{i} =\left( {\tilde{{x}}_{i1} ,\tilde{{x}}_{i2} ,\ldots ,\tilde{{x}}_{ip} } \right) ^{T}\left( {i=1,2,\ldots ,N} \right) \) and \(N\) response values \(\left( {y_{1} ,y_{2} ,\ldots ,y_{N} } \right) \). Furthermore, \(M_{\max } \) is the number of BFs reached at the end of the forward step of MARS, \(V(m)=\left\{ {\kappa _{j}^{m} \vert j=1,2,\ldots ,K_{m} } \right\} \) is the variable set associated with \(m^{th}\) BF. \(\varvec{z}^{m}=\left( {z_{m_{1} } ,z_{m_{2} } ,\ldots ,z_{m_{\kappa _{m} } } } \right) ^{T}\)represent variables that contribute to the \(m^{th}\) BF. The \(\lambda _{m} \quad \left( {m=1,2,\ldots ,M_{\max } } \right) \) values are always nonnegative and used as penalty parameters. Moreover, in Eq. (6), \(D_{r,s}^{\varvec{\alpha }} B_{m} (\varvec{z}^{{m}})=\frac{\partial ^{\left| {\varvec{\alpha }} \right| }B_{m} }{\partial ^{\alpha _{1}}z_{r}^{m} \partial ^{\alpha _{2} }z_{s}^{m} }(\varvec{z}^{{m}})\) is the partial derivative for the \(m^{th}\) BF where \(\varvec{\alpha }=(\alpha _{1} ,\alpha _{2} ),\,\,\left| {\varvec{\alpha }} \right| =\alpha _{1} +\alpha _{2} ,\) and \(\alpha _{1} ,\alpha _{2} \in \left\{ {0,1} \right\} .\)

Here, the optimization approach adopted takes both the accuracy and complexity into account. While accuracy is guaranteed by the RSS, complexity is measured by the second component of PRSS in (6). The tradeoff between these two criteria are represented by the penalty parameters \(\lambda _{m} \left( {m=1,2,\ldots ,M_{\max } } \right) \).

Riemann sums are used to approximate the discretized form of the integrals in (6) as follows (Weber et al. 2012; Yerlikaya 2008)

$$\begin{aligned}&\int \limits _{Q^{m}} {\theta _{m}^{2} \left[ {D_{r,s}^{\varvec{\alpha }} B_{m} (\varvec{z}^{{m}})} \right] ^{2}d\varvec{z}^{{m}}}\nonumber \\&\quad \approx \sum \limits _{(\sigma ^{j})_{j\in \left\{ {1,2,\ldots ,p} \right\} } \in \left\{ {0,1,2,\ldots ,N+1} \right\} ^{K_{m} }} {\theta _{m}^{2} \left[ {D_{r,s}^{\varvec{\alpha }} B_{m} \left( \tilde{{x}}_{l_{\sigma ^{{\kappa _{1}^{m}}}}^{{\kappa _{1}^{m} }}, {\kappa _{1}^{m} }},\ldots ,\tilde{{x}}_{l_{\sigma ^{{\kappa _{K_{m} }^{m} }}}^{{\kappa _{K_{m}}^{m} }} ,{\kappa _{K_{m} }^{m} }} \right) } \right] ^{2}}\nonumber \\&\quad \times \prod \limits _{j=1}^{K_{m} } {\left( {\tilde{{x}}_{l_{\sigma ^{{\kappa _{j}^{m} +1}}}^{{\kappa _{j}^{m} }} ,{\kappa _{j}^{m}}\,} -\tilde{{x}}_{l_{\sigma ^{{\kappa _{j}^{m} }}}^{{\kappa _{j}^{m}}} ,{\kappa _{j}^{m} }} } \right) }. \end{aligned}$$
(7)

As a result, PRSS is rearranged in the following form

$$\begin{aligned}&PRSS\approx \;\quad \mathop {\sum }\limits _{i=1}^N {\left( {y_{i} -\varvec{B}(\tilde{\varvec{d}}_{i} )\varvec{\theta }} \right) ^{2}}\nonumber \\&\quad +\sum \limits _{m=1}^{M_{\max } } \lambda _{m} \theta _{m}^{2} \sum \limits _{i=1}^{(N+1)^{K_{m} }} \left( {\mathop {\mathop {\sum }\limits _{\left| {\varvec{\alpha }} \right| =1}}\limits _{\varvec{\alpha }=(\alpha _{1} ,\alpha _{2} )^{T}}^2} \mathop {\mathop {\sum }\limits _{r<s}}\limits _{r,s\in V(m)} {\left[ {D_{r,s}^{\varvec{\alpha }} B_{m} \left( \tilde{{x}}_{l_{\sigma ^{{\kappa _{1}^{m} }}}^{{\kappa _{1}^{m} }} , {\kappa _{1}^{m} }},\ldots ,\tilde{{x}}_{l_{\sigma ^{{\kappa _{K_{m} }^{m} }}}^{{\kappa _{K_{m} }^{m}}} ,{\kappa _{K_{m} }^{m}}} \right) } \right] ^{2}} \right) \nonumber \\&\quad \times \mathop {\prod }\limits _{j=1}^{K_{m}}{\left( {\tilde{{x}}_{l_{\sigma ^{{\kappa _{j}^{m} +1}}}^{{\kappa _{j}^{m} }} ,{\kappa _{j}^{m} }} -\tilde{{x}}_{l_{\sigma ^{{\kappa _{j}^{m} }}}^{{\kappa _{j}^{m} }} ,{\kappa _{j}^{m} }} } \right) }. \end{aligned}$$
(8)

A short representation of PRSS is as follows

$$\begin{aligned}&PRSS\approx \left\| {\varvec{y}-\varvec{{B}}({\tilde{\varvec{{d}}}} )\varvec{\theta }} \right\| _{2}^{2} +\sum \limits _{m=1}^{M_{\max } } {\lambda _{m} \sum \limits _{i=1}^{(N+1)^{K_{m} }} {L_{im}^{2} \theta _{m}^{2} } ,} \end{aligned}$$
(9)

where \(\;\varvec{{B}}(\tilde{\varvec{{d}}})=\left( {1,B_{1} (\tilde{\varvec{{x}}}^{1}),\ldots ,B_{M} (\tilde{\varvec{{x}}}^{M}),B_{M+1} (\tilde{\varvec{{x}}}^{M+1}),\ldots ,B_{M_{\max } } (\tilde{\varvec{{x}}}^{M_{\max } })} \right) ^{T}\) is an \(\left( {N\times \left( {M_{\max } +1} \right) } \right) \) matrix with the point \(\tilde{\varvec{{d}}}:=(\tilde{\varvec{{x}}}^{1},\ldots ,\tilde{\varvec{{x}}}^{M}, \tilde{\varvec{{x}}}^{M+1},\ldots ,\tilde{\varvec{{x}}}^{M_{\max } })^{T}\), and \(\varvec{\theta }:=\left( {\theta _{0} ,\theta _{1} ,\ldots ,\theta _{M_{\max } } } \right) ^{\mathrm{T}}\); \(\left\| {\,\cdot \,} \right\| _{2} \) denotes the Euclidean norm in the argument. Here, the elements of \(\tilde{\varvec{{d}}}\), which are \(\tilde{\varvec{{x}}}^{1},\tilde{\varvec{{x}}}^{2},\ldots , \tilde{\varvec{{x}}}^{M_{\max } }\), represent predictor data vector used in the \(m^{th}\) BF \(\left( {m=1,2,\ldots ,M_{\max } } \right) \). On the other side, \(L_{im} \) are defined as

$$\begin{aligned} L_{{im}} = \left[ {\left( \mathop {\mathop {\sum \limits _{\left| \varvec{\alpha } \right| = 1}}\limits _{\varvec{\alpha } = (\alpha _{1},\alpha _{2})^{T}}^{2} {\mathop {\mathop {\sum }\limits _{r < s}}\limits _{r,s \in V(m)}} {\left[ {D_{{r,s}}^{\varvec{\alpha }} B_{m} (\hat{\varvec{x}}_{i}^{m} )} \right] ^{2} }} \right) \Delta \hat{\varvec{x}}_{i}^{m} } \right] ^{1/2}. \end{aligned}$$
(10)

Here, \(\hat{\varvec{x}}_i^m \left( {i=1,2,\ldots ,N} \right) \) are the canonical projections of the data points into the input dimensions of \(m^{{th}}\) BF with the same increasing order; \(\Delta \hat{\varvec{x}}_i^m\) represent the differences raised on the \(i^{\mathrm{th}}\) data vector, \(\hat{\varvec{x}}_i^m\) (Weber et al. 2012; Yerlikaya 2008) is as given in (11)

$$\begin{aligned} \hat{\varvec{x}}_i^m =\left( {\tilde{x}_{l_{\sigma ^{{\kappa _1^m }}}^{{\kappa _1^m }}, {\kappa _1^m } } ,\ldots ,\tilde{x}_{l_{\sigma ^{{\kappa _{K_m }^m }}}^{{\kappa _{K_m }^m}} ,{\kappa _{K_m }^m } } } \right) , \quad \Delta \hat{\varvec{x}}_i^m =\prod _{j=1}^{K_m } {\left( {\tilde{x}_{l_{\sigma ^{{\kappa _j^m +1}}}^{{\kappa _j^m }} ,^{\kappa _j^m } } -\tilde{x}_{l_{\sigma ^{{\kappa _j^m }}}^{{\kappa _j^m }} , {\kappa _j^m } } } \right) }. \end{aligned}$$
(11)

Through a uniform penalization, in other words, by taking the same \(\lambda \) value for each derivative term, PRSS can be turned into the Tikhonov regularization problem form given as follows (Aster et al. 2012)

$$\begin{aligned} PRSS \;\approx \;\;\left\| {{\varvec{y}}-{\varvec{B}}\left( {\tilde{{\varvec{d}}}} \right) \varvec{\theta } } \right\| _2^2 + \lambda \left\| {{\varvec{L\theta }}} \right\| _2^2. \end{aligned}$$
(12)

This problem can be evaluated from the view point of CQP, a technique used for continuous optimization, and handled by placing an appropriate bound, \(\tilde{M}\), as follows to obtain the optimal solution

$$\begin{aligned} \mathop {\min }\limits _{t,{\varvec{\theta }}} t, \hbox {such that} \;\;{\varvec{\chi }}= & {} \left( {{\begin{array}{ll} {\mathbf{0}_N }&{} {{{\varvec{B}}}(\tilde{{\varvec{d}}})} \\ 1&{} {\mathbf{0}_{M_{\max } +1}^T } \\ \end{array} }} \right) \left( {{\begin{array}{l} t \\ \varvec{\theta } \\ \end{array} }} \right) +\left( {{\begin{array}{l} {-{\varvec{y}}} \\ 0 \\ \end{array} }} \right) , \nonumber \\ \varvec{\eta }= & {} \left( {{\begin{array}{ll} {\mathbf{0}_{M_{\max } +1} }&{} {\varvec{L}} \\ 0&{} {{\varvec{0}}_{M_{\max } +1}^T } \\ \end{array} }} \right) \left( {{\begin{array}{l} t \\ \varvec{\theta } \\ \end{array} }} \right) +\left( {{\begin{array}{l} {\mathbf{0}_{M_{\max } +1} } \\ {\sqrt{\tilde{M}}} \\ \end{array} }} \right) ,\nonumber \\ {\varvec{\chi }}\in & {} L^{N+1}, \;{\varvec{\eta }} \in L^{M_{\max } +2}, \end{aligned}$$
(13)

where \(L^{N+1}, \;L^{M_{\max } +2}\) are the \((N+1)\)- and (\(M_{max}+2\))- dimensional second-order cones, defined by

$$\begin{aligned} L^{N+1}=\left\{ {{\varvec{x}}=(x_1 ,x_2,\ldots ,x_{N+1} )^{T}\in \mathbb {R}^{N+1}|x_{N+1} \ge \sqrt{x_1^2 +x_2^2 + \cdots +x_N^2 }} \right\} (N\ge 1) , \end{aligned}$$

and

$$\begin{aligned} L^{M_{\max } +2}= & {} \left\{ {\varvec{x}}=(x_1 ,x_2,\ldots ,x_{M_{\max } +2} )^{T}\in \mathbb {R}^{M_{\max } +2}| \right. \nonumber \\&\quad x_{M_{\max } +2} \ge \left. \sqrt{x_1^2 +x_2^2 +\cdots +x_{M_{\max } +1}^2 } \right\} (M_{\max } >0). \end{aligned}$$
(14)

In applications, we observe that the log–log scale plot of the two criteria, \(\left\| {{\varvec{L\theta }}} \right\| _2\) versus \(\;\left\| {\varvec{y}-{{\varvec{B}}}\left( {\tilde{{\varvec{d}}}} \right) {\varvec{\theta }} } \right\| _2\), has a particular “L” shape whose corner point provides the optimum value of \(\sqrt{\tilde{M}}\), and the proposed method provides a reliable solution to our problem. However, more efficient and robust algorithm(s) for locating the corner of an L-curve can be developed.

2.3 Bootstrap regression

2.3.1 Bootstrap resampling

The bootstrap is a resampling technique that takes samples from the original data set with replacement (Chernick 2008). It is a data-based simulation method useful for making inferences such as estimating standard errors and biases, constructing CIs, testing hypothesis, and so on. Implementation of this method is not difficult, but depends heavily on computers. The bootstrap procedure can be defined as in Table 1.

Efron and Tibshirani (1993) indicate that bootstrap is applicable to any models such as nonlinear ones and the models which use estimation techniques other than LS. According to them, bootstrapping regression is applicable to nonparametric models as well as the parametric ones with no analytical solutions.

Let \({\varvec{y}}={\varvec{X\theta }} +{\varvec{\varepsilon }}\) be a usual MLR model, where \({\varvec{X}}\) and \({\varvec{\theta }}\) represent the vector of independent variables as its columns and model parameters, respectively. The error term, \({\varvec{\varepsilon }}\), is normally distributed with zero mean and constant variance. If assumptions regarding the model are satisfied, reliable inferences can be made. In cases such as nonnormal error distribution or nonlinear model fitting, alternative approaches using bootstrap are recommended (Freedman 1981; Hjorth 1994). Three bootstrap regression methods used in the study are described below.

Table 1 The bootstrap procedure

2.3.2 Fixed-X resampling (residual resampling)

In this method, the response values are considered to be random due to the error component. It is more advantageous when it is used with fixed (known) independent variables, with small data sets, and adequate models (Fox 2002). The step-by-step algorithm of the method is given in Table 2.

Table 2 The fixed-X resampling procedure

2.3.3 Random-X resampling (pairs bootstrap)

This technique can be used in case of heteroscedasticity, lack of significant independent variables, and the need for semiparametric or nonparametric model approach (Chernick 2008). The step-by-step algorithm of the method is given in Table 3.

Table 3 The random-X resampling procedure

2.3.4 Wild bootstrap

The wild bootstrap is a relatively new approach, when compared to random-X resampling, proposed for handling heteroscedastic models (Flachaire 2005). Its algorithm is the same as to that of fixed-X resampling given in Table 2 with the only change in Step 2 that the bootstrap of residuals, that is errors, \({\varvec{e}}^{*a} (a=1,\ldots ,A)\), are attached to the fitted values after they are randomly assigned to be 1 or -1 with equal probability.

2.4 Validation technique and performance criteria

In the comparison of models, 3-fold CV technique is used (Martinez and Martinez 2002; Gentle 2009). In this technique, data sets are randomly divided into three parts (folds). At each of the three attempts, two different folds (66.6 % of observations) are combined to develop models while the other fold (33.3 % of observations) is kept to test them. The combined part and the other fold are referred to as training and test data sets, respectively.

The performances of the models developed are evaluated with respect to different criteria including accuracy, precision, complexity, stability, robustness and efficiency. The accuracy criterion is used to measure the predictive ability of the models while precision criterion is used to determine the amount of variation in the parameter estimates; the less variable ones indicate more precision. The mean absolute error (MAE), determination of coefficient \((\hbox {R}^{2})\) and percentage of residuals within three standard deviations (PWI). On the other hand, the precision of parameter estimates are determined by their empirical CIs. Other criterion used in comparisons is the complexity; it is measured by the mean squared error (MSE). It is expected that, in general, the performance measures for test data may not be as good as to that of the training data. Besides, the stabilities of the accuracy and complexity measures obtained from the training and test data sets are also evaluated. The definitions as well as bounds on these measures, where applies, are presented in the Appendix. Furthermore, robustness of the measures with respect to different data sets is evaluated by considering the standard deviations of the measures. Moreover, to assess the computational efficiency of the models build, computational run times are utilized.

3 BCMARS: bootstrapping CMARS

As stated above, studies indicate that CMARS is a good alternative to the backward part of MARS method. However, CMARS produces models at least as complex as MARS models. To overcome this problem, we propose to use a CS method, called bootstrap, due to the lack of distributional assumptions of CMARS, and developed the BCMARS algorithm. The steps of the algorithm given in Table 4 are followed for obtaining three different BCMARS models, labeled as BCMARS-F (uses Fixed-X Resampling), BCMARS-R (uses Random-X Resampling) and BCMARS-W (uses Wild Bootstrap) (Yazıcı 2011; Yazıcı et al. 2011).

Table 4 The BCMARS algorithm

In the Step 2 of Table 4, the optimal value of \(\sqrt{\tilde{M}}\) is the closest solution to the corner of L-curve which is the point with maximum curvature. To determine the corner point, \(\;\left\| {\varvec{y}-\varvec{B}\left( {\tilde{\varvec{d}}} \right) \varvec{\theta } } \right\| _2 \) versus \(\left\| \varvec{L\theta } \right\| _2 \) is plotted in the log–log scale and its corner is located. This point tries to minimize both criteria, \(\;\left\| {\varvec{y}-\varvec{B}\left( {\tilde{{\varvec{d}}}} \right) \varvec{\theta }} \right\| _2\) and \(\left\| \varvec{L\theta } \right\| _2 \), in a balanced manner. We should note here that the corner point is data dependent. There can be many solutions to the CQP problem for different \(\sqrt{\tilde{M}}\) values, which may lead to different estimates. To illustrate, let us consider three representative points for the L-curve, P1, P2 and P3, as given in Fig. 1. While P1 and P3 minimize \(\left\| \varvec{L\theta } \right\| _2 \) and \(\;\left\| {\varvec{y}-\varvec{B}\left( {\tilde{{\varvec{d}}}}\right) \varvec{\theta }} \right\| _2\), respectively, P2, the corner of L-curve, tries to minimize both simultaneously. Here, P1 represents the least complex and least accurate solution whereas P3 represents the most complex and most accurate solution. On the other hand, P2 provides better prediction performance than the other points with respect to both complexity and accuracy criteria (Weber et al. 2012).

Fig. 1
figure 1

The curve of \(\left\| \varvec{L\theta } \right\| _2\) versus \(\left\| {\varvec{y}-\varvec{B}\left( {\tilde{\varvec{d}}} \right) \varvec{\theta }} \right\| _2\) in log–log scale

Table 5 Data sets used in comparisons

4 Application and findings

In order to evaluate and compare the performances of models developed by using MARS, CMARS and BCMARS methods, they are run on four different data sets to observe the effects of certain data characteristics such as size (i.e. the number of observations, N) and scale (i.e. the number of independent variables, \(p\)) on the methods’ performances. Note that the data sets are classified as small and medium subjectively. The data sets used in comparisons are presented in Table 5.

While validating the models, 3-fold CV is used as described in Sect. 2.4. As a result, three models are developed and tested for each of the method applied on a data set. In applications, the R package “Earth” Milborrow (2009), MATLAB (2009) and the MOSEK optimization software (2011) run in MATLAB are utilized.

To construct BCMARS models, the algorithms given in Sect. 3 are applied step-by-step by taking \(A\), in Table 1, 2 and 3, as 1000. Then, the performance measures for each model are calculated. Moreover, the computational run times of the methods are recorded to be compared.

5 Results and discussion

In this section, it is aimed to compare the performances of the methods studied, namely MARS, CMARS, BCMARS-F, BCMARS-R and BCMARS-W, in general, according to different features of data sets such as size and scale. In these comparisons, various criteria including accuracy, precision, stability, efficiency and robustness are considered.

5.1 Comparison with respect to overall performances

The mean and standard deviations of measures obtained from four data sets are given in Table 6. These values are calculated for both training and testing data sets in addition to the stability of measures. Definitions of the measures and their bounds are given in the Appendix. In this table, for training and test data, lower means for MAE and MSE; higher means for \(\hbox {R}^{2}\) and PWI measures indicate better performances. Besides, stability values for all means close to one indicate better performances. On the other hand, smaller standard deviations imply robustness for the corresponding measure. The following conclusions can be drawn from this table:

  • BCMARS-F and BCMARS-R are the most accurate, robust and least complex for training and testing data sets, respectively.

  • BCMARS-R and BCMARS-W methods are the most stable, and BCMARS-R has the most robust stability.

5.2 Comparison with respect to sample size

Table 7 presents the performance measures of the methods studied with respect to two sample size categories: small and medium. Depending on the results given in the table, following conclusions can be reached.

Table 6 Overall performances (Mean\(\pm \)SD) of the methods
Table 7 Averages of performance measures with respect to different sample sizes
  • All methods perform the best in small data sets when compared to the medium size for training and testing data.

  • BCMARS-F and MARS perform the best for small training and testing data sets, respectively. Moreover, BCMARS-W competes with MARS in small testing data sets.

  • Among all, BCMARS-W method is the most stable one in small data sets.

  • BCMARS-F and BCMARS-W are the most stable methods in small size data when compared to medium size.

Note that “the best” here indicates better performance with respect to at least two measures out of four.

5.3 Comparisons with respect to scale

In Table 8, the performance measures of the studied methods with respect to two scale types: small and medium are presented. Depending on the results given in the table, following conclusions can be drawn:

  • For training data sets, medium scale produces better models for all of the methods. Moreover, BCMARS-F is the best performing one regardless of the scale.

  • For testing data sets, MARS, BCMARS-F and BCMARS-W perform equally well on both scales; while medium scale gives the best results for the other methods studied.

  • MARS and BCMARS-W are the most stable methods for small scale data compared to medium scale; CMARS and BCMARS-R are the most stable methods for medium scale compared to small scale. BCMARS-F performs equally well on both scales.

  • MARS and BCMARS-W are more stable for small scale among all methods. BCMARS-R is more stable for medium scale data sets.

Table 8 Averages of performance measures with respect to different scale

5.4 Evaluation of the computational efficiencies

The elapsed time for each method applied on each data set are recorded on Pentium (R) Dual-Core CPU 2.80 GHz processor and 32-bit operating system Windows ®computer during the runs (Table 9). Depending on the results, following conclusions can be stated:

  • Run times increases as sample size and scale increases, except MARS.

  • Bootstrap methods run considerably longer times than MARS and CMARS. Three bootstrap regression methods have almost the same computational efficiencies in small size and small scale data sets. Run times of these methods increase almost ten times as much as the scale increases from small to medium.

  • BCMARS-R and BCMARS-W have similar better efficiencies in medium size small scale data sets. Their run times increase almost five times as much as the sample size increases in small scale data sets.

  • BCMARS-F and BCMARS-W have similar better efficiencies for medium size medium scale data sets.

Table 9 Run times (in seconds) of methods with respect to size and scale of data sets

5.5 Evaluation of the precision of model parameters

In addition to the accuracy, complexity and stability measures of the models, the CIs using Eq. (17) given in the Appendix and two different standard deviations of the parameters as described in Eq. (18) in the Appendix are calculated after bootstrapping. These values are compared with those obtained from bootstrapping CMARS. For the detailed results, one can refer to Yazıcı (2011). The shorter the lengths of the CIs and the smaller the standard deviations are, the more precise the parameter estimates are. According to the results, following conclusions can be drawn.

  • In US (small size medium scale) data, CMARS, BCMARS-F and BCMARS-R build the same models. Hence, the precision of their parameters are the same.

  • For all data sets except US, the lengths of CIs become narrower and standard deviations of the parameters become smaller after bootstrapping CMARS, thus, resulting in more precise parameter estimates.

  • In general, two different types of standard deviations obtained for all BCMARS methods are smaller than the ones obtained from CMARS.

6 Conclusion and further research

In this study, three different bootstrap methods are applied to a machine learning method, called CMARS, which is an improved version of the backward step of the well-known method MARS. Although CMARS overperforms MARS with respect to several criteria, it constructs models which are at least as complex as MARS (Weber et al. 2012). In this study, it is aimed to reduce the complexity of CMARS models without degrading its performance. To achieve this aim, bootstrapping regression methods, namely fixed-X and random-X resampling, and wild bootstrap, are utilized by adopting an iterative approach to determine whether the parameters statistically contribute to the developed CMARS model or not. The reason of using a computational method here is the lack of prior knowledge regarding the distributions of the model parameters.

The performances of the methods are empirically evaluated and compared with respect to several criteria (e.g. accuracy, complexity, stability, robustness, precision, computational efficiency) by using four data sets which are selected subjectively to represent the small and medium sample size and scale categories. All performance criteria are explained in the Appendix. In addition, to validate all models developed, three-fold CV approach is used.

Depending on the comparisons, particularly for testing data and stability results presented in Sect. 5, one may conclude the followings:

  • In the overall, BCMARS-R is the best performing method.

  • Small size (training and testing) data sets produce the best results for all methods; for small and medium size data, BCMARS-W and BCMARS-R overperform the others, respectively.

  • Medium scale produces the best results for CMARS and BCMARS-R when compared to the others, and BCMARS-R is the better performing one.

  • Bootstrapping methods give the most precise parameter estimates; however, they are computationally the least efficient.

In short, depending on the above conclusions, it may be suggested that BCMARS-R method leads to more accurate, precise and less complex models, particularly for medium size and medium scale data. Nevertheless, it is the least efficient method among the others for this type of data set in terms of run time.

In the future, BCMARS methods are going to be applied on more data sets with small to large size and scale to be able to examine the interactions that may exist between data size and scale more clearly.