1 Introduction

Generalized additive model (GAM) has gained popularity on addressing the curse of dimensionality in multivariate nonparametric regressions with non-Gaussian responses. GAM was developed by Hastie and Tibshirani (1990) for blending generalized linear model with nonparametric additive regression, which stipulates that a data set \(\left\{ Y_{i},\mathbf {X} _{i}^{T}\right\} _{i=1}^{n}\) consists of iid copies of \(\left\{ Y,\mathbf {X} ^{T}\right\} \) that satisfies

$$\begin{aligned} \mathop {\mathsf{E}}(Y|{\mathbf {X}})= & {} b^{\prime }\left\{ m\left( {\mathbf {X}} \right) \right\} ,\mathop {\mathrm{var}}(Y|{\mathbf {X}})=a\left( \phi \right) b^{\prime \prime }\left\{ m\left( {\mathbf {X}}\right) \right\} , \nonumber \\ m\left( {\mathbf {X}}\right)= & {} c+\sum \limits _{l=1}^{d}m_{l}(X_{l}), \end{aligned}$$
(1)
$$\begin{aligned} Y=b^{\prime }\left\{ m\left( {\mathbf {X}}\right) \right\} +\sigma \left( {\mathbf {X}}\right) \varepsilon ,\sigma \left( {\mathbf {X}}\right) =\left\{ \mathop {\mathrm{var}}(Y|{\mathbf {X}})\right\} ^{1/2} \end{aligned}$$

where the response Y is one of certain types, such as Bernoulli, Poisson and so forth, the vector \(\mathbf {X}=\left( X_{1},X_{2},\ldots ,X_{d}\right) ^{T} \) consists of the predictors, \(m_{l}(\cdot ),1\le l\le d\) are unknown smooth functions, the white noise \(\varepsilon \) satisfies that \( \mathop {\mathsf{E}}\left( \varepsilon \vert \mathbf {X} \right) =0\) and \(\mathop {\mathsf{E}}\left( \varepsilon ^{2}\vert \mathbf {X} \right) =1\), while c is an unknown constant, \(a\left( \phi \right) \) is a nuisance parameter that quantifies overdispersion, and the known inverse link function \(b^{\prime }\) satisfies that \(b^{\prime }\in C^{2}\left( \mathbb {R}\right) ,b^{\prime \prime }\left( \theta \right) >0,\theta \in \mathbb {R}\), see Assumption (A2) in the Appendix. In particular, if one takes the identity/trivial link, model (1) becomes a common additive model, see Huang and Yang (2004).

The joint density \(f\left( \mathbf {x}\right) \) of \(\left( X_{1},\ldots ,X_{d}\right) \) is assumed to be continuous and

$$\begin{aligned} 0<c_{f}\le \inf \nolimits _{\mathbf {x}\in \left[ 0,1\right] ^{d}}f\left( \mathbf {x}\right) \le \sup \nolimits _{\mathbf {x}\in \left[ 0,1\right] ^{d}}f\left( \mathbf {x}\right) \le C_{f}<\infty , \end{aligned}$$

see Assumption (A4) in the Appendix. Furthermore, for each \(1\le l\le d\), the marginal density function \(f_{l}\left( x_{l}\right) \) of \(X_{l}\) has continuous derivatives on \(\left[ 0,1\right] \) and the same uniform bounds \( C_{f}\) and \(c_{f}\). There exists a \(\sigma \)-finite measure \(\lambda \) on \( \mathbb {R}\) such that the distribution of \(Y_{i}\) conditional on \(X_{i}\) has a probability density function \(f_{Y\vert \mathbf {X}}\left( y;b^{\prime }\left\{ m\left( \mathbf {x}\right) \right\} \right) \) relative to \(\lambda \) whose support for y is a common \(\varOmega \), and is continuous in both \(y\in \varOmega \) and \(x\in \left[ 0,1\right] ^{d}\).

It is often the case that in model (1) the probability density function of \(Y_{i}\) conditional on \(\mathbf {X}_{i}\) with respect to a fixed \( \sigma \)-finite measure forms an exponential family:

$$\begin{aligned} f\left( Y_{i}\vert \mathbf {X}_{i},\phi \right) =\exp \left[ \left\{ Y_{i}m\left( \mathbf {X}_{i}\right) -b\left\{ m\left( \mathbf {X} _{i}\right) \right\} \right\} /a\left( \phi \right) +h\left( Y_{i},\phi \right) \right] . \end{aligned}$$
(2)

Nonetheless, such an assumption is not necessary in this paper. Instead, we only stipulate that the conditional variance and conditional mean are linked by

$$\begin{aligned} \mathop {\mathrm{var}}\left( Y|\mathbf {X=x}\right) =a\left( \phi \right) b^{\prime \prime }\left[ \left( b^{\prime }\right) ^{-1}\left\{ \mathop {\mathsf{E}} \left( Y|\mathbf {X=x}\right) \right\} \right] . \end{aligned}$$

For identifiability, one needs

$$\begin{aligned} \mathop {\mathrm{E}}\left\{ m_{l}\left( X_{l}\right) \right\} =0,1\le l\le d \end{aligned}$$

that leads to unique additive representations of \(m\left( \mathbf {x}\right) =c+\sum \nolimits _{l=1}^{d}m_{l}\left( x_{l}\right) \). Without loss of generality, \(\mathbf {x}\) take values in \(\chi \) \(=\left[ 0,1\right] ^{d}\).

Model (1) has numerous applications. In corporate credit rating, for instance, one is interested in modelling how the default or non-default of a given corporate or company depends on the additive effects of the covariates in financial statements, i.e., the response \(Y=0,1\) with 1 indicating default, 0 indicating non-default, and the predictors are selected from financial statements with a \(\mathop {\mathrm{logit}}\nolimits \)-link \(\left( b^{\prime }\right) ^{-1}(x)\) \(=\) \(\log \left\{ x/\left( 1-x\right) \right\} \). Our method has been applied to 3472 companies in Japan within a 5-year default horizon (2005–2010), and it has been discovered that the current liabilities and stock market returns of current, 3 months and 6 months prior to default are very significant as rating factors, and the default impact of the selected factors are examined via the simultaneous confidence corridors (SCCs) in Fig. 1a–c. More details of this example are contained in Sect. 6.

Fig. 1
figure 1

Plots of the rating factors in ac SBK estimators (thin), \( 95~\% \) CIs (dashed) and \(95~\%\) SCCs (thick). Plot of the CAPs defined as (24) in d Perfect (dashed), GAM (thick solid), GLM (thin solid), non-informative (dotted). a Current liability. b 3 months earlier return. c 6 months earlier return. d The CAP curves

The smooth functions \(\left\{ m_{l}(x_{l})\right\} _{l=1}^{d}\) in (1) can be estimated by, for instance, kernel methods in Linton and Härdle (1996), Linton (1997) and Yang et al. (2003), B-spline methods in Stone (1986) and Xue and Liang (2010), and two-stage methods in Horowitz and Mammen (2004). To make statistical inference on these functions individually and collectively, however, the proper tools are nonparametric simultaneous confidence corridors (SCCs) and consistent variable selection criteria, both of which are absent in the literature.

Nonparametric SCCs methodology has become increasingly important in statistical literature, see Xia (1998), Fan and Zhang (2000), Wu and Zhao (2007), Zhao and Wu (2008), Ma et al. (2012), Wang et al. (2014), Zheng et al. (2014), Gu et al. (2014), Cai and Yang (2015) and Gu and Yang (2015) for recent theoretical works on nonparametric SCCs. Capturing global shape properties by SCCs of the functions \(\left\{ m_{l}(x_{l})\right\} _{l=1}^{d}\) in GAM (1) is of prime importance. A nonparametric component can be replaced by a parametric one covered entirely within the SCCs, significantly decreasing the estimation variance, see He et al. (2002, 2005) for discussions. As far as we know, SCCs has not been established for functions \( \left\{ m_{l}(x_{l})\right\} _{l=1}^{d}\) in GAM (1) due to the lack of estimators that fit in Gaussian process extreme value theory. Using the spline-backfitted kernel (SBK) smoothing of Liu et al. (2013), we extend the SCCs works of univariate nonparametric regression in Bickel and Rosenblatt (1973) and Härdle (1989) to those of GAM. The SBK smoothing has been well developed in Wang and Yang (2007), Wang and Yang (2009), Liu and Yang (2010) and Ma and Yang (2011) for the much simpler additive model (i.e., GAM with\(\ b^{\prime }\left( x\right) \equiv x\)) including the construction of SCCs, but ours is the first work on SCCs on GAM with nonlinear link.

While variable selection for nonparametric additive model has been investigated under different settings, see Wang et al. (2008), there is lack of theoretically reliable variable selection approach for GAM. To the best of our knowledge, only Zhang and Lin (2006) proposed a sounding method named “COSSO” , which stands for components (CO) LASSO using penalized likelihood method, for selecting components in nonparametric regression with exponential families, but it leaves the asymptotic distributions and variable selection consistency to be desired. Instead, we tackle this issue by building a BIC type criterion based on spline pre-smoothing (first stage in the SBK), which is asymptotically consistent and easy to compute. Our work extends the BIC criterion for additive models (trivial link) in Huang and Yang (2004). Such an extension is challenging since a much more complicated quasi-likelihood is used in GAM with possibly nonlinear link instead of the log mean squared error for trivial link, see the Appendix for details.

The rest of paper is organized as follows. The SBK estimator and its oracle property are briefly described in Sect. 2. Asymptotic extreme value distribution of the SBK estimator is investigated in Sect. 3, which is used to construct the SCCs of component functions. Section 4 introduces a BIC criterion in the GAM setting and provides results on consistent component selection as well as the implementation, followed by the Monte Carlo simulations in Sect. 5. Section 6 illustrates the application of our SCCs and BIC methods to predict default of nearly 3, 500 listed companies in Japan. Technical assumptions and proofs are presented in the Appendix.

2 Spline-backfitted kernel smoothing in GAM

In this section, we briefly describe the spline-backfitted kernel (SBK) estimator for GAM (1) and its oracle properties obtained in Liu et al. (2013). Let \(\left\{ \mathbf {X}_{i},Y_{i}\right\} _{i=1}^{n}\) be i.i.d. observations following model (1). Without loss of generality, one denotes \(\mathbf {x}_{\_1}=\left( x_{2},\ldots ,x_{d}\right) \) and \(m_{\_1}\left( \mathbf {x}_{\_1}\right) =c+\sum _{l=2}^{d}m_{l}\left( x_{l}\right) \) and estimates \(m_{1}\left( x_{1}\right) \).

As a benchmark of efficiency, we introduce the “oracle smoother” by treating the constant c and the last \(d-1\) components \(\left\{ m_{l}\left( x_{l}\right) \right\} _{l=2}^{d}\) as known, then the only unknown component \(m_{1}\left( x_{1}\right) \) may be estimated by the following procedure. Although the exponential family Eq. (2) does not necessarily hold, one still defines, as in Severini and Staniswalis (1994), for each \(x_{1}\in \left[ h,1-h\right] \) a local log-likelihood function \(\tilde{l}\left( a\right) =\tilde{l}\left( a,x_{1}\right) \) as

$$\begin{aligned} \tilde{l}\left( a,x_{1}\right) =n^{-1}\sum \limits _{i=1}^{n}\left[ Y_{i}\left\{ a+m_{\_1}\left( \mathbf {X}_{i,\_1}\right) \right\} -b\left\{ a+m_{\_1}\left( \mathbf {X}_{i,\_1}\right) \right\} \right] K_{h}\left( X_{i1}-x_{1}\right) , \nonumber \\ \end{aligned}$$
(3)

where \(a\in A\), a set whose interior contains \(m_{1}\left( \left[ 0,1\right] \right) \). The oracle smoother of \(m_{1}\left( x_{1}\right) \) is

$$\begin{aligned} \widetilde{m}_{\mathop {\mathrm{K}},1}\left( x_{1}\right) =\mathop {\mathrm{argmax}}\nolimits _{a\in A} \widetilde{l}\left( a,x_{1}\right) . \end{aligned}$$

Although \(\widetilde{m}_{\mathop {\mathrm{K}},1}\left( x_{1}\right) \) is not a statistic since c and \(\left\{ m_{l}\left( x_{l}\right) \right\} _{l=2}^{d} \) are actually unknown, its asymptotic properties serve as a benchmark for estimators of \(m_{1}\left( x_{1}\right) \) to achieve.

To define the SBK, we introduce the linear B spline basis for smoothing: \( b_{J}\left( x\right) =\left( 1-\vert x-\xi _{J}\vert /H\right) _{+},\) \(0\le J\le N+1\) where \(0=\xi _{0}<\xi _{1}<\cdots <\xi _{N}<\xi _{N+1}=1\) are a sequence of equally spaced points, called interior knots, on interval \(\left[ 0,1\right] \). Denote by \(H=\left( N+1\right) ^{-1}\) the width of each subinterval \(\left[ \xi _{J},\xi _{J+1}\right] ,0\le J\le N\) and the degenerate knots by \(\xi _{-1}=0,\xi _{N+2}=1\). The space of l-empirically centered linear spline functions on [0, 1] is

$$\begin{aligned} G_{n,l}^{0}=\left\{ g_{l}:g_{l}\left( x_{l}\right) \equiv \sum \limits _{J=0}^{N+1}\lambda _{J}b_{J}\left( x_{l}\right) ,\mathop {\mathrm{E}} \nolimits _{n}\left\{ g_{l}\left( X_{l}\right) \right\} =0\right\} ,\quad 1\le l\le d, \end{aligned}$$
(4)

with empirical expectation \(\mathop {\mathrm{E}}\nolimits _{n}\left\{ g_{l}\left( X_{l}\right) \right\} =n^{-1}\sum _{i=1}^{n}g_{l}\left( X_{li}\right) \). The space of additive spline functions on \(\chi =\left[ 0,1\right] ^{d}\) is

$$\begin{aligned} G_{n}^{0}=\left\{ g\left( \mathbf {x}\right) =c+\sum \limits _{l=1}^{d}g_{l}\left( x_{l}\right) ;c\in \mathbb {R},\quad \text { } g_{l}\in G_{n,l}^{0}\right\} . \end{aligned}$$

The SBK method is defined in two steps. One first pre-estimates the unknown functions \(\left\{ m_{l}\left( x_{l}\right) \right\} _{l=2}^{d}\) and constants c by linear spline smoothing. We define the log-likelihood function \(\widehat{L}\left( g\right) \) as

$$\begin{aligned} \widehat{L}\left( g\right) =n^{-1}\sum \limits _{i=1}^{n}\left[ Y_{i}g\left( \mathbf {X}_{i}\right) -b\left\{ g\left( \mathbf {X}_{i}\right) \right\} \right] ,\quad g\in G_{n}^{0}. \end{aligned}$$
(5)

According to Lemma 14 of Stone (1986), (5) has a unique maximizer with probability approaching 1. Therefore, the multivariate function \(m\left( \mathbf {x}\right) \) can be estimated by an additive spline function:

$$\begin{aligned} \widehat{m}\left( \mathbf {x}\right) =\mathop {\mathrm{argmax}}\nolimits _{g\in G_{n}^{0}}\widehat{L }\left( g\right) . \end{aligned}$$
(6)

The spline estimator is asymptotically consistent, and can be solved efficiently via generalized linear models. However, as stated in Wang and Yang (2007) and Liu et al. (2013), spline methods only provide convergence rates but no asymptotic distributions, so no measures of confidence can be assigned to the estimators. To overcome this problem, we adapt the SBK estimator, which combines the strength of kernel smoothing with regression spline. One then rewrites \(\widehat{m}\left( \mathbf {x} \right) =\hat{c}+\sum _{l=1}^{d}\widehat{m}_{l}\left( x_{l}\right) \) for \( \widehat{c}\in \mathbb {R}\) and \(\widehat{m}_{l}\left( x_{l}\right) \in G_{n,l}^{0}\) and defines a univariate quasi-likelihood function similar to \( \widetilde{l}\left( a,x_{1}\right) \) in (3) as

$$\begin{aligned} \widehat{l}\left( a,x_{1}\right) =n^{-1}\sum \limits _{i=1}^{n}\left[ Y_{i}\left\{ a+\widehat{m}_{\_1}\left( \mathbf {X}_{i,\_1}\right) \right\} -b\left\{ a+\widehat{m}_{\_1}\left( \mathbf {X}_{i,\_1}\right) \right\} \right] K_{h}\left( X_{i1}-x_{1}\right) \end{aligned}$$

with \(\widehat{m}_{\_1}\left( \mathbf {x}_{\_1}\right) =\widehat{c} +\sum _{l=2}^{d}\widehat{m}_{l}\left( x_{l}\right) \) being the pilot spline estimator of \(m_{\_1}\left( \mathbf {x}_{\_1}\right) \). Consequently, the spline-backfitted kernel (SBK) estimator of \(m_{1}\left( x_{1}\right) \) is

$$\begin{aligned} \widehat{m}_{\mathop {\mathrm{SBK}}\nolimits ,1}\left( x_{1}\right) =\mathop {\mathrm{argmax}}\nolimits _{a\in A} \widehat{l}\left( a,x_{1}\right) . \end{aligned}$$
(7)

We now introduce some useful results and definitions from Liu et al. (2013), under Assumptions (A1)–(A7) in appendix, as \(n\rightarrow \infty \),

$$\begin{aligned}&\sup _{x_{1}\in [0,1]}\big \vert \widehat{m}_{\mathop {\mathrm{SBK}}\nolimits ,1}\left( x_{1}\right) -\widetilde{m}_{\mathop {\mathrm{K}},1}\left( x_{1}\right) \big \vert = \mathcal {O}_{a.s.}\left( n^{-1/2}\log n\right) , \end{aligned}$$
(8)
$$\begin{aligned}&\widetilde{m}_{\mathop {\mathrm{K}},1}\left( x_{1}\right) -m_{1}\left( x_{1}\right) = \mathop {\mathrm{bias}}\nolimits _{1}\left( x_{1}\right) h^{2}/D_{1}\left( x_{1}\right) \nonumber \\&\quad +\,n^{-1}\sum \limits _{i=1}^{n}K_{h}\left( X_{i1}-x_{1}\right) \sigma \left( \mathbf {X}_{i}\right) \varepsilon _{i}/D_{1}\left( x_{1}\right) +r_{\mathop {\mathrm{ K}},1}\left( x_{1}\right) \end{aligned}$$
(9)

in which the higher order remainder \(r_{\mathop {\mathrm{K}},1}\left( x_{1}\right) \) satisfies

$$\begin{aligned} \sup _{x_{1}\in \left[ h,1-h\right] }\big \vert r_{\mathop {\mathrm{K}},1}\left( x_{1}\right) \big \vert =\mathcal {O}_{a.s.}\left( n^{-1/2}h^{1/2}\log n\right) . \end{aligned}$$
(10)

The scale function \(D_{1}\left( x_{1}\right) \) and bias function \(\mathop {\mathrm{ bias}}\nolimits _{1}\left( x_{1}\right) \) are defined in Liu et al. (2013) as:

$$\begin{aligned} \sigma _{b}^{2}\left( x_{1}\right)= & {} \mathop {\mathrm{E}}\left[ b^{\prime \prime }\left\{ m\left( \mathbf {X}\right) \right\} |X_{1}=x_{1}\right] ,\text { } \sigma ^{2}\left( x_{1}\right) =\mathop {\mathrm{E}}\left\{ \sigma ^{2}\left( \mathbf { X}\right) |X_{1}=x_{1}\right\} \nonumber \\ D_{1}\left( x_{1}\right)= & {} f_{1}\left( x_{1}\right) \sigma _{b}^{2}\left( x_{1}\right) ,v_{1}^{2}\left( x_{1}\right) =\left\| K\right\| _{2}^{2}f_{1}\left( x_{1}\right) \sigma ^{2}\left( x_{1}\right) . \end{aligned}$$
(11)
$$\begin{aligned} \mathop {\mathrm{bias}}\nolimits _{1}\left( x_{1}\right)= & {} \mu _{2}\left( K\right) \times \left\{ m_{1}^{\prime \prime }\left( x_{1}\right) D_{1}\left( x_{1}\right) +m_{1}^{\prime }\left( x_{1}\right) f\left( x_{1}\right) \sigma _{b}^{2}\left( x_{1}\right) ^{\prime }\right. \\&\left. -\left\{ m_{1}^{\prime }\left( x_{1}\right) \right\} ^{2}f\left( x_{1}\right) \mathop {\mathrm{E}}\left[ b^{\prime \prime \prime }\left\{ m\left( \mathbf {X}\right) \right\} |X_{1}=x_{1}\right] \right\} \end{aligned}$$

where \(\left\| K\right\| _{2}^{2}=\int K^{2}\left( u\right) du\), \(\mu _{2}\left( K\right) =\int K\left( u\right) u^{2}du\). The above Eqs. (8), (9) and (10) lead one to a simplifying decomposition of the estimation error \(\widehat{m}_{\mathop {\mathrm{SBK}}\nolimits ,1}\left( x_{1}\right) -m_{1}\left( x_{1}\right) \)

$$\begin{aligned}&\sup _{x_{1}\in \left[ h,1-h\right] }\Big \vert \widehat{m}_{\mathop {\mathrm{SBK}}\nolimits ,1}\left( x_{1}\right) -m_{1}\left( x_{1}\right) -n^{-1}\sum \limits _{i=1}^{n}K_{h}\left( X_{i1}-x_{1}\right) \sigma \left( \mathbf {X}_{i}\right) \varepsilon _{i}/D_{1}\left( x_{1}\right) \Big \vert \nonumber \\&\quad \quad \quad =\mathcal {O}_{a.s.}\left( n^{-1/2}h^{1/2}\log n+n^{-1/2}\log n+h^{2}\right) . \end{aligned}$$
(12)

The decomposition in (12) is fundamental for constructing SCCs in Sect. 3, and it follows from Theorems 1 and 4 of Liu et al. (2013), which were proved under weak dependence. A similar Theorem 2 in Horowitz and Mammen (2004) for the two-stage estimator was established only for a fixed \(x_{1}\), not uniformly for \(x_{1}\) in the growing interval \(\left[ h,1-h\right] \), and exclusively for iid data, not dependent data, see detailed discussion on page 621 of Liu et al. (2013).

3 GAM inference via simultaneous confidence corridor

In this section, we propose SCCs for GAM components based on the SBK smoothing, extending the works for univariate nonparametric function estimation in Bickel and Rosenblatt (1973) and Härdle (1989).

3.1 Main results

Denote \(a_{h}=\sqrt{-2\mathop {\mathrm{log}}h},C\left( K\right) =\left\| K^{\prime }\right\| _{2}^{2}\left\| K\right\| _{2}^{-2}\) and for any \(\alpha \in \left( 0,1\right) \), the quantile

$$\begin{aligned} Q_{h}(\alpha )=a_{h}+a_{h}^{-1}\left[ \mathop {\mathrm{log}}\left\{ \sqrt{C\left( K\right) }/\left( 2\pi \right) \right\} -\mathop {\mathrm{log}}\left\{ - \mathop {\mathrm{log}}\sqrt{1-\alpha }\right\} \right] . \end{aligned}$$
(13)

Also with \(D_{1}\left( x_{1}\right) \) and \(v_{1}^{2}\left( x_{1}\right) \) given in (11), we define

$$\begin{aligned} \sigma _{n}\left( x_{1}\right) =n^{-1/2}h^{-1/2}v_{1}\left( x_{1}\right) D_{1}^{-1}\left( x_{1}\right) . \end{aligned}$$
(14)

Theorem 1

Under Assumptions (A1)–(A7), as \(n\rightarrow \infty \)

$$\begin{aligned} \lim _{n\rightarrow \infty }\mathop {\mathrm{P}}\left\{ \sup \nolimits _{x_{1}\in [h,1-h]}\big \vert \widehat{m}_{\mathop {\mathrm{SBK}}\nolimits ,1}\left( x_{1}\right) -m_{1}\left( x_{1}\right) \big \vert /\sigma _{n}\left( x_{1}\right) \le Q_{h}\left( \alpha \right) \right\} =1-\alpha . \end{aligned}$$

A \(100\left( 1-\alpha \right) \%\) simultaneous confidence corridor for \( m_{1}\left( x_{1}\right) \) is

$$\begin{aligned} \widehat{m}_{\mathop {\mathrm{SBK}}\nolimits ,1}\left( x_{1}\right) \pm \sigma _{n}\left( x_{1}\right) Q_{h}\left( \alpha \right) . \end{aligned}$$
(15)

The above SCC for component function \(m_{1}\left( x_{1}\right) \) resembles the SCCs in Bickel and Rosenblatt (1973) and Härdle (1989) for estimating unknown univariate nonparametric function, although it is for multivariate nonparametric regression.

3.2 Implementation

To satisfy Assumption (A4), one could use the transformed \( U_{il}=F_{nl}\left( X_{il}\right) \) instead of \(X_{il}\) as predictors for each \(l=1,\ldots ,d\) and \(i=1,\ldots ,n,\) where \(F_{nl}\) is the empirical distribution of \((X_{1l},\ldots ,X_{nl})\). We still use symbol X instead of U to avoid involving new symbols, but the X variates have been transformed in simulation study and applications.

To construct the SCC for \(m_{1}\left( x_{1}\right) \) in (15), one needs to select the bandwidth h and the number of knots N to evaluate \(m_{\mathop {\mathrm{SBK}}\nolimits ,1}\left( x_{1}\right) ,\) \(Q_{h}\left( \alpha \right) \) and \(\sigma _{n}\left( x_{1}\right) \) given in (7), (13) and (14).

Assumption (A6) requires that the bandwidth for SCCs be different from the mean square optimal bandwidth \(h_{\mathop {\mathrm{opt}}}\sim n^{-1/5}\) (minimizing AMISE) in Liu et al. (2013). This is due to the two conflicting goals in SCCs construction: coverage of the true curve and narrowness of the corridor, are not quantifiable in a single measure to minimize, such as the mean integrated squared error. We, therefore, take \(h=h_{ \mathop {\mathrm{opt}}}(\log n)^{-1/4}\), as a data-driven undersmoothing bandwidth for SCCs construction to fulfill Assumption (A6), where \(h_{\mathop {\mathrm{opt}}}\) is computed as in Liu et al. (2013), page 623–624. Recent articles on SCCs for time series, such as Wu and Zhao (2007), Zhao and Wu (2008), have used similar undersmoothing bandwidths.

For a given l and a chosen bandwidth h, one can easily estimate \(m_{ \mathop {\mathrm{SBK}}\nolimits ,1}\left( x_{1}\right) \) and \(Q_{h}\left( \alpha \right) \) as in (7), (13). To evaluate \(\sigma _{n}\left( x_{1}\right) \), one needs to estimate \(v_{1}\left( x_{1}\right) \) and \( D_{1}^{-1}\left( x_{1}\right) \) given in (11), i.e., estimating \( f\left( x_{1}\right) ,\sigma _{b}^{2}\left( x_{1}\right) \) and \(\sigma ^{2}\left( x_{1}\right) \). The density function \(f\left( x_{1}\right) \) is estimated by \(\widehat{f}\left( x_{1}\right) =\) \(n^{-1}\sum _{i=1}^{n}K_{h_{ \mathop {\mathrm{ROT}}\nolimits }}\left( X_{i1}-x_{1}\right) \), where \(h_{\mathop {\mathrm{ROT}}\nolimits }\) is the rule-of-thumb bandwidth in equation (5.8), page 200 of Fan and Yao (2003), namely \(h_{\mathop {\mathrm{ROT}}\nolimits }=\left( 8\sqrt{\pi }/3\right) ^{1/5}\mu _{2}\left( K\right) \left\| K\right\| _{2}^{2/5}n^{-1/5}\hat{\sigma }\), in which \( \hat{\sigma }\) is the sample standard deviation of \(\left\{ X_{i1}\right\} _{i=1}^{n}\). We further illustrate the spline estimates of \(\sigma _{b}^{2}\left( x_{1}\right) \) and \(\sigma ^{2}\left( x_{1}\right) \) below:

One partitions \(\min _{i}X_{i1}=t_{1,0}<\cdots <t_{1,N+1}=\max _{i}X_{i1}\) where N is the number of spline interior knots, i.e.,

$$\begin{aligned} \max \left( 1,\min \left( \left\lfloor n^{1/4}\log n+1\right\rfloor ,\left\lfloor n/4d-1/d\right\rfloor -1\right) \right) , \end{aligned}$$
(16)

which satisfies Assumption (A7) in the Appendix. Then \(\sigma _{b}^{2}\left( x_{1}\right) \) can be estimated as \(\sum _{k=0}^{3}\widehat{a} _{1,k}x_{1}^{k}+\sum _{k=4}^{N+3}\widehat{a}_{1,k}\left( x_{1}-t_{l,k-3}\right) ^{3}_{+}\) where \(\left\{ \widehat{a}_{1,k}\right\} _{k=0}^{N+3}\) minimize

$$\begin{aligned} \sum \limits _{i=1}^{n}\left[ b^{\prime \prime }\left\{ \widehat{m}\left( \mathbf {X}_{i}\right) \right\} -\left\{ \sum \limits _{k=0}^{3}a_{1,k}X_{i1}^{k}+\sum \nolimits _{k=4}^{N+3}a_{1,k} \left( X_{i1}-t_{k-3}\right) ^{3}_{+}\right\} \right] ^{2}, \end{aligned}$$
(17)

and \(\sigma ^{2}\left( x_{1}\right) \) can be estimated as \(\sum _{k=0}^{3} \widehat{a}_{1,k}x_{1}^{k}+\sum _{k=4}^{N+3}\widehat{a}_{1,k}\left( x_{1}-t_{l,k-3}\right) ^{3}_{+}\) where \(\left\{ \widehat{a}_{1,k}\right\} _{k=0}^{N+3}\) minimize

$$\begin{aligned} \sum \limits _{i=1}^{n}\left[ \left[ Y_{i}-b^{\prime }\left\{ \widehat{m} \left( \mathbf {X}_{i}\right) \right\} \right] ^{2}-\left\{ \sum \limits _{k=0}^{3}a_{l,k}X_{i1}^{k}+\sum \limits _{k=4}^{N+3}a_{l,k}\left( X_{i1}-t_{k-3}\right) ^{3}_{+}\right\} \right] ^{2}. \end{aligned}$$
(18)

The resulted estimate \(\hat{\sigma }_{n}\left( x_{1}\right) \) of \(\sigma _{n}\left( x_{1}\right) \), using (17) and (18) satisfies \(\sup _{x_{1}\in \left[ h,1-h\right] }\big \vert \hat{\sigma }_{n}\left( x_{1}\right) -\sigma _{n}\left( x_{1}\right) \big \vert =\mathcal {O}_{p}\left( n^{-\gamma }\right) \) for some \(\gamma >0\), see Liu et al. (2013) Sect. 5 for details. This consistency and Slutsky’s theorem ensure that

$$\begin{aligned} \mathop {\mathrm{P}}\left\{ \sup \nolimits _{x_{1}\in [h,1-h]}\big \vert \widehat{m}_{\mathop {\mathrm{SBK}}\nolimits ,1}\left( x_{1}\right) -m_{1}\left( x_{1}\right) \big \vert /\hat{\sigma }_{n}\left( x_{1}\right) \le Q_{h}\left( \alpha \right) \right\} \rightarrow 1-\alpha \end{aligned}$$

as \(n\rightarrow \infty \), and therefore \(\widehat{m}_{\mathop {\mathrm{SBK}}\nolimits ,1}\left( x_{1}\right) \pm \hat{\sigma }_{n}\left( x_{1}\right) Q_{h}\left( \alpha \right) \) is a \(100\left( 1-\alpha \right) \%\) simultaneous confidence corridor for \(m_{1}\left( x_{1}\right) \). The SCCs constructions of other components \(m_{2}\left( x_{2}\right) ,\ldots ,m_{d}\left( x_{d}\right) \) are similar. It is worth while to emphasize that, based on extensive simulation experiments, the estimators \(\widehat{m}_{\mathop {\mathrm{SBK}}\nolimits ,1}\left( x_{1}\right) ,\) \(\widehat{Q}_{h}\left( \alpha \right) ,\widehat{f}\left( x_{1}\right) \) and \(\hat{\sigma }_{n}\left( x_{1}\right) \) remain stable if h and N slightly vary.

4 Variable selection in GAM

In this section, we propose a Bayesian Information Criterion (BIC) for component function selection based on spline smoothing in step one of the SBK estimation for GAM and an efficient implementation follows.

4.1 Main results

According to Stone (1985), p. 693, the space of l-centered square integrable functions on \(\left[ 0,1\right] \) is defined as

$$\begin{aligned} \mathcal {H}_{l}^{0}=\left\{ g:\mathop {\mathrm{E}}\left\{ g\left( X_{l}\right) \right\} =0,\mathop {\mathrm{E}}\left\{ g^{2}\left\{ X_{l}\right\} \right\} <\infty , 1\le l\le d\right\} , \end{aligned}$$
(19)

and the model space \(\mathcal {M}\) is

$$\begin{aligned} \mathcal {M}=\left\{ g\left( \mathbf {x}\right) =c+\sum \limits _{l=1}^{d}g_{l}\left( \mathbf {x}_{l}\right) ;c\in \mathbb {R} ,g_{l}\in \mathcal {H}_{l}^{0}, 1\le l\le d\right\} . \end{aligned}$$
(20)

To introduce the proposed BIC, let \(\left\{ 1,\ldots ,d\right\} \) denote the complete set of indices of d tuning variables \(\left( X_{1},\ldots ,X_{d}\right) \). For each subset \(S\subset \left\{ 1,\ldots ,d\right\} \), define a corresponding model space \(\mathcal {M}_{S}\) for S as

$$\begin{aligned} \mathcal {M}_{S}=\left\{ g\left( \mathbf {x}\right) =c+\sum \limits _{l\in S}g_{l}\left( \mathbf {x}_{l}\right) ;c\in \mathbb {R},g_{l}\in \mathcal {H} _{l}^{0},l\in S\right\} , \end{aligned}$$

with \(\mathcal {H}_{l}^{0}\) given in (19), and the space of the additive spline functions as

$$\begin{aligned} G_{n,S}^{0}=\left\{ g\left( \mathbf {x}\right) =c+\sum \limits _{l\in S}g_{l}\left( x_{l}\right) ;c\in \mathbb {R},g_{l}\in G_{n,l}^{0},l\in S\right\} , \end{aligned}$$

with \(G_{n,l}^{0}\) given in (4). Following Definition 1 of Huang and Yang (2004), the set \(S_{0}\) of significant variables is defined as the minimal set \(S\subset \left\{ 1,\ldots ,d\right\} \) such that \(m\in \mathcal {M}_{S}\). According to Lemma 1 of Huang and Yang (2004), the set \(S_{0}\) is uniquely defined. Standard theory of Hilbert space and subspace projection implies that the set \(S_{0}\) is also the minimal set \(S\subset \left\{ 1,\ldots ,d\right\} \) such that \(\mathop {\mathrm{E}}\{m\left( \mathbf {X}\right) -m_{S}\left( \mathbf {X}\right) \}^{2}=0\) in which the least squares projection of function m in \(\mathcal {M}_{S}\) is

$$\begin{aligned} m_{S}=\mathop {\mathrm{argmin}}_{g\in \mathcal {M}_{S}}\mathop {\mathrm{E}}\left\{ m\left( \mathbf {X}\right) -g\left( \mathbf {X}\right) \right\} ^{2}. \end{aligned}$$
(21)

To identify \(S_{0}\), one computes for an index set S the BIC as

$$\begin{aligned} \mathop {\mathrm{BIC}}\nolimits _{S}=-2\widehat{L}\left( \widehat{m}_{S}\right) +\frac{N_{S}}{n}\left( \log n\right) ^{3} \end{aligned}$$
(22)

where \(\widehat{L}\) \(\left( \cdot \right) \) is given in (5), \(\widehat{m}_{S}\left( \mathbf {x}\right) \) \(\in G_{n,S}^{0}\) is the pilot spline estimator as in (6), \(N_{S}=1+\left( N+1\right) \#\left( S\right) \) with N the number of interior knots as defined in (16), \(\#\left( S\right) \) the cardinality of S.

Our variable selection rule takes the subset \(\widehat{S}\subset \left\{ 1,\ldots ,d\right\} \) that minimizes BIC\(_{S}\).

Theorem 2

Under Assumptions (A1)–(A5), (A7), \(\lim _{n\rightarrow \infty }P\left( \widehat{S}=S_{0}\right) =1\).

According to Theorem 2, the variable selection rule based on the BIC in (22) is consistent. The nonparametric version BIC was firstly established in Huang and Yang (2004) for additive autoregression model, and adapted to additive coefficient model by Xue and Yang (2006), to single index model by Wang and Yang (2009). Our proposed BIC differs from all of the above as it is based on quasi-likelihood rather than mean squared error, which makes the technical proof of consistency much more challenging. To the best of our knowledge, it is the first theoretically reliable information criterion in this setting.

4.2 Implementation

We have not implemented the BIC variable selection by a greedy search through all possible subsets. Instead, a forward stepwise procedure is used with minimizing BIC as the criterion since it is more common that only a few variables are significant among many variables. We have also experimented with backward as well as forward–backward stepwise procedures which have yielded similar outcomes in simulation examples.

5 Simulation

This section studies under simulated setting the performance of the proposed procedures including the computational cost of the SBK, the consistency of selecting variables via BIC and the coverage frequency of the SCCs. The data are generated from

$$\begin{aligned} \mathop {\mathrm{P}}(Y=1|\mathbf {X=x})=b^{\prime }\left\{ c+\sum \limits _{l=1}^{d}m_{l}\left( X_{l}\right) \right\} ,b^{\prime }\left( x\right) =\frac{e^{x}}{1+e^{x}} \end{aligned}$$
(23)

with \(d=10,c=0,m_{3}\left( x\right) =\sin \left( 4\pi x\right) ,m_{4}\left( x\right) =m_{5}\left( x\right) =\sin \left( \pi x\right) ,\) \(m_{6}\left( x\right) =x,m_{7}\left( x\right) =e^{x}-(e-e^{-1})\) and \(m_{l}\left( x\right) =0\) for \(l=1,2,8,9,10\). The predictors are generated by

$$\begin{aligned} X_{il}=2\varPhi \left( Z_{il}\right) -1,\text { }\mathbf {Z}_{i}=\left( Z_{i1},\ldots ,Z_{id}\right) \sim \mathop {\mathrm{N}}\left( 0,\varSigma \right) ,\quad 1\le i\le n,\quad 1\le l\le d, \end{aligned}$$

where \(\varPhi \) is the standard normal c.d.f. and \(\varSigma =\left( 1-r\right) \mathbf {I}_{d\times d}+r\mathbf {1}_{d}\mathbf {1}_{d}^{T}\). The parameter r (\(0\le r<1\)) controls the correlation between \(Z_{il,}1\le l\le d\). To examine the computing advantage of BIC for large d, we have also included results for \(d=50\) with \(m_{3},\ldots ,m_{7}\) as above and all the other component functions are 0.

COSSO is a penalized likelihood method proposed in Zhang and Lin (2006) for LASSO type component selection and nonparametric regression in exponential families. In what follows, the performance of BIC and COSSO is firstly compared, followed by a computational comparison between the SBK and a kernel method in GAM, and it ends with a report on the SCCs coverage frequency for components function (the frequency that SCCs covering the entire curve on the domain). We have tried numbers of knots different from the one in (16) with similar results, so our conclusion is that the performance of BIC is rather insensitive to the number of knots.

Table 1 Simulation comparison of the proposed BIC method and COSSO with \(d=10, 50\)

Table 1 shows the simulation results from 500 replications, where the outcome is defined in accuracy as correct fitting, if \(\widehat{S} =S_{0}\); overfitting, if \(S_{0}\subset \widehat{S}\); and underfitting, if \(S_{0}\nsubseteq \widehat{S}\). It is clear that the performance of BIC on selecting 5 significant variables \(m_{l}\left( X_{l}\right) ,l=3,\ldots ,7,\) is quite satisfactory. The selection accuracy becomes higher as the sample size increases and/or the correlation decreases; it is poorer with higher dimension d (\(=\!50\)) but still high when sample size \(n=1000\). The accuracy and computing time of COSSO are also listed for comparison (Platform: R; PC: Intel 3.1 GHz processor and 8 GB RAM). It is shown in Table 1 that the BIC significantly outperforms the COSSO in terms of accuracy and computing time, and the advantage in computing time widens significantly for \(d=50\).

In addition to the above comparison for model selection, we have also conducted numerical comparison between COSSO and our proposed SBK estimation method in terms of probability prediction. The proposed SBK method has higher prediction accuracy in almost all cases, see Table 4 in the Supplement. Comparison regarding SCC has not been made against COSSO because it does not produce one.

The SCCs coverage frequency for \(m_{l}\left( x_{l}\right) ,l=1,\ldots ,7\) is reported in Table 2. Among the zero functions, we have omitted the results for \(m_{8},m_{9}\) and \(m_{10}\) because the results are very similar to \(m_{1}\) and \(m_{2}\). The empirical coverage approaches the nominal confidence levels as n increases, and better coverage occurs when the correlation is lower. The coverage frequencies vary slightly when d increases, the numerical results of which have not been included for brevity. We have also compared the coverage frequency of SCC and method VOT (Volume of Tube) in the same setup of the simulation 1 in Wiesenfarth et al. (2012), which considered only the case of trivial link function. The performance of our proposed SCC is quite similar to the VOT method Wiesenfarth et al. (2012), see Table 3 in the Supplement.

Table 2 The \(95~\%\) SCCs coverage frequency for \(m_{l}\left( x\right) ,\) \( l=1,2,\ldots ,7\) from 2000 replications

The above studies evidently indicate the reliability of our methodology, such as high selection accuracy of the BIC and desired coverage frequency of the SCCs. It ensures their applications for credit rating modelling in the following section.

6 Application

We now return to forecast default probabilities of the listed companies in Japan. The data taken from the Risk Management Institute, National University of Singapore include the comprehensive financial statements and the credit events (default or bankruptcy) from 2005 to 2010 of 3583 Japanese firms.

Berg (2007) found that the liability status was important to indicate the creditworthiness of a company, while Bernhardsen (2001) and Ryser and Denzler (2009) proposed to consider the “leverage effect” expressed by the financial statement ratios. Therefore, we have pooled two situations by considering \(X_{1}\): Current liability, \(X_{2}\): Current stock return, \(X_{3}\): Long-term borrow, \(X_{4}\): Short-term borrow, \(X_{5}\): Total asset, \(X_{6}\): Non-current liability, \(X_{7}\): 3 months earlier (stock) return, \(X_{8}\): 6 months earlier (stock) return, \(X_{9}\): Current ratio, \(X_{10}\): Net liability to shareholder equity, \(X_{11}\): Shareholder equity to total liability and equity, \(X_{12}\): TCE ratio, \(X_{13}\): Total debt to total asset, \(X_{14}\): Quick ratio.

Selecting the rating factors via the BIC given in (22), we have found that \(X_{1}\): Current liabilities, \(X_{7}\): 3 months earlier return, \(X_{8}\): 6 months earlier return are significant. Similar rating covariates were also discovered in Shina and Moore (2003), Berg (2007) and Ryser and Denzler (2009). However, Berg (2007) selected 23 variables which led to a non-parsimonious GAM. In contrast, Ryser and Denzler (2009) had found that 3 financial ratios (capital turnover, long-term debt ratio, return on total capital) were significant based on the blockwise cross-validation (CV) method which is nonetheless extremely time consuming in comparison to the proposed BIC.

Figure 1a–c depicts the SBK estimator of the factor’s default impact curve on domain, while a shoal of \(95~\%\) CIs and the \(95~\%\) SCCs present, respectively, the pointwise and global uncertainty of the whole curve. The SBK estimators indicate overall monotonicities of each rating factors, and the SCCs turn out to be fairly narrow to warrant the global nonlinearities of the factors’ curves which reveal the underlying nonlinear features in different segments of domain.

As for the model evaluations, the cumulative accuracy profile (CAP) is plotted in Fig. 1d. For any score function S, one defines its alarm rate \(F\left( s\right) =P\left( S\le s\right) \) and the hit rate \(F_{\mathop {\mathrm{D}}}\left( s\right) =P\left( S\le s\vert \mathop {\mathrm{D}}\right) \) where \(\mathop {\mathrm{D}}\) represents the conditioning event of “default”. One then defines the \(\mathop {\mathrm{CAP}}\) curve as

$$\begin{aligned} \mathop {\mathrm{CAP}}\left( u\right) =F_{\mathop {\mathrm{D}}}\left( F^{-1}\left( u\right) \right) ,\quad u\in \left( 0,1\right) , \end{aligned}$$
(24)

which is the percentage of default-infected obligators that are found among the first (according to their scores) \(100u~\%\) of all obligators. A satisfactory model’s CAP would be expected to approach to that of the perfect model (i.e., \(\mathop {\mathrm{CAP}}_{\mathop {\mathrm{P}}}\left( u\right) =\min \left( u/p,1\right) ,u\in \left( 0,1\right) \) where p is the unconditional default probability) and always better than the noinformative. In contrast, a noninformative rating method with zero discriminatory power displays a diagonal line \(\mathop {\mathrm{CAP}}_{\mathop {\mathrm{N}}}\left( u\right) \equiv u,u\in \left( 0,1\right) \). See details of the CAP in Engelmann et al. (2003).

The AR is the ratio of two areas \(a_{R}\) and \(a_{P}\). The area between the given CAP curve and the noninformative diagonal \(\mathop {\mathrm{CAP}}_{\mathop {\mathrm{N}} }\left( u\right) \equiv u\) is \(a_{R}\), whereas \(a_{P}\) is the area between the perfect CAP curve \(\mathop {\mathrm{CAP}}_{\mathop {\mathrm{P}}}\left( u\right) \) and the noninformative diagonal \(\mathop {\mathrm{CAP}}_{\mathop {\mathrm{N}}}\left( u\right) \). Thus,

$$\begin{aligned} \mathop {\mathrm{AR}}=\frac{a_{R}}{a_{P}}=\frac{2\int _{0}^{1}\mathop {\mathrm{CAP}}\left( u\right) \mathrm{d}u-1}{1-p}, \end{aligned}$$
(25)

where \(\mathop {\mathrm{CAP}}\left( u\right) \) is given in (24). The AR takes value in \(\left[ 0,1\right] \), with value 0 corresponding to the noninformative scoring, and 1 the perfect scoring method, a higher AR indicates an overall higher discriminatory power of a method.

Using both GAM and GLM obtained from first 2000 companies to predict the default rate of the rest 1583 companies, the accuracy ratio is \(97.56~\%\) for GAM, much higher than the \(89.76~\%\) for GLM. We have also applied the COSSO method to the same data, and the following error message has appeared “Error in solve.QP(GH$H, GH$H %*% old.theta - GH$G, t(Amat), bvec): matrix D in quadratic function is not positive definite!”, which once again has illustrated the advantage of the proposed BIC procedure over the existing method.