1 Introduction

Mixtures of regression models are commonly used as “model based clustering” methods to reveal the relationship among variables of interest if the population consists of several homogeneous subgroups. This type of application is commonly seen in econometrics, where it is also known as switching regression models, and in various other fields, see, for example, in econometrics (Wedel and DeSarbo 1993; Frühwirth-Schnatter 2001), and in epidemiology (Green and Richardson 2002). Another wide application of finite mixture of regressions is in outlier detection or robust regression estimation (Young and Hunter 2010). Traditional mixture of linear regression models require strong parametric assumptions: linear component regression functions, constant component variances, and constant component proportions. The fully parametric hierarchical mixtures of experts model (Jordan and Jacobs 1994) has been proposed to allow the component proportions to depend on the covariates in machine learning. Recently, many semiparametric and nonparametric mixture regression models have been proposed to relax the parametric assumptions of mixture of regression models. See, for example, Young and Hunter (2010), Huang and Yao (2012), Cao and Yao (2012),Huang et al. (2013, 2014), Hu et al. (2017), Xiang and Yao (2018), among others. Xiang et al. (2019) provided a good review of many semiparametric regression models. However, most of those existing semiparametric or nonparametric mixture regressions can only be applied to low dimensional predictors due to the “curse of dimensionality”. It will be desirable to be able to relax parametric assumptions of traditional mixtures of regression models when the dimension of predictors is high.

In this article, we propose a mixture of single-index models (MSIM) and a mixture of regression models with varying single-index proportions (MRSIP) to reduce the dimension of high dimensional predictors before modeling them nonparametrically. Many existing popular models can be considered as special cases of the proposed two models. Huang et al. (2013) proposed the nonparametric mixture of regression models

$$\begin{aligned} f(y|x,\pi , m,\sigma ^2)=\sum _{j=1}^k \pi _j(x)\phi (y|m_j(x),\sigma _j^2(x)), \end{aligned}$$

where \(\pi _j(x), m_j(x),\) and \(\sigma _j^2(x)\) are unknown smoothing functions, and \(\phi (y|\mu ,\sigma ^2)\) is the normal density with mean \(\mu \) and variance \(\sigma ^2\). Their proposed model can drastically reduce the modeling bias when the strong parametric assumption of traditional mixture of linear regression models does not hold. However, the above model is not applicable to high dimensional predictors due to the kernel estimation used for nonparametric parts. To solve the above problem, we propose a mixture of single-index models

$$\begin{aligned} f(y|x,{\varvec{\alpha }},\pi , m,\sigma ^2)=\sum _{j=1}^k\pi _j({\varvec{\alpha }}^\top {{\varvec{x}}})\phi (y|m_j({\varvec{\alpha }}^\top {{\varvec{x}}}),\sigma _j^2({\varvec{\alpha }}^\top {{\varvec{x}}})), \end{aligned}$$
(1)

in which the single index \({\varvec{\alpha }}^\top {{\varvec{x}}}\) transfers the high dimensional nonparametric problem to a univariate nonparametric problem. When \(k=1\), model (1) reduces to a single index model (Ichimura 1993; Härdle et al. 1993). If \({{\varvec{x}}}\) is a scalar, then model (1) reduces to the nonparametric mixture of regression models proposed by Huang et al. (2013). Zeng (2012) also applied the single index idea to the component means and variances and assumed that component proportions do not depend on the predictor \({{\varvec{x}}}\). However, Zeng (2012) did not give any theoretical properties of their proposed estimates.

Young and Hunter (2010) and Huang and Yao (2012) proposed a semiparametric mixture of regression models

$$\begin{aligned} f(y|x,\pi , {\varvec{\beta }},\sigma ^2)=\sum _{j=1}^k \pi _j({{\varvec{x}}})\phi (y|{{\varvec{x}}}^\top {\varvec{\beta }}_j,\sigma _j^2), \end{aligned}$$

where \(\pi _j({{\varvec{x}}})\)’s are unknown smoothing functions, to combine nice properties of both nonparametric mixture regression models and traditional parametric mixture regression models. Their semiparametric mixture models assume that component proportions depend on covariates nonparametrically to reduce the modeling bias while component regression functions are still assumed to be linear to have better model interpretation. However, their estimation procedures cannot be applied if the dimension of predictors \({{\varvec{x}}}\) is high due to the kernel estimation used for \(\pi _j({{\varvec{x}}})\). We propose a mixture of regression models with varying single-index proportions

$$\begin{aligned} f(y|x,{\varvec{\alpha }},\pi ,{\varvec{\beta }},\sigma ^2)=\sum _{j=1}^k \pi _j({\varvec{\alpha }}^\top {{\varvec{x}}})\phi (y|{{\varvec{x}}}^\top {\varvec{\beta }}_j,\sigma _j^2), \end{aligned}$$
(2)

which uses the idea of single index to model the nonparametric effect of predictors on component proportions, while allowing easy interpretation of linear component regression functions. When \(k=1\), model (2) reduces to the traditional linear regression model. If \({{\varvec{x}}}\) is a scalar, then model (2) reduces to the semiparametric mixture models considered by Young and Hunter (2010) and Huang and Yao (2012). Modeling component proportions nonparametrically can reduce the modeling bias and better cluster the data when the traditional parametric assumptions of component proportions do not hold (Young and Hunter 2010; Huang and Yao 2012).

We prove the identifiability results of the two models under some mild conditions. We propose a modified EM algorithm, which combines the ideas of backfitting algorithm, kernel estimation, and local likelihood, to estimate both the global parameters and the nonparametric functions. In addition, the asymptotic properties of the proposed estimation procedures are also investigated. Simulation studies are conducted to demonstrate the finite sample performance of the proposed models. Two real data applications reveal some new interesting findings.

The rest of the paper is organized as follows. In Sect.  2, we introduce the MSIM and study its identifiability result. A one-step and a fully-iterated backfitting estimate are proposed, and their asymptotic properties are also studied. In Sect.  3, the MRSIP is proposed. The identifiability result and the asymptotic properties of the proposed estimates are given. In Sects.  4 and 5, Monte Carlo studies and two real data examples are illustrated to demonstrate the finite sample performance of the two models. A discussion section is given in Sect.  6 and we defer the technical conditions and proofs to the “Appendix”.

2 Mixture of single-index models

2.1 Model definition and identifiability

Assume that \(\{({{\varvec{x}}}_i,Y_i),i=1,\ldots ,n\}\) is a random sample from the population \(({{\varvec{x}}},Y)\), where \({{\varvec{x}}}\) is p-dimensional and Y is univariate. Let \(\mathscr {C}\) be a latent variable, and has a discrete distribution \(P(\mathscr {C}=j|{{\varvec{x}}})=\pi _j({\varvec{\alpha }}^\top {{\varvec{x}}})\) for \(j=1,\ldots ,k\). Conditional on \(\mathscr {C}=j\) and \({{\varvec{x}}}\), Y follows a normal distribution with mean \(m_j({\varvec{\alpha }}^\top {{\varvec{x}}})\) and variance \(\sigma _j^2({\varvec{\alpha }}^\top {{\varvec{x}}})\). Without observing \(\mathscr {C}\), the conditional distribution of Y given \({{\varvec{x}}}\) can be written as:

$$\begin{aligned} f(y|x,{\varvec{\alpha }},\pi , m,\sigma ^2)=\sum _{j=1}^k\pi _j({\varvec{\alpha }}^\top {{\varvec{x}}})\phi (y|m_j({\varvec{\alpha }}^\top {{\varvec{x}}}),\sigma _j^2({\varvec{\alpha }}^\top {{\varvec{x}}})). \end{aligned}$$

The above model is the proposed mixture of single-index models. Throughout the paper, we assume that k is fixed, and refer to model (1) as a finite semiparametric mixture of regression models, since \(\pi _j(\cdot )\), \(m_j(\cdot )\) and \(\sigma _j^2(\cdot )\) are all nonparametric. In the model (1), we use the same index \({\varvec{\alpha }}\) for all components. But our proposed estimation procedure and asymptotic results can be easily extended to the cases where components have different index \({\varvec{\alpha }}\).

Compared to Huang et al. (2013), the appeal of the proposed MSIM is that by using an index \({\varvec{\alpha }}^\top {{\varvec{x}}}\), the so-called “curse of dimensionality” in fitting multivariate nonparametric regression functions is avoided. It is of dimension-reduction structure in the sense that, given the estimate of \({\varvec{\alpha }}\), denoted by \(\hat{{\varvec{\alpha }}}\), we can use the univariate \(\hat{{\varvec{\alpha }}}^\top {{\varvec{x}}}\) as the covariate and simplify model (1) by the nonparametric mixture regression model proposed by Huang et al. (2013). Therefore, model (1) is a reasonable compromise between fully parametric and fully nonparametric modeling.

Identifiability is a major concern for most mixture models. Some well known identifiability results of finite mixture models include: mixture of univariate normals is identifiable up to relabeling (Titterington et al. 1985) and finite mixture of regression models is identifiable up to relabeling provided that covariates have a certain level of variability (Henning 2000). Wang et al. (2014) established some general identifiability results for many existing nonparametric or semiparametric mixture regression models. The following theorem establishes the identifiability result of the model (1) and its proof is given in the “Appendix”.

Theorem 1

Assume that

  1. 1.

    \(\pi _j(z)\), \(m_j(z)\), and \(\sigma _j^2(z)\) are differentiable and not constant on the support of \({\varvec{\alpha }}^\top {{\varvec{x}}}\), \(j=1,\ldots ,k\);

  2. 2.

    The \({{\varvec{x}}}\) is continuously distributed random variable that has a joint probability density function;

  3. 3.

    The support of \({{\varvec{x}}}\) is not contained in any proper linear subspace of \(\mathbb {R}^p\);

  4. 4.

    \(\Vert {\varvec{\alpha }}\Vert =1\) and the first nonzero element of \({\varvec{\alpha }}\) is positive;

  5. 5.

    For any \(1\le i\ne j\le k\),

    $$\begin{aligned} \sum _{l=0}^{1} \Vert m^{(l)}_i(z)-m^{(l)}_j(z)\Vert ^2+ \sum _{l=0}^{1}\Vert \sigma ^{(l)}_i(z)-\sigma ^{(l)}_j(z)\Vert ^2\ne 0, \end{aligned}$$

    for any z where \(g^{(l)}\) is the lth derivative of g and equal to g if \(l=0\).

Then, model (1) is identifiable.

2.2 Estimation procedure

In this subsection, we propose a one-step estimation procedure and a backfitting algorithm to estimate the nonparametric functions and the single index of the model (1).

Let \(\ell ^{*(1)}({\varvec{\pi }},{{\varvec{m}}},{{\varvec{\sigma }}}^2,{\varvec{\alpha }})\) be the log-likelihood of the collected data \(\{({{\varvec{x}}}_i,Y_i),i=1,\ldots ,n\}\) from the model (1). That is:

$$\begin{aligned} \ell ^{*(1)}({\varvec{\pi }},{{\varvec{m}}},{{\varvec{\sigma }}}^2,{\varvec{\alpha }})=\sum _{i=1}^n\log \left\{ \sum _{j=1}^k\pi _j({\varvec{\alpha }}^\top {{\varvec{x}}}_i)\phi (Y_i|m_j({\varvec{\alpha }}^\top {{\varvec{x}}}_i),\sigma _j^2({\varvec{\alpha }}^\top {{\varvec{x}}}_i))\right\} , \end{aligned}$$
(3)

where \(\phi (y|\mu ,\sigma ^2)\) is the normal density with mean \(\mu \) and variance \(\sigma ^2\), \({\varvec{\pi }}(\cdot )=\{\pi _1(\cdot ),\ldots ,\pi _{k-1}(\cdot )\}^\top \), \({{\varvec{m}}}(\cdot )=\{m_1(\cdot ),\ldots ,m_k(\cdot )\}^\top \), and \({{\varvec{\sigma }}}^2(\cdot )=\{\sigma _1^2(\cdot ),\ldots ,\sigma _k^2(\cdot )\}^\top \). Since \({\varvec{\pi }}(\cdot )\), \({{\varvec{m}}}(\cdot )\) and \({{\varvec{\sigma }}}^2(\cdot )\) consist of nonparametric functions, (3) is not ready for maximization.

Note that for the model (1), the space spanned by the single index \({\varvec{\alpha }}\) is in fact the central mean subspace of \(Y|{{\varvec{x}}}\) (Cook and Li 2002) in the literature of sufficient dimension reduction. Therefore, we can employ existing sufficient dimension reduction methods to find an initial estimate of \({\varvec{\alpha }}\). Please see, for example, Li (1991), Li et al. (2005), Wang and Xia (2008), Luo et al. (2009), Wang and Yao (2012), Ma and Zhu (2012, 2013), Yao et al. (2019). In this article, we will simply employ sliced inverse regression (Li 1991) to obtain an initial estimate of \({\varvec{\alpha }}\), denoted by \(\tilde{{\varvec{\alpha }}}\).

Given the estimated single index \(\tilde{{\varvec{\alpha }}}\), the nonparametric functions \({\varvec{\pi }}(z)\), \({{\varvec{m}}}(z)\) and \({{\varvec{\sigma }}}^2(z)\) can then be estimated by maximizing the following local log-likelihood function:

$$\begin{aligned}&\ell ^{(1)}_1({\varvec{\pi }},{{\varvec{m}}},{{\varvec{\sigma }}}^2)\nonumber \\&\quad =\sum _{i=1}^n\log \left\{ \sum _{j=1}^k\pi _j(\tilde{{\varvec{\alpha }}}^\top {{\varvec{x}}}_i)\phi (Y_i|m_j(\tilde{{\varvec{\alpha }}}^\top {{\varvec{x}}}_i),\sigma _j^2(\tilde{{\varvec{\alpha }}}^\top {{\varvec{x}}}_i))\right\} K_h(\tilde{{\varvec{\alpha }}}^\top {{\varvec{x}}}_i-z),\qquad \end{aligned}$$
(4)

where \(K_h(z)=\frac{1}{h}K(\frac{z}{h})\), \(K(\cdot )\) is a kernel density function, and h is a tuning parameter. Let \(\hat{{\varvec{\pi }}}(\cdot )\), \(\hat{{{\varvec{m}}}}(\cdot )\) and \(\hat{{{\varvec{\sigma }}}}^2(\cdot )\) be the estimates that maximize (4). The above estimates are the proposed one-step estimate.

We propose a modified EM-type algorithm to maximize \(\ell _1^{(1)}\). In practice, we usually want to evaluate unknown functions at a set of grid points, which in this case, requires us to maximize local log-likelihood functions at a set of grid points. If we simply employ the EM algorithm separately for each grid point, the labels in the EM algorithm may change at different grid points, and we may not be able to get smoothed estimated curves (Huang and Yao 2012). Therefore, we propose the following modified EM-type algorithm, which estimates the nonparametric functions simultaneously at a set of grid points, say \(\{u_t,t=1,\ldots ,N\}\), and provides a unified label for each observation across all grid points.

Algorithm 1

Modified EM-type algorithm to maximize (4) given the single index estimate \(\tilde{{\varvec{\alpha }}}\).

E-step::

Calculate the expectations of component labels based on estimates from lth iteration:

$$\begin{aligned} p_{ij}^{(l+1)}=\frac{\pi _j^{(l)}(\tilde{{\varvec{\alpha }}}^\top {{\varvec{x}}}_i)\phi (Y_i|m_j^{(l)}(\tilde{{\varvec{\alpha }}}^\top {{\varvec{x}}}_i),\sigma _j^{2(l)}(\tilde{{\varvec{\alpha }}}^\top {{\varvec{x}}}_i))}{\sum _{j=1}^k \pi _j^{(l)}(\tilde{{\varvec{\alpha }}}^\top {{\varvec{x}}}_i)\phi (Y_i|m_j^{(l)}(\tilde{{\varvec{\alpha }}}^\top {{\varvec{x}}}_i),\sigma _j^{2(l)}(\tilde{{\varvec{\alpha }}}^\top {{\varvec{x}}}_i))}, \end{aligned}$$
(5)

where \(i=1,\ldots ,n,j=1,\ldots ,k.\)

M-step::

Update the estimates

$$\begin{aligned}&\pi _j^{(l+1)}(z)=\frac{\sum _{i=1}^np_{ij}^{(l+1)}K_h(\tilde{{\varvec{\alpha }}}^\top {{\varvec{x}}}_i-z)}{\sum _{i=1}^nK_h(\tilde{{\varvec{\alpha }}}^\top {{\varvec{x}}}_i-z)}, \end{aligned}$$
(6)
$$\begin{aligned}&m_j^{(l+1)}(z)=\frac{\sum _{i=1}^np_{ij}^{(l+1)}Y_iK_h(\tilde{{\varvec{\alpha }}}^\top {{\varvec{x}}}_i-z)}{\sum _{i=1}^np_{ij}^{(l+1)}K_h(\tilde{{\varvec{\alpha }}}^\top {{\varvec{x}}}_i-z)}, \end{aligned}$$
(7)
$$\begin{aligned}&\sigma _j^{2(l+1)}(z)=\frac{\sum _{i=1}^np_{ij}^{(l+1)}(Y_i-m_j^{(l+1)}(z))^2K_h(\tilde{{\varvec{\alpha }}}^\top {{\varvec{x}}}_i-z)}{\sum _{i=1}^np_{ij}^{(l+1)}K_h(\tilde{{\varvec{\alpha }}}^\top {{\varvec{x}}}_i-z)}, \end{aligned}$$
(8)

for \(z\in \{u_t,t=1,\ldots ,N\}\) and \(j=1,\ldots ,k\). We then update \(\pi _j^{(l+1)}(\tilde{{\varvec{\alpha }}}^\top {{\varvec{x}}}_i)\), \(m_j^{(l+1)}(\tilde{{\varvec{\alpha }}}^\top {{\varvec{x}}}_i)\) and \(\sigma _j^{2(l+1)}(\tilde{{\varvec{\alpha }}}^\top {{\varvec{x}}}_i)\), \(i=1,\ldots ,n\), by linear interpolating \(\pi _j^{(l+1)}(u_t)\), \(m_j^{(l+1)}(u_t)\) and \(\sigma _j^{2(l+1)}(u_t)\), \(t=1,\ldots ,N\), respectively.

Note that in the M-step, the nonparametric functions are estimated simultaneously at a set of grid points, and therefore, the classification probabilities in the the E-step can be estimated globally to avoid the label switching problem (Stephens 2000; Yao and Lindsay 2009). If the sample size n is not too large, one can also take all \(\{\tilde{{\varvec{\alpha }}}^\top {{\varvec{x}}}_i, i=1,\ldots ,n\}\) as grid points for z in the M-step.

The initial estimate \(\tilde{{\varvec{\alpha }}}\) by SIR does not make use of the mixture information and thus is not efficient. Given one step estimate \(\hat{{\varvec{\pi }}}(\cdot )\), \(\hat{{{\varvec{m}}}}(\cdot )\) and \(\hat{{{\varvec{\sigma }}}}^2(\cdot )\), we can further improve the estimate of \({\varvec{\alpha }}\) by maximizing

$$\begin{aligned} \ell ^{(1)}_2({\varvec{\alpha }})=\sum _{i=1}^n\log \left\{ \sum _{j=1}^k\hat{\pi }_j({\varvec{\alpha }}^\top {{\varvec{x}}}_i)\phi (Y_i|\hat{m}_j({\varvec{\alpha }}^\top {{\varvec{x}}}_i),\hat{\sigma }_j^2({\varvec{\alpha }}^\top {{\varvec{x}}}_i))\right\} , \end{aligned}$$
(9)

with respect to \({\varvec{\alpha }}\). The proposed fully iterative backfitting estimator of \({\varvec{\alpha }}\), denoted by \(\hat{{\varvec{\alpha }}}\), iterates the above two steps until convergence.

Algorithm 2

Fully iterative backfitting estimator (FIB)

Step 1::

Apply sliced inverse regression (SIR) to obtain an initial estimate of the single index parameter \({\varvec{\alpha }}\), denoted by \(\tilde{{\varvec{\alpha }}}\).

Step 2::

Given \(\tilde{{\varvec{\alpha }}}\), apply the modified EM-algorithm (5)–(8) to maximize \(\ell ^{(1)}_1\) in (4) to obtain the estimates \(\hat{{\varvec{\pi }}}(\cdot )\), \(\hat{{{\varvec{m}}}}(\cdot )\), and \(\hat{{{\varvec{\sigma }}}}^2(\cdot )\).

Step 3::

Given \(\hat{{\varvec{\pi }}}(\cdot )\), \(\hat{{{\varvec{m}}}}(\cdot )\), and \(\hat{{{\varvec{\sigma }}}}^2(\cdot )\) from Step 2, update the estimate of \({\varvec{\alpha }}\) by maximizing \(\ell ^{(1)}_2\) in (9).

Step 4::

Iterate Steps 2–3 until convergence.

2.3 Asymptotic properties

The asymptotic properties of the proposed estimates are investigated below. Let \({\varvec{\theta }}(z)=({\varvec{\pi }}^\top (z),{{\varvec{m}}}^\top (z),({{\varvec{\sigma }}}^2)^\top (z))^\top \). Define

$$\begin{aligned} \ell ({\varvec{\theta }}(z),y)&=\log \sum _{j=1}^k\pi _j(z)\phi \{y|m_j(z),\sigma _j^2(z)\},\\ q_1(z)&=\frac{\partial \ell ({\varvec{\theta }}(z),y)}{\partial \theta },\\ q_2(z)&=\frac{\partial ^2\ell ({\varvec{\theta }}(z),y)}{\partial \theta \partial \theta ^\top },\\ \mathscr {I}^{(1)}_\theta (z)&=-E[q_2(Z)|Z=z],\\ \varLambda _1(u|z)&=E[q_1(z)|Z=u]. \end{aligned}$$

Under further conditions defined in the “Appendix”, the asymptotic properties of the one-step estimates \(\hat{{\varvec{\pi }}}(\cdot )\), \(\hat{{{\varvec{m}}}}(\cdot )\), and \(\hat{{{\varvec{\sigma }}}}^2(\cdot )\) are given in the following theorem.

Theorem 2

Assume that conditions (C1)–(C7) in the “Appendix” hold. Then, as \(n\rightarrow \infty \), \(h\rightarrow 0\) and \(nh\rightarrow \infty \), we have

$$\begin{aligned} \sqrt{nh}\{\hat{{\varvec{\theta }}}(z)-{\varvec{\theta }}(z)-\mathscr {B}_1+o_p(h^2)\}\overset{D}{\rightarrow } N\{0,\nu _0f^{-1}(z)\mathscr {I}^{(1)}_\theta (z)\}, \end{aligned}$$
(10)

where

$$\begin{aligned} \mathscr {B}_1(z)=\mathscr {I}^{(1)-1}_\theta \left\{ \frac{f'(z)\varLambda _1^{'}(z|z)}{f(z)}+\frac{1}{2}\varLambda _1^{''}(z|z)\right\} \kappa _2h^2, \end{aligned}$$

with \(f(\cdot )\) the marginal density function of \({\varvec{\alpha }}^\top {{\varvec{x}}}\), \(\kappa _l=\int t^lK(t)dt\) and \(\nu _l=\int t^lK^2(t)dt\).

Note that the asymptotic variance of \(\hat{{\varvec{\theta }}}(z)\) is the same as those given in Huang et al. (2013). Thus, the nonparametric functions can be estimated with the same accuracy as it would have if the single index \({\varvec{\alpha }}^\top {{\varvec{x}}}\) were known. This is expected since the index \({\varvec{\alpha }}\) can be estimated at a root n convergence rate which is faster than \(\hat{{\varvec{\theta }}}(z)\). In addition, note that the one-step estimates of \({\varvec{\theta }}(z)\) have the same asymptotic variance (up to the first order) as the fully iterative backfitting algorithm but with much less computations. Our simulation results in Sect. 4 further confirm this result.

The next theorem gives the asymptotic results of the \(\hat{{\varvec{\alpha }}}\) given by the fully iterative backfitting algorithm.

Theorem 3

Assume that conditions (C1)–(C8) in the “Appendix” hold. Then, as \(n\rightarrow \infty \), \(nh^4\rightarrow 0\), and \(nh^2/\log (1/h)\rightarrow \infty \),

$$\begin{aligned} \sqrt{n}(\hat{{\varvec{\alpha }}}-{\varvec{\alpha }})\overset{D}{\rightarrow } N(0,{{\varvec{Q}}}_1^{-1}), \end{aligned}$$
(11)

where

$$\begin{aligned} {{\varvec{Q}}}_1=E\left[ \{{{\varvec{x}}}{\varvec{\theta }}'(Z)\}q_2(Z)\{{{\varvec{x}}}{\varvec{\theta }}'(Z)\}^\top -{{\varvec{x}}}{\varvec{\theta }}'(Z)q_2(Z)\mathscr {I}^{(1)-1}_\theta (Z)E\{q_2(Z)[{{\varvec{x}}}{\varvec{\theta }}' (Z)]^\top |Z\}\right] . \end{aligned}$$

3 Mixtures of regression models with varying single-index proportions

3.1 Model definition and identifiability

The MRSIP assumes that \(P(\mathscr {C}=j|{{\varvec{x}}})=\pi _j({\varvec{\alpha }}^\top {{\varvec{x}}})\) for \(j=1,\ldots ,k\), and conditional on \(\mathscr {C}=j\) and \({{\varvec{x}}}\), Y follows a normal distribution with mean \({{\varvec{x}}}^\top {\varvec{\beta }}_j\) and variance \(\sigma ^2_j\). That is,

$$\begin{aligned} f(y|x,{\varvec{\alpha }},\pi , {\varvec{\beta }},\sigma ^2)=\sum _{j=1}^k\pi _j({\varvec{\alpha }}^\top {{\varvec{x}}})\phi (y|{{\varvec{x}}}^\top {\varvec{\beta }}_j,\sigma _j^2). \end{aligned}$$

Since \(\pi _j(\cdot )\)’s are nonparametric, model (2) is also a finite semiparametric mixture of regression models. The linear component regression functions \({{\varvec{x}}}^\top {\varvec{\beta }}_j\) enjoy simple interpretation, while nonparametric functions \(\pi _j({\varvec{\alpha }}^\top {{\varvec{x}}})\) can incorporate the effects of predictors on component proportions more flexibly to reduce the modeling bias. See Young and Hunter (2010), Huang et al. (2013) for more information. We first prove the identifiability result of model (2) in the following theorem and defer its proof to the “Appendix”.

Theorem 4

Assume that

  1. 1.

    \(\pi _j(z)>0\) are differentiable and not constant on the support of \({\varvec{\alpha }}^\top {{\varvec{x}}}\), \(j=1,\ldots ,k\);

  2. 2.

    The components of \({{\varvec{x}}}\) are continuously distributed random variables that have a joint probability density function;

  3. 3.

    The support of \({{\varvec{x}}}\) contains an open set in \(\mathbb {R}^p\) and is not contained in any proper linear subspace of \(\mathbb {R}^p\);

  4. 4.

    \(\Vert {\varvec{\alpha }}\Vert =1\) and the first nonzero element of \({\varvec{\alpha }}\) is positive;

  5. 5.

    \(({\varvec{\beta }}_j,\sigma _j^2)\), \(j=1,\ldots ,k\), are distinct pairs.

Then, model (2) is identifiable.

3.2 Estimation procedure

The log-likelihood of the collected data for model (2) is:

$$\begin{aligned} \ell ^{*(2)}({\varvec{\pi }},{{\varvec{\sigma }}}^2,{\varvec{\alpha }},{\varvec{\beta }})=\sum _{i=1}^n\log \left\{ \sum _{j=1}^k\pi _j({\varvec{\alpha }}^\top {{\varvec{x}}}_i)\phi (Y_i|{{\varvec{x}}}_i^\top {\varvec{\beta }}_j,\sigma _j^2)\right\} , \end{aligned}$$
(12)

where \({\varvec{\pi }}(\cdot )=\{\pi _1(\cdot ),\ldots ,\pi _{k-1}(\cdot )\}^\top \), \({{\varvec{\sigma }}}^2=\{\sigma _1^2,\ldots ,\sigma _k^2\}^\top \), and \({\varvec{\beta }}=\{{\varvec{\beta }}_1,\ldots ,{\varvec{\beta }}_k\}^\top \). Since \({\varvec{\pi }}(\cdot )\) consists of nonparametric functions, (12) is not ready for maximization. We propose a backfitting algorithm to iterate between estimating the parameters \(({\varvec{\alpha }},{\varvec{\beta }},{{\varvec{\sigma }}}^2)\) and the nonparametric functions \({\varvec{\pi }}(\cdot )\).

Given the estimates of \(({\varvec{\alpha }},{\varvec{\beta }},{{\varvec{\sigma }}}^2)\), say \((\hat{{\varvec{\alpha }}},\hat{{\varvec{\beta }}},\hat{{{\varvec{\sigma }}}}^2)\), then \({\varvec{\pi }}(\cdot )\) can be estimated locally by maximizing the following local log-likelihood function:

$$\begin{aligned} \ell ^{(2)}_1({\varvec{\pi }})=\sum _{i=1}^n\log \left\{ \sum _{j=1}^k\pi _j(\hat{{\varvec{\alpha }}}^\top {{\varvec{x}}}_i)\phi (Y_i|{{\varvec{x}}}_i^\top \hat{{\varvec{\beta }}}_j,\hat{\sigma }^2_j)\right\} K_h(\hat{{\varvec{\alpha }}}^\top {{\varvec{x}}}_i-z). \end{aligned}$$
(13)

Let \(\hat{{\varvec{\pi }}}(\cdot )\) be the estimate that maximizes (13). We can then further update the estimate of \(({\varvec{\alpha }},{\varvec{\beta }},{{\varvec{\sigma }}}^2)\) by maximizing

$$\begin{aligned} \ell ^{(2)}_2({\varvec{\alpha }},{\varvec{\beta }},{{\varvec{\sigma }}}^2)=\sum _{i=1}^n\log \left\{ \sum _{j=1}^k\hat{\pi }_j({\varvec{\alpha }}^\top {{\varvec{x}}}_i)\phi (Y_i|{{\varvec{x}}}_i^\top {\varvec{\beta }}_j,\sigma ^2_j)\right\} . \end{aligned}$$
(14)

The backfitting algorithm by iterating the above two steps can be summarized as follows.

Algorithm 3

Backfitting algorithm to estimate the model (2).

  1. Step 1:

    Obtain an initial estimate of \(({\varvec{\alpha }},{\varvec{\beta }},{{\varvec{\sigma }}}^2)\).

  2. Step 2:

    Given \((\hat{{\varvec{\alpha }}},\hat{{\varvec{\beta }}},\hat{{{\varvec{\sigma }}}}^2)\), use the following modified EM-type algorithm to maximize \(\ell ^{(2)}_1\) in (13).

    E-step: Calculate the expectations of component labels based on estimates from lth iteration:

    $$\begin{aligned} p_{ij}^{(l+1)}=\frac{\pi ^{(l)}_j(\hat{{\varvec{\alpha }}}^\top {{\varvec{x}}}_i)\phi (Y_i|{{\varvec{x}}}_i^\top \hat{{\varvec{\beta }}}_j,\hat{\sigma }_j^2)}{\sum _{j=1}^k\pi ^{(l)}_j(\hat{{\varvec{\alpha }}}^\top {{\varvec{x}}}_i)\phi (Y_i|{{\varvec{x}}}_i^\top \hat{{\varvec{\beta }}}_j,\hat{\sigma }_j^2)}, \end{aligned}$$
    (15)

    where \(i=1,\ldots ,n, j=1,\ldots ,k\).

    M-step: Update the estimate

    $$\begin{aligned} \pi _j^{(l+1)}(z)=\frac{\sum _{i=1}^np_{ij}^{(l+1)}K_h(\hat{{\varvec{\alpha }}}^\top {{\varvec{x}}}_i-z)}{\sum _{i=1}^nK_h(\hat{{\varvec{\alpha }}}^\top {{\varvec{x}}}_i-z)} \end{aligned}$$
    (16)

    for \(z\in \{u_t,t=1,\ldots ,N\}\). We then update \(\pi _j^{(l+1)}(\hat{{\varvec{\alpha }}}^\top {{\varvec{x}}}_i)\), \(i=1,\ldots ,n\) by linear interpolating \(\pi _j^{(l+1)}(u_t)\), \(t=1,\ldots ,N\).

  3. Step 3:

    Given \(\hat{{\varvec{\pi }}}(\cdot )\) from Step 2, update \((\hat{{\varvec{\alpha }}},\hat{{\varvec{\beta }}},\hat{{{\varvec{\sigma }}}}^2)\) by maximizing (14). We propose to iterate between updating \({\varvec{\alpha }}\) and \(({\varvec{\beta }}, {{\varvec{\sigma }}})\).

  4. Step 3.1:

    Given \(\hat{{\varvec{\alpha }}}\), update \(({\varvec{\beta }},{{\varvec{\sigma }}}^2)\).

    E-step: Calculate the classification probabilities:

    $$\begin{aligned} p_{ij}^{(l+1)}=\frac{\hat{\pi }_j(\hat{{\varvec{\alpha }}}^\top {{\varvec{x}}}_i)\phi (Y_i|{{\varvec{x}}}_i^\top {\varvec{\beta }}^{(l)}_j,\sigma ^{2(l)}_j)}{\sum _{j=1}^k\hat{\pi }_j(\hat{{\varvec{\alpha }}}^\top {{\varvec{x}}}_i)\phi (Y_i|{{\varvec{x}}}_i^\top {\varvec{\beta }}^{(l)}_j,\sigma ^{2(l)}_j)},\quad j=1,\ldots ,k. \end{aligned}$$
    (17)

    M-step: Update \({\varvec{\beta }}\) and \({{\varvec{\sigma }}}^2\):

    $$\begin{aligned} {\varvec{\beta }}_j^{(l+1)}&=({{\varvec{S}}}^\top {{\varvec{R}}}_j^{(l+1)}{{\varvec{S}}})^{-1}{{\varvec{S}}}^\top {{\varvec{R}}}_j^{(l+1)}{{\varvec{y}}}, \end{aligned}$$
    (18)
    $$\begin{aligned} \sigma _j^{2(l+1)}&=\frac{\sum _{i=1}^np_{ij}^{(l+1)}(Y_i-{{\varvec{x}}}_i^\top {\varvec{\beta }}_j^{(l+1)})^2}{\sum _{i=1}^np_{ij}^{(l+1)}}, \end{aligned}$$
    (19)

    where \(j=1,\ldots ,k\), \({{\varvec{R}}}_j^{(l+1)}=diag\{p_{ij}^{(l+1)},\ldots ,p_{nj}^{(l+1)}\}\), and \({{\varvec{S}}}=({{\varvec{x}}}_1,\ldots ,{{\varvec{x}}}_n)^\top \).

  5. Step 3.2:

    Given \((\hat{{\varvec{\beta }}},\hat{{{\varvec{\sigma }}}}^2)\), update \({\varvec{\alpha }}\) by maximizing the following log-likelihood

    $$\begin{aligned} \ell ^{(2)}_3({\varvec{\alpha }})=\sum _{i=1}^n\log \left\{ \sum _{j=1}^k\hat{\pi }_j({\varvec{\alpha }}^\top {{\varvec{x}}}_i)\phi (Y_i|{{\varvec{x}}}_i^\top \hat{{\varvec{\beta }}}_j,\hat{\sigma }^2_j)\right\} . \end{aligned}$$
  6. Step 3.3:

    Iterate Steps 3.1–3.2 until convergence.

  7. Step 4:

    Iterate Steps 2–3 until convergence.

There are many ways to obtain an initial estimate of \(({\varvec{\alpha }},{\varvec{\beta }},{{\varvec{\sigma }}}^2)\). In our numerical studies, we get an initial estimate of \(({\varvec{\beta }},{{\varvec{\sigma }}}^2)\) by fitting traditional mixtures of linear regression models. Using the resulting hard-clustering results as new response variable, we then apply the SIR to get an initial estimate for \({\varvec{\alpha }}\).

3.3 Asymptotic properties

Let \((\hat{{\varvec{\pi }}}(z),\hat{{\varvec{\alpha }}},\hat{{\varvec{\beta }}},\hat{{{\varvec{\sigma }}}}^2)\) be the resulting estimate of Algorithm 3. In this section, we investigate their asymptotic properties. Let \({{\varvec{\eta }}}=({\varvec{\beta }}^\top ,({{\varvec{\sigma }}}^2)^\top )^\top \) and \({{\varvec{\lambda }}}=({\varvec{\alpha }}^\top ,{{\varvec{\eta }}}^\top )^\top \). Define

$$\begin{aligned} \ell ({\varvec{\pi }}(z),{{\varvec{\lambda }}},{{\varvec{x}}},y)&=\log \sum _{j=1}^k\pi _j(z)\phi \{y|{{\varvec{x}}}^\top {\varvec{\beta }}_j,\sigma _j^2\},\\ q_{\pi }(z)&=\frac{\partial \ell ({\varvec{\pi }}(z),\lambda ,x,y)}{\partial {\varvec{\pi }}},\\ q_{\pi \pi }(z)&=\frac{\partial ^2\ell ({\varvec{\pi }}(z),\lambda ,x,y)}{\partial {\varvec{\pi }}\partial {\varvec{\pi }}^\top }. \end{aligned}$$

Similarly, define \(q_{\lambda }\), \(q_{\lambda \lambda }\), and \(q_{\pi \eta }\). Denote \(\mathscr {I}^{(2)}_\pi (z)=-E[q_{\pi \pi }(Z)|Z=z]\) and \(\varLambda _2(u|z)=E[q_{\pi }(z)|Z=u]\).

Under some regularity conditions, the asymptotic properties of \(\hat{{\varvec{\pi }}}(z)\) are given in the following theorem and its proof is given in the “Appendix”.

Theorem 5

Assume that conditions (C1)–(C4) and (C9)–(C11) in the “Appendix” hold. Then, as \(n\rightarrow \infty \), \(h\rightarrow 0\) and \(nh\rightarrow \infty \), we have

$$\begin{aligned} \sqrt{nh}\{\hat{{\varvec{\pi }}}(z)-{\varvec{\pi }}(z)-\mathscr {B}_2(z)+o_p(h^2)\}\overset{D}{\rightarrow } N\{0,\nu _0f^{-1}(z)\mathscr {I}^{(2)}_\pi (z)\}, \end{aligned}$$
(20)

where

$$\begin{aligned} \mathscr {B}_2(z)=\mathscr {I}_\pi ^{(2)-1}\left\{ \frac{f'(z)\varLambda _2'(z|z)}{f(z)}+\frac{1}{2}\varLambda _2''(z|z)\right\} \kappa _2h^2. \end{aligned}$$

The asymptotic property of the parametric estimate \(\hat{{{\varvec{\lambda }}}}\) is given in the following theorem.

Theorem 6

Assume that conditions (C1)–(C4) and (C9)–(C12) in the “Appendix” hold. Then, as \(n\rightarrow \infty \), \(nh^4\rightarrow 0\), and \(nh^2/\log (1/h)\rightarrow \infty \),

$$\begin{aligned} \sqrt{n}(\hat{{{\varvec{\lambda }}}}-{{\varvec{\lambda }}})\overset{D}{\rightarrow } N(0,{{\varvec{Q}}}_2^{-1}), \end{aligned}$$

where,

$$\begin{aligned} {{\varvec{Q}}}_2=E\left[ q_{\pi \pi }(Z)\begin{pmatrix}{{\varvec{x}}}{\varvec{\pi }}'(Z)\\ \mathbf{I} \end{pmatrix}\left\{ \begin{pmatrix}{{\varvec{x}}}{\varvec{\pi }}'(Z)\\ \mathbf{I} \end{pmatrix} -\begin{pmatrix}\mathscr {I}_\pi ^{(2)-1}(Z)E\{q_{\pi \pi }(Z)({{\varvec{x}}}{\varvec{\pi }}'(Z))^\top |Z\}\\ \mathscr {I}_\pi ^{(2)-1}(Z)E\{q_{\pi \eta }(Z)|Z\}\end{pmatrix}\right\} ^\top \right] . \end{aligned}$$

4 Simulation studies

In this section, we conduct simulation studies to test the performance of the proposed models and estimation procedures.

The performance of the estimates are measured via the absolute bias (AB), standard deviation (SD), and root mean squared error (RMSE), which is \((AB^2+SD^2)^{1/2}\). Specifically, the AB and SD of the mean functions \(m_j(\cdot )\)’s in the model (1) is defined as:

$$\begin{aligned} AB=\left[ \frac{1}{N}\sum _{j=1}^k\sum _{t=1}^N\left\{ E\hat{m}_j(u_t)-m_j(u_t)\right\} ^2\right] ^{1/2}, \end{aligned}$$

and

$$\begin{aligned} SD=\left[ \frac{1}{N}\sum _{j=1}^k\sum _{t=1}^NE\left\{ \hat{m}_j(u_t)-E\hat{m}_j(u_t)\right\} ^2\right] ^{1/2}, \end{aligned}$$

where \(\hat{m}_j(\cdot )\) is an estimator of \(m_j(\cdot )\), and \(E\hat{m}_j(u_t)\) are estimated by their replicates of studies. Similarly, we can define the AB and SD for variance functions \(\sigma _j^2(\cdot )\)’s and proportion functions \(\pi _j(\cdot )\)’s. In the following numerical studies, we set \(N=100\).

In addition, a conditional bootstrap procedure is applied to estimate the standard error of estimates and construct confidence intervals for the parameters.

Example 1

Generate data from the following two-component MSIM:

$$\begin{aligned} \pi _1(z)= & {} 0.5+0.3\sin (\pi z)\quad \hbox {and}\quad \pi _2(z)=1-\pi _1(z),\\ m_1(z)= & {} 6-\sin (2\pi z/\sqrt{3})\quad \hbox {and}\quad m_2(z)=\cos (\sqrt{3}\pi z)-1,\\ \sigma _1(z)= & {} 0.4+\sin (3\pi z)/15\quad \hbox {and}\quad \sigma _2(z)=0.3+\cos (1.3\pi z)/10, \end{aligned}$$

where \(z_i={\varvec{\alpha }}^\top {{\varvec{x}}}_i\), \({{\varvec{x}}}_i\) is an eight-dimensional random vector with independent uniform (0,1) components, and the direction parameter is \({\varvec{\alpha }}=(1,1,1,0,0,0,0,0)^\top /\sqrt{3}\). The sample sizes \(n=200\) and \(n=400\) are conducted over 500 repetitions. To estimate \({\varvec{\alpha }}\), we use sliced inverse regression (SIR) and the fully iterative backfitting estimate (MSIM). To estimate the nonparametric functions, we apply the one-step estimate (OS) and MSIM. For MSIM, we use both true value (T) and SIR (S) as the initial values.

First, a proper bandwidth for estimating \({\varvec{\pi }}(\cdot )\), \({{\varvec{m}}}(\cdot )\) and \({{\varvec{\sigma }}}^2(\cdot )\) is selected. Based on Theorem 2, one can calculate the theoretical optimal bandwidth by minimizing asymptotic mean squared errors. However, the theoretical optimal bandwidth depends on many unknown quantities, which are not easy to estimate in practice. In our examples, we propose to use the following cross-validation (CV) method to choose the bandwidth. Let \(\mathscr {D}\) be the full data set, and divide \(\mathscr {D}\) into a training set \(\mathscr {R}_l\) and a test set \(\mathscr {T}_l\). That is, \(\mathscr {R}_l\cup \mathscr {T}_l=\mathscr {D}\) for \(l=1,\ldots ,L\). We use the training set \(\mathscr {R}_l\) to obtain the estimates \(\{\hat{{\varvec{\pi }}}(\cdot ),\hat{{{\varvec{m}}}}(\cdot ),\hat{{{\varvec{\sigma }}}}^2(\cdot ),\hat{{\varvec{\alpha }}}\}\), then evaluate \({\varvec{\pi }}(\cdot )\), \({{\varvec{m}}}(\cdot )\) and \({{\varvec{\sigma }}}^2(\cdot )\) for the test data set \(\mathscr {T}_l\). For each \(({{\varvec{x}}}_t,y_t)\in \mathscr {T}_l\), calculate the classification probability as

$$\begin{aligned} \hat{p}_{tj}=\frac{\hat{\pi }_j(\hat{{\varvec{\alpha }}}^\top {{\varvec{x}}}_t)\phi (y_t|\hat{m}_j(\hat{{\varvec{\alpha }}}^\top {{\varvec{x}}}_t),\hat{\sigma }_j^2(\hat{{\varvec{\alpha }}}^\top {{\varvec{x}}}_t))}{\sum _{j=1}^k\hat{\pi }_j(\hat{{\varvec{\alpha }}}^\top {{\varvec{x}}}_t)\phi (y_t|\hat{m}_j(\hat{{\varvec{\alpha }}}^\top {{\varvec{x}}}_t),\hat{\sigma }_j^2(\hat{{\varvec{\alpha }}}^\top {{\varvec{x}}}_t))}, \end{aligned}$$
(21)

for \(j=1,\ldots ,k\). The regular CV is considered, which is defined by

$$\begin{aligned} CV(h)=\sum _{l=1}^L\sum _{t\in \mathscr {T}_l}(y_t-\hat{y}_t)^2, \end{aligned}$$

where \(\hat{y}_t=\sum _{j=1}^k\hat{p}_{tj}\hat{m}_j(\hat{{\varvec{\alpha }}}^\top {{\varvec{x}}}_t)\). We also implemented the likelihood based cross validation to choose the bandwidth and the results are similar but with more computations.

Table 1 Simulation results for Example 1, \(n=200\). Absolute bias (AB), standard deviation (SD), and root mean squared error (RMSE) of global and local parameter estimates using the proposed methods

Throughout the simulation, \(L=10\) and the data are randomly partitioned. The procedure is repeated 30 times, and the average of the selected bandwidth is taken as the optimal bandwidth, denoted by \(\hat{h}\).

Tables 1 and 2 present the AB, SD, and RMSE for the estimates. We summarize our findings as follows. The AB, SD, and RMSE of the index parameter estimates from MSIM are consistently smaller than the OS estimates, which indicates that the MSIM is superior to the OS. This is reasonable, since the MSIM makes use of mixture information while SIR does not. Furthermore, the performance of MSIM(T) is slightly better than MSIM(S), but the improvement is negligible, indicating that the sliced inverse regression provides good initial values for our method. In addition, the analysis results about nonparametric function estimates demonstrate that MSIM works slightly better than OS. This verifies the theoretical results stated in Sect. 2.3.

In addition, to see how the selected bandwidth works, we also did simulation for two other bandwidths, \(\hat{h}\times n^{-2/15}\) and \(1.5\hat{h}\), which correspond to the under-smoothing and over-smoothing conditions, respectively. The results are summarized in Table 3 and 4. We can see that the proposed bandwidth selection procedure works reasonably well since the corresponding AB, SD, and RMSE are often the smallest.

Next, we test the accuracy of the standard error estimation and the confidence interval constructed for \(\alpha \) via a conditional bootstrap procedure. Given the covariate x, the response Y can be generated from the estimated distribution

$$\begin{aligned} \sum _{j=1}^k\hat{\pi }_j(\hat{{\varvec{\alpha }}}^\top {{\varvec{x}}})\phi (y|\hat{m}_j(\hat{{\varvec{\alpha }}}^\top {{\varvec{x}}}),\hat{\sigma }_j^2(\hat{{\varvec{\alpha }}}^\top {{\varvec{x}}})). \end{aligned}$$

For the simplicity of presentation, only the results from MSIM(S) are reported. We apply the proposed estimation procedure to each of the 200 bootstrap samples, and further obtain the confidence intervals. Table 5 summarizes the performance of the bootstrap procedure. The average and standard deviation of the 500 estimated standard errors are reported, and are denoted by SE and STD, respectively. The actual coverage probabilities based on the constructed confidence intervals are also reported. From Table 5, we can see that the actual coverage probabilities are usually very close to the nominal coverage probabilities, although are conservative for some parameters.

Table 2 Simulation results for Example 1, \(n=400\). Absolute bias (AB), standard deviation (SD), and root mean squared error (RMSE) of global and local parameter estimates using the proposed methods
Table 3 Absolute bias (AB), standard deviation (SD), and root mean squared error (RMSE) of global and local parameter estimates using the proposed methods with various bandwidths, \(n=200\)
Table 4 Absolute bias (AB), standard deviation (SD), and root mean squared error (RMSE) of global and local parameter estimates using the proposed methods with various bandwidths, \(n=400\)
Table 5 Standard errors and coverage probabilities for the global index parameter \({\varvec{\alpha }}\) in Example 1

In the following, we explore the performance of MSIM in even higher dimensional spaces. To be more specific, \({\varvec{\alpha }}\) is now set to be a 50-dimensional vector, whose first three elements are \(1/\sqrt{3}\) and the others are zeros. \({{\varvec{x}}}_i\) and the nonparametric functions are generated in the similar manner as before. That is to say, we are now exploring the cases when \(n=200, p=50\) and \(n=400, p=50\). The AD, SD, and RMSE for the nonparametric functions are reported in Table 6. For the global vector \({\varvec{\alpha }}\), we reported the pooled AB, SD, and RMSE, which are defined by

$$\begin{aligned} AB_\alpha =\left[ \frac{1}{p}\sum _{t=1}^p\left( E\hat{\alpha }_t-\alpha _t\right) ^2\right] ^{1/2}, \quad \text {and}\quad SD_\alpha =\left[ \frac{1}{p}\sum _{t=1}^p E\left( \hat{\alpha }_t-E\hat{\alpha }_t\right) ^2\right] ^{1/2}. \end{aligned}$$

The results are based on MSIM(S). The results from the previous \(n=200, p=8\) and \(n=400, p=8\) cases are also reported for comparison. It can be seen that when we increase p to 50, the MSIM still provides reasonable, although slightly worse, estimates.

Table 6 Example 1 in higher dimensional case
Table 7 Simulation results for Example 2, \(n=200\)
Table 8 Simulation results for Example 2, \(n=400\)

Example 2

Next, we consider a two-component MRSIP:

$$\begin{aligned} \pi _1(z)=0.5-0.35\sin (\pi z)\quad \hbox {and}\quad \pi _2(z)=1-\pi _1(z),\\ m_1({{\varvec{x}}})=3+3x_2\quad and\quad m_2({{\varvec{x}}})=-3+2x_1+3x_3,\\ \sigma ^2_1=0.6\quad \hbox {and}\quad \sigma ^2_2=0.4, \end{aligned}$$

where \(m_1({{\varvec{x}}})\) and \(m_2({{\varvec{x}}})\) are the regression functions for the first and second components, respectively, and \({{\varvec{x}}}_i\) is an eight-component random vector with independent uniform (0,1) components. Let \({\varvec{\beta }}_1=(3,0,3,0,0,0,0,0,0)^\top \), \({\varvec{\beta }}_2=(-3,2,0,3,0,0,0,0,0)^\top ,\) and \({\varvec{\alpha }}=(1,1,1,0,0,0,0,0)^\top /\sqrt{3}\). MRSIP with true value (T) and SIR (S) as initial values are used to fit the data, and the results are compared to the traditional mixture of linear regression models (MLR). The bandwidth for MRSIP is chosen based on the cross-validation, similar to Example 1.

Tables 7 and 8 report the AB, SD, and RMSE for the estimates. From both tables, we can see that MRSIP works comparable to MLR when the sample size is small, and outperforms MLR when the sample size is large. It is clear that the MRSIP provides better estimates of component proportions than MLR since the constant assumption of component proportions by MLR is violated. By reducing the modeling bias of component proportions, MRSIP is able to better classify observations into two components and thus provide better component regression parameters. In addition, we can see that MRSIP(S) provides similar results to MRSIP(T), which demonstrates that SIR provides good initial values for MRSIP.

Next, the standard error estimation and the confidence interval construction for \({\varvec{\beta }}_1, {\varvec{\beta }}_2, \sigma _1^2, \sigma _2^2\) and \({\varvec{\alpha }}\) are reported in Table 9 via the conditional bootstrap procedure introduced in Example 1. The results are based on MRSIP(S). From Table 9, we can see that the actual coverage probabilities are usually close to the nominal coverage probabilities.

Table 9 Standard errors and coverage probabilities for global parameters in Example 2

We now illustrate the performance of the MRSIP model for a high dimensional setting when p is increased to 50. In this case, \({\varvec{\alpha }}=(1,1,1,0,\ldots ,0)^\top /\sqrt{3}\) is a 50 dimensional vector, \({\varvec{\beta }}_1=(3,0,3,0,\ldots ,0)^\top \) and \({\varvec{\beta }}_2=(-3,2,0,3,0,\ldots ,0)^\top \) are 51-dimensional vectors. \(\pi _j(\cdot )\) and \(\sigma _j^2\), \(j=1,2\) are still the same as before. The results are summarized in Table 10, where the AB and SD of \({\varvec{\beta }}\) are defined as

$$\begin{aligned}&AB_\beta =\left[ \frac{1}{p+1}\sum _{j=1}^k\sum _{t=1}^{p+1}\left( E\hat{\beta }_{jt}-\beta _{jt}\right) ^2\right] ^{1/2}, \quad \text {and }\\&SD_\beta =\left[ \frac{1}{p+1}\sum _{j=1}^k\sum _{t=1}^{p+1} E\left( \hat{\beta }_{jt}-E\hat{\beta }_{jt}\right) ^2\right] ^{1/2}. \end{aligned}$$

The results are based on MRSIP(S). From the table, we can see that, given the same sample size, increasing the number of predictors would downgrade the performance of MRSIP. However, if we increase the sample size to 800, even with 50 predictors, the MRSIP still works very well.

Table 10 Example 2 in higher dimensional case

5 Real data examples

Example 1

(NBA data) We illustrate the proposed methodology by an analysis of “The effectiveness of National Basketball Association guards”. There are many ways to measure the (statistical) performance of guards in the NBA. Of interest is how the height of the player (Height), minutes per game (MPG) and free throw percentage (FTP) affect points per game (PPM) (Chatterjee et al. 1995).

Fig. 1
figure 1

NBA data: a estimated mean functions and a hard-clustering result; b mean squared prediction error: five-fold CV; ten-fold CV; MCCV \(\text {d}=10\); MCCV \(\text {d}=20\)

The data set contains some descriptive statistics for all 105 guards for the 1992–1993 season. Since players playing very few minutes are quite different from those who play a sizable part of the season, we only look at those players playing 10 or more minutes per game and appearing in 10 or more games. In addition, Michael Jordan is an outlier, so we also omit him from our data analysis. These exclude 10 players (Chatterjee et al. 1995). We divide each variable by its corresponding standard deviation, so that they have comparable numerical scales.

To evaluate the prediction performance of the proposed models and compared them to the linear regression model (Linear), the mixture of linear regression models (MLR), the nonparametric mixture of regression models (MNP, Huang et al. 2013), the mixture of regression models with varying mixing proportions (VaryPr1, Huang and Yao 2012), and the mixtures of regressions with predictor-dependent mixing proportions (VaryPr2, Young and Hunter 2010), we used d-fold cross-validation with \(d=5\), 10, and the Monte-Carlo cross-validation (MCCV) with \(d=10\), 20 (Shao 1993). In MCCV, the data were partitioned 500 times into disjoint training subsets (with size \(n-d\)) and test subsets (with size d). The mean squared prediction error evaluated at the test data sets are reported as boxplots in Fig. 1b. Apparently, the MSIM has superior prediction power than the rest of the models, followed by MNP. The average improving rates of MSIM over MNP, for the four cross-validation methods, are 46.64%, 15.16%, 22.87%, and 25.22%, respectively.

Fig. 2
figure 2

Corporations core competition data: mean squared prediction error: five-fold CV; ten-fold CV; MCCV \(\text {d}=20\); MCCV \(\text {d}=40\)

Next, we give detailed analysis of this dataset through the MSIM. An optimal bandwidth is selected at 0.344 by CV procedure. Figure 1a contains the estimated mean functions and hard-clustering results, denoted by dots and squares, respectively. The 95% confidence interval for \(\hat{{\varvec{\alpha }}}\) are (0.134,0.541), (0.715,0.949) and (0.202,0.679). Therefore, MPG is the most influential factor on PPM. This might be partly explained by that coaches tend to let good players with higher PPM play longer minutes per game (i.e., higher MPG). The two groups of guards our new models found might be explained by the difference between shooting guards and passing guards.

Example 2

(Corporations core competition data) We now analyze a corporations core competition dataset, which includes descriptive statistics and financial data of 196 manufacturing listed companies for the year of 2017. Of interest is what determines the value of a corporation. The response variable is the market value Y, and the independent variables are general assets (\(X_1\)), sales revenue (\(X_2\)), sales revenue growth rate (\(X_3\)), income per capita (\(X_4\)), earning cash flow (\(X_5\)), inventory turnover ratio (\(X_6\)), accounts receivable turnover (\(X_7\)), earning per share (\(X_8\)), return on equity (\(X_9\)), research and development expenditure (\(X_{10}\)), proportion of scientific research personnel (\(X_{11}\)), proportions of stuffs with undergraduate degrees (\(X_{12}\)), rate of production equipment updates (\(X_{13}\)), sales revenue within industry (\(X_{14}\)), and market share (\(X_{15}\)). Each variable is scaled before further analysis.

Table 11 Standard errors and 95% confidence intervals for global parameters of MRSIP for corporations core competition data

Similar to the previous example, we compare the prediction performance of the proposed models to the five existing models. Both CV and MCCV are applied to this dataset and the mean squared prediction errors are shown in Fig. 2. We can see that the MRSIP and the MLR are the best two models for this dataset. To better compare these two methods, we also compute the average improving rates of MRSIP over MLR, for the four cross-validation methods, which are 11.19%, 6.2%, 6.8%, and 27.10%, respectively. As a conclusion, MRSIP is the best choice for this dataset, followed by mixture of linear regressions.

Table 11 shows the standard errors and 95% confidence intervals for \(\alpha \), \(\beta _1\), and \(\beta _2\), assuming the MRSIP. It can be seen that \(X_1, X_2\), and \(X_4\) have significant effects on mixing proportions. Note that there are variables like \(X_3\), which does not have significant effect on \(\alpha \) or \(\beta \)’s, and therefore, variable selection methods might be helpful to further improve the accuracy and homogeneity of the model estimation. Hard clustering shows that most companies in the first component are innovative companies, who spend huge amount of money on recruiting high-end talents and developing new products, while the second component contains companies that are mostly more traditional and conservative.

6 Discussion

In this paper, we propose two finite semiparametric mixture of regression models and provide the modified EM algorithms to estimate them. We establish the identifiability results of the new models and investigate the asymptotic properties of the proposed estimation procedures. Throughout the article, we assume that the number of components is known and fixed, but it requires more research to select the number of components for the proposed semiparametric mixture models. It will be interesting to know whether the recently proposed EM test (Chen and Li 2009; Li and Chen 2010) can be extended to the proposed semiparametric mixture models. In addition, it is also interesting to build some formal model selection procedure to compare different semiparametric mixture models. In the real data applications, we use the cross-validation criteria to compare different models. When the models are nested, one might use generalized likelihood ratio statistic proposed by Fan et al. (2001) to test any parametric assumption for the semiparametric models. Furthermore, the assumption of fixed dimension of predictors can be relaxed and the proposed models can be extended to the cases where the dimension of predictors p also diverges with the sample size n. This might be done by using the idea of penalized local likelihood if the sparsity assumption is added on the predictors. We employed kernel regression method to estimate the nonparametric functions. One can also use local polynomial regression method (such as local linear) to possibly reduce the bias and boundary effects of the estimators.