Keywords

1 Introduction

Gaussian process (GP) is a powerful model for a wide range of applications in machine learning, including regression [1], classification [2], dimensionality reduction [3], reinforcement learning [4], etc. Nevertheless, GP model fails to describe multimodality dataset and the training of GP consumes O(N3) time for N training samples [5, 6]. In order to solve these problems, Tresp [7] proposed the Mixture of Gaussian Processes (MGP) in 2000, which was adjusted from Mixture of Experts. Since then, various MGP models have been proposed, and the most of them are special cases of mixture of experts where each expert is a GP.

Another useful way of reducing the time cost of training a GP is to adopt the model of sparse Gaussian Process (SGP), which computes the GP likelihood with a pseudo dataset being smaller than the training dataset in size [8]. To combine the advantages of MGP and SGP, the Mixture of Sparse Gaussian Processes (MSGP) has been also proposed, in which each component is a SGP instead of a GP to accelerate the parameter learning [5, 6, 911].

For these MGP and MSGP models, there are three main training algorithms: Monte Carlo Markov Chain (MCMC), variational Bayesian inference (VB), and EM algorithm. Actually, MCMC approximated the posterior of parameters by sampling several long sequences of Markov Chains [1219]. However, it was quite time consuming as pointed out in [9]. VB tried to approximate the posterior of parameters with factorized forms, which actually may deviate a lot from the true posterior [2024].

Generally, EM algorithm is an effective approach to the learning of mixture models. However, since the exact posterior of latent variables and the Q function are intractable for MGP and MSGP models, several approximation strategies have been adopted. For example, variational EM algorithm approximated the posterior in E step with variational inference [5, 911, 25], which had the same drawbacks as VB. In [26, 27], the Q function was decomposed via leave-one-out cross-validation (LOOCV), which was a complicated form involving much computation. Stachniss et al. [6] and Tresp [7] attempted to simplify the learning process with the help of heuristic estimation in M step. However, kernel parameters were predetermined without learning in [7], whereas the parameters were sampled from data-irrelevant distributions in [6]. Therefore, both learning processes in fact needed more guidance by datasets.

Recently, some hard-cut EM algorithms have also been adopted to improve the learning efficiency, which partition each sample into the component with the maximum posterior in E-step and learn the parameters of each component independently in M-step [9, 28]. Moreover, in the same way, Yang and Ma [26] has adjusted its proposed soft EM algorithm for MGP with the LOOCV probability decomposition into a hard-cut EM algorithm for MGP, which is here referred to as the LOOCV hard-cut EM algorithm for convenience.

After all, these algorithms resorted to some approximation strategies since MGP and MSGP models are usually very complex. Recently, Chen et al. [29] refined the MGP model to a more simplified form, and strictly derived a precise hard-cut EM algorithm for it. In this paper, we develop this approach by adding sparsity constraints and adjusting the refined MGP model into the MSGP model. As a result, the parameter learning becomes more efficient as the MSGP model is refined and the hard-cut EM algorithm is still strictly derived. The experimental results demonstrates that our proposed model and algorithm are feasible and can outperform some typical regression algorithms.

2 The Mixture of Sparse Gaussian Processes

2.1 Gaussian Process (GP)

A GP model for regression is mathematically defined by

$$ \left\{ \begin{aligned} & F = \left[ {\begin{array}{*{20}c} {f_{1} } \\ {f_{2} } \\ \vdots \\ {f_{N} } \\ \end{array} } \right]\sim N\left( {\left[ {\begin{array}{*{20}c} {m(x_{1} )} \\ {m(x_{2} )} \\ \vdots \\ {m(x_{N} )} \\ \end{array} } \right],\left[ {\begin{array}{*{20}c} {K(x_{1} ,x_{1} )} & {K(x_{1} ,x_{2} )} & \cdots & {K(x_{1} ,x_{N} )} \\ {K(x_{2} ,x_{1} )} & {K(x_{2} ,x_{2} )} & \cdots & {K(x_{2} ,x_{N} )} \\ \vdots & \vdots &\ddots & \vdots \\ {K(x_{N} ,x_{1} )} & {K(x_{N} ,x_{2} )} & \cdots & {K(x_{N} ,x_{N} )} \\ \end{array} } \right]} \right) \\ & y_{t} \sim N(f_{t} ,\sigma^{2} ) \\ \end{aligned} \right.,$$
(1)

where xt, ft and yt denote the input, latent response and output of a training sample, respectively, K(u, v) is a mercer kernel function, and \( \sigma^{2} \) denotes the noise intensity. As in most cases, we adopt zero mean function (m ≡ 0) and the most popular kernel function—the squared exponential (SE) kernel [30]:

$$ K(u,v) = l^{2} \exp \left[ { - \frac{1}{2}\sum\limits_{k = 1}^{d} {b_{k}^{2} (u_{k} - v_{k} )^{2} } } \right], $$
(2)

where d is the dimensionality of inputs and each dimension has a different weight bk to realize automatic feature selection.

There are several ways for learning the hyper-parameters of a GP model, including the approaches of maximum likelihood estimation (MLE), maximizing a posteriori (MAP), surrogate predictive probability (GPP), cross validation (CV), etc. [31].

With the estimated hyper-parameters, the predictive distribution of the output at a test input x* can be obtained as follows:

$$ p(y^{*} |x^{*} )\sim N\left[ {K(x^{*} ,X)K(X,X)^{ - 1} y,K(x^{*} ,x^{*} ) - K(x^{*} ,X)K(X,X)^{ - 1} K(X,x^{*} )} \right], $$
(3)

where K(x*, X) = [K(x*, xj)]1xN, K(X, x*) = [K(xi, x*)]Nx1, K(X, X) = [K(xi, xj)]NxN are kernel matrices [30].

2.2 Mixture of Gaussian Processes (MGP)

MGP is a special mixture model in which each component is just a Gaussian process, and these components are independent in most existing MGP models. Denote zt = c iff (xt, yt) belongs to the c-th GP, and denote

$$ X_{c} = [x_{{t_{c} (1)}} ,x_{{t_{c} (2)}} , \cdots ,x_{{t_{c} (N_{c} )}} ]^{T} \;\;\;\;Y_{c} = [y_{{t_{c} (1)}} ,y_{{t_{c} (2)}} , \cdots ,y_{{t_{c} (N_{c} )}} ]^{T} $$

as the inputs and outputs from the c-th GP, respectively, where Nc is the number of samples in the c-th GP. Therefore, the output likelihood is mathematically given by

$$ Y_{c} \sim N\left[ {0,K(X_{c} ,X_{c} |\theta_{c} ) + \sigma_{c}^{2} I} \right];c = 1,2, \cdots ,C, $$
(4)

where the SE kernel given by Eq. (2) is parameterized by \( \theta_{c} = \{ l_{c} ,\sigma_{c} \}\,\cup\,\{ b_{ck} :k = 1,2, \cdots ,d\} \) for the c-th GP.

MGP has two main advantages over a single GP. Firstly, MGP can capture the heterogeneity among different input locations by fitting them with different GPs. For example, the toy dataset used by some MGP models [12, 17, 26, 29] was generated by 4 continuous functions with Gaussian noise, as is shown in Fig. 1. Therefore, it is better to fit the Toy dataset by MGP with four GPs than only one GP. Secondly, the computational cost can be reduced by dividing the large kernel matrix of one GP into small matrices of components.

Fig. 1.
figure 1

Toy dataset

2.3 Sparse Gaussian Process (SGP)

Another good scaling technique for GP is to use the model of Sparse Gaussian Process. The main idea of SGP is to approximate N training samples with M pseudo samples (M<<N) [32]. Mathematically, we denote inputs of the training data, test data and pseudo data \( X = [x_{1} ,x_{2} , \cdots ,x_{N} ]^{T} \), \( X^{*} = [x_{1}^{*} ,x_{2}^{*} , \cdots ,x_{L}^{*} ]^{T} \), \( U = [u_{1} ,u_{2} , \cdots ,u_{M} ]^{T} \) respectively, and their corresponding latent responses are denoted as \( F = [f_{1} ,f_{2} , \cdots ,f_{N} ]^{T} \), \( F^{*} = [f_{1}^{*} ,f_{2}^{*} , \cdots ,f_{L}^{*} ]^{T} \) and \( V = [v_{1} ,v_{2} , \cdots ,v_{M} ]^{T} \), respectively. Almost, the sparse GP models can be mathematically defined by the following equations [32]:

$$ V\sim N[0,K(U,U)], $$
(5)
$$ p(F,F^{*} |V) = q(F|V)q(F^{*} |V), $$
(6)
$$ y_{t} |f_{t} \sim N(f_{t} ,\sigma^{2} )\;i.i.d., $$
(7)
$$ y_{t}^{*} |f_{t}^{*} \sim N(f_{t}^{*} ,\sigma^{2} )\;i.i.d., $$
(8)

where Eq. (5) means the latent pseudo outputs V fulfill a GP, as in Eqs. (1), and (6) gives the crucial assumption of SGP, i.e., the latent responses for training dataset and test dataset are conditionally independent given V. In Eq. (6), q(F|V) and q(F*|V) can be approximations of the following GP predictive distributions

$$ \begin{aligned} & p(F^{*} |V)\sim N\left[ {K(X^{*} ,U)^{T} K(U,U)^{ - 1} V,K(X^{*} ,X^{*} ) - K(X^{*} ,U)K(U,U)^{ - 1} K(U,X^{*} )} \right] \\ & p(F|V)\sim N\left[ {K(X,U)^{T} K(U,U)^{ - 1} V,K(X,X) - K(X,U)K(U,U)^{ - 1} K(U,X)} \right], \\ \end{aligned} $$
(9)

respectively, and the forms of q(F|V) and q(F*|V) determine the kind of SGP models, including SoR, DTC, FITC and PITC models [32]. Among these models, the Fully Independent Training Conditional (FITC) model proposed in [8] was highly recommended in [32], due to its rich covariance. Experimental results have shown the great advantage of FITC over the other sparse GPs on predictive accuracy, whereas the increase on time consumption is trivial [8]. Therefore, we will use FITC as each component for our proposed mixture of sparse Gaussian Processes model (MSGP).

In FITC model, the conditional distribution of the latent test outputs is the exact GP predictive distribution:

$$ \begin{aligned} & q(F^{*} |V) = p(F^{*} |V) \\ & \sim N\left[ {K(X^{*} ,U)^{T} K(U,U)^{ - 1} V,K(X^{*} ,X^{*} ) - K(X^{*} ,U)K(U,U)^{ - 1} K(U,X^{*} )} \right], \\ \end{aligned} $$
(10)

whereas the latent training outputs are assumed to be conditionally independent given the latent pseudo output V, as suggested by the name “Fully Independent Training Conditional”, i.e.

$$ q(F|V)\sim N\left[ {K(X,U)K(U,U)^{ - 1} V,\Lambda } \right], $$
(11)

where \( \Lambda = diag\left[ {K(X,X) - K(X,U)K(U,U)^{ - 1} K(U,X)} \right] \)

It can be inferred that the marginal likelihood for the outputs is

$$ p(Y)\sim N\left[ {0,K(X,U)K(U,U)^{ - 1} K(U,X) +\Lambda + \sigma^{2} I} \right], $$
(12)

and the predictive distribution can be derived as follows.

$$ \begin{array}{ll} p(Y^{*} |Y)\sim N\left\{{K(X^{*},U)Q^{- 1} K(U,X^{*} )(\Lambda + \sigma^{2} I)^{- 1} Y,} \right. \\ \left. {K(X^{*} ,X^{*} ) - K(X^{*},U)[K(U,U)^{- 1} - Q^{- 1}]K(U,X^{*}) + \sigma^{2} I} \right\}, \end{array} $$
(13)

where \( Q = K(U,U) + K(U,X)(\Lambda + \sigma^{2} I)^{ - 1} K(X,U) \).

The crucial issue on training a SGP is to find appropriate pseudo inputs. One method is to greedily select a fixed number of training inputs one by one with a certain criterion [3336]. Snelson and Ghahramani [32] have extended the range of pseudo inputs from the training sets to the whole Euclidean space, and learnt pseudo inputs and the hyper parameters together by maximizing the likelihood, which is more flexible than greedy selection and has better performance in experiments.

2.4 The FITC Mixture Model

As the MGP model proposed in [29] is refined and can be trained by the precise learning algorithm, we can inherit its general framework for our MSGP model. The only difference is that SGP is adopted for each component for higher speed in the training process.

As in [29], we adopt the following gating function and Gaussian inputs.

$$ \Pr \left( {z_{t} = c} \right) = \pi_{c} ;c = 1\sim C{\kern 1pt} {\kern 1pt} {\kern 1pt} i.i.d{\kern 1pt} {\kern 1pt}\,for{\kern 1pt} {\kern 1pt} t = 1\sim N, $$
(14)
$$ p(x_{t} |z_{t} = c)\sim N(\mu_{c} ,S_{c} );c = 1\sim C{\kern 1pt} ,{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} i.i.d{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} for{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} t = 1\sim N. $$
(15)

Furthermore, for each component, we use the FITC model due to its advantages mentioned in Sect. 2.3 to describe the SGP input-output relation

$$ p(Y_{c} |X_{c} ,U_{c} ,\theta_{c} )\sim N\left[ {0,K(X_{c} ,U_{c} |\theta_{c} )K(U_{c} ,U_{c} |\theta_{c} )^{ - 1} K(U_{c} ,X_{c} |\theta_{c} ) +\Lambda _{c} + \sigma_{c}^{2} I} \right], $$
(16)

where Uc denotes the pseudo inputs in the c-th component,

$$ \Lambda _{c} = diag\left[ {K(X_{c} ,X_{c} |\theta_{c} ) - K(X_{c} ,U_{c} |\theta_{c} )K(U_{c} ,U_{c} |\theta_{c} )^{ - 1} K(U_{c} ,X_{c} |\theta_{c} )} \right], $$
(17)

and \( X_{c} \), \( Y_{c} \), \( \theta_{c} \) follow the meanings in Sect. 2.2.

Our MSGP model can be fully described by Eqs. (14)–(16), and it has fewer parameters than previous MSGP models [5, 6, 911]. However, it still retains the advantages of combining sparse GPs with the mixture model. For this refined model, we strictly derive its training algorithm, called the precise hard-cut EM algorithm.

3 The Hard-Cut EM Algorithm for the FITC Mixtures

Since the outputs from both MGP and MSGP models are dependent, the computational complexity of Q function is exponential with summation over multiple labels. In order to avoid such an expensive computation, a hard-cut EM algorithm can be adopted by partitioning the training samples into the corresponding component with the maximum posterior in E-step, so that the parameters of each component can be learnt separately via the MLE in M-step. Because our model is developed from the MGP model in [29], where the precise hard-cut EM algorithm was strictly derived and performed quite well on both the prediction accuracy and the learning efficiency, we adjust its general learning framework to train the FITC mixture.

It can be easily derived that the marginal output likelihood for each sample in fact remains exactly the same as [29], i.e.

$$ p(y_{t} |x_{t} ,z_{t} = c) = N\left[ {\left. {y_{t} } \right|0,K(x_{t} ,x_{t} |\theta_{c} ) + \sigma_{c}^{2} } \right] = N(y_{t} |0,l_{c}^{2} + \sigma_{c}^{2} ), $$
(18)

so, the same posterior in E-step can be derived using the Bayesian formula:

$$ \Pr (z_{t} = c|x_{t} ,y_{t} ) = \frac{{\pi_{c} N(x_{t} |\mu_{c} ,S_{c} )N(y_{t} |l_{c}^{2} + \sigma_{c}^{2} )}}{{\sum\limits_{k = 1}^{C} {\pi_{k} N(x_{t} |\mu_{k} ,S_{k} )N(y_{t} |l_{k}^{2} + \sigma_{k}^{2} )} }}. $$
(19)

However, in M-step, the GP likelihood in [29] can be replaced by the FITC likelihood given by Eq. (16) with less computation, thus pseudo inputs are added during the learning of the hyper-parameters. With such an adjustment from [29], we obtain the procedure of our proposed precise hard-cut EM algorithm as follows.

  • Step1 (Initialize the partition of the training data). Divide the vectors \( \left\{ {[x_{t}^{T} ,y_{t} ]} \right\}_{t = 1}^{N} \) into C clusters, and set zt←the indicator of the t-th sample to the cluster.

  • Step2 (M-step). Learn the parameters with MLE for each component:

    $$ \pi_{c} \leftarrow \frac{{N_{c} }}{N},{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \mu_{c} \leftarrow \frac{1}{{N_{c} }}\sum\limits_{t = 1}^{N} {I(z_{t} = c)x_{t} } ,{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} S_{c} \leftarrow \frac{1}{{N_{c} }}\sum\limits_{t = 1}^{N} {I(z_{t} = c)(x_{t} - \mu_{c} )(x_{t} - \mu_{c} )^{T} } , $$
    (20)
    $$ (\hat{U}_{c} ,\hat{\theta }_{c} ) = \mathop {\arg \hbox{max} }\limits_{{U_{c} ,\theta_{c} }} \ln p(Y_{c} |X_{c} ,U_{c} ,\theta_{c} ), $$
    (21)

    where Nc is the number of samples in the c-th component, Eq. (20) is the same as learning the parameters of the Gaussian mixture model and Eq. (21) follows from [8].

  • Step3 (E-step). Repartition the samples into the corresponding component based on the maximizing a posterior criterion:

    $$ z_{t} \leftarrow \mathop {\arg \hbox{max} }\limits_{c} \;\Pr (z_{t} = c|x_{t} ,y_{t} ) = \mathop {\arg \hbox{max} }\limits_{c} [\pi_{c} N(x_{t} |\mu_{c} ,S_{c} )N(y_{t} |l_{c}^{2} + \sigma_{c}^{2} )] $$
    (22)
  • Step4. Stop if the number of changed labels falls below η % of the number of training samples (0 ≤ η≤10 is a threshold). Otherwise, return to Step 2.

After the learning process above, we have obtained the estimated parameters and the partition of the training data. Then, for a test input x*, we can classify it into the z-th component of the MSGP by maximizing the posterior as in [29]:

$$ z^{*} \leftarrow \mathop {\arg \hbox{max} }\limits_{c} \;\Pr (z^{*} = c|D,x^{*} ) = \mathop {\arg \hbox{max} }\limits_{c} [\pi_{c} N(x^{*} |\mu_{c} ,S_{c} )] $$
(23)

According to this classification, we can predict the output of the test input via the z-th component using Eq. (13).

4 Experimental Results

4.1 On a Small Synthetic Dataset from the MGP Model

In order to evaluate the validity and feasibility of the FITC mixture model and its hard-cut EM algorithm, we generate a typical synthetic dataset from MGP model with 3 components. Actually, there are 500 training samples and 100 test samples in each component and each input has 3 features. Then we compare our algorithms with the following typical baselines for regression:

  1. 1.

    The FITC model and its MLE algorithm [8].

  2. 2.

    The MGP model and its precise hard-cut EM algorithm [29].

  3. 3.

    The MGP model and its LOOCV hard-cut EM algorithm [26].

  4. 4.

    The mixture of sparse GPs model and its variational hard-cut EM algorithm [9].

  5. 5.

    Linear regression [37].

  6. 6.

    SVM regression with Gaussian kernel [38].

Each of these algorithms is implemented on this synthetic dataset five times, with an Intel (R) Core (TM) i5 CPU and 4.00 GB of RAM running Matlab R2014b source codes on Secure CRT. For each of these algorithms, the mean and standard deviation of the root mean squared errors (RMSEs) for prediction and the training times are listed in Table 1. In addition, to make the prediction of FITC and its mixture model more accurate, we initialize the kernel parameters by training a GP model on 20 randomly selected training samples before the MLE learning process, as did in [8].

Table 1. The mean and standard deviation of the predictive RMSEs and the training times for each algorithm on the typical synthetic dataset from MGP model

It can be seen from Table 1 that on the synthetic dataset, our proposed algorithm is much more accurate than the single FITC since the dataset is multimodal. With sparse technique, our algorithm is much more efficient than the two EM algorithms of the MGP models, whereas the loss of accuracy is trivial compared with the increase on speed. The variational hard-cut EM algorithm makes a coarse conditionally independent assumption to the posterior and thus it is not as accurate as our proposed algorithm. Besides, the variational hard-cut EM algorithm also takes longer to train than our proposed algorithm. Traditional regression algorithms like the linear regression and the SVM are rough for such a non-linear synthetic dataset.

Furthermore, our proposed algorithm correctly partitions all the training samples and test samples into their components each time, which, along with the results in Table 1, means that our algorithm is feasible and effective.

4.2 On a Large Synthetic Dataset from the FITC Mixture Model

Moreover, to test the advantage of FITC mixture model and its hard-cut EM algorithm, we generate a typical synthetic dataset from FITC mixture model with 100 components. In each component, there are 200 training samples and 200 test samples, with 1 dimensional inputs. Then we compare our algorithm with the baselines above. The computational environment and experimental details remain the same as above.

On such a big dataset (20000 training samples), some algorithms are prohibitively slow and fail to converge, such as the SVM, the precise hard-cut EM algorithm and the LOOCV hard-cut EM algorithm of MGP, so we omit them in the experiment. It also indicates that our proposed hard-cut EM algorithm is more efficient than the precise hard-cut EM algorithm of the MGP model due to sparsity mechanism.

All the other algorithms are implemented on this synthetic dataset five times and the mean and standard deviation of their RMSEs for prediction and training times are listed in Table 2. Similarly, for FITC model and our proposed FITC mixture model, we initialize the kernel parameters by training a GP model on 10 randomly selected training samples before the MLE learning process, as in [8].

Table 2. The mean and standard deviation of the predictive RMSEs and the training times for each algorithm on the typical synthetic dataset from FITC mixture model

Table 2 indicates that for the synthetic dataset with much more components, our proposed FITC mixture model has greater advantage than the FITC model, on both predictive precision and learning efficiency. The conditionally independent assumption of the variational hard-cut EM algorithm, as well as the linear assumption of the linear regression, is not suitable for such a highly non-linear dataset with dependent samples. Therefore, the two algorithms have big prediction errors on this dataset.

4.3 On Kin40k Dataset

Finally, we compare these algorithms on a popular real dataset called kin40k, which is generated by a robot arm simulator, with 10000 training samples, 30000 test samples and 9 attributes [34]. The computational environment and implementation details remain the same as above. The mean and standard deviation of the predictive RMSEs as well as the training times for each algorithm are listed in Table 3. Similarly, for FITC and FITC mixture model, we initialize the kernel parameters by training a GP model on 500 randomly selected training samples before the MLE learning process.

Table 3. The mean and standard deviation of the predictive RMSEs and the training times for each algorithm on kin40k dataset

Still, on the large dataset, the SVM, the precise hard-cut EM algorithm and the LOOCV hard-cut EM algorithm of MGP are prohibitively slow and fail to converge.

From Table 3, we can observe that on the kin40 k dataset, our proposed algorithm is more precise than the other algorithms due to fewer approximations. A possible reason why our algorithm consumes so much time is that on a real dataset that doesn’t come from a GP or MGP model, our algorithm needs much more iterations. Therefore, a further improvement can be made by preprocessing a real dataset so that it is more like a dataset simulated from a MGP. Linear regression is still rather coarse since the kin40k dataset is highly non-linear [34].

In general, our algorithm is practical on the real dataset, and its computational cost is acceptable with 10000 training samples.

5 Conclusion

We have refined the MSGP model and developed the hard-cut EM algorithm for its parameter learning. The general framework is adapted from [29] whereas the learning efficiency significantly improves with the sparsity mechanism. The experimental results on both the synthetic dataset and the real dataset demonstrate that the developed hard-cut EM algorithm for MSGP model is precise and efficient, and generally outperforms the conventional linear regression, SVM and some other learning algorithms for GP, MGP and MSGP models.