The Hard-Cut EM Algorithm for Mixture of Sparse Gaussian Processes

Chen, Ziyi; Ma, Jinwen

doi:10.1007/978-3-319-22053-6_2

Ziyi Chen⁶ &
Jinwen Ma⁶

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9227))

Included in the following conference series:

International Conference on Intelligent Computing

3134 Accesses
5 Citations

Abstract

The mixture of Gaussian Processes (MGP) is a powerful and fast developed machine learning framework. In order to make its learning more efficient, certain sparsity constraints have been adopted to form the mixture of sparse Gaussian Processes (MSGP). However, the existing MGP and MSGP models are rather complicated and their learning algorithms involve various approximation schemes. In this paper, we refine the MSGP model and develop the hard-cut EM algorithm for MSGP from its original version for MGP. It is demonstrated by the experiments on both synthetic and real datasets that our refined MSGP model and the hard-cut EM algorithm are feasible and can outperform some typical regression algorithms on prediction. Moreover, with sparse technique, the parameter learning of our proposed MSGP model is much more efficient than that of the MGP model.

Access provided by Autonomous University of Puebla. Download conference paper PDF

A Precise Hard-Cut EM Algorithm for Mixtures of Gaussian Processes

An Effective Model Selection Criterion for Mixtures of Gaussian Processes

Regularized Estimation and Feature Selection in Mixtures of Gaussian-Gated Experts Models

Keywords

1 Introduction

Gaussian process (GP) is a powerful model for a wide range of applications in machine learning, including regression [1], classification [2], dimensionality reduction [3], reinforcement learning [4], etc. Nevertheless, GP model fails to describe multimodality dataset and the training of GP consumes O(N³) time for N training samples [5, 6]. In order to solve these problems, Tresp [7] proposed the Mixture of Gaussian Processes (MGP) in 2000, which was adjusted from Mixture of Experts. Since then, various MGP models have been proposed, and the most of them are special cases of mixture of experts where each expert is a GP.

Another useful way of reducing the time cost of training a GP is to adopt the model of sparse Gaussian Process (SGP), which computes the GP likelihood with a pseudo dataset being smaller than the training dataset in size [8]. To combine the advantages of MGP and SGP, the Mixture of Sparse Gaussian Processes (MSGP) has been also proposed, in which each component is a SGP instead of a GP to accelerate the parameter learning [5, 6, 9–11].

For these MGP and MSGP models, there are three main training algorithms: Monte Carlo Markov Chain (MCMC), variational Bayesian inference (VB), and EM algorithm. Actually, MCMC approximated the posterior of parameters by sampling several long sequences of Markov Chains [12–19]. However, it was quite time consuming as pointed out in [9]. VB tried to approximate the posterior of parameters with factorized forms, which actually may deviate a lot from the true posterior [20–24].

Generally, EM algorithm is an effective approach to the learning of mixture models. However, since the exact posterior of latent variables and the Q function are intractable for MGP and MSGP models, several approximation strategies have been adopted. For example, variational EM algorithm approximated the posterior in E step with variational inference [5, 9–11, 25], which had the same drawbacks as VB. In [26, 27], the Q function was decomposed via leave-one-out cross-validation (LOOCV), which was a complicated form involving much computation. Stachniss et al. [6] and Tresp [7] attempted to simplify the learning process with the help of heuristic estimation in M step. However, kernel parameters were predetermined without learning in [7], whereas the parameters were sampled from data-irrelevant distributions in [6]. Therefore, both learning processes in fact needed more guidance by datasets.

Recently, some hard-cut EM algorithms have also been adopted to improve the learning efficiency, which partition each sample into the component with the maximum posterior in E-step and learn the parameters of each component independently in M-step [9, 28]. Moreover, in the same way, Yang and Ma [26] has adjusted its proposed soft EM algorithm for MGP with the LOOCV probability decomposition into a hard-cut EM algorithm for MGP, which is here referred to as the LOOCV hard-cut EM algorithm for convenience.

After all, these algorithms resorted to some approximation strategies since MGP and MSGP models are usually very complex. Recently, Chen et al. [29] refined the MGP model to a more simplified form, and strictly derived a precise hard-cut EM algorithm for it. In this paper, we develop this approach by adding sparsity constraints and adjusting the refined MGP model into the MSGP model. As a result, the parameter learning becomes more efficient as the MSGP model is refined and the hard-cut EM algorithm is still strictly derived. The experimental results demonstrates that our proposed model and algorithm are feasible and can outperform some typical regression algorithms.

2 The Mixture of Sparse Gaussian Processes

2.1 Gaussian Process (GP)

A GP model for regression is mathematically defined by

$$ \left\{ \begin{aligned} & F = \left[ {\begin{array}{*{20}c} {f_{1} } \\ {f_{2} } \\ \vdots \\ {f_{N} } \\ \end{array} } \right]\sim N\left( {\left[ {\begin{array}{*{20}c} {m(x_{1} )} \\ {m(x_{2} )} \\ \vdots \\ {m(x_{N} )} \\ \end{array} } \right],\left[ {\begin{array}{*{20}c} {K(x_{1} ,x_{1} )} & {K(x_{1} ,x_{2} )} & \cdots & {K(x_{1} ,x_{N} )} \\ {K(x_{2} ,x_{1} )} & {K(x_{2} ,x_{2} )} & \cdots & {K(x_{2} ,x_{N} )} \\ \vdots & \vdots &\ddots & \vdots \\ {K(x_{N} ,x_{1} )} & {K(x_{N} ,x_{2} )} & \cdots & {K(x_{N} ,x_{N} )} \\ \end{array} } \right]} \right) \\ & y_{t} \sim N(f_{t} ,\sigma^{2} ) \\ \end{aligned} \right.,$$

(1)

where x_t, f_t and y_t denote the input, latent response and output of a training sample, respectively, K(u, v) is a mercer kernel function, and $ \sigma^{2} $ denotes the noise intensity. As in most cases, we adopt zero mean function (m ≡ 0) and the most popular kernel function—the squared exponential (SE) kernel [30]:

$$ K(u,v) = l^{2} \exp \left[ { - \frac{1}{2}\sum\limits_{k = 1}^{d} {b_{k}^{2} (u_{k} - v_{k} )^{2} } } \right], $$

(2)

where d is the dimensionality of inputs and each dimension has a different weight b_k to realize automatic feature selection.

There are several ways for learning the hyper-parameters of a GP model, including the approaches of maximum likelihood estimation (MLE), maximizing a posteriori (MAP), surrogate predictive probability (GPP), cross validation (CV), etc. [31].

With the estimated hyper-parameters, the predictive distribution of the output at a test input x* can be obtained as follows:

$$ p(y^{*} |x^{*} )\sim N\left[ {K(x^{*} ,X)K(X,X)^{ - 1} y,K(x^{*} ,x^{*} ) - K(x^{*} ,X)K(X,X)^{ - 1} K(X,x^{*} )} \right], $$

(3)

where K(x*, X) = [K(x*, x_j)]_1xN, K(X, x*) = [K(x_i, x*)]_Nx1, K(X, X) = [K(x_i, x_j)]_NxN are kernel matrices [30].

2.2 Mixture of Gaussian Processes (MGP)

MGP is a special mixture model in which each component is just a Gaussian process, and these components are independent in most existing MGP models. Denote z_t = c iff (x_t, y_t) belongs to the c-th GP, and denote

$$ X_{c} = [x_{{t_{c} (1)}} ,x_{{t_{c} (2)}} , \cdots ,x_{{t_{c} (N_{c} )}} ]^{T} \;\;\;\;Y_{c} = [y_{{t_{c} (1)}} ,y_{{t_{c} (2)}} , \cdots ,y_{{t_{c} (N_{c} )}} ]^{T} $$

as the inputs and outputs from the c-th GP, respectively, where N_c is the number of samples in the c-th GP. Therefore, the output likelihood is mathematically given by

$$ Y_{c} \sim N\left[ {0,K(X_{c} ,X_{c} |\theta_{c} ) + \sigma_{c}^{2} I} \right];c = 1,2, \cdots ,C, $$

(4)

where the SE kernel given by Eq. (2) is parameterized by $ \theta_{c} = \{ l_{c} ,\sigma_{c} \}\,\cup\,\{ b_{ck} :k = 1,2, \cdots ,d\} $ for the c-th GP.

MGP has two main advantages over a single GP. Firstly, MGP can capture the heterogeneity among different input locations by fitting them with different GPs. For example, the toy dataset used by some MGP models [12, 17, 26, 29] was generated by 4 continuous functions with Gaussian noise, as is shown in Fig. 1. Therefore, it is better to fit the Toy dataset by MGP with four GPs than only one GP. Secondly, the computational cost can be reduced by dividing the large kernel matrix of one GP into small matrices of components.

2.3 Sparse Gaussian Process (SGP)

Another good scaling technique for GP is to use the model of Sparse Gaussian Process. The main idea of SGP is to approximate N training samples with M pseudo samples (M<<N) [32]. Mathematically, we denote inputs of the training data, test data and pseudo data $ X = [x_{1} ,x_{2} , \cdots ,x_{N} ]^{T} $, $ X^{*} = [x_{1}^{*} ,x_{2}^{*} , \cdots ,x_{L}^{*} ]^{T} $, $ U = [u_{1} ,u_{2} , \cdots ,u_{M} ]^{T} $ respectively, and their corresponding latent responses are denoted as $ F = [f_{1} ,f_{2} , \cdots ,f_{N} ]^{T} $, $ F^{*} = [f_{1}^{*} ,f_{2}^{*} , \cdots ,f_{L}^{*} ]^{T} $ and $ V = [v_{1} ,v_{2} , \cdots ,v_{M} ]^{T} $, respectively. Almost, the sparse GP models can be mathematically defined by the following equations [32]:

$$ V\sim N[0,K(U,U)], $$

(5)

$$ p(F,F^{*} |V) = q(F|V)q(F^{*} |V), $$

(6)

$$ y_{t} |f_{t} \sim N(f_{t} ,\sigma^{2} )\;i.i.d., $$

(7)

$$ y_{t}^{*} |f_{t}^{*} \sim N(f_{t}^{*} ,\sigma^{2} )\;i.i.d., $$

(8)

where Eq. (5) means the latent pseudo outputs V fulfill a GP, as in Eqs. (1), and (6) gives the crucial assumption of SGP, i.e., the latent responses for training dataset and test dataset are conditionally independent given V. In Eq. (6), q(F|V) and q(F*|V) can be approximations of the following GP predictive distributions

$$ \begin{aligned} & p(F^{*} |V)\sim N\left[ {K(X^{*} ,U)^{T} K(U,U)^{ - 1} V,K(X^{*} ,X^{*} ) - K(X^{*} ,U)K(U,U)^{ - 1} K(U,X^{*} )} \right] \\ & p(F|V)\sim N\left[ {K(X,U)^{T} K(U,U)^{ - 1} V,K(X,X) - K(X,U)K(U,U)^{ - 1} K(U,X)} \right], \\ \end{aligned} $$

(9)

respectively, and the forms of q(F|V) and q(F*|V) determine the kind of SGP models, including SoR, DTC, FITC and PITC models [32]. Among these models, the Fully Independent Training Conditional (FITC) model proposed in [8] was highly recommended in [32], due to its rich covariance. Experimental results have shown the great advantage of FITC over the other sparse GPs on predictive accuracy, whereas the increase on time consumption is trivial [8]. Therefore, we will use FITC as each component for our proposed mixture of sparse Gaussian Processes model (MSGP).

In FITC model, the conditional distribution of the latent test outputs is the exact GP predictive distribution:

$$ \begin{aligned} & q(F^{*} |V) = p(F^{*} |V) \\ & \sim N\left[ {K(X^{*} ,U)^{T} K(U,U)^{ - 1} V,K(X^{*} ,X^{*} ) - K(X^{*} ,U)K(U,U)^{ - 1} K(U,X^{*} )} \right], \\ \end{aligned} $$

(10)

whereas the latent training outputs are assumed to be conditionally independent given the latent pseudo output V, as suggested by the name “Fully Independent Training Conditional”, i.e.

$$ q(F|V)\sim N\left[ {K(X,U)K(U,U)^{ - 1} V,\Lambda } \right], $$

(11)

where $ \Lambda = diag\left[ {K(X,X) - K(X,U)K(U,U)^{ - 1} K(U,X)} \right] $

It can be inferred that the marginal likelihood for the outputs is

$$ p(Y)\sim N\left[ {0,K(X,U)K(U,U)^{ - 1} K(U,X) +\Lambda + \sigma^{2} I} \right], $$

(12)

and the predictive distribution can be derived as follows.

$$ \begin{array}{ll} p(Y^{*} |Y)\sim N\left\{{K(X^{*},U)Q^{- 1} K(U,X^{*} )(\Lambda + \sigma^{2} I)^{- 1} Y,} \right. \\ \left. {K(X^{*} ,X^{*} ) - K(X^{*},U)[K(U,U)^{- 1} - Q^{- 1}]K(U,X^{*}) + \sigma^{2} I} \right\}, \end{array} $$

(13)

where $ Q = K(U,U) + K(U,X)(\Lambda + \sigma^{2} I)^{ - 1} K(X,U) $.

The crucial issue on training a SGP is to find appropriate pseudo inputs. One method is to greedily select a fixed number of training inputs one by one with a certain criterion [33–36]. Snelson and Ghahramani [32] have extended the range of pseudo inputs from the training sets to the whole Euclidean space, and learnt pseudo inputs and the hyper parameters together by maximizing the likelihood, which is more flexible than greedy selection and has better performance in experiments.

2.4 The FITC Mixture Model

As the MGP model proposed in [29] is refined and can be trained by the precise learning algorithm, we can inherit its general framework for our MSGP model. The only difference is that SGP is adopted for each component for higher speed in the training process.

As in [29], we adopt the following gating function and Gaussian inputs.

$$ \Pr \left( {z_{t} = c} \right) = \pi_{c} ;c = 1\sim C{\kern 1pt} {\kern 1pt} {\kern 1pt} i.i.d{\kern 1pt} {\kern 1pt}\,for{\kern 1pt} {\kern 1pt} t = 1\sim N, $$

(14)

$$ p(x_{t} |z_{t} = c)\sim N(\mu_{c} ,S_{c} );c = 1\sim C{\kern 1pt} ,{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} i.i.d{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} for{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} t = 1\sim N. $$

(15)

Furthermore, for each component, we use the FITC model due to its advantages mentioned in Sect. 2.3 to describe the SGP input-output relation

$$ p(Y_{c} |X_{c} ,U_{c} ,\theta_{c} )\sim N\left[ {0,K(X_{c} ,U_{c} |\theta_{c} )K(U_{c} ,U_{c} |\theta_{c} )^{ - 1} K(U_{c} ,X_{c} |\theta_{c} ) +\Lambda _{c} + \sigma_{c}^{2} I} \right], $$

(16)

where U_c denotes the pseudo inputs in the c-th component,

$$ \Lambda _{c} = diag\left[ {K(X_{c} ,X_{c} |\theta_{c} ) - K(X_{c} ,U_{c} |\theta_{c} )K(U_{c} ,U_{c} |\theta_{c} )^{ - 1} K(U_{c} ,X_{c} |\theta_{c} )} \right], $$

(17)

and $ X_{c} $, $ Y_{c} $, $ \theta_{c} $ follow the meanings in Sect. 2.2.

Our MSGP model can be fully described by Eqs. (14)–(16), and it has fewer parameters than previous MSGP models [5, 6, 9–11]. However, it still retains the advantages of combining sparse GPs with the mixture model. For this refined model, we strictly derive its training algorithm, called the precise hard-cut EM algorithm.

3 The Hard-Cut EM Algorithm for the FITC Mixtures

Since the outputs from both MGP and MSGP models are dependent, the computational complexity of Q function is exponential with summation over multiple labels. In order to avoid such an expensive computation, a hard-cut EM algorithm can be adopted by partitioning the training samples into the corresponding component with the maximum posterior in E-step, so that the parameters of each component can be learnt separately via the MLE in M-step. Because our model is developed from the MGP model in [29], where the precise hard-cut EM algorithm was strictly derived and performed quite well on both the prediction accuracy and the learning efficiency, we adjust its general learning framework to train the FITC mixture.

It can be easily derived that the marginal output likelihood for each sample in fact remains exactly the same as [29], i.e.

$$ p(y_{t} |x_{t} ,z_{t} = c) = N\left[ {\left. {y_{t} } \right|0,K(x_{t} ,x_{t} |\theta_{c} ) + \sigma_{c}^{2} } \right] = N(y_{t} |0,l_{c}^{2} + \sigma_{c}^{2} ), $$

(18)

so, the same posterior in E-step can be derived using the Bayesian formula:

$$ \Pr (z_{t} = c|x_{t} ,y_{t} ) = \frac{{\pi_{c} N(x_{t} |\mu_{c} ,S_{c} )N(y_{t} |l_{c}^{2} + \sigma_{c}^{2} )}}{{\sum\limits_{k = 1}^{C} {\pi_{k} N(x_{t} |\mu_{k} ,S_{k} )N(y_{t} |l_{k}^{2} + \sigma_{k}^{2} )} }}. $$

(19)

However, in M-step, the GP likelihood in [29] can be replaced by the FITC likelihood given by Eq. (16) with less computation, thus pseudo inputs are added during the learning of the hyper-parameters. With such an adjustment from [29], we obtain the procedure of our proposed precise hard-cut EM algorithm as follows.

Step1 (Initialize the partition of the training data). Divide the vectors $ \left\{ {[x_{t}^{T} ,y_{t} ]} \right\}_{t = 1}^{N} $ into C clusters, and set z_t←the indicator of the t-th sample to the cluster.
Step2 (M-step). Learn the parameters with MLE for each component:

$$ \pi_{c} \leftarrow \frac{{N_{c} }}{N},{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \mu_{c} \leftarrow \frac{1}{{N_{c} }}\sum\limits_{t = 1}^{N} {I(z_{t} = c)x_{t} } ,{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} S_{c} \leftarrow \frac{1}{{N_{c} }}\sum\limits_{t = 1}^{N} {I(z_{t} = c)(x_{t} - \mu_{c} )(x_{t} - \mu_{c} )^{T} } , $$
(20)

$$ (\hat{U}_{c} ,\hat{\theta }_{c} ) = \mathop {\arg \hbox{max} }\limits_{{U_{c} ,\theta_{c} }} \ln p(Y_{c} |X_{c} ,U_{c} ,\theta_{c} ), $$
(21)

where N_c is the number of samples in the c-th component, Eq. (20) is the same as learning the parameters of the Gaussian mixture model and Eq. (21) follows from [8].
Step3 (E-step). Repartition the samples into the corresponding component based on the maximizing a posterior criterion:
$$ z_{t} \leftarrow \mathop {\arg \hbox{max} }\limits_{c} \;\Pr (z_{t} = c|x_{t} ,y_{t} ) = \mathop {\arg \hbox{max} }\limits_{c} [\pi_{c} N(x_{t} |\mu_{c} ,S_{c} )N(y_{t} |l_{c}^{2} + \sigma_{c}^{2} )] $$
(22)
Step4. Stop if the number of changed labels falls below η % of the number of training samples (0 ≤ η≤10 is a threshold). Otherwise, return to Step 2.

After the learning process above, we have obtained the estimated parameters and the partition of the training data. Then, for a test input x*, we can classify it into the z-th component of the MSGP by maximizing the posterior as in [29]:

$$ z^{*} \leftarrow \mathop {\arg \hbox{max} }\limits_{c} \;\Pr (z^{*} = c|D,x^{*} ) = \mathop {\arg \hbox{max} }\limits_{c} [\pi_{c} N(x^{*} |\mu_{c} ,S_{c} )] $$

(23)

According to this classification, we can predict the output of the test input via the z-th component using Eq. (13).

4 Experimental Results

4.1 On a Small Synthetic Dataset from the MGP Model

In order to evaluate the validity and feasibility of the FITC mixture model and its hard-cut EM algorithm, we generate a typical synthetic dataset from MGP model with 3 components. Actually, there are 500 training samples and 100 test samples in each component and each input has 3 features. Then we compare our algorithms with the following typical baselines for regression:

1.
The FITC model and its MLE algorithm [8].
2.
The MGP model and its precise hard-cut EM algorithm [29].
3.
The MGP model and its LOOCV hard-cut EM algorithm [26].
4.
The mixture of sparse GPs model and its variational hard-cut EM algorithm [9].
5.
Linear regression [37].
6.
SVM regression with Gaussian kernel [38].

Each of these algorithms is implemented on this synthetic dataset five times, with an Intel (R) Core (TM) i5 CPU and 4.00 GB of RAM running Matlab R2014b source codes on Secure CRT. For each of these algorithms, the mean and standard deviation of the root mean squared errors (RMSEs) for prediction and the training times are listed in Table 1. In addition, to make the prediction of FITC and its mixture model more accurate, we initialize the kernel parameters by training a GP model on 20 randomly selected training samples before the MLE learning process, as did in [8].

Table 1. The mean and standard deviation of the predictive RMSEs and the training times for each algorithm on the typical synthetic dataset from MGP model

Full size table

It can be seen from Table 1 that on the synthetic dataset, our proposed algorithm is much more accurate than the single FITC since the dataset is multimodal. With sparse technique, our algorithm is much more efficient than the two EM algorithms of the MGP models, whereas the loss of accuracy is trivial compared with the increase on speed. The variational hard-cut EM algorithm makes a coarse conditionally independent assumption to the posterior and thus it is not as accurate as our proposed algorithm. Besides, the variational hard-cut EM algorithm also takes longer to train than our proposed algorithm. Traditional regression algorithms like the linear regression and the SVM are rough for such a non-linear synthetic dataset.

Furthermore, our proposed algorithm correctly partitions all the training samples and test samples into their components each time, which, along with the results in Table 1, means that our algorithm is feasible and effective.

4.2 On a Large Synthetic Dataset from the FITC Mixture Model

Moreover, to test the advantage of FITC mixture model and its hard-cut EM algorithm, we generate a typical synthetic dataset from FITC mixture model with 100 components. In each component, there are 200 training samples and 200 test samples, with 1 dimensional inputs. Then we compare our algorithm with the baselines above. The computational environment and experimental details remain the same as above.

On such a big dataset (20000 training samples), some algorithms are prohibitively slow and fail to converge, such as the SVM, the precise hard-cut EM algorithm and the LOOCV hard-cut EM algorithm of MGP, so we omit them in the experiment. It also indicates that our proposed hard-cut EM algorithm is more efficient than the precise hard-cut EM algorithm of the MGP model due to sparsity mechanism.

All the other algorithms are implemented on this synthetic dataset five times and the mean and standard deviation of their RMSEs for prediction and training times are listed in Table 2. Similarly, for FITC model and our proposed FITC mixture model, we initialize the kernel parameters by training a GP model on 10 randomly selected training samples before the MLE learning process, as in [8].

Table 2. The mean and standard deviation of the predictive RMSEs and the training times for each algorithm on the typical synthetic dataset from FITC mixture model

Full size table

Table 2 indicates that for the synthetic dataset with much more components, our proposed FITC mixture model has greater advantage than the FITC model, on both predictive precision and learning efficiency. The conditionally independent assumption of the variational hard-cut EM algorithm, as well as the linear assumption of the linear regression, is not suitable for such a highly non-linear dataset with dependent samples. Therefore, the two algorithms have big prediction errors on this dataset.

4.3 On Kin40k Dataset

Finally, we compare these algorithms on a popular real dataset called kin40k, which is generated by a robot arm simulator, with 10000 training samples, 30000 test samples and 9 attributes [34]. The computational environment and implementation details remain the same as above. The mean and standard deviation of the predictive RMSEs as well as the training times for each algorithm are listed in Table 3. Similarly, for FITC and FITC mixture model, we initialize the kernel parameters by training a GP model on 500 randomly selected training samples before the MLE learning process.

Table 3. The mean and standard deviation of the predictive RMSEs and the training times for each algorithm on kin40k dataset

Full size table

Still, on the large dataset, the SVM, the precise hard-cut EM algorithm and the LOOCV hard-cut EM algorithm of MGP are prohibitively slow and fail to converge.

From Table 3, we can observe that on the kin40 k dataset, our proposed algorithm is more precise than the other algorithms due to fewer approximations. A possible reason why our algorithm consumes so much time is that on a real dataset that doesn’t come from a GP or MGP model, our algorithm needs much more iterations. Therefore, a further improvement can be made by preprocessing a real dataset so that it is more like a dataset simulated from a MGP. Linear regression is still rather coarse since the kin40k dataset is highly non-linear [34].

In general, our algorithm is practical on the real dataset, and its computational cost is acceptable with 10000 training samples.

5 Conclusion

We have refined the MSGP model and developed the hard-cut EM algorithm for its parameter learning. The general framework is adapted from [29] whereas the learning efficiency significantly improves with the sparsity mechanism. The experimental results on both the synthetic dataset and the real dataset demonstrate that the developed hard-cut EM algorithm for MSGP model is precise and efficient, and generally outperforms the conventional linear regression, SVM and some other learning algorithms for GP, MGP and MSGP models.

References

Rasmussen, C.E.: Evaluation of Gaussian processes and other methods for non-linear regression. The University of Toronto (1996)
Google Scholar
Williiams, C.K.I., Barber, D.: Bayesian classification with Gaussian processes. IEEE Trans. Pattern Anal. Mach. Intell. 20(12), 1342–1351 (1998)
Article Google Scholar
Gao, X.B., Wang, X.M., Tao, D.C.: Supervised Gaussian process latent variable model for dimensionality reduction. IEEE Trans. Syst. Man Cybern. B Cyberne. 41(2), 425–434 (2011)
Article Google Scholar
Rasmussen, C.E., Kuss, M.: Gaussian processes in reinforcement learning. In: Thrun, S., Saul, L., Schölkopf, B. (eds.) Advances in Neural Information Processing Systems, vol. 16, pp. 751–759. MIT Press, Cambridge (2003)
Google Scholar
Yuan, C., Neubauer, C.: Variational mixture of Gaussian process experts. In: Advances in Neural Information Processing Systems, vol. 21, pp. 1897–1904 (2008)
Google Scholar
Stachniss, C., Plagemann, C., Lilienthal, A.J., et al.: Gas distribution modeling using sparse Gaussian process mixture models. In: Proceedings of Robotics: Science and Systems, pp. 310–317 (2008)
Google Scholar
Tresp, V.: Mixtures of Gaussian processes. In: Advances in Neural Information Processing Systems, vol. 13, pp. 654–660 (2000)
Google Scholar
Snelson, E., Ghahramani, Z.: Sparse Gaussian processes using pseudo-inputs. In: Advances in Neural Information Processing Systems, vol. 18, pp. 1257–1264 (2005)
Google Scholar
Nguyen, T., Bonilla, E.: Fast allocation of Gaussian process experts. In: Proceedings of the 31st International Conference on Machine Learning, pp. 145–153 (2014)
Google Scholar
Wang, Y., Khardon, R.: Sparse Gaussian processes for multi-task learning. In: The European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, pp. 711–727 (2012)
Google Scholar
Sun, S., Xu, X.: Variational inference for infinite mixtures of Gaussian processes with applications to traffic flow prediction. IEEE Trans. Intell. Transp. Syst. 12(2), 466–475 (2011)
Article Google Scholar
Meeds, E., Osindero, S.: An alternative infinite mixture of Gaussian process experts. In: Advances in Neural Information Processing Systems, vol. 18, pp. 883–890 (2005)
Google Scholar
Gramacy, R.B., Lee, H.K.H.: Bayesian treed Gaussian process models with an application to computer modeling. J. Am. Stat. Assoc. 103(483), 1119–1130 (2008)
Article MathSciNet Google Scholar
Shi, J.Q., Murray-Smith, R., Titterington, D.M.: Bayesian regression and classification using mixtures of Gaussian processes. Int. J. Adapt. Control Sig. Process. 17(2), 149–161 (2003)
Article Google Scholar
Shi, J.Q., Murray-Smith, R., Titterington, D.M.: Hierarchical Gaussian process mixtures for regression. Stat. Comput. 15(1), 31–41 (2005)
Article MathSciNet Google Scholar
Rasmussen, C.E., Ghahramani, Z.: Infinite mixtures of Gaussian process experts. In: Advances in Neural Information Processing Systems, vol. 14, pp. 881–888 (2001)
Google Scholar
Fergie, M.P.: Discriminative Pose Estimation Using Mixtures of Gaussian Processes. The University of Manchester (2013)
Google Scholar
Sun, S.: Infinite mixtures of multivariate Gaussian processes. In: Proceedings of the International Conference on Machine Learning and Cybernetics, pp. 1011–1016 (2013)
Google Scholar
Tayal, A., Poupart, P., Li, Y.: Hierarchical double Dirichlet process mixture of Gaussian processes. In: Proceedings of the 26th Association for the Advancement of Artificial Intelligence, pp. 1126–1133 (2012)
Google Scholar
Ross, J., Dy, J.: Nonparametric mixture of Gaussian processes with constraints. In: Proceedings of the 30th International Conference on Machine Learning, pp. 1346–1354 (2013)
Google Scholar
Chatzis, S.P., Demiris, Y.: Nonparametric mixtures of Gaussian processes with power-law behavior. IEEE Trans. Neural Netw. Learn. Syst. 23(12), 1862–1871 (2012)
Article Google Scholar
Platanios, E.A., Chatzis, S.P.: Mixture Gaussian process conditional heteroscedasticity. IEEE Trans. Pattern Anal. Mach. Intell. 36(5), 888–900 (2014)
Article Google Scholar
Kapoor, A., Ahn, H., Picard, R.W.: Mixture of Gaussian processes for combining multiple modalities. In: Oza, N.C., Polikar, R., Kittler, J., Roli, F. (eds.) MCS 2005. LNCS, vol. 3541, pp. 86–96. Springer, Heidelberg (2005)
Chapter Google Scholar
Reece, S., Mann, R., Rezek, I., et al.: Gaussian process segmentation of co-moving animals. Proc. Am. Inst. Phys. 1305(1), 430–437 (2011)
MathSciNet Google Scholar
Lázaro-Gredilla, M., Van, V.S., Lawrence, N.D.: Overlapping mixtures of Gaussian processes for the data association problem. Pattern Recogn. 45(4), 1386–1395 (2012)
Article MATH Google Scholar
Yang, Y., Ma, J.: An efficient EM approach to parameter learning of the mixture of Gaussian processes. In: Liu, D., Zhang, H., Polycarpou, M., Alippi, C., He, H. (eds.) ISNN 2011, Part II. LNCS, vol. 6676, pp. 165–174. Springer, Heidelberg (2011)
Chapter Google Scholar
Schiegg, M., Neumann, M., Kersting, K.: Markov logic mixtures of Gaussian processes: towards machines reading regression data. In: Proceedings of the 15th International Conference on Artificial Intelligence and Statistics, JMLR:W&CP, vol. 22, pp. 1002–1011 (2012)
Google Scholar
Yu, J., Chen, K., Rashid, M.M.: A Bayesian model averaging based multi-kernel Gaussian process regression framework for nonlinear state estimation and quality prediction of multiphase batch processes with transient dynamics and uncertainty. Chem. Eng. Sci. 93(19), 96–109 (2013)
Article Google Scholar
Chen, Z., Ma, J., Zhou, Y.: A precise hard-cut EM algorithm for mixtures of Gaussian processes. In: Huang, D.-S., Jo, K.-H., Wang, L. (eds.) ICIC 2014. LNCS, vol. 8589, pp. 68–75. Springer, Heidelberg (2014)
Google Scholar
Rasmussen, C.E., Williams, C.K.I.: Gaussian Processes for Machine Learning. MIT Press, Cambridge (2006)
MATH Google Scholar
Sundararajan, S., Keerthi, S.: Predictive approaches for choosing hyperparameters in Gaussian processes. Neural Comput. 13(5), 1103–1118 (2001)
Article MATH Google Scholar
Quiñonero-Candela, J., Rasmussen, C.E.: A unifying view of sparse approximate Gaussian process regression. J. Mach. Learn. Res. 6, 1935–1959 (2005)
Google Scholar
Csató, L., Opper, M.: Sparse online Gaussian processes. Neural Comput. 14(3), 641–669 (2002)
Article Google Scholar
Seeger, M., Williams, C.K.I., Lawrence, N.D.: Fast forward selection to speed up sparse Gaussian process regression. In: Bishop, C.M., Frey, B.J. (eds.) Proceedings of the 9th International Workshop on Artificial Intelligence and Statistics (2003)
Google Scholar
Keerthi, S.S., Chu, W.: A matching pursuit approach to sparse Gaussian process regression. In: Advances in Neural Information Processing Systems, vol. 18 (2005)
Google Scholar
Smola, A., Bartlett, P.: Sparse greedy Gaussian process regression. Adv. Neural Inf. Process. Syst. 13, 619–625 (2000)
Google Scholar
Murphy, K.P.: Machine Learning: A Probabilistic Perspective. MIT Press, Cambridge (2012)
Google Scholar
Smola, A.J., Schölkopf, B.: A tutorial on support vector regression. Stat. Comput. 14(3), 199–222 (2004)
Article MathSciNet Google Scholar

Download references

Acknowledgement

This work was supported by the Natural Science Foundation of China for Grant 61171138. The authors would like to thank Dr. E. Snelson and Dr. Z. Ghahramani for their valuable advice about FITC model.

Author information

Authors and Affiliations

Department of Information Science, School of Mathematical Sciences and LMAM, Peking University, Beijing, 100871, China
Ziyi Chen & Jinwen Ma

Authors

Ziyi Chen
View author publications
You can also search for this author in PubMed Google Scholar
Jinwen Ma
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jinwen Ma .

Editor information

Editors and Affiliations

Tongji University, Shanghai, China
De-Shuang Huang
Incheon, Korea (Republic of)
Kyungsook Han

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chen, Z., Ma, J. (2015). The Hard-Cut EM Algorithm for Mixture of Sparse Gaussian Processes. In: Huang, DS., Han, K. (eds) Advanced Intelligent Computing Theories and Applications. ICIC 2015. Lecture Notes in Computer Science(), vol 9227. Springer, Cham. https://doi.org/10.1007/978-3-319-22053-6_2

Download citation

DOI: https://doi.org/10.1007/978-3-319-22053-6_2
Published: 13 August 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-22052-9
Online ISBN: 978-3-319-22053-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

The Hard-Cut EM Algorithm for Mixture of Sparse Gaussian Processes

Abstract

Similar content being viewed by others

A Precise Hard-Cut EM Algorithm for Mixtures of Gaussian Processes

An Effective Model Selection Criterion for Mixtures of Gaussian Processes

Regularized Estimation and Feature Selection in Mixtures of Gaussian-Gated Experts Models

Keywords

1 Introduction

2 The Mixture of Sparse Gaussian Processes

2.1 Gaussian Process (GP)

2.2 Mixture of Gaussian Processes (MGP)

2.3 Sparse Gaussian Process (SGP)

2.4 The FITC Mixture Model

3 The Hard-Cut EM Algorithm for the FITC Mixtures

4 Experimental Results

4.1 On a Small Synthetic Dataset from the MGP Model

4.2 On a Large Synthetic Dataset from the FITC Mixture Model

4.3 On Kin40k Dataset

5 Conclusion

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

The Hard-Cut EM Algorithm for Mixture of Sparse Gaussian Processes

Abstract

Similar content being viewed by others

A Precise Hard-Cut EM Algorithm for Mixtures of Gaussian Processes

An Effective Model Selection Criterion for Mixtures of Gaussian Processes

Regularized Estimation and Feature Selection in Mixtures of Gaussian-Gated Experts Models

Keywords

1 Introduction

2 The Mixture of Sparse Gaussian Processes

2.1 Gaussian Process (GP)

2.2 Mixture of Gaussian Processes (MGP)

2.3 Sparse Gaussian Process (SGP)

2.4 The FITC Mixture Model

3 The Hard-Cut EM Algorithm for the FITC Mixtures

4 Experimental Results

4.1 On a Small Synthetic Dataset from the MGP Model

4.2 On a Large Synthetic Dataset from the FITC Mixture Model

4.3 On Kin40k Dataset

5 Conclusion

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation