Abstract
Mixture of Gaussian Processes (MGP) is a generative model being powerful and widely used in the fields of machine learning and data mining. However, when we learn this generative model on a given dataset, we should set the probability density function (pdf) of the input in advance. In general, it can be set as a Gaussian distribution. But, for some actual data like time series, this setting or assumption is not reasonable and effective. In this paper, we propose a specialized pdf for the input of MGP model which is a piecewise-defined continuous function with three parts such that the middle part takes the form of a uniform distribution, while the two side parts take the form of Gaussian distribution. This specialized pdf is more consistent with the uniform distribution of the input than the Gaussian pdf. The two tails of the pdf with the form of a Gaussian distribution ensure the effectiveness of the iteration of the hard-cut EM algorithm for MGPs. It demonstrated by the experiments on the simulation and stock datasets that the MGP model with these specialized pdfs can lead to a better result on time series prediction in comparison with the general MGP models as well as the other classical regression methods.
Access provided by CONRICYT-eBooks. Download conference paper PDF
Similar content being viewed by others
Keywords
- Gaussian distribution
- Mixture of Gaussian processes
- Hard-cut EM algorithm
- Probability density function
- Time series prediction
1 Introduction
Gaussian process (GP) is a powerful model and widely used in machine learning and data mining [1,2,3]. However, there are two main limitations. Firstly, it cannot fit the multi-modal dataset well because GP model employs a global scale parameter [4]. Secondly, its parameter learning consumes \(O(N^{3})\) computational time [5, 6], where N is the number of training samples. In order to overcome those difficulties, Tresp [4] proposed mixture of Gaussian processes (MGP) in 2000, which was developed from the mixture of experts. Since then, many kinds of MGP model have been proposed and can be classified into two main forms: the generative model [7,8,9,10] and conditional model [4, 6, 11,12,13]. In comparison with the conditional model, the generative model has two main advantages: (1) The missing features can be easily inferred from the outputs; (2) The influence of the inputs on the outputs is more clear [8]. Therefore, many scholars have studied the generative model [14,15,16,17,18,19,20].
However, when we learn the generative model on a given dataset, we should set the probability density function (pdf) of the input in advance. In general, it can be set as a Gaussian distribution [14,15,16,17,18,19,20]. But, for some actual data like time series, this setting or assumption is not so reasonable and effective. When we learn MGP model on these actual data, we usually need to utilize the ARMA model [14,15,16,17,18,19,20,21] to transform the data, and then use the transformed data on the MGP model. However, this transformation can destroy the correlation of samples, which is very important for MGP model. Figure 1 shows the eLoad data [14] from which we can see that samples in three different colors (blue, black, and red) represent three temporally sequential samples, respectively. Figure 2 shows the transformed eLoad data from which we can find that three temporally sequential samples are mixed together and cannot be classified effectively. In this paper, we propose a specialized pdf for the input of the MGP model to solve this problem. As shown in Fig. 3, this pdf consists of three components. The left and right side parts are Gaussian distributions, while the middle is a uniform distribution. For the training of the MGP model, we use the hard-cut EM algorithm [17] as the basic learning framework for parameter estimation. Actually, the hard-cut EM algorithm can get better result than some popular learning algorithms.
The rest of the paper is organized as follows. Section 2 introduces the GP and MGP models. We describe the specialized probability density function in Sect. 3. We further propose the learning algorithm for the MGP model of the specialized pdfs in Sect. 4. The experimental results are contained in Sect. 5. Finally, we make a brief conclusion in Sect. 6.
2 GP and MGP Models
2.1 GP Model
We mathematically define the GP model as follows:
where D = {X,Y} = {(\({\varvec{x}}_{i}\), \(y_{i}\)): i =1,2,...,N}, \({\varvec{x}}_{i}\) denotes a d-dimensional input vector, and \(y_{i}\) is the corresponding output. m(X) and K(X,X) denote the mean vector and covariance matrix, respectively. Without loss of generality, we assume m(X) = 0. There are many choices for covariance function, such as linear, Gaussian noise, squared exponential function and so on. Here, we adopt the squared exponential (SE) covariance function [10]:
where \(\varvec{\theta }\) = {\(\sigma _{f}^2\),\(\sigma _{l}^2\),\(\sigma _{n}^2\)} denote the vector. On the given sample dataset D, the log-likelihood function can be expressed as follows:
In order to obtain the estimation of parameters \(\varvec{\theta }\), we perform the maximum likelihood estimation (MLE) procedure [10], that is, we get
2.2 MGP Model
Denote C and N as the number of GP components and training samples in the MGP model, respectively. On the basis of the GP model, we define MGP model by the following steps:
Step 1. Partition samples into each GP components by the Multinomial distribution:
where c = 1,...,C and n = 1,...,N.
Step 2. Accordingly, each input \({\varvec{x}}_{i}\) fulfills the following distribution:
where {\(\varvec{\psi }_{c}: c=1, ..., C \)} is the parameter set. In general, p(\({\varvec{x}}|\varvec{\psi }_{c}\)) is a Gaussian distribution.
Step 3. Denote \({\varvec{I}}_{c}\) = {\(n \vert z_{n}=c\)}, \({\varvec{X}}_{c}\) = {\({\varvec{x}}_{n} \vert z_{n}=c\)}, \({{\varvec{Y}}_{c}}=\{ y_{n} \vert z_{n}=c \}\) (c=1,...,C, n=1,...,N) as the sample indexes, inputs and outputs of the training samples in the c-th component, respectively. Given \({\varvec{X}}_{c}\), the corresponding c-th GP component can be mathematically defined as follows:
where K(\({\varvec{X}}_{c}\),\({\varvec{X}}_{c}\)) is given by Eq.(2) with the hyper-parameter \({\varvec{\theta }}_{c} =\{\sigma _{fc}^2, \sigma _{lc}^2,\sigma _{nc}^2\}\).
Based on Eqs. (5), (6) and (7), we mathematically define the MGP model. The log-likelihood function is derived as follows:
where \(\varvec{\varTheta }=\{\varvec{\theta }_{c}:c=1, ..., C \}\) and \(\varvec{\varPsi }= \{ \varvec{\psi }_{c}, {\pi }_{c}: c =1, ..., C \}\) denote the hyper-parameters and parameters of the MGP model, respectively.
3 Specialized Input Distribution and Its Learning Algorithm
For many real world datasets, such as UCI machine learning repository, Gaussian distribution is not appropriate for the input. In order to solve this problem, we propose a specialized distribution for this situation.
3.1 Specialized PDF
This specialized distribution is a piecewise-defined continuous function, which consists of three parts, the middle part is a uniform distribution density, both sides are Gaussian distribution densities, shown in Fig. 3. We mathematically defined the specialized distribution as follows:
where we redefine \(\varvec{\psi }\)\(=\){\(\lambda ,\lambda _{1},\lambda _ {2},\tau _1,\tau _2\),\(\varvec{a}\),\(\varvec{b}\)} as the parameter vector.
3.2 Learning Algorithm for the Specialized PDF
In order to learn \(\varvec{\psi }\), we set that the input interval (\(\varvec{a}\),\(\varvec{b}\)) contains the number of the samples with probability \(p_0\). Denote X and N as the training sample set and the number of training sample, respectively. We summarize the algorithm framework as following steps:
Step 1. Learn a, b, and \(\lambda \):
where p(x< \(X_{\frac{N(1-p_{0})}{2}}\) \(| x \in \) \({\varvec{X}}\)) = \(\frac{(1-p_{0})}{2}\). In order to reduce the effect of the misclassified (or outlier) point on the middle part, we estimate \(\varvec{a}\) and \(\varvec{b}\) as Eq.(10) do.
Step 2. Estimate \(\lambda _{1}\), \(\lambda _{2}\), \(\tau _{1}\) and \(\tau _{2}\).
Denote \(p_{1}\) and \(p_{2}\) as the sample ratio at both left side and right side, respectively. The probability density function is continuously integrable, and the integral of the probability density function is equal to 1. In other word:
According to the continuity of the probability density function, we only need do same simple calculations to get {\(\lambda _{1},\lambda _ {2},\tau _{1},\tau _{2}\)}:
4 The MGP Model of the Specialized PDFs and Its Learning Algorithm
We now consider the MGP model with these specialized pdfs. For the parameter learning of the MGP model, there are main three kinds of learning algorithms: MCMC methods [22, 23], variational Bayesian inference [24, 25], and EM algorithm [5, 9, 11]. However, the MCMC methods and variational Bayesian inference methods have their own limitations: the time complexity of the MCMC method is very high, and variational Bayesian inference may lead to a rather deviation from the true objective function. EM algorithm is an important and effective iterative algorithm to do maximum likelihood or maximum a posterior(MAP) estimates of parameters for mixture model. However, for such a complex MGP model, the posteriors of latent variables and Q function are rather complicated. In order to overcome this difficulty, we implement the hard-cut EM algorithm [17] to learn parameter, which makes certain approximations in E-step.
Denote \({\varvec{z}}_{nc}\) be the latent variables, where \({\varvec{z}}_{nc}\) is a Kronecker delta function, \({\varvec{z}}_{nc}\) = 1, if the sample (\({\varvec{x}}_{n}\),\(y_{n}\)) belongs to the c-th GP component. Therefore, we can obtain the log likelihood function of the complete data from Eq. (8) as follows:
The main idea of hard-cut EM algorithm can be expressed as the following steps:
E-step. Assign the samples to the corresponding GP component according to the maximum a posterior (MAP) criterion:
that is, latent variable \({\varvec{z}}_{\widehat{k}_{n}n}\)=1.
M-step. With the known partition, we can estimate the parameters \(\varvec{\varPsi }\) and hyper-parameters \(\varvec{\varTheta }\) via the MLE procedure:
-
(1)
For learning the parameters \(\{\varvec{\psi }_{c}\}_{c}\), we perform the learning algorithm in the last section.
-
(2)
For estimating the hyper-parameter \(\varvec{\varTheta }\), we perform the MLE procedure on each c-th component to estimate \(\varvec{\theta }_{c}\) as shown in Eq. (4).
5 Experimental Results
In order to test the accuracy and effectiveness of the specialized pdf for MGP model, we carry out several experiments on the simulation and stock datasets. We employ the root mean squared error (RMSE) to measure the prediction accuracy, which is defined as follows:
where \(\hat{{\varvec{y}}_{n}}\) and \({{\varvec{y}}_{n}}\) denote the predicted value and true value, respectively. We also compare our algorithm with some classical machine learning algorithms: kernel, RBF, SVM, and denote ‘OURS’ as our proposed model with the hard-cut EM algorithm.
5.1 Simulation Experiments
In the simulation experiments, we generate three groups of synthetic datasets from MGP model. those three MGP models contain 4, 6, 8 GP components, respectively. The number of samples in each group is 2600, 3900, 5000, respectively. In each group, there are three datasets, which are the same except the degree of overlap. Figure 4 shows the dataset with the smallest degree of overlap with 4 GP components. On each group dataset, we run each algorithm 100 times, and randomly extract training samples and test samples, where 1/3 are training samples and other 2/3 are test samples. The RMSE of each algorithm is listed in Table 1. From the Table 1, We can see that our proposed algorithm obtains the better results. Figure 5 shows the specialized pdfs on the first group dataset with the smallest overlapping degree. We can obtain that the specialized pdf at both ends of the data in the form of a Gaussian distribution of attenuation, the specialized pdf in the middle of the data is a uniform distribution. This shape of the specialized pdf is more consistent with the uniform distribution than Gaussian distribution. The attenuation of both ends of the specialized pdf in the form of a Gaussian distribution ensures the effectiveness of the iteration of the hard-cut EM algorithm. Then, the class label of the samples can be updated according to the MAP criteria in the iteration of hard-cut EM algorithm. If we apply uniform distribution only, the iterative steps of hard-cut EM algorithm is invalid.
5.2 Prediction on Stock Data
In this section, we obtain the closing price data of three stocks from Shanghai Stock Exchange, and the IDs are 300015, 002643, and 601058, respectively.
From Eq. (10), we can know that the specialized pdf is closely related to the interval length of the middle data. In order to check the effect of different input lengths on the prediction accuracy of the algorithm, we do some transformations on the input. Since the range of output changes is too large, we use a linear function to narrow the output down to the same range as the synthetic data. In summary, we transform the datasets as follows:
-
(i)
Transform the input as following equation:
$$\begin{aligned} {X}_{n} =\frac{n}{\delta } \end{aligned}$$(16)where i = 1,...,N, N is the sample number, \({\delta }=\{101, 51, 23, 11, 7, 3, 1\}\).
-
(ii)
Transform the output by a linearly compressed, and the compressed interval is [−4.5, 4.5].
$$\begin{aligned} \tilde{y}=\frac{9y}{M-m}+\frac{4.5}{M-m} \end{aligned}$$(17)
where M and m denote the maximum value and minimum value of the stock, respectively.
Through the above transformations, each stock can produce 7 datasets. In each 7 datasets of three stock datasets, we repeat each regression algorithm 100 times, and randomly extracted 1/3 as training samples and the other 2/3 as test samples. The RMSE of each algorithm on those three transformed stock datasets is listed in Table 2. From Table 2, we can obtain that our proposed algorithm can get a better predict accuracy than other classical regression algorithms, and our algorithm obtain the better result with the smaller \(\delta \), but this is not absolute.
6 Conclusion
We have designed a specialized pdf for the input of MGP model which consists of three parts: the right and left side parts still take the form of Gaussian distributions, while the middle part takes the form of a uniform distribution. This specialized pdf has the advantages of both the Gaussian distribution and the uniform distribution. That is, the tail Gaussian distributions in the left and right side parts ensure that the hard-cut EM algorithm can perform more efficiently during each iteration, and the uniform distribution in the middle part is more reasonable for the time series data. The experiments are conducted on three groups of synthetic datasets and stock datasets. It is demonstrated by the experimental results that the hard-cut EM algorithm for the MGPs with the specialized pdfs can obtain a better prediction accuracy than the other classical regression algorithms. This specialized input pdf is more effective for the time series data.
References
Rasmussen, C.E.: Evaluation of Gaussian Processes and Other Methods for Non-linear Regression. University of Toronto (1999)
Williams, C.K.I., Barber, D.: Bayesian classification with Gaussian processes. IEEE Trans. Pattern Anal. Mach. Intell. 20(12), 1342–1351 (1998)
Rasmussen, C.E., Kuss, M.: Gaussian processes in reinforcement learning. In: NIPS, vol. 4, p. 1 (2003)
Tresp, V.: Mixtures of Gaussian processes. In: Advances in Neural Information Processing Systems, pp. 654–660 (2001)
Yuan, C., Neubauer, C.: Variational mixture of Gaussian process experts. In: Advances in Neural Information Processing Systems, pp. 1897–1904 (2009)
Stachniss, C., Plagemann, C., Lilienthal, A.J., et al.: Gas Distribution Modeling using Sparse Gaussian Process Mixture Models. In: Robotics: Science and Systems, vol. 3 (2008)
Yang, Y., Ma, J.: An efficient EM approach to parameter learning of the mixture of Gaussian processes. In: Liu, D., Zhang, H., Polycarpou, M., Alippi, C., He, H. (eds.) ISNN 2011. LNCS, vol. 6676, pp. 165–174. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-21090-7_20
Meeds, E., Osindero, S.: An alternative infinite mixture of Gaussian process experts. In: Advances in Neural Information Processing Systems, pp. 883–890 (2006)
Sun, S., Xu, X.: Variational inference for infinite mixtures of Gaussian processes with applications to traffic flow prediction. IEEE Trans. Intell. Transp. Syst. 12(2), 466–475 (2011)
Williams, C.K.I., Rasmussen, C.E.: Gaussian processes for machine learning, MIT Press 2(3), 4 (2006)
Nguyen, T., Bonilla, E.: Fast allocation of Gaussian process experts. In: International Conference on Machine Learning, pp. 145–153 (2014)
Lázaro-Gredilla, M., Van Vaerenbergh, S., Lawrence, N.D.: Overlapping mixtures of Gaussian processes for the data association problem. Pattern Recogn. 45(4), 1386–1395 (2012)
Ross, J., Dy, J.: Nonparametric mixture of Gaussian processes with constraints. In: International Conference on Machine Learning, 1346–1354 (2013)
Wu, D., Ma, J.: A two-layer mixture model of Gaussian process functional regressions and its MCMC EM algorithm. IEEE Trans. Neural Netw. Learn. Syst. (2018)
Wu, D., Chen, Z., Ma, J.: An MCMC based EM algorithm for mixtures of Gaussian processes. In: Hu, X., Xia, Y., Zhang, Y., Zhao, D. (eds.) ISNN 2015. LNCS, vol. 9377, pp. 327–334. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-25393-0_36
Wu, D., Ma, J.: A DAEM algorithm for mixtures of Gaussian process functional regressions. In: Huang, D.-S., Han, K., Hussain, A. (eds.) ICIC 2016. LNCS (LNAI), vol. 9773, pp. 294–303. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-42297-8_28
Chen, Z., Ma, J., Zhou, Y.: A precise hard-cut EM algorithm for mixtures of Gaussian processes. In: Huang, D.-S., Jo, K.-H., Wang, L. (eds.) ICIC 2014. LNCS (LNAI), vol. 8589, pp. 68–75. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-09339-0_7
Chen, Z., Ma, J.: The hard-cut EM algorithm for mixture of sparse Gaussian processes. In: Huang, D.-S., Han, K. (eds.) ICIC 2015. LNCS (LNAI), vol. 9227, pp. 13–24. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-22053-6_2
Zhao, L., Chen, Z., Ma, J.: An effective model selection criterion for mixtures of Gaussian processes. In: Hu, X., Xia, Y., Zhang, Y., Zhao, D. (eds.) ISNN 2015. LNCS, vol. 9377, pp. 345–354. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-25393-0_38
Zhao, L., Ma, J.: A dynamic model selection algorithm for mixtures of Gaussian processes. In: 2016 IEEE 13th International Conference on Signal Processing (ICSP), pp. 1095–1099. IEEE (2016)
Liu, S., Ma, J.: Stock price prediction through the mixture of Gaussian processes via the precise hard-cut EM algorithm. In: Huang, D.-S., Han, K., Hussain, A. (eds.) ICIC 2016. LNCS (LNAI), vol. 9773, pp. 282–293. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-42297-8_27
Shi, J.Q., Murray-Smith, R., Titterington, D.M.: Bayesian regression and classification using mixtures of Gaussian processes. Int. J. Adapt. Control. Signal Process. 17(2), 149–161 (2003)
Tayal, A., Poupart, P., Li, Y.: Hierarchical double Dirichlet process mixture of Gaussian processes. In: AAAI (2012)
Chatzis, S.P., Demiris, Y.: Nonparametric mixtures of Gaussian processes with power-law behavior. IEEE Trans. Neural Netw. Learn. Syst. 23(12), 1862–1871 (2012)
Kapoor, A., Ahn, H., Picard, R.W.: Mixture of Gaussian processes for combining multiple modalities. In: Oza, N.C., Polikar, R., Kittler, J., Roli, F. (eds.) MCS 2005. LNCS, vol. 3541, pp. 86–96. Springer, Heidelberg (2005). https://doi.org/10.1007/11494683_9
Acknowledgment
This work was supported by the National Science Foundation of China under Grant 61171138.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 IFIP International Federation for Information Processing
About this paper
Cite this paper
Zhao, L., Ma, J. (2018). A Specialized Probability Density Function for the Input of Mixture of Gaussian Processes. In: Shi, Z., Pennartz, C., Huang, T. (eds) Intelligence Science II. ICIS 2018. IFIP Advances in Information and Communication Technology, vol 539. Springer, Cham. https://doi.org/10.1007/978-3-030-01313-4_8
Download citation
DOI: https://doi.org/10.1007/978-3-030-01313-4_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-01312-7
Online ISBN: 978-3-030-01313-4
eBook Packages: Computer ScienceComputer Science (R0)