Keywords

1 Introduction

Gaussian process (GP) is a powerful model and widely used in machine learning and data mining [1,2,3]. However, there are two main limitations. Firstly, it cannot fit the multi-modal dataset well because GP model employs a global scale parameter [4]. Secondly, its parameter learning consumes \(O(N^{3})\) computational time [5, 6], where N is the number of training samples. In order to overcome those difficulties, Tresp [4] proposed mixture of Gaussian processes (MGP) in 2000, which was developed from the mixture of experts. Since then, many kinds of MGP model have been proposed and can be classified into two main forms: the generative model [7,8,9,10] and conditional model [4, 6, 11,12,13]. In comparison with the conditional model, the generative model has two main advantages: (1) The missing features can be easily inferred from the outputs; (2) The influence of the inputs on the outputs is more clear [8]. Therefore, many scholars have studied the generative model [14,15,16,17,18,19,20].

Fig. 1.
figure 1

The sketch of the eLoad data.

Fig. 2.
figure 2

The sketch of the transformed eLoad data.

However, when we learn the generative model on a given dataset, we should set the probability density function (pdf) of the input in advance. In general, it can be set as a Gaussian distribution [14,15,16,17,18,19,20]. But, for some actual data like time series, this setting or assumption is not so reasonable and effective. When we learn MGP model on these actual data, we usually need to utilize the ARMA model [14,15,16,17,18,19,20,21] to transform the data, and then use the transformed data on the MGP model. However, this transformation can destroy the correlation of samples, which is very important for MGP model. Figure 1 shows the eLoad data [14] from which we can see that samples in three different colors (blue, black, and red) represent three temporally sequential samples, respectively. Figure 2 shows the transformed eLoad data from which we can find that three temporally sequential samples are mixed together and cannot be classified effectively. In this paper, we propose a specialized pdf for the input of the MGP model to solve this problem. As shown in Fig. 3, this pdf consists of three components. The left and right side parts are Gaussian distributions, while the middle is a uniform distribution. For the training of the MGP model, we use the hard-cut EM algorithm [17] as the basic learning framework for parameter estimation. Actually, the hard-cut EM algorithm can get better result than some popular learning algorithms.

The rest of the paper is organized as follows. Section 2 introduces the GP and MGP models. We describe the specialized probability density function in Sect. 3. We further propose the learning algorithm for the MGP model of the specialized pdfs in Sect. 4. The experimental results are contained in Sect. 5. Finally, we make a brief conclusion in Sect. 6.

Fig. 3.
figure 3

The sketch of the specialized input distribution.

2 GP and MGP Models

2.1 GP Model

We mathematically define the GP model as follows:

$$\begin{aligned} {{\varvec{Y}}}\sim N(m({{\varvec{X}}}),K({{\varvec{X}}},{{\varvec{X}}})) \end{aligned}$$
(1)

where D = {X,Y} = {(\({\varvec{x}}_{i}\), \(y_{i}\)): i =1,2,...,N}, \({\varvec{x}}_{i}\) denotes a d-dimensional input vector, and \(y_{i}\) is the corresponding output. m(X) and K(X,X) denote the mean vector and covariance matrix, respectively. Without loss of generality, we assume m(X) = 0. There are many choices for covariance function, such as linear, Gaussian noise, squared exponential function and so on. Here, we adopt the squared exponential (SE) covariance function [10]:

$$\begin{aligned} K({{\varvec{{x}}}}_{i},{{\varvec{{x}}}}_{j};{{\varvec{\theta }}})=\sigma _{f}^2exp(-\frac{\sigma _{l}^2}{2}\Vert {{\varvec{{x}}}}_{i}-{{\varvec{{x}}}}_{j}\Vert ^2)+\sigma _{n}^2I _{(i=j)} \end{aligned}$$
(2)

where \(\varvec{\theta }\) = {\(\sigma _{f}^2\),\(\sigma _{l}^2\),\(\sigma _{n}^2\)} denote the vector. On the given sample dataset D, the log-likelihood function can be expressed as follows:

$$\begin{aligned} \log p({{\varvec{Y}}}|{{\varvec{X}}},\varvec{\theta })=\log N ({{\varvec{Y}}}|{{\varvec{0}}},K(\mathbf X ,\mathbf X )) \end{aligned}$$
(3)

In order to obtain the estimation of parameters \(\varvec{\theta }\), we perform the maximum likelihood estimation (MLE) procedure [10], that is, we get

$$\begin{aligned} \hat{\varvec{\theta }}= {\mathop {argmax}\nolimits _{\varvec{\theta }}}\log N ({{\varvec{Y}}}|{{\varvec{0}}},K({{\varvec{X}}},{{\varvec{X}}})) \end{aligned}$$
(4)

2.2 MGP Model

Denote C and N as the number of GP components and training samples in the MGP model, respectively. On the basis of the GP model, we define MGP model by the following steps:

Step 1. Partition samples into each GP components by the Multinomial distribution:

$$\begin{aligned} p(z_{n}=c)=\pi _{c} \end{aligned}$$
(5)

where c = 1,...,C and n = 1,...,N.

Step 2. Accordingly, each input \({\varvec{x}}_{i}\) fulfills the following distribution:

$$\begin{aligned} p({{\varvec{x}}_{i}}| z_{n}= c)\sim p({{\varvec{x}}}| \varvec{\psi }_{c}) \end{aligned}$$
(6)

where {\(\varvec{\psi }_{c}: c=1, ..., C \)} is the parameter set. In general, p(\({\varvec{x}}|\varvec{\psi }_{c}\)) is a Gaussian distribution.

Step 3. Denote \({\varvec{I}}_{c}\) = {\(n \vert z_{n}=c\)}, \({\varvec{X}}_{c}\) = {\({\varvec{x}}_{n} \vert z_{n}=c\)}, \({{\varvec{Y}}_{c}}=\{ y_{n} \vert z_{n}=c \}\) (c=1,...,C, n=1,...,N) as the sample indexes, inputs and outputs of the training samples in the c-th component, respectively. Given \({\varvec{X}}_{c}\), the corresponding c-th GP component can be mathematically defined as follows:

$$\begin{aligned} {{\varvec{Y}}_{c}}\sim {N}({{\varvec{0}}}, K({{\varvec{X}}_{c}},{{\varvec{X}}_{c}})) \end{aligned}$$
(7)

where K(\({\varvec{X}}_{c}\),\({\varvec{X}}_{c}\)) is given by Eq.(2) with the hyper-parameter \({\varvec{\theta }}_{c} =\{\sigma _{fc}^2, \sigma _{lc}^2,\sigma _{nc}^2\}\).

Based on Eqs. (5), (6) and (7), we mathematically define the MGP model. The log-likelihood function is derived as follows:

$$\begin{aligned} {\begin{matrix} \log (p ({{\varvec{Y}}_{c}}|{{\varvec{X}}_{c}}, \varvec{\varTheta }, \varvec{\varPsi })) &{}=\sum _{c=1}^{C}(\sum _{n\in {{\varvec{I}}_{c}}}(\log ({\pi _{c}} {p}({{\varvec{x}}_{n}}| {\varvec{\mu }}_{c},{{\varvec{S}}}_{c})))\\ {} &{}\quad +\log ({p}({{\varvec{Y}}_{c}}|{{\varvec{X}}_{c}},{\varvec{\theta }}_{c}))) \end{matrix}} \end{aligned}$$
(8)

where \(\varvec{\varTheta }=\{\varvec{\theta }_{c}:c=1, ..., C \}\) and \(\varvec{\varPsi }= \{ \varvec{\psi }_{c}, {\pi }_{c}: c =1, ..., C \}\) denote the hyper-parameters and parameters of the MGP model, respectively.

3 Specialized Input Distribution and Its Learning Algorithm

For many real world datasets, such as UCI machine learning repository, Gaussian distribution is not appropriate for the input. In order to solve this problem, we propose a specialized distribution for this situation.

3.1 Specialized PDF

This specialized distribution is a piecewise-defined continuous function, which consists of three parts, the middle part is a uniform distribution density, both sides are Gaussian distribution densities, shown in Fig. 3. We mathematically defined the specialized distribution as follows:

$$\begin{aligned} P(\varvec{x};\varvec{\psi })={\left\{ \begin{array}{ll} \frac{\lambda _{1}}{(\sqrt{2\pi }\tau _1)}\exp ^ {-\frac{(\varvec{x}-\varvec{a})^2}{2\tau _1^2}}&{}\varvec{x}<\varvec{a}\\ \lambda &{} \varvec{a} \le \varvec{x} \le \varvec{b} \\ \frac{\lambda _{2}}{(\sqrt{2\pi }\tau _2)}\exp ^ {-\frac{(\varvec{x}-\varvec{b})^2}{2\tau _2^2}}&{}\varvec{x}>\varvec{b}\\ \end{array}\right. } \end{aligned}$$
(9)

where we redefine \(\varvec{\psi }\)\(=\){\(\lambda ,\lambda _{1},\lambda _ {2},\tau _1,\tau _2\),\(\varvec{a}\),\(\varvec{b}\)} as the parameter vector.

3.2 Learning Algorithm for the Specialized PDF

In order to learn \(\varvec{\psi }\), we set that the input interval (\(\varvec{a}\),\(\varvec{b}\)) contains the number of the samples with probability \(p_0\). Denote X and N as the training sample set and the number of training sample, respectively. We summarize the algorithm framework as following steps:

Step 1. Learn a, b, and \(\lambda \):

$$\begin{aligned} {{\varvec{a}}}=X_{\frac{N(1-p_{0})}{2}};{{\varvec{b}}}=X_{\frac{N(1+p_{0})}{2}};\lambda =\frac{{p}_{0}}{({{\varvec{b}}}-{{\varvec{a}}})} \end{aligned}$$
(10)

where p(x< \(X_{\frac{N(1-p_{0})}{2}}\) \(| x \in \) \({\varvec{X}}\)) = \(\frac{(1-p_{0})}{2}\). In order to reduce the effect of the misclassified (or outlier) point on the middle part, we estimate \(\varvec{a}\) and \(\varvec{b}\) as Eq.(10) do.

Step 2. Estimate \(\lambda _{1}\), \(\lambda _{2}\), \(\tau _{1}\) and \(\tau _{2}\).

Denote \(p_{1}\) and \(p_{2}\) as the sample ratio at both left side and right side, respectively. The probability density function is continuously integrable, and the integral of the probability density function is equal to 1. In other word:

$$\begin{aligned} \int {P(\varvec{x};\varvec{\psi })}dx={\left\{ \begin{array}{ll} {p}_{1} &{}\varvec{x} <\varvec{a}\\ {p}_{0} &{} \varvec{a} \le \varvec{x} \le \varvec{b} \\ {p}_{2} &{} \varvec{x} > \varvec{b} \\ \end{array}\right. }; \qquad {p}_{0} +{p}_{1} + {p}_{2} =1 \end{aligned}$$
(11)

According to the continuity of the probability density function, we only need do same simple calculations to get {\(\lambda _{1},\lambda _ {2},\tau _{1},\tau _{2}\)}:

$$\begin{aligned} \lambda _{1}=2 {p}_{1} ;\lambda _{2}=2 {p}_{2} ;\tau _{1}=\frac{\lambda _{1}}{\sqrt{2\pi }\lambda };\tau _{2}=\frac{\lambda _{2}}{\sqrt{2\pi }\lambda } \end{aligned}$$
(12)

4 The MGP Model of the Specialized PDFs and Its Learning Algorithm

We now consider the MGP model with these specialized pdfs. For the parameter learning of the MGP model, there are main three kinds of learning algorithms: MCMC methods [22, 23], variational Bayesian inference [24, 25], and EM algorithm [5, 9, 11]. However, the MCMC methods and variational Bayesian inference methods have their own limitations: the time complexity of the MCMC method is very high, and variational Bayesian inference may lead to a rather deviation from the true objective function. EM algorithm is an important and effective iterative algorithm to do maximum likelihood or maximum a posterior(MAP) estimates of parameters for mixture model. However, for such a complex MGP model, the posteriors of latent variables and Q function are rather complicated. In order to overcome this difficulty, we implement the hard-cut EM algorithm [17] to learn parameter, which makes certain approximations in E-step.

Denote \({\varvec{z}}_{nc}\) be the latent variables, where \({\varvec{z}}_{nc}\) is a Kronecker delta function, \({\varvec{z}}_{nc}\) = 1, if the sample (\({\varvec{x}}_{n}\),\(y_{n}\)) belongs to the c-th GP component. Therefore, we can obtain the log likelihood function of the complete data from Eq. (8) as follows:

$$\begin{aligned} {\begin{matrix} \log (p({{\varvec{Y}}},{{\varvec{Z}}}| {{\varvec{X}}}, \varvec{\varTheta }, \varvec{\varPsi } )) &{}=\sum _{c=1}^{C}(\sum _{n=1}^{N}( {\varvec{z}}_{nc} \log (\pi _{c} p({{\varvec{x}}_{n}}| \varvec{\psi } _{c})))\\ {} &{} +\log ( p ( {\varvec{Y}}_{c} | {\varvec{X}}_{c} , \varvec{\theta } _{c}))) \end{matrix}} \end{aligned}$$
(13)

The main idea of hard-cut EM algorithm can be expressed as the following steps:

E-step. Assign the samples to the corresponding GP component according to the maximum a posterior (MAP) criterion:

$$\begin{aligned} \widehat{k}_{n}=argmax_{1 \le c \le C} \{ \pi _{c} p({\varvec{x}}_{n}|\varvec{\psi } _{c}) p(y_{n}|\varvec{\theta }_{c}) \} \end{aligned}$$
(14)

that is, latent variable \({\varvec{z}}_{\widehat{k}_{n}n}\)=1.

M-step. With the known partition, we can estimate the parameters \(\varvec{\varPsi }\) and hyper-parameters \(\varvec{\varTheta }\) via the MLE procedure:

  1. (1)

    For learning the parameters \(\{\varvec{\psi }_{c}\}_{c}\), we perform the learning algorithm in the last section.

  2. (2)

    For estimating the hyper-parameter \(\varvec{\varTheta }\), we perform the MLE procedure on each c-th component to estimate \(\varvec{\theta }_{c}\) as shown in Eq. (4).

5 Experimental Results

In order to test the accuracy and effectiveness of the specialized pdf for MGP model, we carry out several experiments on the simulation and stock datasets. We employ the root mean squared error (RMSE) to measure the prediction accuracy, which is defined as follows:

$$\begin{aligned} RMSE=\sqrt{\frac{\sum _{n=1}^{N}({{\varvec{y}}_{n}}-\hat{{{\varvec{y}}_{n}}})^2}{N}} \end{aligned}$$
(15)

where \(\hat{{\varvec{y}}_{n}}\) and \({{\varvec{y}}_{n}}\) denote the predicted value and true value, respectively. We also compare our algorithm with some classical machine learning algorithms: kernel, RBF, SVM, and denote ‘OURS’ as our proposed model with the hard-cut EM algorithm.

Fig. 4.
figure 4

The dataset with the least degree of overlap from MGP with 4 components.

Fig. 5.
figure 5

The distribution of the probability density function of the input on each Gaussian processes component.

5.1 Simulation Experiments

In the simulation experiments, we generate three groups of synthetic datasets from MGP model. those three MGP models contain 4, 6, 8 GP components, respectively. The number of samples in each group is 2600, 3900, 5000, respectively. In each group, there are three datasets, which are the same except the degree of overlap. Figure 4 shows the dataset with the smallest degree of overlap with 4 GP components. On each group dataset, we run each algorithm 100 times, and randomly extract training samples and test samples, where 1/3 are training samples and other 2/3 are test samples. The RMSE of each algorithm is listed in Table 1. From the Table 1, We can see that our proposed algorithm obtains the better results. Figure 5 shows the specialized pdfs on the first group dataset with the smallest overlapping degree. We can obtain that the specialized pdf at both ends of the data in the form of a Gaussian distribution of attenuation, the specialized pdf in the middle of the data is a uniform distribution. This shape of the specialized pdf is more consistent with the uniform distribution than Gaussian distribution. The attenuation of both ends of the specialized pdf in the form of a Gaussian distribution ensures the effectiveness of the iteration of the hard-cut EM algorithm. Then, the class label of the samples can be updated according to the MAP criteria in the iteration of hard-cut EM algorithm. If we apply uniform distribution only, the iterative steps of hard-cut EM algorithm is invalid.

Table 1. The RMSEs of the four algorithms on the three groups.
Table 2. The RMSEs of the four algorithms on the three groups of the transformed stock datasets.

5.2 Prediction on Stock Data

In this section, we obtain the closing price data of three stocks from Shanghai Stock Exchange, and the IDs are 300015, 002643, and 601058, respectively.

From Eq. (10), we can know that the specialized pdf is closely related to the interval length of the middle data. In order to check the effect of different input lengths on the prediction accuracy of the algorithm, we do some transformations on the input. Since the range of output changes is too large, we use a linear function to narrow the output down to the same range as the synthetic data. In summary, we transform the datasets as follows:

  1. (i)

    Transform the input as following equation:

    $$\begin{aligned} {X}_{n} =\frac{n}{\delta } \end{aligned}$$
    (16)

    where i = 1,...,N, N is the sample number, \({\delta }=\{101, 51, 23, 11, 7, 3, 1\}\).

  2. (ii)

    Transform the output by a linearly compressed, and the compressed interval is [−4.5, 4.5].

    $$\begin{aligned} \tilde{y}=\frac{9y}{M-m}+\frac{4.5}{M-m} \end{aligned}$$
    (17)

where M and m denote the maximum value and minimum value of the stock, respectively.

Through the above transformations, each stock can produce 7 datasets. In each 7 datasets of three stock datasets, we repeat each regression algorithm 100 times, and randomly extracted 1/3 as training samples and the other 2/3 as test samples. The RMSE of each algorithm on those three transformed stock datasets is listed in Table 2. From Table 2, we can obtain that our proposed algorithm can get a better predict accuracy than other classical regression algorithms, and our algorithm obtain the better result with the smaller \(\delta \), but this is not absolute.

6 Conclusion

We have designed a specialized pdf for the input of MGP model which consists of three parts: the right and left side parts still take the form of Gaussian distributions, while the middle part takes the form of a uniform distribution. This specialized pdf has the advantages of both the Gaussian distribution and the uniform distribution. That is, the tail Gaussian distributions in the left and right side parts ensure that the hard-cut EM algorithm can perform more efficiently during each iteration, and the uniform distribution in the middle part is more reasonable for the time series data. The experiments are conducted on three groups of synthetic datasets and stock datasets. It is demonstrated by the experimental results that the hard-cut EM algorithm for the MGPs with the specialized pdfs can obtain a better prediction accuracy than the other classical regression algorithms. This specialized input pdf is more effective for the time series data.