Keywords

1 Introduction

In recent years, Gaussian processes (GPs) have become a useful technique for machine learning [1]. They are based on a principled approach to statistical inference as long as correlations among the data can be estimated a priori. Hence, it provides a Bayesian nonparametric approach to classification, regression and prediction, and excellent results were produced [2].

The prior of the inference embodies our natural expectation about the behaviors of the data, and many forms of kernels have been proposed [1, 2]. For example, for quantities that have smooth variations in space or time, the Squared Exponential (SE) kernel is a suitable choice. For data with stronger correlation range, one may choose the Rational Quadratic (RQ) kernel. To capture the characteristics of different kernels, mixtures of kernels such as Spectral Mixture (SM) kernel [2] have also been used.

The importance of the appropriate choice of kernels was illustrated in GP time series prediction [2, 3]. For example, while the SE kernel is suitable to describe smoothly varying data, it performs poorly when processing time series with periodic variations. For time series data with both long and short term trends, kernels such as RQ can capture the long term trend but not short one, whereas kernels such as SE performs oppositely. To improve the performance, kernels composed of a spectral mixture [2] was proposed, and was able to overcome the above drawbacks.

In real applications of time series prediction in science and economics, the data are often very noisy. The quasi-periodic components of the data may be masked by the noise. The prediction of housing prices considered in this work is a typical example. And for the quasi-periodic time series, traditional kernels such as SE and RQ may not be able to make meaningful predictions and sophisticated kernels such as SM can suffer from over-parameterization and overfitting.

Financial time series have been investigated using different methods, such as fuzzy time series [4], Support Vector Machine [5, 6], and Artificial Neural Network (ANN) [7], in recent decades for the interest of maintaining stable and profitable market environment. However, most of the methods fail to be applicable to real time series [4] and they are too complex and lack the power of interpretation to capture the essential properties of the real time series.

Among various markets, the housing market might be the most appealing one, because it is closely related to the livelihood of individuals. For example, the Japanese housing bubble burst in 1991 exerted serious impact not only on the housing market but also the overall Japanese economy [8], leading to the sudden decline in both investment and consumption. However, because of the stochastic nature of financial markets [9], it is extremely difficult for scientists to accurately predict the subsequent observations based on the current information. On the other hand, many researchers are seeking indicators which could reveal the hidden states of the financial markets. For example, to prevent great depression caused by housing bubble bursts, there were many efforts to identify the explosive behavior, which is regarded as bubble inflation, in housing price time series. Several statistical techniques were used to confirm the existence of explosive behavior, such as root test, augmented Dickey Fuller test, and Cointegration test [10]. One popular method was introduced by [11], called PSY method, and gained its success in identifying multiple bubbles in historical housing data [12]. However, it lacks evidence that it is practical in predicting future trends in housing markets.

In this paper, we investigate the useful statistical information extracted from a real time series, namely, the Hong Kong housing price, and propose a comprehensive kernel using the prior knowledge obtained from the time series. By using this kernel, we can improve the predictive power of the GP models. In Sect. 2 we first review the theoretical framework of GPs. In Sect. 3, we describe how the Fourier spectrum of the auto-covariance of the time series can be used to construct the High Order Periodic (HOPE) kernel for Gaussian processes. In Sect. 4, we describe the construction of the kernel from the Hong Kong housing price data. In Sects. 5 and 6, we will present the predictions on housing price data and yearly mean total sunspot number given by the GP models using the HOPE kernel and some other alternative popular kernels. The paper is concluded in Sect. 7.

2 Gaussian Processes

Given a set of points \(X = \{x_1, x_2, \dots , x_t\}\), a Gaussian process [1] assumes that \(f(x_i)\), the true value generated at point \(x_i\), satisfies the joint Gaussian distribution

$$\begin{aligned} \varvec{f} = [f(x_1) \dots f(x_t)]^T \sim \mathcal {N}(\varvec{\mu }, \varvec{K}(X,X)), \end{aligned}$$
(1)

with mean vector \(\varvec{\mu } = [\mu (x_1), \mu (x_2), \dots , \mu (x_t)]^T\) and covariance matrix \(\varvec{K}\), where \(K_{ij} = k(\varvec{\theta },x_i, x_j)\) is the kernel parameterized by hyperparameter \(\varvec{\theta }\). And \(\mathcal {N}\) is the Gaussian distribution. Generally, the true value \(f(x_i)\) cannot be known and only their noisy versions \(y_i = f(x_i) + \epsilon \) can be observed, here \(\epsilon \) is the independent identically distributed Gaussian noise with variance \(\sigma _n^2\). Based on the given observations and prior distribution, the expected value and variance of the true value generated at a set of points \(X'\) can be inferred. Specifically, to estimate the true values \(\varvec{f}'\) at \(X'\) based on the known observations and prior, we have the joint distribution

$$\begin{aligned} \begin{bmatrix} \varvec{y} \\ \varvec{f}' \end{bmatrix} \sim \mathcal {N}\left( \begin{bmatrix} \varvec{\mu }\\ \varvec{\mu '} \end{bmatrix}, \begin{bmatrix} \varvec{K}(X, X) + \sigma _n^2 I&\varvec{K}(X', X)\\ \varvec{K}(X, X')&\varvec{K}(X', X') \end{bmatrix} \right) , \end{aligned}$$
(2)

where I is the identity matrix, and the posterior predictive distribution

$$\begin{aligned} p(\varvec{f}' |\varvec{y}, X, X')= \mathcal {N}(\varvec{m}, \varvec{C}), \end{aligned}$$
(3)

where \(\varvec{m}\) is the estimated mean and \(\varvec{C}\) is the estimated variance at points \(X'\), which can be analytically derived as:

$$\begin{aligned} \varvec{m} = \varvec{\mu '} + \varvec{K}(X,X') (\varvec{K}(X,X)+\sigma _n^2 I)^{-1}(\varvec{y} - \varvec{\mu }), \end{aligned}$$
(4)
$$\begin{aligned} \varvec{C} = \varvec{K}(X', X') - \varvec{K}(X, X') (\varvec{K}(X,X) + \sigma _n^2 I)^{-1}\varvec{K}(X', X). \end{aligned}$$
(5)

Therefore, specifying the kernel k, the variance \(\sigma _n^2\), and the mean function \(\mu \) determines the GP, and hence, the estimated mean value can be calculated by the model. To tune the hyperparameters of GP, we need to optimize the marginal log-likelihood \(\mathcal {L}\) with respect to the unknown hyperparameters in the covariance matrix and the mean function:

$$\begin{aligned} \mathcal {L} = \log p(\varvec{y} | \varvec{\theta }, X) = -\frac{1}{2}\left( \varvec{y} - \varvec{\mu }\right) ^T(\varvec{K}+\sigma _n^2 I)^{-1}\left( \varvec{y} - \varvec{\mu }\right) \\ - \frac{1}{2}\log \det {(\varvec{K} + \sigma _n^2 I)} - \frac{|X|}{2}\log {2 \pi }, \end{aligned}$$
(6)

where X is the set of observation points, \(\varvec{y}\) is the corresponding vector of observed values at X, and |X| is the number of observations. Thus, given the estimated hyperparameters, variance \(\sigma _n^2\), and mean function that maximize the log-likelihood \(\mathcal {L}\), we are able to predict the expected value and standard deviation of the unknown true value generated at points \(X'\) by the predictive distribution.

3 Construction of Kernels

In this section, we construct a new kernel for noisy realistic time series. Although empirical correlation can be questionable in reflecting the true theoretical underlying processes [2], we may show that it is still possible to improve predictions on noisy quasi-periodic time series by using the information extracted from statistical analysis, such as auto-covariance analysis and Fourier Transform.

The cross-covariance \(\gamma _{FG}(l)\) for two finite discrete real time series \(F = \{F_t: t \in T\}\) and \(G = \{G_t: t\in T\}\) is defined as

$$\begin{aligned} \gamma _{FG}(l) = \sum _{t} F_{t + l} G_{t}, \end{aligned}$$
(7)

where l is the time that G lags F. For the case that \(F = G\), \(\gamma _{FG} = \gamma _{FF}\) is the auto-covariance of the time series F. For the housing price data we adopt the definition that is without the subtraction of expected value of the time series.

Given a time series F with N observations, the auto-covariance \(\gamma _{FF}(l)\) can characterize the similarity in the data up to lag \(N - 1\). To construct the kernel function describing the similarities between arbitrary data, we transform the auto-covariance \(\gamma _{FF}(l)\) into its frequency-domain:

$$\begin{aligned} X(f) = \sum _{l = 0}^{N-1}\gamma _{FF}(l) e^{-i 2 \pi f l/N}, \end{aligned}$$
(8)

and obtain the auto-covariance spectrum \(\frac{|X(f)|}{N}\).

Therefore, we select the significant periodic components having large enough amplitudes to construct the kernel and ignore the other weaker components, since the components with small amplitudes may be just noises or redundancies and have little contribution to the true connections between the data. Thus we propose the one-dimensional High Order Periodic (HOPE) kernel with order n:

$$\begin{aligned} k_n(x_i, x_j) = \alpha e^{ - (x_i - x_j)^2 / (2\ell ^2)} \sum _{m = 1}^{n} A_m \cos \left( 2 \pi f_m \left| x_i - x_j\right| \right) , \end{aligned}$$
(9)

where \(n \in \mathbb {Z}^+\), \(\alpha \), \(\ell \), and \(f_{m}, m = 1, \dots , n\), are hyperparameters, and the weights \(A_m, m = 1, \dots , n\) is the m-th largest amplitudes (peaks) of the components in the auto-covariance spectrum. The introduction of the exponential factor \(e^{ - ( x_i - x_j )^2 / (2\ell ^2)}\) is to ensure that the kernel vanishes when the space (time) difference between two observations points approaches infinity. Intuitively, two data points well separated in space (time) should have weak correlations. The HOPE kernel can be considered as an extension of the Spectral Mixture (SM) kernel [2] to deal with noisy quasi-periodic data, whereby we determine a number of hyperparameters directly from the statistical features of the data to alleviate the problem of over-parameterization and overfitting, which will be shown in Sects. 5 and 6. Then by optimizing the log-likelihood (Eq. 6) with respect to these hyperparameters and the mean function \(\mu \), we obtain a trained GP regression model.

Fig. 1.
figure 1

(a) The monthly average real price index (MARPI) per square foot of Hong Kong housing and the prime lending rate (PLR). (b) The MARPI rate of change and the PLR rate of change.

4 Hong Kong Housing Price

The data of the Hong Kong property transaction prices were obtained through negotiations from EPRC Limited, a wholly-owned subsidiary of the Hong Kong Economic Times that specializes in providing property information to market-related industries. The dataset contains records of transacted properties for the period 1992 to 2010 and amounts to 2,492,842 transaction records. Figure 1(a) illustrates the monthly average of real price index (MARPI) (in HKD per square foot normalized by the consumer price index) of Hong Kong housing transactions and the Prime Lending Rates (PLR) in Hong Kong from Jan 1992 to Dec 2010 (totally 228 data points). Since the data of MARPI are very noisy, to pre-process and denoise the data, we investigate the time series of MARPI rate of change over 12 months (a year), defined as the slope of the linear regression line for the past 12 months MARPI data. Similarly, the time series of PLR rate of change over 12 months can also be calculated. As a comparison, these two time series are illustrated in Fig. 1(b).

Fig. 2.
figure 2

(a) Auto-covariance of MARPI rate of change and PLR rate of change. (b) Cross-correlation (normalized version of cross-covariance) of MARPI rate of change and PLR rate of change.

As shown in Fig. 1(b), quasi-periodic features exist in both the MARPI rate of change and the PLR rate of change. The quasi-periodicity is exhibited even more clearly in the auto-covariance (Eq. 7) of the rate of change as shown in Fig. 2(a). The periodicity of the rate of change observed from the auto-covariance is around 30 months, as confirmed by the result of the auto-covariance spectrum in Fig. 3.

Fig. 3.
figure 3

Single-sided auto-covariance spectrum of (a) MARPI rate of change and (b) PLR rate of change. The components with the largest amplitude in (a) and (b) have periodicity \(1/ 0.0323 = 31.0\) Month and \(1/ 0.0277 = 36.1\) Month respectively.

It is interesting to consider why housing price time series is quasi-periodic. We found evidence that the housing price is correlated with PLR, which happened to have a quasi-periodic variation during this period of study. While the underlying reason is beyond the scope of this study, it is natural to expect the occurrence of cycles in the world economy. When consumption heats up, PLR will be raised to prevent overheating, and when the economy is weak, PLR will be reduced to stimulate consumption. As shown in Figs. 1, 2, 3, there is a close correlation between the periods of two time series. Furthermore, the cross-correlation (normalized version of cross-covariance) is negative near 0 months showing that housing prices drop (rise) when the mortgage rate rises (drops).

5 Experiments

In this section, we show that the HOPE kernel can be applied to the real time series to improve the predictive power of the GP. At the same time, to contrast the prediction results using the new kernel, other popular kernels, such as Spectral Mixture (SM) with number of components \(Q = 12\), Squared Exponential (SE), and Rational Quadratic (RQ) kernel (as shown in Table 1 [1, 2]) are also applied to predict the data. Since the size of our data is small (around 200 data points), Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm [13] with cubic interpolation is applied to optimize the log-likelihood function. The fitrgp.m and fminunc.m functions from MATLAB [14] are used to implement the GP and optimization. The mean function \(\mu \) of the GP model is set to be a constant function \(\mu (x) = c\). For the optimization of hpyperparameters, we randomly sample 40 initializations of hyperparameters for each model and optimizations are repeated. The final optimal hyperparameters are the ones which yield the highest log-likelihood \(\mathcal {L}\) for the model. During the sampling, the order n in HOPE is sampled between 1 and 12.

Table 1. Expressions of one-dimensional popular kernels, where \(\tau = \left| {x_i - x_j}\right| \).
Fig. 4.
figure 4

Predictions on MARPI rate of change using different kernels. The training data are on the left hand side of the vertical dashed line. (a) The first 152 observations as the training inputs. (b) The first 171 observations as the training inputs.

5.1 Predictions on MARPI Rate of Change

To illustrate the predictive power of the GP model using the HOPE, we choose two subsets from the time series of the MARPI rate of change as the training set, one containing the first 152 observations in the time series and the other constituting the first 171 observations, and the remaining data as testing sets. The predictions using different kernels can be found in Fig. 4.

GP models using all kernels perform well in reconstructing the training inputs, since the predictions on training data given by the models almost overlap with the real data in the training set. However, in future predictions, models using SE and RQ have little predictive power, since the estimated mean observations in the future (outside the training set) diverges largely and quickly from the testing data, and their values tend to fall back to the mean function c. The predictions using SM are unsatisfactory, since there are many huge gaps between the estimated mean observations and real data. On the other hand, the predictions given by the model using HOPE do not deviate from the testing data dramatically and can capture the trend of the future data up to a considerable number of steps. At the same time, Fig. 4 indicates that generally, the errors may increase when we predict the values to a more distant future.

Another observation is that for the training set with 152 (171) observations, the optimized log-likelihood \(\mathcal {L}\) of HOPE and SM are 270.9 (311.7) and 300.4 (329.9) respectively. Although the SM has higher log-likelihood, Fig. 4 shows that it has poorer performance in predicting the unobserved data. One suggested reason for this is that using SM, possibly overparameterized, leads to the overfitting so that the predictive power of the GP is suppressed. Meanwhile, this observation can be considered as the evidence that compared to SM, HOPE is able to effectively prevent the GP from overfitting and give better predictions on the future data.

Moreover, we use another example to illustrate the problem of overfitting induced by the SM. For one set of training inputs (first 200 observations of MARPI rate of change), the predictions given by two GP models using SM (\(Q = 12\)) are shown in Fig. 5. Compared to the model with log-likelihood \(\mathcal {L} = 419.5\), the model with the higher log-likelihood \(\mathcal {L} = 428.1\) has rather poorer performance on predicting the future data.

Fig. 5.
figure 5

Predictions on MARPI rate of change with first 200 observations as training inputs, which are on the left hand side of the vertical dashed line. \(\mathcal {L}\) is the log-likelihood of the model. Both models are using SM with \(Q = 12\).

5.2 Performance and Stability

Since GPs can be really sensitive to the initialization of the hyperparameters as well as the input training data, we further compare the prediction stability of different kernels by measuring their mean squared error (MSE) and following the procedure of time series prediction performance assessment introduced by Hyndman [15]. Specifically, experiments are repeated using different training sets, each one having additional one observation than the previous one, and during each experiment and for each training set the MSE in h-step horizon (that is, from 1-step prediction up to h-step prediction) is computed to investigate the performance of prediction. Initially, the training set has 151 observations. And the average MSE of experiments of each kernel can be found in Table 2. The smaller the average MSE, the more stable the model, because small average MSE indicates that the model can maintain relatively good performance when the training inputs are changed. As a comparison, the variance of the whole data set is 0.0232. Table 2 shows that the stability of HOPE outperforms other kernels in several-step horizon prediction, which also justifies that HOPE has better performance in capturing the future trends of the time series. It also indicates that the predictive power fades away when the GP model is trying to estimate the data in distant future, since the average MSE increases dramatically when the prediction horizon increases. And we can also see that the performances of SM are relatively poor, which might be the consequences of overfitting.

Table 2. Average MSE using different kernels

6 Application to Sunspot Time Series

To further illustrate the power of HOPE, we apply it to another well known time series, yearly mean total sunspot number, which can be obtain from http://www.sidc.be/silso/datafiles. To make comparisons, the popular kernels in Table 1 (SM with \(Q = 20\)) are also applied. The data from year 1700 to year 1899 (totally 190 data points) are used as training inputs, and the data from year 1900 to year 1941 (totally 42 data points) are considered as testing set. The mean function of the GP is set to be constant function \(\mu (x) = c\). BFGS algorithm and 100 random initializations of hyperparameters are applied to search for optimal estimations of hyperparameters for each model, and the order n in HOPE is sampled between 1 and 20.

Fig. 6.
figure 6

Predictions on yearly mean total sunspot number with the first 190 observations as training inputs, which are on the left hand side of the vertical dashed line. (a) Predictions given by HOPE and SM with \(Q = 20\). (b) Predictions given by SE and RQ.

Predictions on the yearly mean total sunspot number given by different kernels are illustrated in Fig. 6. Similar to the cases of predictions on MARPI rate of change, for sunspot time series, SM (\(\mathcal {L} = -857.5\)) has a larger log-likelihood than HOPE (\(\mathcal {L} = -885.0\)). However, for the performance on predicting the testing data, compared to HOPE (\(\text {MSE} = 1947.5\)), SM (\(\text {MSE} = 2272.3\)) gives poorer predictions, which indicates the problem of overfitting caused by SM. Also, predictions on future data given by both SE (\(\text {MSE} = 2588.4\)) and RQ (\(\text {MSE} = 2599.7\)) are quite poor, which can be seen in Fig. 6(b), though their log-likelihood values (\(\mathcal {L}_{\text {SE}} = -887.7\) and \(\mathcal {L}_{RQ} = -887.5\)) are comparable to that of HOPE.

7 Conclusion

In this paper, we have discussed the Gaussian process (GP), which is a powerful regression model to predict the time series. Using a new kernel (HOPE) based on the information extracted from the auto-covariance of the training time series, we improved the predictive power of the GP model and it compares favourably with models using other popular kernels. Our work shows that the choice of the kernel is a very essential factor to the performance of GP. Choosing a kernel from some popular choices may not work for noisy quasi-periodic data, in which case extracting the kernel from the covariance functions provides an alternative.

The Gaussian process can be a powerful predictor in financial time series analysis. Since we have only used Gaussian processes to analyze the single time series, we believe that by incorporating multiple time series, such as Prime Lending Rate and trading volume, Gaussian processes may capture more information underlying the series, and give more accurate predictions. However, the model cannot predict the value at arbitrary time steps ahead, so we may further investigate the applicable range that the predicted values are still close to the real data. On the other hand, we may also focus on predicting a few steps ahead so that we can determine its accuracy and stability in short-term prediction.