1 Introduction

One major research problem of housing market is how to predict the future trend and fluctuation of housing price (Ghysels et al. 2013; Guirguis et al. 2005). Only based on such knowledge, policymakers can react instantly and appropriately to reduce the boom and bust of housing market (Claessens 2015). In the past few decades, China has experienced a rapid growth of its housing market, and now, the sky-rocketing housing price in major cities of China has raised the concerns of both policymakers and residents (Jia et al. 2017).

Unlike western countries which have free market, China’s housing market has several unique characteristics. First, the urban land market was established in 1988, when the first land auction also sparked the light of housing market. Since then, the local government intervenes a lot in land leasing and makes it an important source for extra-budgetary revenue (Huang and Du 2017). Therefore, besides a trading product, land performs more as an important tool to stimulate local economics and contributes a lot to the high housing price. Second, the housing price keeps up growing in an extremely long period of time. The commodification of housing in China began in 1990s. In 1998, the central government abolished the welfare-oriented distribution of public housing, which stimulated the housing market for the first time. Since then, the housing in China keeps up with a soaring price (Fang et al. 2015; Wei et al. 2016). Meanwhile, strong optimistic exists in the housing market, because the housing price raised so much when compared to the housing income and price of other commodities. Wei et al. (2016) has indicated that the ownership of houses becomes the major measure of one’s wealth. Meanwhile, since house is deemed as a necessity of marriage in China, it becomes a vital factor in the competition for marriage partners, which further stimulates the housing price to grow. Third, the government intervention is an important element in China’s housing market. To cope with the booming market, the government has enacted various policies to keep the stability of the market (Jia et al. 2017). These policies include restricting home-purchase, tightening mortgage rates and rising property taxes. However, all these policies failed to keep the housing price from growing too fast in the long run. Zhou (2018) has proved that the high sentiment negatively impacts the effectiveness of tightening policies. Especially, even if the price is temporarily suppressed by the central policy, a following retaliatory price rebound would happen when related local policy is published.

Because of these special characteristics, predicting the trend and fluctuation of future housing price in China is needed, but extremely difficult. To find an appropriate solution, we first review several existing models of the housing market, as discussed below.

The past studies proved that, at least partially, the time path of housing price can be predicted. In the existing literatures, various methodologies have been adopted to forecast the future trend of housing price. Regression approach has been most usually used. For instance, Pain and Westaway (1997) have developed an error correction model to estimate the housing price in UK, and Malpezzi (1999) have also specified an error correction model for house prices. Crawford and Fratantoni (2003) have compared the performance of regime switching, autoregressive integrated moving average model (ARIMA) and generalized autoregressive conditional heteroskedastic (GARCH) model in predicting housing price; results show that ARIMA models perform better in out-of-sample forecasting. Guirguis et al. (2005) have put varying parameters to several of the existing autoregressive models and forecasted the housing price in US. These models, however, have strict requirements for the input datasets. If the sample size is small, or when the data are non-stationarity, the model cannot be correctly established.

Grey model contributes another solution in modeling housing price. Yang and Xing (2006) have utilized a Grey–Markov model to predict the housing price index, which has achieved a satisfying result. Wang et al. (2013) have used a gray system, which successfully predicts the slow-down rise of housing price index in China. However, though the construction of Grey model only needs a few samples, it is unable to simulate the complex up and downs of housing price; therefore, it can only be used to predict a monotonously increasing or decreasing trend (Wu and Chen 2005).

In recent years, the technique of machine learning is developing fast, because of its great advantage compared with traditional methods (Al-Janabi 2017, 2018; Al-Janabi and Al-Shourbaji 2017; Al-Janabi et al. 2015), and various related methods have been used to model housing price. Because the impact factors of housing price have very high dimensions and are usually nonlinear, these machine learning methods are expected to achieve better results than the above traditional methods. The applications are twofold. On the one hand, many studies focus on predicting or evaluating the single unit house in the city by using machine learning. For example, Park and Bae (2015) have compared various machine learning methods of predicting the housing price in Fairfax country, Virginia, and demonstrate that RIPPER algorithm outperforms other models. Selim (2009) has used artificial neural network (ANN) to examine the determinants of housing price in Turkey. Compared to the traditional hedonic-based regression model, these machine learning approaches have been proved to provide better results in accuracy (Limsombunchai et al. 2004). On the other hand, the macro-level housing price index can be also modeled and predicted by these methods. Wang et al. (2014) has adopted support vector machine (SVM) to forecast the housing price index in Chongqing, China, with using warm optimization (PSO) to determine the parameters; the results show that PSO-SVM has better performance compared to grid and genetic algorithm. Such approach, which includes optimizing methodology to do parameter optimization and feature selecting, can usually improve the performance of the original method and has been widely adopted in various applications (Abualigah and Khader 2017; Abualigah et al. 2017a, b, 2018a, b).

To sum up, traditional methodologies have several limitations relating to the fundamental model assumptions and estimations. Moreover, though in recent years, some attempts have been made to model housing market by machine learning methods, few studies have utilized and compared various machine learning methods in modeling housing price for a city or a country, especially for China, where the government behavior and financial policy have great impact on the housing market. To fill this research gap, we propose a model to predict the housing price index at city level in China based on long short-term memory (LSTM). This method has not yet been used in the previous studies, but is expected to achieve good results in forecasting housing price, because its advantage in predicting time series with long time lags between important events. Moreover, to achieve better results, a modified genetic algorithm with multi-level probability crossover is adopted to implement feature selection and optimize the hyper-parameters for the model. The real housing price data and the related features of Shenzhen, China, from the year 2012 to the year 2017 are used to test the performance of the model. By comparing the results of the proposed model with BPNN, SVR and DELSTM, it is proved that the proposed LSTM approach achieves the best results, whose RMSE is 41, MAE is 40, and MAPE is 0.06.

2 Datasets

We choose the housing market in Shenzhen, China, as the object city. To predict the future trend and fluctuation of the newly built commercial housing in Shenzhen, eight features are selected based on two criteria. First, the features are proved to influence the future housing price. Second, the monthly time series data for the feature can be obtained from available data source.

These features can be divided into three dimensions. The first dimension is about the residential land. In China, the local governments intervenes the land market by monopolizing their right to supply land and lease them to developers (Huang and Du 2017). Therefore, the newly released residential land supply (NewhouseS) can indirectly affect the supply side of housing market. On the other hand, the floor area under construction (AreaCons) can reflect the developer’s response to the market driven by investment incentives and space demand driven by residents (Zhou 2018) and thus can also be an indicator of the housing price.

The second dimension contains several basic economic features. Price of the newly built commercial housing (PriceNewHouse) is our prediction target, and its historical time series can greatly affect the future housing price (Huang et al. 2008) and can thus be put into the model as a feature. Another indicator is Completed investment in Fixed Assets (CIFAseets), which can reveal the attitude of investors, as well as the housing demand of common residents. Therefore, CIFAssets works as an important indicator of housing market by showing a picture of the demand side. Moreover, the increase in Consumer Price Index (MIncreCPI) can influence the housing market in China, because when the CPI goes up fast, people are very prone to invest on houses to maintain the value of their asset (Wei et al. 2016).

The third dimension is about the states’ financial policies that would influence the housing market. One of the main policies concerning government intervention should be the adjustment of medium- and long-term loan interest (LInterestRate). A benign interest rate environment can lead to the boom of housing market (Demary 2010), and a tight interest policy is usually taken to suppress the rocketing up price. Meanwhile, in China, all employees are required to contribute a proportion of their salaries to housing provident fund (HPF) (Yeung and Howes 2006), and as a return, the HPF loan for a house has lower interest rate than commercial loan. Therefore, the interest rate of HPF also affects the housing market and should be added to the model as a feature. Moreover, the monthly net increase in RMB loans (RMBloan) of the city can be a useful indicator, because it reflects people’s optimism toward the market. The fast increase in the RMB loans reveals that people are passionate in putting their money to the housing market.

We extracted 664 records for all these eight features, spanning period from December 2010 to October 2017. For each individual feature, 83 monthly records are obtained from various data source. The detailed information of these features is listed in Table 1.

Table 1 List of features for the experiment

3 Algorithm

3.1 LSTM

Long short-term memory (LSTM) is one of the recurrent neural networks (RNN), whose nodes contain self-loop connections (Evermann et al. 2017; Gensler et al. 2017; Hochreiter and Schmidhuber 1997).

Different from classical RNN, the LSTM introduces the concept of memory cell which contains several control structures of information flow: input gate, forget gate and output gate. The classical RNN has problems of vanishing/exploding gradient and long-term dependencies, while the LSTM can make up these disadvantages by using its special structure.

Given a multivariate time series \( x = \left\{ {x_{1} ,x_{2} , \ldots ,x_{n} } \right\} \), where x1 is a univariate time series \( \left\{ {x_{1,1} ,x_{1,2} , \ldots ,x_{n,T} } \right\} \) and T is the length of x1. The LSTM block can be represented by Eqs. (1)–(6). Figure 1 shows the structure of a LSTM block.

Fig. 1
figure 1

A LSTM block

$$ i_{t} = \sigma \left( {W_{i} x_{t} + V_{i} h_{t - 1} + B_{i} } \right) $$
(1)
$$ f_{t} = \sigma \left( {W_{f} x_{t} + V_{f} h_{t - 1} + B_{f} } \right) $$
(2)
$$ o_{t} = \sigma \left( {W_{o} x_{t} + V_{o} h_{t - 1} + B_{o} } \right) $$
(3)
$$ m_{t} = { \tanh }\left( {W_{m} x_{t} + V_{m} h_{t - 1} + B_{m} } \right) $$
(4)
$$ c_{t} = i_{t} *m_{t} + f_{t} *c_{t - 1} $$
(5)
$$ h_{t} = o_{t} *\tanh \left( {c_{t} } \right) $$
(6)

where xt is a multidimensional input values at step t. \( \sigma \left( \cdot \right) \) is the sigmoid function and \( { \tanh }\left( \cdot \right) \) is the hyperbolic tangent function. it, ft and ot are the control structure of input gate, forget gate and output gate, respectively, at step t. ct is the memory cell state, and ht is the hidden state at step t. W and V are the weight parameters, and B is the bias parameters.

3.2 GA

GA is a metaheuristic that simulates species evolution strategy (Holland 1992), which is usually used to generate solutions in optimization and searching problems. It has simple operations and is very robust. In recent years, GA has been used widely in various applications such as production scheduling and controlling science. A classical GA mainly includes four steps, which are initialization, selection, crossover and mutation. Initialization generates some feasible solutions randomly targeting to an actual problem. Selection indicates the process of selecting individuals in the population based on its fitness (Goldberg 1989). Generally, individuals with higher fitness are more likely to be retained, while individuals with lower fitness are more likely to be eliminated. Crossover refers to exchanging part of two individuals to generate a new individual, which may significantly improve the searching performance. Mutation randomly rewrites part of the individual with some probability. It intends to increase the diversity of the population and thus to effectively reduce the probability of falling into a local optimum. Based on iterating these four steps, GA can help us to obtain the optimal result.

3.3 GA-LSTM

The hyper-parameters can greatly affect the performance of LSTM. The number of units in each hidden layer (NUHL) is an important hyper-parameter, and the determination of this NUHL is crucial in the whole process. Meanwhile, feature selection can also greatly influence the performance of LSTM. Therefore, a modified GA with multi-level probability crossover is proposed to optimize the NUHL and do the feature selection. The GA-LSTM procedure is described as follows:

Step 1 Data preprocessing. The data set is first normalized to [0, 1] by Eq. (7) and then divided into two subsets: the training set and the test set.

$$ x = \left( {x - x_{ \hbox{min} } } \right)/\left( {x_{ \hbox{max} } - x_{ \hbox{min} } } \right) $$
(7)

Step 2 Initialization. The parameters of GA include population size; maximum generation, Gmax; crossover factor, CF; level of number of hidden layers (NHL), Ch; level of feature selection Cs; mutation factors, MF. The parameters of LSTM include bounds of NUHL, training number, batch size and look back. An individual is generated based on these settings. As shown in Fig. 2, an individual is generated with 4 hidden layers and 6 features. The NUHL is generated randomly between the lower and upper bound of NUHL. The part of feature selection is randomly generated in {0, 1}, where 1 and 0 represents that the feature is selected and not selected, respectively. The initial population is generated accordingly.

Fig. 2
figure 2

An individual

Step 3 Selection. The fitness is calculated first. The selection is then done based on roulette wheel method.

Step 4 Crossover. First, a formula of multi-level probability is proposed to maintain the stability and enhance the diversity of the population, as shown in Eq. (8), where p presents for the initial probability. Based on this formula, two individuals are randomly selected. Several positions are then randomly selected for NUHL and feature selection separately. If reach the CF, exchanging is processed at these positions. This procedure is described in Algorithm 1. Moreover, Fig. 3 shows crossover of two individuals with \( C_{h} = C_{s} = 1,2,3 \).

$$ LP\left( l \right) = \left\{ {\begin{array}{*{20}l} p \hfill & {l = 1} \hfill \\ {\left( {1 - p} \right) \times \left( {2/3} \right)} \hfill & {l = 2} \hfill \\ {\left( {1 - p} \right) \times \left( {1/3} \right) \times \left( {2/3} \right)} \hfill & {l = 3} \hfill \\ {\left( {1 - p} \right) \times \left( {1/3} \right)^{2} \times \left( {2/3} \right)} \hfill & {l = 4} \hfill \\ { \cdots } \hfill & {} \hfill \\ {\left( {1 - p} \right) \times \left( {1/3} \right)^{n - 3} \times \left( {2/3} \right)} \hfill & {l = n - 1} \hfill \\ {\left( {1 - p} \right) \times \left( {1/3} \right)^{n - 2} } \hfill & {l = n} \hfill \\ \end{array} } \right. $$
(8)
Fig. 3
figure 3

Crossover

figure a

Step 5 Mutation. For an individual, a position is selected individually for NUHL and feature selection. Then values are generated randomly at these two positions if reach the MF. Figure 4 shows mutation operation of two individuals.

Fig. 4
figure 4

Mutation

Step 6 Calculate fitness of the offspring population. If the iteration number reaches Gmax, return the optimal individual; otherwise, G = G + 1, return to Step 3.

Step 7 The LSTM with the optimal individual is tested by the test data.

A flowchart showing GA-LSTM is shown in Fig. 5.

Fig. 5
figure 5

The flowchart of GA-LSTM

4 Experiment and discussion

To test the performance of the GA-LSTM in predicting housing price, results are also obtained by long short-term memory with differential evolution algorithm (DELSTM) (Peng et al. 2018), back propagation neural network (BPNN) and support vector regression (SVR) for comparison. In this section, all algorithms are coded in Python (version 3.6). LSTM and BPNN are based on a Python deep learning library—Keras (version 2.2.2), and the backend is TensorFlow (version 1.11). Meanwhile, SVR is based on scikit-learn (version 0.19.1). All experiments are conducted on a personal computer with an Intel® Core i7-6700HQ, 2.6 GHz CPU, 8 GB RAM and Windows 10 Operational System.

4.1 Parameters setting

The values of corresponding parameters have a significant influence on the performance of GA-LSTM, DELSTM, BPNN and SVR. For the proposed GA-LSTM, the parameters of GA are set as follows: probability of crossover is 0.9, probability of mutation is 0.2, p is 0.6, level of number of NHL \( C_{h} \le { \hbox{min} }\left\{ {{\text{number of NHL}},3} \right\} \), feature selection \( C_{s} \ge { \hbox{min} }\left\{ {{\text{number of features}},3} \right\} \), population size is 20, and iteration number is 10. Meanwhile, the parameters of LSTM are set as: the bound of NUHL is [5, 20], the training number is 100, the batch size is 5, and the number of hidden layer (NHL) is {1, 2,3, 4, 5, 10, 15, 20}. For the DELSTM. For BPNN, nine combinations: the NUHL is {5, 10, 20}, the training number is {250, 500, 1000}, are tested. For SVR, the parameters are default parameters in scikit-learn. In addition, the look back is set to 1 for the above algorithms.

4.2 Results of the experiment

In experiments, the first 90% of the dataset is set as the training data and the rest of the dataset is set as the test data. Each feature in the dataset is normalized by Eq. (7). Meanwhile, root mean square error (RMSE), mean absolute error (MAE) and mean absolute percentage error (MAPE) are adopted to evaluate the performance of the algorithms in experiments.

$$ {\text{RMSE}} = \sqrt {\mathop \sum \limits_{t = 1}^{T} \left( {\bar{y}_{t} - y_{t} } \right)^{2} /T} $$
(9)
$$ {\text{MAE}} = \sqrt {\mathop \sum \limits_{t = 1}^{T} \left| {\bar{y}_{t} - y_{t} } \right|} /T $$
(10)
$$ {\text{MAPE}} = \sqrt {\mathop \sum \limits_{t = 1}^{T} \left| {\bar{y}_{t} - y_{t} } \right|/y_{t} } /T $$
(11)

T represents for the length of test data; \( \bar{y}_{t} \) and yt refer to the forecasting value and real value of the test data, respectively.

The results of the GA-LSTM, DELSTM, BPNN and SVR are presented as follows in Tables 2, 3, 4 and Fig. 3. To be noticed that, we choose the best result of BPNN by testing various combination of parameters, as discussed in Sect. 4.1.

Table 2 Results of the GA-LSTM, DELSTM, BPNN and SVR
Table 3 Solutions of the GA-LSTM with different NHL
Table 4 Results of the proposed GA-LSTM and the basic GA-LSTM

By scrutinizing the result, the following conclusions can be obtained:

  1. (a)

    As shown in Table 2 and Fig. 6, these four machine learning methods can successfully establish housing price models with acceptable results. However, the GA-LSTM performs better than DELSTM, BPNN and SVR. For the RMSE, the best result of GA-LSTM is 41, while the result of DELSTM, BPNN and SVR is 80, 1702 and 1818, respectively. Meanwhile, for the MAPE, the best result of GA-LSTM is 0.06%, while for BPNN and SVR the best results are 2.42% and 2.35%, respectively. For the MAE, GA-LSTM is also best. Moreover, when the NHL is less than 10, the results of GA-LSTM are always better than both BPNN and SVR.

    Fig. 6
    figure 6

    Fitting curves of the housing price with different NHL

  2. (b)

    With increasing NHL, the performance of GA-LSTM is first getting better and then getting worse. In the numerical examples, the results of GA-LSTM are the best when the NHL equals 3 by examining RMSE, MAE and MAPE.

  3. (c)

    In GA-LSTM, only a few features are appropriately selected from the original eight, to achieve the best result. As shown in Table 3, no more than 3 features are selected when NHL is no more than 10.

  4. (d)

    The proposed GA-LSTM performs better than the basic GA-LSTM with single-point crossover. As shown in Table 4, the MAPE of BGA-LSTM is much larger than that of GA-LSTM.

4.3 Discussion

The above results indicate that the proposed GA-LSTM approach can successfully predict the housing price of a city in China. Compared to the traditional methods, this approach has several advantages. First, it can achieve a good result because of a better feature selection process. It is known that the housing price can be affected by many factors, and the establishment of traditional models is usually bothered by the selection of variables, since these variables are usually complex, inconsistent through time and are not integrated from one another. On the comparison, the proposed method can automatically and dynamically select appropriate features by adopting a modified GA, with no need to consider the problems traditional models always need to face. Moreover, by only adopting a few number of features (only eight in our study), and with limited samples, a satisfying result can be achieved.

However, the proposed model has its limitations. First, the GA-LSTM approach can achieve a good result, but is very time-consuming. Therefore, the efficiency of the model needs to be improved. Second, when the dataset is small, the performance of the model is likely to be weakened. Third, the housing price is modeled only for one city in this study, and more cities should be included to test the availability of this model. Fourth, in this study, only eight features concerning residential land, housing economics and loan interest are considered in the model. In China, policy is such an important factor of the housing market, and a better result can be expected when more policy-related features are put into the model.

5 Conclusion

In this study, LSTM incorporating a modified GA is proposed for predicting the future trend and fluctuation of housing price of cities in China. Eight features that may influence the housing market are considered and have been used for training our model. Because China’s housing market has many unique characters and is largely affected by the policy, the housing price in China is extremely hard to be accurately modeled and predicted. However, the results in this manuscript indicate that machine learning methods have good performance in modeling housing price of a city, even with limited features and data. Particularly, the proposed GA-LSTM obviously outperforms DELSTM, BPNN, SVR and basic GA-LSTM. Therefore, this GA-LSTM can be used as an efficient tool to assist policy makers as well as investors in monitoring and forecasting the dynamics of the housing market.

In the future study, a better housing price model can be possibly obtained through two ways. First, the policies concerning the housing price can be classified and quantified to construct new features for the model. Second, the hyper-parameters can be further optimized by trying various algorithms.