1 Introduction

Revenue management is a component of operations management that focuses on pricing to increase the profits generated from a limited amount of supply chain assets (Dana 2008). Concepts of revenue management are applied successfully across many capacity-constrained service industries for nondurable and perishable products (Berk et al. 2009), such as the airline industry for flight ticket prices (Li and Tang 2012), the car rental industry for car rental prices (Geraghty and Johnson 1997) or the hotel industry for booking prices (Harewood 2006). On the other hand, only a few studies are interested in revenue management for durable and non-replenishable (in the short term) products such as real estate (Wen et al. 2016; Padhi et al. 2015).

Considered for most countries as the largest asset class, real estate plays a major role in social and economic systems. Real estate price fluctuations have direct impacts on the financial system due to banks’ central role as mortgage lenders and the frequent use of real estate as collateral (Koetter and Poghosyan 2010). However, acquiring real estate is a delicate operation that requires a precise and objective estimate of its value beforehand. Since buying a house is the largest financial transaction for most households (Pedersen et al. 2013), knowing the real value of a home is a major asset in more ways than one and allows the buyer to not only distinguish between good and bad deals but also to be able to effectively negotiate the price of the property during the transaction. On the seller’s side, the precise estimate of the price of his/her home before it goes on sale allows him/her to know its exact market value. As a result, the seller can then avoid any unnecessary risk of overestimating or underestimating the sale price. It should be noted that when the sale price is overestimated, it almost surely causes a delay in the sale, while underestimation generates an unnecessary loss of profit for the seller. Additionally, an accurate estimate of house value is considered capital to an investor willing to diversify his/her portfolio because of the alternatives among housing securities and other possible investments (D’Amato et al. 2019). Therefore, it is crucial and greatly beneficial for both sellers and buyers to have tools that facilitate the estimation of real estate values.

Real estate prices are sometimes studied for rental price assessments (Gomes 2009; Gomes and Rangel 2009), but they are mostly analyzed for property price assessments with automated valuation models (Pagourtzi et al. 2003; d’Amato and Kauko 2017; Wang and Li 2019; Valier 2020). Automated valuation models (AVMs) are statistically-based models that use real estate information, such as property characteristics (e.g., age, number of rooms), comparable sales, or price trends, to provide a current estimate of the market value of a specific property. Generally, valuations are required, and they are often carried out by several different players in the marketplace, such as real estate agents, appraisers, assessors, mortgage lenders, brokers, property developers, investors, fund managers, market researchers, analysts, etc. The most commonly used approaches for automated valuation models are based on parametric and nonparametric regression techniques.

The parametric regressions used for automated valuation models are mostly based on hedonic regressions, such as multiple linear regression analysis (Narula et al. 2012). Due to the complexity and the nonlinearity of the real estate price estimation problem (Yu and Wu 2006; Kontrimas and Verikas 2011), various nonparametric regression methods, such as data envelopment analysis (Lins et al. 2005), fuzzy logic (Kuşan et al. 2010) or genetic algorithms (Morano et al. 2018), are also used. In general, machine learning methods are currently among the emerging nonparametric methods most used for automated valuation models (Viriato 2019; Valier 2020). Several studies are interested in empirically comparing the prediction accuracies of machine learning methods versus those of hedonic regression methods. Although a few studies show that some hedonic regressions can provide better results in some specific contexts (Doumpos et al. 2020), machine learning methods generally outperform hedonic regressions in many studies (Valier 2020; Mayer et al. 2018; Pérez-Rave et al. 2019). However, beyond their predictive capacities, these two approaches are often different according to their targets; hedonic regressions are explanatory, interpretable and less volatile models that can successfully address numerous economic, social, environmental and public policy issues, while machine learning models are very often less interpretable (“black box”) (Yacim and Boshoff 2018) and more volatile models (Mayer et al. 2018), but they provide more powerful predictive capacity than hedonic regressions (Din et al. 2001; McCluskey et al. 2013; Mayer et al. 2018). Machine learning models are attractive to all operators who evaluate, manage or trade real estate assets. Investors can use them to evaluate the possible investments or transactions for which they are a party. Similarly, valuation service providers can use them to offer reliable estimates to their clients. In this study, we are primarily interested in the prediction accuracies of the models (e.g., from the point of view of an investor); therefore, we focus on machine learning models. Because their relevance has already been demonstrated in different contexts in the real estate price estimation literature (Isakson 1988; Kontrimas and Verikas 2011; Huang 2019; Lam et al. 2009; Mullainathan and Spiess 2017; Čeh et al. 2018; Kok et al. 2017; McCluskey et al. 2014; Baldominos et al. 2018), we consider the following seven machine learning models in this study: artificial neural networks (multilayer perceptron), ensemble learning (random forest, gradient boosting, adaboost), support vector regression, k-nearest neighbors and linear regression.

The input explanatory variables always play a major role with regard to the relevance of automated parametric or nonparametric models. Several types of explanatory variables are commonly used to estimate the prices of properties, such as the following: physical characteristic variables (e.g., living area, number of rooms), accessibility variables (e.g., proximity to amenities such as schools), neighborhood socioeconomic variables (e.g., local unemployment rates) (Johnson 2003), and environmental variables (e.g., road noise or visibility impact) (Čeh et al. 2018). Depending on the availability of all these explanatory variables, the heterogeneity of real estate features is responsible for the laboriousness of the price estimation process. However, there is a consensus in the literature on the prime importance of location/spatial variables (e.g., geographic coordinates, accessibility variables or neighborhood variables) when estimating real estate prices (Anselin 2013). Several empirical studies support this argument by handling spatial heterogeneity and spatial dependence (e.g., Basu and Thibodeau 1998; Bourassa et al. 2003; Bitter et al. 2007; Borst and McCluskey 2008; Helbich and Griffith 2016; Gröbel and Thomschke 2018; Doumpos et al. 2020). However, even if very few of these studies considered machine learning techniques, none of them consistently evaluated and quantified the relevance of spatial/location attributes among a wide range of machine learning techniques. We are tackling this latter point in this paper. Thus, our research question can be summarized as follows:

RQ What would be lost in terms of predictive power for a machine learning-based automated valuation model that fails to integrate location variables?

We study this research question by analyzing the French real estate market, which has so far received little attention. The French housing market is quite tight (Garcia and Alfandari 2018). Citizens invest in real estate to build wealth or to collect additional income. Considering the volume of rentals, the resulting tax savings and the reduced effort required to obtain savings, the rental real estate market is one of the few investment sectors in France that allows one to build up a sustainable heritage financed with credit without having exceptional income. If real estate investment is a successful these days, it is partially due to the mechanism that it offers to investors allowing for a reduction in income taxes. For example, the current Pinel lawFootnote 1 provides income tax reductions of up to 21% over 12 years for new real estate. In this context, increasing numbers of people are interested in quickly identifying good opportunities for investing in real estate in France. However, the French real estate market is widely heterogeneous, with several different metropolitan areas. For instance, Paris is the economic and political capital, and its real estate market is particularly tight. This is also the case for other metropolitan areas, such as Nice and Bordeaux, but for different reasons (touristic and bourgeois cities, respectively). To assess their predictive capacities for such different towns, we evaluate the machine learning models for the following nine major metropolitan French areas: Paris, Marseille, Lyon, Toulouse, Lille, Bordeaux, Montpellier, Nice, and Nantes.

Overall, the main contributions of this paper compared to the literature are as follows:

  • A global evaluation and a quantification of the relevance of location/spatial attributes for real estate price estimations using a wide range of machine learning methods are performed. To the best of our knowledge, this is the first study with this specific goal; at a fine-grained level, the location attributes in this study are derived from geocoding processing of the properties’ addresses.

  • The evaluations are performed with the same dataset, thus avoiding the bias that could appear when comparing different methods evaluated on different datasets, as in many literature reviews (e.g., Wang and Li 2019; Valier 2020).

  • The study focuses on the French real estate market, which has so far received little attention. We study 5 years (2015–2019) of real sales data from notarial acts containing 480 055 house and apartment transactions.

  • The machine learning models’ predictive powers are evaluated and compared at a high level of granularity using data from nine different and heterogeneous metropolitan areas.

As the summary results are compared at a high level of granularity, we obtain important differences regarding the models’ predictive powers (beyond 70% differences in precision in some cases) between cities with high standards of living (e.g., Paris, Bordeaux, Nice) and cities with medium standards of living (e.g., Toulouse, Lille, Montpellier). At a low level of granularity, we use geocoding to extract from and add precise geographical location features to the machine learning algorithm inputs. We obtain important improvements regarding the models’ forecasting powers (improvements beyond 50% for some forecasting error measures) compared to the models trained without these features. Regarding the machine learning methods, our results reveal that neural networks and the random forest particularly outperform the other methods when geocoding features are not accounted for, while the ensemble learning methods (random forest, adaboost and gradient boosting) perform well when geocoding features are considered.

The rest of this paper is structured as follows: the next section presents some related works, followed by a description and an exploration of the dataset and a presentation of the methods used in our experiments. The subsequent section presents our experiments and the results obtained with and without geocoding processing. The succeeding section provides a discussion of our results, as well as implications and limitations of the study and future research directions. Finally, the last section concludes the paper.

2 Related works

The importance of location in determining housing prices is widely recognized. The key econometric issues include spatial dependence and spatial heterogeneity (Anselin 2013). Spatial dependence exists because nearby properties often have similar structural features (they were often developed at the same time) and share locational amenities (Basu and Thibodeau 1998). Spatial heterogeneity focuses on whether the marginal prices of housing characteristics are constant throughout a metropolitan area or whether they change over space (Bitter et al. 2007). To improve traditional automated valuation models, locations or spatial features are widely integrated in parametric and nonparametric methods for modeling spatial heterogeneity or spatial dependence. These methods can be classified into the following four groups: (1) market segmentation (or submarket) methods, (2) trend surface models and spatial expansion methods, (3) spatial regression methods and (4) machine learning methods with spatial attribute. Empirical studies commonly either use spatial methods in comparison with models without spatial features or compare spatial methods with one another.

2.1 Market segmentation (submarket) methods

Submarket or market segmentation methods (Bourassa et al. 1999, 2003, 2010; Goodman and Thibodeau 1998, 2003, 2007) are approaches for dealing with spatial heterogeneity by delineating the housing market into distinct submarkets. Submarkets can be defined as physical geographical areas or noncontiguous groups of dwellings with similar characteristics and/or hedonic prices. Estimates are either performed separately for each submarket or globally by adding spatial indicators, such as dummy variables for submarkets, and performing price estimates for the whole market. The aim is not necessarily to define relatively homogeneous submarkets consisting of substitutable dwellings but rather to segment the market in a way that allows for accurate estimates of house values. For example, (Bourassa et al. 2003) compared a set of spatial submarkets defined by real estate appraisers with a set of non-spatial submarkets created using factor and cluster analysis. They also considered the impacts of adjusting predictions by using the neighboring properties’ residuals. Using data for Auckland, New Zealand, they found that the most accurate predictions are obtained by using a citywide equation with spatial submarket dummy variables and by adjustment with neighboring residuals. The separate submarket equations performed slightly worse or better than the citywide equation, depending on whether the predictions were or were not adjusted for the neighboring residuals, respectively. (Goodman and Thibodeau 2003) compared the predictions for three submarkets with those of a market-wide model from Dallas. The submarket models were defined based on ZIP codes, census tracts, and a hierarchical method described in (Goodman and Thibodeau 1998). They concluded that each of the submarket definitions yielded significantly better results than those of the market-wide model, but none of the submarket definitions dominated the others. (Goodman and Thibodeau 2007) compared spatial submarkets consisting of adjacent census block groups with non-spatial submarkets constructed based on dwelling sizes and prices per square foot. Both submarket methods produced significantly better predictions than the results obtained from the market-wide model, although neither clearly dominated the other.

2.2 Trend surface models and spatial expansion methods

Trend surface models and spatial expansion methods integrate spatial attributes into traditional hedonic regression methods. The principle of a trend surface model is to use a regression function that estimates the property value at any location based on the two coordinates (latitude and longitude) of the location (Clapp 2003; Xu 2008; Orford 2017; Doumpos et al. 2020). The spatial expansion method allows house characteristics to vary over space in a traditional hedonic regression framework by the interaction of house characteristics with locational information (Thériault et al. 2003; Fik et al. 2003; Bitter et al. 2007). For example, (Thériault et al. 2003) used an expansion model that allows housing attributes to vary based on both accessibility and neighborhood attributes. In a study by Tucson, (Fik et al. 2003) specified a fully interactive expansion model employing a second-order polynomial expansion of housing attributes (properties’ geographical coordinates) and dummy variables representing submarkets. The interactions between the absolute location variables and structural attributes allowed the coefficients to vary over space. This model outperformed the stationary model, and its explanatory power was far superior. Several spatial interactive terms were significant, indicating the presence of spatial heterogeneity in the prices of these attributes.

2.3 Spatial regression methods

Spatial regression models have been developed to make estimations and predictions about space by explicitly modeling the spatial correlations among observations in different locations. For automated valuation models, the most commonly used spatial regressions include methods such as geographically weighted regressions (GWRs) (Bitter et al. 2007; Borst and McCluskey 2008; Lockwood and Rossini 2011; McCluskey et al. 2013; Bidanset et al. 2017) and simultaneous autoregressive (SAR) models or conditional autoregressive (CAR) models (Bourassa et al. 2007). The GWR method is a local modeling approach that explicitly allows parameter estimates to vary over space. Rather than specifying a single model to characterize the entire housing market, GWR estimates a separate model for each sale point and weights the observations by their distance to this point, thus allowing for unique marginal-price estimates at each location. This method is appealing because it mimics, to some extent, the “sales comparison” approach to valuation used by appraisers in that only sales within proximity to the subject property are considered, and price adjustments are made based on the differences in the characteristics within this subset of properties. (Bitter et al. 2007; Helbich and Griffith 2016) found that GWR outperforms many standard hedonic regressions and spatial expansion methods. (Borst and McCluskey 2008; McCluskey and Borst 2011) applied GWR successfully to identify the existence of housing submarkets. Their findings demonstrated an increase in predictive accuracy when using the GWR approach across three large urban areas in the USA. These findings seemingly indicated that the local variation explicitly addresses spatial dependency as a continuous function, which led to the analysis of the relationships between properties, depending on the distance from one to another. In the case of lattice models, such as the SAR and CAR models, locations are restricted to the discrete set of points represented by the data used to estimate the model. Using data for Auckland, New Zealand, (Bourassa et al. 2007) compared a simple hedonic regression model that included submarket dummy variables with geostatistical (similar to GWR) and lattice (CAR and SAR) models. They showed that the lattice methods performed poorly in comparison with the geostatistical approaches or even in comparison with a simple hedonic regression model that ignores spatial dependence; however, they did not use the neighboring properties' residuals or the spatial weight matrix to improve the prediction accuracy. Their best results were obtained by incorporating submarket variables into a geostatistical framework.

2.4 Machine learning methods with spatial attributes

The last group includes a few studies that integrated machine learning methods with spatial attributes and compared them with some of the previous methods or with non-spatial methods (McCluskey et al. 2013; Mayer et al. 2018; Čeh et al. 2018; Doumpos et al. 2020). (McCluskey et al. 2013) assessed and analyzed a number of geostatistical approaches relative to an artificial neural network (ANN) model and the traditional linear hedonic model. The findings demonstrated that ANNs can perform very well in terms of predictive power and, therefore, valuation accuracy, outperforming traditional multiple regression analysis and approaching the performances of spatially weighted regression approaches. The results of (Doumpos et al. 2020) demonstrated that linear regression models developed with a weighted spatial (local) scheme provide the best results, outperforming the machine learning approaches and models that do not consider spatial effects. However, the two machine learning approaches in their study (random forest and gaussian process regression) provided the best results in a global setting but did not benefit much from implementation in a local context; this could be justified by the fact that, in a local context with only few transactions, there are not enough data for machine learning techniques to train optimal models. This study also evaluated only two machine learning techniques. Other studies, such as (Mayer et al. 2018; Čeh et al. 2018), clearly demonstrated the relevance of machine learning techniques compared to those of some other methods in a spatial context. (Mayer et al. 2018) compared three variants of hedonic linear regressions with three machine learning techniques (random forest, gradient boost and artificial neural networks). Their results showed that machine learning techniques (gradient boost in particular) are more accurate than linear models in terms of prediction accuracy, even if linear models (robust regression, in particular) are less volatile. (Čeh et al. 2018) studied the predictive performance of the random forest machine learning technique in comparison with commonly used hedonic models based on multiple regressions for the prediction of apartment prices. Their outputs revealed that the random forest method obtained significantly better prediction results than those of the hedonic models.

We can clearly observe that all these studies that considered spatial heterogeneity or spatial dependence mostly integrated spatial features in traditional hedonic linear models (the first three groups), and only a few of them also evaluated machine learning techniques (the last group). When machine learning techniques are also evaluated, they universally tend to provide better results in terms of predictive power than hedonic models. This is the reason why we specifically focus on these techniques in this paper. However, this study differs from the literature in many aspects, as follows: (1) in a spatial/location context with geocoded location attributes, we evaluate a wider range of machine learning techniques that have already shown their relevance for automated valuation models in different contexts; (2) this evaluation is performed with the same dataset for each model, thus avoiding the bias that could appear when comparing different methods evaluated on different datasets; (3) we analyze the French real estate market, which has received little attention thus far; and (4) we compare the results at high and low location granularity levels by comparing, for instance, the models’ predictive powers on nine different and heterogeneous metropolitan areas in France.

3 Data and methods

3.1 Data

The raw dataset for this study is an open source dataset provided by the French government since April 2019 with an open license. This dataset, titled “Demands of land values”, is published and produced by the French general directorate of public finances.Footnote 2 It provides data on real estate transactions completed during the last five years in metropolitan territories and the DOM-TOM (French overseas departments and territories), except the Alsace-Moselle and Mayotte departments. The data are from notarial acts and cadastral information. The data files are updated every six months in April and October. Each update removes and then replaces all previously published files. Datafiles (under the.csv extension) are provided on a yearly basis and are approximately 4 GB in size. In this paper, we study real estate transactions from the following 5 years: 2015, 2016, 2017, 2018 and the first three quarters of 2019. These transactions represent approximately 18 GB of data and contain almost all the real estate transactions for all French cities. However, given that the most important portion of the transactions takes place in the largest cities, we choose to restrict the study to the 10 largest French cities in terms of population, which are as follows: Paris, Marseille, Lyon, Toulouse, Nice, Nantes, Montpellier, Strasbourg, Bordeaux and Lille (Fig. 1). Due to political, economic and geographic factors, the real estate markets are very different in each of these cities. For example, the price per square meter is much higher in Paris (the French economic and political capital) than in other cities (regional cities). Our goal here is to go beyond global real estate estimation based on prices per square meter and provide precise and automatic estimations of real estate in each of these cities with the use of machine learning methods. As the city of Strasbourg is in the Alsace-Moselle department, transactions for this city are not provided in the dataset; therefore, our study focuses on the 9 other largest French cities.

3.1.1 Variables

For each transaction in the dataset, 43 variables are available. However, a significant number of these variables refer to technical data about notarial acts and are not relevant for our study. The variables that could be related to real estate price estimation are listed in the following in Tables 1 and 2.

Table 1 List of variables used

The descriptive statistics of these variables for each city are provided in the following table.

Table 2 Variable descriptive statistics per city

3.1.2 Repartition

Number of transactions per city, year and quarter Figure 2 shows how the transactions are distributed per city, year and quarter. Figure 2A shows that the distribution of the numbers of transactions per city is generally consistent with the distribution of the populations in these cities (Fig. 1). However, we can notice that Toulouse and Bordeaux recorded many more transactions relative to their population, and this can be perceived as a good indicator for real estate development in these two cities. Figure 2B shows that there was a growth in the number of real estate transactions from 2015 to 2017, but this trend seems to have reversed since 2018. Figure 2C and D clearly show that more overall transactions are made in the last quarter of the year than in each of the other quarters.

Fig. 1
figure 1

Studied cities (except Strasbourg)

Fig. 2
figure 2

Transactions per city, year and quarter

Fig. 3
figure 3

Transactions per city and year

Figure 3 shows how the transactions are distributed per year for each city. As in the previous figure, the trends are almost the same for all cities. Only the city of Lille registers continued and noninterrupted growth from 2015 to 2018. We cannot draw any conclusions for 2019, as the last quarter is not included in the data for this year.

Fig. 4
figure 4

Transactions per sale type, residence type and city

Number of transactions per city, sale type and residence type The distributions of the transactions per sale type (nature of mutation) and residence type are provided in Fig. 4 below. Figure 4A shows that almost all the transactions were of the sale or sale before completion types. The adjudication, exchanges, land to build and expropriation types are marginals. This same behavior is also observed even when examining the repartition per city (Fig. 4C and D). Figure 4B shows that most transactions concerned apartments, followed by outbuildings, industrial locations and houses, which are also significant. Because we are only interested in real estate for residential properties, we only use the transactions for houses and apartments as residence types in our analysis. To remove any side effects due to the skewness of the distribution per sale type, we also only keep the transactions of the sales and sales before completion types in our analysis.

Fig. 5
figure 5

Price distribution per city and residence type

Price distribution per city Since our target variable is the price of each piece of real estate, Fig. 5 below shows the price distributions per city. Figure 5A clearly shows that Paris is by far the most expensive city for real estate in France. Overall, the price distributions per city are relatively consistent with respect to the distributions of their populations (Fig. 1). However, we observe a gap with regard to Bordeaux and Nice, which appear to be particularly expensive compared to their sizes in terms of population. Conversely, Marseille is less expensive relative to its size in terms of population. We also observe that there are many outliers for all cities that have very high prices; these certainly represent luxury real estate. To avoid side effects, we remove these outliers in our analysis to keep only the most common real estate transactions, which represent the majority of the population. Figure 5B shows the price distributions per residence type (houses and apartments). The price distribution trends per city remain the same for apartments and houses. However, houses are obviously more expensive than apartments, except in Lille. The price difference between houses and apartments is also much more pronounced in Paris than in other cities.

Fig. 6
figure 6

Overall experimental process

3.2 Methods

Consisting of a set of well‐established methods, machine learning provides algorithms for computers to discover knowledge and make decisions by first learning from given data. Machine learning techniques are becoming increasingly popular, even within the field of production, operations management or manufacturing (Choi et al. 2018; Shin and Park 2000). In those domains, machine learning algorithms are routinely used to search for new patterns in data or to generate predictive models. Subsequently, such patterns are used to improve future operational decisions (Cohen 2018; Chen et al. 2020; Kusiak 2020). This success can be explained by different factors: the improvement of computational processing that makes it cheaper and more powerful than before; affordable data storage solutions; the availability of massive and diverse sources of information; an ever-increasing demand for data-driven decision making; and a need for automatization of the decision processes. Machine learning algorithms have good reputations in terms of predictive power (Wu 1997). Using very few assumptions regarding the input and output variables and applying complex mathematical calculations, they automatically produce models that are not only able to analyze large and complex datasets but also to produce fast and accurate results (Akyildirim et al. 2020). Machine learning methods are also increasingly used for automated valuation models. When comparing machine learning methods for automated valuation models, most existing studies show that several different methods can perform well depending on each context or dataset used (Valier 2020). Most of these methods include artificial neural networks (McCluskey et al. 2013; Yacim and Boshoff 2018; Abidoye et al. 2019); ensemble learning methods, such as random forest, gradient boosting and adaptive boosting (e.g., McCluskey et al. 2014; Čeh et al. 2018; Mullainathan and Spiess 2017; Kok et al. 2017; Mayer et al. 2018; Baldominos et al. 2018); k-nearest neighbors (e.g., Isakson 1988; Borde et al. 2017); and support vector regression (e.gLam et al. 2009; Kontrimas and Verikas 2011; Huang 2019). Thus, for our specific study, we compare all these methods but in the same context and with the same dataset. We also use the linear regression model, which can serve as the baseline model. In the next subsections, we present an overview of these selected techniques.

3.2.1 Artificial neural networks

Inspired by biological neural networks, artificial neural networks mimic the human neural network and are composed of artificial neurons that are also called nodes. The neurons are connected to each other through edges. The latter are responsible for the transmission of signals from one node to another. A signal that propagates through the network can be associated with a real number, and each node is associated with a threshold, above which the signal is assumed to be significant. Additionally, a weight is assigned to each edge that measures the importance of the considered connection. The node values and edge weights are combined to define the strength of the signal. The intuition behind neural networks is that many neurons can be joined together to carry out complex computations. The structure of a neural network can be described as a graph whose nodes are neurons and whose edges are links between the output of some neuron to the input of another neuron (Shalev-Shwartz and Ben-David 2014; Anthony and Bartlett 2009). The network is organized through the following three different types of layers: the input layer, which receives the external data; the hidden layer, which is also called the black box; and the output layer, which produces the result. To be more precise, each node receives signals from other nodes (approximated by numbers); to compute the output of a specific node, the incoming signals are combined with the weights of all the input’s edges and the node bias is adjusted using a transfer function. This process is applied to all nodes until the final estimated output is obtained. The final output is compared to the true value, and an observed error is computed. Then, the edge weights and node biases are adjusted through the network, and the output values are recomputed until a minimal error is obtained. Since an artificial neural network is a mathematical model with approximation functions, it has the advantage of being able to work with any data that can be made numeric. Artificial neural networks perform well with nonlinear data and large numbers of inputs. This type of model can be trained with any numbers of inputs and layers, and the predictions are fast. It is among the most powerful modeling devices in machine learning and is currently the preferred approach for addressing complex machine learning problems. Its flexibility draws from its ability to entwine many telescoping layers of nonlinear predictor interactions. However, this method is often said to be a black box with a computationally expensive and time-consuming training step. Additionally, despite its effective learning capability, a major drawback of an artificial neural network is the unreadability of the learned knowledge, i.e., the lack of an explanatory capability (Shigaki and Narazaki 1999).

3.2.2 Random forest

The random forest algorithm is based on decision trees and can be applied for classification or regression exercises (Breiman 2001; Shalev-Shwartz and Ben-David 2014). Let us assume that we want to use a training set \(S=\left\{\left({x}_{1},{y}_{1}\right),\dots ,\left({x}_{N},{y}_{N}\right)\right\}\) to construct a predictor for the output variable \(y\) using the inputs in \(x\). The first step of the random forest algorithm involves selecting, with replacement, a random sample \({S}_{1}=\left\{\left({x}_{11},{y}_{11}\right),\dots ,\left({x}_{n1},{y}_{n1}\right)\right\}\) of \(n\) observations from \(S\). As a second step, from the sample \({S}_{1}\), we construct a decision tree \({T}_{1}\) with one additional randomness attribute, as follows: during the construction of each node, from the set of \(P\) attributes (or inputs), only \(p\) attributes are randomly selected and used to split the node based on the information gain or the variance reduction (in the case of regression trees). At the end of the process, we obtain a decision tree. The process is then repeated \(m\) times, leading to \(m\) decision trees \({T}_{1}, \dots , {T}_{m}\). Given an unseen observation of inputs \(x\), the prediction of the output \(y\) is obtained by averaging the predictions from all individual regression trees \({T}_{1}, \dots , {T}_{m}\). The random forest has the advantage of reducing the overfitting problem and the variance in the decision trees. Thus, there is an improvement in the accuracy of the algorithm. Unlike curve-based algorithms, the advantages of random forest are that it is invariant to monotonic transformations of the predictors; it naturally accommodates categorical and numerical data in the same model; it can approximate severe nonlinearities; and a tree of depth L can capture (L − 1)-way interactions (Gu et al. 2020). The flexibility of random forests is also their limitation; this method is less interpretable than an individual decision tree, has a high computational cost and uses a great deal of memory. Consequently, its predictions can be slow.

3.2.3 Adaptive bBoosting and gradient boosting

Adaptive boosting (henceforth, adaboost) is a learning technique that aims to increase the efficiency of a given learning system (Freund and Schapire 1995). The theory behind boosting suggests that many weak learners may, as an ensemble, comprise a single strong learner with greater stability than that of a single complex tree. A decision tree is most often considered as the base estimator. It uses the notion of recursive partitioning: at each step, by searching for the best split across all predictors and all their values, the sample is partitioned into subsamples to create the most homogeneous subsamples in terms of the outcome. To generate the full‐grown tree, the concept of node impurity is used (Shmueli and Yahav 2018). In adaboost, the decision tree is trained in several successive stages on random samples formed by assigning significant weights to individuals who are difficult to classify. At each step, a classifier is produced. The final classifier is a linear combination of step classifiers weighted by coefficients related to their performances. Additionally, adaboost can be interpreted as an optimization algorithm on an exponential cost function. Gradient boosting is a generalized boosting technique since it allows for optimization with other differentiable loss functions. During a prediction exercise, once the models have been trained, adaboost and gradient boosting can achieve very good accuracy levels with modest memory and runtime requirements. They are designed to deal with complex and high‐dimensional data (Cui et al. 2018). Nevertheless, these methods suffer from difficulties in terms of their interpretability. Additionally, they perform poorly when the feature space has thousands of features with sparse values.

3.2.4 K-nearest neighbors

Based on local approximation, the k-nearest neighbors algorithm (henceforth, KNN) is a nonparametric machine learning algorithm that can be used for classification and regression (Cover and Hart 1967; Devroye et al. 1996). The intuition behind this technique is the following: let \(S=\left\{\left({x}_{1},{y}_{1}\right),\dots ,\left({x}_{N},{y}_{N}\right)\right\}\) be a sample of N observations, where \({x}_{i}\) is the set of attributes for individual \(i\) and \({y}_{i}\) is the outcome variable. Let us consider a new individual with coordinates \((x,y)\), whose attributes are known and stored in a vector \(x\). We are interested in predicting the value of the outcome variable \(y\). From the set of points \(\left({x}_{1},\dots ,{x}_{N},x\right)\), using a distance metric, this algorithm observes the k nearest neighbors of \(x\). Let us call these neighbors \(\left({x}_{(1)},\dots ,{x}_{(K)}\right)\). Depending on the nature of the output variable (categorical or numeric), \(y\) is approximated either by the mode or the average of \(\left({y}_{(1)},\dots ,{y}_{(K)}\right)\). In the regression case, the use of a weighted average can provide optimal results. The weight allocated to the output \({y}_{(k)}\) can be the inverse of the distance between \({x}_{(k)}\) and \(x\). This procedure is described under the assumption that the number of neighbors k to consider is known. However, this is often not the case. Nevertheless, this number can be approximated using the root mean square error (RMSE); the optimal value for k is the one that minimizes the RMSE. Since it does not derive any discriminative function from the training data, the KNN has the advantage of being much faster than other algorithms that require training. Because of the absence of a training step, new data can be added seamlessly without impacting the accuracy of the algorithm. This method is very easy to implement since only two parameters are required for its implementation, i.e., the value of k and the distance function. However, the KNN performs poorly in high-dimensional setups (with a large number of individuals or an important number of variables or dimensions). In that case, the performance of the algorithm can be degraded by the cost of computing the distance between a new point and the massive number of existing points.

3.2.5 Linear regression

A linear regression model is used when we want to explain a dependent variable \(y\), which is also called the output or target, or outcome variable, by a set of n-dimensional attributes stored in the variables \(\left({x}_{1},\dots ,{x}_{p}\right)\), which are also called explanatory, input or independent variables (Stigler 1981). The following equation summarizes the link between the outcome and input variables:

$$y={\beta }_{0}+{\beta }_{1}{x}_{1}+\dots +{\beta }_{p}{x}_{p}+\varepsilon $$

where \(\varepsilon \) is an error term, and we assume that \(\varepsilon \) follows a standard normal distribution, as follows:

$$\varepsilon \sim N(0;1)$$

The coefficients of this model are estimated using the minimization of the sum of the squared errors and are given by the following formula:

$$\widehat{\beta }={({X}^{{{\prime}}}X)}^{-1}X{^{\prime}}y$$

where \(\widehat{\beta }\)=\({(\widehat{\beta }}_{0}, {\widehat{\beta }}_{1},\dots ,{\widehat{\beta }}_{p}){^{\prime}}\) and \(X=(1,{x}_{1},\dots ,{x}_{p})\).

Given a set of input attributes \(\left({x}_{(1)},\dots ,{x}_{(p)}\right)\), the predicted output \(\widehat{y}\) is given by the following:

$$\widehat{y}={\widehat{\beta }}_{0}+{\widehat{\beta }}_{1}{x}_{(1)}+\dots +{\widehat{\beta }}_{p}{x}_{(p)}$$

Linear regression models have the advantage of being easy to implement and interpret, and they are also efficient to train. They tend to demand low computation costs. Hence, they are often used in large‐scale prediction tasks (Cui et al. 2018). Their major limitation is the linearity assumption between the outcome variable and the explanatory variables. In real applications, the data are rarely linearly separable. This method is very sensitive to outliers.

3.2.6 Support vector machine (SVM)

The SVM is a linear supervised classifier. Using a hyperplane to separate the data, it is trained on in-sample items to learn to classify out-of-sample items solely based on the values they show for their features (Lolli et al. 2019). To find the frontier between the categories to be separated, an SVM uses a training sample made of points whose categories are known. The frontier is obtained by searching for the hyperplane that separates the training sample while maximizing the distance between the training points and this hyperplane (this is called maximizing the margin). The training points closest to the border are called support vectors. However, the training points may not be linearly separable, in which case there is no hyperplane capable of separating the data. In this situation, we search for a transformation of the initial data that allows separation. In general, the training values ​​are projected into a large dimensional space, where it becomes possible to find a linear separator (Shalev-Shwartz and Ben-David 2014; Cortes and Vapnik 1995; Boser et al. 1992). When the output variable being predicted is continuous-valued, the classification concept of the SVM can be generalized to the regression case. This is called support vector regression (SVR). The goal of SVR is to find a function that presents a margin of tolerance \(\varepsilon \) from the target values while being as flat as possible, that is, to find the narrowest tube centered around the surface while minimizing the distance between the predicted and true outputs. Mathematically, the problem resolved by SVR during the training process is as follows:

$$\left\{\begin{array}{l}Min \frac{1}{2}{\Vert w\Vert }^{2}\\ s.t \left|{y}_{i}-\langle w,{x}_{i}\rangle -b\right|\le \varepsilon , \forall i\end{array}\right.$$

where \({y}_{i},{x}_{i}\), for \(i=1,\dots ,n\), are the output and input variables from the training set, respectively, \(\langle w,{x}_{i}\rangle +b\) is the predicted value to be compared to the target value \({y}_{i}\), and \(\varepsilon \) is a threshold such that all predictions must be within a range \(\varepsilon \) of the true values. In a case with a nonlinear SVM, the scalar product \(\langle w,{x}_{i}\rangle \) is replaced by a kernel function \(K(w,{x}_{i})\). Because of the kernel function, the SVM method is highly flexible. Assumptions about the functional form of the transformation are avoided, and there is good out-of-sample generalization when the kernel tuning parameters are appropriately chosen. Like other nonparametric techniques, the SVM method suffers from a lack of transparency in its results. Graphical visualizations can be used to facilitate the interpretation of the results.

3.3 Performance evaluation

To assess the predictive performances of machine learning estimators for real estate price forecasting in major French cities, some evaluation metrics are needed. As is common in the literature (Botchkarev 2019), we rely on the following measures:

  • Q1: defines the first quartile of the prediction error distribution (the error values larger than 25% of all the prediction errors).

  • MedAE: represents the median error (the error values larger than 50% of all the prediction errors).

  • Q3: defines the third quartile of the prediction error distribution (the error values larger than 75% of all the prediction errors).

  • MAE measures the mean absolute error; for a set of \({\varvec{n}}\) error terms \(\left\{{{\varvec{e}}}_{{\varvec{i}}}, {\varvec{i}}=1,\dots ,{\varvec{n}}\right\}\), the MAE is defined by the following:

    $${\varvec{M}}{\varvec{A}}{\varvec{E}}=\frac{\sum_{{\varvec{i}}=1}^{{\varvec{n}}}\left|{{\varvec{e}}}_{{\varvec{i}}}\right|}{{\varvec{n}}}$$
  • RMSE: quantifies the root mean square error; for a set of \({\varvec{n}}\) error terms \(\left\{{{\varvec{e}}}_{{\varvec{i}}}, {\varvec{i}}=1,\dots ,{\varvec{n}}\right\}\), the RMSE is defined by the following:

    $${\varvec{R}}{\varvec{M}}{\varvec{S}}{\varvec{E}}=\sqrt{\frac{\sum_{{\varvec{i}}=1}^{{\varvec{n}}}{\left|{{\varvec{e}}}_{{\varvec{i}}}\right|}^{2}}{{\varvec{n}}}}$$
  • MSLE: defines the mean squared logarithmic error; for a set of \({\varvec{n}}\) prices \(\left\{{{\varvec{y}}}_{{\varvec{i}}}, {\varvec{i}}=1,\dots ,{\varvec{n}}\right\}\) and a set of \({\varvec{n}}\) predicted price values \(\left\{{\widehat{{\varvec{y}}}}_{{\varvec{i}}}, {\varvec{i}}=1,\dots ,{\varvec{n}}\right\}\), the MSLE is defined by the following:

    $${\varvec{M}}{\varvec{S}}{\varvec{L}}{\varvec{E}}=\frac{1}{{\varvec{n}}}\sum_{{\varvec{i}}=1}^{{\varvec{n}}}{\left(\mathbf{log}\left({{\varvec{y}}}_{{\varvec{i}}}+1\right)-\mathbf{log}\left({\widehat{{\varvec{y}}}}_{{\varvec{i}}}+1\right)\right)}^{2}$$
  • R2: computed for the regression model; it represents the proportion of the variance of the dependent variable (output) that is explained by the independent variables (inputs).

For each evaluation metric, we are first interested in its values for the best performing city (considered as nonreference) and the worst performing city (considered as reference); second, we are interested in its values regarding the real estate price prediction information with geocoding (considered as nonreference) and without geocoding (considered as reference). For each case, these values are used to compute an improvement ratio, which is defined as follows:

$$ Improvement\,ratio = \frac{{Metric\,value\,for\,the\,reference - metric\,value\,for\,the\,nonreference}}{{Metric\,value\,for\,the\,reference}} $$

4 Experiments

Our overall experimental process is described in the figure below.

The main steps are data preparation (with and without geocoding), model training with machine learning techniques and cross validation, and finally, selection of the best model, which will be used for the evaluations and interpretations. All these steps are described in the next sections.

4.1 Data preparation

It is now well known that the most important step in machine learning or predictive modeling is the data preparation step. In practice, it has been generally found that data cleaning and preparation account for approximately 80% of the total data engineering effort (Zhang et al. 2003). Data preparation comprises those techniques concerned with analyzing raw data to yield high-quality data and mainly includes the following processes: data collection, data integration, data transformation, data cleaning, data reduction, and data discretization. Data preparation is a fundamental step for many reasons. First, although real-world data are impure, high-performance mining systems require high-quality data, and accurate data yield high-quality patterns. Second, real-world data may be incomplete (e.g., missing attribute values, missing certain attributes of interest, or only aggregate data are available), noisy (e.g., containing errors or outliers), and inconsistent (containing discrepancies in codes or names), and these types of data can disguise useful patterns.

Data preparation involves generating a dataset smaller than the original dataset that can significantly improve the efficiency of data mining and includes the following tasks:

  • Selecting relevant data: selecting attributes (filtering and wrapper methods), removing anomalies, or eliminating duplicate records.

  • Reducing data: sampling or instance selection.

Data preparation generates high-quality data, which lead to high-quality patterns. For example, we can:

  • Recover incomplete data: fill in the values missed or reducing ambiguity.

  • Purify the data: correct errors or remove outliers (unusual or exceptional values).

  • Resolve data conflicts: use domain knowledge or expert decisions to settle discrepancies.

  • Add additional valuable data by data linkage.

In our case, we use almost all of these data preparation techniques for each experiment (without geocoding and with geocoding).

4.1.1 Data Preparation Without Geocoding

In the experiments without geocoding, the data preparation step is summarized by the figure below.

The successive steps are as follows: attribute selection, inconsistency removal, outlier removal, filling in missing values, standardization and one-hot encoding.

The attribute selection step consists of selecting only data from the 9 cities in all the raw datasets. As stated in the Data section, the raw dataset contains 43 variables for each transaction. In this step, we also select only the valuable variables (the 10 variables shown in the figure) that are naturally related to the price of each transaction. Because we are only interested in real estate transactions for residential properties, we also only keep the transactions with the sales and sales before completion sale types for apartments and residential houses.

In the inconsistency removal step, we particularly remove all transactions with missing or bad values for the following key attributes: postal code (because we have a strong belief regarding the importance of house locations in this study), price (since it is our target dependent variable), living area and number of rooms (since they are naturally strong predictors for the price).

In the outlier removal step, for each city, we remove all transactions with outliers in their prices (Fig. 5A). To avoid side effects, we remove outliers in our analysis to keep only the most common real estate transactions that represent the majority of the population. The outlier price values are identified with a common method, which consists of using the interquartile range, i.e., all values above the third quartile Q3 plus one half the interquartile range.

The step of filling in missing values consists of replacing the missing values of the land area variable with zero. This is because this variable is usually missing for apartment transactions.

Because many algorithms (e.g., neural networks, support vector regressors) are perform better and more efficiently with standardized variables than with nonstandardized variables, we perform a transformation for all the continuous variables, all of which are almost normally distributed (land area, living area, number of rooms, number of lots). Standardization typically means rescaling the data to have a mean of zero and a standard deviation of 1 (unit variance).

Finally, for all other discrete attributes (postal code, sale type, residence type), we perform the one-hot encoding transformation to convert them into continuous and Boolean dummy variables with 0 or 1 for each of their values. For instance, we have 6 different postal codes in Toulouse, so the postal code variable for this city is replaced by 6 different dummy variables, with each of them taking the value 0 or 1 for each transaction. Many machine learning algorithms (e.g., neural networks, support vector regression or linear regressions) require this transformation for the effective handling of discrete attributes.

At the end of this step, for a city such as Toulouse, we end up with, for instance, 17 independent variables (along with the dependent variable “price”) in the prepared dataset to be used as the input for the machine learning algorithms. (Fig. 7)

4.1.2 Data Preparation with Geocoding

In our framework with geocoding, the data preparation step is detailed below.

This set of steps differs from the previous set by one additional step, i.e., the geocoding of each transaction to obtain the precise latitude and longitude for a piece of real estate. By adding the latitude and longitude of each transaction, we should be able to evaluate the relevance of the spatial/location features for improving the real estate estimations of the models. Since we have variables that address the details of each transaction in the raw data (e.g., street number, repetition index, street type, postal code and city), we should be able, by using a geocoding service to provide the geographical coordinates (latitude and longitude) of each property of a transaction in the data. There are many existing geocoding services worldwide (e.g., Google, ArcGis, HERE), that are available for free, paid or, most often, paid at a daily usage rate (Singh 2017; Di Pietro and Rinnone 2017). However, in our case, with only French addresses, we can use the available free geocoding serviceFootnote 3 from the French government that provides many APIs for geocoding in France. We use this service for retrieving the geographical coordinates of each transaction address, as shown in the figure below.

For each address provided, the geocoding service returns additional information, such as the latitude of the address, the longitude of the address, the resulting address variables (house number, street, postal code and city), and a probability score that gives us an idea of the accuracy of the result. Geocoding, in general, is a complex task that can sometimes provide inaccurate, erroneous or no results. When there is no match for the input address, all the result fields are empty. When a match is obtained, the additional information provided (in addition to the latitude and longitude) helps to eliminate potential errors. For example, in our case, we retain only transactions where the geocoding result provides the same street name and postal code as the input, as well as a result score probability greater than 60%. Overall, approximately 90% of the transactions are successfully geolocated by the service, and this can be considered a good ratio. Because we have a very high number of transactions to be geocoded, we use the batch service of the API, and we perform whole-file geocoding for each city. For each successful result, we only retain the latitude and longitude as additional variables to be used in the machine learning algorithms (Fig. 8). For example, for the city of Toulouse, we have 17 independent variables in the prepared dataset without geocoding (after one-hot encoding); thus, we have 19 independent variables in our prepared dataset with geocoding (after one-hot encoding), but we lose approximately 10% of the transaction data due to geocoding errors or non-matches during the geocoding exercise.

Fig. 7
figure 7

Main steps of data preparation without geocoding

Fig. 8
figure 8

Main steps of data preparation with geocoding

4.2 Training

The training process is performed in the same way with and without geocoding, as presented in Fig. 6. For each city, the prepared dataset is first divided: 75% for the training set (i.e., 33 475 transactions for the city of Toulouse) and 25% for the test set (i.e., 11 159 transactions for the city of Toulouse). The training process is performed with fivefold cross validation and a set of hyperparameters for each machine learning algorithm, as presented in the following in Table 3.

Table 3 Hyperparameters used for each machine learning algorithm
Fig. 9
figure 9

Preview of the inputs and outputs of the French geocoding service

Fig. 10
figure 10

Model evaluations in terms of metrics for the experiment without geocoding

For each city and for all fivefold cross validation steps, this process gives us 180 different trained neural network models, 30 different trained random forest models, 90 different trained adaboost models, 180 different trained gradient boosting models, 120 different trained k-nearest neighbors models, 180 different trained support vector regression models and 20 different trained linear regression models. This training process provides a total of 800 trained models per city for each experiment, and this corresponds to a total of 1600 models for training both experiments (with and without geocoding) for each city. Overall, we have a total of 14 400 trained models for all 9 cities. For each city and each algorithm, we only select the best model for the experiment without geocoding and the best model for the experiment with geocoding for a comparative analysis.(Fig. 9 )

4.3 Evaluation of results

Here, we present an evaluation of results for the experiment without geocoding and for the experiment with geocoding, as well as a comparison between these two results.

4.3.1 Results of the experiment without geocoding

Model performances without geocoding


Figure 10 shows the resulting metrics for each machine learning model used in this experiment. If we examine, for instance, the three best predictors for each metric, we always obtain the random forest, neural network and k-nearest neighbors models as the best predictors, except for gradient boosting in the case of the Q1 metric. However, in general, the neural network technique appears to be the best predictor among all the algorithms used. The hyperparameters of the neural networks used to achieve these results are as follows:

  • 2 layers with 150 neurons in the first layer and 50 neurons in the second layer.

  • ReLU activation function.

  • Adam solver.

  • 1000 max iterations.

  • Learning rate of 0.1.

Fig. 11
figure 11

Metrics of the best model (the neural network model) per city for the experiment without geocoding

For the best predictors in general:

  • The Q1 of absolute errors is approximately 15,000 euros.

  • The median error is approximately 35,000 euros.

  • The mean absolute error is approximately 55,000 euros.

From another point of view, using the R2 metric leads to 61% of the price variance being explained by the input variables of our models.

Performances of the best models per city without geocoding

If we only consider our best model (the neural network model), the metrics per city are presented in Fig. 11. The performances of the models are different for each city. Except for the R2 metric, real estate price predictions are more accurate for cities with medium costs of living in terms of real estate prices (e.g., Toulouse, Montpellier, and Nantes) and are less accurate for cities with high costs of living(e.g., Paris, Bordeaux, and Nice). The R2 metric shows that even if the price variation is better explained for an expensive city such as Paris than for an inexpensive city, the price forecasting precision remains low.

Fig. 12
figure 12

Model evaluations in terms of metrics for the experiment with geocoding

The metric improvement ratios between the best-performing city and the worst-performing city are mostly over 60%, as shown in the Table 4 below.

Table 4 Improvement ratios between the best-performing and worst-performing cities for the experiment without geocoding

If we compare the results of the best-performing city with the average results for all cities, we can notice the following:

  • The first quartile of the prediction error distribution is approximately 10,000 euros (compared to the average of 15,000 euros for all cities).

  • The median prediction error is approximately 23,000 euros (compared to the average of 35,000 euros for all cities).

  • The mean absolute error is approximately 36,000 euros (compared to the average of 55,000 euros for all cities).

  • The R2 variance is approximately 67% (compared to the 61% average for all cities).

In the next section, we present the results obtained with geocoding.

4.3.2 Results for the experiment with geocoding and improvement

Model performances with geocoding

Figure 12 shows the resulting metrics for each machine learning model used along with the geocoded variables. Relative to the experiment without geocoding, we can observe two major findings, as follows:

  • The ensemble learning algorithms (random forest, gradient boosting and adaboost) outperform all other algorithms for all metrics.

  • Real estate price predictions with geocoding are far better than predictions without geocoding in terms of all metrics.

Fig. 13
figure 13

Metrics of the best model (random forest) per city for the experiment with geocoding

The following Table 5 presents the hyperparameters used during the evaluation of the ensemble learning algorithms. For all these algorithms, the best max depth parameter for the decision trees was 32, and the optimal number of estimators (decision trees) was 2500. The use of small values for the learning rates leads to better results (0.05 for adaboost and 0.1 for gradient boosting) than the use of large values.

Table 5 Best hyperparameters for the ensemble learning algorithms

Compared to the experiment without geocoding, when we examine the accuracy of the models, the following Tables 6, 7 and 8 show the average improvements for the different metrics for all cities and for each ensemble learning algorithm. Overall, for all metrics, we observe a mean improvement of 36.11% for adaboost, 31.13% for gradient boosting and 24.66% for random forest, thereby clearly showing the relevance of integrating the geocoding step for real estate estimation for all the cities used in this experiment. If we consider, for instance, the adaboost algorithm, we obtain the following observations:

  • The first quartile of the forecasting errors is approximately 10,000 euros (compared to the average of 18,000 euros without geocoding).

  • The median error is approximately 26,000 euros (compared to the average of 42,000 euros without geocoding).

  • The mean absolute error is approximately 39,000 euros (compared to the average of 57,000 euros without geocoding).

  • The R2 variance is approximately 0.74 (compared to the average of 0.54 without geocoding).

Table 6 Best adaboost predictor for all cities
Table 7 Best gradient boosting predictor for all cities
Table 8 Best random forest predictor for all cities

Performances of the best models per city with geocoding

We consider one of our best ensemble learning models (random forest, for instance); the metrics obtained per city are presented in Fig. 13. The performances of the model are slightly different for each city. The results are more accurate for cities with high costs of living in terms of real estate prices (e.g., Lille, Toulouse, Montpellier, and Nantes) and are less accurate for cities with high costs of living (e.g., Paris, Bordeaux, and Nice).

The metric improvement ratios between the best-performing city and the worst-performing city are mostly over 70%, as shown in the Table 9 below, except for that of the R2 metric (improvement ratio of 21%).

Table 9 Improvement ratios between the best-performing and worst-performing cities for the experiment with geocoding

If we compare the results of the best-performing city with the average results for all cities, we can notice the following:

  • The first quartile of the forecasting errors is approximately 5000 euros (compared to the 10,000 euros average for all cities).

  • The median error is close to 16,000 euros (compared to the 26,000 euros average for all cities).

  • The mean absolute error is close to 29,000 euros (compared to the 44,000 euros average for all cities).

  • The R2 variance is approximately 80% (compared to the 74% average for all cities).

When examining the model precision for one city, such as Lille, compared to the experiment without geocoding, the following Tables 10, 11 and 12 show the mean improvement for all metrics with the ensemble learning algorithms. Overall, we observe for all metrics a mean improvement of 40.85% for Ada Boost, 39.77% for Gradient Boost and 31.7% for Random Forest, which also clearly shows the relevance of integrating the geocoding step for real estate estimation at the city level. If we consider, for instance, the Ada Boost algorithm for that city, we have the following:

  • The first quartile of the forecasting errors (25% of the predictions) is approximately 4000 euros (compared to the average of 9000 euros without geocoding, an improvement of approximately 52.36%).

  • The median error is approximately 16,000 euros (compared to the average of 26,000 euros without geocoding, an improvement of approximately 39.22%).

  • The mean absolute error is approximately 29,000 euros (compared to the average of 43,000 euros without geocoding, an improvement of approximately 33.14%).

  • The R2 variance is approximately 0.79 (compared to the average of 0.54 without geocoding, an improvement of approximately 46.3%).

Table 10 Best adaboost predictor for Lille
Table 11 Best gradient boosting predictor for Lille
Table 12 Best random forest predictor for Lille

5 Discussion and implications

5.1 Experimental discussion

With respect to our research question, the aim of this paper is to evaluate what would be lost in terms of predictive power for an automated valuation model that fails to integrate location variables. We designed an experiment that particularly focuses on machine learning models evaluated on a complete dataset containing the 5-year historical real estate transactions in nine major French cities. We used geocoding to add precise geographic location coordinates to the features to be used as inputs for each machine learning model. We built specific models for each city of the experiment with and without adding geographic coordinate features as model inputs to compare the predictive powers of the models in both cases. The results clearly show that adding geographic coordinates to the list of input features leads to a significant increase in precision for the most popular model evaluation metrics (MedAE, Q1, Q3, R2, MAE, RMSE, and MSLE). More precisely, for all cities, the mean precision improvement can reach 36% on average for all metrics and up to 45% on average for some specific metrics with the best predictor models. In terms of the models built for each city, this precision improvement can reach 40% on average for all metrics (e.g., Lille city) and even 52% for specific metrics. At a high level of granularity, we also compare the differences in terms of each model’s precision for the nine cities used in the experiment. The results show that each model’s precision for almost all the metrics was approximately 60% more precise for cities with medium costs of living (e.g., Toulouse, Lille, and Montpellier) than for cities with high costs of living (e.g., Paris, Bordeaux, and Nice). Moreover, this precision difference reaches 70% when considering models using geographical coordinates as input features. Finally, regarding the machine learning techniques used, our results reveal that neural networks and random forest particularly outperform the other methods when geographical coordinates are not accounted for, while the ensemble learning methods (random forest, adaboost and gradient boosting) perform well when geographical coordinates are considered.

Our results are in line with studies in the literature that shows that including location attributes in automated valuation models results in improved prediction accuracies for techniques such as submarket methods, trend surface and spatial expansion methods, spatial regression methods, and machine learning methods with spatial attributes (Bourassa et al. 2003; Bitter et al. 2007; McCluskey et al. 2013; Čeh et al. 2018; Doumpos et al. 2020). However, from our research question and our experiments, this study additionally provides an estimation of what would be lost in terms of predictive power for a model (specifically a machine learning model) that fails to integrate location attributes. The losses increase up to 52% for the best model predictors for a metropolitan city in our experiment. This metric provides a better perception than other metrics of the high importance of location attributes for automated valuation models. At a high level of granularity, our results also provide a quantification of the relevance of using submarket methods (e.g., Bourassa et al. 2010; Goodman and Thibodeau 2007). In our case, we built different models for each city, and we observed that we can obtain model precision differences of up to 70% between medium-cost cities and high-cost cities. This result can be viewed as a difference in the spatial dependence and spatial heterogeneity between these medium-cost cities and high-cost cities (Anselin 2013; Basu and Thibodeau 1998; Bitter et al. 2007). Finally, regarding the best machine learning methods, many studies in the literature have already demonstrated similar results with ensemble learning algorithms as their best predictors (e.g., McCluskey et al. 2014; Čeh et al. 2018; Mullainathan and Spiess 2017; Kok et al. 2017; Mayer et al. 2018; Baldominos et al. 2018) or with artificial neural networks outperforming the other methods (McCluskey et al. 2013; Yacim and Boshoff 2018; Abidoye et al. 2019). However, some other studies in the literature contrast the results with those of k-nearest neighbors (e.g., Isakson 1988; Borde et al. 2017) or support vector regression as the best predictors (e.g., Lam et al. 2009; Kontrimas and Verikas 2011; Huang 2019). However, all these related experiments are realized in different contexts and with different datasets, and they do not always consider all these algorithms in the same experiment. Our experiment overcomes these biases and could be viewed as a more reliable comparison between all these algorithms considering the use of the same context and the same dataset throughout the experiment.

5.2 Implications

Our studies may have many research and practical implications.

Research implications


To the best of our knowledge, this is the first study focusing on evaluating and quantifying the impact of geographic locations on real estate price estimations. Many existing studies in the literature (described in Sect. 2) have already demonstrated the relevance of location features in real estate price estimations, but none of them provide metrics that precisely quantify the relevance of location features. Our research question in this study is thus quite new and can lead to many other similar empirical studies with machine learning methods, as well as with other automated valuation methods, such as submarket methods, trend surface methods, spatial expansion methods, and spatial regression methods.

In the operations management field, only a few studies are interested in revenue management for durable and non-replenishable products such as real estate (Wen et al. 2016; Padhi et al. 2015). This study could serve as a basis for assessing real estate prices for strategic revenue management under the uncertainty of real estate projects. For instance, this study could help to set the number of each type of property and price for which it is difficult to handle revenue management under uncertain customer demands, customer preferences, and volatile commodity prices (Padhi et al. 2015; Bogataj et al. 2016).

Practical implications


This study could have direct implications in terms of real estate price estimations, particularly for the French market, which has so far received little attention from automated valuation models or in operations management. Our study is based on a reliable data source containing 5 years of historical real estate transactions from notarial acts. We can express the practical implications of this study in two aspects.

First, the trained machine learning models could help everyone obtain a quick estimation of the value of a real estate property from a sale or purchasing perspective, and this can also apply to real estate agencies or investors. As shown in our experiment, adding precise geographic location features considerably improves the price estimations of a given model. For instance, for many cities we have median errors of approximately 15 000 euros and first quartile errors of approximately 5 000 euros, which could be very promising as margin errors for an automated estimator while taking into account that many other important house characteristics are missing in the studied dataset (e.g., the age of the house, presence of a lift, presence of parking spaces, presence of a swimming pool, presence of terraces, presence of a garden, number of floors, community costs, etc.). This makes it possible to envisage highly relevant results with multiple characteristics.

Second, our study makes it easy to understand and compare the real estate markets of major French cities. For instance, we can clearly notice that the real estate prices in medium-cost cities, such as Lille, Toulouse, and Montpellier, can be estimated more precisely than those of more expensive cities, such as Paris, Bordeaux, and Nice. Such comparative information could provide a quality indicator when interpreting automated price estimations from different cities or when choosing only cities where price predictions are sufficiently precise to be exploited. All of this could provide valuable information for individuals, agencies or investors interested in the real estate market.

5.3 Limitations and directions for future research

The approach presented in this paper shows promising results but can be improved experimentally and conceptually in many ways.

Experimentally, the studied dataset does not contain many important house characteristics that are valuable in real estate estimations, such as the age of the asset, details of asset composition (e.g., presence of parking spaces, lifts, gardens, etc.), community costs, etc. Adding such missing characteristics would naturally improve the model accuracy rates. Linking the dataset with external data sources, such as online real estate ads or social media (Bekoulis et al. 2018), could help in extracting and adding some missing characteristics in the experiment. We also choose, in this study, to quantify the relevance of spatial attributes by adding the geographic coordinates of each transaction as a feature variable for training the machine learning models. However, other studies have also successfully included model locations with other variables, such as accessibility variables (e.g., proximity to amenities, such as schools), neighborhood socioeconomic variables (e.g., local unemployment rates), and environmental variables (e.g., road noise or visibility impact) (Čeh et al. 2018; Bourassa et al. 2010; Case et al. 2004). One other experimental improvement could be to quantify the relevance and differences (using machine learning techniques) between these other location-related variables compared to the singular use of geographical coordinates. Additionally, rather than using geographic coordinates directly, one can also use first-group transactions in small geographic tile area features (McNeill and Hale 2017) with many sizes for capturing geographical areas with different and flexible levels of granularity (e.g., low, intermediate or high). This latter approach would consider flexible, geographically-based submarkets (Bourassa et al. 1999) in the preparation steps before the process of model training with machine learning techniques. From another point of view, we mainly focus on the predictive capacities of machine learning techniques in this study because they represent the main advantage of these techniques and can be relevant for providing good estimates to many real estate actors, such as real estate agencies or investors. However, it could also be interesting to go beyond this limitation and practically quantify and compare the levels of volatility of these techniques (Mayer et al. 2018).

Conceptually, we think the approach presented in this paper could be complementary to many existing approaches for automated valuation models, particularly when integrating hedonic modeling and machine learning algorithms (Hu et al. 2019).

6 Conclusion

We presented an experiment on real estate price estimations using seven machine learning techniques with 5 years of historical data of real estate transactions in major French cities. We particularly focused on demonstrating and quantifying the relevance of location features in real estate estimations with high and fine levels of granularity, with one main objective being to provide an idea of what would be lost in terms of predictive power for an automated valuation model that fails to integrate location variables. From a practical point of view, this could also allow for the training of more accurate real estate models that could help in identifying the best opportunities for marketplace players, such as real estate agencies or investors. For instance, at a high level of granularity, we clearly observed that there were very important differences regarding the models’ forecasting errors (sometimes with precision differences beyond 70%) between high-cost cities (e.g., Paris, Bordeaux, and Nice) and medium-cost cities (e.g., Toulouse, Lille, and Montpellier). Thus, this fact could imply that it would be more relevant to train specific models for some geographical submarkets (cities in this case) rather than global models including all cities. At a low level of granularity, we made use of geocoding to extract and add precise geographic location features to the machine learning algorithms’ inputs. We observed important improvements in the models’ forecasting powers (sometimes an improvement greater than 50%) when adding these geographic location features over models trained without these features. These results are promising and could provide data modeling alternatives using machine learning techniques in real estate price estimation procedures. However, our approach could also be complementary to many automated valuation models or revenue management methods and thus offers many perspectives for future research.