Keywords

1 Introduction

This paper constitutes a part of a larger research on the urban real-estate market concerned with the analysis of its general structure and the monitoring of its evolution during the last few years, when the economic and property crisis, along with the change in the fiscal system, have deeply impacted on the readiness to invest, the competition between different types of assets, and the very idea of urban real-estate capital as well (Breuer and Nadler 2012; Gabrielli et al. 2015; Giuffrida et al. 2015). The analysis applied to the housing market in Palermo (Italy) can provide an informative base for the public investment decision processes, and the implementation of planning policy and public/private negotiations, especially in a period in which the map of the urban values is going to be reconfigured because of significant modifications of the public transport system (streetcars and subway).

Assuming the district as the minimum spatial unit, five hundreds sets of data of houses were collected in ten urban districts corresponding to a wide, densely populated, and complex area of the city. With the aim of managing and organizing the collected data, the study applies the cluster analysis approach to provide different hypotheses of articulation of the real estate market into submarkets expressing the characteristics of the properties. In such a way, the study intends to express the fluid and mutable relationship that relates the objects (the properties having their own characteristics) to the hypothetical model that is meant to represent them. The concept of homogeneity is mostly related to the topographical demarcation of the district and the peculiar characteristics of the properties that are afferent to the same submarket. This study has continued to review this concept as necessary to be employed as a unique interpretative scheme for a systematic reading of the real estate phenomenon, by the means of data-mining models or big-data management (Case et al. 2004; Fik et al. 2003).

The cluster techniques, developed in other scientific sectors, have been effectively used for the implementation of the mass appraisal in the fiscal equalization (Nesticò et al. 2014) and the fair-land planning (Giuffrida et al. 2014), where the direct or “phenomenological” approach (which is generally applied for a case-by-case valuation) must be modified to represent the structural tendencies of the market and how the real estate values react to specific or areal transformations within the city (Chan et al. 2012; Gabrielli 2013; Hepşen and Vatansever 2012).

The study also provides the occasion for a few methodological remarks on the representation of the real estate market, at the property and urban level as well. It also shows some of the major difficulties in the development of a standardized informational support that makes possible systematic analysis of the observations and the comparisons between different urban districts.

2 The Real-Estate Market Survey

This study analyzes the area of Palermo corresponding to ten districts having varied historical, representative, and functional qualities because of the time of their establishment and the most recent urban transformations. The boundaries of the area are the Mediterranean coast on the East, Regione Siciliana Street on the South and West, and Mount Pellegrino on the North. This area covers about 48 km2 (30% of the municipal land) where 55% of the population is located (about 370,000 people). The districts are the followings: Q1 Settecannoli-Brancaccio is a working-class suburb; Q2 Oreto-Stazione, Q5 Montegrappa-S. Rosalia, Q6 Cuba-Calatafimi, Q7 Zisa-Noce, and Q10 Malaspina-Palagonia are low and medium income districts located near the city center; Q3 Tribunali-Castellamare is a part of the historic center where there is a mix of social classes; Q8 Politeama, and Q9 Libertà constitute the city “center”, that was built between the end of the 19th and the beginning of the 20th century, where high-income households live; and Q11 Resuttana-S. Lorenzo is a middle-class suburb (Fig. 1).

Fig. 1
figure 1

Localization of the real estate data in the analyzed districts in Palermo

The market survey collected 500 residential properties for sale located in the abovementioned districts in 2014 (Fig. 1). The data sample describes the houses by four types of characteristics (\(k_{e}\) location, \(k_{i}\) intrinsic, \(k_{t}\) technological, and \(k_{a}\) architectural) (Forte 1968) that are organized in 28 quantitative and qualitative attributes as shown in Table 1. The data sample also contains the asking prices, and the prices per square meters, as well as per room. Each attribute is expresses in a standard scale (from 1 to 5) so that the scores are aggregated at the characteristic level and, afterward, the overall quality k* is achieved by calculating the weighted average score of each house.

Table 1 Characteristics and attributes of the houses

Due to the generalized inconsistency of the coefficients of the four regressors calculated by the multiple linear regression for each district, the weights for the score aggregation have been calculated on the basis of the mean of the most significant coefficients in each district, having excluded the negative or the highest ones. Moreover, the weights \(\lambda_{j}\) of each district have been iteratively varied within the ranges shown below to obtain the maximum value of R 2. Assuming \(\sum\nolimits_{j} {\lambda_{j} } = 1;j = e,i,t,a\), the ranges are: \(0,10 \le \lambda_{e} \le 0,30;\,0,10 \le \lambda_{i} \le 0,30;\,0,30 \le \lambda_{t} \le 0,50\); \(0,10 \le \lambda_{a} \le 0,30\).

A sample of the database for the Q3 district is shown in Table 2.

Table 2 Database for the Q3 district

3 Clustering Methodology

Especially when the real estate is widely heterogeneous, data-mining procedures may be applied to achieve a consistent articulation of the real estate market in submarkets that represent the similarities between objects described by a standardized set of shared characteristics. The cluster approach can be classified into these types: hierarchical, non-hierarchical (partitions), grid-based, and model-based.

The k-mean algorithm, which belongs to a non-hierarchical method (Jardine and Sibson 1968), has been applied to the dataset formerly described. The output of this algorithm is the optimal partition of elements that maximizes a certain objective function, and it is based on the assumption of distributing the elements of a sample over a predetermined number of groups (King 2014; Everitt et al. 2011; Kaufman and Rousseeuw 1990). The number of possible partitions p \(\left( {{{p}} = 2^{{({\text{n}} - 1) }} - 1} \right)\) can be reduced through the initial choice of the number of groups of the partition and, consequently, the optimal partition can be constructed among those partitions having the chosen number of groups, using a criterion depending on the algorithm applied.

The k-means algorithm forms k groups using certain values as initial centroids and placing the elements into groups on the basis of the maximum proximity to the centroids (proximity is measured using the Euclidean metric). Once the first partition has been computed, the new centers are recalculated; the previous routine is modified in the subsequent routine, until convergence is obtained (the condition is that each element is assigned to the same group as in the previous partition). When this condition has been verified, the optimal partition will have been thus obtained (Steinley 2003, 2006) (Table 3).

Table 3 Steps of the k-means algorithm

Some problematic aspects of the iterative k-means procedure mainly regard the choice of both the initial centroids and the number of groups G. The choice of the initial centroids is the starting point from which the search for the final partition begins. If there are no specific indications regarding them, an internal algorithm of the software (IBM SPSS) will elect the centroids between the elements of the sample, so that they are well spaced. Alternatively, the analysis can be performed many times, and the final partition will be the one that is more consistent in respect to the information in the dataset resulting from cognitive domains. Regarding the number of groups g, if it is not available a priori on the basis of the dataset, then the procedure can be applied several times by varying g \(\left( {g = 2,3, \ldots } \right)\) and choosing the value of g according to the CH Calinski-Harabasz index (Milligan and Cooper 1985; Yanchi et al. 2010). The CH index is calculated in the following way:

$$CH(g) = \frac{B(g)/(g - 1)}{W(g)/(n - g)}\quad B(g) = \sum\limits_{i = 1}^{g} {d\left( {\bar{x}_{i} ,\bar{x}} \right)} ;\quad W(g) = \sum\limits_{i = 1}^{g} {\sum\limits_{{j:x_{j} { \in }C_{i} }} {d\left( {x_{j} ,\bar{x}_{i} } \right)} }$$
(1)

where: B is the external deviance (between the groups); W is the internal deviance (within the group); g is the number of groups; \(\bar{x}_{i}\) is the mean value of the observations belonging to the i-th cluster \(C_{i} ;\,\bar{x}\) is the is mean value of the entire sample; \({\text{x}}_{\text{j}}\) is the j-th observation; d is the Euclidean metric; and n is the number of observations. Obviously, the more this index increases, the more the validity of the partition improves, since it represents the ratio between the external variance and the internal variance of the partition.

4 Application of Cluster Analysis

The cluster analysis (k-mean algorithm) is applied to the data sample by deciding in advance that the numbers of the clusters are equal to 3, 4 and 5—because of the limited variability of the overall quality in each district—and leaving the software to make the choice of the initial centroids.

Figure 2 shows the resulting values of the CH index and the number of clusters (best partitions) for which the CH index is maximized for each district:

Fig. 2
figure 2figure 2

Values of the CH index per district and relations between overall quality k* (x-axis) and price €/m2 (y-axis) of the houses

  • 3 clusters for the Q2, Q3, Q5, Q8, and Q9 districts;

  • 4 clusters for the Q7 district;

  • 5 clusters for the Q1, Q6, Q10, and Q11 districts.

In general terms, the resulting clusters are sufficiently representative of the local housing market: the suburbs, such as Q1 and Q11 districts, have a high degree of inner heterogeneity caused by various land uses (residential, industrial, and a shopping center) and by various states of maintenance of the buildings, and this complexity can be better expressed through numerous groups of properties—in this case, 5 clusters. —The central districts, such as Q3, Q8, and Q9, are instead quite homogeneous because they originate from the same period of the urban fabric and the analogous typologies of buildings, and they may be described through 3 clusters only.

However, by observing the relations between overall quality and prices in the scatter graphs (Fig. 2), we note that significant differences between the districts having the same number of clusters may occur. By comparing the Q1 and Q11 districts, for example, it can be noted that the price elasticity with respect to the overall quality is very low in the first district, whereas it is high in the second one. The low price elasticity may be explained through the fact that Q1 is a working class suburb where the lack of public facilities stops any price increase, even if the intrinsic and technological characteristics have good quality. Otherwise, by comparing the districts with three clusters, the data points in the Q2 and Q8 districts are quite close to the trend line, whereas, in the Q9 district, the data points are much more spread, so that the market prices differ greatly in correspondence to the same overall quality.

If the partition of the Q1 district (5 clusters) and the data set of the properties involved are examined with a greater detail (Fig. 3), it has been found that:

Fig. 3
figure 3

The partition in the Q1 district

  • the clusters 1 and 2 represent two groups of similar properties as all of them have the same value of k e and k a , whereas the first group has k t higher and k i lower than the second group’s corresponding k (and vice versa);

  • the clusters 3 and 5 are also comparable except for the k i ;

  • the properties in the cluster 4 have the lowest prices and the worst characteristics of the district.

In the partition of the Q8 district into 3 clusters (Fig. 4; Table 4):

Fig. 4
figure 4

The partition in the Q8 district

Table 4 Statistical results in the Q8 district
  • the cluster 1 is very homogeneous, in fact the characteristics of all properties have the highest quality and the correspondent prices are higher than the mean price;

  • in the cluster 2, the properties may have a low score for each k and especially k e is very low because of their location in the blighted area of Borgo Vecchio, or they may have a high value of the location k e and a low value of the others k, and, in this latter case, the prices rise because the market recognizes the location to produce a marginal price higher than the ones of the others features;

  • the cluster 3 includes the properties with intermediate characteristics.

5 Conclusions

The results of the cluster analysis revealed that the housing market in each district has its own degree of complexity and peculiar relations between the market prices and the clusters representing the housing characteristics. The best number of clusters, chosen on the basis of Calinski-Harabasz index, expressed the inner variable heterogeneity of each district and represented the urban complexity.

The relationships between asking price and characteristics can significantly vary within the same cluster even when the characteristic quality is almost equivalent, and this fact is indicative of the typical information asymmetry and opacity of the real estate market and, moreover, of the current uncertainty and instability of the social and economic system, so that the owners of the real estate capitals express dissimilar expectations of the capital gains or losses (plus-minus valorization), translating them into different bid prices (Rizzo 1999).

Cluster analysis may be a useful tool to manage and analyze big data for describing, even in not exhaustive way, the structure of the real estate market, because this approach can select homogeneous groups of properties, reduce the degree of intrinsic complexity of the urban property data, and build a knowledge system to support the implementation of urban policy.