1 Introduction

The decrease in water availability has posed several challenges to many regions globally, with some of them facing high water stress levels (Luo et al. 2015). Moreover, during the next decades, freshwater demand is predicted to increase by 55 %, mainly because of manufacturing, electricity, and residential use (OECD 2012). Therefore, policymakers and water service providers need to evaluate a range of water resource management policies to bridge the current (and expected) gap between water supply and demand, with pricing policies being one of these options (Pérez-Urdiales et al. 2014; Puri and Maas 2020).

Pricing policies are considered more effective and efficient than command and control policies such as rationing, information campaigns that incentivize conservation, or the introduction of new water-saving technologies (Arbués et al. 2003; Marzano et al. 2018; Olmstead and Stavins 2009; Pérez-Urdiales et al. 2014; Puri and Mass 2020; Roibás et al. 2007). However, pricing policies are only effective when water demand responds to water price changes. Thus, the knowledge of the water price elasticity is crucial from a policy perspective.

Estimating the water price elasticity implies considering several topics, including functional form, estimation strategy, sample size, nature of the data (panel, cross-sectional, or time series), sociodemographic and weather information to be included in the estimation, and aggregation level of the data being used (Arbués et al. 2003; Marzano et al. 2018; Sebri 2014; Worthington and Hoffman 2008). Unfortunately, most of the time the data aggregation level and the sample size are already defined by information availability. In this paper, we shed some light on the consequences of using different data aggregation levels and sample sizes on the water price elasticity estimates.

The empirical literature on residential water demand uses many types of data aggregation, depending on whether the aggregation is over consumption, sociodemographic variables, or both. We identify three dataset types: (1) aggregated data (AD), in which all the variables used are aggregated at some spatial level (i.e., neighborhood, district, municipality); (2) disaggregated data (DD), which uses household-level information for all the variables; and (3) semi-aggregated data (SAD), which combines household-level data on water consumption and prices with aggregated information on sociodemographic variables. Although the literature recognizes that higher disaggregation is better (Yoo et al. 2014), researchers end up working with some level of data aggregation because of several reasons. Among these are institutional arrangements, unavailability of disaggregated information, and the ease and low cost of gathering aggregated information, allowing for larger datasets.

Two research questions motivated us to conduct this study:

  1. 1)

    What are the consequences of using large SAD (disaggregated by water consumption and price; aggregated by sociodemographic variables) vis-à-vis small DD (water consumption, price, and sociodemographic variables)? We estimate the water price elasticity with these different sample types, resembling the tradeoff that researchers and policymakers face when deciding how to gather water consumption data.

  2. 2)

    What are the consequences of using different levels of data aggregation for the sociodemographic variables when the number of observations is fixed (i.e., a fixed panel of households’ water consumption)?

We address these questions through a discrete-continuous choice (DCC) model, which considers that consumers face an increasing block price (IBP) structure. We use a metropolitan area in south-central Chile as a case study. Our results show that the water price elasticities are statistically different when comparing a large SAD sample with a small DD sample. We also find that the water price elasticity estimates depend upon the level of aggregation of the data used and the sample size.

The evidence of the impact of using different levels of data aggregation on water price elasticity estimates is limited. The results of previous studies are not conclusive and mostly originate from meta-analyses. For instance, two of these studies, Dalhuisen et al. (2003) and Espey et al. (1997), found different effects for DD at the household-level—Espey et al. (1997) illustrated that the use of DD generates more inelastic demands, whereas Dalhuisen et al. (2003) demonstrated that it generates more elastic demands. Neither of these studies found the impact to be statistically significant. In another meta-analysis, Sebri (2014) contended that studies that use DD present more elastic demand than those based on AD and that this result is significant at a 95 % confidence level. Our study is the first attempt to evaluate this issue using primary data.

To the best of our knowledge, this is the first study to address the effect of aggregating sociodemographic variables on price elasticity estimates. We seek to determine whether it is worth the extra cost of acquiring household sociodemographic information to estimate the demand function parameters. For policymakers and researchers, understanding the impact of different data aggregation levels on price elasticity estimates is important because it can lead to more cost-efficient data collection and better estimation of price elasticity, thereby helping design better-informed water policies.

2 Brief Literature Review on Water Price Elasticity and Data Aggregation

Scholars face a tradeoff when choosing between different data aggregation levels. On the one hand, DD provides better information on sociodemographic variables directly associated with the individual’s behavior. However, the sample size might be small because of the cost of gathering this information. On the other hand, both AD and SAD allow us to have bigger samples size, but with a loss in the precision of the sociodemographic information, as we assume that the households sharing the same district (or another aggregation level) have identical sociodemographic variables.

In previous studies using AD, authors employed all the variables aggregated at the neighborhood, municipal, regional, district, or county level, either because of institutional or budget constraints (Martínez-Espiñeira 2002; Martínez-Espiñeira 2003; Mazzanti and Montini 2006; Nauges and Thomas 2000; Salazar and Pineda 2010; Schefter and David 1985; Schleich and Hillenbrand 2009; Yoo et al. 2014). For instance, Nauges and Thomas (2000) used AD at the municipal-level in France because price negotiations in that country depend on municipal characteristics and not on individual consumption levels (unlike other European and Latin American countries). Mazzanti and Montini (2006) analyzed residential water demand in Italy using municipal-level data because it was less expensive to collect. Yoo et al. (2014) estimated residential water demand in Phoenix, Arizona, using aggregated consumption data from 312 census tracts serviced by the Phoenix Water Services Department. Although the authors argued that using water consumption data at the household-level was better, this information was unavailable because of the high collection cost. For the sociodemographic variables, the authors used data from the 2000 and 2010 censuses. Recently, because of budget constraints, Acuña et al. (2019) used AD at the municipal-level to examine the role of climate variability in the convergence of water consumption in Chile.

One disadvantage of using AD to estimate the water demand is that it conceals the effects of individual heterogeneity, which can lead to biased estimations (Marzano et al. 2018). By contrast, DD allows us to control for individual characteristics (Arbués et al. 2003; Marzano et al. 2018; Worthington and Hoffman 2008), observe individual preferences within a population, and analyze variability in the variables that could explain consumers’ behavior (Salazar and Pineda 2010).

Using DD implies that all the information (water consumption, price, and sociodemographic characteristics) is at the household-level and is obtained directly through the water service provider whenever possible (Jiménez et al. 2017; Jones and Morris 1984; Maas et al. 2017; Olmstead et al. 2007; Pérez-Urdiales et al. 2014; Vásquez et al. 2017; Wichman et al. 2016). The studies using DD at the household-level are more expensive to conduct given the difficulties and costs involved in obtaining primary information for building large panels (Clavijo 2013; Hoyos and Artabe 2017; Jiménez et al. 2017; Klassert et al. 2018; Suárez-Varela 2020). Therefore, DD is an exception in the literature. For example, Suárez-Varela (2020) estimated the residential water demand in Spain during 2006–2012 employing a panel of annual data at the household-level from the Spanish Survey on Family Spending (Encuesta Española de Presupuestos Familiares). This survey compiles an exhaustive registry of household spending for 489 goods, such as drinking water, covering 24,000 representative Spanish households. Hoyos and Artabe (2017) also estimated the residential water demand using the same survey, focusing on the regional differences between climatic areas. Maas et al. (2017) estimated the water demand in Colorado, United States, using 6,759 observations for the period 2009–2014, combining motivational information (social and environmental) with observable household characteristics that influence water consumption and household-level billing data for the water public service. Vásquez et al. (2017) calculated the residential water demand in Colombia, employing DD, including climatic variables, at the household-level using a sample of 490 households with monthly information during 2001–2013.

Literature reviews and meta-analyses have demonstrated the lack of studies using DD (Arbués et al. 2003; Dalhuisen et al. 2003; Espey et al. 1997; Marzano et al. 2018; Sebri 2014; Worthington and Hoffman 2008). The most recent and comprehensive meta-analysis, Marzano et al. (2018), analyzes 124 studies, including 615 price elasticity estimations, among which 36 % use disaggregated information at the household-level. The authors found no significant statistical difference in elasticity estimations between studies using disaggregated consumption and price data and those using AD, whereas Sebri (2014) argued the opposite.

SAD studies combine household-level data on water consumption and prices with sociodemographic data that is aggregated at either census-district, neighborhood, or municipality level (Ghimire et al. 2016; Ghavidelfar et al. 2017; Ghavidelfar et al. 2018). This approach is easier to implement and is less expensive than using DD because information on consumption and prices can be obtained directly from the water utility companies (at the household-level), while aggregated sociodemographic data can be obtained from public sources (e.g., from censuses or national surveys). Adopting this approach enables building larger panels of information at a fraction of the cost of using a “pure” DD. For instance, Ghavidelfar et al. (2017) used large household data on water consumption and prices, with aggregated socioeconomic information at the census level for assessing individual houses’ water demand in New Zealand, while Ghavidelfar et al. (2018) employed the same approach for analyzing the water demand of high-rise apartments in New Zealand.

The possibility of using a panel is an important feature from an estimation perspective. Using panel data has grown in popularity over the last 20 years because it allows us to control for heterogeneity in households and over time. In Marzano et al.’s (2018) meta-sample, panel data are used in 64 % of the studies. Other alternatives such as cross-sectional data are not currently used as frequently because they do not allow variability in consumption over time. For instance, the cross-sectional approach is used only in 20 % of Marzano et al.’s (2018) meta-sample, whereas time series analysis accounts for 15 % of estimates in the same review.

Ideally, panels should use individual, high-frequency consumption data (daily or weekly) to isolate the effects of price changes from other determinants such as meteorological conditions that vary over time; however, panel data is often aggregated monthly or annually (Marzano et al. 2018). According to Marzano et al. (2018), most studies employ panel data aggregated by municipalities or neighborhoods and then analyze the price elasticity with spatially disaggregated information. Some studies estimate residential water demand with panel data using only AD (Mazzanti and Montini 2006; Salazar and Pineda 2010; Yoo et al. 2014), while others employ panels using SAD—with consumption and price variables at the household-level—and AD for the sociodemographic variables (Ghimire et al. 2016; Hewitt and Hanemann 1995; Kenney et al. 2008; Nieswiadomy and Molina 1989; Yates et al. 2013). However, panels presenting all variables disaggregated at the household-level, including the sociodemographic variables, are preferable (Maas et al. 2017; Olmstead et al. 2007; Suárez-Varela 2020; Vásquez et al. 2017).

3 Materials and Methods

3.1 Hypothesis Testing

We test two hypotheses: (1) Using a large SAD sample (disaggregated information on water consumption and prices and aggregated information on sociodemographic variables) provides water price elasticity estimates that are statistically different from those using a small sample of DD. (2) Successive levels of aggregation of sociodemographic variables affect the water price elasticity estimates when keeping the sample size fixed.

These hypotheses can be confirmed using a confidence interval comparison for the estimated values of price elasticity. The estimates are statistically different if the confidence intervals do not overlap. Schenker and Gentleman (2001) illustrate that this is not conclusive when the confidence intervals overlap. Therefore, they suggest a mean difference t-test. However, in paired samples, the covariance is different from zero and needs to be estimated.

3.2 Case Study and Data

In this study, we focus on the metropolitan area of Concepción, a city located in south-central Chile. Currently, the country is experiencing a ten-year-long drought (since 2010), with a decline in precipitation between 20 and 40 % (Garreaud et al. 2019). According to the Luo et al. (2015), Chile is one of the most vulnerable countries globally, with a high level of water stress (Banwell et al. 2020). Moreover, by 2040, the water demand is expected to increase by 10 %, with a 25 % growth in the residential sector (MOP 2017). This will put more stress on the national water system, requiring the implementation of water policies (i.e., pricing policy) aimed at promoting more efficient water use.

The metropolitan area of Concepción includes five municipalities (Concepción, Chiguayante, San Pedro de La Paz, Talcahuano, and Hualpén), which together have a total population of 684,842 (INE 2017). ESSBIO, a private utility company, provides drinking water to the entire area and according to current regulations, has an IBP scheme. This price system is used by nearly 74 % of sanitary companies in developing counties (Fuente et al. 2016). In this study, we considered price, sociodemographic, and climatic variables as the determinants of the residential water demand.

The comparison of the large SAD versus the small DD includes information on the water consumption and monthly prices for all households in the metropolitan area during 2007–2012. This information was provided by the water utility company ESSBIO. For the SAD case, the sociodemographic information (household income, number of inhabitants in the household, number of bathrooms) at the district level was obtained from the 2002 Population and Housing Census (Censo de Población y Vivienda de 2002) and the Survey of Socioeconomic Characteristics from the waves of 2009, 2011, and 2013. Climatic information associated with average temperatures and precipitation was obtained from meteorological stations in the metropolitan area using the interpolation process. The household’s sample considers 9,198,072 records on water consumptions and prices (monthly information for 127,751 families in the 5 municipalities of the metropolitan area over 6 years). We randomly selected observations from the full sample of households because of computational capacity limitations, finishing with a database of 890,104 observations (the equivalent of 12,777 families). We call this data Sample 1 and use it to build the large SAD panel.

The small DD panel was randomly extracted, to collect sociodemographic information, from Sample 1 and includes 506 households over 32 census districts. After data management, we obtained complete information for 443 households, which constitute the small panel of DD. This is our Sample 2.

The assessment of using successive levels of aggregation of sociodemographic variables was developed using Sample 2, based on which we constructed three panels. The first panel includes disaggregated information of all the variables (DD panel). The other two panels consider different levels of aggregation for the sociodemographic variables, district and municipal, whereas information on water and prices is disaggregated at the household-level. Consequently, we have three panels: household-level (DD), district level aggregation (small SAD at district level), and municipality level (small SAD at municipality level) of monthly information for 443 families over 6 years, thereby providing approximately 31,896 observations each. After the process of depuration, this subsample had 30,865 observations. Descriptive statistics are presented in Table 1.

Table 1 Data characteristics

Table 1 reports the variables’ descriptive statistics (dependent and independent) at different levels of aggregation of the sociodemographic variablesFootnote 1. As depicted, with Sample 2 (n = 30,865) at the household-level, the average water consumption in the Metropolitan area of Concepción is 15.7 m3/month for the period under study, with a marginal price in the first block of $ 741.17/m3 and the second block of $ 860.16/m3. Water bill represents a low fraction of family expenditure, approximately 2 %, although this figure is higher than the one reported for other developed countries at 0.5 % (Olmstead et al. 2007; Suárez-Varela 2020). At the communal level, the average income reaches $ 64,980.47 in block 1 and $ 73,754.32 in block 2. The data with large samples (N = 890,104) do not present a significant change compared to the small samples, and on average, differ by less than 1 %.

3.3 Water Demand and IBP: the DCC Model

The presence of non-linear price structures because of increasing or decreasing block tariffs generates some challenges, such as the presence of endogeneity, in estimating the water demand function (Taylor 1975). Non-linear price structures are widespread in the literature (more than 40 % of the studies reviewed by Dalhuisen et al. (2003) reveal multiple or non-linear tariffs) and have been gaining more recognition lately (Marzano et al. 2018; Pérez-Urdiales et al. 2014; Puri and Maas 2020; Suárez-Varela 2020).

Taylor (1975) shows that a non-linear price system generates an endogeneity problem as the individual chooses not only the quantity of water to consume but also the marginal price he/she will bear. Hewitt and Hanemann (1995) use the DCC model, which considers this endogeneity issue. They demonstrate how to estimate this model using the maximum likelihood approach, while Olmstead et al. (2007) propose an analytical expression to estimate the price elasticity under the semi-logarithmic functional form. Vásquez et al. (2017) recently extended the estimation and price elasticity calculation to other functional forms.

The econometric estimation requires maximizing an intricate likelihood function. For readers interested in details, we suggest studies by Hewitt and Hanemann (1995), Olmstead et al. (2007), Vásquez et al. (2017), and Yates et al. (2013). Notably, price elasticity differs from price coefficient because of the nonlinearity of the price function. In this case, the price elasticity represents the proportional change in the water demand for a 1 % change in the whole price structure, not just in one but all marginal prices simultaneously (see details in Olmstead et al. 2007). The price elasticity is not independent of the sociodemographic and climatic characteristics included in the model. Therefore, given that the level of aggregation of sociodemographic characteristics affects the coefficients associated with these variables, the price elasticity expression of the DCC model is also expected to be affectedFootnote 2.

4 Results

Table 2 presents the water demand estimates for the different data aggregation levels. As shown, all the estimated coefficients are statistically significant at a 95 % confidence level or higher.

Table 2 Water demand estimates

Our results demonstrate that the coefficients for the climatic variables are both positive and statistically significant. The coefficient of average temperature has a positive correlation with the residential water demand. This can be explained by discretionary water uses, such as watering grass and pool-use (Worthington and Hoffman 2008). Precipitation is positively correlated with water consumption, which seems to be an unexpected finding as greater precipitation should reduce the residential water demand (at least its discretional use). However, how temperature and precipitation affect people’s water consumption decisions lacks consensus (Puri and Maas 2020). For example, Vásquez et al. (2017) find a negative and significant relationship between water consumption and temperature and rainfall. Hoyos and Artabe (2017) find a positive relationship between temperature and negative consumption. Romano et al. (2014) illustrate that temperature has no influence on average water consumption, but rainfall has a significant and negative effect on overall water consumption. Maidment and Miaou (1986) demonstrate that the effect of rain on water demand is non-linear and decreases over time. Regarding the effect of climatic variables on price elasticity, Espey et al. (1997) and Marzano et al. (2018) contend that including rainfall leads to significantly lower estimates of elasticity. However, Sebri (2014) and Dalhuisen et al. (2003) find that rain is not significant. Regarding temperature, Marzano et al. (2018) demonstrate that the inclusion of temperature causes more elastic demands whereas Sebri (2014) shows that its inclusion leads to less elastic demands; Espey et al. (1997) and Dalhuisen et al. (2003) find no significant impact of temperature.

In general, the sign and value of the water price coefficients are consistent with those in the literature, in which water is a relatively inelastic good (Dalhuisen et al. 2003; Marzano et al. 2018; Sebri 2014). The income coefficient shows a positive relationship with water consumption, which is in line with theory and with the results of previous studies (Arbués et al. 2003; Marzano et al. 2018; Suárez-Varelas 2020). The number of inhabitants’ coefficient is positive and statistically significant for all the data aggregation levels. However, the number of bathrooms has different impacts, depending on the aggregation level used. For instance, for both the large and small SAD panels at the district level, the number of bathrooms in the household is statistically significant and positively correlated with water consumption, which is consistent with the extant literature (Worthington and Hoffman 2008).

Table 3 shows the water price elasticity estimates when faced with proportional changes in the price vector for each level of aggregation of the sociodemographic variables and sample size. The table also includes the respective standard errors calculated using the delta method with a 95 % confidence interval. All the estimated elasticities are significant at a 95 % confidence level, except for the elasticity at the municipal-level aggregation.

Table 3 Water price elasticity estimates

Figure 1 provides a graphical representation of the confidence intervals at 95 % from Table 3 and is used for testing our hypotheses. Based on our results, we can accept the first null hypothesis, implying that the water price elasticity from a large SAD panel is statistically different from that computed using a small DD panel. Further, our results also support the second null hypothesis. When comparing the water price elasticity for paired samples, we found statistical differences between the water price elasticities computed with different levels of aggregation on the sociodemographic variables.

Fig. 1
figure 1

Water price elasticities: Confident intervals

As illustrated in Fig. 1, using the large SAD panel delivers a more elastic water price elasticity estimate with a smaller variance than the small DD panel. By contrast, if the level of aggregation on the socioeconomic data is increased while keeping the sample size fixed, the elasticity increases. These changes in the absolute value of elasticity occur because the use of variables with higher aggregation levels conceals the effects of individual household heterogeneity.

From the comparison between the large SAD and the small DD panels, we can infer that the precision of estimation increases when using a higher number of observations, even when utilizing aggregated sociodemographic variables.

5 Discussion and Conclusions

The current and expected level of water scarcity in many regions of the world calls for an urgent change in water resource management to address these challenges. Knowledge of the different water demand parameters, such as water price elasticity, can allow us to develop more sustainable water policies.

The water price elasticities obtained are in line with those in the previous literature, in general (Dalhuisen et al. 2003; Espey et al. 1997; Marzano et al. 2018; Sebri 2014), and with those in the previous literature for developing countries, in particular (Nauges and Whittington 2010). Our results demonstrated that the residential water demand within an IBP system, while generally inelastic, is significantly different from zero. This indicates that pricing policies in the urban sector can reduce the demand for water at the household-level, acting as an efficient conservation method in times of growing water scarcity (Marzano et al. 2018; Olmstead and Stavins 2009).

The literature concerning the magnitude of the price elasticity estimates reveals that the DCC model tends to compute elasticities that are more elastic than those calculated using other methods (Dalhuisen et al. 2003; Marzano et al. 2018; Puri and Maas 2020; Sebri 2014; Suárez-Varela 2020). The authors argue that this is because DCC is a method that correctly estimates when the tariffs are in increasing blocks. In comparison with previous literature, our estimate for the large SAD sample is close to the average reported by Espey et al. (1997) of − 0.51 and slightly higher (in terms of absolute value) than that reported by the meta-analysis of Dalhuisen et al. (2003), Marzano et al. (2018), and Sebri (2014), with elasticities of − 0.4, − 0.37, − 0.41, respectively. This may be due to (1) IBP structure, (2) DCC estimation method, and (3) the use of panel data, which tend to show higher price elasticities (Puri and Maas 2020).

This study illustrated that the water price elasticities depend upon the level of aggregation of the data used. We found that the water price elasticities are statistically different when comparing a large SAD sample with a smaller fully DD sample. For the latter, increasing the level of aggregation of the sociodemographic variables also affects the price elasticity. Our estimates were based on the DCC model (Hewitt and Hanemann 1995; Klassert et al. 2018; Olmstead et al. 2007; Vásquez et al. 2017), which is appropriate for modeling water systems with IBP because it corrects the problem of endogeneity generated by the simultaneity of marginal price and quantity demanded.

The difference in price elasticity between various samples reveals an interesting result, that is, generating statistically significant estimations for residential water demand using the SAD sample is possible. This is the key when designing public policies, especially in developing countries, where accessing socioeconomic household-level information is a huge challenge (Nauges and Whittington 2010; Salazar and Pineda 2010). Studies with individual-level data are expensive to conduct given the costs involved in obtaining primary household information (Clavijo 2013;; Hoyos and Artabe 2017; Jiménez et al. 2017; Klassert et al. 2018; Suárez-Varela 2020).

Our estimations demonstrated that the water price elasticity computed not only using SAD but also with multiple observations is more elastic than that computed using a small DD. These findings shed some light on the impact of distinct levels of aggregation on price elasticity, which had remained inconclusive thus far (Dalhuisen et al. 2003; Espey et al. 1997; Marzano et al. 2018; Sebri 2014).

If scholars could freely decide the type of sample for computing the water price elasticity, the decision is evident: select the larger and more disaggregated sample available. This is the safest path to compute robust and reliable water price elasticity estimates. However, most of the time, the level of aggregation of the data, as well as the sample size, is more a constraint than the decision of the researcher. In this framework, our results imply greater ease of estimating the residential water demand, both in terms of data collection and budget. Employing semi-aggregated information to estimate residential water demand and avoiding the high cost of data collection from individual surveys is possible. However, obtaining the highest amount of information on individual prices and consumption possible through water providers should still be emphasized. Therefore, taking advantage of the smart metering technology, which is now implemented by many water utility companies, can create new possibilities for evaluating public policies using an efficient data collection process.