1 Introduction

The approach used to classify patterns based on large amounts of data is known as profiling (Wright 2009). Profiling can be applied in multiple domains and for a variety of purposes. For instance, in the electricity sector, network managers need to assess the type of demand to balance electricity between generation, transmission and distribution, to support their long-term planning (Espinoza et al. 2005). In the water sector, managers need to have reliable predictions of daily average consumption, peak factors and minimum night consumption for the operation of water distribution systems. Profiling water consumption is challenging given the nature and quality of available data (i.e., different sources and with different temporal and spatial resolutions), the numerous consumption drivers and the horizons and spatial scales involved (Cabral et al. 2014; Donkor et al. 2012).

Water consumption in network areas is mostly influenced by socio-demographic, billing and climate factors (Arbués et al. 2010; Browne et al. 2013; Parker and Wilby 2013). Higher consumption is typically associated with wealthier individuals living in newer and larger households with outdoor uses, such as irrigation and swimming pools (Beal and Stewart 2011). Households with more elements tend to have lower water per capita consumption and households with elderly have patterns of more frugal water consumption (March et al. 2010). In terms of billing and pricing, water is price-inelastic, but the outdoor uses component is sensitive to price rising (Grafton et al. 2011; Tanverakul and Lee 2012). Regarding climate, temperature and precipitation also affect water uses (Polebitski and Palmer 2010).

Until now, demand profiling has been developed at the household level or at the city or region level (Al-Zahrani and Abo-Monasar 2015; Hollermann et al. 2010; Idowu et al. 2012; Polycarpou and Zachariadis 2013; Scheepers and Jacobs 2014), taking into consideration a limited number of influential factors. Hardly any research has been carried out at the census or network area level (Alcocer-Yamanaka and Tzatchkov 2012; Fontdecaba et al. 2012), important for many water planning decisions. Additional, very prediction models incorporate seasonal or daily consumption scenarios, despite the differences in consumption between winter and summer (Polebitski and Palmer 2010) or weekdays and weekends (Alvisi et al. 2007).

The main objective of this paper is to present a comprehensive approach for spatial and temporal demand profiling in network areas, focusing on domestic consumption. Spatial profiling focuses on estimating consumption variables and patterns in network areas without metering but with similar socio-demographic characteristics to the ones analyzed. Temporal profiling focuses on predicting daily and seasonal demand behaviours in a specific area. This approach was explored through the use of extensive data collected from different network areas in Portugal north and south regions. High-resolution flow data (15-min) have been collected during one year, which allowed identifying different seasonal and daily scenarios.

The main contribution of this study is the comprehensive approach for consumption profiling that results in: (i) regression models for estimating design parameters and (ii) daily consumption patterns for different seasonal and daily scenarios. These allow accurate estimations of peaking factors, daily consumption patterns and minimum night consumption, essential for the network operation and management (e.g., water losses control, pumping cost minimization) and network planning, design and rehabilitation.

2 Methodology

The methodology for spatial and temporal consumption profiling involves a four-step procedure – data collection, data processing, data characterization, and spatial and temporal consumption profiling – described in the following sections.

2.1 Data Collection

Data collection can be divided in two stages: the first includes collecting data from different network areas and the second collecting census data provided by Statistics Institutes.

In the first stage, flow time series, billed consumption and infrastructure data are collected. Flow data readings from the utilities’ SCADA or telemetry systems should be collected for each metered area. The following criteria have been set to select network areas for the analysis:

  1. 1.

    Boundaries of each area where network operation are kept constant along the year.

  2. 2.

    Network areas with annual domestic billed consumption higher than 80 %, to ensure that areas are mainly composed by residential clients.

  3. 3.

    Number of service connections between 150 and 5000, which corresponds to an acceptable network size for operational management (Farley and Trow 2003; Jankovic-Nišic et al. 2004).

  4. 4.

    Service connections with geographic reference. Infrastructure data is available in a geographical information system.

  5. 5.

    High-resolution flow data with a 10 to 15-min time-step and a minimum one-year data record.

In the second stage, socio-demographic data with the smallest territorial division, the “census areas” should be collected. This division corresponds to a homogeneous building and living zones, with ca. 300 households (INE 2012); a network area may include several census areas.

2.2 Data Processing

Flow time series need to be validated, normalized and cleaned. Data validation includes detecting and correcting outliers. Data normalization aims at obtaining data with a regular time step (15 min). Infrastructure and billing data should be standardized and organized to a common database.

A geoprocessing tool to relate infrastructure and billing data from network areas to socio-demographic data organized in census areas should be used. This tool should convert sociodemographic data at the census area level to data at the network area level.

2.3 Data Characterization

This step involves calculating all the variables for demand profiling. Regarding flow time series, consumption variables and daily patterns different seasonal and daily scenario should be considered.

Consumption variables should include peaking factors and average consumption (relevant for pipe design and rehabilitation) and minimum night consumption, and average consumption during minimum night consumption period (relevant for network operation and water losses control).

Daily consumption patterns can be obtained by the average values of flow data measurements at each 15-min in the period of analysis. Dimensionless consumption patterns are obtained by dividing each instantaneous values by the respective daily average. This is useful to compare the daily behaviour of different network areas.

Consumption scenarios can be obtained using hierarchical cluster analyzes (Ward’s method and Euclidean distances). The recommended procedure is to first identify of groups of months with a similar behaviour and then, to identify group of weekdays with similar behaviour, within each seasonal scenario following the same approach. Seasonal scenarios are related with changes in outdoor uses throughout the year (e.g., garden watering and swimming pool filling during the summer); daily scenarios are related to water use changes between working days and weekends.

Table 1 presents the consumption variables that should be considered and the respective scenarios (global, seasonal and daily).

Table 1 Analysed time series, consumption variables and scenarios considered

Infrastructure variables should include the main characteristics of the pipe network (i.e., material, diameter and installation year) and service connections (i.e., number of service connections and service connection pipe length).

Billing variables should characterize the domestic and major categories of non-domestic consumption (i.e., commerce-industry, collective and public).

Socio-demographic variables should include the four main census categories: building, dwelling, family and individual. Building category refers to building age and number of floors; dwelling category indicates whether the household is used as primary residence, rented or vacant; family category indicates family type and size and individuals’ category refers to age, employment and education level. Table 2 shows the 37 infrastructure (8), billing (6) and socio-demographic (23) variables that should be considered for subsequent analyzes.

Table 2 Socio-demographic, infrastructure and billing variables calculated in this study

2.4 Spatial and Temporal Consumption Profiling

Profiling involves setting consumption variables and patterns as dependent variables that will be explained by a combination of socio-demographic, billing and infrastructure independent variables.

A model should depend on the fewest number of independent variables (Vandekerckhove et al. 2014). Principal Components Analysis (PCA) should be applied to reduce the number of independent variables into the most significant ones (Jolliffe 2002). Principal Components (PCs) are new orthogonal (uncorrelated) variables given by linear combinations of the original ones that preserve the total variance. Mathematically, PCA is an eigen decomposition of covariance (or correlation) matrix of the original variables. The Kaiser-Meyer-Olkin measure of sampling adequacy (KMO-test) should be used to avoid reducing the variables to an inadequate size (Kaiser 1970). Adequate samples are the ones with KMO values greater than 0.6 and a total explained variance for each category greater than 75 %. After the PCA, Multiple Linear Regression (MLR) should be carried out by setting consumption variables as dependent variables and key-variables as independent variables. For a data set \( {\left\{{y}_i,{x}_{i1},\dots, {x}_{ip}\right\}}_{i=1}^n \) of n statistical units, the MLR model takes is given by:

$$ {y}_i={\beta}_0+{\beta}_1{x}_{i1}+\dots +{\beta}_p{x}_{ip}+{\varepsilon}_I\kern4em i=1,\dots, n $$
(1)

in which y i , Dependent or response variable for unit i;

βP: Regression coefficient related with independent variable p.

xip: Independent variable p for unit i.

εI: Random error at case i.

The regression coefficients β1...βp represent an increase (positive value) or decrease (negative value) in the expected value of the dependent variable, associated with each independent variable. The expected value of the dependent variable is equal to β0 when the remaining regression coefficients are null. To evaluate the quality of the results, the standard errors of the estimated regression coefficients should be computed, as well as the adjusted correlation coefficient \( {\mathrm{R}}_{\mathrm{a}}^2 \). This last coefficient is called “adjusted” since it reflects the number of independent variables and the sample size. Additionally, the p-value of the overall F-test for the regression model should also be calculated.

For daily consumption patterns, a new cluster analysis with standardized variables needs to be carried out to group areas with similar patterns. A Decision Tree (DT) using CART algorithm and Gini impurity (Breiman et al. 1984) should be calculated to classify consumption patterns.

3 Case-Studies

The methodology was applied to network areas belonging to Portuguese WDS located in two regions: the north region that includes the districts of Oporto (Por) and Braga (Bra), and the south region incluiing the districts of Lisbon (Lis) and Setúbal (Set). Each area was identified with a code with an abbreviation code followed by the district names (e.g., ADE_Bra refers to an area in Braga district).

Billing and infrastructure data from 86 network areas was collected: 33 areas in the north region and 53 south region. Network length in the studied areas ranged between 4 and 95 km, clients ranged between 740 and 5200 and inhabitants ranged between 2300 and 9300.

Socio-demographic data referring to the last census in Portugal (2011) were obtained through the National Statistics Institute website (www.ine.pt). To convert data at the census area level to the network area level, a geoprocessing tool has been used (Loureiro, 2010; Mamade 2013). This conversion was carried out by weighting census areas according to the number of domestic clients. This weighting option proved to be more accurate than the original weighting method that relied on the Thiessen’s polygons of each service connection. A cluster analysis has been applied to highlight main regional differences in terms of the socio-demographic, billing and infrastructure characteristics (Fig. 1).

Fig. 1
figure 1

Cluster analysis revealing the main socio-demographic, billing and infrastructure characteristics of network areas (Group 1–2: North region; Group 3–5: South region)

Regarding billing variables, domestic billed consumption is higher in the north region (Groups 1–2) where families are larger (higher percentage of families with adolescents). Nevertheless, per capita consumption in the north ranges between 50 and 70 l/inh∙day, whereas in the south (Groups 3–5) it is considerably higher (100–140 l/inh∙day). This can be due to the higher economic mobility in the south region (higher percentage of workers employed in the tertiary sector), which is typically correlated with higher incomes and may lead to less conservation attitudes towards the use of water. This difference may also be related with the existence of households in the north region that are not connected to the WDS (e.g., households with private wells). The northern region is also characterized by lower temperatures (T) in the summer and much higher precipitation (P) than the south region. This explains the lower public consumption in the northern region.

Regarding socio-demographic and infrastructure variables, the north region has newer buildings, less asbestos cement pipes and a higher proportion of buildings with 1–2 floors, comparatively to the south region. Northern areas also exhibit a higher proportion of families with adolescents, while the southern areas have more elderly population.

Flow time series could not be obtained for all network areas due to insufficient historical data. A total of 17 network areas (5 in the north and 12 in the south) have been used for consumption profiling. Outliers were removed based on the concept of outlier region (Loureiro et al. 2015). For each time series, the minimum consumption during the night period (0 h00-6 h00) was identified and removed from the series, to ensure that neither consumption variables nor daily patterns were influenced by the level of water losses in the network area (Farley and Trow 2003).

4 Results from Spatial and Temporal Profiling

4.1 Consumption Scenarios

Daily and seasonal scenarios were identified using cluster analysis. Cluster analysis allowed the identification of two seasons: the winter and the summer seasons. Fig. 2a shows these seasonal scenarios for one network area, in which there is a seasonal average consumption increase and a significant behavioural change. Typically for all the analyzed areas, summer (S) scenarios occur from July to September, whereas winter (W) scenarios occur from November to February.

Fig. 2
figure 2

Scenario exploration for FAR_Set demand patterns: a seasonal; b daily

The next step was to understand daily consumptions behaviours. Results have shown that working days have a different behaviour from Saturdays and Sundays and bank holidays, for both seasonal scenarios. Thus, for spatial and temporal profiling three daily scenarios were analyzed for each seasonal scenario (Fig. 2b).

4.2 Data Reduction

The multiple variables obtained in each category (infrastructure, billing and customers and socio-demography) were reduced to a subset of independent variables using PCA. Since significant regional differences were identified, PCA was separately carried out for the north and south regions. A total of 33 and 53 areas in the north and south regions, respectively, were used in PCA. Table 3 summarizes all the key-variables considered for DT and MLR analysis and describes the structure of each principal component.

Table 3 Key-variables considered for MLR using network data from 53 areas in the south region and 33 areas from the north region

Concerning the socio-demographic category in the south region, the 1st component (PC1: Elderly families) is the most important, as it explains 58.2 % of total variance and shows that families with 1–2 elements and inactive workers or elderly are related (positive loadings), in opposition (negative loadings) to families with 3–4 elements and with adolescents. The 2nd component (PC2: Individuals Mobility) explains 30.7 % of total variance and shows that individuals with higher graduation (university graduates) and working in the tertiary sector (economic mobility) are related, in opposition to individuals with lower education level. For the north region, PCA showed the same components, however, the Individuals Mobility component had a greater importance, explaining 50.6 % of total variance, whereas the Elderly families component explained 26.0 %.

Regarding the infrastructure in the south region, the 1st component (PC1: Pipe material) explains 44.9 % reflecting pipe material, which is independent of pipe size (PC2: Pipe size) that explains 31.0 %. In opposition, for the north region, Pipe size is more important (explains 42.5 %) than Pipe material (explains only 25.3 %).

For billing variables in the south region, PCA was only applied to domestic billed consumption variables as these are independent from non-domestic ones. The only component obtained is PC1: Domestic billed consumption. The same results were obtained for the north region.

In summary, data reduction allowed reducing the 49 initial variables into 8 new variables (5 PCs and 3 variables). A good structure (with high explained variance and KMO) was obtained for both regions and important regional differences were observed.

4.3 Regression Models

A correlation matrix was calculated to analyze which relations between consumption variables (Table 1) and key-variables (Table 3) ought to be explored. After analysing the most significant correlations, a MLR analysis was separately carried out for both regions. Obtained regressions are presented in Table 4.

In the south region, domestic billed consumption is negatively influenced by the Elderly families component (\( \widehat{\upbeta} \) 2 = −25.8) showing that families with 3–4 elements with adolescents consume more water for domestic uses, which is coherent with the north region results. This variable also negatively relates with the Pipe size component (\( \widehat{\upbeta} \) 2 = −13.6), meaning that increases with higher pipe diameters (above 110 mm).

Table 4 Profiling models obtained through MLR

In terms of the average consumption per inhabitant, two seasonal scenarios were analyzed: winter and summer. For both scenarios, the average consumption per inhabitant is higher for individuals with higher mobility (\( \widehat{\upbeta} \) 1 = 34.2) and monthly consumptions above 25m3 (\( \widehat{\upbeta} \) 2 = −21.3). Consumption is higher in the summer (\( \widehat{\upbeta} \) 0 = 220.5) than in winter (\( \widehat{\upbeta} \) 0 = 172.1).

The daily peaking factor increases mostly with monthly consumptions above 25 m3 (\( \widehat{\upbeta} \) 2 = −0.10), plastic pipes (\( \widehat{\upbeta} \) 3 = 0.10) and families with 3–4 elements and adolescents (\( \widehat{\upbeta} \) 1 = −0.02).

The minimum night consumption is analyzed in the winter scenario, since the average flows are generally lower and leakage becomes more significant. This variable is mainly influenced by socio-demographic characteristics, since the infrastructure is recent and in good condition, with low percentage of asbestos cement (AC) (< 30 %) Thus, the minimum night consumption increases with the Elderly families component (\( \widehat{\upbeta} \) 1 = 74.1) and the Individualsmobility component (\( \widehat{\upbeta} \) 2 = 54.0), as well as with Commerce-industry consumption (\( \widehat{\upbeta} \) 3 = 11.1).

Domestic billed consumption in the north tends to be higher than in the south. This is explained by the family size: in the north, 66 % of the families have more than 3 elements, while in the south this represents 40 %. In this region, domestic billed consumption relates positively with Individuals Mobility component and negatively with the Elderly Families. Tertiary sector Employees typically have higher incomes, leading to higher water consumptions with less conservation attitudes (Beal and Stewart 2011). This regression also indicates that consumption relates positively with Pipe material, increasing in network areas where AC pipes predominates.

Results are encouraging and should be explored with a large number of network areas.

4.4 Classification of Daily Consumption Patterns

A cluster analysis was firstly used to group dimensionless patterns obtained in the different areas. This CA considered only the winter period, due to the more homogeneous consumption and only the working days, due to the difference of consumption behaviour between working days and weekends. The instantaneous consumption was characterized by the median and the 10th and the 90th percentiles of dimensionless consumption. A total of 18 areas (out of the initial 21) was used for CA.

Four types of daily consumption patterns were obtained (Fig. 3). Consumption is characterized at six periods: transition (6-7 h and 22-1 h), night (1-6 h), morning (7-10 h), lunch (10-15 h), afternoon (15-19 h) and dinner (19-22 h). Obtained patterns are:

  • Type 1: maximum value of the consumption in the morning (2.2), lower consumption at lunch and afternoon and a significant consumption at dinner (1.5);

  • Type 2: largest consumption during the day (morning, lunch, afternoon and dinner factor higher than 1.0) and morning and dinner peaks with identical c (1.5);

  • Type 3: higher consumption in the morning and lunch periods (1.5–1.7) and lower consumption in the dinner period (1.4);

  • Type 4: largest and identical consumption at lunch and dinner (1.6) and a significant consumption during the morning period (1.4).

Fig. 3
figure 3

Daily consumption patterns for working days: a Type 1, b Type 2, c Type 3, and d Type 4

Types 1 and 2 correspond to the areas from the south region, wherein the economic and individuals’ mobility is higher and the individuals spend most time out. This fact justifies morning and dinner peaks and lower consumption during the day.

Types 3 and 4 correspond to the areas from the north region, wherein the consumption factors are higher throughout the day (morning, lunch, afternoon and dinner period), more similar to consumption patterns in the weekends. These areas present a lower percentage of active population of 47 % (against 68 % from the south), allowing consumption throughout the day.

Decision tree results used to classify the daily demand pattern on working days is presented in Fig. 4. The decision tree was constructed only using two variables to classify the consumption patterns: public billed consumption and individuals’ mobility.

Fig. 4
figure 4

Decision tree to classify the daily demand pattern for working days

4.5 Validation

The regression models have been tested and validated using three new network areas. A relative error given by the absolute difference between the real and the estimated value divided by the estimated value was used. A network area in the north region (VIL_Bra) has been used to validate the Model A (domestic billed consumption). Two network areas in the south region were selected to validate Models C (daily peaking factor) and Model D (per capita consumption in the summer). Table 5 presents the relative errors, showing that the models have a good prediction performance, particularly for the south region, since theses models have a higher \( {\mathrm{R}}_{\mathrm{a}}^2 \).

Table 5 Relative errors for the validation models

Pattern validation included two steps: decision tree application to classify two new sectors (ALF_Lis and QTE_Lis were classified as Type1); and comparison with the median pattern of the classified pattern. Figure. 5 shows that the real patterns are close to the median of Type 1 pattern.

Fig. 5
figure 5

Pattern validation using ALF_Lis and QTE_Lis

5 Concluding Remarks

This research aims at developing a comprehensive approach for spatial and temporal profiling of water consumption variables and patterns in WDS. The approach is applied to 86 network areas considering 49 initial socio-demographic, billing and infrastructure variables. Scenario exploration allowed the identification of seasonal (winter and summer) and daily scenarios (working days, Saturdays and Sundays).

Principal Components Analysis was carried out to define the most relevant socio-demographic, billing and infrastructure variables (reducing the number of variables to 8), followed by Multiple Linear Regression models to estimate average, peaking and minimum consumption. The most important components obtained were socio-demographics and indicate that family structures (i.e., families with elderly or adolescents), individuals’ mobility (i.e., people employed in the tertiary sector and university graduates) and public consumption (i.e., public spaces’ irrigation) are key-variables to profile water demand.

Four different consumption patterns have been identified, clearly showing that different daily consumption behaviours are mainly associated with different family structures (families with adolescents or elderly).

Results are encouraging and should be explored with a larger number of areas. This research considerably reduces the uncertainty in planning and operation of water distribution systems, thereby improving their efficiency and sustainability.