1 Introduction

Clustering is the process of dividing a set of given elements into groups (or communities) such that elements in the same group are somewhat similar to each other, while elements from different groups are dissimilar [1]. The purpose and, at the same time, the key strength of community detection is to identify the groups and their organization, by only using the topological information [2]. The grouping is based on similarity measures defined for the elements without relying on any a priori information on how the classification should be done [3]. A survey of the most popular clustering methods used in pattern recognition is provided in [4]. Cluster analysis has been applied to many areas: gene analysis [5]; natural language processing [6]; galaxy formation [7]; image segmentation [8]; global air transportation network [9]; logistics [10]; water distribution networks [11, 12].

In all cases, clustering is used for modeling and predicting the functioning, behavior and evolution of a set of objects, identifying substructures to improve the analysis of the system and the detection of similarities and anomalies. Due to the wide range of applications, many clustering algorithms have been proposed. Generally, clustering algorithms can be roughly divided into two main groups [13]: (a) hierarchical and (b) partitioning algorithms, according to the solution representation and the algorithmic approach for generating clusters.

The hierarchical algorithms form clusters through agglomerations (agglomerative algorithms) or divisions (divisive algorithms), creating a tree structure where supergroups (or subgroups) of clusters are gradually defined.

Partitioning algorithms split the n elements of a set into \(k\le n\) clusters by moving them between clusters until a stop criterion is verified. Each solution is assessed by a given objective function.

Generally, graphs are structures formed by a set of n vertices (nodes) and a set of m edges (links) connecting vertices. Also Water Distribution Networks (WDNs) can be modeled as a graph, in particular as link-node planar spatially organized weighted graphs, for which pipes and valves correspond to links, while junctions, water sources and water demand points correspond to nodes [14]. WDNs belong to the class of networks strongly constrained by their geographical embedding [15], for which it is unlikely to find connections between distant nodes, due to obvious physical constraints. Furthermore, WDNs can be considered as complex networks [16], since they are often constituted of thousands of nodes and links, they are strongly looped and topologically irregular, as they follow the shape of the road network of the city they serve. These characteristics make WDNs management arduous, with many operational problems, such as water and energy losses [17].

In this regard, in the last years, Water Network Partitioning (WNP) has become one of the most attractive and studied strategies for the improvement of WDN management, such as leakage [18] and pressure management [19], monitoring of water quality [20], speeding up of repairing interventions [21]. This organization strategy is based on defining areas partially isolated from the rest of the network through the insertion of gate-valves and flow-meters along some pipes. Anyway, the partitioning of WDNs, besides increasing the energy consumption and the economic investment, could deteriorate the hydraulic performances and the reliability of the system due to the closure of pipes. The target of a partitioning layout design is to balance the negative and positive aspects described above [21], ensuring the fulfillment of the minimum required nodal pressure, so to satisfy the water demand of the users and save network resilience [21]. This is achieved by defining the proper number, shape and dimension of clusters, by minimizing the number of boundary pipes and then by optimally locating gate-valves and flow-meters. In this regard, WNP is usually carried out in two different phases [22, 23]: (a) clustering, aimed to define the shape and the dimension of the network subsets, balancing the number of nodes of each cluster and minimizing the number of edge-cuts and (b) dividing, aimed to the physical partitioning of the network, by selecting pipes along which flow meters or gate valves are to be inserted, minimizing the economic investment and the hydraulic performance deterioration. It is evident that the definition of a WNP is a complex challenge for operators, depending on the design choices of both clustering and dividing phases.

The number of possible clusters and gate-valves/flow-meters combinations grows enormously with the network dimension [24], making necessary the recourse to complex mathematical and computational algorithms.

This paper is focused on the first phase of the partitioning of water networks, i.e. the definition of the clusters, since a good subdivision of the network certainly affects positively the subsequent phase of device insertion, in terms of economic investment and hydraulic performances. In this respect, the aim of this work is to investigate the feasibility of two of the most adopted clustering algorithms, a graph partitioning algorithm with multi-level recursive bisection, implemented in Metis software [25, 26], and a spectral clustering based on normalized cut algorithm NCut [27], in order to establish which of them works better for the definition of proper grouping (districts) for water distribution networks.

The comparison is made for a real water network which serves Parete, a city near Naples, Italy. The network graph was considered either un-weighted, or weighted with some of the major geometric/hydraulic characteristics, in order to take into account also different operator purposes. The network graph was clustered into \(k=2\) to \(k=10\) clusters with both the clustering algorithms and with the different chosen weights. Finally, some quality and hydraulic indices were used to compare the different obtained clustering layouts.

2 Clustering Algorithms

A Water Distribution Networks can be modelled as a graph \(G=(V,E)\), where V is the set of n vertices \(v_{i}\) (or nodes) and E is the set of m edges \(e_{ij}\) (or links).

A k-way graph clustering problem consists in grouping the set of vertices V of G into k subsets, \(P_1 ,P_2 ,....P_k \) such that:

  • \(\mathop \bigcup \limits _{c=1}^k P_c =V\) (the union of all clusters P\(_{k}\) must contain all the vertices v\(_{i})\);

  • \(P_c \mathop \cap \nolimits ^ P_t =\emptyset \)(each vertex can belong to only one cluster \(P_c )\);

  • \(\emptyset \subset P_c \subset V\) (at least one vertex must belong to a cluster and a cluster cannot contain all vertices);

  • \(1<k<n\) (the number k of clusters must be different from one and from the number n of vertices).

The graphs are considered undirected and weighted. Link weights express the strength of the links between elements, in terms of proximity and/or similarity, indicated with non-negative weight \(\varepsilon _{ij} >0\,if\,i\) and j are linked, \(\varepsilon _{ij} = 0\) otherwise. Node weights express some node characteristics (i.e. age, demand, etc.), indicated with positive weight \(\varpi _{i}\), with \(i \in V\). The edge-cut of a clustering, denoted with \(N_{EC}\), is equal to the sum of weights (\(\varepsilon _{ij})\) of edges (\(e_{ij})\) whose incident vertices belong to different clusters:

$$\begin{aligned} N_{EC} =\mathop \sum \nolimits _{i\in P_c \Rightarrow j\notin P_k } e_{ij} or \mathop \sum \nolimits _{i\in P_c \Rightarrow j\notin P_k } \varepsilon _{ij} \end{aligned}$$
(1)

In this paper two algorithms were used to cluster the graph of the water distribution network of Parete: (a) Metis, based on a graph partitioning technique, and (b) NCut, based on spectral clustering.

(a) Metis. The graph partitioning algorithm is based on Multi-Level Recursive Bisection (MLRB) proposed in Metis software by Karypis and Kumar [25, 26, 28]. Metis belongs to the class of multi-level partitioning techniques. Graph clustering starts with constructing a sequence of successively smaller (coarser) graphs, and a bisection of the coarsest graph is applied. Subsequently, a finer graph is generated in the next level based on the previous bisections. At each level, an iterative refinement algorithm is used to further improve the bisection.

The goal of Metis is to compute a k-way graph partitioning by minimizing the \(N_{ec}\) (Eq. 1) with the constraint: \(I_B \le 1+\varepsilon \), where \(\varepsilon \) is a small positive number and \(I_{B}\) is the balance index defined as follows:

$$\begin{aligned} I_B =k.max\left( {d_p } \right) /\mathop \sum \limits _i^n \varpi _i \end{aligned}$$
(2)

where can be the maximum number of nodes or the maximum sum of the vertex-weights among the subsets \(P_{i}\) obtained by the k-way partitioning algorithm.

(b) NCut. The minimum-cut criterion for graph clustering refers to a class of techniques which divide a graph into subgraphs, such that the number of boundary edges is minimized. To avoid an unnatural bias towards splitting small-sized subgraphs based on the minimum-cut criterion, Shi and Malik [27] proposed the Normalized Cut (NCut), to compute the cut cost as a fraction of the total edge connections to all the nodes in a graph. A generalized eigenvector decomposition was used to speed up computation time. For this reason, these graph clustering algorithms, that rely on the eigenvector decomposition of a Laplacian matrix Lnxn [29], are also called spectral clustering.

The Laplacian matrix is the difference between the diagonal matrix \(D_{K}\), with the connectivity degrees of the nodes (\(D_{K}\) =diag(\(K_{i})\) and \(K_{i}\) is the degree of a node \(v_{i})\), and the adjacency matrix A (where elements \(a_{ij}=a_{ji}\)=1 indicate that there is a link between nodes i and j and \(a_{ij}=a_{ji}\)=0 otherwise). The Normalized Cut exploits the property of the Normalized Laplacian Matrix defined as \(L_{rw}=D_{K}^{-1}L\). The goal of Ncut algorithm is carried out by minimizing the following relationship:

$$\begin{aligned} \frac{N_{EC} }{\mathop \sum \nolimits _{c=1}^k vol\left( {P_c } \right) } \end{aligned}$$
(3)

where \(vol\left( {P_c } \right) \) is the sum of the degrees or weighed degrees of all nodes that belong to the i-th cluster and \(N_{ec}\) is the edge-cut set between the clusters. The minimum of Eq. (3) is achieved if all \(vol\left( {P_i } \right) \) coincide or, in other words, if \(vol\left( {P_1 } \right) \cong vol\left( {P_2 } \right) \cong \ldots \cong vol\left( {P_k } \right) \); in this way the NCut algorithm tries to obtain k balanced clusters.

It is worth to highlight that both algorithms were performed with different weights of graph edges. Specifically, taking into account the strength of the connections between links provides certainly different cluster layout [30] which can meet different needs of the operators, offering specific solutions according to the peculiarities of the problem to solve. In this way, introducing weights for the pipes, the adjacency matrix A is replaced by the weight matrices Wto calculate the Laplacian matrix and so the corresponding spectrum. The performances of the two clustering techniques as the weight changes are evaluated.

Several metrics were used to compare the performance of the two different adopted clustering techniques; specifically: two clustering quality indices (a), and three hydraulic indices (b) were evaluated for each cluster layout combination:

  1. (a)

    clustering quality indices

  2. -

    \(N_{EC}\): it represents the total number of boundary pipes;

  3. -

    \(I_{B}\): the balance index (as described above);

  4. (b)

    hydraulic indices

  5. -

    \(C_{EC}\): it represents the sum of the ratios d / l of all boundary pipes, and can be considered as a proxy of pipe conductance;

  6. -

    \(R_{EC}\): it represents the sum of the hydraulic resistances \(l/d^{5}\) of all boundary pipes;

3 Case Study

The two described clustering algorithms were compared for a real water distribution network, which serves the city of Parete, near Naples, Italy, with 10,800 inhabitants. The water network has two sources, \(m=282\) links and \(n=184\) nodes; its main topological characteristics are reported in Table 1.

Table 1. Topological characteristics of the WDN of Parete

From a topological point of view, in agreement with most large-scale real networks, it is a sparse network with a link density value q=0.017. Furthermore, the average node degree \(K=3.05\) is small, since in a WDN the number of edges that can be connected to a node is limited by physical constraints [16, 17]. The small average path length APL=8.80 shows that the graph has a cohesive and robust behaviour [27], an important aspect for an efficient water flow. Regarding the main spectral measurements, the spectral gap \(\Delta \lambda \)= 0.062 and the algebraic connectivity \(\lambda \) \(_{2}\)= 0.021, they assume low values, showing that the graph arrangement can be easily decomposed into isolated parts (clusters) [31, 32].

The results of the analysis for all weight/algorithm combinations are reported, respectively, in Table 2 for Metis and in Table 3 for NCut. The graph of the network of Parete was subdivided in \(k=2\) to \(k=10\) clusters, analyzing the performances of the two clustering techniques as the number of clusters changed. The clustering phase was obtained considering the network graph un-weighted (\(\varepsilon _{ij}=1)\), as well as weighted with some of the major geometric/hydraulic characteristics: namely, the ratio between diameter and length of pipes (\(\varepsilon _{ij}=d/l)\) and the hydraulic resistance (\(\varepsilon _{ij}=l/d^{5})\). In order to better compare the two algorithms, Metis was forced to balance the clusters in a similar way as NCut, by using the following node weights:

  • node degree \(\varpi _i =K_i =\mathop \sum \limits _{j=1}^n a_{ij} \), for the un-weighted link;

  • weighted node degree \(\varpi _i =K_{\varpi i} =\mathop \sum \limits _{j=1}^n d_{ij} /l_{ij} \), coupled with the link weight \(\varepsilon _{ij} =d_{ij} /l_{ij} \);

  • weighted node degree \(\varpi _i =K_{\varpi i} =\mathop \sum \limits _{j=1}^n l_{ij} /d_{ij}^5 \), coupled with the link weight \(\varepsilon _{ij} =l_{ij} /d_{ij}^5 \).

Table 2. Performance Indices for Metis algorithm considering the graph unweighted, and weighted with d / l and \(l/d^{5}\), and for \(k=2\) to \(k=10\) clusters

First, the number of the edge-cuts \(N_{EC}\) for all groupings from \(k=2\) to \(k=10 \)clusters considering the graph un-weighted and weighted with d / l and \(l/d^{5}\) are shown in Fig. 1. As expected, the number of boundaries increases as the number of clusters increases for both algorithms, even if the increasing trends are clearer for NCut than Metis.

Table 3. Performance Indices for the spectral NCut algorithm considering the graph unweighted and weighted with d / l and \(l/d^{5}\) and for \(k=2\) to \(k=10\) clusters
Fig. 1.
figure 1

Number of edge-cuts \(N_{EC}\) for Metis and NCut algorithms, considering the graph un-weighted and weighted with d / l and \(l/d^{5}\), and from \(k=2\) to \(k=10\) clusters.

In general, Metis provides clustering solutions with higher number of boundaries than spectral NCut, and this difference between the two algorithms grows as the number of clusters increases for the un-weighted and d / l-weighted graph. For the \(l/d^{5}\)-weighted graph this difference is evident for any value of k, reaching about 71% for \(k=4\).

Another important aspect is that the number of edge-cuts \(N_{EC}\), for both Metis and Ncut, is generally lower if the unweighted graph is considered, except for \(k=\) 3, 4 and 5. As the solutions with the smallest values of \(N_{EC}\) likely cause lower hydraulic deterioration and they reasonably will have a lower cost for device purchase and installation, they can be considered preferable. Anyway, it may happen that a small boundary set presents a narrow distribution of pipe diameters, so to make difficult the subsequent dividing phase, for which it is better to have both small pipes (to be closed) and large pipes (to be left open) [30].

The trends of the Balance Index \(I_{B}\) for all clustering from \(k=2\) to \(k=10\), considering the graph un-weighted and weighted with d / l and \(l/d^{5}\), are shown in Fig. 2.

Fig. 2.
figure 2

Balance Index \(I_{B }\)for Metis and NCut algorithms, considering the graph un-weighted and weighted with d / l and \(l/d^{5}\), and from \(k=2\) to \(k=10\) clusters.

Except for the NCut with \(l/d^{5}\)-weighted graph, the value of \(I_{B}\) ranges from 1,00 to 1,60, indicating that all the layouts are satisfactorily balanced. As expected, for both algorithms, when the graph is considered un-weighted, the clustering layouts result more balanced. Metis provides clustering solutions with lower values of \(I_{B}\) than NCut, so clusters are more balanced. The least balanced layouts come from the NCut when the graph is weighted with \(l/d^{5}\),\(^{ }\)and the difference between the two algorithms reaches the highest value (65%) for \(k=4. \)The difference between the results of the two algorithms is clearly due to the fact that in Metis \(I_{B}\) is a constraint (it searches layouts in the space of solutions that minimize the number of edge-cuts, or the sum of their weights, in compliance with the constraint on \(I_{B})\). Instead, NCut provides solutions that simultaneously minimize the number of boundaries (or the sum of their weights) and balance the clusters.

Concerning the index \(C_{EC}\), it is evident from Fig. 3 that, for both algorithms, consistently with the minimized objective function, the lowest values are obtained when the graph is considered d / l-weighted. Conversely, the highest values are obtained considering the \(l/d^{5}\)-weighted graph, as the minimization of the resistance of the boundary leads to select the pipes with greater values of the diameter as edge-cuts.

Fig. 3.
figure 3

The index \(C_{EC }\)for Metis and NCut algorithms, considering the graph un-weighted and weighted with d / l and \(l/d^{5}\), and from \(k=2\) to \(k=10\) clusters.

In Fig. 3, it is apparent that the clustering layouts obtained with Metis and Ncut without weights show intermediate d/l values between lower and higher values obtained for weighted graphs. The lowest values correspond respectively for Metis to \(k=2, 4, 5\) and 6, and for NCut to \(k=\) 3, 7, 8, 9, 10. Concerning the total resistance \(R_{EC}\), it is evident from Fig. 4 that, as expected, the lowest value is obtained with the weight \(l/d^{5}\).

Fig. 4.
figure 4

The index \(R_{EC}\) for Metis and NCut algorithms, considering the graph un-weighted and weighted with d / l and \(l/d^{5}\), and from \(k=2\) to \(k=10\) clusters.

Fig. 5.
figure 5

WDN of Parete divided into \(k=5\) clusters, for un-weighted network graph, with (a) Metis clustering, (b) NCut clustering

In this case, the difference between the two algorithms reaches the value of 93% for \(k=8\). The highest values are provided by both the algorithms for the d / l-weighted graph, as in this case the pipes with wider diameters are selected. The solutions with the lowest value of resistance are obtained with NCut, while the highest values are obtained with Metis. Also in this case, the clustering layouts obtained without weights show intermediate values compared to the weighted graphs and increasing trends are clear for NCut as a consequence of the trends of \(N_{EC}\).

Finally, in the Fig. 5, the clustering layout of the network of Parete with \(k=5\) is reported, comparing the results obtained with the two clustering algorithms for the un-weighted graph. The Figure shows unequivocally that Metis and Ncut algorithms provide different clustering of the network, not only in terms of indices \(N_{EC}\), \(I_{B}\), \(R_{EC}\) and \(C_{EC}\), but also in terms of shape of each cluster and positioning of boundary pipes.

For the presented case study and taking into account the above described indices, it seems clear that Ncut algorithm allows to minimize the number of edge-cuts (both weighted and un-weighted) better than Metis, providing solutions with lower resistance and, probably, with lower cost for devise purchase and installation (because \(N_{EC}\) is significantly smaller). Conversely, Metis provides solutions generally more balanced than Ncut and, consequently, showing more ability to identify cluster layouts with balanced number of nodes, water demand or other characteristics useful for the aims of WNP.

Anyway, this preliminary study was focused only on the first phase (clustering) of WNP. To fully compare the solutions found by Metis and Ncut, analyzing the consequences on the hydraulic behavior of the clustered network, also the second phase (dividing) should be studied, in which the locations of gate valves and flow meters are selected, affecting the hydraulic performance of the network.

4 Conclusions

The paper presents the comparison between two clustering algorithms, Metis and NCut, applied to the problem of water network partitioning. In fact, the first grouping phase is of crucial importance for determining the hydraulic performance of the system, since it establishes the shape and the dimension of the clusters and the number and the typology of the boundary between them. The comparison was developed for the case study of a real WDN, which graph was considered either un-weighted, or weighted with some of the major geometric/hydraulic characteristics of the pipes, aiming at establishing how the weight choice influences the clustering for both the tested algorithms, and which of them provides the optimal grouping layout. To evaluate the simulation results, four indices were used. Simulations, obtained with the two algorithms (Metis and NCut) and three pipe weights (no-weight, d / l and \(l/d^{5})\) and for a different number of clusters (from 2 to 10), confirm the effectiveness of both the algorithms in providing good clustering layouts. In particular, Metis provides solutions more balanced in terms of number of nodes for each cluster than NCut, but the latter provides solutions with smaller edge cut sets, lower infra-cluster resistance and smaller cost, ensuring apparently clustering layouts more convenient from both hydraulic and economic point of view.

Anyway, only the application to other larger networks, by carrying out also the dividing phase, consisting in the insertion of flow meters and gate valves, can confirm these first results. Further studies will compare also other clustering techniques (i.e., Infomap, Louvain, Label propagation, etc.) in order to find the best one for the water distribution networks.