Summable and nonsummable data-driven models for community detection in feature-rich networks

Shalileh, Soroosh; Mirkin, Boris

doi:10.1007/s13278-021-00774-8

Summable and nonsummable data-driven models for community detection in feature-rich networks

Original Article
Published: 28 July 2021

Volume 11, article number 67, (2021)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Social Network Analysis and Mining Aims and scope Submit manuscript

Summable and nonsummable data-driven models for community detection in feature-rich networks

Download PDF

584 Accesses
7 Citations
2 Altmetric
Explore all metrics

Abstract

A feature-rich network is a network whose nodes are characterized by categorical or quantitative features. We propose a data-driven model for finding a partition of the nodes to approximate both the network link data and the feature data. The model involves summary quantitative characteristics of both network links and features. We distinguish between two modes of using the network link data. One mode postulates that the link values are comparable and summable across the network (summability); the other assumption models the case in which different nodes represent different measurement systems so that the link data are neither comparable, nor summable, across different nodes (nonsummability). We derive a Pythagorean decomposition of the combined data scatter involving our data recovery least-squares criterion. We address an equivalent problem of maximizing its complementary part, the contribution of a found partition to the combined data scatter. We follow a doubly greedy strategy in maximizing that. First, communities are found one-by-one, and second, entities are added one-by-one in the process of identifying a community. Our algorithms determine the number of clusters automatically. The nonsummability version proves to have a niche of its own; also, it is faster than the other version. In our experiments, they appear to be competitive over generated synthetic data sets and six real-world data sets from the literature.

Community Detection in Feature-Rich Networks Using Data Recovery Approach

Article 06 July 2022

A Method for Community Detection in Networks with Mixed Scale Features at Its Nodes

Detecting Communities in Feature-Rich Networks with a K-Means Method

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction: background, motivation, and previous work

1.1 Background

Community detection in networks is a popular topic applied in various domains ranging from sociology to biology to computer science. A network is a set of objects, usually referred to as nodes, which are interconnected by pair-wise links. Typical examples are: a network of mutual friendship relations between a set of individuals; a network of enterprises related by mutual supplies; and a network of websites related by mutual visits. When nodes are additionally supplied with a set of features characterizing them, such a network is referred to as a feature-rich (Interdonato et al. 2019), or node-attributed (Bojchevski and Günnemann 2018; Xu et al. 2012), network. In a friendship network, features may characterize individual’s demographics, background, interests, etc.

A community is a group of relatively densely inter-connected nodes that are similar in the feature space too.

Figure 1a illustrates the concept of feature-rich network as a formal structure, and Fig. 1b visualizes communities in a network.

There have been published a number of papers proposing various approaches to identifying communities in feature-rich networks. A comprehensive, yet concise, review of methods for community detection in feature-rich networks can be found in Chunaev (2020). In this review, all the community detection methods are classified according to the stage of the process of finding communities at which the two data sources, network and features, are merged together. The merger may occur before the process begins (early fusion), within the process (simultaneous fusion), and after the process (late fusion).

1.2 Motivation

The subject of our interest belongs to the simultaneous fusion stage, at which both data sources, network links and feature values, are available for investigation. We are going to develop a mathematical model for the data, so that the model is able to help us in detecting communities in the network informed by the data. Among mathematical data modeling approaches, we distinguish between theory-driven and data-driven approaches. Theory-driven approaches involve a model of the world leading to a probabilistic distribution, parameters of which can be recovered from the data. In contrast, data-driven approaches involve no world models but rather focus on modeling the data as is.

Our data-driven model conventionally assumes that there is a hidden partition of the node set in nonoverlapping communities, which is supplied with hidden parameters encoding the average link intensities in the network and community central points, in the feature space. These are used at the “decoding” stage so that the residuals of data modeling equations are minimized according to the least squares criterion. Such an approach is referred to as the data recovery approach in Mirkin (2012); in neural network domain, that is referred to as auto-encoder (Ng 2011). Unsupervised data analysis methods such as K-means clustering and principal component analysis naturally fall within this approach, as described in Mirkin (2012).

As usual, the least squares criterion leads to computationally hard problems which are tackled with various heuristics. In particular, we follow a strategy of sequentially extracting clusters one-by-one. This strategy naturally fits into the additive structure of the least squares criterion. Previously, this strategy has been applied separately to only network, or entity-to-entity similarity data, and to only feature data. Applied to similarity data, it was first described in English in Mirkin (1987) and experimentally validated in papers like (Mirkin and Nascimento 2012). Similar constructions, for dissimilarity data, have been developed in Vichi (2008). Applied to feature space data, the strategy was experimentally validated in Chiang and Mirkin (2010), Amorim and Mirkin (2012).

Recently, we applied this approach to network data and feature space data combined (Shalileh and Mirkin 2020). This paper significantly extends that to a more comprehensive analysis of the network structure.

First of all, we introduce and test here two different modes of using network link structure. Specifically, we now distinguish between summable and nonsummable modes. The former corresponds to the case at which all link weights are measured in the same scale, so that they are comparable and summable across the entire data table. In the nonsummable mode, each node’s links are considered as measured in different scales.

This assumption points to a not uncommon data type emerging, for example, in some psychological experiments in which the entities are individuals or cognitive subsystems with different scales of individual judgments. Whenever the node’s links are measured independently of the other nodes, there is a potential for the weights to be nonsummable. To give an example of nonsummable links, let us consider two sets of internet sites: sites in one set provide classical music education, and sites in another set sell goods. These two sets would usually much differ in at least two aspects. First, the numbers of visitors are of different orders at these sets: the numbers are massive at selling goods sites; in contrast, the numbers of visitors at classical music education would be smaller by several orders of magnitude. Second, the time spent, in general, would be much different at these two sets: seconds at purchasing goods and hours at listening music. As we will see, taking into account the nonsummability phenomenon, even in the context of ordinary feature-rich networks, leads to advantages in at least two aspects: the speed of computation and quality of cluster recovery at some data types.

We apply the least squares approach to both cases, leading to two different versions of the method, and conduct a comprehensive set of experiments to validate and to compare the performance of the newly proposed algorithms. Second, we expand here both the list of real-world data sets and the list of algorithms under comparison. Specifically, we add to our collection of small-sized data sets a medium-sized data set with 3490 nodes, which may add substance to our claim that our method works well for both types of data sets. Also, we found a recent heuristic algorithm, EVA (Citraro and Rossetti 2020), which has a publicly available code, so that we were able to add that to the list of our competition. We inserted a section discussing the complexity of our algorithms and provided a comparison of the timing taken by all the algorithms under consideration. It appears the algorithm in the nonsummable mode works almost as fast as the fastest out of our sample.

It is noteworthy to add that our method:

Is data driven;
Admits either quantitative or categorical features or both;
Involves an explicit relative weighting of the two data sources in the fitting criterion;
Assumes that hidden communities are crisp and nonoverlapping;
Determines the number of communities automatically.

Our method finds communities one-by-one, which leads to a natural way for selecting the number of clusters. All procedures involved are finite and, thus, always convergent. Our experiments show that this approach is able to recover hidden clusters in feature-rich networks reasonably well. Moreover, it is competitive against existing state-of-the-art approaches.

The rest of the paper is organized as follows. Section 1.3 reviews the previous work. We describe our models and algorithms in Sect. 2. Section 3 is devoted to setting of our experiments for validation of the algorithms and comparison of them with some state-of-the-arts algorithms. It presents: (a) competition; (b) data sets, both real-world and artificially generated; (c) criteria for assessment of the quality of experiments. In Sects. 4 and 5 we describe results of our experiments. We draw conclusions in Sect. 6.

1.3 Previous work

Within the simultaneous fusion approach (Chunaev 2020) literature, we consider three directions: (a) heuristics, (b) theory-driven modeling, and (c) data-driven modeling approaches.

We are going to briefly discuss these three in the remainder of this section after a mention of some classical clustering methods.

These classical methods include the normalized cut and related spectral clustering (Shi and Malik 2000), as well as the modularity-based method (Newman 2006; Dang and Viennet 2010). The Louvain algorithm (Blondel et al. 2008) detects communities by locally maximizing the modularity score. The most recent reviews of research on community detection in networks with no feature data include a comprehensive monograph (Doreian et al. 2020) as well as thought-provoking reviews (Javed et al. 2018; Hoffman et al. 2018). A review of the theory-driven approach in the analysis of networks (Goldenberg et al. 2010) should be mentioned too.

Among heuristics approaches, one can distinguish those at which criteria of the classical clustering algorithms are modified according to the presence of two data sources. Paper (Ye et al. 2017) modifies the normalized cut criterion by adding the so-called unimodality compactness to reflect the homogeneity of attributes within the community. A modified modularity criterion and corresponding method is developed in Sánchez et al. (2015). A modified Louvain method is proposed and tested in Combe et al. (2015). Another popular direction of development in this area is the so-called network embedding (see (Chang et al. 2015; Cavallari et al. 2017; Sun et al. 2020)). In these approaches, both the network and feature data are approximated with a low-dimension Euclidean vector space.

The theory-driven approach involves both the maximum likelihood and Bayesian criteria for fitting probabilistic models. Many methods in this category involve stochastic block models (SBM) which have been successfully used for detection of communities in conventional networks. In Stanley et al. (2019) network structures are modeled with SBM, while the continuous features are modeled with a Gaussian mixture model. The Blockmodel Entropy Significance Test (BESTest) (Peel et al. 2017) for evaluation of how much a metadata partition is relevant to the network structure. The BESTest works by first dividing network’s nodes according to the feature labels and then by computing the entropy of that SBM which best corresponds to the partition.

Methods in Xu et al. (2012), Newman and Clauset (2016), Bojchevski and Günnemann (2018) are based on Bayesian inferences. In Yang et al. (2013) the authors proposed clustering criterion to statistically model interrelation between the network structure and node attributes.

As to the data-driven modeling approach, it seems research in this direction is rather scarce. Some authors propose the so-called non-negative matrix factorization (NNMF), which is a technique to approximate the data via data matrix factorization into non-negative matrices of simpler structure. In papers (Wang et al. 2016; Cao et al. 2019) combined criteria for such an approximation and methods for suboptimally solving them are proposed. The criteria are based on the least-squares approach. However, in contrast to our line of thinking, these criteria involve some derived data rather than the original ones. A different approach is described in Akoglu et al. (2012). Here, the data are summarized as given; the quality, however, is scored according to the principle of minimum description length (MDL) so that the number of bits in coding of the summary is minimized.

One may say that our approach combines aspects of the two approaches above: a straightforward modeling of the data as is, like in Akoglu et al. (2012), and a least-squares criterion, like in Wang et al. (2016), Cao et al. (2019).

2 Methodology

2.1 Data recovery model for community detection

Consider a network with features at the nodes, $A=\{P, Y \}$, over an entity set I. Here I is a set of network nodes of cardinality $|I|=N$; $P=(p_{i j})$ is an $N\times N$ matrix of mutual link weights between nodes $i,j\in I$; and $Y=(y_{{i v}})$ is an $N\times V$ matrix of feature values, so that entry $y_{i v}$ is the value of feature $v=1,2,\ldots ,V$ at node $i\in I$. This definition covers a wide range of networks, including, for example, a flat network in which inter-node links simply exist or not, but have no associated weights. Such a network can be represented by matrix P at which $p_{i j}=1$ if a link between i and j exists, and $p_{i j}=0$ if not.

To build a data-driven community model, let us specify the following notation.

A community, or cluster, $S\subset I$ is represented by a binary $N\times 1$ membership column vector, $s=(s_i)$ in which $s_i=1$ if $i\in S$, and $s_i=0$, otherwise ($i=1,2,\ldots , N$).

In the feature space, community S can be represented by a V-dimensional point $c=(c_v)$, which is a standard to which all the community members relate.

At the network link data, there may be at least two possible assumptions:

(a)
AS: Summable weights

This assumption means that the weights $p_{ij}$ are comparable and summable across all the matrix P. In this case, there should be a single intensity weight $\lambda$ to relate the weights measurement scale to S. Specifically, each within-community weight $p_{ij}$, $i,j\in S$, in this case, should be large and approximately equal to the intensity $\lambda$. The between-community links, ideally, should be all zero.
(b)
AN: Nonsummable weights

Under this assumption, weights $p_{ij}$ in any column j are considered incomparable to weights $p_{ij'}$ in any different column $j'\ne j$, $i,j\in I$. Therefore, at each column $j\in I$ a specific intensity weight $\lambda _j$ is assumed, so that, for any $i\in S$ the link weights $p_{ij}$ tend to be equal to $\lambda _j$.

Extending these definitions to a partition, $S=\{S_1, S_2,\ldots , S_K\}$, of I in K nonoverlapping parts/communities, S can be represented by a binary matrix $s=(s_{ik})$ so that $s_{ik}=1$ if $i\in S_k$, and $s_{ik}=0$, otherwise.

To relate any partition to the feature data, we assume that a standard point $c_k=(c_{kv})$ is specified for each community $S_k$, $k=1,2,\ldots , K$, so that approximate equations hold:

$$\begin{aligned} y_{i v} = \sum _{k=1}^{K} c_{k v} s_{ik} + f_{iv}, i\in I, v\in V. \end{aligned}$$

(1)

Since communities $S_k$ do not overlap, the sum in the equations plays a rather nominal role: for any $i\in I$, $y_{iv}$ is equal to $c_{kv} +f_{iv}$ just for that k at which $i\in S_k$. The value $f_{iv}$ expresses the extent of approximation and should be made as small as possible.

To approximate the network part of the data, we assume either a total intensity weight $\lambda _k$ for community $S_k$, under the summability assumption AS, or column-dependent intensity weights $\lambda _{kj}$, under the nonsummability assumption AN ($k=1,2,\ldots K; j\in I$). Then the following equations should hold:

$$\begin{aligned} p_{i j} = \sum _{k=1}^K \lambda _{k} s_{ik} s_{jk} + e_{i j}, i,j\in I, \end{aligned}$$

(2)

at the AS, and

$$\begin{aligned} p_{i j} = \sum _{k=1}^K \lambda _{kj} s_{ik} + e_{i j}, i,j\in I. \end{aligned}$$

(3)

at the AN.

Similarly, the sums in these equations are purely nominal. At the AS, they just express that $p_{ij}=\lambda _k$ for all $i,j\in S_k$ ($k=1,2,\ldots , K$) or $p_{ij}=0$, otherwise, up to the residual $e_{ij}$, of course. At the AN, $p_{ij}=\lambda _{kj}$ for $i\in S_k$ and any $j\in I$, up to small residual $e_{ij}$ again. One may consider that at the AN assumption, the columns $j\in I$ play roles of features.

By using the least-squares approach, we formulate the problem of finding a hidden membership matrix $s=(s_{ik})$, community centers $c_k$, and intensity weights $\lambda _k$ or $\lambda _{kj}$, as of minimizing the sum of squared residuals:

at AS assumption:
$$\begin{aligned} \begin{aligned} F_{\mathrm{AS}}(\lambda _k, s_{k}, c_{k})&= \rho \sum _{k=1}^{K} \sum _{i, v} (y_{iv} - c_{k v} s_{i k})^2 \\&\quad + \xi \sum _{k=1}^{K} \sum _{i, j}( p_{i j} -\lambda _{k} s_{ik} s_{jk})^2 , \end{aligned} \end{aligned}$$
(4)
at AN assumption:
$$\begin{aligned} \begin{aligned} F_{\mathrm{AN}}(\lambda _k, s_{k}, c_{k})&= \rho \sum _{k=1}^{K} \sum _{i, v} (y_{iv} - c_{k v} s_{i k})^2 \\&\quad + \xi \sum _{k=1}^{K} \sum _{i , j}( p_{i j} -\lambda _{kj} s_{ik})^2 . \end{aligned} \end{aligned}$$
(5)

The factors $\rho$ and $\xi$ in Eqs. (4) and (5) are expert-driven constants to balance the relative weights of the two sources of data, network links and feature values.

Since vectors $s_k=(s_{ik})$ ($k=1, 2,\ldots , K$) correspond to a partition, they are mutually orthogonal. That means that for any specific i, $s_{ik}$ is zero for all k’s except one: that one k for which $S_k$ contains i. As a result, each of the sums over k in the models relates to a single summand, meaning that the operation of summation over k may be applied outside of the parentheses in Eqs. (4) and (5).

2.2 The iterative extraction approach

The problems of optimization of criteria (4) and (5) are computationally intensive and cannot be solved exactly in a reasonable time. Therefore, there can be various heuristic strategies explored to locally or approximately advance to solving them. We are going to exploit a doubly greedy approach of sequential extraction (Mirkin 2008). This approach can be applied here because the criteria to optimize are additive. According to this approach, parts $S_k$ of the partition S are sought not simultaneously but one-by-one, sequentially, in a greedy manner. That is, a subset of I to serve as $S_k$ at $k=1$ is found to minimize the part of the criterion related to $S_1$. Specifically, for an individual community denoted by $T\subseteq I$, its membership by $t=(t_i)$, so that $t_i=1$ if $i\in T$ and $t_i=0$, otherwise; its center in feature space, by c; and the corresponding intensity weight by $\lambda$ (the index k has been removed), the extent of fit between the community and the data set, according to criteria (4) and (5), is

$$\begin{aligned} f_{\mathrm{AS}}(\lambda , c_{v}, t_{i})= & {} \rho \sum _{i, v} (y_{i v} - c_{v} t_{i})^2 \nonumber \\&+\xi \sum _{i,j}( p_{i j} -\lambda t_{i} t_{j})^2 \end{aligned}$$

(6)

at the assumption AS, or

$$\begin{aligned} f_{\mathrm{AN}}(\lambda _j, c_{v}, t_{i})= & {} \rho \sum _{i, v} (y_{i v} - c_{v} t_{i})^2 \nonumber \\&+\xi \sum _{i,j}( p_{i j} -\lambda _j t_{i})^2 \end{aligned}$$

(7)

at the assumption AN.

A T locally or approximately minimizing corresponding criterion (6) or (7) is taken as the first part of partition S, $S_1$. This $S_1$ is removed from I, and the next part, $S_2$, is sought in the same way over the residual entity set $I\leftarrow I-S_1$. This continues till a pre-specified stopping criterion is reached, such as, say, when the residual I gets empty.

Given the data matrices, consider a method, Ext(D), for extracting a subset $T\subseteq D$ from any $D\subseteq I$, together with some related quantitative characteristics $\alpha$, so that $(T, \alpha )=$Ext(D). Of course, P and Y remain the only data sources used in Ext. A greedy Sequential Extraction procedure SE can be formulated as follows:

SE algorithm

Input: set I and data matrices P and Y.

Output: partition $S=\{S_1, S_2,\ldots , S_K\}$ of I in nonintersecting parts (communities) $S_k$, as well as their characteristics $\alpha _k$, $k=1,2,\ldots , K$, where $K>0$ is an integer determined as a result of running the algorithm.

Step 1 Set $k=1$, $D=I$.

Step 2 Apply $(T, alpha)=Ext(D)$ and set $S_k=T$, $\alpha _k=alpha$.

Step 3 Redefine $D=D-S_k$. If $D=\emptyset$ is true, set $K=k$ and stop. Otherwise, define $k=k+1$ and go to Step 2.

Within this greedy strategy, at its k-th step ($k=1,2,\ldots , K$), we use one more greedy procedure for obtaining a (locally) optimal part $T=S_k$ and its quantitative characteristic $\alpha _k$. According to this procedure, the set $S_k$, with its quantitative characteristic $c_k, \lambda _k$, at AS, or $c_k, \lambda _{jk}$ at AN, is found not in one go, but by greedily adding elements of I to $S_k$ one-by-one. The additive structure of criteria (6) and (7) above allows us to express them using contributions to the data scatter, which, to an extent, guides the process, as explained below. Besides its computational simplicity, the sequential extraction approach has some theoretical and practical advantages.

One of the theoretical advantages is a Pythagorean decomposition of the data scatter—this allows scoring the contribution of various elements of found solutions to the data scatter, which is helpful for interpretation (Mirkin 2012). Among practical advantages is the competitiveness of the approach regarding the quality of cluster recovery against other computational procedures (see, for example, experimental results of realizations of the doubly greedy strategy in different situations in Chiang and Mirkin (2010), Mirkin (2012), Nascimento et al. (2015)).

To apply this strategy here, denote the indicator vector of a community T by $t=(t_i)$; its center in the feature space, by $c=(c_v)$; and the corresponding intensity weights by $\lambda$ and $\lambda _j$ depending on the assumption, AS or AN, respectively (the index k is removed because it is not needed here).

Consider three individual items constituting squared error criteria (6) and (7):

(a)
The fit between the feature data and the community and its standard point:
$$\begin{aligned} F_Y( c, t) = \sum _{i, v} (y_{i v} - c_{v} t_{i})^2 \end{aligned}$$
(8)
(b)
The fit between the AS community model and network data:
$$\begin{aligned} F_{\mathrm{PS}} (\lambda , t)=\sum _{i,j}( p_{i j} -\lambda t_{i} t_{j})^2 , \end{aligned}$$
(9)
(c)
The fit between the AN community model and network data:
$$\begin{aligned} F_{\mathrm{PN}} (\lambda , t)=\sum _{i,j}( p_{i j} -\lambda _j t_{i})^2 . \end{aligned}$$
(10)

The total goodness of fit measure is either $f_{\mathrm{AS}}=\rho F_Y+\xi F_{\mathrm{PS}}$ (in criterion (6)) or $f_{\mathrm{AN}}=\rho F_Y+\xi F_{\mathrm{PN}}$ (in criterion (7)). Recall that $\rho$ and $\xi$ are weights to balance two data sources, the features and the links, respectively.

At a specified subset $T\subseteq I$, to minimize criteria (6) and (7) regarding the quantitative characteristics $c_v$, $\lambda$, $\lambda _j$, one may separately minimize individual parts (8) over $c_v$, (9) over $\lambda$, and (10) over $\lambda _j$ because of the additive structure of criteria (6) and (7).

Since each of these three is quadratic regarding the respective numerical characteristic $c_v$, $\lambda$, $\lambda _j$, the optimal solutions can be found from the first-order optimality conditions. Let us take the derivatives of $F_Y$ with respect to $c_v$, $F_{\mathrm{PS}}$ with respect to $\lambda$, and $F_{\mathrm{PN}}$ with respect to $\lambda _j$:

$$\begin{aligned}&\frac{\partial {F_Y}}{\partial {c_{v}}} = 2\sum _{i}(y_{iv} - c_{v}t_i) (-t_{i}), \end{aligned}$$

(11)

$$\begin{aligned}&\frac{\partial {F_{\mathrm{PS}}}}{\partial {\lambda }} = 2\sum _{i,j}( p_{ij} -\lambda t_{i} t_{j}) (- t_{i} t_{j}). \end{aligned}$$

(12)

$$\begin{aligned}&\frac{\partial {F}_{\mathrm{PN}}}{\partial {\lambda _j}} = 2\sum _{i}( p_{ij} -\lambda _j t_{i}) (- t_{i}). \end{aligned}$$

(13)

Equating each of these to zero would yield, in respect, equations:

$$\begin{aligned}&\sum _{i} y_{iv} t_{i} = c_{v} \sum _{i} t_{i}^2 , \end{aligned}$$

(14)

$$\begin{aligned}&\sum _{i,j} p_{ij} t_{i}t_{j} = \lambda \sum _{i} t_{i}^2 \sum _{j} t_{j}^2 , \end{aligned}$$

(15)

and

$$\begin{aligned} \sum _{i} p_{i j} t_{i} = \lambda _{j} \sum _{i} t_{i}^2. \end{aligned}$$

(16)

Since $t_i$ is 1/0 binary, equality $t_i^2=t_i$ holds. Thus, $\sum _{i} t_{i}^2=\sum _{j} t_{j}^2=\sum _{i} t_{i}=|T|$. Therefore, these equations can be equivalently reformulated as follows:

$$\begin{aligned} c_{v}= & {} \frac{\sum _{i} y_{iv} t_{i}}{|T|}=\frac{\sum _{i\in T}y_{iv}}{|T|}, \end{aligned}$$

(17)

$$\begin{aligned} \lambda= & {} \frac{\sum _{i,j} p_{i j} t_{i} t_{j}}{|T|^2}= \frac{\sum _{i,j\in T}p_{i j}}{|T|^2}, \end{aligned}$$

(18)

and

$$\begin{aligned} \lambda _j = \frac{\sum _{i} p_{i j} t_{i}}{|T|}= \frac{\sum _{i\in T}p_{i j}}{|T|}. \end{aligned}$$

(19)

In other words, the optimal $c_v$ and $\lambda _j$ at AN must be central in T: they are within-cluster means of features v and network link columns j. Similarly, at AS, the optimal intensity value $\lambda$ is equal to the mean within-cluster link value.

Let us now reformulate criteria (8), (9), (10) by opening the parentheses and putting there the found optimal values of $c_v$, $\lambda$, $\lambda _j$:

Criterion (8) yields:

$$\begin{aligned} F_Y( c, t)= & {} \sum _{i, v} (y_{i v} - c_{v} t_{i})^2=\sum _{i, v} (y_{i v} ^2 -2y_{iv} c_{v} t_{i}+c_v^2t_i)\\= & {} \sum _{i, v} y_{i v} ^2 -2\sum _v c_{v}\sum _i(y_{iv}t_i)+\sum _v c_v^2|T| \end{aligned}$$

Let us denote the square Y scatter by $Q(Y)= \sum _{i, v} y_{i v} ^2$ and take into account that $\sum _i y_{iv}t_i = c_v|T|$ and $\sum _{i} t_i=|T|$. Then the equation above can be rewritten as

$$\begin{aligned} F_Y( c, t) = Q(Y) - \sum _v c_v^2|T| \end{aligned}$$

(20)

Criterion (9) yields:

$$\begin{aligned} F_{\mathrm{PS}} (\lambda , t)= & {} \sum _{i,j}( p_{i j} -\lambda t_{i} t_{j})^2= \sum _{i,j} p_{ij}^2\\&-2\lambda \sum _{i,j} p_{ij}t_{i}t_{j}+ \lambda ^2 \sum _{i,j}t_{i} t_{j}. \end{aligned}$$

Let us denote the square P scatter by $Q(P)= \sum _{i, j} p_{ij} ^2$ and take into account that $\sum _{i,j} p_{ij}t_{i} t_{j}=\lambda \sum _{i,j}t_{i} t_{j}$. Then the equation above can be rewritten as

$$\begin{aligned} F_{\mathrm{PS}} (\lambda , t)=Q(P) - \lambda ^2 |T|^2 \end{aligned}$$

(21)

Similarly, criterion (10) yields:

$$\begin{aligned} F_{\mathrm{PN}} (\lambda , t)= & {} \sum _{i,j}( p_{i j} -\lambda _j t_{i})^2= \sum _{i,j} p_{ij}^2 \\&-2 \sum _{i,j} p_{ij}t_{i} \lambda _{j}+ \sum _{j} \lambda _j^2 \sum _{i}t_{i}. \end{aligned}$$

Let us take into account that $\sum _{i} p_{ij}t_{i}=\lambda _j\sum _{i}t_{i}$. Then the equation above can be rewritten as

$$\begin{aligned} F_{\mathrm{PN}} (\lambda , t)=Q(P) -\sum _j \lambda _j^2 |T|. \end{aligned}$$

(22)

Therefore, with the optimal values for $c_v$, $\lambda$, and $\lambda _j$ determined by T in Eqs. (17), (18), and (19), respectively, criteria (6) and (7) can be equivalently reformulated as

$$\begin{aligned} f(\lambda , c_{v}, t_{i}) =\rho Q(Y) +\xi Q(P) - G \end{aligned}$$

(23)

where $\lambda$ is either a scalar or vector, and

$$\begin{aligned} G(T)=G_s = \rho |T|\sum _{v} c_{v}^2 + \xi \lambda \sum _{i j} p_{i j} t_{i}t_{j} \end{aligned}$$

(24)

at the assumption AS, and

$$\begin{aligned} G(T) =G_n= |T|\left( \rho \sum _{v} c_{v}^2 + \xi \sum _{j} \lambda _j^2\right) \end{aligned}$$

(25)

at the assumption AN, where $c_v$, $\lambda$, and $\lambda _j$ are determined by T according to Eqs. (17), (18), and (19), respectively.

Maximizing criteria G(T) in Eqs. (24) and (25) is equivalent to minimizing the one-cluster least-squares criteria in Eqs. (6) and (7). Therefore, it makes sense to take a look whether G(T) has any meaning of its own.

First of all, we can rewrite Eq. (23) as a Pythagorean decomposition of the combined data scatter $\rho Q(Y) +\xi Q(P)$:

$$\begin{aligned} \rho Q(Y) +\xi Q(P) = G+f \end{aligned}$$

(26)

in two parts, the minimized square residuals f and the remaining part G. This decomposition gives meaning to the value of G as the contribution of cluster T to the combined data scatter.

By looking at the formulas for G, we can see that its part related to the feature set, which is the same in both expressions for G(T), (24) and (25), requires maximization of both the cardinality |T| and the squared distance between c and 0, $\sum _{v} c_{v}^2$. This means an optimal T should have as many elements as possible and, simultaneously, be as far away from 0 as possible in the feature space. Assuming that the feature data are pre-processed so that the origin is transferred to the center of gravity, the grand mean, the point whose components are the averages of the corresponding features, we may conclude that the cluster T should be both numerous and anomalous. The second item in each of the criteria, $G_s$ (24) and $G_n$ (25), has a similar meaning regarding the network data.

Hence, we refer to our local search algorithm for maximizing (24) or (25) as to the Feature-rich Network Addition Clustering algorithm, FNAC. We use endings, FNACs and FNACn, if necessary, to point out which of criteria (24) and (25), respectively, is maximized. The algorithm finds a cluster T, its center c, and its intensity weight(s) $\lambda$ ($\lambda _j$) by locally maximizing G(T) in the system of neighborhoods defined by the following condition. Given a current T, its neighborhood consists of subsets differing from T by just adding a single entity.

The algorithm starts from a random $i\in I$. This i serves as the seed forming a singleton cluster $T=\{i\}$. This triggers the execution of the base FNAC module. At any current T, this module computes increment $\Delta (j)=G(T+j)-G(T)$ for every element $j\in I-T$ and selects that $j*$ at which $\Delta (j)$ is maximum. If this maximum is positive, then $j*$ is added to T, and the module runs again from thus updated T. If, in contrast, $\Delta (j*)<0$, the algorithm halts and outputs T, its center c, its link intensity $\lambda$ (or intensities $\lambda _j$), and its contribution to the combined data scatter G. Then the last check is performed: Seed relevance check If the seed’s removal increases the cluster contribution, this seed is extracted from the cluster.

In its versions FNACs and FNACn, the algorithm FNAC above serves as the core subroutine Ext in our community detection algorithm SE above. The algorithm SE involves an internal procedure, $(T, \alpha )=Ext(D)$ where $D\subseteq I$. By using FNAC as the algorithm Ext to output the community T along with its parameters $c_v$ and $\lambda /\lambda _j$ constituting the $\alpha$, we obtain a combined algorithm, SEFNAC.

A source code of SEFNACs and SEFNACn and all other supplementary materials, including the real-world data sets, synthetic data generator, etc. are publicly available in https://github.com/Sorooshi/SEFNACs_SEFNACn.

3 Setting of experiments for validation and comparison of the proposed methods

To set a computational experiment, one should specify its constituents:

(1)
A set of algorithms under comparison.
(2)
A set of data sets at which the algorithms are evaluated and/or compared.
(3)
A set of pre-processing methods which are applied to standardize or normalize the data sets.
(4)
A set of criteria for assessment of the experimental results.

We describe our settings in separate sections.

3.1 Algorithms under comparison

In addition to our algorithms, SEFNACs and SEFNACn, we take two popular algorithms of the model-based approach, CESNA (Yang et al. 2013) and SIAN (Newman and Clauset 2016), which have been extensively tested in computational experiments. We use the author-made codes of the algorithms which are publicly available. We also tested the algorithm PAICAN from Bojchevski and Günnemann (2018). The results of this algorithm, unfortunately, were always less than satisfactory; therefore, we have excluded the algorithm PAICAN from this paper.

Here are brief descriptions of CESNA and SIAN approaches.

CESNA (Yang et al. 2013) overview Given an undirected graph G(V, E) with binary node attribute matrix X, where V is the set of vertices and E is the set of edges, the aim of CESNA is to detect C communities regarding the graph structure and node attributes. The authors define two generative models, one for the graph and the other for attributes, and combine them together. For graph structure they use Eq. (27) to model the probability of an edge between two nodes u and v as follows:

$$\begin{aligned} P_{u v}= & {} 1-{\mathrm{exp}}\left(- \sum _{c=1}^{C} F_{u c}F_{v c}\right) \nonumber \\ A_{u v}\sim & {} {\mathrm{Bernoulli}}(P_{u v}) \end{aligned}$$

(27)

where $A \in \{0, 1\}^{N \times N}$ denotes the graph adjacency matrix. Unknown function $F_{u c}$ represents the membership of node u to community c, so that the probability is a logistic function of the inner product of $F_{uc}$ and $F_{vc}$ . The presence or absence of an edge uv is governed by a Bernoulli distribution, so that it holds with probability $P_{uv}$ or does not, with probability $1-P_{uv}$.

Similar model (28) is defined for any binary attribute at nodes:

$$\begin{aligned} \begin{aligned} Q_{u k}&= \frac{1}{1+{\mathrm{exp}}(-\sum _{c} W_{kc} . F_{u c})} \\ \ X_{u k}&\sim {\mathrm{Bernoulli}} (Q_{u k}) \end{aligned} \end{aligned}$$

(28)

here $W_{kc}$ is a real-valued parameter of the logistic model for community c to the k-th node attribute.

With the two models above, the problem is to infer values of latent variables F and W by maximizing the likelihood $l(F,W) = {\mathrm{log}} P(G, X | F, W)$ of the observed data G, X. Here $F=(F_{u c})$ is the node-to-community membership matrix and $W=(W_{k c})$ is the real-valued logistic model parameter for attributes.

Assuming that these two sources of data are conditionally independent, the loglikelihood can be defined as $\log P(G, X | F, W) = L_{G} + L_{X}$ where $L_{G} = \log P(G|F)$ and $L_{X} = \log P(X | F, W)$. To find F and W maximizing $L_{G}$ and $L_{X}$, which can be computed using Eqs. (27) and (28), the authors adopt projected gradient ascent approach with backtracking line search (Boyd and Vandenberghe 2004).

An author-supplied code for CESNA algorithm can be found at Leskovec and Sosič (2016).

SIAN (Newman and Clauset 2016) overview Consider a set of features ${\mathbf{x }}=\{ x_{u}\}$ at nodes $u=1, 2, \ldots , n$ and a set of node degrees ${\mathbf{d }} = \{ d_{u}\}$. Assume, first, that each node u belongs to community s with the probability depending on $x_{u}$. and denote all possible combinations of features and communities by $\Gamma = (\gamma _{s x})$. Then the full prior probability of community assignment is $P({\mathbf{s }}| \Gamma , {\mathbf{x }})$. At the next stage, edges between nodes are formed independently at random, with the probability of an edge between nodes u and v being $p_{u v}= d_{u} d_{v} \theta _{s_{u} s_{v}}$ where $\theta _{s t}$ is a hyper-parameter.

The task is to fit the model to the observed data by using the maximum likelihood principle. To this end, a binary adjacency matrix ${\mathbf{A }}=(a_{u v})$, is generated according to the following model:

$$\begin{aligned} P({\mathbf{A }}| \Theta , \Gamma , {\mathbf{x }})= & {} \sum _{s} P({\mathbf{A }}| \Theta , {\mathbf{s }}). P({\mathbf{s }}|\Gamma , {\mathbf{x }}) \nonumber \\= & {} \sum _{s} \prod _{u < v} p_{u v}^{a_{u v}} (1-p_{u v})^{1-a_{u v}} \prod _{u} \gamma _{s_{u}, x_{u}} \end{aligned}$$

(29)

Here $\Theta$ is a $k \times k$ matrix of elements $\theta _{s t}$, and the sum is over all admissible node-to-community assignments. To maximize the function in (29) the authors use the expectation–maximization (EM) algorithm.

An author-supplied code for SIAN algorithm can be found at https://www.nature.com/articles/ncomms11863.

EVA (Citraro and Rossetti 2020) overview

Defining a node-attributed graph as ${G} = (V, E, A)$: where V is the set of nodes, E the set of links, and A is a set of nominal or ordinal attributes such that A(v), for $v \in V$, identifies the set of labels (features) associated with node v. The aim is to discover clusters $C = \{c_1, \ldots , c_n\}$ such that the network links clustering criterion and the feature’s homogeneity criterion within each community is maximized.

To this end, authors of EVA (Citraro and Rossetti 2020) model the network links with the popular modularity criterion as follows:

$$\begin{aligned} Q = \frac{1}{2m} \sum _{v,w} \left[ A_{v,w} - \frac{k_v k_w}{2m}\right] \delta (c_v, c_w) \end{aligned}$$

(30)

where m is the number of links, $A_{v,w}$ is the entry of the adjacency matrix for $v, w \in V$, $k_v, k_w$ are the degrees of node v and node w, respectively. And $\delta (c_v, c_w)$ is an indicator function taking value 1 when v, w both belong to the same community and 0 otherwise.

The authors model features with a metric called purity. Concretely, for a given community $c \in C$, its purity is the product of the frequencies of the most frequent attribute.

$$\begin{aligned} P_c = \prod _{a \in A} \frac{\max _{a \in A} \sum _{v \in c} a(v)}{|c|}. \end{aligned}$$

(31)

where A is the set of features, for $a \in A$ is a feature, a(v) represents an indicator function which is unity iff $a \in A(v)$. Purity ranges in [0,1], and it is maximized when all the nodes within a community share the same attribute. The authors define the purity of a partition as the average of all the community purities:

$$\begin{aligned} P = \frac{1}{C} \sum _{c \in C} Pc \end{aligned}$$

(32)

Finally, the authors linearly combine these two criteria as follows:

$$\begin{aligned} Z = \alpha P + (1-\alpha )Q \end{aligned}$$

(33)

where $\alpha$ is a user-defined hyper-parameter to adjust the importance of each source of the two sources of data.

To optimize Eq. (33) two modifications of the Louvain algorithm (Blondel et al. 2008) are adopted.

An author-supplied code for EVA algorithm can be found at https://github.com/GiulioRossetti/EVA.

3.2 Data sets

We use both real-world data sets and synthetic data sets. We describe them in the following subsections.

3.2.1 Real-world data sets

The two out of three algorithms under comparison restrict the features to be categorical, unlike the proposed methods SEFNAC and EVA. Therefore, whenever a data set contains a quantitative feature we convert that feature to a categorical version. A brief overview of the six real-world data sets under consideration can be found in Table 1.

Table 1 Real-world data sets under consideration

Summable and nonsummable data-driven models for community detection in feature-rich networks

Abstract

Similar content being viewed by others

Community Detection in Feature-Rich Networks Using Data Recovery Approach

A Method for Community Detection in Networks with Mixed Scale Features at Its Nodes

Detecting Communities in Feature-Rich Networks with a K-Means Method

1 Introduction: background, motivation, and previous work

1.1 Background

1.2 Motivation

1.3 Previous work

2 Methodology

2.1 Data recovery model for community detection

2.2 The iterative extraction approach

3 Setting of experiments for validation and comparison of the proposed methods

3.1 Algorithms under comparison

3.2 Data sets

3.2.1 Real-world data sets

3.2.2 Generating synthetic data sets

3.3 Data pre-processing

3.4 Evaluation criteria

4 Experimental comparison of the methods under consideration

4.1 Comparison of the methods over real-world data sets

4.2 Comparison of the methods over synthetic data sets with categorical features

5 Experimental validation of SEFNAC methods

5.1 Choosing the data standardization options

5.2 Experimental results for SEFNAC algorithms at various feature scales

5.2.1 SEFNAC at synthetic networks with categorical features at the nodes

5.2.2 SEFNAC at synthetic networks with quantitative features at the nodes

5.2.3 SEFNAC at synthetic data sets combining quantitative and categorical features

5.3 Computational complexity of SEFNAC algorithms

6 Conclusion and future work

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation