1 Introduction

Over the last decade, there has been an increased interest in network research, with the focus shifting away from the analysis of single small graphs to consideration of large-scale ones, called complex networks. Complex network-based machine learning and data mining have triggered much attention. This is because such networks are ubiquitous in nature and in everyday life. Many data sets are already represented by networks, such as the Internet, WWW, telecommunication networks, transportation networks, biological networks, and social networks. Many other kinds of data sets can be transformed to network representations. For example, a table-based relational data base can be transformed into a network by simply connecting the k nearest neighbors of each data point. One of the main motivations of graph theory research is the ability to describe topological structure of the original system. In machine learning domain, it has been shown that the topological structure is quite useful in detecting various cluster (class) forms by a data clustering (classification) algorithm with a unique distance measure (Karypis et al. 1999; Fortunato 2010).

One of the striking phenomena of complex networks is the presence of communities. The notion of communities in networks is straightforward, each community is defined as a subgraph whose nodes are densely connected within itself but sparsely connected with the rest of the network. Therefore, community detection in networks has turned out to be an important topic in data mining (Newman and Girvan 2004; Newman 2006; Duch and Arenas 2005; Reichardt and Bornholdt 2004; Danon et al. 2005). In graph theory, community detection corresponds to graph partition, which has been shown to be a NP-complete problem (Fortunato 2010). For this reason, a lot of efforts have been paid to develop more efficient approximate solutions (see Fortunato 2010 and references there in).

In practice, there are common cases where some nodes in a network can belong to two or more communities at the same time. For example, in a social network of friendship, individuals often belong to several communities: their families, their colleagues, their classmates, etc. These nodes are often called overlapping nodes, and few community detection algorithms can deal with this problem. Therefore, uncovering the overlapping community structure of complex networks is still an open problem (Zhang et al. 2007a, b; Palla et al. 2005).

It is rare that we know nothing on a given data set. On the contrary, in many real world data sets, we know the labels of some elements. For example, one certainly does not know the majority of the people from Brazil, but we usually know some of them, such as Pelé, a famous soccer player, or Ayrton Senna, who was a famous racing driver. These labeled data, without a doubt, help to correctly determine the labels of the remaining ones. For this reason, we consider fuzzy community structure detection in a semi-supervised environment.

In this paper, we present a new community detection method, which uses competition and cooperation among particles walking in the network. It is inspired by the community detection method proposed by Quiles et al. (2008), in which only hard labels can be produced. That model features walking particles in the network competing with each other in order to possess as many nodes as possible. Later, Breve et al. (2012) extended that model to perform semi-supervised learning including not only competition among particles spreading different labels, but also cooperation among the particles which are spreading the same class label. However, it also provides only hard labels. Both Quiles et al. (2008) and Breve et al. (2012) provide analysis on time and storage complexity of these methods, showing that they have lower order of computational complexity than other unsupervised and semi-supervised graph-based methods. While most semi-supervised graph-based methods have cubic complexity order (O(n 3)) (Zhu 2005), the methods proposed by Quiles et al. (2008) and Breve et al. (2012) have only linear complexity (O(n)), thus they can be applied to larger data sets.

A preliminary work to determine overlapping nodes by particle competition has been presented by Breve et al. (2009). In that work, partial knowledge of networks is not taken into consideration, thus only the competition mechanism between particles is implemented. In this paper, we extend that model to semi-supervised learning case by introducing the cooperative mechanism through the concept of teams of particles. Particles in the same team cooperate with their teammates and compete against particles of other teams. We also transform the unsupervised learning mechanism into a semi-supervised learning mechanism, in order to take advantage of a small portion of labeled samples that usually are available in many real data sets. The proposed model produces a fuzzy output (soft label) for each node of the network. Such continuous-valued output can be treated as the levels of membership of each node to each community. Therefore, it is able to uncover the overlapping community structure in networks.

Another problem faced by machine learning algorithms is the presence of outliers or mislabeled samples in the training data. In practical applications, semi-supervised learning may entail risks because error propagation may be embedded in normal label propagation. For example, in a medical diagnostic system, a classification mistake may have serious consequences to a person’s health. In semi-supervised learning domain, this situation gets worse because the classification mistake may propagate to the sub-system or even the whole system, resulting in wrong diagnosis of other cases. The model proposed in this paper gives special consideration to label propagation safety by introducing a mechanism to identify outliers among the labeled data items, and consequently prevent error propagation coming from outliers.

It is worth noting that many graph-based semi-supervised learning methods have been developed (Chapelle et al. 2006). However, most of them are similar to each other (Zhu 2005) in such way that they can be seen as regularization frameworks, differing only in the particular choice of the loss function and the regularizer (Blum and Chawla 2001; Zhu et al. 2003; Zhou et al. 2004; Belkin et al. 2004, 2005; Joachims 2003). Moreover, most of these methods spread the labels globally, i.e., at each iteration, all nodes propagate their labels to all other nodes accordingly to edges weights. The method proposed in this paper is essentially different from the others because the nature-inspired particle movement, competition and cooperation mechanisms allow it to spread the labels locally, at each algorithm step, i.e., only those nodes which are being visited by a particle are updated. Both the fuzzy output (overlapping community structure detection) and the outlier detection mechanisms are extracted naturally from the particle dynamics.

The rest of this paper is organized as follows: Sect. 2 describes the model in details. Section 3 shows some experimental results from computer simulations, and in Sect. 4 we draw some conclusions.

2 Model description

In this section, we introduce the particle competition and cooperation algorithm. It takes a undirected and unweighed network as input. Therefore, if the data set is already in that form, it can be used directly. Otherwise, the input data set must be transformed into an undirected and unweighed network. For each labeled data item, a corresponding particle is generated and put in the network. A group of particles having the same label is called a team. Each node in the network possesses a vector of elements, which corresponds to the domination level of each team of particles over that node. As the system runs, each particle uses a random-greedy rule to choose a neighbor to visit. In this chosen node, there is an increase of the domination level of the particle team, while there is a decrease of the domination levels of other teams. Teams of particles will act cooperatively trying to dominate as many nodes as possible while preventing intrusion of other teams. We keep track of each node visits and, at the end of iterations, we calculate the membership degrees of each node to each class by using the information of domination levels.

The input of the algorithm is a graph \(\mathbf{G} =(\mathbf{V},\mathbf{E}), \) where \(\mathbf{V} =\{v_1,v_2,\ldots,v_n\}\) is the set of nodes and \(\mathbf{E}\) is the set of edges (v i v j ), which can also be represented by an adjacency matrix \(\mathbf{W}{:} \)

$$ W_{ij} =\left\{\begin{array}{*{20}l} 1& \quad \rm{if} \; v_i \;\hbox{and} \; v_j \; \hbox{are neighbors} \\ 0 & \quad\hbox{otherwise} \end{array}\right., $$
(1)

so W ij specifies whether there is an edge between the pair of nodes v i and v j . The algorithm also requires a vector \(Y = \{y_1,y_2,\ldots,y_n\},\) where y i takes the hard label of the node v i if it is known, or 0, otherwise. The label set is defined as \(L = \{1,2,\ldots,c\}, \) where c is the amount of classes/communities, so a number is assigned to each class/community and 0 is reserved for nodes which label is unknown. If a labeled node is known to be an overlap node, its hard label is chosen after the class/community that it has the higher pertinence level. The goal of the algorithm is to provide a vector of membership degrees to each class for each of the nodes in the graph, no matter if they are initially labeled or unlabeled.

If the input data set is not a undirected unweighed network, it must be first transformed into one. For instance, if the input data set is vector-based, as in \(\mathbf{\chi} =\{x_1,x_2,\ldots,x_n\} \subset \mathcal{R}^{m}, \) one may transform it into an undirected and unweighed graph by transforming each element x i into a node v i , and connecting it its k nearest neighbors according to some distance measure, such as the Euclidean distance. In this case, the adjacency matrix \(\mathbf{W}\) may be build as follows:

$$ W_{ij} = \left\{\begin{array}{*{20}l} 1& \quad \hbox{if } x_j \hbox{ is among the }k\hbox{-nearest neighbors}\\ & \quad \quad \hbox{of } x_i \hbox{ or\;vice-versa} \\ 0& \quad \hbox{otherwise} \\ \end{array}\right..$$
(2)

where W ij specifies whether there is an edge between the pair of nodes x i and x j . Of course, one can also use faster methods to estimate the nearest neighbors, specially in larger data sets where the prior graph construction may be more time consuming than the algorithm iterations, as the algorithm has lower order of computational complexity than the prior graph construction step.

For each labeled node v i (i.e., y i  ≠ 0) in the network, a particle ρ i is generated and its initial position is at the node v i . Thus, there is a particle for each labeled node in the network. If v i is the initial position of particle ρ i , we call it the home node of particle ρ i . At each iteration, each particle changes its position and register the distance it is from its home node. Particles generated for nodes with the same class/communities labels form a team and cooperate with each other to compete with other teams. Therefore, each team represents a class/community of the network.

Each particle ρ j comes with two variables: \(\rho_{j}^{\omega(t)}\) and \(\mathbf{\rho_{j}^{d(t)}}.\) The first variable \(\rho_{j}^{\omega(t)}\in [0,1]\) is the particle strength, which indicates how much the particle can affect a node levels at time t. The second variable is a distance table, i.e., a vector \(\mathbf{\rho_j^d(t)} =\{\rho_{i}^{d_{1}}(t),\rho_{i}^{d_{2}}(t),\ldots,\rho_{i}^{d_n}(t)\},\) where each element \(\rho_j^{d_i}(t) \in [0,n-1]\) corresponds to the distance measured between the particle’s home node v j and its current position.

Table 1 Degrees of membership from each sample to each class obtained by the proposed method for the Iris Data Set

Each node v i has two variables. The first variable is a vector \(\mathbf{v_i^\omega(t)} =\{v_i^{\omega_1}(t),\) \(v_i^{\omega_2}(t),\ldots,v_i^{\omega_c}(t)\}\) called instantaneous domination levels, and each element \(v_i^{\omega_\ell}(t) \in[0, 1]\) corresponds to the level of domination of team ℓ over node v i . At each node, the sum of the domination levels is always constant, as follows:

$$ \sum_{\ell=1}^{c}v_i^{\omega_\ell} = 1. $$
(3)

This relation is possible because particles increase the node domination level of their own team and, at the same time, decreases the other teams’ domination levels. The second variable is the long-term domination levels, which is a vector \(\mathbf{v_i^\lambda(t)} =\{v_i^{\lambda_1}(t),v_i^{\lambda_2}(t),\ldots,v_i^{\lambda_c}(t)\},\) and each element \(v_i^{\lambda_\ell}(t) \in[0 \quad \infty]\) represents long-term domination level by team ℓ over node v i . Long-term domination levels can vary from zero to infinity and they never decrease.

Each node v i has an initial value of its instantaneous domination vector \(\mathbf{v_i^\omega}\) set as follows:

$$ v_i^{\omega_\ell}(0) =\left\{ \begin{array}{*{20}l} \frac{1}{c} & {\hbox{if }} y_i = 0\\ 1& {\hbox{if }} y_i = \ell \\ 0& {\hbox{otherwise}} \\\end{array}\right., $$
(4)

i.e., for each unlabeled node (y = 0), the domination levels of all particle teams are set to the same value \(\frac{1}{c}, \) where c is the number of classes/communities (number of teams); and for each labeled node (\(y \neq 0\)), the domination level of the dominating team is set to the highest value 1, while the domination levels of other teams are set to the lowest value 0. On the other hand, in all nodes, all long-term domination levels \(\mathbf{v_i^{\lambda_\ell}(0)}\) have their initial values set to zero, for all the classes ℓ no matter if the corresponding data item is labeled or unlabeled.

Each particle has its initial position set to the corresponding home node, and their initial strength is set as follows:

$$ \rho_j^{\omega}(0)=1, $$
(5)

i.e., each particle starts with maximum strength.

Particles have limited knowledge of the network, they only know the distances from their home node to nodes that they already visited. Distances are recalculated dynamically at each particle movement. Thus, the distance table of each particle is set as follows:

$$ \rho_j^{d_i}(t) = \left\{ \begin{array}{*{20}l} 0& \quad \hbox{if } i = j \\ n-1& \quad \hbox{if } i \neq j \\ \end{array}\right., $$
(6)

i.e., for each particle, the distance from its home node is set to zero, and all the other distances are assumed to be the largest possible value n − 1.

At each iteration, each particle will select a neighbor to visit. There are two different kinds of movements a particle can use: random movement and greedy movement. During random movement, a particle randomly chooses any neighbor to visit without concerning domination levels or distance from its home node. This movement is used for exploration and acquisition of new nodes. Meanwhile, in greedy movement, each particle prefers visiting those nodes that have been already dominated by its own team and that are closer to their home nodes. This movement is used for defense of both its own and its team’s territories. In order to achieve an equilibrium between exploratory and defensive behavior both movements are applied. Therefore, at each iteration, each particle has probability \(p_{\rm grd}\) to choose greedy movement and probability 1 − p grd to choose random movement, with \(0 \leq p_{\rm grd} \leq 1. \) Once the random movement or greedy movement is determined, the target neighbor node \(\rho_j^\tau(t)\) will be chosen with probabilities defined by Eq. (7) or Eq. (8), respectively.

In random walk the particle ρ j tries to move to any node v i with the probabilities defined as:

$$p(v_{i}|\rho_{j}) = \frac{W_{qi}}{\sum_{\mu=1}^{n}{W_{q \mu}}},$$
(7)

where q is the index of the current node of particle ρ j , so W qi  = 1 if there is an edge between the current node and any node v i , and W qi  = 0 otherwise.

In greedy movement the particle tries to move to a neighbor with probabilities defined according to its team domination level on that neighbor \(\rho_j^{\omega_\ell}\) and the inverse of the distance (\(\rho_j^{d_{i}}\)) from that neighbor v i to its home node v j as follows:

$$ p(v_i|\rho_j) = \frac{W_{qi} v_i^{\omega_\ell} (\rho_j^{d_i}+1)^{-2}} {\sum_{\mu=1}^{n}{W_{q\mu} v_\mu^{\omega_\ell}} (\rho_j^{d_\mu}+1)^{-2}}. $$
(8)

Once more, q is the index of the current node of particle ρ j and \(\ell = \rho_{j}^{f},\)where \(\rho_{j}^{f}\) is the class label of particle ρ j .

Particles of different teams compete for owning the network nodes, when a particle moves to another node, it increases the instantaneous domination level of its team in that node, at the same time it decreases the instantaneous domination level of the other teams in that same node. The exception are the labeled nodes, which instantaneous domination levels are fixed. Thus, for each selected target node v i , the instantaneous domination level \(v_i^{\omega_\ell}(t)\) is updated as follows:

$$ v_i^{\omega_\ell}(t+1) = \left\{ \begin{array}{*{20}l} \max\{0,v_i^{\omega_\ell}(t) - \frac{\Updelta_{v} \rho_j^{\omega}(t)}{c-1}\} \\ \quad {\hbox{if}\ y_i = 0\ \hbox{and} \ \ell \neq \rho_j^f} \\ v_i^{\omega_\ell}(t) + \sum_{q \neq \ell}{v_i^{\omega_q}(t) - v_i^{\omega_q}(t+1)} \\ \quad {\hbox{if}\ y_i = 0\ \hbox{and}\ \ell = \rho_j^f} \\ v_i^{\omega_\ell}(t) \quad {\hbox{if}\ y_i \neq 0} \end{array}\right., $$
(9)

where \(0 < \Updelta_v \leq 1\) is a parameter to control changing rate of the instantaneous domination levels and \(\rho_{j}^{f}\) represents the class label of particle ρ j . If \(\Updelta_v\) takes a low value, the node instantaneous domination levels change slowly, while if it takes a high value, the node domination levels change quickly. Each particle ρ j increases the instantaneous domination level of its team (\(v_i^{\omega_\ell}, \ell=\rho_j^f\)) at the node v i when it moves to it, while it decreases the instantaneous domination levels of this same node of other teams (\(v_i^{\omega_\ell}, \ell\neq \rho_j^f\)), always respecting the conservation law defined by Eq. (3). The instantaneous domination level of all labeled node \(v_i^\omega\) are always fixed, as defined by the third case expressed by Eq. (9).

Regarding long-term domination levels, at each iteration, for each selected node v i in random movement, the long-term domination level \(v_i^{\lambda_\ell}(t)\) is updated as follows::

$$ v_i^{\lambda_\ell}(t+1) =v_i^{\lambda_\ell}(t) + \rho_j^{\omega}(t) $$
(10)

where ℓ is the class label of particle ρ j . Eq. (10) shows that the updating of the long-term domination levels \(v_i^{\lambda_\ell}(t+1)\) is proportional to the current particle strength \(\rho_j^{\omega}(t). \) This is a desirable feature because the particle probably has a higher strength when it is arriving from its own neighborhood, while it has a lower strength when it is arriving from nodes of other teams’ neighborhoods. When greedy movement is selected, long-term domination levels are not updated, otherwise a team domination would be amplified by greedy movement visits, which is not desirable as it would introduce bias in the fuzzy output.

Regarding particles strength, they get stronger when they move to a node being dominated by its own team and they get weaker when they move to a node dominated by other teams. Thus, at each iteration t, a particle strength \(\rho_j^{\omega}(t)\) is updated as follows:

$$ \rho_j^{\omega}(t+1) = v_i^{\omega_\ell}(t+1), $$
(11)

where v i is the target node, and \(\ell = \rho_j^f, \) i.e., ℓ is the class label of particle ρ j . Therefore, each particle ρ j has its strength \(\rho_j^{\omega}\) set to the value of its team instantaneous domination level \(v_i^{\omega_j}\) of the node v i .

It is important to notice that when a particle moves, it may be accepted or rejected in the target node due to the competition mechanism. First, a particle modifies both the node instantaneous and long-term domination levels as explained, then it updates its own strength, and finally it will be accepted in the new node only if the domination level of its team is higher than others; otherwise, a shock happens and the particle goes back to the last node until next iteration.

The distance table purpose is to keep the particle aware of how far it is from its home node. This information is used in the greedy movement in order to keep the particle around its own neighborhood most of the time, avoiding letting it susceptible to be attacked by other teams. The instantaneous domination levels together with the distance information also avoid situations where a particle would walk into enemies’ neighborhoods and lose all its strength. Each particle ρ j updates its distance table \(\rho_j^{d_k}(t)\) at each iteration t as follows:

$$ \rho_j^{d_k}(t+1) = \left\{ \begin{array}{*{20}l} \rho_j^{d_i}(t) + 1 & {\hbox{if }\ } \rho_j^{d_i}(t) + 1 < \rho_j^{d_k}(t) \\ \rho_j^{d_k}(t)& \hbox{otherwise} \end{array}\right., $$
(12)

where \(\rho_j^{d_i}(t)\) and \(\rho_j^{d_k}(t)\) are the distances to its home node from the current node and the target node, respectively.

The distance calculation works as follows: we assume that the particles initially have limited knowledge of the network, i.e., they know how many nodes in the network, but they do not know how the nodes are connected, so they assume all the nodes can be reached in at most n − 1 steps (the largest possible distance). Every time a particle moves, it checks the current distance table. If the target node distance is higher than the current node distance, the target node distance is updated to the distance of the current node plus 1. This method has advantage to use already known distances without recalculation.

In a first glance, the nodes’ instantaneous domination levels \(\mathbf{v_i^\omega(t)}\) looks like a natural choice for nodes’ fuzzy (gradual) outputs, since they indicate the domination levels from each team (class) to each node quantified in terms of continuous values in [0, 1]. However, instantaneous domination levels are very volatile under certain conditions. For instance, the dominating team of a non-overlapping node after the last iteration usually owns the node for all or majority of iterations, but this may not happen to overlapping nodes, in which the dominating team changes frequently, and thus the instantaneous domination level of the last dominating team may not correspond to the team which have dominated that node for longer time. In addition, due to the competition effect, the instantaneous domination level of the dominating team is largely amplified and it does not correspond to the real overlapping level. And that is why we have the long-term domination levels, which represents temporal averaged domination level for each team at each node. In this case, when a team’s long-term domination level is increased, long-term domination levels of other teams are kept without changes. In addition, there is no upper limit on long-term domination levels, i.e., they can vary from zero to infinity. At the end of iterations, the fuzzy output is derived from the long-term domination levels. It is important to note that the long-term domination levels are adjusted only when a particle selects the random movement, because, such as the competition effect, the greedy movement amplifies visiting advantage of dominating particle.

After the last iteration, the degrees of membership \(f_i^\ell \in [0\;1]\) corresponding to each node v i are calculated using the long-term domination levels, as follows:

$$ f_i^\ell =\frac{v_i^{\lambda_\ell}(\infty)}{\sum_{q=1}^{c}v_i^{\lambda_q}(\infty)} $$
(13)

where f i represents the final membership level from the node v i to community ℓ.

Based on the membership degrees (fuzzy output), we have formed an overlapping measure in order to easily illustrate the application of the algorithm. Therefore, the overlapping index o i for a node v i is defined as follow:

$$ o_i = \frac{f_i^{\ell**}}{f_i^{\ell*}} $$
(14)

where \(\ell*= \arg \max\nolimits_\ell f_i^\ell, \ell** = \arg\max\nolimits_{\ell,\ell \neq \ell*} f_i^\ell, \) and \(o_i \in [0 \; 1], \) where o i  = 0 means completely confidence that the node belongs to a single community, while o i  = 1 means the node is completely undefined being shared among two or more communities.

If needed, hard labels may be easily obtained through the following equation:

$$ y_i = \arg\max_\ell f_i^\ell, $$
(15)

i.e., the node is hard labeled after the class with the highest membership level. These hard labels may be very different from those obtained from instantaneous domination levels, such as in (Breve et al. 2012). They are usually more accurate to classify outliers and nodes around outliers. They may also be used to accurately reclassify wrongly labeled nodes, as instantaneous domination levels are always fixed and, therefore, cannot be used to this purpose.

Overall, the proposed algorithm can be outlined as follows:

figure aa

3 Computer simulations

In this section, we present some simulation results to evaluate the effectiveness of the proposed method. First, the proposed algorithm is applied to artificial data sets with increasing amount of overlapped nodes. Then, the robustness to incorrectly labeled nodes is demonstrated, including the reclassification of these nodes. Next, the proposed algorithm is applied to some real-world data sets, including both network-based and vector-based data sets. Finally, the algorithm is evaluated with the benchmark for undirected and unweighted networks with overlapping communities proposed by Lancichinetti and Fortunato (2009) (LFR benchmark), in order to make it easier to compare it to other methods.

For all vector-based data sets, the networks are built by using Eq. (2), with the parameter k being empirically set for each problem, i.e., a set of simulations is executed by varying k in the interval [0.01n 0.1n], and the value leading to the best results is chosen. The Euclidean distance is used in all cases. The algorithm parameters are also empirically set to \(\Updelta_v=0.1\) and p grd = 0.5 for all the experiments. All results shown in this section are the average of 50 to 1,000 executions.

Figure 1a–c shows the results of the proposed method applied to three banana-shaped classes generated using PRTools (Duin et al. 2007) function gendatb with 1,000 elements each (500 per class) and different variance parameter s = {0.6, 0.8, 1.0}. For each data set, 50 elements (5 %) were randomly selected as the labeled ones. The size of the nodes in the plot are proportional to their respective overlapping index. We see that there are more overlapping nodes and the overlapping levels are higher as the classes get more mixed. This situation matches well the results we obtain through a direct visual inspection. These experiments are repeated 100 times. The mean standard deviation of the membership levels are 0.0216, 0.0222, and 0.0598, for Fig. 1a–c, respectively. After assigning the hard labels, the average correct classification rates are 0.9945, 0.9923 and 0.9607, respectively; and the standard deviations are 0.0108, 0.0140, and 0.0164, respectively.

Fig. 1
figure 1

Fuzzy classification of two banana-shaped classes generated with different variance parameters: a s = 0.6, b s = 0.8, c s = 1.0. Nodes size and colors represent their respective overlapping index detected by the proposed method

Figure 2a shows a data set with 4 classes with Gaussian distribution, generated by using PRTools (Duin et al. 2007) function gendats with 1,000 elements (250 per class) and 20 samples are labeled (5 per class), represented by the squares, triangles, lozenges and stars. The algorithm is applied to the data set and the detected overlapping indexes are shown in Fig. 2b. We see that the nodes in the interior of each class are small, i.e., they are clearly non-overlapping nodes. Meanwhile, the nodes in the borders among classes have larger sizes, which represent their different levels of overlapping. These results are in agreement with our intuition. These experiments are repeated 100 times and the mean standard deviation of the membership levels is only 0.0037. For the hard labels, the average correct classification rate is 0.9546 and the standard deviation is 0.0012.

Fig. 2
figure 2

Classification of normally distributed classes (Gaussian distribution). a Toy data set with 1,000 samples divided in four classes, 20 samples are labeled, 5 from each class (squares, triangles, lozenges, and stars). b Nodes size represent their respective overlapping index detected by the proposed method

Referring to Fig. 2a, we note that there is an upper triangle in the space of the square class, it is clearly an outlier. However, it does not mix up the overlapping indexes of the nodes around it. It means that a particle which home node is an outlier has difficulty to defend its neighborhood, since it may be far from its team-mates and receives few or no help from them. A particle whose home node is an outlier can eventually abandon its home, if its neighborhood is dominated by another team. In this case, it may migrate to the neighborhood of one of its nearby team-mates. Although an outlier can eventually change a little bit of the instantaneous domination levels (\(v_i^\omega(t)\)) of its neighbors, it has very weak effect to the long-term domination levels (\(v_i^\lambda(t)\)) of these same neighbors. Thus, we can achieve good classification results even though the data sets have some outliers. Notice that the instantaneous domination levels are fixed for labeled nodes, but the long-term domination levels are not. Thus, through the long-term domination levels, a labeled node can be reclassified if it is an outlier using Eq. (15).

In order to show these features, we perform simulations on an artificial data set presented by Fig. 3a, it has 2,000 elements distributed into two banana-shaped classes (1,000 elements per class), 100 (5 %) of them are labeled (circles and squares), however, 10 of these labeled nodes have the wrong label representing outliers. Figure 3b shows the classification by the proposed method. The hard labels are obtained through Eq. (15), i.e, the sample is simply classified to the class with the highest membership level. We can see that the wrongly labeled nodes do not affect the classification of their neighbors and the outliers themselves are eventually reclassified to their respective proper classes. These experiments are repeated 100 times, the average correct classification rate is 0.9975 and the standard deviation is only 0.0001.

Fig. 3
figure 3

Classification of data sets with some outliers: a artificial data set with some wrongly labeled nodes, b classification by the proposed method

Next, the proposed algorithm is applied to a network-based real-world data set: the Zachary’s Karate Club Network (Zachary 1977), which is already an undirected and unweighed network, so the prior graph construction step is not needed. The data set is presented to the algorithm with only two labeled nodes: 1 and 34, each one representing a different class. The results are shown in Fig. 4, and the overlapping index of each node is indicated by their sizes. Our visual inspection indicates that this is a good result as well. Notice that although the two labeled nodes exhibit some degree of overlapping, the algorithm still produced a good result, even detecting these overlapping degrees in the labeled nodes (notice the slightly larger size). This is also a desirable feature, since we do not need to choose a non-overlapping node to represent a class. The three most overlapping nodes detected by the proposed algorithm are nodes 9, 3, and 20, which matches the results obtained by Zhang et al. (2007b). Notice that these results are the mean of 1000 executions, and the standard deviation of the continuous output is only 0.003, i.e., the method output is pretty consistent. By applying Eq. (15), the hard labels are obtained and the algorithm achieves a perfect classification score (100 % correct classification rate of the 34 nodes) in all the 1000 repetitions.

Fig. 4
figure 4

The Karate Club Network. Nodes size and colors represent their respective overlapping index detected by the proposed method

As the next step, the proposed method is applied to two vector-based real-world data sets from the UCI Machine Learning Repository (Frank and Asuncion 2010): the Iris Data Set and the Wine Data Set. The Iris Data Set has 150 samples of 3 different types of Iris flower: Iris Setosa, Iris Versicolour and Iris Virginica. The first class (Iris Setosa) is quite different from the other two; while the latter are more similar and hard to separate from each other. There are 50 samples in each class and 4 real-valued attributes for each sample. The network is built from the data using (2) with k = 5, which was empirically set. We randomly select 10 % of the samples to be presented to the algorithm with their respective labels, while the remaining are presented unlabeled. The degrees of membership from each sample to each class obtained by the proposed algorithm are presented in Table 1. A graph representation is showed in Fig. 5, in which the overlapping index of each node is indicated by their sizes. Notice that the linearly separable class becomes a disconnected subset of the graph nodes, and the membership degrees of these samples are complete to the respective class and zero for the others, as expected for clearly non-overlapping nodes. The other two classes are not linearly separable as they have some degree of overlapping. The different overlapping degrees of all these nodes and their respective pertinence to each of the classes are detected by the proposed algorithm. These experiments are repeated 1,000 times and the mean standard deviation of the membership levels is 0.0803. Regarding the hard labels, the average correct classification rate is 0.9375 and the standard deviation is 0.0458.

Fig. 5
figure 5

The Iris Data Set. Nodes size and colors represent their respective overlapping index detected by the proposed method

The Wine Data Set has 178 samples of 3 different types of Wine. There are 59, 71 and 48 samples in classes 1, 2, and 3, respectively. Each sample has 13 integer and real-valued attributes. The network is also built from the data using (2) with k = 5. Once more, we randomly select 10 % of the samples to be presented to the algorithm with their respective labels, while the remaining are presented unlabeled. The degrees of membership from each sample to each class obtained by the proposed algorithm are presented in Table 2; and a graph representation is showed in Fig. 6, in which the overlapping index of each node is indicated by their sizes. By analyzing the results, we can notice that there are some overlapping nodes between classes 1 and 2; and some more between classes 2 and 3. On the other hand, classes 1 and 3 are almost completely separated. These experiments are repeated 1,000 times and the mean standard deviation of the membership levels is 0.0794. Regarding the hard labels, the average correct classification rate is 0.9326 and the standard deviation is 0.0364.

Table 2 Degrees of membership from each sample to each class obtained by the proposed method for the Wine Data Set
Fig. 6
figure 6

The Wine Data Set. Nodes size and colors represent their respective overlapping index detected by the proposed method

Finally, the proposed method is evaluated with the benchmark for undirected and unweighted networks with overlapping communities proposed by Lancichinetti and Fortunato (2009) (LFR benchmark). This benchmark uses the normalized mutual information between the planted and the recovered partition, in its generalized form for overlapping communities proposed by Lancichinetti et al. (2009), as the performance measure. The benchmark method does not use the overlap degrees, only hard labels. Therefore, we adapted the proposed method to produce one single hard label using Eq. (15) when the overlap index is low (\(o_i \leq 0.5\)), and two hard labels when the overlap index is high (o i  > 0.5). In this last case, the first hard label is given by Eq. (15), and the second hard label (y ** i ) is given by:

$$ y^{**}_i =\arg\max_{\ell, \ell \neq y_i} f_i^\ell, $$
(16)

i.e., the node second hard label is assigned after the team with the second largest domination level on that node. In all cases, 10 % of the nodes are randomly chosen to be presented to the algorithm with their respective label.

The proposed method benchmark output is shown in Figs. 7 and 8. Each data point in these figures is obtained by an average of 50 executions with different generated networks. The vertical bars indicate the standard deviation. These figures makes it easier to compare the proposed method benchmark performance with those obtained by Lancichinetti and Fortunato (2009); Lancichinetti and Fortunato (2009) for the Cfinder method, proposed by Palla et al. (2005), as we generate the test networks using exactly the same parameters they have used. The proposed method performed better than Cfinder when the communities are larger (Figs. 7c, d, 8c, d).

Fig. 7
figure 7

Test of the proposed method on the benchmark for undirected and unweighted networks with overlapping communities (Lancichinetti and Fortunato 2009). The plot shows the variation of the normalized mutual information between the planted and the recovered partition, in its generalized form for overlapping communities (Lancichinetti et al. 2009), with the fraction of overlapping nodes. The error bars indicate standard deviation. The networks have 1,000 nodes, the other parameters are τ1 = 2, τ2 = 1, \(\langle k \rangle = 20,\) and k max = 50. a S min = 10, S min = 50, μ t  = 0.1; b S min = 10, S max = 50, μ t  = 0.3; c S min = 20, S max = 100, μ t  = 0.1; d S min = 20, S max = 100, μ t  = 0.3

Fig. 8
figure 8

Test of the proposed method on the benchmark for undirected and unweighted networks with overlapping communities (Lancichinetti and Fortunato 2009). The plot shows the variation of the normalized mutual information between the planted and the recovered partition, in its generalized form for overlapping communities (Lancichinetti et al. 2009), with the fraction of overlapping nodes. The error bars indicate standard deviation. The networks have 5,000 nodes, the other parameters are τ1 = 2, τ2 = 1, \(\langle k \rangle = 20,\) and k max = 50. a S min = 10, S min = 50, μ t  = 0.1; b S min = 10, S max = 50, μ t  = 0.3; c S min = 20, S max = 100, μ t  = 0.1; d S min = 20, S max = 100, μ t  = 0.3

4 Conclusions

This paper presents a new semi-supervised learning graph-based method for uncovering the network overlapping community structure. The method combines cooperation and competition among particles in order to generate a fuzzy output (soft label) for each node in the network. The fuzzy output correspond to the levels of membership of the nodes to each class. An overlapping measure is derived from these fuzzy output, and it can be considered as a confidence level on the output label. This mechanism has been used to determine outliers in data sets too. The fuzzy output and outlier detection realized by our algorithm provide mechanisms to help stopping error propagation during the semi-supervised learning process, thus avoiding label propagation risk at certain level. It is also able to reclassify incorrectly labeled data items.

Computer simulations were performed with both synthetic and real-world data sets, including vector-based data-sets, network-based data sets, and an evaluation with the LFR benchmark. Their results show that the proposed model is a promising method for classification of data sets with overlapping structure and/or a considerable amount of outliers, as well as detecting and quantifying an overlapping measure for each node in the network.