1 Introduction

Community structures are possessed by many real-world networks, e.g., neural networks in biology, social networks in the humanities, or interbank networks in economics, to cite but a few. The notion of community often appears of paramount importance, since it allows a unveil the hidden structure of the network. They are usually considered as groups of nodes that are strongly linked to each other.

Over the past few years, community detection has emerged as a cornerstone task in the area of network analysis, and provides insight into the underlying structure and potential functions of the networks (Girvan and Newman 2002; Newman 2003). The goal of the community detection is to organize the different nodes of a graph into several groups or communities. This process is carried out in such way that nodes belonging to the same community are very similar, while being different from the nodes belonging to other communities.

Many prominent researchers focused on extracting disjoint communities that partition the set of nodes within a network (Blondel et al. 2008; Pons and Latapy 2006; Rosvall and Bergstrom 2008). Recently, researchers have observed the increase in intra-community overlap and have proposed algorithms for finding overlapping communities (Yong-Yeol et al. 2010; Coscia et al. 2012; Lancichinetti et al. 2010; Yang and Leskovec 2012; Jelassi et al. 2014). For instance, in a social network, individuals may belong to multiple strong social communities, corresponding to groups, such as families, colleagues and friends. Any social network with attribute node can be represented by a bipartite graph and the extracted bi-communities can be explored later such as the case of author communities in bipartite bibliographic network which can be used for citation recommendation (Dai et al. 2018). The bipartite graphs have a particular coverage property, called the maximum matching. The latter consists in extracting the maximum number of links covering all the graph. Our approach is mainly based on this property. Indeed, we introduce a new approach for communities detection in social network using bipartite graph and maximum matching property. This contribution explores the properties of the bipartite graphs for finding the pertinent communities in terms of quality. In fact, we aggregate different criteria to determine these communities, such as, stability (Mouakher and Yahia 2016; Mouakher and Ben Yahia 2019; Mouakher et al. 2019), modularity (Newman and Girvan 2004), bond (Omiecinski 2003) and we rely on the overlapping measure to reduce the inter-communities overlap.

The main trust of this contribution stands in the use of the maximum matching. The latter provides a powerful mean to reduce the search space of communities in large networks. Furthermore, it allows a straightforward distribution of the algorithm in order to fulfill the scalability requirements. The contribution of the maximum matching is clearly shown by the results of the experimental section. We mainly study the performance of our approach versus those of the literature in terms of time execution and quality metrics.

The remainder of this paper is organized as follows: In the following section, we sketch the basic concepts of bipartite graph analysis and communities. In Sect. 3, various methods for community detection are categorized and scrutinized. Section 4 is devoted for a thorough description of the new contribution for communities detection in bipartite graph using the maximum matching. The benefits of the maximum matching is clearly shown by the results put in the penultimate section. The latter describes the complete experimental study and the obtained results. The last section recalls our contribution and sketches issues of future work.

2 Key notions

In this section, we briefly sketch the key notions used in the remainder of this paper. These notions covers the following concepts: bipartite graph, biclique, maximal biclique (Ben Yahia and Mephu Nguifo 2004), Galois connection (Hamrouni et al. 2008), community, pseudo-community (Mouakher and Yahia 2016) and maximum matching.

Definition 1

(Bipartite Graph) A simple graph is called bipartite and denoted by \({\mathcal {G}} = ({\mathcal {U}}, {\mathcal {V}}, {\mathcal {E}})\), if its vertex set can be partitioned into two disjoint subsets \({\mathcal {U}}\) and \({\mathcal {V}}\), where \({\mathcal {E}}\) is the set of edges. Note that every edge has the form e =(u,v) where u \(\in {\mathcal {U}}\) and v \(\in {\mathcal {V}}\) (Asratian et al. 1998), such that no vertex both in \({\mathcal {U}}\), or both in \({\mathcal {V}}\), are connected.

Example 1

Figure 1 illustrates an example of a bipartite graph composed by 7 nodes and 8 edges.

Fig. 1
figure 1

An example of a bipartite graph with \({\mathcal {U}} = \{1,2,3\}\) and \({\mathcal {V}} = \{4,5,6,7\}\)

Definition 2

(Biclique) Let \({\mathcal {G}}=({\mathcal {U}},{\mathcal {V}},{\mathcal {E}})\) denote a bipartite graph. A biclique \(C=({\mathcal {U}}',{\mathcal {V}}')\) is a subgraph of \({\mathcal {G}}\) induced by a pair of two disjoint subsets \({\mathcal {U}}' \subseteq {\mathcal {U}}, {\mathcal {V}}' \subseteq {\mathcal {V}}\), such that \(\forall u \in {\mathcal {U}}', v \in {\mathcal {V}}', (u,v) \in {\mathcal {E}}\).

Definition 3

(Maximal Biclique) A maximal biclique is a largest biclique in a graph. Given a bipartite graph \({\mathcal {G}} = ({\mathcal {U}},{\mathcal {V}},{\mathcal {E}})\), a biclique \((S_{x}, S_{y})\) is a maximal biclique of G if no proper superset of \((S_{x}, S_{y})\) is a biclique, i.e., there exists no biclique \((S_{x}', S_{y}')\) \(\ne\) \((S_{x}, S_{y})\) such that \(S_{x} \subseteq S_{x}'\) and \(S_{y} \subseteq S_{y}'\).

Example 2

An example of a maximal biclique is illustrated by Fig. 2.

Fig. 2
figure 2

An example of maximal biclique \(\langle 123, 45\rangle\)

Definition 4

(Galois connection) Let \({\mathcal {G}} = ({\mathcal {U}},{\mathcal {V}},{\mathcal {E}})\) be a bipartite graph. The application \(\psi\) is defined from the first set of nodes (i.e., \(\mathcal{P}\)(\({\mathcal {U}}\))) to the second set (i.e., \(\mathcal{P}\)(\({\mathcal {V}}\))). It associates to U the set of nodes v \(\in\) \({\mathcal {V}}\) that are common to all nodes u \(\in\) U:

$$\begin{aligned}&\psi : {\mathcal {P}}({\mathcal {U}})\rightarrow {\mathcal {P}}({\mathcal {V}})\\&\quad U\mapsto \psi (U)=\{v \in {\mathcal {V}}| \forall u \in U, (u, v) \in {{\mathcal {E}}}\} \end{aligned}$$

In a dual way, the application \(\phi\) is defined from the set of nodes (i.e., \(\mathcal{P}({\mathcal {V}}\))) to the set (i.e., \(\mathcal{P}({\mathcal {U}}\))). It associates to V the set of nodes \(u \in {\mathcal {U}}\) that contains all nodes \(v \in V\):

$$\begin{aligned}&\phi :{\mathcal {P}}({\mathcal {V}})\rightarrow {\mathcal {P}}({\mathcal {U}})\\&\quad V \mapsto \phi (V)=\{u \in {\mathcal {U}} | \forall v \in V, (u, v) \in {\mathcal {E}}\} \end{aligned}$$

The coupled applications (\(\psi\), \(\phi\)) form a Galois connection between the set of nodes in \({\mathcal {U}}\) and that of \({\mathcal {V}}\) Barbut and Monjardet (1970).

Definition 5

(Community) Informally, a community C is a subset of nodes of \({\mathcal {U}}\) and \({\mathcal {V}}\) which are connected to each other more than to other nodes of the network.

Let \({\mathcal {G}}= ({\mathcal {U}},{\mathcal {V}},{\mathcal {E}})\) be a bipartite graph. We define a community \(\hbox {C}=\langle A,B\rangle\) with A and B two subsets of nodes belonging respectively to \({\mathcal {U}}\) and \({\mathcal {V}}\). We define by \(|E^{in}|\) the cardinality of edges inside C and \(|E^{out}|\) those are outside C, A and B are highly connected, if the ratio between \(|E^{in}|\) and \(|E^{out}|\) is very important.

Example 3

In Fig. 3, we can observe that node {2,3} and {4,5,6,7} are highly connected in the graph illustrated by Fig. 1. \(|E^{in}|\) is equal to 6 and \(|E^{out}|\) is equal to 2. Thus, the ratio is equal to 3. Consequently, these nodes can form a community denoted by \(\langle 23, 4567\rangle\). However, this is not the case for \(\langle 12, 67\rangle\) with a ratio equal to 0. In the same respect, the ratio of \(\langle 2, 4567\rangle\) is equal to \(\frac{2}{6}= 0.33\). Indeed, the latter can not build a community, because the nodes {2} and {4,5,6,7} appear in a larger ratio community which is \(\langle 23,4567\rangle\).

Fig. 3
figure 3

An example of the community \(\langle 23,4567 \rangle\) from the graph illustrated by Fig. 1

Definition 6

(Pseudo-Community) The pseudo-community associated to the couple (uv), denoted PC\(_{(u,v)}\), is a sub-bipartite graph computed by getting the cartesian product of the maximal set of nodes fulfilling u in V and the maximal set of nodes satisfactory v in U Mouakher and Yahia (2016).

Formally,

$$\begin{aligned} PC_{(u,v)}&=\{(u',v') \mid (u',v') \in \phi (v) \times \psi (u)\\& \subseteq {{\mathcal {E}}} \mid u' \in \phi (v) \wedge v'\in \psi (u)\}. \end{aligned}$$

The strong point of this new notion of pseudo-community associated to the couple (uv) that all the communities which contains (uv) and which make it possible to maximize the intra-community relation can be determined from this sub-graph. We also define the density of a given pseudo-community \(PC_{(u,v)}\) as follows:

$$\begin{aligned} \textit{Density}(PC_{(u,v)})=\frac{\vert PC_{(u,v)}\vert }{\vert \psi (u)\vert \times \vert \phi (v)\vert } \end{aligned}$$

\(\vert PC_{(u,v)}\vert\) represents the cardinality of \(PC_{(u,v)}\). The latter cardinality is equal to the number of existing edges, whereas \(\vert \psi (u)\vert\) is equal to the number of outgoing links from u and \(\vert \phi (v)\vert\) is equal to the number of outgoing links from v.

Example 4

\(PC_{(2,5)} = \langle \{1,2,3\}, \{4,5\}\rangle\) is a pseudo-community associated to the couple (2, 5) in the bipartite graph shown by Fig. 1 and its corresponding density is equal to \(\frac{6}{2 \times 3} = 1\).

Definition 7

(Maximum Matching) A matching in a Bipartite Graph is a set of the edges chosen in such a way that no two edges share an endpoint. A maximum matching (Gibbons 1985) is defined as a matching of maximum size (maximum number of edges). In a maximum matching, if any edge is added to it, then it is no longer a matching. There can be more than one maximum matching for a given Bipartite Graph (Mucha and Sankowski 2004).

Example 5

An example of a maximum matching is shown by Fig. 4. In this example, the set of edges {(1,5), (2,4), (3,6)} represent the maximum matching of the bipartite graph illustrated by Fig. 1.

Fig. 4
figure 4

An example of the maximum matching of the bipartite graph illustrated by Fig. 1

After introducing the key concepts, we need to analyze how the community detection issue has been tackled by the research community.

3 Scrutiny of the related work

Community detection (González-Pardo et al. 2017) has been extensively studied within the context of unipartite graphs. Most of these algorithms rely on the modularity measure, as defined by Newman and Girvan (2004).

$$\begin{aligned} Q= \frac{1}{2m}\sum _{i,j} \left[ A_{ij} - \frac{K_{i} * k_{j}}{2m}\right] \delta (c_{i},c_{j}) \end{aligned}$$
(1)

where \(A_{ij}\) is the value of the adjacency matrix between the vertices i and j, \(K_{i}\) is the sum of the weights of the edges adjacent to i, m is the number of edges of the graph, \(c_ {i}\) indicates the class assigned to the node i and \(\delta (c_{i},c_{j})\) is the Kronecker delta which is 1 if \(c_{1}\) is equal to \(c_2\), and 0 otherwise.

$$\begin{aligned} \delta (c_{i},c_{j}) = {\left\{ \begin{array}{ll} 1 &{} \quad \text {if } i \text { and } j \text { are in the same community,}\\ 0 &{} \quad \text {otherwise} \end{array}\right. } \end{aligned}$$

In the remainder of this section, we discuss community detection algorithms that are intended for bipartite networks. At a glance, the dedicated literature witnessed three main streams for addressing such a task: (1) modularity-based algorithms; (2) minimum description length algorithms; and (3) link partitioning algorithms. In the following, we sketch these approaches.

  1. 1.

    Modularity-based algorithms

    Most approaches follow the modularity method, proposed by Newman and Girvan (2004), to identify communities in bipartite networks. Due to the particular structure of these kind of networks, modularity optimization required some modifications. In this respect, Guimera et al. proposed a bipartite modularity as the cumulative deviation from the random expectation of the number of edges between vertex members of the same bipartite community (Guimerà et al. 2007). The main weakness of this definition is that it focuses on connectivity from the perspective of only one vertex type. In the same trend, Barber extended the definition of Newman’s modularity in a unipartite network to be appropriate for bipartite networks and introduced a bipartite modularity. The latter relies on the assumption that there is a one-to-one correspondence between communities of different node types (Barber 2007). However, this definition has a limitation of assuming a one-to-one correspondence between the communities from both vertex types – i.e., the number of communities should be equal on both sides. It is worth mentioning that the main weakness of Baber’s bipartite modularity stands on that the number of communities has to be determined in advance. Consequently, it is not practical in many real-life applications. Later, Murata’s definition overcomed the above limitations by not enforcing a one-to-one mapping between the communities of both sides (Murata 2009). Unlike previous proposals, his proposal handles two types of nodes in a uniform framework.

    In this respect, Raghavan et al. proposed an algorithm for detecting communities using the techniques of Label Propagation Algorithms (LPA), which assigns unique labels to nodes and repeatedly updates the label of each vertex by assigning the most frequent labels of its neighbors until it fulfills the terminal condition (Raghavan et al. 2007). Later, Barber and Clark reformulated LPA as an optimization problem, denoted LPAb, addressed its drawbacks with additional constraints, and produced several variants of The LPA algorithm. LPAb is one of these variants that can be used to find modules in bipartite networks. The algorithm proceeds in two main stages the ’bottom up’ and the ’top down’. In the first, it tries to maximize the modularity node-by-node using the propagation of labels. Next, it tries to join modules together as far as it increases the network modularity. Subsequently, Liu and Murata introduced an improved version of LPAb, called LPAb+. The latter has been shown to have the most reliable algorithm having the highest bipartite modularity (Murata 2009).

  2. 2.

    Minimum description length algorithms

    A minimum description length greedy algorithm (MDL-greedy) have been proposed by Xu et al. for choosing a good modular structure in bipartite networks (Xu et al. 2010). MDL-greedy is an heuristic algorithm based on combination theory. It seeks to combine the communities obtained during the previous phase in order to find the optimal communities structure at the current phase. The latter searches automatically the number of partitions, and requires no user intervention.

  3. 3.

    Link partitioning algorithms

    The idea of partitioning links instead of nodes to discover community structure has also been explored. A node in the original graph is called overlapping, whenever links connected to it are put in more than one cluster. Ahn et al. proposed an overlapping community detection algorithm called Link Community, LC, that uses the similarity of the edges to identify hierarchical communities of edges rather than communities of nodes (Yong-Yeol et al. 2010). Given a pair of links \(e_{ik}\) and \(e_{jk}\) incident on a node k, a similarity can be computed through the Jaccard index as follows:

    $$\begin{aligned} S(e_{ik}, e_{jk})=\frac{\vert N_{i} \cap N_{j}\vert }{\vert N_{i} \cup N_{j}\vert } \end{aligned}$$

    where \(N_{i}\) is the neighborhood of node i including j.

In the state of the art, the structure of the modules differs from one approach to another. In fact, the output communities cannot be fully connected. Roughly speaking, all the nodes can not be necessarily strongly linked to each other. In addition, the network can be or not all covered by these communities.

Table 1 A comparison between the studied bipartite community detection approaches

Table 1 summarizes the different outputs of the above scrutinized approaches. The first column represents the different required input for each surveyed algorithm, which could be different from one algorithm to another. The second column describes the communities returned, which are not necessarily maximum bi-cliques and they do not guarantee to cover all the graph. The penultimate column shows the objective function used by each method. Finally, the fourth column describes the dependency between the different algorithm tasks, to wit dependent or independent. This characteristic allows to indicate whether it is possible to optimize the work and to treat these tasks in distributed manner. All of these algorithms are dependent tasks, i.e., the graph can not be split and the processing can not be distributed.

The major moan that can be adressed to these algorithms stands in the absence of scalability, i.e., large graphs can not be processed. To palliate such a drawback, we introduce, through this paper, a new method that splits community detection processing into independent tasks.

4 The fast-bi community detection (FBCD) approach

This section contains the definition of the new approach designed to identify the pertinent community structure in bipartite networks using the maximum matching, called FBCD algorithm. This latter ensures the cover of all edges and vertex in the network. Therefore, we can determine each community existence through these critical edges, called maximum matching edges. It is worth mentioning that, all the links between these edges are disjoint, i.e., we can not find two links that share the same node. This point makes easier the distribution of the algorithm by treating each element independently. Concurrently, the search space will be reduced since the algorithm does not need to treat all the existing edges.

The proposed algorithm relies on a heuristic based on quality score optimization. This score is determined by the aggregation of four different criteria according to user needs. The latter criteria are as follows:

  • Stability (Roth et al. 2007): the stability metric for a given community \(\langle A,B\rangle\), denoted by \(\sigma (\langle A,B\rangle )\), describes the proportion of subsets of nodes in A whose closure is exactly equal to B. This metric reflects the dependency of B on particular nodes of A.

    $$\begin{aligned} \sigma (\langle A,B\rangle )=\frac{|~\{X \subseteq A~|~\psi (X)=B\}~|}{2^{|A|}} \end{aligned}$$
    (2)

    The higher the stability is, the higher the quality of the community.

Example 6

Given the bipartite graph depicted in Fig. 4. The stability of community \(\langle 23,4567\rangle\) is computed as follows : \(A = \{2,3\}, B=\{4,5,6,7\}\)

  • \(\psi (2) = \{4,5\}\)

  • \(\psi (3) = \{4,5,6,7\} = \hbox {B}\)

  • \(\psi (2,3) = \{4,5,6,7\} = \hbox {B}\)

  • \(\sigma (\langle 23,4567\rangle ) = \frac{2}{(2^{2})} = 0.5\)

Modularity (Newman and Girvan 2004): the modularity metric is defined as the ratio of difference between the actual number of edges within the community and expected number of edges in a randomized graph with the same number of nodes and the same degree sequence. A better community quality is assessed through a higher modularity.

$$\begin{aligned} \left( \frac{|E^{in}_{c}|}{|E|}\right) - \left( \frac{( 2|E^{in}_{c} |+|E^{out}_{c} |)}{2 \times |E|}\right) ^2 \end{aligned}$$
(3)

Example 7

Given the bipartite graph depicted in Fig. 4, the modularity of \(\langle 23,4567\rangle\) is computed as follows: \(E^{in} = 6\), \(E^{out} = 2\) and \(|\hbox {E}| = 8 \hbox { Mod} = ((\frac{6}{8})-(\frac{(2*6+2)}{2*8})^2) = 0.75 - 0.766 = -0.016\).

Bond (Omiecinski 2003): the bond metric computes the ratio between the conjunctiveFootnote 1 and the disjunctiveFootnote 2 support. Thus, the bond measure of a community \(C = \langle A,B\rangle\) is defined as follows:

$$\begin{aligned} {\textit{Bond}}({\langle A,B\rangle }) = \frac{\displaystyle {\textit{Supp}}(\wedge {A})}{\displaystyle {\textit{Supp}}(\vee {A})} \end{aligned}$$
(4)

Example 8

Given the bipartite graph depicted in Fig. 4, the bond of \(\langle 23,4567\rangle\) is computed as follows :

  • Supp(\(\wedge {A}) = |\psi (2) \cap \psi (3)| = |\{4,5\}|=2\).

  • Supp(\(\vee {A}) = |\psi (2) \cup \psi (3)| = |\{4,5,6,7\}|=4\).

  • Bond( \(\langle 23,4567\rangle ) = \frac{2}{4} = 0.5\).

Overlapping: the overlapping metric is defined as the redundancy of each link in the extracted communities. A smaller value of overlapping induces a better quality of the community.

Example 9

Given the bipartite graph depicted in Fig. 4. Let us suppose that we just returened the community \(\langle 12,45\rangle\) and that we are interested in computing the overlapping for the community \(\langle 23,4567\rangle\). The edges that already exist in the returned communities are \(\{(2,4) ; (2,5)\}\). So, the overlapping of \(\langle 23,4567\rangle\) is equal to \(|\{(2,4) ; (2,5)\}| = 2\).

After computing the different measures for each community per iteration, the algorithm applies a method to compute an aggregated score for the different communities. In order to do that, the algorithm relies in the TOPSIS (Technique for Order of Preference by Similarity to Ideal Solution) method which is a multi-criteria decision analysis method, developed by Hwang et al. (1993). In this method, two artificial alternatives are hypothesized:

  • Ideal alternative One which has the best attributes values.

  • Negative ideal alternative One which has the worst attributes values.

Let \(x_{ij}\) be the evaluation of the community i according to the measure j, and \(w_{j}\), the weight of measure j. The Topsis method operates in six steps described as follows:

  • Step 1: Standardize the decision-matrix

    $$\begin{aligned} r_{ij}=\frac{x_{ij}}{\sqrt{\sum _{i=1}^{m} x_{ij}^{2}}}, i=1\ldots m;j=1\ldots n \end{aligned}$$
  • Step 2: Construct the weighted standardize decision-matrix by multiplying attributes weight to each rating.

    $$\begin{aligned} v_{ij}=w_j r_{ij}, i=1\ldots m;j=1\ldots n \end{aligned}$$

    where \(w_j\) is the weight of criteria j.

  • Step 3: Determine the ideal solution \(A^{*}\) and the negative ideal solution \(A^{-}\).

    $$\begin{aligned} A^{*}&=\lbrace v^{*}_{1},v^{*}_{2}\ldots v^{*}_{j} \ldots v^{*}_{n} \rbrace \\&= \lbrace ({\mathop {\max }\limits _{i}}{}^{v_{ij}} \vert j \in J_1), (\mathop {\min }\limits _{i}{}^{v_{ij}} \vert j \in J_2) \vert i=1 \ldots m \rbrace .\\ A^{-}&=\lbrace v^{-}_{1},v^{-}_{2}\ldots v^{-}_{j} \ldots v^{-}_{n} \rbrace \\&= \lbrace (\mathop {\min }\limits _{{i}}{}^{{v_{ij}}} \vert j \in J_1),(\mathop {\max }\limits _{i}{}^{v_{ij}} \vert j \in J_2) \vert i=1 \ldots m \rbrace . \end{aligned}$$

    where \(J_1\) is the set of criteria to be maximized and \(J_2\) is the set of criteria to be minimized.

  • Step 4: Determine the separation from the ideal solution: \(S^{*}_{i}=\sqrt{\sum _{j=1}^{n}(v_{ij} - v^{*}_j)^2}\)

  • Step 5: Determine the separation from the negative ideal solution: \(S^{-}_{i}=\sqrt{\sum _{j=1}^{n}(v_{ij} - v^{-}_j)^2}\)

  • Step 6: Compute the relative closeness to the ideal solution: \(C^{*}_{i}=\frac{S^{-}_i}{(S^{*}_{i}+ S^{-}_{i})}\)

In terms of performance, the Topsis has been compared versus a number of other multi-attribute methods and was found to perform almost as well as multiplicative additive weights and better than analytic hierarchy process in matching a base prediction model (Zanakis et al. 1998).

4.1 Description of the proposed algorithm

Figure 5 skecthes a diagram that describes the different steps of the FBCD algorithm. The pseudo-code of the FBCD algorithm is given by Algorithm 1.

Fig. 5
figure 5

Diagram describing the different steps of the FBCD algorithm

The first step illustrated by part (1) of the Fig. 5 considers the set of maximum matching \(\mathcal {MM}\) and the initial network \({\mathcal {G}}\) as input of the algorithm. Then, the Pseudo-Community PC of each couple in \(\mathcal {MM}\) are extracted (line 4 in Algorithm 1). This stride is needed to distribute the process in the following step described by part (2) of the Fig. 5. In this step, a distributed call to Pseudo-Community-Detection is made to extract the communities for each element in \(\mathcal {MM}\). During the final step which is explained by the part (3) of the Fig. 5, the algorithm reduces the set of the already returned communities in a list of pertinent communities (line 7 Algorithm 1). This list is considered as the output of our algorithm (line 8 Algorithm 1).

figure a

The Algorithm 2 describes the Pseudo-Community-Detection step, which takes as input a bipartite graph \({\mathcal {G}}=({\mathcal {U}},{\mathcal {V}},{\mathcal {E}})\) associated to current pseudo-community, as well as the quality metrics i.e., stability, modularity, overlapping and bond. It outputs a community partition of the current pseudo-community PC. The pseudo-code of this part is given by algorithm 2. In a first step, \(\varphi\) is set to all couples (uv) of \({\mathcal {E}}\) (line 1), and the community partition \(\mathcal {F}_{{\mathcal {G}}}\) is initialized to the empty set (line 2). Then, our algorithm invokes the \(\textsc {Get\_PseudoCommunity}\) function (line 4) in order to compute the pseudo-community for each couple belonging to \(\varphi\).

In the next step, the \(\textsc {Get\_Density}\) function (line 5) assesses the density \(\omega\) for each pseudo-community \(PC_{(u,v)}\). Afterwards, a decreasing sort is carried out on the couples in \(\varphi\) through the \(\textsc {Sort\_Elements}\) function (line 6). Then, for each element belonging to \(\varphi\), the algorithm extracts a local pertinent maximal bi-clique from its pseudo-community \(PC_{(u,v)}\).

The pertinent maximal bi-clique is obtained through the \(\textsc {Build\_Community}\) function (line 9). In the case where \(\omega\)(\(PC_{(u,v)}\)) is equal to 1, then the latter is reduced to a maximal bi-clique. Otherwise, our algorithm proceeds to the extraction of all the local maximal bi-cliques \({S}_{c}\) enclosed into \(PC_{(u,v)}\) by calling the function named \(\textsc {Get\_All\_Max\_BiCliques}\) (line 10). It is worth citing that the extraction of the maximal bi-cliques is carried by a slightly modified version of the very efficient Lcm algorithm (Uno et al. 2004). The choice of this algorithm is argued by the fact that it has a linear complexity in the number of closed attributes. Moreover, it has been shown to be one of the best algorithms dedicated to such a task. After that, the \(\textsc {Get\_Metrics}\) procedure (line 12) is invoked in order to compute, for each maximal bi-clique in \({S}_{c}\), the associated metrics values. The \(\textsc {Get\_Aggregation}\) function (line 13) is invoked in order to compute for each maximal bi-clique its corresponding score. To do so, we use the multi-criteria aggregation method Topsis (Hwang and Yoon 1981). Depending on the value of the score measure, the algorithm elects the pertinent maximal bi-clique, through the \(\textsc {Get\_Pertinent\_Community}\) function (line 14). During the current iteration, the chosen pertinent community is added to the list of pertinent communities \(\mathcal {F}_{{\mathcal {G}}}\) (line 15). Then, the couples included in the chosen community are removed from \(\varphi\) before going through the \(\textsc {Removed\_Links}\) function (line 16). The algorithm comes to an end whenever \(\varphi\) list is exhausted and returns the final set of communities \(\mathcal {F}_{{\mathcal {G}}}\) (line 17).

figure b

4.2 Complexity analysis of FBCD

The complexity of the FBCD algorithm is depends on that of Pseudo-Community-Detection part. This last is assessed as follows:

Let \(n = |V_{1}|\) and \(m=|V_{2}|\) be, respectively, the number of vertex for each set of the bipartite graph.

  1. 1.

    First part of Pseudo-Community-Detection: the complexity of this part depends on the complexity of the two functions Get_PseudoCommunity and Get_density (lines 3-5).

  2. 2.

    Second part of Pseudo-Community-Detection: we have chosen the QuickSort algorithm to sort elements (u,v) of \(\omega\) which has a complexity of O(nlog(n)) (line 6).

  3. 3.

    Third part of Pseudo-Community-Detection: the complexity of this part depends on the number of iterations of the “while” loop (line 7). In fact, the maximal number of iterations is estimated at \(t_ {1} = Max (n, m)\) (lines 7–16). The process done by this part, can also be split into three subparts:

    1. (a)

      lines 8–9: the complexity of this part is equal to that of the function Build_ Community\((PC_{(u,v)})\). The latter is about O(n) in terms of number of iterations (\(|\psi (u)| + |\phi (v)|\) - 1).

    2. (b)

      lines 10–14: The Get_All_Max_BiCliques function invokes the Lcm algorithm. The time complexity of the latter is theoretically bounded by a linear function in the number of frequent closed attributes (Uno et al. 2005). Indeed, it enumerates all frequent closed pattern to derive its closure. So, its complexity is equal to \(O(n^2)\). Then, assessing the metrics used for selecting the pertinent community has a complexity \(O(t_{2})\), where \(t_{2} = \hbox {Max}(|\psi (u)| , |\phi (v)|\)). After that, the Topsis method used to aggregate the metrics, has a complexity of \(O(t_{2})\) per iteration. Thus, the complexity of the second subpart is then about \(O(t_{1}^{2})\).

    3. (c)

      lines 15–16: The final subpart of our algorithm is about \(O(|A|*|B|)\).

To sum up, we can conclude that the complexity of the FBCD algorithm is determined by summing the corresponding complexity of its three parts of the Pseudo-Community-Detection, which are treated in a distributed processing. Finally, we can say that the theoretical complexity of our algorithm is about \(nlog(n) + n*[2n + n^{2}] + n^{2} = n^{3} + 3n^{2} + nlog(n) = O(n^{3})\).

5 Experimental results

This section presents a detailed study of the performance of the proposed algorithm. thus, we briefly introduce the datasets and the real-world networks used during this study. Then, we discuss the outputs of the obtained results.

5.1 Real-world networks and datasets

In our approach, we used 20 different datasets extracted from 3 well-known repositories: KONECT newtworks, Network Repository and SNAP dataset. A brief summary of all the datasets can be observed in Table  2. This table contains for each dataset, the following informations: the repository where the dataset can be found (Source column), the application domain of the dataset, it’s name, and the characteristics defining the dataset. Note that for the ’Domain’ column, we used the following acronyms: Social Networks (SN); Web Graphs (WG); Network Dataset (ND); Brain Network (BN); Collaboration Network (CN); Recommendation Network (RN) and Citations Network (CIT).

  1. 1.

    KONECT networksFootnote 3: For testing our first contribution, We use four datasets from Konect databases: Southern Women (Davis et al. 2009), American Revolution, Corporate Leadership, South Africa Companies. The “Southern women” network collected by Davis et al. Davis et al. (2009)Footnote 4 shows the participation of 18 white women (who form the primary set U) in 14 social events (the secondary set V) over a nine-month period. The data was collected in the Southern United States of America in the 1930s. There is an edge for every woman who participates in an event. The first column contains the women, the second column contains the events. The “American Revolution”Footnote 5 contains membership information of 136 people (forming the primary set U) in 5 organizations dating back (the secondary set V) to the time before the American Revolution. The list includes well-known people such as the American activist Paul Revere. Left nodes represent persons and right nodes represent organizations. An edge between a person and an organization shows that the person was a member of the organization. The “Corporate Leadership”Footnote 6 contains person-company leadership information between 20 companies (forming the primary set U) and 24 corporate directors (the secondary set V). The data was collected in 1962. Left nodes represent persons and right nodes represent companies. An edge between a person and a company shows that the person had a leadership position in that company. The “South African Companies”Footnote 7 contains person-company shared leadership relations of “the five most representative companies” that are claimed to represent “the small inner ring of South African Finance”. Left nodes represent persons (the primary set U) and right nodes represent companies (the secondary set V). An edge between a person and a company shows that the person had a leadership position in that company.

  2. 2.

    SNAP networksFootnote 8: We choose two networks with ground truth communities collected by SNAP[38]: Cit-HepPh and Cit-HepTh. These latter indicate the relation between citations and the ground truth communities are paper defined groups. If a paper i cites paper j, then the graph contains a directed edge from i to j. If a paper cites, or is cited by, a paper outside the dataset, the graph does not contain any information about this.

  3. 3.

    Network RepositoryFootnote 9: A network repository is a logical and physical grouping of data from related but separate network. In these cases, a repository is necessary to bring together the discrete data items and operate on them as one. We used different categories of this type of network, such as: Social Networks, Web Graphs, Network Datasets, Brain Networks, Collaboration Networks, Facebook Networks and Recommendation Networks.

Table 2 The datasets and the real-world networks used for experiments

5.2 Experiments on real-world networks

In the following, we start by presenting the quality metrics of use to assess the performance of the introduced algorithm.

5.2.1 Quality metrics

In the following, we put the focus on the evaluation of the performances of the FBCD algorithm using various metrics, such as the modularity (Murata 2009), conductance (Kannan et al. 2000) and density (Viard and Latapy 2014). These metrics describe how community-like is the connectivity structure of a given set of nodes. Indeed, they rely on the fact that communities are sets of nodes with many internal edges and few external ones. Thus, given a network \({\mathcal {G}}=({\mathcal {V}},{\mathcal {E}})\) and a community or a set of nodes C. The number of nodes in the community is set to |C| and \(|E^{in}_{C}|\) presents the total number of edges in C for unweighted networks or the total weight of the edges for weighted networks. In addition, we denote by \(|E^{out}_{C}|\) the total number of edges from the nodes in community to the nodes outside C for unweighted networks or the total weight of such edges for weighted networks. In the following, we review the following metrics.

  1. 1.

    Modularity This metric was designed to assess the strength of division of a network into communities. Indeed, networks with high modularity have dense connections between the nodes within communities but sparse connections between nodes in different communities. According to Table 3, we can detect the different results between the FBCD algorithm and those of the surveyed algorithms of the literature. The following results are obtained on four medium-sized graphs. As shown by Table 3, the worst results on average were obtained by the MDL-greedy algorithm. For the Corporate Leadership graph, the latter gives the same result as the FBCD algorithm which is the highest obtained value for this graph. Even though the LPAb+ algorithm does not yield any highest results for all the graphs, its corresponding results are better than those obtained by the MDL-greedy algorithm. On average, the highest value equal to 0.14, is yielded by the FBCD algorithm which outperforms all its competitors.

  2. 2.

    Conductance The detected communities can also be assessed through the conductance metric. The latter is based on the density of communities and the number of links emerging from them. A structure community is supposed to flag out a high number of links within it and a weak number of outbound links. The conductance metric is based on the ratio of the number outbound links, \(E^{out}_{C}\) and the total number of links (inside the \(E^{in}_{C}\)) for a community C. If we consider a community C of a graph \(G = (V_{1}, V_{2}, E)\), with \(C = (V_{C}, E_{C})\) (\(V_{C}\) the set of vertexes of C and \(E_{C}\) the set of edges of C), the conductance of this community is defined by \(\varPsi (C, G)=\frac{|E^{out}_{C}|}{2|E^{in}_{C} |+|E^{out}_{C}|}\). Considering a partition \(P = {C_{1}, \ldots , C_{k}}\) into k parts of disjoint nodes, the conductance of G is defined as follows: \(\varPsi _{G}\) = \(\frac{1}{k}\sum _{C=1}^{k} (\varPsi (C, G))\), \(\varPsi _{G}\) = \(\frac{1}{k}\sum _{C=1}^{k} (\frac{|E^{out}_{C}|}{ 2|E^{in}_{C} |+|E^{out}_{C}|})\). The conductance values stand within the unit interval. The closer this value to 0, the higher community density is. In this respect, Table 4 illustrates a comparison between our algorithm versus its competitors. We note that the linkComm algorithm gives bad values for all of the considered datasets, even though it has shown very good performances in terms of modularity. If we glance on the values for the American Revolution graph, the optimal is given by all the other algorithms; that is to say that the three approaches give the same detections. In this case, we can deduce the optimal detection for the American Revolution graph. Finally, according to Table 4, we find that our algorithm outperforms its competitors in terms of conductance.

  3. 3.

    Density We start by discussing the intra-community density. The latter is defined as the number of existing edges over the number of edges that could exist in community. Plainly speaking, it is the probability that two nodes chosen at random in from the two sets \(V_{C1}, V_{C2}\) the same community are linked together. Considering a community c of a graph \(G = (V_{1}, V_{2}, E)\), with \(C = \{V_{C1}, V_{C2}, E_{C}\}\) (\(V_{C1}\) (resp.\(V_{C2}\)) the set of vertexes of C in \(V_{1}\) (resp. \(V_{2}\)) and \(E_{c}\) the set of edges of C), the density of this community is equal to \(MQ^{+}(C)=\frac{|E^{in}_{C}|}{|V_{C1}|*|V_{C2}|}\). Considering a partition \(P = {C_{1}, \ldots , C_{k}}\) into k parts of disjoint nodes, the density of G is defined as follows: \(MQ^{+}_{G}\) = \(\frac{1}{k}\sum _{C=1}^{k} (MQ^{+}(C))\), \(MQ^{+}_{G}\) = \(\frac{1}{k}\sum _{C=1}^{k} \left(\frac{|E^{in}_{C}|}{|V_{C1}|*|V_{C2}|}\right)\). The intra-community density can have a value between 0 and 1. A large value is better than a small one in terms of the community quality assessment. The inter-community density is the probability that two different nodes, chosen at random, in two different communities are linked together. So, given a graph \(G = (V_{1}, V_{2}, E)\) and a partition \(P = {C_{1}, \ldots , C_{k}}\) into k communities, the inter-community density is defined by the ratio between the number of edges connecting vertexes of communities \(C_{i}\) and \(C_{j}\) and the maximum possible number of such edges:

    $$\begin{aligned} MQ^{-}_{ci,cj}= & {} \frac{|(v_{i1},v_{j2}) ; v_{i1} \in V_{i1}, v_{j2} \in V_{j2}, (v_{i1},v_{j2}) \in E | + |(v_{j1},v_{i2}) ; v_{j1} \in V_{j1}, v_{i2} \in V_{i2}, (v_{j1},v_{i2}) \in E|}{|V_{i1}|*|V_{j2}| + |V_{j1}|*|V_{i2}|}.\\ MQ^{-}_{G}= & {} \frac{1}{k(k-1)/2}\sum \nolimits _{i=1}^{k-1} \sum \nolimits _{j=1}^{k} (MQ^{-}_{ci,cj}). \end{aligned}$$

    The inter-community density values stand within the unit interval. A weak value is preferable to a large one in terms of the community quality assessment.

    The main trust of the density metric is to assess the average density of communities to the density of edges between communities: \(MQ_{G}\) = \(MQ^{+}_{G}\)\(MQ^{-}_{G}\). The quality of the output communities depends on a higher intra-community density \(MQ^{+}_{G}\) and a lower inter-community density \(MQ^{-}_{G}\). The density range is between \(-1\) and 1. Table 5 shows the density of obtained communities. In fact, our algorithm maximizes the intra-community density for all the considered graphs. Even though the inter-community is is not minimized by an acceptable value is obtained in average, to wit 0.78. This result shows the coherence of the obtained communities versus those obtained by its competitors.

Table 3 The quality of returned communities in terms of the modularity metric
Table 4 The quality of returned communities in terms of the conductance metric
Table 5 The quality of returned communities in terms of the density metric

5.2.2 Processing time

Finally, we analyze the performance of the proposed algorithm using the maximum matching. To do so, we present a comparative study using a set of sample networks and some of the algorithms studied in Section 3. According to Table 6, the community detection is almost impossible with large datasets. Furthermore, Table 6 shows that our algorithm is more efficient than those of the literature in terms of execution time. Using the networks given by KONECT datasets, the FBCD algorithm provides the highest value on average which is equal to 0.01. Clearly, our algorithm outperforms MDL, LPA + and linkComm which respectively provide 1.197, 5.831 and 0.049.

Table 6 The quality of returned communities in terms of processing time in second

BCD is supposed to handle all the existing edges in the network and we compare this hypothesis with our new algorithm. The results of maximum matching presented by Table 7. The column FBCD contains the obtained results by the algorithm using the maximum matching. Whereas the penultimate column (BCD) corresponds to the execution of the algorithm without considering the maximum matching. According to Table 7, we find that the search space of the new algorithm FBCD is very limited compared to that of BCD. Indeed, the latter explores all the edges. The same process is done by all the surveyed approaches. This characteristic is valid for large networks as well as small ones. This fact is clearly stressed by both of the large datasets socfb-Cal65 and socfb-Bingham82, which have respectively 131 and 111 edges of maximum matching.

Table 7 Analysis of the influence of the maximum matching for both of the BCD and the FBCD algorithms

6 Conclusion

In this article, we propose a new paradigm of community structure for social network analysis and community detection in bipartite graph. We presented a formal definition of the concept of ’community structure’, and proposed a systematic algorithm called FBCD to discover these communities.

We conduct a comprehensive benchmarking study on approaches to community detection in social networks. Through these extensive experiments, we demonstrate that community structure exists in real-world networks of various domains. Our proposed method significantly outperforms those of the literature including modularity-based algorithms, minimum description length methods and link partitioning approaches. Avenues of future work are as follows:

  1. 1.

    Consideration of other types bipartite networks We are currently about exploring the extraction of communities from other types of networks, e.g. directed, weighted or dynamic networks. In fact, we provide to apply our community detection algorithm as the basis in order to design a new method to find the community structure in this new category of networks.

  2. 2.

    Scalability The considered datasets are not the most representative ones of the era of Big data. It is also a compelling task to provide an implementation under the distributed framework Spark.