1 Introduction

In real-life applications, graphs are widely adopted to model the relationships among different entities [15, 16]. Mining cohesive subgraphs is one of the most fundamental graph problems in graph analysis [13, 18]. In the literature, different cohesive subgraph models are developed, such as k-core [1], k-truss [22], clique [14], etc. In this paper, we utilize the k-core model to measure the engagement of users in a network. Generally, the k-core of a graph is the maximal subgraph where each vertex inside has at least k neighbors. In real scenarios such as social networks, users in the networks are associated with attributes (e.g., personal preferences) [4, 9]. In attribute graphs, a user is more likely to build contacts with neighbors who share common interests with him. For example, in social networks, friends with common hobby of sport will have greater potential to join a sport community together.

Motivated by this, in this paper, we develop a new model named attribute k-core to better characterize a community. Given an attribute graph and a subset of attribute λ, a maximal subgraph is the attribute k-core depending on λ if each vertex has at least k neighbors and each neighbor shares at least one common attribute in λ with the vertex. As shown in the case study (Figure 8) in the experiment, different selected attributes will lead to different communities. To compute the attribute k-core, we can iteratively remove the vertices that violate the constraints. To find the critical attributes and characterize the main properties of the network, we propose and investigate the attribute k-core maximization problem. Specifically, given an attribute graph and a budget b, we aim to identify a set of b attributes that can lead to the largest attribute k-core.

Example 1

Figure 1 shows a small network with 26 users and their corresponding attributes. Suppose k = 3 and b = 1. {u5, u6, ... , u9, u14, ... , u17} is the corresponding attribute 3-core induced by attribute a1. {u14, ... , u17, u18, ... , u21} is the corresponding attribute 3-core induced by attribute a2. {u18, u19, u20, u21} is the corresponding attribute 3-core induced by attribute a3. Hence, attribute a1 is the optimal result for budget 1.

Fig. 1
figure 1

Motivating example (The table presents the attributes that each vertex associates with)

In the literature, a lot of research are conducted over attribute graphs for cohesive subgraph analysis, e.g., [4, 7, 9, 11]. However, most of them focus on identifying the corresponding community for a given attribute query (e.g., [9, 11]), or finding the attribute association in the graph (e.g., [7]). In addition, they usually emphasize that each vertex should contain the query attributes instead of enforcing the share of common attributes between neighbors. The attribute k-core maximization problem can find many applications. For video game design, the company can use the identified attributes as product features, especially for the games that involve multi-gamers cooperations. Similarly, for party hosting or conference organization, the found attributes can serve as the party or conference theme, which have greater potential to attract more attendances and build a more harmonious communication atmosphere.

Challenges and Contributions

To the best of our knowledge, this is the first work to investigate the attribute k-core maximization problem. Although we can compute the attribute k-core in linear time, the attribute k-core maximization problem is NP-hard. Due the large search space, in this paper, we employ the greedy strategy by iteratively selecting the best attribute. However, the number of attributes in real-life networks is usually large. In the greedy framework, a major cost is to compute the marginal gain of adding a new attribute. To scale for large networks, we propose a layer-based method and the corresponding pruning strategies to reduce the exploration space. Furthermore, novel searching paradigms are developed to compute the marginal gain. Finally, we conduct experiments on 6 real networks to verify the advantages of our developed techniques.

Roadmap

The rest of the paper is organized as follows. In Section 2, we formally define the problem and prove its properties. The baseline method and optimized solution are presented in Sections 3 and 4, respectively. We report the experiment results in Section 5. Lastly, we review the related work in Section 6 and conclude the paper in Section 7.

2 Preliminaries

In this section, we first introduce some related concepts and formally define problem investigated. Then, we prove the hardness of the problem and show its properties. Mathematical notations that are frequently used throughout this paper are summarized in Table 1.

Table 1 Summary of notations

2.1 Problem definition

We consider an undirected network G = (V, E, Λ) as an attribute graph, where V (resp. E) represents the set of vertices (resp. edges) in G and Λ = {a1, a2, ... , at} represents the set of all attributes in G. Each vertex uV is associated with a set of attributes, denoted by \({{\varLambda }}(u) \subseteq {{\varLambda }}\). n = |V | and m = |E| represent the number of vertices and edges in G, respectively. Given an attribute subset \(\lambda \subseteq {{\varLambda }}\), we use V (λ) to denote the set of vertices that contain at least one attribute in λ, i.e., ∀vV (λ), \({{\varLambda }}(v) \cap \lambda \neq \varnothing \). Given an attribute graph G, a subgraph S = (VS, ES, ΛS) is an induced subgraph of G, if \(V_{S} \subseteq V\), ES = E ∩ (VS × VS) and \({{\varLambda }}_{S} \subseteq {{\varLambda }}\). Given a vertex u, we use N(u, S)uS to denote the set of its neighbors in S, and d(u, S)uS = |N(u, S)uS| to denote its degree in S.

Definition 1 (k-core)

Given a graph G and a positive integer k, an induced subgraph \(S \subseteq G\) is the k-core of G, denoted by Ck(G), if it satisfies the following constraints. i) d(u, S)uSk for each vertex uVS; ii) S is maximal, i.e., any supergraph \(S^{\prime } \supset S\) is not a k-core.

To compute the k-core, we can iteratively delete the node that violates the degree constraint, which time complexity is \(\mathcal {O}(m)\) [1]. Based on the k-core model, we can measure the cohesiveness of a community. However, in real applications, such as social networks,

users with common hobbies are usually closer to each other than others, because it is easier for them to build contacts through the shared interests.

To better model the communities in attribute graphs, we introduce the concept of resonant degree.

Then, we present the formal definition of attribute k-core based on resonant degree.

Definition 2 (resonant degree)

Given a graph G and a subset of attributes λΛ, let S be the subgraph induced by V (λ). For uVS, we use \(\widehat {N}(u,S,\lambda )\) to denote the resonant neighbor set of u in S depending on λ, such that each resonant neighbor of u has at least one common attribute with u, i.e., \({\forall } v \in \widehat {N}(u,S,{{\varLambda }})\) satisfies that \({{\varLambda }}(u) \cap {{\varLambda }}(v) \neq \varnothing \). The resonant degree of u in S equals \(|\widehat {N}(u,S,\lambda )|\), denoted by \(\widehat {d}(u,S,\lambda )\).

Definition 3 (attribute k-core)

Given an attribute graph G and a subset of attributes λΛ, an induced subgraph S is the attribute k-core of G depending on λ, denoted by \(\widehat {C}_{k}(G,\lambda )\), if it satisfies i) each vertex uS has at least k resonant neighbors, i.e., \(\widehat {d}(u,S,\lambda ) \geq k\); and ii) S is maximal, i.e., any superset of it is not an attribute k-core.

For the simplicity, when the context is clear, we use \(\widehat {N}({u}, {\lambda }){u}{\lambda }\), \(\widehat {d}({u}, {\lambda }){u}{\lambda }\) and \(\widehat {C}_{k}({\lambda }){\lambda }\) instead of \(\widehat {N}(u,S,\lambda )\), \(\widehat {d}(u,S,\lambda )\) and \(\widehat {C}_{k}(G,\lambda )\), respectively. In addition, we use degree and resonant degree interchangeably. We use \(|\widehat {C}_{k}({\lambda }){\lambda }|\) to denote the number of vertices in \(\widehat {C}_{k}({\lambda }){\lambda }\).

Obviously, different selected attribute sets can lead to different attribute k-cores. In this paper, in order to analyze the properties of the given attribute graph, we aim to find an attribute set λ to maximize the size of the corresponding attribute k-core.

Problem Statement

Given an attribute graph G = (V, E, Λ), two positive integers k and b, the attribute k-core maximization problem aims to identify a set λ of b attributes from Λ, such that the size of corresponding attribute k-core is maximized, i.e.,

$$ \lambda^{*} = \arg\max_{\lambda \subseteq {{\varLambda}} \wedge |\lambda| = b} |\widehat{C}_{k}({\lambda}){\lambda}| $$

2.2 Problem properties

According to Theorems 1 and 2, the investigated problem is NP-hard, and the object function is monotone but not submodular.

Theorem 1

Given an attribute graph G, the attribute k-core maximization problem is NP-hard for any k.

Proof

When k ≥ 1, we reduce the maximum coverage problem [6] to the attribute k-core maximization problem. The maximum coverage problem that is given several sets and a number b, each set may have several same elements, we must select at most b sets such that the maximum number of elements are covered, i.e., the union of the selected sets has the largest size. We consider an arbitrary instance of the maximum coverage problem with s sets T1, T2, …, Ts and t elements \(\{e_{1}, e_{2}, \dots , e_{t}\}={\bigcup }_{1 \leq i \leq s}T_{i}\). Then we construct a corresponding instance of the attribute k-core maximization problem in an attribute graph G as follows.

In this attribute graph G, the attribute set of this graph is A with s attributes \(\{a_{1}, a_{2}, \dots , a_{s}\}\), where each attribute ai corresponds to the set Ti for any 1 ≤ is. The set of vertices in G consists of two parts: M and P. M consist of k + 1 vertices in which every pair of vertices in M are adjacent, and the associated attribute set of every vertex in M is A. P consist of t parts P1, P2, …, Pt, where each part Pj (1 ≤ jt) corresponds to the elements ej and Pj consists of k vertices pj,1, pj,2, …, pj, k. For each Pj (1 ≤ jt) we add an edge for each pair of vertices such that each Pj is a k-clique. At this stage, the degree of each vertex in P is k − 1. Next, we add an edge from each vertex in P to any vertex in M to make sure the degree of each vertex in P is exactly k. Then, we add attributes for each vertex in P. If Ti consist of ej, we add attribute ai to the associated attribute set of each vertex in Pj. Figure 2 is an example of the attribute G with k = 3 constructed from 4 sets and 4 elements.

Fig. 2
figure 2

Example of NP-hard proof

With the construction, we ensure that (i) none of vertex in M will be deleted, because the number of resonant neighbors of each vertex in M will not less than k no matter what attribute set we choose from A; (ii) the degree of all vertices in P is exactly equals to k; (iii) Pj will be deleted unless we select the corresponding attribute; and (iv) all Pj have the same size for 1 ≤ jt. By doing this, the optimal solution of the attribute k-core maximization problem corresponds to the optimal solution of the maximum coverage problem. Since the maximum coverage problem is NP-hard, we prove that the attribute k-core maximization problem is NP-hard for any k ≥ 1. □

Theorem 2

The object function \(f(\lambda )=|\widehat {C}_{k}({\lambda }){\lambda }|\) is monotonic but not submodular.

Proof

Monotonic. Suppose there is an attribute set \(\lambda _{0} \subset \lambda _{0}^{\prime }\). The attribute k-core \(\widehat {C}_{k}({\lambda }){\lambda _{0}}\) must be contained in \(\widehat {C}_{k}({\lambda }){\lambda _{0}^{\prime }}\), because adding any attribute in \(\lambda _{0}^{\prime } \setminus \lambda _{0}\) can not decrease the degree of \(u \in \widehat {C}_{k}({\lambda }){\lambda _{0}}\). Thus, \(f(\lambda _{0}) \leq f(\lambda _{0}^{\prime })\) and f is monotonic.

Non-submodular The function f is submodular if f(λ1λ2) + f(λ1λ2) ≤ f(λ1) + f(λ2) for any attribute subsets. We prove the theorem by construct a counter example. As shown in Figure 1, for k = 3, suppose λ1 = {a1} and λ2 = {a2}. We have f(λ1) = 9, f(λ2) = 8, f(λ1λ2) = 13 and f(λ1λ2) = 0. The inequation does not hold. Thus, function f is non-submodular. □

figure a

3 Baseline algorithm

A naive solution for the problem is to enumerate all the possible attribute sets with size b and return the optimal result. The time complexity is \(\mathcal {O}(\binom {|{{\varLambda }}|}{b}m)\), which is not affordable for real-life networks. Considering the NP-hardness of the problem, we resort to the greedy heuristic by iteratively choosing the best attribute. Besides, as the monotonic property discussed in Theorem 2, the size of attribute k-core \(|\widehat {C}_{k}({\lambda }){\lambda }|\) is positively correlated with |λ|, which means \(\widehat {C}_{k}({\lambda }){\lambda }\) will be expanded with new attribute added into λ. Details are shown in the following theorem.

Theorem 3

Suppose that \(\lambda \subseteq \lambda ^{\prime }\) is the currently selected attribute set. If vertex u belongs to \(\widehat {C}_{k}({\lambda }){\lambda }\), u must be in \(\widehat {C}_{k}({\lambda }){\lambda ^{\prime }}\).

Proof

We have that \(\lambda \subseteq \lambda ^{\prime }\). Therefore, \(\widehat {C}_{k}({\lambda }){\lambda }\) must be contained in \(\widehat {C}_{k}({\lambda }){\lambda ^{\prime }}\) according to Theorem 2. □

The baseline greedy method is shown in Algorithm 1. We first compute the k-core of G in Line 1 since any attribute k-core must be inside the k-core of G.

Then, we initialize an empty set λ to store all the selected attributes (Line 2). At each iteration, we compute the corresponding attribute k-core and choose the best attribute whose addition can enlarge the size of attribute k-core most (Lines 4-5). The algorithm terminates when b attributes are selected. The time complexity is \(\mathcal {O}(bm|{{\varLambda }}|)\), which significantly reduces the computation cost compared with the native approach.

4 Optimized solution

As shown in the experiment results, the baseline greedy method can greatly accelerate the search with competitive results. However, it is still cannot scale for large networks. In this section, we first present a novel core-spread searching framework and then introduce some pruning techniques to reduce the candidate space.

4.1 Core-spread framework

As observed in the greedy framework, a major cost is to compute the marginal gain of adding an attribute to the currently selected attribute set. Therefore, we introduce the core-spread framework to facilitate the computation. Before presenting the details of the framework, we first apply the example in Figure 3 to clarify some notations and concepts. Suppose the currently selected attribute set is λ0 and the new added attribute is a.

Fig. 3
figure 3

Example of adding attribute a into λ0

The left (resp. right) dotted ellipse is V (λ0) (resp. V ({a})), and the left (resp. right) solid ellipse represents \(\widehat {C}_{k}({\lambda }){\lambda _{0}}\) (resp. \(\widehat {C}_{k}({\lambda }){\{a\}}\)). Note that, we use \(\mathcal {U}\mathcal {V} = V(\lambda _{0}) \cup V(\{a\})\) (resp. \(\mathcal {I}\mathcal {V} = V(\lambda _{0}) \cap V(\{a\})\)) to represent the union (resp. intersection) of V (λ0) and V ({a}). Similarly, we use \(\mathcal {U}\mathcal {C} = \widehat {C}_{k}({\lambda }){\lambda _{0}} \cup \widehat {C}_{k}({\lambda }){\{a\}}\) (resp. \(\mathcal {I}\mathcal {C} = \widehat {C}_{k}({\lambda }){\lambda _{0}} \cap \widehat {C}_{k}({\lambda }){\{a\}}\)) to represent the union (resp. intersection) of \(\widehat {C}_{k}({\lambda }){\lambda _{0}}\) and \(\widehat {C}_{k}({\lambda }){\{a\}}\). When adding a new attribute a to λ0, it is obvious that vertices in \(\mathcal {I}\mathcal {V}\) may have their resonate degree changed and even affect other vertices, such that some vertices will join the updated attribute k-core, i.e., \(\widehat {C}_{k}({\lambda }){\lambda _{0} \cup \{a\}}\). Hence, if we can efficiently calculate the update degree for promising vertices instead of computing the attribute k-core from scratch, we can save a lot of computation.

In our core-spread framework, each vertex in \(\mathcal {U}\mathcal {V}\) has three statuses. The vertex is explored if it has been checked. If the vertex satisfies the resonate degree constraint, its status is survived, otherwise it is marked as deleted. A deleted vertex will not involve in the enlarged attribute k-core, and a currently survived vertex may be deleted later after further verification. In this paper, we derive the upper bound d+(u) of u’s degree to quickly filter unpromising ones. Specifically, d+(u) is the number of vertices in the union of u’s survived neighbors, unexplored neighbors and all neighbors in \(\mathcal {U}\mathcal {C}\). We will remove u if d+(u) < k. In the framework, we traverse from all vertices of \(\mathcal {I}\mathcal {V}\), which are called source vertices. With the addition of a new attribute, some source vertices will preserve in the updated attribute k-core, which further facilitates some originally unsatisfied vertices retain in the final attribute k-core.

figure b

The core-spread framework

Algorithm 2 illustrates the pseudocode of the framework.

At first, for each vertex \(u \in \mathcal {U}\mathcal {V}\), the explored value e(u) and the survived value s(u) are both initialized as 0 (Lines 1-2). For each vertex u in \(\mathcal {U}\mathcal {C}\), we initialize its explored value and survived value as 1, its upper bound degree as \(+\infty \) in Lines 3-4. This is because the vertices in \(\mathcal {U}\mathcal {C}\) must exist in the corresponding attribute k-core. Then, we put all vertices in \(\mathcal {I}\mathcal {V}\), i.e., source vertices, into a queue \({\mathscr{H}}\) (Line 5), and we visit the vertices in \({\mathscr{H}}\) iteratively. For each processed vertex u, we set e(u) as 1 and compute its upper bound degree in Lines 8-9. If d+(u) ≥ k, its survived value is set as 1 (Lines 10-11) and its unexplored neighbors will be pushed into \({\mathscr{H}}\) (Lines 12-14). Otherwise, u is not survived due to the violation of degree constraint, and we invoke Shrink algorithm to update the upper bound degree of u’s neighbors in Lines 15-17. Finally, we return all the survived vertices as the result in Line 18.

Shrink procedure

Algorithm 3 shows the details of Shrink process for a checked vertex u, which aims to update the upper bound degree of u’s neighbors considering the removal of u. We firstly initialize a vertex set \(\mathcal {T}\) as \(\varnothing \) in Line 1. Then we update the upper bound degree of its survived neighbors and check neighbors’ new upper bound degree in Lines 2-5. Specifically, the upper bound degree of its neighbors is reduced by 1 (Line 3). We put the vertices that violate the upper bound degree constraint into \(\mathcal {T}\) (Lines 4-5). For each vertex in \(\mathcal {T}\), we set its survived value as 0 and process it by recursively invoking the Shrink process (Lines 6-8).

figure c

Discussion

Based on the core-spread framework, we can speedup the computation of marginal gain when adding a new attribute. However, it still has some drawbacks: i) the upper bound is not tight, and ii) the searching cost is large considering the large number of attributes. Therefore, in the following sections, optimized upper bound and novel search paradigm are developed to further accelerate the processing.

4.2 Layer-based optimization

figure d

In the core-spread framework, we leverage the degree upper bound to filter the unpromising vertices. Apparently, a tighter and efficient-computed upper bound could accelerate the processing. In this subsection, we employ a layer structure to speedup the computation and derivation of bound. Given a set λ of attributes, the layer structure Lλ organizes the vertices in V (λ) following the core decomposition procedure. That is, we iteratively peel the vertices that violate the attribute k-core constraint and store them in the same layer. The rest vertices are processed in the same manner until all the vertices are maintained in the corresponding layers.

Layer construction

Different attribute sets lead to vertex sets. We construct the layer structure for each candidate attribute ai and the currently selected attribute set λ0. The details of layer construction are shown in Algorithm 4. We initialize i = 1 to denote the number of layer and store all the vertices associated with attribute set λ into \(\mathcal {N}\) (Line 1). We process the vertices in \(\mathcal {N}\) iteratively. In each iteration, we store all the vertices that violate the degree constraint currently in the same layer \(L_{\lambda }^{i}\) (Line 3). Then, we remove \(L_{\lambda }^{i}\) from \(\mathcal {N}\). The iteration terminates when all the vertices in \(\mathcal {N}\) satisfy the degree constraint. The subgraph induced by the remained vertices in \(\mathcal {N}\) is the corresponding attribute k-core \(\widehat {C}_{k}({\lambda }){\lambda }\), and we store the remained vertices in layer \(L_{\lambda }^{+\infty }\) (Line 6). Eventually, we return Lλ and \(\mathcal {N}\) in Line 7.

For a given vertex u, we use lλ(u) to denote its layer number in Lλ. The layer construction phase follows the procedure of k-core decomposition, which time complexity is \(\mathcal {O}(m)\). Note that, we need to construct the layer for each attribute, which will be done in the first iteration of the greedy framework.

Following is a layer construction example.

Example 2

Reconsidering the graph in Figure 1, suppose the currently selected attribute set λ0 = {a1} and the verified attribute a = a2. Then, we have \(\widehat {C}_{3}(\lambda _{0}) =\) {v5,⋯ , v9, v14,⋯ , v17} and \(\widehat {C}_{3}(\{a\}) = \{v_{14}, \cdots , v_{21}\}\). The constructed layer structures for λ0 and a are shown in Figure 4. Note that, for the last layer (i.e., \(+\infty \) layer), we only draw partial vertices. Initially, for the vertices in V (λ0), the resonant degree of v1, v2, v12, v13, v18 are less than 3. Thus, we put them in \(L_{\lambda _{0}}^{1}\) and remove them from V (λ0). Then, the resonant degree of v3, v4, v10 are smaller than 3, and we have \(L_{\lambda _{0}}^{2} = \{v_{3}, v_{4}, v_{10}\}\). Moreover, we put v11 into the third layer, i.e., \(l_{\lambda _{0}}(v_{11})=3\). After deleting v11 from V (λ0), the remained vertices correspond to that of \(\widehat {C}_{3}(\lambda _{0})\), and we put them into the \(+\infty \) layer. Similarly, we build the layer structure La for attribute a, which is shown in the right part of Figure 4. For all vertices in V ({a}), we can use same method to build \({L_{a}^{1}} = \{v_{9}, \cdots , v_{12}, v_{25}\}\), \({L_{a}^{2}} = \{v_{13}, v_{22}, v_{24}, v_{26}\}\) and \({L_{a}^{3}} = \{v_{23}\}\).

Fig. 4
figure 4

Example of the layer structure

Given the currently selected attributes set λ0 and the newly added attribute a, it is easy to find that many vertices can co-exist in both layer structures, i.e., \(L_{\lambda _{0}}\) and La. According to the layer construction procedure, the vertices in lower layer can provide degree support for the vertices in higher layer, even may make them join the corresponding attribute k-core. Therefore, during the traversal, we need to consider the impact of neighbors of the currently processed vertex with higher layer value. Furthermore, the vertices in lower layer cannot be ignored directly. Reconsider the example in Figure 4. For two vertices v10 and v13, in V (λ0), we have \(l_{\lambda _{0}}(v_{10}) > l_{\lambda _{0}}(v_{13})\), but la(v10) < la(v13) in La. Therefore, they can affect each other. The details can be explained by the concept and theorem of active-path, which are shown as follows.

Definition 4 (active-path)

Given an attribute set λ0 and an attribute a, we say there is an active-path from a source vertex u to another vertex v, if for each two consecutive vertices x and y (they must be neighbor) along this path satisfying la(y) > la(x) or \(l_{\lambda _{0}}(y) > l_{\lambda _{0}}(x)\).

Theorem 4

For each vertex \(u \in \mathcal {U}\mathcal {V} \backslash \mathcal {U}\mathcal {C} \backslash \mathcal {I}\mathcal {V}\), it may join the updated attribute k-core iff there exists a vertex v in \(\mathcal {I}\mathcal {V}\) that can form an active-path with u (i.e., \(v \rightsquigarrow u\)).

Proof

Suppose the currently selected attribute set is λ0 and the added attribute is a. The vertices in \(\mathcal {U}\mathcal {V} \backslash \mathcal {U}\mathcal {C} \backslash \mathcal {I}\mathcal {V}\) can be divided into two partitions, \(V(\{a\}) \setminus V(\lambda _{0}) \setminus \mathcal {U}\mathcal {C}\) and \(V(\lambda _{0}) \setminus V(\{a\}) \setminus \mathcal {U}\mathcal {C}\). We first prove for each vertex \(u \in V(\{a\}) \setminus V(\lambda _{0}) \setminus \mathcal {U}\mathcal {C}\). All the neighbors of u can be divided into two sets N1 and N2, where the layer value of vertex in N1 is lower than la(u) and the layer value of vertex in N2 is no smaller than la(u). Clearly, the size of N2 is less than k. If there is no vertex in \(\mathcal {I}\mathcal {V}\) can form an active-path with u, all the vertex in N1 have been deleted when u is processed in the computation of \(\widehat {C}_{k}(\lambda _{0} \cup \{a\})\). Therefore, u will be deleted. When \(u \in V(\lambda _{0}) \setminus V(\{a\}) \setminus \mathcal {U}\mathcal {C}\), the proof is similar. Thus, this theorem holds. □

Recall that in the core-spread framework, we conduct the exploration from \(\mathcal {I}\mathcal {V}\), and traverse through their neighbor. To further filter the visiting space, we partition the space into three parts. Reconsider the example in Figure 3. For ease of illustration, we use \({\mathscr{L}}\) (resp. \(\mathcal {R}\)) to represent the region of \(V(\lambda _{0}) \cap \mathcal {U}\mathcal {C} \setminus \widehat {C}_{k}({\lambda }){\{a\}}\) (resp. \(V(\{a\}) \cap \mathcal {U}\mathcal {C} \setminus \widehat {C}_{k}({\lambda }){\lambda _{0}}\)). The partition details and the corresponding theorems are shown as follows.

  • Partition 1: The vertices in \(\mathcal {I}\mathcal {C}\);

  • Partition 2: The vertices in \(\mathcal {R} \cup {\mathscr{L}}\);

  • Partition 3: The vertices in \(\mathcal {I}\mathcal {V} \setminus (\mathcal {I}\mathcal {C} \cup \mathcal {R} \cup {\mathscr{L}})\).

Theorem 5

In the core-spread framework, we do not need to traverse from the vertices in \(\mathcal {I}\mathcal {C}\).

Proof

According to Algorithm 4, since the vertices in \(\mathcal {I}\mathcal {C}\) are in \(\widehat {C}_{k}(\lambda _{0})\) and \(\widehat {C}_{k}(\{a\})\), for each \(u \in \mathcal {I}\mathcal {C}\), the value of \(l_{\lambda _{0}}(u)\) and la(u) are both \(+\infty \). Thus, based on Definition 4, no vertex can form an active-path from the vertex in the \(\mathcal {I}\mathcal {C}\). Therefore, we do not need to traverse from the vertices in \(\mathcal {I}\mathcal {C}\). □

Theorem 6

In the core-spread framework, for each vertex u in \({\mathscr{L}}\), we only need to traverse from its neighbors v in V ({a}) with higher layer value i.e., la(v) > la(u); for each vertex u in \(\mathcal {R}\), we only need to traverse from its neighbors v in V (λ0) with higher layer value i.e., \(l_{\lambda _{0}}(v) > l_{\lambda _{0}}(u)\).

Proof

Same as the proof for Theorem 5. According to Algorithm 4, since the vertices in \({\mathscr{L}}\) are in \(\widehat {C}_{k}(\lambda _{0})\), the layer value of each vertex \(u \in {\mathscr{L}}\) is set as \(+\infty \) i.e., \(l_{\lambda _{0}}(u) = +\infty \). Hence, the layer value of u’s neighbor in V (λ0) will not be larger than that of u, which means that there is no active-path from u to its neighbors. Besides, if u’s neighbors v in V ({a}) with no larger layer value i.e., la(v) ≤ la(u), they also can not form active-paths from u to v. Thus, we only need to traverse from u’s neighbors v in V ({a}) with higher layer value i.e., la(v) > la(u). The proof for the vertices in \(\mathcal {R}\) is similar to that for \({\mathscr{L}}\). □

Based on the above analysis, we can leverage the theorem to mark the vertices that can contribute to the degree of attribute k-core first. Generally, we mark the vertices along the active-path and the unmarked vertices definitely cannot exist in the attribute k-core (i.e., Algorithm 5). Then, we compute the newly degree upper bound of vertices based on the layer structure (i.e., Algorithm 6). The details of two algorithms are shown as follows.

figure e

Mark Algorithm

In the Mark algorithm, we use m(u) to store u’s mark value and return the number of marked vertices. It starts by initializing \({\mathscr{M}}\) with \(\mathcal {I}\mathcal {V} \backslash \mathcal {U}\mathcal {C}\) and markSize with the number of vertices in \(\mathcal {U}\mathcal {C}\) in Lines 1-2. We set m(u) = 0 for each vertex \(u \in \mathcal {U}\mathcal {V}\) (Lines 3-4) and setting m(u) = 1 for each vertex \(u \in \mathcal {U}\mathcal {C}\) (Lines 5-6). Then, we try to visit each vertex in \({\mathscr{L}}\) and check its neighbors in V ({a}) (Lines 7-10). According to Theorem 6, we do not need to start the traversal from neighbors in V (λ0). For each processed vertex u, if its resonate neighbor is not marked, we compare layer values of u and its neighbors. Note that, we use la(u) to denote the layer value of u based on attribute set {a}. Specifically, if u and its neighbor v satisfy la(v) > la(u), which means v’s layer value is higher than u’s, we enlarge \({\mathscr{M}}\) by pushing v and set m(v) = 1 (Lines 9-10). The same procedure as Lines 7-10 is conducted by replacing \({\mathscr{L}}\) with \(\mathcal {R}\) and {a} with λ0. We iteratively process vertices in \({\mathscr{M}}\) in Lines 12-19.

For each processed vertex u, we visit its unmarked neighbors. If its neighbor is in V (λ0) and has larger layer value than u, we push this neighbor into \({\mathscr{M}}\) and set it marked. Similarly, we push the neighbor of vertex u if its neighbor has larger layer value than u in V ({a}). Finally, we return the result in Line 20.

figure f

Upper Degree Algorithm

Based on the layer structure and the Mark algorithm, we present the new upper bound degree nd+(u) for a vertex u. nd+(u) is the number of vertices in the union of u’s survived neighbors, u’s neighbors in \(\mathcal {U}\mathcal {C}\), u’s unexplored but marked neighbors. Details about computing the new upper bound degree for the chosen vertex u are shown in Algorithm 6. We initialize nd+(u) with 0 in Line 1 and process all neighbors of vertex u in Lines 2-6. For a vertex u, it will be removed if nd+(u) < k, which can be verified by the following theorem.

Theorem 7

A vertex \(u \in \mathcal {U}\mathcal {V} \setminus \mathcal {U}\mathcal {C}\) can not join in the updated attribute k-core if nd+(u) < k.

Proof

The neighbors of u can be divided into two categories, marked neighbors and unmarked neighbors. According to the properties of the Algorithm 5, unmarked neighbors will not be added to the updated attribute k-core. All marked neighbors may be counted into the upper degree of u except for the marked but deleted ones, and the deleted neighbors will definitely not be added to the updated attribute k-core. Thus, nd+(u) is a correct upper bound of u’s degree. If nd+(u) < k, u can not be included into the attribute k-core. □

figure g

4.3 Compute attribute core

Based on the new upper bound degree derived, Algorithm 7 is presented to incrementally compute the attribute k-core when adding a new attribute. We first initialize two heaps \({\mathscr{H}}_{a}\) and \({\mathscr{H}}_{\lambda _{0}}\) in Line 1, which indicate the unexplored vertices of V ({a}) and V (λ0), respectively. Then we set the explored value e(u) and the survived value s(u) of each vertex \(u \in \mathcal {U}\mathcal {V}\) as 0 in Lines 2-3. The vertices in \(\mathcal {U}\mathcal {C}\) belong to the attribute k-core, so we set their explored value and survived value both as 1 and their upper bound degree as \(+\infty \) (Lines 4-5). For each vertex \(u \in {\mathscr{L}}\), we process its neighbors in Lines 7-9. If its neighbor v is not in \({\mathscr{H}}_{a} \cup {\mathscr{H}}_{\lambda _{0}} \cup \mathcal {U}\mathcal {C}\) and has larger layer value than u, we push it into \({\mathscr{H}}_{a}\) (Lines 8-9). We do the same process as Lines 6-9 by replacing \({\mathscr{L}}\) with \(\mathcal {R}\) and {a} with λ0. Then, we push all the vertices in \(\mathcal {I}\mathcal {V} \setminus (\mathcal {I}\mathcal {C} \cup \mathcal {R} \cup {\mathscr{L}})\) (i.e., partition 3 discussed in Section 4.2) into \({\mathscr{H}}_{a}\) for processing (Line 11). We iteratively process each vertex in \({\mathscr{H}}_{a} \cup {\mathscr{H}}_{\lambda _{0}}\) in Lines 12-31. If \({\mathscr{H}}_{a}\) is not empty, for each vertex \(u \in {\mathscr{H}}_{a}\), we first set its explored value e(u) as 1 and compute its upper bound degree by invoking Upper Degree(u) (Lines 14-16). u is survived if its upper degree is no less than k (Lines 17-18). The change of u’s state will effect its neighbors, so we process its unexplored neighbors iteratively in Lines 19-26. If \(u \in \mathcal {I}\mathcal {V}\) and la(v) > la(u), we push v into \({\mathscr{H}}_{a}\) (Lines 20-22). Similarly, if \(u \in \mathcal {U}\mathcal {V}\) and \(l_{\lambda _{0}}(v) > l_{\lambda _{0}}(u)\), we push v into \({\mathscr{H}}_{\lambda _{0}}\) (Lines 23-24). If uV ({a}), we do the same process as Lines 21-22. u is deleted if nd+(u) is unsatisfied, and we invoke Shrink(u). If \({\mathscr{H}}_{\lambda _{0}}\) is not empty, we do the same process as Lines 14-29 by replacing a and λ0. We stop the iteration when there is no vertex in \({\mathscr{H}}_{a} \cup {\mathscr{H}}_{\lambda _{0}}\). Then we invoke the Shrink algorithm to process all marked but unexplored vertices because these vertices must be deleted. Finally, we return all survived vertices as the updated attribute k-core in Line 33.

figure h

4.4 AKC algorithm

Suppose the currently selected attributes set is λ0 and the newly added attribute is a. We use UB(λ0, a) to denote the upper bound size of updated attribute k-core, which can be obtained by the Mark algorithm, i.e., markSize. Specifically, there is no active-path contains the unmarked vertices, so the unmarked vertices cannot be added to the attribute k-core. Then, we have Theorem 8, which can further filter some unpromising attribute in current iteration. The correctness of the theorem is easy to verify. Thus, we omit the proof here.

Theorem 8

In each iteration, if UB(λ0, a) is smaller than the current best result, attribute a can be filtered directly.

By integrating all the techniques proposed, we come up with the optimized algorithm AKC, which details are shown in Algorithm 8. We obtain the k-core of G in Line 1 and initialize λ0 as \(\varnothing \) to store the current best attribute set in Line 2. In Lines 3-4, we compute the layer structure and the corresponding attribute k-core of each attribute by invoking Algorithm 4. We put the best attribute with the largest attribute k-core into λ0 (Line 5). In each iteration, we initialize δ with \(-\infty \) to denote the size of updated attribute k-core. For each unselected attribute aΛλ0, we compute the number of marked vertices by Algorithm 5 as the upper bound size of attribute k-core (Lines 8-9). This is because the unmarked vertices cannot be added into the attribute k-core. We continue the algorithm if UB(λ0, a) < λ (Line 10). Then, we compute the updated attribute k-core by Algorithm 7 (Line 11).

If the size of the current attribute k-core is larger than δ, we update δ and the best attribute a (Lines 12-14). The new current best attribute set λ0 and layer information are updated in Lines 15 and 16. The algorithm terminates until b attributes are selected.

5 Experiments

In this section, we evaluate the effectiveness and efficiency of our proposed techniques on 6 real-word networks.

5.1 Experiment setup

Algorithms

To the best of our knowledge, there is no existing work for our problem. For the effectiveness, we implement three algorithms (i.e., Random, TopV-Cover and TopC-Cover) to select different attribute set compared with our greedy strategy. We also implement and evaluate four algorithms (Baseline, BL-S, BL-SA and AKC) to verify the efficiency of proposed techniques. A brief description of the employed algorithms is as follows.

  • Exact: the exact algorithm that enumerates all the combinations of attributes and returns the optimal result.

  • Random: algorithm that randomly chooses b attributes from all attributes.

  • TopV-Cover: algorithm that selects the top-b frequent attributes.

  • TopC-Cover: algorithm that selects the top-b attributes with the largest corresponding attribute k-core size.

  • Baseline: baseline greedy method, i.e., Algorithm 1.

  • BL-S: greedy method which adopts the framework in Algorithm 2 to compute the attribute k-core.

  • BL-SA: integrates BL-S with Theorems 5 and 6.

  • AKC: algorithm that integrates all the developed techniques, i.e., Algorithm 8.

Datasets

We conduct the experiments with 3 real-world attribute graphs and 3 semi-synthetic datasets. Table 2 presents the statistic details of the datasets, where Λavg is the average attribute number for each vertex and davg is the average degree of vertices. DBLPFootnote 1, FlickrFootnote 2 and YelpFootnote 3 are real-world attribute graphs. DBLP is an authors’ relationship network with their paper information as keywords. For each author, we select the 30 most frequent keywords from all the titles of his published papers as his attributes. Flickr is an online user relationship network with their personal interests. For each user, we choose the 30 most frequent tags of its associated photos as his attributes. The Yelp dataset is extracted from the Yelp website, which consists of their businesses, reviews, and user data. The set of attributes for each user is the categories of the 30 most frequently visited restaurants by that user. The other 3 datasets, i.e., Brightkite, Gowalla and Youtube, are real-world networks, which are download from SNAPFootnote 4. Since the vertices in these networks have no attributes, we first generate 400 attributes and then randomly select 20-30 attributes from 400 attributes as the attribute set for each vertex.

Table 2 Statistics of datasets

Parameters and workloads

We conduct the experiments by varying the degree constraint k and the budget b. For parameter b, we vary it from 2 to 10 with b = 6 as the default value. Due to the different properties of the networks, for parameter k, we vary it from 5 to 25 with k = 15 as the default value for three real-world datasets, and vary it from 6 to 10 with k = 8 as the default value for the three semi-synthetic datasets. We evaluate the effectiveness and efficiency of the algorithms by reporting the identified the corresponding attribute k-core size and response time, respectively. For each setting, we run the algorithm 10 times and report the average value. All the programs are implemented in C++. The experiments are performed on a machine with an Intel i5-9600KF 3.7GHz CPU and 64 GB memory.

5.2 Effectiveness evaluation

To evaluate the effectiveness of proposed methods, we report the size of returned attribute k-core. Firstly, we conduct the experiments by comparing AKC with other 3 heuristic methods. Then, we report the results by comparing with the exact solution. Finally, we present the case studies on DBLP dataset.

Effectiveness evaluation by varying k and b

Comparing AKC with Random, TopC-Cover and TopV-Cover, Figures 5 and 6 report the returned attribute k-core size by varying k and b, respectively. For Random, we report the average result by conducting 500 independent tests. As we can see, AKC always outperforms the other algorithms, and the two cover based methods are better than Random. The result size increases when b grows, because more attributes are selected. The core size decreases when k becomes larger, since vertices need larger degree to stay engaged. The result size of Random always very small due to the cardinality and distribution of Λ. Thus, Random may select a lot of unpromising attributes and lead to a smaller attribute k-core. Although the two cover based methods contain a lot of promising attributes, they do not consider the correlation among attributes. Thus, they still cannot perform as well as AKC.

Fig. 5
figure 5

Effectiveness evaluation by varying k

Fig. 6
figure 6

Effectiveness evaluation by varying b

Compare with the exact solution

To further evaluate the effectiveness of proposed greedy framework, we report the results by comparing AKC with Exact. The experiments are conducted on two real-world datasets, i.e., DBLP and Flickr, and the results are shown in Figure 7. Due to the high computation cost of the exact solution, we only report the results for b = 2. The number on each bar denotes the corresponding running time. As shown, AKC achieves almost the same results as Exact. In addition, AKC is much faster than Exact. For example, on DBLP datasets with k = 25, AKC can finish in 0.81s, while it takes 63945s for the exact solution. Thus, the greedy framework can greatly accelerate the search with competitive results.

Fig. 7
figure 7

Compare with the exact solution

Case Studies

To demonstrate the properties of the investigated model, we conduct case studies on DBLP datasets, which results are shown in Figures 8 and 9. In Figure 8, with k = 20, the whole graph itself is a k-core. The red part is the attribute k-core depending on attribute set {language, structure, parallel, system, matrix}, while the blue vertex part is the attribute k-core formed by attribute set {graph, queue, network, complexity, search}. The number of vertices in red is 25, and the number of blue vertices is 43. Obviously, when different sets of attributes are selected, the corresponding attribute k-core is different and reveals different natures of the network. Thus, it is necessary to investigate the properties of attribute k-core. In Figure 9, for b = 5, the figure shows the corresponding attribute k-core returned by AKC when k = 20 and 30, respectively. In Figure 9a, the selected attribute set is {scheme, community, effective, technology, access}. In Figure 9b, the selected attribute set is {hoc, community, effective, technology, access}. For the convenience of viewing, we only draw the main component of each result. As observed, with the increase of k, the theme of the community also varies.

Fig. 8
figure 8

Case Studies on DBLP with k = 20

Fig. 9
figure 9

Case Studies on DBLP with b = 5

5.3 Efficiency evaluation

To evaluate the efficiency of proposed techniques, in this section, we conduct the experiments by comparing AKC with Baseline, BL-S and BL-SA.

Efficiency evaluation by varying k and b

Figures 10 and 11 present the results by varying k and b, respectively. As we can see, AKC is much faster than the other three algorithms under all the settings. When increasing k, the response time is decreasing in most cases, since the k-core size decreases correspondingly. With the increase of b, the response time increases, because we need to perform more iterations to select sufficient attributes. As shown, BL-S has better performance compared to the Baseline, because BL-S only needs to conduct the computation for the vertices related to the currently accessed attribute set. Although BL-SA only chunks the vertices in \(\mathcal {I}\mathcal {V}\) compared to BL-S, it can skip many unnecessary explorations. Thus, BL-SA is faster than BL-S. As observed, with more technique equipped, the algorithm runs faster, which verifies the advantages of developed techniques.

Fig. 10
figure 10

Efficiency evaluation by varying k

Fig. 11
figure 11

Efficiency evaluation by varying b

Scalability evaluation

In Figure 12, we evaluate the scalability of proposed methods. Specifically, we generate four subgraphs by randomly sampling 20-100% of the graph, and report the response time. Obviously, as the graph size grows, the response time increases. As observed, AKC is always faster than others, and preserve good scalability.

Fig. 12
figure 12

Scalability evaluation on all datasets

6 Related work

Cohesive subgraph detection is a fundamental problem in graph analysis. Different cohesive subgraph models are proposed in the literature, such as k-core [1], k-truss [19], clique [14], etc. As a popular model, k-core is widely adopted in many applications, e.g., community detection [21], influential community search [8], information propagation [2], etc. The k-core model is firstly introduced by Seidman in [12] for simple graphs. As the advance of online social network, users are often equipped with multiple attributes. In [20], Zhou et al. investigate graph clustering on attribute graph by considering both graph structural and attribute information. In [10], authors investigate the graph modeling with correlated attributes. [7] explores the attribute associations in the graphs. In [17], Yang et al. investigate the community detection problem in attribute graph. In [3] and [5], given a set of query attributes and query vertices, authors investigate the community search problem in attribute graph by leveraging the k-core and k-truss, respectively. In [9], a new community search model is further developed. As observed, in the literature, most of the researches focus on community search for a set of query attributes instead of identifying the critical attributes in the network. In addition, in their models, they usually emphasize each vertex containing one or all the query attributes instead of requiring the share of common attributes between neighbors. To the best of our knowledge, we are the first to investigate the attribute k-core maximization problem.

7 Conclusion

Identifying critical attributes is of great importance for attribute graph analysis. In this paper, we conduct the first research to investigate the attribute graph maximization problem. Given an attribute graph, it aims to retrieve a set of b attributes, which can lead to the largest attribute k-core. Due to the NP-hardness of the problem, a greedy framework is proposed. Layer-based filtering methods and searching paradigms are developed to scale for large networks. Finally, experiments over real-life networks are conducted to demonstrate the effectiveness and efficiency of proposed model and techniques.