Keywords

1 Introduction

In the traditional graph-based sense, roles represent node-level connectivity patterns such as star-center, star-edge nodes, near-cliques or nodes that act as bridges to different regions of the graph. Intuitively, two nodes belong to the same role if they are “similar” in the sense of graph structure. Our proposed research will broaden the framework for defining, discovering and learning network roles, by drastically increasing the degree of usefulness of the information embedded within rich graphs.

Recently, role discovery has become increasingly important for a variety of application and problem domains [5, 6, 9, 15, 19, 28] including descriptive network modeling [30], classification [14], anomaly detection [30], and exploratory analysis [29]. See [28] for other applications. Despite the importance of role discovery, existing work has only focused on discovering node roles (e.g., see [5, 7, 11, 23]). We posit that discovering the roles of edges may be fundamentally more important and able to capture, represent, and summarize the key behavioral roles in the network better than existing methods that have been limited to learning only the roles of nodes in the graph. For instance, a person with malicious intent may appear normal by maintaining the vast majority of relationships and communications with individuals that play normal roles in society. In this situation, techniques that reveal the role semantics of nodes would have difficulty detecting such malicious behavior since most edges are normal. However, modeling the roles (functional semantics, intent) of individual edges (relationships, communications) in the rich graph would improve our ability to identify, detect, and predict this type of malicious activity since we are modeling it directly. Nevertheless, existing work also have many other limitations, which significantly reduces the practical utility of such methods in real-world networks. One such example is that the existing work has been limited to mainly simple degree and egonet features [14, 30], see [28] for other possibilities. Instead, we leverage higher-order network motifs (induced subgraphs) of size \(k \in \{3, 4, \ldots \}\) computed from [1, 2] and other graph parameters such as the largest clique in a node (or edge) neighborhood, triangle core number, as well as the neighborhood chromatic, among other efficient and highly discriminative graph features. The main contributions are as follows:

  • Edge role discovery: This work introduces the problem of edge role discovery and proposes a computational framework for learning and modeling edge roles in both static and dynamic networks.

  • Higher-order role discovery models: Proposed a general class of higher-order role models that leverage network motifs and higher-order network features for learning both node and edge roles. This work is also the first to use higher-order network motifsFootnote 1 for role discovery in general.

  • Edge feature representation learning: Proposed a novel deep graph representation learning framework that begins with higher-order network motifs and automatically learns deeper edge features.

  • Efficient and scalable: The proposed feature and role discovery methods are efficient (linear in the number of edges) for modeling large networks. In addition, all methods are parallelized and shown to scale to massive networks.

2 Related Work

Related research is categorized into the following parts: (1) role discovery, (2) higher-order network analysis, (3) graph representation learning, (4) sparse graph features, and (5) parallel role discovery.

Role Discovery: There has been a lot of work on role discovery in general [5, 6, 9, 14, 15, 19, 28, 30]. However, all existing approaches have focused on learning roles of nodes in graphs. See [28] for a recent survey on role discovery. In contrast, this work introduces the problem of edge role discovery and presents a computational framework for learning and extracting edge roles from large networks. Additional key differences are as follows: (1) our approach uses higher-order graphlets for discovering more intuitive and meaningful roles, and (2) the proposed role methods are parallelized and thus able to scale to extremely large real-world networks. Moreover, our approach supports graphs that are directed/undirected/bipartite, attributed, typed/heterogeneous, and signed.

Higher-Order Network Analysis: Small induced subgraphs called graphlets (motifs) have recently been used for graph classification [36], link prediction [25], and visualization and exploratory analysis [1]. However, this work focuses on using graphlets for learning and extracting more useful and meaningful roles from large networks. Furthermore, previous feature-based role methods have been learned based on simple degree and egonet-based features. Thus, another contribution of this work is the use of higher-order network motifs (based on small k-vertex subgraph patterns called graphlets) for role discovery of nodes and edges — a key and fundamental difference between existing work.

Graph Representation Learning: While a lot of work has engineered features by hand (or manually selected them) for various ML applications, not much work has been done on learning a set of useful features automatically. Our approach is different from previous work in four fundamental ways: (1) the proposed approach learns important and useful edge features automatically, whereas existing approaches were designed for learning node features, (2) our approach is space-efficient as it learns sparse features and fast/efficient with a time complexity that is linear in the number of edges. (3) an efficient parallel implementation with strong scaling results as shown in Sect. 4 and thus well-suited for large-scale networks, and finally, (4) most graph representation learning methods were used in SRL systems for classification [12], whereas we use the proposed approach for edge role discovery.

Sparse Graph Features: We also make a significant contribution in terms of space-efficient role discovery. In particular, this work proposes the first practical space-efficient approach for feature-based role discovery by learning sparse graph features automatically. In contrast, feature-based node role methods [14, 30] store hundreds/thousands of dense features in memory, which is impractical for any relatively large network, e.g., they require more than 2TB of memory for a 500M node graph with 1,000 features.

Parallel Role Discovery: The existing role discovery methods are sequential, despite the practical importance of parallel role discovery algorithms that scale to massive real-world networks. This work is the first parallel role discovery approach. Furthermore, the proposed edge feature learning techniques are also parallelized and designed to be both efficient in terms of space and communication.

3 Framework

This section introduces edge role discovery along with higher-order edge role models and a computational framework for learning and extracting roles based on higher-order structures.

Extracting Higher-Order Graphlet Features: Given the graph \(G=(V,E)\), we first decomposes G into its smaller subgraph components called graphlets (motifs). For this, we use parallel edge-centric graphlet decomposition methods such as [1] to compute a variety of graphlet edge features of size \(k=\{3,4,\ldots \}\) (Algorithm 1 Line 2). Moreover, our approach can leverage directed, undirected, and weighted/typed graphlet counts (among other useful and discriminative graphlet edge statistics) using either exact or estimation methods. These graphlet features are then used to learn deeper higher-order edge features (see below for further details).

figure a
Table 1. Relational edge feat. operators

Edge Feature Representation Learning Framework: This section presents our deep edge feature representation learning framework (Algorithm 1). Recall that our approach leverages the previous higher-order graphlet counts as a basis for learning deeper and more discriminative higher-order edge features (Line 2–3). Next, primitive edge features are computed in Line 4, including in/out/total/weighted edge egonet and edge degree features. After computing the initial feature layer \(\mathcal {F}_{1}\) (Line 2–4), redundant features are pruned (Line 5–20). The framework proceeds to learn a set of feature layers where each successive layer represents increasingly deeper higher-order edge features (Line 5–20), i.e., \(\mathcal {F}_{1}< \mathcal {F}_{2}< \cdots < \mathcal {F}_{\tau }\) such that if \(i<j\) then \(\mathcal {F}_{j}\) is said to be a deeper layer than \(\mathcal {F}_{i}\).

The feature layers \(\mathcal {F}_2, \mathcal {F}_{3}, \cdots , \mathcal {F}_{\tau }\) are learned as follows (Line 5–20): For each layer \(\mathcal {F}_{\tau }\), we first construct and search candidate features using the set of relational edge feature operators \(\mathrm{\varvec{\Phi }}\) (See Line 7), which include mean, sum, product, \(\min \), \(\max \), variance, L1, L2, and even parameterized relational kernels based on RBF, polynomial functions, among others. See Table 1 for a few examples. Now, we compute the similarity between all pairs of features and prune edges between features that are not significantly correlated (Line 9–13): \(E_{F} = \{ (f_i, f_j) \; | \; \forall (f_i, f_j) \in |\mathcal {F}| \times |\mathcal {F}| \text { s.t. } \mathbb {K}(f_i, f_j) > \lambda \}\). This process results in a feature similarity graph where large edge weights indicate strong similarity/correlation between two features. Now, the feature similarity graph \(\mathcal {G}_F\) from Line 9–13 is used to prune all redundant edge features from \(\mathcal {F}_{\tau }\). Features are pruned by first partitioning the feature graph (Line 14) using connected components, though our approach is flexible and allows other possibilities (e.g., largest clique). Intuitively, each connected component is a set of redundant edge features since edges in \(\mathcal {G}_{F}\) represent strong dependencies between features. For each connected component \(\mathcal {C}_k \in \mathcal {C}\) (Line 15–17), we identify the earliest feature in \(\mathcal {C}_{k} = \{...,f_i,...,f_j,...\}\) (Line 16) and remove all others from \(\mathcal {F}_{\tau }\) (Line 17). After pruning the feature layer \(\mathcal {F}_{\tau }\), Line 18 ensures the pruned features are removed from \(\varvec{\mathrm {X}}\) and updates the set of edge features learned thus far by setting \(\mathcal {F}\leftarrow \mathcal {F}\cup \mathcal {F}_{\tau }\). Line 19 increments \(\tau \) and set \(\mathcal {F}_{\tau } \leftarrow \varnothing \). Finally, Line 20 checks for convergence, and if the stopping criterion is not satisfied, then the approach tries to learn an additional feature layer (Line 5–20).

Learning Higher-Order Edge Roles: Let \(\varvec{\mathrm {X}}= \big [ x_{ij}\big ] \in \mathbb {R}^{m\times f}\) be an edge feature matrix with \(m\) rows representing edges and \(f\) columns representing higher-order graph features learned from our edge feature representation learning approach. Given \(\varvec{\mathrm {X}}\in \mathbb {R}^{m\times f}\), the edge role discovery optimization problem is to find \({\varvec{\mathrm {U}}}\in \mathbb {R}^{m\times r}\) and \({\varvec{\mathrm {V}}}\in \mathbb {R}^{f\times r}\) where \(r\ll \min (m,f)\) such that the product of two lower rank matrices \({\varvec{\mathrm {U}}}\) and \({\varvec{\mathrm {V}}}^{{T}}\) minimizes the divergence between \(\varvec{\mathrm {X}}\) and \(\varvec{\mathrm {X}}^{\prime }={\varvec{\mathrm {U}}}{\varvec{\mathrm {V}}}^{{T}}\). Intuitively, \({\varvec{\mathrm {U}}}\in \mathbb {R}^{m\times r}\) represents the latent role mixed-memberships of the edges whereas \({\varvec{\mathrm {V}}}\in \mathbb {R}^{f\times r}\) represents the contributions of the features with respect to each of the roles. Each row \(\mathbf {u}_{i}^{{T}} \in \mathbb {R}^{r}\) of \({\varvec{\mathrm {U}}}\) can be interpreted as a low dimensional rank-\(r\) embedding of the \(i^{th}\) edge in \(\varvec{\mathrm {X}}\). Alternatively, each row \(\mathbf {v}_{j}^{{T}} \in \mathbb {R}^{r}\) of \({\varvec{\mathrm {V}}}\) represents a \(r\)-dimensional role embedding of the \(j^{th}\) feature in \(\varvec{\mathrm {X}}\) using the same low rank-r dimensional space. Also, \(\mathbf {u}_k \in \mathbb {R}^{m}\) is the \(k^{th}\) column representing a “latent feature” of \({\varvec{\mathrm {U}}}\) and similarly \(\mathbf {v}_k \in \mathbb {R}^{f}\) is the \(k^{th}\) column of \({\varvec{\mathrm {V}}}\). For learning higher-order edge roles, we solve:

$$\begin{aligned} \mathop {\mathrm{arg \; min}}\limits _{({\varvec{\mathrm {U}}},{\varvec{\mathrm {V}}}) \in \mathcal {C}} \; \Bigl \{ \mathbb {D}_{\phi }(\varvec{\mathrm {X}} \Vert \, {\varvec{\mathrm {U}}}{\varvec{\mathrm {V}}}^{T}) + \mathcal {R}({\varvec{\mathrm {U}}},{\varvec{\mathrm {V}}}) \Bigr \} \end{aligned}$$
(1)

where \(\mathbb {D}_{\phi }(\varvec{\mathrm {X}} \Vert {\varvec{\mathrm {U}}}{\varvec{\mathrm {V}}}^{T})\) is an arbitrary Bregman divergence [10] between \(\varvec{\mathrm {X}}\) and \({\varvec{\mathrm {U}}}{\varvec{\mathrm {V}}}^T\). Furthermore, the optimization problem in (1) imposes hard constraints \(\mathcal {C}\) on \({\varvec{\mathrm {U}}}\) and \({\varvec{\mathrm {V}}}\) such as non-negativity constraints \({\varvec{\mathrm {U}}},{\varvec{\mathrm {V}}}\ge 0\) and \(\mathcal {R}({\varvec{\mathrm {U}}},{\varvec{\mathrm {V}}})\) is a regularization penalty. In this work, we mainly focus on solving \(\mathbb {D}_{\phi }(\varvec{\mathrm {X}} \Vert {\varvec{\mathrm {U}}}{\varvec{\mathrm {V}}}^{T})\) under non-negativity constraints:

$$\begin{aligned} \mathop {\mathrm{arg \; min}}\limits _{{\varvec{\mathrm {U}}}\ge 0, {\varvec{\mathrm {V}}}\ge 0} \; \Bigl \{ \mathbb {D}_{\phi }(\varvec{\mathrm {X}} \Vert \, {\varvec{\mathrm {U}}}{\varvec{\mathrm {V}}}^{T}) + \mathcal {R}({\varvec{\mathrm {U}}},{\varvec{\mathrm {V}}}) \Bigr \} \end{aligned}$$
(2)

Given the edge feature matrix \(\varvec{\mathrm {X}}\in \mathbb {R}^{m\times f}\), the edge role discovery problem is to find \({\varvec{\mathrm {U}}}\in \mathbb {R}^{m\times r}\) and \({\varvec{\mathrm {V}}}\in \mathbb {R}^{f\times r}\) such that \(\varvec{\mathrm {X}}\approx \varvec{\mathrm {X}}^{\prime } = {\varvec{\mathrm {U}}}{\varvec{\mathrm {V}}}^{{T}}\). To measure the quality of our edge mixed membership model, we use Bregman divergences:

$$\begin{aligned} \sum _{ij} \mathbb {D}_{\phi }(x_{ij} \Vert x_{ij}^{\prime }) = \sum _{ij} \big (\phi (x_{ij}) - \phi (x_{ij}^{\prime }) - \ell (x_{ij}, x_{ij}^{\prime }) \big ) \end{aligned}$$
(3)

where \(\phi \) is a univariate smooth convex function and \(\ell (x_{ij}, x_{ij}^{\prime }) = \nabla \phi (x_{ij}^{\prime })(x_{ij} - x_{ij}^{\prime })\) where \(\nabla ^{p}\phi (x)\) is the p-order derivative operator of \(\phi \) at x. Furthermore, let \(\varvec{\mathrm {X}}- \varvec{\mathrm {U}}\varvec{\mathrm {V}}^{T} = \varvec{\mathrm {X}}^{(k)} - \mathbf {u}_k \mathbf {v}_k^{T}\) denote the residual term in the approximation \(\varvec{\mathrm {X}}\approx \varvec{\mathrm {X}}^{\prime } = {\varvec{\mathrm {U}}}{\varvec{\mathrm {V}}}^{{T}}\) where \(\varvec{\mathrm {X}}^{(k)}\) is the k-residual matrix defined as:

$$\begin{aligned} \varvec{\mathrm {X}}^{(k)}&= \varvec{\mathrm {X}}- \sum \limits _{h\not =k} \mathbf {u}_{h} \mathbf {v}_{h}^{T} = \varvec{\mathrm {X}}- \varvec{\mathrm {U}}\varvec{\mathrm {V}}^{T} + \mathbf {u}_k \mathbf {v}_k^{T}, \quad \text {for } k=1,\dots ,r\end{aligned}$$
(4)

We use a fast scalar block coordinate descent approach that easily generalizes for heterogeneous networks [32]. The approach considers a single element in \(\varvec{\mathrm {U}}\) and \(\varvec{\mathrm {V}}\) as a block in the block coordinate descent framework. Replacing \(\phi (y)\) with the corresponding expression from Table 2 gives rise to a fast algorithm for each Bregman divergence. Table 2 gives the updates for Frobenius norm (Fro.), KL-divergence (KL), and Itakura-Saito divergence (IS). Note that Beta divergence and many others are also easily adapted for our higher-order edge role discovery framework.

Table 2. Role divergences and update rules

Model Selection: In this section, we introduce an approach that automatically learns the appropriate role mixed-membership model. The approach is based on the Minimum Description Length (MDL) [13, 26] principle; a practical formalization of Kolmogorov complexity [17]. More formally, we find the model \(M_{\star } = ({\varvec{\mathrm {V}}}_{r}, {\varvec{\mathrm {U}}}_{r})\) that leads to the best compression by solving:

$$\begin{aligned} M_{\star } \, = \, \mathop {\mathrm{arg \; min}}\limits _{M \in \mathcal {M}} \; \mathcal {L}_{}(M) \, + \, \mathcal {L}_{}(\varvec{\mathrm {X}}\,|\, M) \end{aligned}$$
(5)

where \(\mathcal {M}\) is the model space, \(M_{\star }\) is the model given by the solving the above minimization problem, and \(\mathcal {L}_{}(M)\) as the number of bits required to encode M using code \(\varOmega \), which we refer to as the description length of M with respect to \(\varOmega \). Recall that MDL requires a lossless encoding. Therefore, to reconstruct \(\varvec{\mathrm {X}}\) exactly from \(M=({\varvec{\mathrm {U}}}_r,{\varvec{\mathrm {V}}}_r)\) we must explicitly encode the error \(\varvec{\mathrm {E}}\) such that \(\varvec{\mathrm {X}}= {\varvec{\mathrm {U}}}_r{\varvec{\mathrm {V}}}_r^{T}+\varvec{\mathrm {E}}\). Hence, the total compressed size of \(M=({\varvec{\mathrm {U}}}_r,{\varvec{\mathrm {V}}}_r)\) with \(M \in \mathcal {M}\) is simply \(\mathcal {L}(X\,|\, M) = \mathcal {L}(M) + \mathcal {L}(\varvec{\mathrm {E}})\). Given a role mixed-membership model with r roles \(M = ({\varvec{\mathrm {U}}}_r,{\varvec{\mathrm {V}}}_r) \in \mathcal {M}\), the description length is decomposed into: (1) bits required to describe the model, and (2) cost of describing the approximation errors \(\varvec{\mathrm {X}}- \varvec{\mathrm {X}}_{r}\) where \(\varvec{\mathrm {X}}_{r} = \varvec{\mathrm {U}}_{r}\varvec{\mathrm {V}}_{r}^{T}\) is the rank-r approximation of \(\varvec{\mathrm {X}}\),

$$\begin{aligned} \varvec{\mathrm {U}}_{r} = \big [\, \mathbf {u}_1 \;\, \mathbf {u}_2 \;\, \cdots \;\, \mathbf {u}_r \, \big ] \in \mathbb {R}^{m\times r}, \;\; \text {and} \;\; \varvec{\mathrm {V}}_{r} = \big [\, \mathbf {v}_1 \;\, \mathbf {v}_2 \;\, \cdots \;\, \mathbf {v}_r \, \big ] \in \mathbb {R}^{f \times r} \end{aligned}$$
(6)

The model \(M_{\star }\) is the model \(M \in \mathcal {M}\) that minimizes the total description length: the model description cost X and the cost of correcting the errors of our model. Let \(\left| \varvec{\mathrm {U}}\right| \) and \(\left| \varvec{\mathrm {V}}\right| \) denote the number of nonzeros in \(\varvec{\mathrm {U}}\) and \(\varvec{\mathrm {V}}\), respectively. Thus, the model description cost of M is: \(\kappa r( \left| \varvec{\mathrm {U}}\right| + \left| \varvec{\mathrm {V}}\right| )\) where \(\kappa \) is the bits per value. Similarly, if \(\varvec{\mathrm {U}}\) and \(\varvec{\mathrm {V}}\) are dense, then the model description cost is simply \(\kappa r (m+ f)\) where \(m\) and f are the number of edges and features, respectively. Assuming errors are non-uniformly distributed, one possibility is to use KL divergence (see Table 2) for the error description costFootnote 2. The cost of correcting a single element in the approximation is \(\mathbb {D}_{\phi }(x \Vert x^{\prime }) = x \log \frac{x}{x^{\prime }} - x + x^{\prime }\) (assuming KL-divergence), and thus, the total reconstruction cost is:

$$\begin{aligned} \mathbb {D}_{\phi }(\varvec{\mathrm {X}} \Vert \varvec{\mathrm {X}}^{\prime }) = \sum \limits _{ij} X_{ij} \log \frac{X_{ij}}{X^{\prime }_{ij}} - X_{ij} + X^{\prime }_{ij} \end{aligned}$$
(7)

where \(\varvec{\mathrm {X}}^{\prime } = \varvec{\mathrm {U}}\varvec{\mathrm {V}}^{T} \in \mathbb {R}^{m\times f}\). Other possibilities are given in Table 2. The above assumes a particular representation scheme for encoding the models and data. Recall that the optimal code assigns \(\log _2 p_i\) bits to encode a message [34]. Lloyd-Max quantization [18, 22] with Huffman codes [16, 35] are used to compress the model and data [8, 24]. Notice that we require only the length of the description using the above encoding scheme, and thus we do not need to materialize the codes themselves. This leads to the improved model description cost: \(\bar{\kappa } r (\left| \varvec{\mathrm {U}}\right| + \left| \varvec{\mathrm {V}}\right| )\) where \(\bar{\kappa }\) is the mean bits required to encode each valueFootnote 3. In general, the higher-order (edge) role discovery framework can easily leverage other model selection techniques such as AIC [4] and BIC [33].

Fig. 1.
figure 1

The valley identifies the correct number of latent roles.

4 Experiments

This section investigates the effectiveness and scalability of the proposed edge role discovery framework (Sect. 3). All network data is available at nr [27].

Higher-Order Model Selection: We now validate our model learning approach. Figure 1 demonstrates the effectiveness of our approach for automatically selecting the “best” model from the space of models expressed in the framework (Sect. 3). In particular, our approach finds the best model with \(r=18\) roles by minimizing the description length (in bits)Footnote 4. As expected, the model description cost is inversely proportional to the error description cost. We also demonstrate the efficiency of our approach in Fig. 2. Furthermore, Fig. 4 demonstrates the impact on the learning time, number of novel features discovered, and their sparsity, as the tolerance (\(\varepsilon \)) and bin size (\(\alpha \)) varies.

Fig. 2.
figure 2

Runtime of our edge role model selection. The curve is the average over 50 experiments and the dotted lines represent three standard deviations. The result reported above is from a laptop with a single core.

Modeling Dynamic Networks: In this section, we investigate the Enron email communication networks using the higher-order dynamic edge role mixed-membership model. The Enron email data consists of 151 Enron employees whom have sent 50.5k emails to other Enron employees over a 3 year period. The email communications are from 05/11/1999 to 06/21/2002. For learning we use only the first year of emails. A dynamic network \(\{G_t\}^{T}_{t=1}\) is constructed from the remaining email communications (approximately 2 years) where each snapshot graph \(G_t\), \(t=1,\ldots ,T\) represents a month of communications. Interestingly, our higher-order dynamic node role mixed-membership model has 5 latent roles, whereas we learn 18 roles using the edge role model. Evolving edge and node mixed-memberships from the Enron email communication network are shown in Fig. 3. The set of edges and nodes visualized in Fig. 3 are selected using the difference entropy rank (defined below) and correspond to the edges and nodes with largest difference entropy rank \(\mathbf {d}\). The first role in Fig. 3 represents inactivity (dark blue). The above empirical results suggest that edge roles are superior to node roles in three fundamental ways: (1) Edge roles reveal novel behavioral characteristics that are not captured by the node role models. We posit that these novel behavioral roles are intrinsic to the edge semantics (which represent communications in Fig. 3). (2) Roles learned on the edges represent behavioral characteristics at a much lower-level of granularity than those learned on nodes. (3) Edge roles are better at modeling dynamic/temporal networks and avoid many of the unrealistic assumptions that lie at the heart of dynamic node role mixed-membership models.

Fig. 3.
figure 3

Temporal changes in the edge and node mixed-membership vectors. The horizontal axes of each subplot is time, whereas the vertical axes represent the components of each mixed-membership vector. Roles are represented by different colors. (Color figure online)

Fig. 4.
figure 4

Impact on the learning time, number of features, and their sparsity, as the tolerance \(\varepsilon \) (rows) and bin size \(\alpha \) (columns) varies.

Fig. 5.
figure 5

Edge and node roles for ca-netscience. Link color represents the edge role and node color indicates the corresponding node role. (Color figure online)

We define \(\mathbf {d}= \max _{t} H(\mathbf {u}_{t}) - \min _{t} H(\mathbf {u}_{t})\) as the difference entropy rank where \(H(\mathbf {u}_{t}) = - \mathbf {u}_t \cdot \log (\mathbf {u}_t)\) and \(\mathbf {u}_t\) is the r-dimensional mixed-membership vector for an edge (or node) at time t. Using the difference entropy rank, we are able to reveal important communications between key players involved in the Enron Scandal, such as Kenneth Lay, Jeffrey Skilling, and Louise Kitchen. In particular, anomalous relationships between these individuals appear in the top anomalies from the difference rank. Notice that when node roles are used for identifying dynamic anomalies in the graph, we are only provided with potentially malicious employees, whereas using edge roles naturally allow us to not only detect the key malicious individuals involved, but also the important relationships between them, which can be used for further analysis, among other possibilities. Many results are removed for brevity.

Exploratory Analysis: Figure 5 visualizes the node and edge roles learned for ca-netscience. While our higher-order role edge discovery method learns a stochastic r-dimensional vector for each edge (and/or node) representing the individual role memberships, Fig. 5 assigns a single role to each link and node, i.e., the role with maximum likelihood \(k_{\star } \leftarrow \mathrm{arg \; max}_{k}\ u_{ik}\). The higher-order edge and node roles from Fig. 5 are clearly meaningful. For instance, the red edge role represents a type of bridge relationship.

Table 3. Higher-order sparse graph feature learning for latent node and edge network modeling. Recall that f is the number of features, L is the number of layers, and \(\rho (\varvec{\mathrm {X}})\) is the sparsity of the feature matrix. Edge values are bold.

Sparse Graph Feature Learning: Recall that the proposed feature learning approach attempts to learn “sparse graph features” to improve learning and efficiency, especially in terms of space-efficiency. This section investigates the effectiveness of our sparse graph feature learning approach. Results are presented in Table 3. In all cases, our approach learns a highly compressed representation of the graph, requiring only a fraction of the space of current (node) approaches. Moreover, the density of edge and node feature representations learned by our approach is between [0.164, 0.318] and [0.162, 0.334] for nodes (See \(\rho (\varvec{\mathrm {X}})\) and \(\rho (\varvec{\mathrm {Z}})\) in Table 3) and up to 6x more space-efficient than other approaches.

Fig. 6.
figure 6

Relative improvement in label consistency (homophily) — a known proxy for classification performance. In all cases, links predicted using edge roles improves the label consistency over both the initial graph as well as links predicted using node roles.

Improving Classification via Link Prediction: This section demonstrates the effectiveness of edge roles for improving relational classification by predicting links between nodes in the graph. For consistency, we first construct node features from the edge role memberships using a set of relational operators (e.g., relational \(\mathtt {mean}\), \(\mathtt {sum}\), \(\mathtt {var}\), \(\mathtt {max}\), among others), as introduced in [31]. Thus, let us assume \(\mathbf {x}_i\) is a k-dimensional feature vector for node \(v_i \in V\). Given \(\mathbf {x}_i\) and \(\mathbf {x}_j\), and a positive semidefinite kernel function \(K\langle \cdot ,\cdot \rangle \), the relationship strength between \(v_i\) and \(v_j\) is defined as:

$$\begin{aligned}&\varvec{\mathrm {S}}= [S_{ij}],\; \forall i,j \quad {\textit{and}}&S_{ij} = {\left\{ \begin{array}{ll} K\langle \mathbf {x}_i,\mathbf {x}_j \rangle &{} \text {if } (v_i, v_j) \not \in E \wedge K\langle \mathbf {x}_i,\mathbf {x}_j \rangle >\epsilon \\ 0 &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(8)

where \(K\langle \mathbf {x}_i,\mathbf {x}_j \rangle \) represents the “closeness” between node \(v_i\) and \(v_j\) in the latent lower-dimensional subspace, \(\varvec{\mathrm {S}}\in \mathbb {R}^{n\times n}\) is the (implicit) “similarity” matrix (which can be thought of as the weighted adjacency matrix for a graph \(G^{\prime }\)) and \(S_{ij}\) represents the relationship strength between node \(v_i\) and \(v_j\) such that \((v_i,v_j) \not \in E\), and 0 otherwise. Note \(\epsilon \) is a small scalar that controls sparsity. In this work, we use \(K\langle \mathbf {x}_i,\mathbf {x}_j \rangle = \mathrm {exp}(-{\Vert \mathbf {x}_i - \mathbf {z}_j\Vert ^{2}}/{2\sigma ^2})\). Given \(\varvec{\mathrm {S}}\), let \(G^{\prime } = (V, E^{\prime })\) denote the predicted latent graph where \(E^{\prime }\) is the set of k predicted links with the largest relationship strength weights. By definition \(|E|+|E^{\prime }| = m+ k\) and thus \(E \cap E^{\prime } = \varnothing \).

For quantitative evaluation of the edge roles, we use a measure of homophily called label consistency [21]. Let \(\xi (v_i)\) be the class of \(v_i\), then the label consistency of G is defined as: where \(\mathbb {L}(v_i, v_j) = 1\) if \(\xi (v_i) = \xi (v_j)\) and 0 otherwise. Hence, label consistency measures how often two connected nodes belong to the same class. It is a good proxy measure for classification performance since most existing statistical relational learning (SRL) [12] methods assume the labels of neighbors are highly correlated, i.e., the network exhibits high relational autocorrelation (or homophily) [12, 20]. To determine the effectiveness of edge roles for link prediction, we measure \(\mathbb {L}(G)\) and \(\mathbb {L}(G^{\prime })\). Notice that if the higher-order edge roles (and node roles for that matter) are useful and effective, one would expect that \(\mathbb {L}(G) < \mathbb {L}(G^{\prime })\), that is, the predicted links resulted in higher homophily among the connected nodes since the class labels of the connected nodes in \(G^{\prime }\) are more consistent than G. Results are provided in Fig. 6 for six different networks. In particular, Fig. 6 demonstrates the effectiveness of the higher-order edge roles (and node roles) for link prediction. In all cases, both the higher-order node and edge roles significantly outperform the baseline. Further, the edge role models always perform significantly better than the node roles.

Computational Complexity: Recall that \(m\) is the number of edges, \(f\) is the number of features, and \(r\) is the number of latent roles. The total computational complexity of the higher-order latent space model is \(\mathcal {O}\big ( f(mf+ mr) \big )\). The computational complexity is decomposed into the following main parts: Edge feature learning takes \(\mathcal {O}(f(m+mf))\). Model learning takes \(\mathcal {O}(mfr)\) in the worst case (which arises when \(\varvec{\mathrm {U}}\) and \(\varvec{\mathrm {V}}\) are completely dense). The quantization and Huffman coding terms are very small and therefore ignored. Role assignment using scalar element-wise coordinate descent has worst case complexity of \(\mathcal {O}(mfr)\) per iteration which arises when \(\varvec{\mathrm {X}}\) is completely dense. We assume the initial graphlet features are computed using fast and accurate estimation methods, seel [3].

Scalability: To evaluate the scalability of the parallel framework for modeling higher-order latent edge roles, we measure the speedup defined as \(S_p = T_1 / T_p\) where \(T_1\) is the execution time of the sequential algorithm, and \(T_p\) is the execution time of the parallel algorithm with p processing units. Overall, the methods show strong scaling (See Fig. 7). Similar results were observed for other networks. The experiments used a machine with 4 Intel Xeon E5-4627 v2 3.3 GHz CPUs.

Fig. 7.
figure 7

Strong parallel scaling is observed.

5 Conclusion

In this paper, we introduced the edge role discovery problem and presented a computational framework for learning and extracting edge roles from large networks. In addition, we proposed higher-order role discovery methods that leverage network motifs (including all motifs of size 3, 4, and larger) for learning more meaningful and discriminative roles. We also proposed a novel edge feature learning approach, which was used for our feature-based edge roles. Furthermore, all methods are space-efficient (by learning sparse features) and efficient with a runtime that is linear in the number of edges. Finally, the approach also supports graphs that are directed/undirected/bipartite, attributed, typed, and signed.