Keywords

1 Introduction

Given a graph and a set of query nodes, we are interested in connecting these query nodes in a minimal but highly informative manner. Minimal in the sense that we are looking for a preferably small subgraph to which the query nodes belong to. Informative meaning that our aim is to show a user a subgraph that is highly insightful to them, i.e., the subgraph contains relationships between nodes that are unexpected and surprising to the user. In this paper we consider the case of connecting the query nodes through a subgraph that has a tree structure.

Fig. 1.
figure 1

Tree connecting the three most recent KDD best paper award winners listed at the official ACM SIGKDD webpage (http://www.kdd.org/awards/sigkdd-best-research-paper-awards) that are also present in the Aminer ACM-Citation-network v8 (https://aminer.org/citation, [10]). The result of our algorithm with heuristic s-IR given no background knowledge about the graph. See Sect. 5 for more details.

An example: suppose we have a scientific paper citation network, where edges denote that one paper references another. Given a set of query papers, a directed tree containing these query papers is one possible way to represent interesting citation relationships between these papers. The root of the tree could represent a paper that was (perhaps indirectly) highly influential to all the papers in the query set. Connections between nodes are subjectively interesting if they are surprising. E.g., if a user knows certain papers are widely cited (have high degree), those papers would be less interesting to find in the connecting tree: the user already expects this connection to exist and hence does not learn much.

An example of an informative tree connecting three recent KDD best paper award winners where no prior knowledge about degrees was assumed is given in Fig. 1. Another example application would be to organize your bookmarks by constructing a tree where the bookmarks are the query nodes and the network is the WWW. With a prior containing the degrees of the nodes in the network, an informative tree would partition the bookmarks according to links that are surprising and hence specific to a sub-network (they have low degree). Our method find such trees without doing community detection as an intermediate step.

The main question here is: what makes a certain tree interesting to a given user? We believe that the goal of Exploratory Data Mining (EDM) is to increase a user’s understanding of his or her data in an efficient way. However, we have to consider that every user is different. It is in this regard that the notion of subjective interestingness was formalised [9] and more particularly the creation of the data mining framework FORSIED that we build upon [4, 5].

The FORSIED framework specifies in general terms how to model prior beliefs the user has about the data. Given a background model representing these prior beliefs, we may find patterns that are highly surprising to the particular user. Hence in our setting, a tree will generally be more interesting if it contains, according to the user’s beliefs, more unexpected relationships between the nodes.

This paper contributes the following:

  • We define the new problem of finding subjectively interesting trees connecting a set of query nodes in a network. (Sect. 2)

  • We show how to formalize a user’s knowledge that the graph has a ‘DAG’-like structure, for example because the nodes represent events in time. (Sect. 3)

  • We propose heuristics for mining the most interesting trees efficiently in the case of directed graphs. (Sect. 4)

  • We evaluate and compare the effectiveness of these heuristics on real data and study the utility of the resulting trees, showing that the results are truly and usefully dependent on the assumed prior beliefs of the user. (Sect. 5)

2 Subjectively Interesting Trees in Graphs

We denote a network (aka graph) G as \(G=(V,E)\), where V is the set of nodes (aka vertices) and \(E \subseteq V \times V\) is the edge set. We denote the adjacency matrix of a graph as A, where \(\mathbf A _{ij}=1\) iff there is an edge connecting node i to j, i.e., iff \((i,j)\in E\). The main focus of this paper will be on directed networks. However, our methods directly apply to undirected networks, when considered as a special case where \(\mathbf A \) is symmetric. We assume that the set of nodes V is fixed and known, and the user is interested the network’s connectivity, i.e., the edge set E, especially in relation to a set of so-called query nodes \(Q\subseteq V\).

2.1 Trees Connecting Query Nodes as Data Mining Patterns

The data mining process we consider is query-driven: the user provides a set of query nodes \(Q\subseteq V\) between which they suspect connections exist in the graph that might be of interest to them. In response to this query, the methods proposed in this paper will thus provide the user with a tree-structured sub-network connecting the query nodes. We consider trees because they are easy to interpret. We refer to the presence of a tree as a pattern found in the network.

Formally, a tree \(T=(V_T,E_T)\) is a network over the nodes \(V_T\subseteq V\) with edges \(E_T=\{e_1,\ldots ,e_{|V_T|-1}\}\subseteq V_T\times V_T\), where \(e_i(2)\ne e_j(2)\) for \(i\ne j\) (i.e., each node has only one parent). The tree \(T=(V_T,E_T)\) is said to be present in the network \(G=(V,E)\) iff \(V_T\subseteq V\) and \(E_T\subseteq E\). The methods proposed below search for interesting trees \(T=(V_T,E_T)\) present in the network \(G=(V,E)\) with \(Q\subseteq V_T\).

Remark 1

The above description is a special type of tree: a rooted arborescence. This is a tree-structured directed sub-network with a unique directed path between the root and each of the leaves. The edges all point away from the root (out-arborescence), but by reversing all edge directions also in-arborescences can be considered. We will simply refer to the considered patterns as trees.

2.2 Subjective Interestingness

The FORSIED framework aims to quantify interestingness of a pattern in a subjective manner, dependent on prior beliefs the user holds about the data. To model the user’s belief state about the data, the framework proposes to use a so-called background distribution, which is a probability distribution P over the data space (in our setting, the set of all possible edge sets E). It was argued that a good choice for the background distribution is the maximum entropy distribution subject to the prior beliefs as constraints [4, 5].

The FORSIED framework then prefers patterns that achieve a trade-off between how much information the pattern conveys to the user (considering their belief state), versus the effort required of the user to assimilate the pattern. Specifically, De Bie [4] argued that the Subjective Interestingness (SI) of a pattern can be quantified as the ratio of the Information Content (IC) and the Description Length (DL) of a pattern. The IC is defined as the negative log probability of the pattern w.r.t. the background distribution P. The DL is quantified as the length of the code needed to communicate the pattern to the user.

The IC of a tree. The background distributions P for all prior belief types discussed in this paper have the property that P factorizes as a product of independent Bernoulli distributionsFootnote 1, one for each possible edge \(e \in V\times V\). Hence the IC of a tree T with edges \(E_T\) decomposes as

$$\begin{aligned} \text {IC}(T) = -\text {log}(\displaystyle \prod _{e \in E_T} \text {Pr}(e)) = \displaystyle \sum _{e \in E_T}\text {IC}(e), \end{aligned}$$
(1)

where we defined the IC of an edge e to be \(\text {IC}(e) = -\text {log}(\text {Pr}(e))\).

The DL of a tree. A tree can be described by first describing the set of nodes \(V_T\) and then the set of edges \(E_T\) over this set of nodes. To describe the set \(V_T\subseteq V\) efficiently, note that \(Q\subseteq V_T\) such that only \(V_T{\setminus } Q\) needs to be described. This can be done using a sequence of \(|V_T|-|Q|+1\) symbols from \(V{\setminus }Q\cup \{`\text {stop'}\}\), where the last one is a stop symbol. This results in a description length of \((|V_T|-|Q|+1)\log (|V|-|Q|+1)\) bits for \(V_T\). Given \(V_T\), \(E_T\) can be described by listing the parents of all nodes from within \(V_T\cup \{`\text {none'}\}\), where the ‘none’ symbol is used for the root. This requires \(|V_T|\text {log}(|V_T|+1)\) bits. Thus:

$$\begin{aligned} \text {DL}(T) = (|V_T|-|Q|+1)\log (|V|-|Q|+1)+|V_T|\log (|V_T|+1). \end{aligned}$$
(2)

2.3 Finding Subjectively Interesting Trees

The methods presented in this paper aim to solve the following problem:

Problem 1

Given a graph \(G = (V,E)\) and set of query nodes \(Q \subseteq V\), we want to find a root \(r\in V\) and an out-arborescence rooted at r, such that the arborescence is maximally subjectively interesting. We additionally require that all leaf nodes are query nodes, and we constrain the height of the tree not to be larger than a user-defined parameter k.

Since the SI depends on the background distribution and thus on the user’s prior beliefs, the optimal solution to Problem 1 does as well. As stated in Remark 1, by transposing the adjecency matrix A, we can equivalently consider in-arborescences in exactly the same manner.

3 The Background Distribution to Model the User Beliefs

As mentioned, the background distribution is computed as the maximum entropy distribution subject to the prior beliefs as constraints. Here we discuss how this is done in detail for three types of prior beliefs: (1) on the overall edge density; (2) on the individual node degrees; and () for networks with nodes that correspond to timed events, on the tendency of nodes to be connected to nodes at a specified time difference (as well as generalization thereof). These prior beliefs can be combined as well. Note that (1) and (2) were introduced before in [6].

3.1 Prior Beliefs on Overall Density, and on Individual Node Degrees

As shown in [6], given prior beliefs on the degrees of the nodes, the maximum entropy distribution factorizes as:

$$\begin{aligned} P(\mathbf A ) = \displaystyle \prod _{i,j} \frac{\exp ((\lambda _i^r +\lambda _j^c)\mathbf A _{ij})}{1+\exp (\lambda _i^r +\lambda _j^c)}, \end{aligned}$$

where \(\lambda _i^r\) and \(\lambda _j^c\) are parameters from the resulting optimization problem. [3] showed how these can be computed efficiently. For a prior belief on the overall graph density, every edge probability in the model equals the assumed density.

3.2 Prior Beliefs When Nodes Represent Timed Events

If the nodes in G correspond to events in time, we can partition the nodes into bins according to a time-based criterion. For example, if the nodes are scientific papers in a citation network, we can partition them by publication year. Given these bins, it is possible to express prior beliefs on the number of edges between two bins. This would allow one to express e.g. beliefs on how often papers from year x cite papers from year y. This is useful e.g. if one believes that papers cite recent papers more often than older ones.

We consider the case when our beliefs are in line with a stationarity property, i.e. when the beliefs regarding two bins are independent of the absolute value of the time-based criterion of these two bins, but rather only depend on the time difference. Given an adjacency matrix \(\mathbf A \), this amounts to expressing prior beliefs on the total number of ones in each of the block-diagonals of the resulting block matrix (formed by partitioning the elements into bins), see Fig. 2 for clarification.

Fig. 2.
figure 2

A resulting block matrix with 3 bins \(b_1, b_2\) and \(b_3\). There are 5 block-diagonals \(D_k\) (indicated by the same fill). For each \(D_k\), we express prior beliefs on the sum of all elements in \(D_k\).

We consider the problem of finding the maximum entropy distribution over the set of rectangular binary matrices \(\mathcal {A} = \{0,1\}^{n\times n}\), while constraining the expectation of the sum of the elements in each of the block diagonals, as well each of the row and column sums. It is found by solving:

$$\begin{aligned} \displaystyle \mathop {\mathrm {arg\,max}}_{P(\mathbf A )}&- \displaystyle \sum _\mathbf{A \in \mathcal {A}} P(\mathbf A )\log P(\mathbf A ), \\ \text {s.t.}&\displaystyle \sum _\mathbf{A \in \mathcal {A}} P(\mathbf A ) \displaystyle \sum _{j=1}^{n} \mathbf A _{ij} = d_i^r, \quad \displaystyle \sum _\mathbf{A \in \mathcal {A}} P(\mathbf A ) \displaystyle \sum _{i=1}^{n} \mathbf A _{ij} = d_j^c, \\&\displaystyle \sum _\mathbf{A \in \mathcal {A}} P(\mathbf A ) \displaystyle \sum _{(i,j) \in D_k} \mathbf A _{ij} = B_k, \\&\displaystyle \sum _\mathbf{A \in \mathcal {A}} P(\mathbf A ) = 1, \end{aligned}$$

with \(i,j \in \{1,\ldots , n\}\) and \(k \in \{1, \ldots , 2\#\text {bins}-1\}\), and with \(d_i^r\) the expected sum of the i’th row, \(d_j^c\) the expected sum of the j’th column, and \(B_k\) the expected sum of the k’th block diagonal \(D_k\). The resulting maximum entropy distribution factorizes as a product of independent Bernoulli distributions, one for each random variable \(\mathbf A _{ij} \in \{0,1\}\):

$$\begin{aligned} P(\mathbf A ) = \displaystyle \prod _{i,j} \frac{\exp ((\lambda _i^r +\lambda _j^c + \alpha _k)\mathbf A _{ij})}{1+\exp (\lambda _i^r +\lambda _j^c + \alpha _k)}, \end{aligned}$$
(3)

where \(\lambda _i^r, \lambda _j^c\) and \(\alpha _k\) are the Lagrange multipliers for the corresponding row, column and block-diagonal constraints. These Lagrange multipliers are found by minimizing the Lagrange dual function, as given by:

$$\begin{aligned} L(\lambda ^r, \lambda ^c, \alpha ) = \displaystyle \sum _{i,j} \log (1+\exp (\lambda _i^r +\lambda _j^c + \alpha _k)) - \displaystyle \sum _{i} \lambda _i^r d_i^r - \displaystyle \sum _{j} \lambda _j^c d_j^c - \displaystyle \sum _{k} \alpha _k B_k. \end{aligned}$$

Standard methods for unconstrained convex optimization such as Newton’s method can be used to infer the optimal values. The number of variables to be optimized over is equal to \({2(\text {n}+\#\text {bins})-1}\), where \(1 \le \#\text {bins} \le \text {n}\). Using Newton’s method then requires solving a linear system of O(n) equations, with computational complexity \(O(n^3)\). For practical problems involving large networks, this quickly becomes infeasable. However, with a similar argument as in [3], we can dramatically reduce the number of variables. Observe that if \(d^r_k = d^r_l\) and k and l belong to the same bin, then we have \(L(\ldots ,\lambda ^r_k,\lambda ^r_l, \ldots ) = L(\ldots ,\lambda ^r_l,\lambda ^r_k, \ldots )\). The convexity of L implies \(\lambda ^r_k = \lambda ^r_l\) at the optimum. A similar argument holds for the \(\lambda ^c\) parameters. Thus the number of free variables per bin to be optimized over, is bounded by the number of distinct row and column sums per bin.

Let \(\widetilde{m}\) be the total number of free row variables, and \(\widetilde{n}\) be the total number of free column variables. The following Lemma provides an upper bound on \(\widetilde{m}+\widetilde{n}\) in terms of the number of non-zero elements of \(\mathbf A \) and the number of bins k:

Lemma 1

Let \(\mathbf A \) be a binary rectangular matrix and denote \(s = \sum _{i,j} \mathbf A \). Then it holds that \(\widetilde{m}+\widetilde{n} \le 2\sqrt{2ks}\).

Proof

Let \(\widetilde{m}_i\) be the number of distinct row variables in the i-th bin and similarly for \(\widetilde{n}_i\) with \(i \in \{1,\ldots ,k\}\). Let \(s_i\) (\(s'_i\)) be the total number of ones in all the rows (columns) of the elements in bin i. Then the following inequalities hold [3]:

$$\begin{aligned} \widetilde{m}_i \le \sqrt{2s_i}, \quad \text { and }\quad \widetilde{n}_i \le \sqrt{2s'_i}. \end{aligned}$$

Hence \(\widetilde{m}+\widetilde{n} \le \sqrt{2}(\sqrt{s_1}+\ldots +\sqrt{s'_k})\). Clearly also \(\sum _i s_i+s_i' = 2s\) and thus by Jensen’s inequality \(\sqrt{s_1}+\ldots +\sqrt{s'_k} \le 2\sqrt{ks}\), which proves the lemma.   \(\square \)

Denote \(\widetilde{\lambda }_{k,l}^r\) as the l-th unique row parameter in the k-th bin. Denote the corresponding row sum constraint as \(\widetilde{d}_{k,l}^r\) having \(\widetilde{m}^k_l\) occurences in that bin. Similarly for \(\widetilde{\lambda }_{k,l}^c\), \(\widetilde{d}_{k,l}^c\) and \(\widetilde{n}^k_l\). Denote \(\alpha _{kk'}\) as the \(\alpha \) parameter of the \(\mathbf A _{ij}\) elements with \(i \in \text {bin k}\) and \(j \in \text {bin k}'\). The reduced Lagrange dual function then becomes

$$\begin{aligned}&L(\widetilde{\lambda ^r}, \widetilde{\lambda ^c}, \alpha ) = \displaystyle \sum _{k} \displaystyle \sum _{k'} \displaystyle \sum _{l} \displaystyle \sum _{l'} \widetilde{m}^k_l \widetilde{n}^{k'}_{l'} \log (1+\exp (\widetilde{\lambda }_{k,l}^r +\widetilde{\lambda }_{k',l'}^c + \alpha _{kk'})) \\&- \displaystyle \sum _k \displaystyle \sum _l \widetilde{m}^k_l \widetilde{d}_{k,l}^r \widetilde{\lambda }_{k,l}^r - \displaystyle \sum _{k'} \displaystyle \sum _{l'} \widetilde{n}^{k'}_{l'} \widetilde{d}_{k',l'}^c \widetilde{\lambda }_{k',l'}^c - \displaystyle \sum _{m} \alpha _m B_m. \end{aligned}$$

The gradient is computed as

$$\begin{aligned}&\frac{\partial L}{\partial \widetilde{\lambda }_{k,l}^r} = \displaystyle \sum _{k'} \displaystyle \sum _{l'} \widetilde{m}^k_l \widetilde{n}^{k'}_{l'} \frac{\exp (\widetilde{\lambda }_{k,l}^r +\widetilde{\lambda }_{k',l'}^c + \alpha _{kk'})}{1+\exp (\widetilde{\lambda }_{k,l}^r +\widetilde{\lambda }_{k',l'}^c + \alpha _{kk'})} - \widetilde{m}^k_l \widetilde{d}_{k,l}^r,\end{aligned}$$
(4)
$$\begin{aligned}&\frac{\partial L}{\partial \widetilde{\lambda }_{k',l'}^c} = \displaystyle \sum _{k} \displaystyle \sum _{l} \widetilde{m}^k_l \widetilde{n}^{k'}_{l'} \frac{\exp (\widetilde{\lambda }_{k,l}^r +\widetilde{\lambda }_{k',l'}^c + \alpha _{kk'})}{1+\exp (\widetilde{\lambda }_{k,l}^r +\widetilde{\lambda }_{k',l'}^c + \alpha _{kk'})} - \widetilde{n}^{k'}_{l'} \widetilde{d}_{k',l'}^c,\end{aligned}$$
(5)
$$\begin{aligned}&\frac{\partial L}{\partial \alpha _k} = \displaystyle \sum _{D_k} \frac{\exp (\widetilde{\lambda }_{k,l}^r +\widetilde{\lambda }_{k',l'}^c + \alpha _{k})}{1+\exp (\widetilde{\lambda }_{k,l}^r +\widetilde{\lambda }_{k',l'}^c + \alpha _{k})} - B_k. \end{aligned}$$
(6)

and a similar expression for the Hessian. In all cases (rows, columns and block diagonal) the corresponding gradient is simply the difference between the expected number of ones and the corresponding parameter as given by the constraints. When applying Newton’s method to the reduced model, we need \(O(\widetilde{m}\widetilde{n})\) calculations to compute both the gradient and Hessian. After that we need to solve a linear system with \(\widetilde{m}+\widetilde{n}+2k-1\) equations, with cubic complexity. By Lemma 1, this is \(O(\sqrt{k^3s^3}+k^3)\), making it very efficient in many real life applications (sparse networks and a small number of bins).

Remark 2

Note that we are not limited to the case of stationarity, nor is it necessary that nodes correspond to timed events. Expressing prior beliefs on the density of any particular subset of edges is possible in a similar manner. We tackled this specific case because it directly applies to the data used in this paper.

4 Algorithms for Finding the Most Interesting Tree

The problem of finding a directed Steiner arborescence (spanning all the query nodes) with maximum SI is NP-hard in general, as can be seen from the case of constant edge weights (e.g., if the prior belief is the overall graph density). In this case the SI of a tree will be a decreasing function of the number of nodes in the tree. Hence the problem is equivalent to the minimum Steiner arborescence problem, with constant edge weights, which is NP-hard. For nonconstant background models it will be a trade-off between the IC and the DL of a tree. In most cases, we are looking for small trees with highly informative edges.

There are a number of algorithms that provide good approximation bounds for the directed Steiner problem [2, 7, 11], and this problem has also been studied recently in the data mining community, e.g., [1, 8]. However, Problem 1 is equivalent to the Steiner problem in the case of a uniform background distribution, i.e., when the IC of the edges is constant and hence irrelevant. In general, we aim to solve a maximization problem, while Steiner tree problems aim to minimize the cost of the tree. For this reason we propose fast heuristics for large graphs, that perform well on different kinds of background distributions.

A Python implementation of the algorithms and the experiments is available at http://www.interesting-patterns.net/forsied/sict/.

4.1 Proposed Heuristics

Our proposed methods all work in a similar way. We apply a preprocessing step, resulting in a set of candidate roots. Given a candidate root r, we build the tree by iteratively adding edges (parents) to the frontier—initialized as \(Q{\setminus }\{r\}\)—, until frontier is empty. We exhaustively search over all candidate roots and select the best resulting tree. The heuristics differ in the way they select allowable edges. The outline of SteinerBestEdge is given in Algorithm 1.

Preprocessing. All of the proposed heuristics have two common preprocessing steps. First we find the common roots of the nodes in Q up to a certain level k, meaning we look for nodes r, s.t. \(\forall q \in Q: \text {SPL}(q, r) \le k\), with SPL\((\cdot )\) denoting the shortest path length. This can be done using a BFS expansion on the nodes in Q until the threshold level k is reached. Note that query nodes are also potential candidates for being the root, if they satisfy the above requirement.

Secondly, for each r we create a subgraph \(H \subset G\), consisting of all simple paths \(q\rightsquigarrow r\) with \(\text {SPL}(q,r)\le k\), for all \(q \in Q\). This can be done using a modified DFS-search. The number of simple paths can be large. However, we can prune the search space by only visiting nodes that we encountered in the BFS expansion, making the construction of H quite efficient for small k.

SteinerBestEdge. Given the subgraph H, we construct the arborescence working from the query nodes up to the root. We initialize the frontier as \(Q \setminus \{r\}\), and iteratively add the best feasible edge to a partial solution, denoted as Steiner, according to a greedy criterion. The greedy criterion is based on the ratio of the IC of that edge to the DLFootnote 2 of the partial Steiner that would result from adding that edge. This heuristic prefers to pick edges from a parent node that is already in Steiner, yielding a more compressed tree and thus a smaller DL.

Algorithm 2 checks if an edge is feasible by propagating its potential influence to all the other nodes in H. The check can fail in two ways. First, the addition of an edge could yield a Steiner tree with height \(> k\), see Fig. 3 for an example. Secondly, the addition of an edge may lead to cycles in Steiner. Cycles are avoided by only considering edges (st) that do not potentially change \(\text {SPL}(t, r)\). If \(\text {SPL}(t, r)\) would change, the shortest path –given the current Steiner– from s to r is not along the edge (st) and hence for all \(f \in frontier\) we always have 1 feasible edge to pick (i.e. an edge that is part of a shortest path \(f \rightsquigarrow r\)). One way to select the best feasible edge is to first sort the edges according to the greedy criterion. Then try the check from Algorithm 2 on this sorted list (starting with the best edge(s)), until the first success, and add the resulting edge to Steiner. Algorithm 2 will also return an updated shortest path function NewSP, containing all the changes in \(\text {SPL}(n, r)\) for \(n \in H\) due to the addition of that edge to Steiner. After performing the necessary updates on the SP function, and the frontier, parents and level sets, we continue to iterate until frontier is empty.

figure a
figure b
Fig. 3.
figure 3

Example of why look-ahead is needed to ensure the returned tree has depth as most k. If \(k=2\), the only valid tree is \((Q_1,R)(Q_2,Q_1)\). Initially, the frontier is \(\{Q_1,Q_2\}\) and X is a candidate parent for \(Q_1\) because there is a path \(Q \rightsquigarrow R\) of at most length 2. Yet, adding the dashed edge violates the shortest path constraint for \(Q_2\).

Fig. 4.
figure 4

Example of why look-ahead is needed for sets of edges. For \(k=3\), neither of the two dashed edges violate the depth constraint—they are part of a valid tree—, but together they indirectly violate the shortest path constraint for \(Q_1\). Regardless of which parent is chosen for A, the path from \(Q_1\) to R has length 4.

SteinerBestIC. Instead of adding 1 edge at a time, this heuristic adds multiple edges at once. We look for the parent node that (potentially) adds the most total information content of allowable edges to the current Steiner. However, given such a parent node, it not always possible to add multiple edges, see Fig. 4. Instead we sort the edges coming from such a parent node according to their IC, and iteratively try to add the next best edge to Steiner.

SteinerBestIR. A natural extension of SteinerBestIC is to actually take in account the DL of the partial Steiner solution, as we did in SteinerBestEdge. SteinerBestIR favors parent nodes that are already in Steiner, steering towards an even more compressed tree.

SteinerBestEdgeBestIR. Our last method simply picks the single best edge coming from the best parent, where the best parent is determined by the same criteria as in SteinerBestIR. In general this will pick a locally less optimal edge than SteinerBestEdge, but it will pick edges from a parent node that has lots of potential to the current Steiner solution.

Correctness of the solutions. The following theorem states that all the heuristics indeed result in a tree with maximal height \(\le k\).

Theorem 1

Given a non-empty query set Q, a candidate root r and a height \(k \ge 1\). In all cases all four heuristics will return a tree with height \(\le k\).

Proof

In all cases the proposed heuristics return a tree rooted at r with height \(\le k\). We call a partial forest solution Steiner valid, if for all leaf nodes \(l \in Steiner: \text {SPL}(l,r|Steiner) \le k\), where \(SPL(\cdot |Steiner)\) denotes a shortest path length given the partial Steiner solution. Note that the initial Steiner is valid, due to the way the subgraph H was constructed. It is always possible to go from one valid Steiner solution to another valid one, by selecting an edge (incident to a frontier node) along a shortest path—given we have Steiner—from r that frontier node. This will result in an unchanged SPL for all other nodes (in particular the leaf nodes), and hence remains a valid Steiner. If we have n frontier nodes, we have at least n such valid edges to pick from. Hence, all of the heuristics have at least \(n\ge 1\) valid edges to pick from. The process of adding edges is finite, and will eventually result in an arborescence rooted at r with height \(\le k\).

5 Experiments

In this section we empirically evaluate our proposed methods on real data. All experiments are based on the ACM-Citation-network v8Footnote 3, a scientific paper citation network. This (directed) network contains 2,381,688 papers and 10,476,564 citations. The oldest paper is a seminal paper of C.E. Shannon from 1938. The most recent papers are from 2016. We will use the acronyms s-E, s-IC, s-IR and s-EIR for resp. SteinerBestEdge, SteinerBestIC, SteinerBestIR and SteinerBestEdgeBestIR. First, we evaluate and compare the performance of the heuristics.

Fig. 5.
figure 5

The interestingness of the heuristics (relative to the optimal interestingness) versus querysize. We also compare with the average interestingness over all trees. Note the decrease in performance of s-E for larger querysizes.

5.1 Comparing the Heuristics

To compare the performance of the heuristics we set up an experiment similar to [1]. We fitted the background model with prior beliefs on the degrees of the network. To generate a set of n query nodes we used a snowball-like sampling scheme. We randomly selected an initial node in the graph. Then, we explore \(n'<n\) of its neighbors, each selected with probability s. For each of these nodes we continue to test \(n'\) of its neighbors until we have n selected nodes. From this query set we randomly select a valid common root within a maximum distance k. To have a baseline, we find the arborescence with maximal SI using exhaustive search. To keep this comparison feasible, we only consider cases where the number of trees is <200,000. For querysizes \(=\{3,5\}\) we generated 1000 query sets, for querysize = 7 we have done 250 sets. In all cases k was limited to 3, the beamwidth \(n'\) was chosen to be 2 and sampling rates \(s \in \{0.1, \ldots , 0.9\}\).

Figure 5 shows a boxplot of the interestingness scores of the tree-building heuristics (relative scores to the optimal arborescence interestingness) versus query size. All four heuristics clearly are better strategies than randomly selecting an arborescence (the Avg. case). s-IR outperforms s-IC in all cases, which makes sense because s-IC has no regard for the DL of the tree. s-E performs comparatively worse for larger query sizes, and s-IR seems to be the best option for larger querysizes. This result is not definite, it could be due to the fact that we fixed the height at \(k =3\). However, while not reported here, we observed that s-IR is also the best option for larger query sizes and larger k. We also tested the effect of the sampling rate (not shown), and this appeared to affect neither the SI of the resulting trees, nor the ranking of the algorithms.

Figure 6 shows the average run time of our methods. The run time of the heuristics are all negligable compared to the time needed to find all simple paths from the root to the query nodes (see Sect. 4), which in all cases takes up more than 90% of the total running time. We conclude that with prior beliefs on the individual node degrees, s-IR seems to be best option for larger queries, while for smaller queries s-E seems to give a more interesting tree.

Fig. 6.
figure 6

Average run time of the heuristics. The main bottleneck is finding all the simple paths, which is included here. Hence, the run time differences are small.

5.2 The Effect of Different Prior Beliefs, and a Subjective Evaluation

Here we evaluate the outcome of our heuristics w.r.t. three different kinds of prior beliefs on the ACM citation network. The first prior belief is on the overall graph density. In this case, every edge has the same probability in the background model and hence same information content. The optimal arborescence then is the smallest Steiner arborescence. The second set of prior beliefs is on the individual degree of each node. As for a citation network the number of citations a paper has is easier to estimate than the number of references of that paper (without reading it), we only constrained the expected in-degree of each node. As a result, edges to highly cited nodes are more probable, and a tree will be more interesting if it is not only small, but has a preference for less frequently cited papers. The final type of prior belief is on both the individual in-degree of each node, as well as the dependency of citation probabilities on the difference in publication date.

In Sect. 3.2 we showed how to formalize prior beliefs on diagonal block sums. Here it is natural to group papers together according to their publication year (or per 2 years, 5 years\(,\ldots \)). In this way, it is possible to incorporate prior beliefs such as: “The number of papers from year X citing a paper from year \(X-3\) is high”. In general, an edge will have a high probability if the corresponding expected block diagonal sum is high, see Eqs. (3) and (6). Note that the citation network should (in theory) be a directed acyclic graph, since no paper can cite a paper with a higher publication year. Yet the data contains 66,772 (<0.01%) violating edges, which our method handles gracefully.

Common authors as external validation. In many scientific fields, self-citations are common practice. We expect the trees to reflect this to differing degrees, depending on the prior beliefs taken into account. To test this, we set up an experiment similar as in Sect. 5.1. The queries are generated in the same way, but with a preference for queries that have some authors in common. If a paper has an author in common with the current query set, it is automatically chosen instead of being sampled with probability s. We generated 200 random queries for each querysize \(\{5, 7, 9\}\), with max. height \(k=4\). For each query, we look at the tree generated by s-IR, computed for 4 different types of prior beliefs. Our measure is the total number of common authors per edge in the tree.

Table 1. Average number of common authors per edge in the tree from algorithm s-IR for different types of prior beliefs and query sizes. p-values for the Wilcoxon signed-rank test (pairwise comparison) of each type of prior with the prior on individual degrees, shown between brackets. The second column lists the time to fit the background model on the full data.

Table 1 shows the results for 4 types of prior beliefs. There is a substantial difference between the first and second prior. This makes sense because with a constant background model, s-IR is indifferent to the number of citations of papers. With the second prior, s-IR prefers nodes with fewer citations, penalizing highly cited papers. This means we are also favoring self-citations a bit more, since chances are high that nodes encountered in our experiment do not have authors in common for references to seminal papers. Secondly, there are differences in the self-citation rate between a prior on the time relations (priors 3 and 4) versus prior 2. Most people stop publishing after their PhD, but during that time they will have some references to their own papers. Hence with a background model of type 3 and 4, s-IR will prefer citations between papers with a high difference in publication year, making self-citations less common.

Subjective evaluation. We queried three recent KDD best paper award winners that were present in the network, see Figs. 1, 7 and 8 for results for different prior beliefs. We used \(k=3\), resulting in 33 candidate roots. Notice the number of citations and the publication year of the root in each of the resulting trees, confirming our expectations of the influence of the prior beliefs on the SI of trees.

Fig. 7.
figure 7

Like Fig. 1, with as prior knowledge the degree of each node.

Fig. 8.
figure 8

Like Fig. 1, with as prior knowledge degrees and time constraints.

6 Discussion and Related Work

We studied the problem of finding interesting trees that connect a user-provided set of query nodes in a large network. This is useful for example to, based on citation data, find papers that (indirectly) influenced a set of query papers, perhaps to understand the structure of an organization from communication records, and in many other settings. We defined the problem of finding such trees as an optimization problem to find an optimal balance between the informativeness (the Information Content) and conciseness (the Description Length) of a tree. Additionally, by encoding the prior beliefs of a user, we propose how to find results that are surprising and interesting to a specific user.

We have introduced a general algorithmic strategy to construct such trees along with four heuristics of varying complexity. We have introduced a tractable model to include prior knowledge about the density of sub-networks and more specifically for the case where the nodes appear in time blocks and the probability of edges is expected to be a function of time. Finally, we evaluated the interestingness of the results in several experiments, both subjectively and using external criteria, plus we empirically compared the quality and computational efficiency of the four heuristics.

The computational problem solved in this paper is related to the problem of constructing a minimal Steiner arborescence (aka directed Steiner tree). There is a long development of approximation algorithms, e.g., [2, 7, 11]. Faster special-purpose approximations have also been studied in the data mining community, e.g., for temporal networks [8]. The most related algorithmic results are those of Akoglu et al. [1], who study the problem of finding a good partitioning and connection structure within each part on undirected graphs for a given set of query nodes. Although their purpose is to explore an undirected graph, they map the problem to graph partitioning plus finding Steiner arborescences.

It should be noted that Problem 1 is not equivalent to the Steiner arborescence problem, because in general the subjective interestingness of a tree does not factorize as a sum over the edges. Hence, we do not expect any existing algorithm to solve this problem well.

We are currently working on applications in biology as well as social media.