Keywords

1 Introduction

Networks are a canonical method to model complex multivariate interactions and have been proven useful in the study of a variety of problems, such as social interaction and human mobility to predicting epidemic spreading [6, 16]. and modeling biochemical networks to predict the onset of diseases [8, 12]. This modeling approach allows for a shift from the traditional scientific focus on the (reductionist) study of things (e.g., animals or proteins), to the study of system-wide interactions among these things, such as friendships among animals, or bonding among proteins. In network science, typically, these multivariate interactions are represented as edges that connect variables as nodes in a graph. In addition, networks built to represent real-world complex systems often denote variable interaction with a weight that is proportional to the strength of interaction between nodes, such as a proximity (similarity) or a distance (dissimilarity). For instance, edge weights can represent the probability of interaction between genes [8], similarity between concepts in a knowledge space [10], or a measure of how much time two individuals spent together in close vicinity [9]. In its simplest form, edge weights are non-directed, meaning interactions between nodes are symmetric. This is especially the case when distance and shortest paths between nodes are relevant for analysis—e.g. inferring the likelihood that a person infects another in a population under epidemic spread—because distance measures are by default symmetric (in addition to being non-negative and anti-reflexive [26]).

Redundancy is considered a fundamental aspect in the evolution of complex systems [7]. Distinct aspects of the phenomenon have been shown to greatly contribute to our understanding of network dynamics, controllability, and robustness [13, 14, 24]. In particular, we have shown that most networks where edges represent distance (or dissimilarity) contain large amounts of topological redundancy in computing shortest paths, which can be identified through our algebraically-principled and parameter-free distance backbone [24]. This means our method differs from other backbones by requiring no tunning parameter, null model comparisons, or Monte Carlo approximations. However, even though distance is typically considered to be symmetric [11], many real-world complex systems are best modeled by directed, weighted graphs. Indeed, asymmetric interactions have been to shown to be important in a variety of domains, ranging from unreciprocated friendships [2], food-webs and host-parasite ecological networks [15], to designing smarter urban traffic and cities [1, 22].

Here, our main contribution is the extension of the distance backbone methodology to directed weighted graphs. Specifically, we build upon the concepts of transitive and distance closure for undirected weighted graphs [26] to identify a subgraph whose edges do not break a generalized triangle inequality and which are sufficient to compute all shortest directed paths. In other words, we obtain a directed distance backbone that preserves the distribution of shortest paths in directed weighted graphs. This in turn allows us to quantify both the structural redundancy of such networks and their robustness to random attacks. Real-world examples also show preliminary results that having directed edges yields a larger distance backbone than it does for undirected graphs.

2 Closures in Complex Networks

In social networks, indirect associations are often exemplified as “the friend of my friend is also my friend”. These indirect associations can be described in a graph G(X), defined on the set of nodes X, in terms of the transitive and distance closures. Transitive closures assume edge weights to measure a similarity while distance closures assume weights to be a dissimilarity between nodes [26]. The formalism for closures in weighted undirected networks has been introduced in Simas et al. [24]. We revise this mathematical construction in this section and, in Sect. 3, we relax the symmetry condition previously considered while showing that the formalism of closures in complex networks is applicable to both undirected and directed networks.

2.1 Transitive Closure

The strength of interactions between the nodes \(x_i \in X\) can be measured by a proximity graph, P(X). This is a reflexive network with edges weights \(p_{ij} \in [0, 1]\), a continuous range of values, with \(p_{ii} = 1\). Transitivity is computed via the composition of generalized, weighted logical operators. These are extensions of the binary logic operators, derived from probabilistic metric spaces and fuzzy logic, and are called triangular norms and conorms [17, 24, 26].

A triangular norm (t-norm) is a generalized logical conjunction given by the operation \(\wedge {:}\, [0, 1] \times [0,1] \rightarrow [0, 1]\). It satisfies the properties of commutativity (\(p\wedge q = q\wedge p\)), associativity (\(p\wedge (q\wedge w) = (p\wedge q)\wedge w\)), monotonicity (\(p\wedge q \le w\wedge v\) implies \(p \le w\) and \(q\le v\) ), and having 1 as its identity element (\(p\wedge 1 = p\)). Similarly, a triangular conorm (t-conorm) is a generalized logical disjunction given by the operation \(\vee : [0, 1] \times [0,1] \rightarrow [0, 1]\). It is also commutative, associative, monotonic, but has 0 as its identity element (\(p\vee 0 = p\)). Combining them gives us the compositions of P with itself as

$$\begin{aligned} P^{\eta } = P\circ P^{\eta -1} \iff p_{ij}^{(\eta )} = \underset{k}{\vee }\ \left( p_{ik} \wedge p_{kj}^{(\eta -1)}\right) , \end{aligned}$$
(1)

considering \(\eta \in \mathbb {Z}\ge 2\) and \(P^1 = P\). This leads to the transitive closure of P(X) given by

$$\begin{aligned} P^T(X) = \displaystyle \bigcup _{\eta =1}^{\kappa } P^{\eta } \iff p_{ij}^T = p_{ij} \vee p_{kj}^{(2)} \vee \cdots \vee p_{ij}^{(\kappa -1)}\vee p_{ij}^{(\kappa )}. \end{aligned}$$
(2)

For general t-norms and t-conorms the closure is reached as \(\kappa \rightarrow \infty \). But with proximity graphs, as long as \(\wedge \equiv \min \), the closure \(P^T(X)\) converges for a finite \(\kappa \) no larger than the graph diameter [17, 26]. The adjacency matrix \(P^\eta (X)\) measures the proximity for paths of size \(\eta \), while the transitive closure \(P^T(X)\) accounts for the strongest proximity for paths up to size \(\kappa \).

We say that a proximity graph is transitive with respect to the algebraic structure \(([0, 1], \vee , \wedge )\) if for every weighted edge \(p_{ij}\) in the graph we have:

$$\begin{aligned} p_{ij} \ge \underset{k}{\vee }(p_{ik} \wedge p_{kj}) \end{aligned}$$
(3)

for any node \(x_k \in X\). By construction, all edges of \(P^T(X)\) obey this generalized transitivity constraint, while only a subset of edges of P(X) typically do. In the context of the generalized transitivity criterion given by Eq. 3, fully transitive graphs denote a similarity multivariate relation, whereas graphs that break transitivity for at least one edge denote a proximity relation [17].

For connected, undirected graphs, this leads to a closure where \(p^T_{ij}>0\) for all \(x_i\) and \(x_j\) in X, i.e. a complete or fully connected graph. Unfortunately, this does not generalize for directed graphs, where there can be nodes that only have outwards connections, and therefore can never be reached from other nodes.

2.2 Distance Closure

In network science, we often need to compute shortest paths on graphs to infer the (direct and indirect) influence of variables on one another. This requires casting the network as a distance (or dissimilarity) graphs, D(X) on the set of node variables X. These graphs have non-negative weights, i.e. adjacency matrix elements \(d_{ij} \in [0, \infty )\), and are anti-reflexive: \(d_{ii} = 0\). They are also isomorphic to proximity graphs [26] via a strictly monotonic decreasing map \(\varphi {:}\, [0, 1] \rightarrow [0, \infty )\) constrained by:

$$\begin{aligned} \underset{k}{f}\{ g( \varphi (p_{ik}), \varphi (p_{kj}) ) \} = \varphi (\underset{k}{\vee }(p_{ik} \wedge p_{kj})) \quad \forall x_i, x_j, x_k \in X, \end{aligned}$$
(4)

where f and g are isomorphic operations to \(\wedge \) and \(\vee \), respectively, in the sense that they are associative, commutative, monotonic, and having identity elements given by \(\varphi (0) \rightarrow \infty \) for f and \(\varphi (1) = 0\) for g. Due to this construction, g and f are named triangular distance norm (td-norm) and conorm (td-conorm), respectively [26].

Though an infinite number of maps satisfy the isomorphism, the simplest, which we use here unless otherwise noted, is the familiar distance function:

$$\begin{aligned} d_{ij} = \varphi (p_{ij}) = \frac{1}{p_{ij}} -1, \end{aligned}$$
(5)

that easily converts between proximity P(X) and distance D(X) graphs. In addition to being non-negative and anti-reflexive, distance measures are typically symmetrical, and if transitive, are also known as metric [11].

Equation (4) allows us to study transitivity of distance graphs by establishing an isomorphism with transitive closures of proximity graphs. Thus, the distance closure \(D^T(X)\) is obtained via compositions of f and g:

$$ \begin{aligned} d_{ij}^{(\eta )} = \underset{k}{f}\ g\left( d_{ik}, d_{kj}^{(\eta -1)} \right) \quad \quad { \& }\quad d_{ij}^{T} = f \left( d_{ij}, d_{ij}^{(2)}, \ldots , d_{ij}^{(\kappa -1)}, d_{ij}^{(\kappa )}\right) , \end{aligned}$$
(6)

where, because of the isomorphism, \(\kappa \) is the same as for the transitive closure (Eq. 2). The adjacency matrix \(D^\eta (X)\) measures the shortest distance for paths including \(\eta \) connections, while the distance closure \(D^T(X)\) accounts for the shortest path length up to \(\kappa \) links. For distance graphs, the transitivity criterion is defined by each algebraic structure \(([0, \infty ), f, g)\):

$$\begin{aligned} d_{ij} \le \underset{k}{f}\ g( d_{ik}, d_{kj} ) \quad \forall x_i, x_j, x_k \in X. \end{aligned}$$
(7)

The distance closure \(D^T(X)\) is transitive by construction, but generally only a subset of edges D(X) obey Eq. (7).

2.3 Shortest-Path, Metric and Ultrametric Closures

The general transitive and distance closures of Sects. 2.1 and 2.2 yield a number of well-known cases used in network science [24, 26]. When \(f \equiv \min \) (or \(\vee \equiv \max \) in proximity graphs), we have the large class of shortest-path closures, \(D^{T, g}(X)\), for any distance function g (or \(\wedge \) in proximity graphs), as the closure selects the minimum path with length given by g. This leads to a generalized triangle inequality [24] as a transitivity criterion:

$$\begin{aligned} d_{ij} \le g( d_{ik}, d_{kj} ) \quad \forall x_i, x_j, x_k \in X. \end{aligned}$$
(8)

For instance, when \(g \equiv +\), we obtain the familiar metric closure, \(D^{T, m}(X)\), where the length of the path is obtained by summing the distance edge weights. Similarly, when \(g \equiv \max \), we instead obtain the ultrametric closure, \(D^{T, u}\), where the length of the path is obtained by the maximum distance weight in path (the weakest link).

Many other shortest-path distance closures—and thus different path length measures and transitivity criteria—can be usefully employed in network science [26]. Here we exemplify the approach with these two well-known cases because the metric closure is the most common way to compute shortest path on weighted graphs, and the ultra-metric closure is the lower bound of distance closures [24].

2.4 Distance Backbone Subgraph

The distance backbone \(B^g(X)\) of a distance graph D(X) is the invariant subgraph under a shortest-path distance closure \(D^{T, g}(X)\) with \(f \equiv \min \) and some g [24]. It is sufficient to compute all shortest paths in D(X) given a path length measure g. The distance backbone is invariant because its edges are the ones that obey the generalized triangle inequality (Eq. 8) and are thus called triangular edges. That is, the distance backbone is defined by edges that have the same weight in the shortest-path closure:

$$\begin{aligned} b^g_{ij} = {\left\{ \begin{array}{ll} d_{ij}, &{} \text {if } d_{ij} = d^{T,g}_{ij} \\ \infty , &{} \text {if } d_{ij} > d^{T,g}_{ij}\end{array}\right. }, \quad \forall x_i, x_j \in X, \end{aligned}$$
(9)

where \(d^{T,g}_{ij}\) are the adjacency matrix weights of the distance closure graph \(D^{T,g}(X)\). The edges that break the generalized triangle inequality are called semi-triangular and are not on the backbone, i.e. \(b^g_{ij} = \infty \). If (and only if) an edge between \(x_i\) and \(x_j\) is semi-triangular (i.e., not present on the backbone), there exists a shorter indirect path (i.e., which is present on the backbone) connecting them via some \(x_k\) [24].

The metric (\(g \equiv +\)) and ultrametric (\(g \equiv \max \)) backbones of distance graph D(X) are denoted by \(B^m(X)\) and \(B^u(X)\), respectively. Similarly, edges on these backbones are called metric and ultrametric, while those off are known as semi-metric and semi-ultrametric, respectively [24].

3 Directed Distance Backbone

Here we extend the concept of distance backbone by relaxing the symmetry constraint of distance functions, thus considering distance graphs D(X) where \(d_{ij} \ne d_{ji}\), or directed distance graphs. As summarized above, distance backbones exist when enforcing a generalized triangle inequality (Eq. 8) as a transitive closure criterion. This is the same as computing all shortest paths of D(X) using a measure of path length determined by g.

Computation of the all pairs shortest path problem (APSP) for undirected weighted graphs with \(g \equiv +\) is straightforward using the Dijkstra algorithm [5] (though it can also be computed with the distance product directly via Eqs. (2) and (6) [26, 28]). Since all shortest-path distance closures are based on setting \(f \equiv \min \) in Eqs. (6) and (7), they can also be computed as a APSP problem by adjusting the chosen algorithm with a different path length measure for each g used, such as \(g \equiv \max \) for the ultrametric backbone [24].

We also know that the standard triangle inequality, Eq. (8) with \(g \equiv +\), is valid for directed distances [18]. This way, the APSP of directed distance graphs based on this transitivity criterion can also be computed via the Dijkstra algorithm [5] or the distance product [28]. Indeed, the methodology of closures in complex networks is found to be applicable to both undirected and directed weighted graphs. The latter is shown in the real world examples of Sect. 4.

3.1 Redundancy and Robustness

The fraction of edges in the backbone

$$\begin{aligned} \tau ^g(X) = \frac{|{B^g(X)}|}{|{D(X)}|} = \frac{|{\{d_{ij}: d_{ij} = d_{ij}^{T, g} \}}|}{|{\{d_{ij}\}}|} \,\,\forall _{x_i, x_j \in X: i\ne j} \end{aligned}$$
(10)

measures the proportion of triangular (or topologically invariant) edges, while its complement \(\sigma (X) = 1 - \tau (X)\) quantifies the proportion of semi-triangular edges. The latter measures the structural redundancy of complex networks given a specific transitivity criterion (Eq. 8). That is, the edges that are redundant for shortest-path computation given the path length measure g chosen. Note that due to the introducing of directionality, now \(\tau ^g\) must be computed for all entries of the adjacency matrix, and not just for the upper or lower diagonal as previously done for the undirected case [24].

If a network has a small backbone (small \(\tau ^g\)), most of its edges are semi-triangular and do not affect the shortest path distribution. This way, random attacks would most likely not interfere with the backbone itself, a robustnessFootnote 1 that can be inferred from the measure of topological redundancy \(\sigma ^g(X)\).

4 Experimental Analysis

Now we investigate the backbone of nine real-world networks pertaining to three distinct domains: biomedical, social, and man-made technological systems. Here we discuss in more detail the backbones of a giraffe social network [3], the U.S airport transportation system [23, 24], and the bike-sharing system of the City of London [21]. Additional details for this and the remaining networks can be found in the accompanying digital supplemental material. Descriptive data for each directed weighted graph, and the size of their respective metric and ultrametric backbone are shown in Table 1.

Table 1 Topological invariance of weighted directed graphs modeling real-world systems

4.1 Giraffe Socialization

Evidence suggests that giraffes have complex social structures, with females having social preferences and suggestive that adult giraffes have friendships beyond only mother-child interactions [3]. We analyze a network of social interaction of captive giraffes at the San Diego Zoo’s Wild Animal Park. The original observational study included 6 adult female Rothschild’s giraffe (Giraffa camelopardalis) housed in a single herd. In the study, they were observed 5 mornings a week for a total of 300 d, and the behavior of each subject was recorded for a 20-min focal sample in random order. Data on nearest neighbor and proximity (measured at 2 neck lengths) were collected at 1-min intervals for the focal subject. Affiliative social interactions involving the focal subject were recorded and included: approach, necking, head rub, bumping, social exam, muzzle, co-feed, and sentinel (details in [3]). In total, 600 h of observation time and 2,748 affiliative interactions were observed.

In the social network directed edge weights represent the frequency in which giraffe \(x_i\) interacts with giraffe \(x_j\) as a measure of similarity \(p_{ij}\) (see Fig. 1a). This is a small network containing only 6 nodes and fully connected with 30 directed edges (density \(\delta = 1.0\)). The metric backbone consists of 23 edges (\(\tau ^m=76.7\%\)) and the ultrametric of only 9 (\(\tau ^u=30\%\)) edges. Interestingly, the metric backbone completely removes the edge between giraffes Yanahmah and Chokolati, both the oldest giraffes in the herd. In the metric backbone the mother-daughter relationships are also kept between Yanahman-Ykeke and Chokolati-Chinde. In other words, and as previously noted for human contact networks [9], the backbone preserves the hierarchical structure of social networks.

Fig. 1
figure 1

Giraffe socialization network in the San Diego Zoo [3]. a Directed distance graph; b metric backbone subgraph; and c ultrametric backbone subgraph. The original distance graph contains 30 edges, while the metric backbone contains 23 (\(76.7\%\)) and the ultrametric backbone only 9 (\(30\%\)) edges. Plotted with Gephi [4]

4.2 London Bike-Sharing Trips

The SARS-Cov-2 pandemic caused unprecedented shifts in urban mobility with bike-share systems having a significant increased in demand in several major capitals [19, 27]. We analyze the City of London’s bike-sharing system, available through the Transport for London Open Data API and previously analyzed in Munoz-Mendez et al. [21]. Data contains records for each unique bicycle and their rental transactions, including timestamped information on which bike-sharing station it was picked up and then returned in a network of 770 stations through the city. A month’s worth of bike-sharing transactions is analyzed, from June to July 2014. Transactions that started or ended in a repair station, as well as stations with too few transactions, were discarded. This means we only included stations that accounted for 75% of all transactions (i.e., a minimum of 7 monthly trips per station), which in turn resulted in 726 bike-sharing stations and 948,339 bike-sharing transactions.

In this network a node represents a bike-sharing station, \(x_i\), and edges are weighted by the average trip duration between stations as a directed distance, \(d_{ij}\). This network has 725 nodes and 53,118 nodes (density \(\delta =0.1\)). The metric and ultrametric backbones consist of \(\tau ^m = 59.53\%\) and \(\tau ^u = 2.75\%\), respectively, of the directed network. Along with the co-morbidity risk network, the bike sharing network has one of the largest differences in the sizes of the metric to the ultrametric backbone (\(\tau ^u / \tau ^m = 4.6\%\)). This means that a directed attack on the metric backbone will have a small impact on ultrametric backbone and thus in the distribution of shortest paths [24]. In other words, the network of the bike-sharing system for the City of London is very robust to directed attacks, translated to the possible closure of bike-sharing stations or street changes that cyclists use (Fig. 2).

Fig. 2
figure 2

Domestic nonstop segment of the U.S. airport transportation system [23]. (a) Undirected distance graph with its respective (b) metric, and (c) ultrametric backbone subgraphs [24]. (d) Directed distance graph with its respective (e) metric, and (f) ultrametric backbone subgraphs. The original directed (undirected) distance graph contains 18,906 (11,973) edges. From those, \(27.59\%\) (\(16.14\%\)) are in the metric backbone, and \(18.99\%\) (\(8.98\%\)) in the ultrametric backbone. The difference in number of edges between the undirected and directed representation comes from the fact that 5040 (\(26.65\%\)) of all flights are only in one direction. Network plotted with Gephi [4]

4.3 U.S. Airport Transportation

This network is the domestic nonstop segment of the U.S. airport transportation system for the year 2006, retrieved from http://www.transtats.bts.gov. Each node is an airport, and edge weights are the normalized number of passengers traveling between two airport-nodes. This network was analyzed in Simas et al. [24] and is a reconstruction of the one used by Serrano et al. [23]. Differently from previous work, however, here we consider directionality in the flow of passengers as 5,040 (approximately 27%) of all flights are only in one direction In other words, flight routes may include stops in multiple airports from initial to final destination, and not necessarily contain a direct return to the initial departure airport. Airports in the American Samoa, Guam, Northern Marianas, and Trust Territories of the Pacific Islands have been removed from the analysis. This is a large but relatively sparse network with 1075 nodes and 18,906 edges (density \(\delta = \)1.64e−2). The relative size of the metric and ultrametric backbone are \(\tau ^m = 27.59\%\) and \(\tau ^u = 18.99\%\), respectively, (see Table 1).

5 Discussion

Directionality and strength of interactions are relevant properties of real complex networks. The structure of such networks can be reduced in a principled manner, while preserving the entire distribution of shortest paths (for a given length measure g), with the computation of the distance backbone.

In the nine networks we analyzed, we found that the size of the metric backbone ranges from 27.59 to 99.6%—three networks have metric backbones above 92% of the distance graph. Ultrametric backbones range from 2.17 to 99.5%, with two networks having ultrametric backbones above 95.8%. In contrast, for undirected graphs studied in Simas et al. [24] the metric (ultrametric) backbones range from 1.75 to 83.59% (0.2–78.45%), which shows a substantial increase in the size of backbones due to directionality. A direct comparison can be made for the U.S. airports network. Its undirected representation has a relative size of the metric and ultrametric backbone of \(\tau ^m = 16.14\%\) and \(\tau ^u = 8.98\%\), respectively [24]. Here, we found that the relative size of the metric and ultrametric backbone are \(\tau ^m = 27.59\%\) and \(\tau ^u = 18.99\%\), respectively (see Table 1). This increase is likely due to the fact that the closure for directed graphs does not lead to a complete graph—unlike what happens to connected undirected graphs. In other words, having many connections in only one direction (approximately 27% in this case) can make them necessary for shortest paths irrespective of the edge weight, which emphasized the importance of directionality when studying real-world networks. The large difference between the size of backbones in directed and undirected graphs warrants future studies of the effect of directionality vis a vis various topological parameters.

The metric and ultrametric backbones of the networks we analyzed (Table 1) exemplify networks which are robust to random edge removal, as is the case of the comorbidity risk and bicycle trips networks, for having a smaller backbone (small \(\tau ^g\)). On the other hand, the species-species interaction network and water pipes networks have a large \(\tau ^g\) and little redundancy. That is, the backbone is most of the network, suggesting that they mostly contain necessary interaction information, or were perhaps optimized to minimize the cost of implementing redundant edges, being susceptible to random edge removal or failure. In the case of the water pipe network, little redundancy is expected because its distance weights represent an actual physical distance between nodes, which must conform to a naturally metric topology. Thus, it is an expected result that its metric backbone is almost the entire distance graph (\(99.6\%\)). This highlights the fact that semi-metric (and semi-triangular) behavior can only occur in high-dimensional spaces [24]. In contrast, the metric backbone of the passenger traffic between U.S. airports is only \(27.59\%\), making its shortest path distribution very robust to random attacks, as the odds of randomly removing semi-metric edges are much higher than removing metric ones that contribute to the backbone. The precise impact in the shortest path distribution for those networks requires the computation of edge distortion [24, 25] and is left for future work.

6 Conclusion

We introduced directionality to study shortest-path redundancy in weighted directed graphs via a novel directed distance backbone subgraph, the computation of which we showed to be feasible. This consideration brings improvement over other sparsification methods that considers only undirected networks [20, 24] or that treat incoming and outgoing edges independently [23]. We focused on the metric (where \(g \equiv +\)) and the ultrametric (where \(g \equiv \max \)) backbones, but the methodology is applicable for any length measure g, allowing other backbones to be considered in the future.

We applied the methodology to study redundancy of a variety of real-world weighted directed graphs modeling biomedical, social, and technological systems. The size of the metric (ultrametric) backbone ranges from 27 to 99% (2–99%), but is typically much smaller than the original distance graph. However, the size of the directed backbones observed are larger than the undirected backbones previously reported, emphasizing the difference in shortest-path robustness for the two different classes of graphs. The comparison using the same underlying U.S. airports network is particularly illuminating. We found that both the metric and the ultrametric backbone for the directed graph are larger than the ones for the undirected version—\(71\%\) and \(112\%\), respectively. Thus, asymmetric airline seat capacity between cities (27% of all connections exist only in one direction) has a large impact on shortest paths between them. This exemplifies the importance of our contribution in the study of distance backbones for directed networks, which will lead to a study with additional networks in the future.

The methodology further allows us to infer the robustness of shortest path distributions to random attack, via the relative size of the metric and ultrametric backbones. This can aid the design of more resilient social and technological systems or the identification of key evolutionary properties in biomedical systems. We are confident that the study of directed distance backbones can help the understanding and control of a variety of complex multivariate systems where both strength and directionality of interactions is key.