1 Introduction

This paper concerns classification problems when each data point is a large network. In neuroscience, for instance, the brain can be represented by a structural connectome or a functional connectome, both are large graphs that model connections between brain regions. In ecology, an ecosystem is represented as a species interaction network. On these data, one may want to classify diseased vs healthy brains, or a species network before and after an environmental shock. Existing approaches for graph classification can be divided broadly into three groups: (1) use of graph parameters such as edge density, degree distribution, or densities of motifs as features, (2) parametric models such as the stochastic k-block model [1], and (3) graph kernels [18], and graph embeddings [29]. Amongst these methods, motif counting is perhaps the least rigorously studied. Though intuitive, only small motifs are feasible to compute, and thus motif counting is often seen as an ad-hoc method with no quantitative performance guarantee.

1.1 Contributions

In this paper, we formalize the use of motifs to distinguish graphs using graphon theory, and give a tight, explicit quantitative bound for its performance in classification (cf. Theorem 1). Furthermore, we use well-known results from graph theory to relate the spectrum (eigenvalues) of the adjacency matrix one-to-one to cycle homomorphism densities, and give an analogous quantitative bound in terms of the spectrum (cf. Theorem 2). These results put motif counting on a firm theory, and justify the use of spectral graph kernels for counting a family of motifs. We apply our method to detect the autoimmune disease Lupus Erythematosus from diffusion tensor imaging (DTI) data, and obtain competitive results to previous approaches (cf. Sect. 4).

Another contribution of our paper is the first study of a general model for random weighted graphs, decorated graphons, in a machine learning context. The proof technique can be seen as a broad tool for tackling questions on generalisations of graphons. There are three key ingredients. The first is a generalization of the Counting Lemma [see 22, Theorem 10.24], on graphons to decorated graphons. It allows one to lower bound the cut metric by homomorphism densities of motifs, a key connection between motifs and graph limits. The second is Kantorovich duality [see 37, Theorem 5.10], which relates optimal coupling between measures and optimal transport over a class of functions and which is used in relating spectra to homomorphism densities. In this, Duality translates our problem to questions on function approximation, to which we use tools from approximation theory to obtain tight bounds. Finally, we use tools from concentration of measure to deal with sampling error an generalise known sample concentration bounds for graphons [see 6, Lemma 4.4].

Our method extends results for discrete edge weights to the continuous edge weight case. Graphs with continuous edge weights naturally arise in applications such as neuroscience, as demonstrated in our dataset. The current literature for methods on such graphs is limited [16, 26], as many graph algorithms rely on discrete labels [10, 34].

1.2 Related Literature

Graphons, an abbreviation of the words “graph” and “function”, are limits of large vertex exchangeable graphs under the cut metric. For this reason, graphons and their generalizations are often used to model real-world networks [8, 12, 36]. Originally appeared in the literature on exchangeable random arrays [4], it was later rediscovered in graph limit theory and statistical physics [14, 22].

There is an extensive literature on the inference of graphons from one observation, i.e. one large but finite graph [3, 9, 21, 40]. This is distinct from our classification setup, where one observes multiple graphs drawn from several graphons. In our setting, the graphs might be of different sizes, and crucially, they are unlabelled: There is no a priori matching of the graph nodes. That is, if we think of the underlying graphon as an infinitely large random graph, then the graphs in our i.i.d sample could be glimpses into entirely different neighborhoods of this graphon, and they are further corrupted by noise. A naïve approach would be to estimate one graphon for each graph, and either average over the graphs or over the graphons obtained. Unfortunately, our graphs and graphons are only defined up to relabelings of the nodes, and producing the optimal labels between a pair of graphs is NP-complete (via subgraph isomorphism). Thus, inference in our setting is not a mere “large sample” version of the graphon estimation problem, but an entirely different challenge.

A method closer to our setup is graph kernels for support-vector machines [18, 38]. The idea is to embed graphs in a high-dimensional Hilbert space, and compute their inner products via a kernel function. This approach has successfully been used for graph classification [39]. Most kernels used are transformations of homomorphism densities/motifs as feature vectors for a class of graphs [cf 41, subsection 2.5]: [33] propose so-called graphlet counts as features. These can be interpreted as using induced homomorphism densities [cf 22, (5.19)] as features which can be linearly related to homomorphism densities as is shown in [22, (5.19)]. The random walk kernel from [18, p. 135 center] uses the homomorphism densities of all paths as features. Finally [28, Prop. 5 and discussion thereafter] uses homomorphism densities of trees of height ≤ k as features.

However, as there are many motifs, this approach has the same problem as plain motif counting: In theory, performance bounds are difficult, in practice, one may need to make ad hoc choices. Due to the computational cost [18], in practice, only small motifs of size up to 5 have been used for classification [33]. Other approaches chose a specific class of subgraphs such as paths [18] or trees [34], for which homomorphism densities or linear combinations of them can be computed efficiently. In this light, our Theorem 2 is a theoretical advocation for cycles, which can be computed efficiently via the graph spectrum.

1.3 Organization

We recall the essentials of graphon theory in Sect. 2. For an extensive reference, see [22]. Main results are in Sect. 3, followed by applications in Sect. 4. Our proofs can be found in the appendix.

2 Background

A graph G = (V, E) is a set of vertices V  and set of pairs of vertices, called edges E. A label on a graph is a one-to-one embedding of its vertices onto \(\mathbb {N}\). Say that a random labelled graph is vertex exchangeable if its distribution is invariant under relabelings.

A labelled graphon W is a symmetric function from [0, 1]2 to [0, 1]. A relabelling ϕ is an invertible, measure-preserving transformation on [0, 1]. An unlabelled graphon is a graphon up to relabeling. For simplicity, we write “a graphon W” to mean an unlabelled graphon equivalent to the labelled graphon W. Similarly, by a graph G we mean an unlabelled graph which, up to vertex permutation, equals to the labelled graph G.

The cut metric between two graphons W, W′ is

where the infimum is taken over all relabelings φ of W and ϕ of W′, and the supremum is taken over all measurable subsets S and T of [0, 1]. That is, is the largest discrepancy between the two graphons, taken over the best relabeling possible. A major result of graphon theory is that the space of unlabelled graphons is compact and complete w.r.t. . Furthermore, the limit of any convergent sequence of finite graphs in is a graphon [see 22, Theorem 11.21]. In this way, graphons are truly limits of large graphs.

A motif is an unlabelled graph. A graph homomorphism ϕ: F → G is a map from V (F) to V (G) that preserves edge adjacency, that is, if {u, v}∈ E(F), then {ϕ(u), ϕ(v)}∈ E(G). Often in applications, the count of a motif F in G is the number of different embeddings (subgraph isomorphisms) from F to G. However, homomorphisms have much nicer theoretical and computational properties [22, par. 2.1.2]. Thus, in our paper, “motif counting” means “computation of homomorphism densities”. The homomorphism density t(F, G) is the number of homomorphisms from F to G, divided by |V (G)||V (F)|, the number of mappings V (F) → V (G). Homomorphisms extend naturally to graphons through integration with respect to the kernel W [22, subsec. 7.2.]. That is, for a graph F with e(F) many edges,

$$\displaystyle \begin{aligned} t(F,W) = \int_{[0,1]^{e(F)}} \prod_{\{x,y\} \in E(F)} W(x,y)\, \mathrm d x \mathrm d y. \end{aligned}$$

The homomorphism density for a weighted graph G on k nodes is defined by viewing G as a step-function graphon, with each vertex of G identified with a set on the interval of Lebesgue measure 1∕k. For a graph G and a graphon W, write t(•, G) and t(•, W) for the sequence of homormophism densities, defined over all possible finite graphs F.

A finite graph G is uniquely defined by t(•, G). For graphons, homomorphism densities distinguish them as well as the cut metric, that is, iff t(•, W) = t(•, W′) [22, Theorem 11.3]. In other words, if one could compute the homomorphism densities of all motifs, then one can distinguish two convergent sequences of large graphs. Computationally this is not feasible, as (t(•, W))F finite graph is an infinite sequence. However, this gives a sufficient condition test for graphon inequality: If t(F, W) ≠ t(F, W′) for some motif F, then one can conclude that . We give a quantitative version of this statement in the appendix, which plays an important part in our proof. Theorem 1 is an extension of this result that accounts for sampling error from estimating t(F, W) through the empirical distribution of graphs sampled from W.

2.1 Decorated graphons

Classically, a graphon generates a random unweighted graph \({\mathbb {G}}(k, W)\) via uniform sampling of the nodes,

$$\displaystyle \begin{aligned} U_1, \dots, U_k &\stackrel{\text{iid}}{\sim} \mathrm{Unif}_{[0,1]}\\ ({\mathbb{G}}(k, W)_{ij} | U_1, \dots, U_k) &\stackrel{\text{iid}}{\sim} \mathrm{Bern}(W(U_i,U_j)), \forall i,j \in [k]. \end{aligned} $$

Here, we extend this framework to decorated graphons, whose samples are random weighted graphs.

Definition 1

Let Π([0, 1]) be the set of probability measures on [0, 1]. A decorated graphon is a function \(\mathcal {W} \colon [0,1]^2 \to \Pi ([0,1])\).

For \(k \in \mathbb {N}\), the k-sample of a measure-decorated graphon \({\mathbb {G}}(k,\mathcal {W})\) is a distribution on unweighted graphs on k nodes, generated by

$$\displaystyle \begin{aligned} U_1, \dots, U_k& \stackrel{\text{iid}}{\sim} \mathrm{Unif}_{[0,1]} \\ ({\mathbb{G}}(k, \mathcal{W})_{ij} | U_1, \dots, U_k)& \stackrel{\text{iid}}{\sim} \mathcal{W}(U_i,U_j), \forall i,j \in [k]. \end{aligned} $$

We can write every decorated graphon \(\mathcal {W}\) as \(\mathcal {W}_{W, \mu }\) with W(x, y) being the expectation of \(\mathcal {W}(x,y)\), and μ(x, y) being the centered measure corresponding to \(\mathcal {W} (x,y)\). This decomposition will be useful in formulating our main results, Theorems 1 and 2.

One important example of decorated graphons are noisy graphons, that is, graphons perturbed by an error term whose distribution does not vary with the latent parameter: Given a graphon W : [0, 1]2 → [0, 1] and a centered noise measure ν ∈ Π([0, 1]), the ν-noisy graphon is the decorated graphon \(\mathcal {W}_{W,\mu }\), where μ(x, y) = ν is constant, i.e. the same measure for all latent parameters. Hence, in the noisy graphon, there is no dependence of the noise term on the latent parameters.

As weighted graphs can be regarded as graphons, one can use the definition of homomorphisms for graphons to define homomorphism numbers of samples from a decorated graphon (which are then random variables). The k-sample from a decorated graphon is a distribution on weighted graphs, unlike that from a graphon, which is a distribution on unweighted (binary) graphs. The latter case is a special case of a decorated graphon, where the measure at (x, y) is a centered variable taking values W(x, y) and 1 − W(x, y). Hence, our theorems generalise results for graphons.

2.2 Spectra and Wasserstein Distances

The spectrum λ(G) of a weighted graph G is the set of eigenvalues of its adjacency matrix, counting multiplicities. Similarly, the spectrum λ(W) of a graphon W is its set of eigenvalues when viewed as a symmetric operator [22, (7.18)]. It is convenient to view the spectrum λ(G) as a counting measure, that is, λ(G) =∑λδ λ, where the sum runs over all λ’s in the spectrum. All graphs considered in this paper have edge weights in [0, 1]. Therefore, the support of its spectrum lies in [−1, 1]. This space is equipped with the Wasserstein distance (a variant of the earth-movers distance)

$$\displaystyle \begin{aligned}\mathcal{W}^1 (\mu, \nu) = \inf_{\gamma \in \Pi ([-1,1]^2)} \int_{(x,y) \in [-1,1]^2} \lvert x - y \rvert \mathrm d \gamma (x,y) \end{aligned} $$
(1)

for μ, ν ∈ Π([−1, 1]), where the first (second) marginal of γ should equal μ (ν). Analogously, equip the space of random measures Π( Π([−1, 1])) with the Wasserstein distance

$$\displaystyle \begin{aligned}\mathcal{W}^1(\bar{\mu}, \bar{\nu}) = \inf_{\gamma \in \Pi (\Pi([-1,1])^2)} \int_{{(\mu,\nu) \in \Pi([-1,1])^2}} \mathcal{W}^1 (\mu,\nu) \mathrm d \gamma (\mu,\nu). \end{aligned} $$
(2)

where again the first (second) marginal of γ should equal \(\bar \mu \) (\(\bar \nu \)).

Equation (2) says that one must first find an optimal coupling of the eigenvalues for different realisations of the empirical spectrum and then an optimal coupling of the random measures. Equation (1) is a commonly used method for comparing point clouds, which is robust against outliers [25]. Equation (2) is a natural choice of comparison of measures on a continuous space. Similar definitions have appeared in stability analysis of features for topological data analysis [11].

3 Graphons for Classification: Main Results

Consider a binary classification problem where in each class, each data point is a finite, weighted, unlabelled graph. We assume that in each class, the graphs are i.i.d realizations of some underlying decorated graphon \(\mathcal {W} = \mathcal {W}_{W,\mu }\) resp. \(\mathcal {W}' = \mathcal {W}_{W',\mu '}\). Theorem 1 says that if the empirical homomorphism densities are sufficiently different in the two groups, then the underlying graphons W and W′ are different in the cut metric. Theorem 2 gives a similar bound, but replaces the empirical homomorphism densities with the empirical spectra. Note that we allow for the decorated graphons to have different noise distributions and that noise may depend on the latent parameters.

Here is the model in detail. Fix constants \(k, n \in {\mathbb N}\). Let \(\mathcal {W}_{W,\mu }\) and \(\mathcal {W}_{W',\mu '}\) be two decorated graphons. Let

$$\displaystyle \begin{aligned} G_1, \dots, G_n&\stackrel{\text{iid}}{\sim} {\mathbb{G}}(k, \mathcal{W}_{W,\mu})\\ G^{\prime}_1, \dots, G^{\prime}_n&\stackrel{\text{iid}}{\sim} {\mathbb{G}}(k, \mathcal{W}_{W',\mu'}) \end{aligned} $$

be weighted graphs on k nodes sampled from these graphons. Denote by δ the Dirac measure on the space of finite graphs. For a motif graph F with e(F) edges, let

$$\displaystyle \begin{aligned} \bar{t}(F) \mathrel{\mathop:}= \frac{1}{n}\sum_{i=1}^n \delta_{t(F,G_i)} \end{aligned}$$

be the empirical measure of the homomorphism densities of F with respect to the data (G 1, …, G n) and analogously \(\bar t'(F)\) the empirical measure of the homomorphism densities of \((G_1^{\prime }, \ldots , G^{\prime }_n)\).

Theorem 1

There is an absolute constant c such that with probability

$$\displaystyle \begin{aligned}1-2\exp\left(\frac{kn^{-\frac{2}{3}}}{2e(F)^2}\right)-2e^{-.09cn^{\frac{2}{3}}} \end{aligned}$$

and weighted graphs G i, \(G_i^{\prime }\), i = 1, …, n generated by decorated graphons \(\mathcal {W}_{W,\mu }\) and \(\mathcal {W}_{W',\mu '}\),

(3)

Note that the number of edges affect both the distance of the homomorphism densities \(\mathcal {W}^1 (\bar t, \bar t')\) and the constant e(F)−1 in front, making the effect of e(F) on the right-hand-side of the bound difficult to analyze. Indeed, for any fixed \(v \in {\mathbb N}\), one can easily construct graphons where the lower bound in Theorem 1 is attained for k, n → by a graph with v = e(F) edges. Note furthermore, that the bound is given in terms of the expectation of the decorated graphon, W, unperturbed by variations due to μ resp. μ′. Therefore, in the large-sample limit, motifs as features characterise exactly the expectation of decorated graphons.

Our next result utilizes , Theorem 1 and Kantorovich duality to give a bound on with explicit dependence on v. Let \(\overline {\lambda }\), \(\overline {\lambda }'\) be the empirical random spectra in the decorated graphon model, that is, \(\overline {\lambda } = \frac {1}{n}\sum _{i=1}^n \lambda (G_i)\), \(\overline {\lambda }' = \frac {1}{n}\sum _{i=1}^n \lambda (G^{\prime }_i)\).

Theorem 2

There is an absolute constant c such that the following holds: Let \(v \in {\mathbb N}\). With probability \(1-2v\exp \left (\frac {kn^{-\frac {2}{3}}}{2v^2}\right )-2ve^{-.09cn^{\frac {2}{3}}}\), for weighted graphs generated by decorated graphons \(\mathcal {W}_{W,\mu }\) and \(\mathcal {W}_{W',\mu '}\),

Through the parameter v, 2 defines a family of lower bounds for the cut distance between the underlying graphons. The choice of v depends on the values of n and the Wasserstein distance of the empirical spectra. The parameter v can be thought of as a complexity control of transformations of eigenvalues that are used in a lower bound: If one restricts to differences in distribution of low-degree polynomials, the approximation with respect to the measure μ, the sampling of edge weights, is good, implying a small additive error. In this case, however, the sampling of the nodes from a graphon, i.e. of the latent node features U i, i ∈ [n] has a large error, which is multiplicative. We refer to the appendix for further details.

Theorems 1 and 2 give a test for graphon equality. Namely, if \(\mathcal {W}^1(\bar \lambda ', \bar \lambda )\) is large, then the underlying graphons W and W′ of the two groups are far apart. This type of sufficient condition is analogous to the result of [11, Theorem 5.5] from topological data analysis. It should be stressed that this bound is purely nonparametric. In addition, we do not make any regularity assumption on either the graphon or the error distribution μ. The theorem is stable with respect to transformations of the graph: A bound analogous to Theorem 2 holds for the spectrum of the graph Laplacian and the degree sequence, as we show in the appendix in Sect. 8. In addition, having either k or n fixed is merely for ease of exposition. We give a statement with heterogenous k and n in the appendix in Sect. 9.

We conclude with a remark on computational complexity. The exact computation of eigenvalues of a real symmetric matrix through state-of-the-art numerical linear algebra takes O(n 3) time, where n is the number of the rows of the matrix [13]. In these algorithms, a matrix is transformed in O(n 3) in tridiagonal form. Eigenvalues are computed in O(kn 2) time, where k is the number of largest eigenvalues that are sought.

This is competitive with other graph kernels in the literature. The random walk kernel [18] has a runtime of O(n 3), the subtree kernel from [28] enumerates all possible subtrees in both graphs and hence has, depending on the depth of the trees, doubly exponential runtime. The same holds for the graphlet kernels [32], which are computable in quadratic time for a constant bound on the size of the subgraph homomorphism taken into consideration, but have doubly exponential dependency on this parameter as well. Finally, the paper [16] has, in the sparse, but not ultra-sparse regime (\(\lvert E(G) \rvert \in \Theta ( \lvert V(G) \rvert )\) and without node labels (corresponding to the Kronecker node kernel) a runtime of Ω(n 3) for graphs with small diameter (\(O(\sqrt {\lvert {V(G)}})\)) and for a graph with high diameter a runtime of Ω(n 4).

4 An Application: Classification of Lupus Erythematosus

Systemic Lupus Erythematosus (SLE) is an autoimmune disease of connective tissue. Between 25–70% of patients with SLE have neuropsychiatric symptoms (NPSLE) [15]. The relation of neuropsychiatric symptoms to other features of the disease is not completely understood. Machine learning techniques in combination with expert knowledge have successfully been applied in this field [20].

We analyse a data set consisting of weighted graphs. The data is extracted from diffusion tensor images of 56 individuals, 19 NPSLE, 19 SLE without neuropsychiatric symptoms and 18 human controls (HC) from the study [30]. The data was preprocessed to yield 6 weighted graphs on 1106 nodes for each individual. Each node in the graphs is a brain region of the hierarchical Talairach brain atlas by [35].

The edge weights are various scalar measures commonly used in DTI, averaged or integrated along all fibres from one brain region to another as in the pipeline depicted in Fig. 1. These scalar measures are the total number (of fibers between two regions), the total length (of all fibers between two regions), fractional anisotropy (FA), mean diffusivity (MD), axial diffusivity (AD) and radial diffusivity (RD) [cf 5].

Fig. 1
figure 1

Preprocessing pipeline for weighted structural connectomes. A brain can be seen as a tensor field \(B: \mathbb {R}^3 \to \mathbb {R}^{3\times 3}\) of flows. The support of this vector field is partitioned into regions A 1, …, A n, called brain regions. Fibers are parametrized curves from one region to another. Each scalar function \(F: \mathbb {R}^3 \to \mathbb {R}\) (such as average diffusivity (AD) and fractional anisotropy (FA)) converts a brain into a weighted graph on n nodes, where the weight between regions i and j is F averaged or integrated over all fibers between these regions

The paper [20] used the same dataset [30], and considered two classification problems: HC vs NPSLE, and HC vs SLE. Using 20 brain fibers selected from all over the brain (such as the fornix and the anterior thalamic radiation) they used manifold learning to track the values AD, MD, RD and FA along fibers in the brain. Using nested cross-validation, they obtain an optimal disretisation of the bundles, and use average values on parts of the fibers as features for support-vector classification. They obtained an accuracy of 73% for the HC vs. NPSLE and 76% for HC vs. SLE, cf Table 1.

Table 1 Result comparison. Our spectral method performs comparable to [20], who used manifold learning and expert knowledge to obtain the feature vectors. Our method is significantly simpler computationally and promises to be a versatile tool for graph classification problems

To directly compare ourselves to [20], we consider the same classification problems. For each weighted graph we reduce the dimension of graphs by averaging edge weights of edges connecting nodes in the same region on a coarser level of the Talairach brain atlas [35]. Inspired by Theorem 2, we compute the spectrum of the adjacency matrix, the graph Laplacian and the degree sequence of the dimension-reduced graphs. We truncate to keep the eigenvalues smallest and largest in absolute value, and plotted the eigenvalue distributions for the six graphs, normalized for comparisons between the groups and graphs (see Fig. 2). We noted that the eigenvalues for graphs corresponding to length and number of fibers show significant differences between HC and NPSLE. Thus, for the task HC vs NPSLE, we used the eigenvalues from these two graphs as features (this gives a total of 40 features), while in the HC vs SLE task, we use all 120 eigenvalues from the six graphs. Using a leave-one-out cross validation with 1-penalty and a linear support-vector kernel, we arrive at classification rates of 78% for HC vs. NPSLE and 67.5% for HC vs. SLE both for the graph Laplacian. In a permutation test as proposed in [27], we can reject the hypothesis that the results were obtained by pure chance at 10% accuracy. Table 1 summarises our results.

Fig. 2
figure 2

Density of first and last ten eigenvalues (normalised to zero mean unit standard deviation) of the graph Laplacian for all six values. (a) Length. (b) Average diffusivity. (c) Fractional anisotropy. (d) Number. (e) Radial diffusivity. (f) Mean diffusivity

5 Conclusion

In this paper, we provide estimates relating homomorphism densities and distribution of spectra to the cut metric without any assumptions on the graphon’s structure. This allows for a non-conclusive test of graphon equality: If homomorphism densities or spectra are sufficiently different, then also the underlying graphons are different. We study the decorated graphon model as a general model for random weighted graphs. We show that our graphon estimates also hold in this generalised setting and that known lemmas from graphon theory can be generalised. In a neuroscience application, we show that despite its simplicity, our spectral classifier can yield competitive results. Our work opens up a number of interesting theoretical questions, such as restrictions to the stochastic k-block model.

6 Proof of Theorem 1

Theorem 1

There is an absolute constant c such that with probability

$$\displaystyle \begin{aligned}1-2\exp\left(\frac{kn^{-\frac{2}{3}}}{2e(F)^2}\right)-2e^{-.09cn^{\frac{2}{3}}} \end{aligned}$$

and weighted graphs G i, \(G_i^{\prime }\), i = 1, …, n generated by decorated graphons \(\mathcal {W}_{W,\mu }\) and \(\mathcal {W}_{W',\mu '}\),

(3)

6.1 Auxiliary Results

The following result is a generalisation of [6, Lemma 4.4] to weighted graph limits.

Lemma 1

Let \(\mathcal {W}=\mathcal {W}_{W,\mu }\) be a decorated graphon, \(G \sim {\mathbb {G}}(k, \mathcal {W})\). Let F be an unweighted graph with v nodes. Then with probability at least \(1-2\exp \left (\frac {k\varepsilon ^2}{2v^2}\right )\),

$$\displaystyle \begin{aligned} \left\lvert {t(F,G) - t(F,W)} \right\rvert < \varepsilon. {} \end{aligned} $$
(4)

Proof

We proceed in three steps. First, give a different formulation of t(F, W) in terms of an expectation. Secondly, we show that this expectation is not too far from the expectation of t(F, G). Finally, we conclude by the method of bounded differences that concentration holds.

  1. 1.

    Let t inj(F, G) be the injective homomorphism density, which restricts the homomorphisms from F to G to all those ones that map distinct vertices of F to distinct vertices in G [cf 22, (5.12)]. Let \(G \sim {\mathbb {G}}(k, \mathcal {W})\) and X be G’s adjacency matrix. As a consequence of exchangeability of X, it is sufficient in the computation of t inj to consider one injection from V (F) to V (G) instead of the average of all such. Without loss, we may assume that V (F) = [v] and V (G) = [k]. Hence, for the identity injection [k]↪[n],

    $$\displaystyle \begin{aligned} \mathbb{E}[t_{\text{inj}}(F,X_n)] = \mathbb{E}\left[ \prod_{\{i,j\}\in E(G)} X_{ij}\right]. \end{aligned}$$

    Let U 1, …, U n be the rows and columns in sampling X from G. Then

    $$\displaystyle \begin{aligned} \mathbb{E} \left[ { \prod_{\{i,j\}\in E(G)} X_{ij}} \right] &= \mathbb{E} \left[ {\mathbb{E}\left[ \prod_{\{i,j\}\in E(G)} X_{ij}\middle|U_1, \dots, U_n\right]} \right] \\ &=\mathbb{E} \left[ { \prod_{\{i,j\}\in E(G)} (W (U_i, U_j) + \mu(U_i, U_j))} \right] \end{aligned} $$

    We multiply out the last product, and use that μ(U i, U j) are independent and centered to see that all summands but the one involving only terms from the expectation graphon vanish, i.e.

    $$\displaystyle \begin{aligned} \mathbb{E} \left[ { \prod_{\{i,j\}\in E(G)} X_{ij}} \right] = \mathbb{E} \left[ { \prod_{\{i,j\}\in E(G)} W (U_i, U_j)} \right] = t(F, W) \end{aligned}$$
  2. 2.

    Note that the bound in the theorem is trivial for \(\varepsilon ^2 \le \ln 2 \frac {2k^2}{n} = 4\ln 2 \frac {k^2}{2n} \). Hence, in particular, \(\varepsilon \le 4\ln 2 \frac {k^2}{2n}\).

    Furthermore, \( \left \lvert {t(F,X) -t(F,W )} \right \rvert \le \frac {1}{k} \binom {v}{2} + \left \lvert {t(F,X) -\mathbb {E}[t(F,X)] } \right \rvert \le \frac {v^2}{2k} + \left \lvert {t(F,X) -\mathbb {E}[t(F,X)] } \right \rvert \) by the first part and the bound on the difference of injective homomorphism density and homomorphism density [23, Lemma 2.1]. Hence

    $$\displaystyle \begin{aligned} \mathbb{P}[ \left\lvert {t(F,X_n) {-}t(F,\mathbb{E}\mathcal{W} )} \right\rvert {\ge} \varepsilon] &\le \mathbb{P}\left[ \left\lvert {t(F,X_n) {-}\mathbb{E}[t(F,X_n)]} \right\rvert {\ge} \varepsilon {+} \frac{1}{n} \binom{k}{2}\right]\\ & \le \mathbb{P}\left[ \left\lvert {t(F,X_n) {-}\mathbb{E}[t(F,X_n)]} \right\rvert {\ge} \varepsilon\left(1{-}\frac{1}{4 \ln 2}\right)\right]. \end{aligned} $$

    Set \(\varepsilon ' = \varepsilon \left (1-\frac {1}{4 \ln 2}\right )\). Let X be the adjacency matrix of \(G \sim {\mathbb {G}}(n,\mathcal {W})\) sampled with latent parameters U 1, …, U n. Define a function depending on n vectors where the i-th vector consists of all values relevant to the i-th column of the array X n, that is U i, X 1, …, X n. In formulas,

    We note that the random vectors (U i, X 1i, X 2i, …, X ni) are mutually independent for varying i. Claim:

    $$\displaystyle \begin{aligned} \left\lvert {f((a_1, \dots, a_n) - f((b_1, \dots, b_n))} \right\rvert \le \sum_{i=1}^n \frac{k}{n}1_{a_i \neq b_i} \end{aligned}$$

    If this claim is proved, then we have by McDiarmid’s inequality [24, (1.2) Lemma],

    $$\displaystyle \begin{aligned}\displaystyle \mathbb{P}[ \left\lvert {t(F, X_n) - t(F, \mathbb{E}\mathcal{W})} \right\rvert \ge \varepsilon'] \\\displaystyle \le 2 \exp \left(-\frac{2\varepsilon^{\prime2}}{n\left(\frac{k}{n}\right)^2} \right) \le 2\exp \left(-\frac{2\varepsilon^{\prime2}n}{k^2} \right)= 2\exp\left(-\frac{2 n\varepsilon^{\prime 2}}{k^2}\right), \end{aligned} $$

    Which implies the theorem by basic algebra.

    Let us now prove the claim: It suffices to consider a, b differing in one coordinate, say n. By the definition of the homomorphism density of a weighted graph, t(F, X) can be written as

    $$\displaystyle \begin{aligned} \int g(x_1, \dots, x_k) \mathrm d \mathrm{Unif}_{[n]}^{k}((x_i)_{i \in [k]}) \end{aligned}$$

    for \(g(x_1, \dots , x_k) = \prod _{\{i,k\} \in E(G)} X_{x_ix_k}\). We observe 0 ≤ g ≤ 1 (in the case of graphons, one has g ∈{0, 1}). It hence suffices to bound the measure where the integrand g depends on a i by \(\frac {k}{n}\). This is the case only if if x  = i at least for one  ∈ [k]. But the probability that this happens is upper bounded by,

    $$\displaystyle \begin{aligned} 1-\left(1-\frac{1}{n}\right)^k \le \frac{k}{n}, \end{aligned}$$

    by the Bernoulli inequality. This proves the claim and hence the theorem.

Lemma 2 ([22, Lemma 10.23])

Let W, W′ be graphons and F be a motif. Then

Lemma 3

Let μ  Π([0, 1]) and let μ n be the empirical measure of n iid samples of μ. Then

$$\displaystyle \begin{aligned} \mathbb{E}[\mathcal{W}^1 (\mu,\mu_n)] \le 3.6462 n^{-\frac{1}{3}} \end{aligned}$$

The strategy of prove will be to adapt a proof in [19, Theorem 1.1] to the 1-Wasserstein distance.

Proof

Let X ∼ μ, Y ∼ N(0, 1) and μ σ = Law(X + Y ). Then for any ν ∈ Π([0, 1]), by results about the standard normal distribution, \(W(\nu , \nu ^\sigma )\le \mathbb {E}[\lvert Y \rvert ] = \sigma \sqrt {\frac {2}{\pi }}\). Hence, by the triangle inequality

$$\displaystyle \begin{aligned} \mathcal{W}^1 (\mu, \mu_n) \le 2 \sqrt{\frac{2}{\pi}} \sigma + \mathcal{W}^1(\mu^\sigma, \mu_n^\sigma). \end{aligned}$$

As the discrete norm dominates the absolute value metric on [0, 1], \(\mathcal {W}^1(\mu ^\sigma , \mu _n^\sigma ) \le \| \mu ^\sigma - \mu _n^\sigma \|{ }_{\mathrm {TV}}\). Note that \(\mu _n^\sigma \) and μ σ have densities f σ, \(f_n^\sigma \). This means, as \(\| \mu ^\sigma -\mu _n^\sigma \|{ }_{\mathrm {TV}} = \int \lvert f^\sigma _n (x) - f^\sigma (x) \rvert \mathrm d x\),

$$\displaystyle \begin{aligned} \mathcal{W}^1(\mu^\sigma, \mu_n^\sigma) \le \int \lvert f^\sigma_n (x) - f^\sigma(x) \rvert \mathrm d x \le \sqrt{2\pi}\sqrt{\int ( \left\lvert {x} \right\rvert ^{2} +1) \lvert f^\sigma_n (x) - f^\sigma(x) \rvert^2 \mathrm d x }, \end{aligned}$$

where the last inequality is an application of [19, (2.2)]. Now observe that by the definitions of f σ and \(f_n^\sigma \), \(\mathbb {E}[\lvert f^\sigma _n (x) - f^\sigma (x) \rvert ^2 \le n^{-1}\int \phi _\sigma ^2(x-y) \mathrm d \mu (y)\), where ϕ σ is the standard normal density. Hence

$$\displaystyle \begin{aligned} \mathbb{E}[\mathcal{W}^1(\mu_n^\sigma,\mu^\sigma)]\le \sqrt{2\pi} n^{-\frac{1}{2}}\sqrt{\int ( \left\lvert {x} \right\rvert ^{2} +1) \int\phi_\sigma^2(x-y) \mathrm d \mu (y) \mathrm d x } \end{aligned}$$

By basic algebra, \(\phi _\sigma ^2(x)=\frac {1}{2\sigma }\pi ^{-\frac {1}{2}}\phi _{\frac {\sigma }{\sqrt {2}}}(x)\). This implies for Z ∼ N(0, 1) by a change of variables

$$\displaystyle \begin{aligned} \begin{array}{rcl} &\displaystyle &\displaystyle \int ( \left\lvert {x} \right\rvert ^{2} +1) \int\phi_\sigma^2(x-y)\mathrm d\mu(y)\mathrm d x\\&\displaystyle &\displaystyle \le \frac{1}{2\sigma\sqrt{\pi}}(1+2(\sigma^2\mathbb{E}[Z^2]+\int \left\lvert {y} \right\rvert ^2 \mathrm d \mu (y)))\le\sigma^{-1}2^{-1}\pi^{-\frac{1}{2}}(1+2(\sigma^2x+1))\\ &\displaystyle &\displaystyle \le\sigma^{-1}\pi^{-\frac{1}{2}}\frac{3}{2} \end{array} \end{aligned} $$

Hence \(\mathbb { E}[\mathcal {W}^1(\mu _n^\sigma ,\mu ^\sigma )]\le \frac {3}{2}\sqrt {2}n^{-\frac {1}{2}}\sigma ^{-\frac {1}{2}}=\frac {3}{\sqrt {2}}n^{-\frac {1}{2}}\sigma ^{-\frac {1}{2}}\) and

$$\displaystyle \begin{aligned} \mathbb{E}[\mathcal{W}^1(\mu_n, \mu)]\le 2\sqrt{\frac{2}{\pi}}\sigma+\frac{3}{\sqrt{2}}n^{-\frac{1}{2}}\sigma^{-\frac{1}{2}}. \end{aligned}$$

Choosing σ optimally by a first-order condition, one arrives at the lemma. □

Lemma 4 ([17, Theorem 2])

Let \(\mu \in \mathcal {P}(\mathbb {R})\) such that for X  μ, \(\ell =\mathbb { E}[e^{\gamma X^\alpha }]<\infty \) for some choice of γ and α. Then one has with probability at least \(1-e^{-cn\varepsilon ^2}\)

$$\displaystyle \begin{aligned} \mathcal{W}^1 (\mu_n, \mu) \le \varepsilon \end{aligned}$$

for any ε ∈ [0, 1] and c only depending on ℓ, γ and α.

6.2 Proof of Theorem 1

Proof (Proof of Theorem 1)

Let \(G \sim {\mathbb {G}}(k, \mathcal {W})\) and \(G' \sim {\mathbb {G}}(k, \mathcal {W}')\). By combining Lemmas 3 and 4, we get that with probability at least \(1-2e^{-.09cn^{\frac {2}{3}}}\),

$$\displaystyle \begin{aligned} \mathcal{W}^1(\bar t, \bar t') \le \mathcal{W}^1(t(F,G), t(F,G')) + 8n^{-\frac{1}{3}} \end{aligned}$$

In addition, by Lemma 1, with probability at least \(1-2\exp \left (\frac {kn^{-\frac {2}{3}}}{2v^2}\right )-2e^{-.09cn^{\frac {2}{3}}}\) one also has

$$\displaystyle \begin{aligned} \mathcal{W}^1(t(F,G), t(F,G')) \le \left\lvert {t(F,W) - t(F,W')} \right\rvert + n^{-\frac{1}{3}} \end{aligned}$$

Upon application of Lemma 2 and rearranging, one arrives at the theorem. □

7 Proof of Theorem 2

Theorem 2

There is an absolute constant c such that the following holds: Let \(v \in {\mathbb N}\). With probability \(1-2v\exp \left (\frac {kn^{-\frac {2}{3}}}{2v^2}\right )-2ve^{-.09cn^{\frac {2}{3}}}\), for weighted graphs generated by decorated graphons \(\mathcal {W}_{W,\mu }\) and \(\mathcal {W}_{W',\mu '}\),

7.1 Auxiliary Results

Lemma 5 ([7, (6.6)])

Let G be a weighted graph and λ the spectrum interpreted as a point measure. Let C k be the cycle of length kem. Then

$$\displaystyle \begin{aligned} t(C_k, G)=\sum_{w\in \lambda}w^k. \end{aligned}$$

Lemma 6 (Corollary of [2, p. 200])

Let f be a 1-Lipschitz function on [−1, 1]. Then there is a polynomial p of degree v such that \(\|f-p \|{ }_\infty \le \frac {3}{\pi v}\).

Lemma 7 ([31, Lemma 4.1])

Let \(\sum _{i=0}^v a_i x^{i}\) be a polynomial on [−1, 1] bounded by M. Then

$$\displaystyle \begin{aligned} \lvert a_i\rvert \le (4e)^v M. \end{aligned}$$

7.2 Proof of Theorem 2

Proof (Proof of Theorem 2)

Consider any coupling (λ, λ′) of \(\bar \lambda \) and \(\bar \lambda '\). One has by the definition of the Wasserstein distance \(\mathcal {W}^1_{\mathcal {W}^1}\) and Kantorovich duality

$$\displaystyle \begin{aligned} \mathcal{W}_{\mathcal{W}^1}^1(\bar \lambda', \bar\lambda) \le \mathbb{ E}\left[\mathcal{W}^1 (\lambda, \lambda')\right] = \mathbb{E}\left[\sup_{\mathrm{Lip} (f) \le 1}\int f(x) \mathrm d (\lambda - \lambda')\right]{} \end{aligned} $$
(5)

Fix any ω ∈ Ω. By Lemma 6 one can approximate Lipschitz functions by polynomials of bounded degree,

$$\displaystyle \begin{aligned} \sup_{\substack{f\colon [-1,1] \to \mathbb{R}\\\mathrm{Lip} (f) \le 1}}\int f(x) \mathrm d (\lambda - \lambda')(\omega) \le \sup_{\substack{\mathrm{deg} (f) \le v\\ \lvert f \rvert \le 2 }}\int f(x) \mathrm d (\lambda - \lambda')(\omega) + \frac{3}{\pi v}. \end{aligned}$$

Here, \(\lvert f \rvert \le 2\) can be assumed as f is defined on [−1, 1] and because of its 1-Lipschitz continuity.

Hence, by Lemma 7 and the triangle inequality

$$\displaystyle \begin{aligned} \sup_{\substack{\mathrm{deg} (f) \le v\\ \lvert f \rvert \le 2 }}\int f(x) \mathrm d (\lambda - \lambda') (\omega) &\le \sum_{i=1}^v 2(4e)^v \left\lvert \int x^k \mathrm d (\lambda - \lambda')\right\rvert (\omega)\\ &= \sum_{i=1}^v 2(4e)^v \left\lvert \sum_{w \in \lambda} w^i - \sum_{w' \in \lambda'} w^i \right\rvert(\omega) \end{aligned} $$

Tanking expectations, one gets

$$\displaystyle \begin{aligned} \mathcal{W}^1_{\mathcal{W}^1} (\bar \lambda, \bar \lambda' ) \le \frac{3}{\pi v} + \sum_{i=1}^v 2(4e)^v \mathbb{E} \left [\left\lvert \sum_{w \in \lambda} w^i - \sum_{w' \in \lambda'} w^i \right\rvert\right] \end{aligned}$$

for any coupling (λ, λ′) of \(\bar \lambda \) and \(\bar \lambda '\). Now consider a coupling (λ, λ′) of \(\bar \lambda \) and \(\bar \lambda '\) such that \(\bar t\), \(\bar t'\) (which are functions of λ, λ′ by Lemma 5) are optimally coupled. Then by the definition of \(\bar \lambda \), \(\bar \lambda '\), \(\bar t\) and \(\bar t'\),

$$\displaystyle \begin{aligned} \mathcal{W}^1 (\bar t_k, \bar t_k^{\prime}) = \mathbb{E}\left[ \left\lvert \sum_{w \in \ \lambda}w^k - \sum_{w \in \bar \lambda'}w^{\prime k}\right\rvert\right] \end{aligned}$$

where \(\bar t_i = \frac {1}{n} \sum _{j=1}^n\delta _{t(C_i, G_j)}\) and \(\bar t_i^{\prime } = \frac {1}{n} \sum _{j=1}^n\delta _{t(C_i, G_j)}\). Hence,

(6)

The first equality follows by (5) and the second with probability at least \(1-2v\exp \left (\frac {kn^{-\frac {2}{3}}}{2v^2}\right )-2ve^{-.09cn^{\frac {2}{3}}}\) from Theorem 1. □

8 A Similar Bound for Degree Features

Let G be a graph and (d i) be its degree sequence. Consider the point measure \(d = \sum _{i} \delta _{d_i}\) of degrees. Denote by \(\bar d\) resp. \(\bar d'\) the empirical measure of degree point measures of G 1, …, G n resp. \(G_1^{\prime }, \dots , G_n^{\prime }\).

Proposition 1

Theorem 2 holds with the same guarantee with \(\bar \lambda \), \(\bar \lambda '\) replaced by \(\bar d\), \(\bar d'\).

Lemma 8

Let S v be the star graph on v nodes and G be a weighted graph. Then

$$\displaystyle \begin{aligned} t(S_v, G)=\sum_{w \in d}w^v \end{aligned}$$

The proof of Proposition 1 is along the same lines as the one of Theorem 2, but using Lemma 8 instead of 5.

9 Heterogenous Sample Sizes

Our bounds from Theorems 1 and 2 can also be formulated in a more general setting of heterogenous sizes of graphs. In the following, we give an extension in two dimensions. First, we allow for heterogenous numbers of observations n. Secondly, we allow for random sizes of graphs k. Here is the more general model in details: There is a measure \(\nu \in \Pi ({\mathbb N})\) such that \(G_1, \dots , G_{n_1}\) are sampled iid as

$$\displaystyle \begin{aligned} k &\sim \nu & G_i \sim {\mathbb{G}}(k,\mathcal{W}_{W, \mu});{} \end{aligned} $$
(7)

sampling of \(G_1^{\prime }, \dots , G_{n_2}^{\prime }\) is analogously. Hence the samples G i are sampled from a mixture over the measures \({\mathbb {G}}(k,\mathcal {W}_{W', \mu '})\). We can define \(\bar t\), \(\bar t'\), \(\bar \lambda \) and \(\bar \lambda '\) using the same formulas as we did in the main text. Then the following result holds.

Corollary 1

There is an absolute constant c such that the following holds: Let \(n_1, n_2 \in {\mathbb N}\) and G i, i = 1, …, n 1, \(G_i^{\prime }, i=1, \dots , n_2\) sampled as in (7). Then with probability at least \(1-\exp \left (\frac {kn_1^{-\frac {2}{3}}}{2e(F)^2}\right )-e^{-.09cn_1^{\frac {2}{3}}}-\exp \left (\frac {kn_2^{-\frac {2}{3}}}{2e(F)^2}\right )-e^{-.09cn_2^{\frac {2}{3}}}\),

Corollary 2

In the setting of Corollary 1 and with the same absolute constant, the following holds: Let \(v \in {\mathbb N}\). With probability \(1-v\exp \left (\frac {kn_1^{-\frac {2}{3}}}{2v^2}\right )-ve^{-.09cn_1^{\frac {2}{3}}}-v\exp \left (\frac {kn_2^{-\frac {2}{3}}}{2v^2}\right )-ve^{-.09cn_2^{\frac {2}{3}}}\),

The proofs are very similar to the ones in the main text. For the differences in n 1 and n 2, the concentration results Lemmas 3 and 4 will have to be applied separately with different values of n. For the random values k, we can choose a coupling that couples random graphs of similar sizes, leading to the expressions in the Corollaries.