1 Introduction

The central assumption in machine learning is that observations are independently and identically distributed (i.i.d.) with respect to a fixed yet unknown probability distribution. Under this assumption, generalization error bounds, shedding light on the learnability of models or conducting in the design of advanced algorithms (Boser et al., 1992), have been proposed. However, in many real applications, the data collected can be dependent, and therefore the i.i.d. assumption does not hold. There have been extensive discussions in the community on why and how the data are dependent (Amini & Usunier, 2015; Dehling & Philipp, 2002).

Learning with interdependent data Establishing generalization theories under dependent settings have received a surge of interest in recent years (Kuznetsov & Mohri, 2017; Mohri & Rostamizadeh, 2008, 2009; Ralaivola et al., 2010). A major line of research in this direction models the data dependencies by various types of mixing models, such as \(\alpha\)-mixing (Rosenblatt, 1956), \(\beta\)-mixing (Volkonskii & Rozanov, 1959), \(\phi\)-mixing (Ibragimov, 1962), and \(\eta\)-mixing (Kontorovich, 2007), and so on. Mixing models have been used in statistical learning theory to establish generalization error bounds based on Rademacher complexity (Kuznetsov & Mohri, 2017; Mohri & Rostamizadeh, 2009, 2010) or algorithmic stability (He et al., 2016; Mohri & Rostamizadeh, 2008, 2010) via concentration results (Kontorovich & Ramanan, 2008) or independent block technique (Yu, 1994). In these models, the mixing coefficients quantitatively measure the dependencies among data. Another line of work, referred to as decoupling, studies the behavior of complex systems by decomposing a set of dependent random variables into sets of independent variables and a set of dependent variables with vanishing moments (Peña & Giné, 1999). A random variable with vanishing moments has a property that its expected value converges to zero as the number of terms increases. This technique of decoupling has been successfully applied in many areas of mathematics, statistics, and engineering.

Dependency graphs Although the results based upon the mixing model and decoupling with vanishing moments are fruitful, they face difficulties in practical applications, as it is usually difficult to determine or estimate the quantitative dependencies among data points (such as the mixing coefficients or the vanishing moments) unless under some restrictive assumptions. On the other hand, determining whether two data are dependent or exhibit a suitable dependency structure is often much easier in practice. Thus in this paper, we focus on such a qualitative dependent setting. We use graphs as a natural tool to describe the dependencies among data and establish generalization theory under such graph-dependence. The dependency graph model we use has been widely utilized in many other fields, in particular, in probability theory and statistics, where it is used to prove normal or Poisson approximation using Stein’s approach, cumulants, and so on (see, for example, Janson, 1988, 1990). It is also heavily used in probabilistic combinatorics and statistical physics, such as Lovász local lemma (Erdős & Lovász, 1975), Janson’s inequality (Janson et al., 1988), along with many others.

Rademacher complexity We collect various concentration bounds under graph-dependence and utilize them to derive Rademacher and stability generalization bounds for learning from dependent data. The basic tool used to establish generalization theory is concentration inequalities. Standard concentration results for the i.i.d. case no longer apply for dependently distributed data, making the study a challenging task. Janson (2004) extended Hoeffding’s inequality to the sum of dependent random variables. This result bounds the probability that the summation of graph-dependent random variables deviates from its expected value, in terms of the fractional chromatic number of the dependency graph. Our first approach uses a similar idea, by dividing graph-dependent variables into sets of independent ones, we establish concentration bounds based on fractional colorings, and generalization bounds via fractional Rademacher complexity.

Algorithmic stability PAC-Bayes bounds for classification with non-i.i.d. data have also been obtained based on fractional colorings of graphs in Ralaivola et al. (2010). These results also hold for specific learning settings such as ranking and learning from stationary \(\beta\)-mixing distributions. Ralaivola and Amini (2015) established new concentration inequalities for fractionally sub-additive and fractionally self-bounding functions of dependent variables. Though fundamental and elegant, the above generalization bounds are algorithm-independent. They consider the complexity of the hypothesis space and data distribution, but do not involve specific learning algorithms. To derive better generalization bounds, there is growing interest in developing algorithm-dependent generalization theories. This line of research heavily relies on the notion of algorithmic stability, which exhibits a key advantage, that is, they are tailored to specific learning algorithms, exploiting their particular properties. Our second approach utilizes algorithmic stability to establish generalization bounds. Note that even under the i.i.d. assumption, Hoeffding-type concentration inequalities, which bound the deviation of sample average from expectation, are not strong enough to prove stability-based generalization. On the contrary, McDiarmid’s inequality characterizes the concentration of general Lipschitz functions of i.i.d. random variables, hence is used as the key tool for proving the stability bounds. Therefore, to build algorithmic stability theory for non-i.i.d. samples, we start with McDiarmid-type concentration bounds for graph-dependent random variables.

Table 1 lists some generalization results using Rademacher complexity and algorithmic stability for i.i.d., mixing, and graph-dependent settings, respectively.

Table 1 Rademacher complexity and stability generalization bounds for i.i.d., mixing, and graph-dependent settings

Paper organization In this survey, we begin with introducing different McDiarmid-type concentration inequalities for functions of graph-dependent random variables. Then we utilize these concentration bounds to provide upper bounds on generalization error for learning from graph-dependent data using Rademacher complexity and algorithm stability. In the reminder, Sect. 2 introduces notation and the framework. Section 3 establishes fractional Rademacher complexity and algorithmic stability bounds. Section 4 shows how the presented framework can be utilized to derive generalization bounds for learning from graph-dependent data in a variety of practical scenarios, including learning-to-rank, multi-class classification problems, and learning from m-dependent data. We finally conclude this work in Sect. 5 and provide some perspective and future work.

2 Notation and framework

Throughout this paper, for all positive integer n, let [n] denote the integer set \(\{1,2, \ldots , n\}\). Given two integers \(i < j\), let [ij] denote the integer set \(\{i, i+1, \ldots , j-1, j\}\). Let \(\Omega _i\) be a Polish space for every \(i\in [n]\), \(\varvec{{\Omega }}= \prod _{i\in [n]} \Omega _i = \Omega _1 \times \ldots \times \Omega _n\) be the product space, \(\mathbb {R}\) be the set of real numbers, and \(\mathbb {R}_+\) be the set of non-negative real numbers. Let \(\Vert \cdot \Vert _p\) denote the standard \(\ell _p\)-norm of a vector. We use uppercase letters for random variables, lowercase letters for their realizations, and bold letters for vectors.

2.1 Graph-theoretic notation

We use the standard graph-theoretic notation. All graphs considered are finite, undirected, and simple (no loops or multiple edges). A graph \(G = (V, E)\) consists of a set of vertices V, some of which are connected by edges in E. Given a graph G, let V(G) be the vertex set and E(G) be the edge set. The edge connecting a pair of distinct vertices uv is denoted by \(\{ u, v \}\), which is assumed to be unordered. The number of edges incident on a vertex is the degree of the vertex; and we use \(\Delta (G)\) to denote the maximum degree of graph G.

2.1.1 Graph covering and partitioning

Formally, given a graph G, we introduce the following definitions.

  1. (a1)

    A family \(\{ S_{k} \}_{k}\) of subsets of V(G) is a vertex cover of G if \(\bigcup S_{k} = V(G)\).

  2. (a2)

    A vertex cover \(\{ S_{k} \}_{k}\) of G is a vertex partition of G if every vertex of G is in exactly one element of \(\{ S_{k} \}_{k}\).

  3. (a3)

    A family \(\{ ( S_{k}, w_{k} ) \}_{k}\) of pairs \(( S_{k}, w_{k} )\), where \(S_{k} \subseteq V(G)\) and \(w_{k} \in [0, 1]\) is a fractional vertex cover of G if \(\{ S_{k} \}_{k}\) is a vertex cover of G, and \(\sum _{k: v \in S_{k}} w_{k} = 1\) for every \(v \in V(G)\).

  4. (a4)

    An independent set of G is a set of vertices of G, no two of which are adjacent in G. Let \(\mathcal I(G)\) denote the set of all independent sets of graph G.

  5. (a5)

    A fractional independent vertex cover \(\{ ( I_{k}, w_{k} ) \}_{k}\) of G is a fractional vertex cover such that \(I_{k} \in \mathcal I(G)\) for every k.

  6. (a6)

    A fractional coloring of a graph G is a mapping g from \(\mathcal I(G)\) to [0, 1] such that \(\sum _{I \in \mathcal I(G): v \in I} g(I) \geqslant 1\) for every vertex \(v \in V(G)\). The fractional chromatic number \(\chi _f(G)\) of G is the minimum of the value \(\sum _{I \in \mathcal I(G)} g(I)\) over fractional colorings of G. See Fig. 1 for an example. Note that the fractional chromatic number \(\chi _f(G)\) of graph G is the minimum of \(\sum _{k} w_{k}\) over all fractional independent vertex covers \(\{ ( I_{k}, w_{k} ) \}_{k}\) of G (see, for example, Janson, 2004).

  7. (a7)

    Let H be a graph and \(\{ H_x \subseteq V (G) \}_{ x \in V (H) }\) be a set of subsets of V(G) indexed by the vertices of H. Each set \(H_x\) is called a ‘bag’. The pair \((H, \{ H_x \}_{x \in V (H)})\) is an H-partition of G if:

Fig. 1
figure 1

A fractional coloring of a cycle graph \(C_{5}\) of length 5 with patterns indicating different colors. The set of pairs \(\{ ( \{ i, (i+3)(\textrm{mod}\ 5) \}, 1/2 ) \}_{1 \leqslant i \leqslant 5}\) is a fractional vertex cover with the fractional chromatic number 5/2

  1. (i)

    \(\{ H_x \}_{x \in V (H)}\) is a vertex partition of G.

  2. (ii)

    Distinct u and v are adjacent in H if and only if there is an edge of G with one endpoint in \(H_u\) and the other endpoint in \(H_v\).

In graph theory, a vertex identification (also called vertex contraction) is to contract a pair of vertices u and v of a graph and produces a graph in which the two vertices u and v are replaced with a single vertex t such that t is adjacent to the union of the vertices to which u and v were originally adjacent. Note that in vertex contraction, it does not matter if u and v are connected by an edge; if they are, the edge is simply removed upon contraction, this special case of vertex identification called edge contraction.

Informally speaking, an H-partition of graph G is obtained from a proper partition of V(G) by identifying the vertices in each part, deleting loops, and replacing parallel edges with a single edge. H is also called the quotient graph of the graph G. For brevity, we say H is a partition of G. For more about partitions of graphs, see, for example, (Wood, 2009).

  1. (a8)

    A tree is a connected, acyclic graph, and a forest is a disjoint union of trees. For a given forest F, we denote the set of (vertex sets of) disjoint trees in forest F as \(\mathcal {T}(F)\).

  2. (a9)

    If forest F is a partition of graph G, then the pair \((F, \{ F_x \subseteq V(G) \}_{x \in V (F)})\) is a tree-partition of G. The set of all tree-partitions of graph G is denoted by \(\textsf {TP}(G)\). See Fig. 2 for an example.

Tree-partitions were independently introduced by Seese (1985) and Halin (1991), and have since been widely investigated (Wood, 2009). Essentially, a tree-partition of a graph is a proper partition of its vertex set into ‘bags’, such that identifying the vertices in each bag produces a forest.

Fig. 2
figure 2

A tree-partition of graph G is \((H, \{ \{ 1, 2 \}, \{ 3, 4 \}, \{ 5, 6 \} \} )\), where H is a path on vertices \(\{ h_1, h_{2}, h_{3} \}\), which correspond to vertex sets \(\{ 1, 2 \}, \{ 3, 4 \}\), and \(\{ 5, 6 \}\) respectively

2.2 Probabilistic tools

Concentration inequalities are fundamental tools in statistical learning theory. They bound the deviation of a function of random variables from some value that is usually the expectation. Among the most powerful ones is McDiarmid’s inequality (McDiarmid, 1989), which establishes sharp concentration for multivariate functions that do not depend too much on any individual coordinate, specifically, when the function satisfies \(\textbf{c}\)-Lipschitz condition for a weighted hamming distance (bounded differences condition).

Let \(\textbf{1}_{\left\{ A\right\} }\) denote the indicator function for any event A, that is, \(\textbf{1}_{\left\{ A\right\} }=1\) if A occurs, otherwise, \(\textbf{1}_{\left\{ A\right\} } = 0\). We first introduce the definition of a Lipschitz function.

Definition 2.1

(\(\textbf{c}\)-Lipschitz) Given a vector \(\textbf{c}= (c_1, \ldots , c_n) \in \mathbb {R}_+^n\), a function \(f:\varvec{{\Omega }}\rightarrow \mathbb {R}\) is \(\textbf{c}\)-Lipschitz if for all \(\textbf{x}= (x_1, \ldots , x_n)\) and \(\textbf{x}^{\prime } = (x^{\prime }_1, \ldots , x^{\prime }_n)\in \varvec{{\Omega }}\), we have

$$\begin{aligned} |f(\textbf{x}) - f(\textbf{x}^{\prime }) |\leqslant \sum _{i = 1}^n c_i \textbf{1}_{\left\{ x_i \ne x^{\prime }_i \right\} }, \end{aligned}$$
(2.1)

where \(c_i\) is the i-th Lipschitz coefficient of f (with respect to the Hamming metric).

McDiarmid’s inequality is based on the following bound on the moment-generating function.

Lemma 2.2

(McDiarmid, 1989) Let \(\textbf{X}= (X_1, \ldots , X_n)\) be a vector of independent random variables taking values in \(\varvec{{\Omega }}\) and \(f:\varvec{{\Omega }}\rightarrow \mathbb {R}\) be \(\textbf{c}\)-Lipschitz. Then for any \(s>0\),

$$\begin{aligned} \mathop {{}\mathbb {E}}\left[ e^{ s( f(\textbf{X}) - \mathop {{}\mathbb {E}}{ f(\textbf{X})) } } \right] \leqslant \exp \left( \dfrac{s^{2}}{8 } \Vert \textbf{c}\Vert ^2_2 \right) . \end{aligned}$$

We can now state the following McDiarmid’s inequality, which constitutes one of the pillars of our results. It states that a Lipschitz function of independent random variables concentrates around its expectation.

Theorem 2.3

(McDiarmid’s inequality 1989) Let \(f: \varvec{{\Omega }}\rightarrow \mathbb {R}\) be \(\textbf{c}\)-Lipschitz and \(\textbf{X}= (X_1, \ldots , X_n)\) be a vector of independent random variables that takes values in \(\varvec{{\Omega }}\). Then for every \(t>0\),

$$\begin{aligned} \mathop {{}\mathbb {P}}( f(\textbf{X}) - \mathop {{}\mathbb {E}}{ f(\textbf{X}) } \geqslant t ) \leqslant \exp \left( - \frac{2t^2}{ \Vert \textbf{c}\Vert ^2_2} \right) . \end{aligned}$$
(2.2)

In the following, we extend McDiarmid’s inequality to the graph-dependent case, where the dependencies among random variables are characterized by a dependency graph. We first define the notion of dependency graphs, which is a widely used model in probability, statistics, and combinatorics, see Erdős and Lovász (1975), Janson et al. (1988), Chen (1978) and Baldi and Rinott (1989) for some classical results.

Given a graph \(G = (V, E)\), we say that random variables \(\{ X_i \}_{i \in V}\) are G-dependent if for any disjoint \(S, T \subset V\) such that S and T are non-adjacent in G (that is, no edge in E has one endpoint in S and the other in T), random variables \(\{ X_i \}_{i \in S}\) and \(\{ X_j \}_{j \in T}\) are independent. See Fig. 3 for an example. Formally, we define the dependency graphs in the following.

Definition 2.4

(Dependency graphs) An undirected graph G is called a dependency graph of a random vector \(\textbf{X}=(X_1,\ldots ,X_n)\) if

  1. (b1)

    \(V(G)=[n]\).

  2. (b2)

    For all disjoint \(I, J \subset [n]\), if IJ are not adjacent in G, then \(\{X_i\}_{i \in I}\) and \(\{X_j\}_{j \in J}\) are independent.

Fig. 3
figure 3

A dependency graph G for random variables \(\{ X_i \}_{i \in [6]}\). Random variables \(\{ X_1, X_2 \}\) and \(\{ X_5, X_6 \}\) are independent, since disjoint vertex sets \(\{ 1, 2 \}\) and \(\{ 5, 6 \}\) are not adjacent in G

The above definition of dependency graphs is a strong version; there are ones with weaker assumptions, such as the one used in Lovász local lemma. Let \(K_n\) denote the complete graph on [n], that is, every two vertices are adjacent. Then \(K_n\) is a dependency graph for any set of variables \(\{ X_i \}_{i \in [n]}\). Note that the dependency graph for a set of random variables may not be necessarily unique, and the sparser ones are the more interesting ones.

Here we introduce a widely-studied random process that generates dependent data whose dependency graph can be naturally constructed for illustration purposes. Consider a data-generating procedure modeled by the spatial Poisson point process, which is a Poisson point process on \(\mathbb {R}^2\), see Linderman and Adams (2014) and Kirichenko and Van Zanten (2015) for discussions of using this process to model data collections in various machine learning applications. The number of points in each finite region follows a Poisson distribution, and the number of points in disjoint regions are independent. Given a finite set \(\{ U_i \}_{i=1}^n\) of regions in \(\mathbb {R}^2\), let \(X_i\) be the number of points in region \(U_i\) for every \(i \in [n]\). Then the graph \(G\left( [n], \{ \{ i, j \}: U_i \cap U_j \ne \emptyset \} \right)\) is a dependency graph of the random variables \(\{ X_i \}_{i=1}^n\).

An important property of the dependency graph, in view of the definition of fractional independent vertex covers, is that if we have a fractional independent vertex cover \(\{ ( I_{k}, w_{k} ) \}_{k \in [K]}\) of G, then we may decompose the sum of interdependent variables into a weighted sum of sums of independent variables.

Lemma 2.5

(Janson, 2004, Lemma 3.1) Let G be a graph, and \(\{ ( I_{k}, w_{k} ) \}_{k \in [K]}\) be a fractional independent vertex cover of G. Let \(\{ u_{i} \}_{i \in V(G)}\) be a set of any numbers. Then

$$\begin{aligned} \sum _{i \in V(G)} u_i = \sum _{i \in V(G)} \sum _{k=1}^K w_k \textbf{1}_{\left\{ i \in I_k \right\} } u_i = \sum _{k=1}^K w_k \sum _{i\in I_k} u_i, \end{aligned}$$
(2.3)

where each \(I_{k} \in \mathcal I(G)\) is an independent set. In particular, we have the following.

  • By setting \(u_{i} = 1\) for each \(i \in V(G)\), we have

    $$\begin{aligned} |V(G) |,= \sum _{k=1}^K w_k |I_k |. \end{aligned}$$
    (2.4)
  • By letting \(\{ u_{i} \}_{i \in V(G)}\) be some G-dependent variables \(\{ X_{i} \}_{i \in V(G)}\), we have (2.3) becomes a weighted sum of independent random variables \(\{ X_i \}_{i\in I_k}\).

2.3 Concentration bounds for decomposable functions

Notice that McDiarmid’s inequality applies to independent random variables. Janson (2004) derived a Hoeffding-like inequality for graph-dependent random variables by decomposing the sum into sums of independent variables. Janson’s bound is a special case of McDiarmid-type inequality tailored for interdependent random variables, especially when the function involves summation.

Theorem 2.6

(Janson’s concentration inequality, 2004) Let random vector \(\textbf{X}\) be G-dependent such that for every \(i \in V(G)\), random variable \(X_i\) takes values in a real interval of length \(c_i \geqslant 0\). Then, for every \(t > 0\),

$$\begin{aligned} \mathop {{}\mathbb {P}}\left( \sum _{i \in V(G)} X_i - \mathop {{}\mathbb {E}}{ \sum _{i \in V(G)} X_i } \geqslant t \right) \leqslant \exp \left( - \frac{2t^2}{ \chi _f(G) \Vert \textbf{c}\Vert ^2_2} \right) , \end{aligned}$$
(2.5)

where \(\textbf{c}= (c_i )_{i \in V(G)}\) and \(\chi _f(G)\) is the fractional chromatic number of G.

We will extend this result, and obtain similar concentration results under certain decomposability constraints for Lipschitz functions of graph-dependent random variables defined in Definition 2.7.

Definition 2.7

(Decomposable \(\textbf{c}\) -Lipschitz functions) Given a graph G on n vertices and a vector \(\textbf{c}= (c_i)_{i \in [n]} \in \mathbb {R}_+^n\), a function \(f:\varvec{{\Omega }}\rightarrow \mathbb {R}\) is decomposable \(\textbf{c}\)-Lipschitz with respect to graph G if for all \(\textbf{x}= (x_1, \ldots , x_n) \in \varvec{{\Omega }}\) and for all fractional independent vertex covers \(\{ ( I_{j}, w_{j} ) \}_{j}\) of G, there exist \((c_{i})_{i \in I_{j}}\)-Lipschitz functions \(\{ f_{j}: \varvec{{\Omega }}_{I_{j}} \rightarrow \mathbb {R}\}_{j}\) such that

$$\begin{aligned} f( \textbf{x}) = \sum _{ j } w_j f_{j}( \textbf{x}_{I_j} ), \end{aligned}$$
(2.6)

where for every set \(V \subseteq [n]\), we write \(\varvec{{\Omega }}_V:= \prod _{i \in V} \Omega _i\), and \(\textbf{x}_V:= \{ X_i \}_{i \in V}\).

Theorem 2.8

(Amini & Usunier, 2015; Usunier et al., 2005) Let function \(f: \varvec{{\Omega }}\rightarrow \mathbb {R}\) be decomposable \(\textbf{c}\)-Lipschitz, and \(\varvec{{\Omega }}\)-valued random vector \(\textbf{X}\) be G-dependent. Then for \(t > 0\),

$$\begin{aligned} \mathop {{}\mathbb {P}}\left( f(\textbf{X}) - \mathop {{}\mathbb {E}}{ f(\textbf{X}) } \geqslant t \right) \leqslant \exp \left( -\frac{2t^2}{\chi _f(G) \Vert \textbf{c}\Vert ^2_2 } \right) . \end{aligned}$$
(2.7)

Remark 2.9

The chromatic number \(\chi (G)\) of a graph G is the smallest number of colors needed to color the vertices of G such that no two adjacent vertices share the same color. Let \(\Delta (G)\) denote the maximum degree of G. It is well-known that \(\chi _f(G) \leqslant \chi (G) \leqslant \Delta (G)+1\), (see, for example, Bollobás 1998). Thus in our bound (2.7), we can substitute \(\chi _f(G)\) with \(\chi (G)\) or \(\Delta (G)+1\), which may be easier to estimate in practice.

Proof of Theorem 2.8

Following the Cramér-Chernoff method (see, for example, Boucheron et al., 2013), we have for any \(s>0\) and \(t>0\),

$$\begin{aligned} \mathop {{}\mathbb {P}}\left( f(\textbf{X}) - \mathop {{}\mathbb {E}}f(\textbf{X}) \geqslant t \right) \leqslant e^{-st} \mathop {{}\mathbb {E}}\left[ e^{ s ( f(\textbf{X}) - \mathop {{}\mathbb {E}}f(\textbf{X}) ) } \right] . \end{aligned}$$
(2.8)

Let \(\{ ( I_{j}, w_{j} ) \}_{j \in [J]}\) be a fractional independent vertex cover of the dependency graph G with

$$\begin{aligned} \sum _{j=1}^{J} w_j = \chi _f(G). \end{aligned}$$
(2.9)

Utilizing the decomposition property of the Lipschitz function (2.6), the moment-generating function on the right-hand side of (2.8) can be written as

$$\begin{aligned} \mathop {{}\mathbb {E}}\left[ e^{ s ( f(\textbf{X}) - \mathop {{}\mathbb {E}}f(\textbf{X}) ) } \right] = \mathop {{}\mathbb {E}}\left[ \exp \left( \sum _{j=1}^J s w_j ( f_j(I_j) - \mathop {{}\mathbb {E}}f_j(I_j) ) \right) \right] , \end{aligned}$$

where each \(f_j(I_j) = f_j( \textbf{X}_{I_j} )\) is some Lipschitz function of independent variables \(\{ X_i \}_{I \in I_j}\).

Now, let \(\{p_1, \ldots , p_J\}\) be any set of J strictly positive reals that sum to 1. Since \(\sum _{j=1}^J \omega _j/\chi _f(G)=1\) by (2.9), using the convexity of the exponential function and Jensen’s inequality, we obtain that

$$\begin{aligned} \mathop {{}\mathbb {E}}\left[ e^{ s ( f(\textbf{X}) - \mathop {{}\mathbb {E}}f(\textbf{X}) ) } \right]&= \mathop {{}\mathbb {E}}\left[ \exp \left( \sum _{j = 1}^J p_j \dfrac{sw_j}{p_j} ( f_j(I_j) - \mathop {{}\mathbb {E}}f_j(I_j) ) \right) \right] \nonumber \\&\leqslant \mathop {{}\mathbb {E}}\left[ \sum _{j = 1}^J p_j \exp \left( \dfrac{sw_j}{p_j} ( f_j(I_j) - \mathop {{}\mathbb {E}}f_j(I_j) ) \right) \right] \nonumber \\&= \sum _{j = 1}^J p_j \mathop {{}\mathbb {E}}\left[ \exp \left( \dfrac{sw_j}{p_j} (f_j(I_j) - \mathop {{}\mathbb {E}}f_j(I_j) ) \right) \right] , \end{aligned}$$
(2.10)

where the last step is by the linearity of expectation. Note that each subset \(I_j\) in summation (2.10) is an independent set, and therefore corresponds to independent variables. Hence applying Lemma 2.2 to each expectation that appears in the above summation gives

$$\begin{aligned} \sum _{j = 1}^J p_j \mathop {{}\mathbb {E}}\left[ \exp \left( \dfrac{sw_j}{p_j} (f_j(I_j) - \mathop {{}\mathbb {E}}f_j(I_j) ) \right) \right] \leqslant \sum _{j = 1}^J p_j \exp \left( \dfrac{s^2 w_j^2}{8p_j^2} \sum _{i \in I_j} c_i ^2 \right) . \end{aligned}$$

By rearranging terms in the exponential of the right-hand side of the inequality above and setting

$$\begin{aligned} p_j = \dfrac{ w_j \sqrt{ \sum _{i \in I_j} c_i^2 } }{ \sum _{j = 1}^J \left( w_j \sqrt{ \sum _{i \in I_j} c_i^2 } \right) }, \end{aligned}$$

we have that

$$\begin{aligned} \sum _{j = 1}^J p_j \exp \left( \dfrac{s^2 w_j^2}{8p_j^2} \sum _{i \in I_j} c_i^2 \right)&= \sum _{j = 1}^J p_j \exp \left( \dfrac{s^2}{8} \left( \sum _{j = 1}^J w_j \sqrt{ \sum _{i \in I_j} c_i^2 }\right) ^2 \right) \\&= \exp \left( \dfrac{s^2}{8} \left( \sum _{j = 1}^J w_j \sqrt{ \sum _{i \in I_j} c_i^2 }\right) ^2 \right) , \end{aligned}$$

where the last equality is by recalling that the sum of \(p_i\) equals 1. By Cauchy–Schwarz inequality,

$$\begin{aligned} \left( \sum _{j = 1}^J w_j \sqrt{ \sum _{i \in I_j} c_i^2 }\right) ^2&= \left( \sum _{j = 1}^J \sqrt{w_j } \sqrt{w_j \sum _{i \in I_j} c_i^2 }\right) ^2 \\&\leqslant \left( \sum _{j = 1}^J w_j \right) \left( \sum _{j = 1}^J w_j \sum _{i \in I_j} c_i^2 \right) = \chi _f(G) \sum _{i \in V(G)} c_i^2, \end{aligned}$$

where the last equality is due to decomposition (2.3) and equation (2.9). The proof is then completed by choosing \(s = 4t/ (\chi _f(G) \sum _{i \in V(G)} c_i^2)\) in (2.8). \(\square\)

2.4 Concentration bounds for general Lipschitz functions

We have demonstrated concentration results for functions with specific decomposable constraints. Moving forward, we extend our study to encompass more general Lipschitz functions. To begin with, we present concentration results for scenarios involving forest-dependence, wherein the dependency graphs are structured as forests. It is worth recalling that a forest is a disjoint union of trees.

Theorem 2.10

(Zhang et al., 2019; Zhang, 2022) Let function \(f: \varvec{{\Omega }}\rightarrow \mathbb {R}\) be \(\textbf{c}\)-Lipschitz, and \(\varvec{{\Omega }}\)-valued random vector \(\textbf{X}\) be G-dependent. If G is a disjoint union of trees \(\{ T_i \}_{i\in [k]}\). Then for \(t > 0\),

$$\begin{aligned} \mathop {{}\mathbb {P}}\left( f(\textbf{X}) - \mathop {{}\mathbb {E}}{ f(\textbf{X}) } \geqslant t \right) \leqslant \exp \left( - \frac{2t^2}{ \sum ^k_{i=1} c_{\min ,i}^2 + \sum _{ \{i, j\} \in E(G) } ( c_i + c_j)^2 } \right) , \end{aligned}$$
(2.11)

where \(c_{\min ,i}:= \min \{ c_j: j \in V(T_i) \}\) for all \(i \in [k]\).

The proof of this theorem is by first properly ordering \(\{ X_{i} \}_{i \in V(G)}\) as \(( X_{i} )_{i \in [n]}\), and rewriting \(f(\textbf{X}) - \mathop {{}\mathbb {E}}f(\textbf{X})\) as a summation \(\sum _{i \in [n]} V_i\), where

$$\begin{aligned} V_i := \mathop {{}\mathbb {E}}[ f(\textbf{X}) |X_1, \ldots X_i ] - \mathop {{}\mathbb {E}}[ f(\textbf{X}) |X_1, \ldots X_{i-1} ]. \end{aligned}$$

In the proof, each tree \(T_{i}\) is rooted by choosing the vertex with the minimum Lipschitz coefficient \(\min \{ c_j: j \in V(T_i) \}\) in that tree as the root. It can be shown that for some suitable ordering, each \(V_i\) ranges in an interval of length at most \(c_i+c_j\), where j is the parent of i in the tree, or simply \(c_i\) (if i corresponds to a root vertex). The theorem then follows by applying the Chernoff-Cramér technique to \(\sum _{i = 1}^n V_i\). The detailed proof is a bit involved and can be found in Zhang (2022).

Remark 2.11

If random variables \((X_1, \ldots , X_n)\) are independent, then the empty graph \(\overline{K}_n = ([n], \emptyset )\) is a valid dependency graphs for \(\{ X_i \}_{i\in [n]}\). In this case, inequality (2.11) gets reduced to the McDiarmid’s inequality (2.2), since each vertex is treated as a tree.

If all Lipschitz coefficients are of the same value c, then the denominator of the exponent in (2.11) becomes \(k c^2 + 4 (n-k) c^2 = (4n - 3k) c^2\), since the number of edges in the forest is \(n - k\). The denominator in Janson’s bound (2.5) is \(2nc^2\), since the fractional chromatic number of any tree is 2. Thus if \(k \geqslant 2n/3\), then bound (2.11) is tighter than Janson’s concentration inequality (2.5).

2.4.1 Concentration for general graphs

In this subsection, we consider the concentration of general Lipschitz functions of variables whose dependency graph may not be a forest. This is by utilizing tree-partitions of the dependency graphs via vertex identifications, and then applying the forest-dependent results obtained.

Theorem 2.12

Let function \(f: \varvec{\Omega }\rightarrow \mathbb {R}\) be \(\textbf{c}\)-Lipschitz, and \(\varvec{\Omega }\)-valued random vector \(\textbf{X}\) be G-dependent. Then for any \(t>0\),

$$\begin{aligned} \mathop {{}\mathbb {P}}( f(\textbf{X}) - \mathop {{}\mathbb {E}}f(\textbf{X}) \geqslant t ) \leqslant \exp \left( - \frac{2t^2}{ D(G, \textbf{c}) } \right) , \end{aligned}$$

where

$$\begin{aligned} D(G, \textbf{c}) := \min _{(F, \{ F_x \}_{x \in V (F)}) \in \textsf {TP}(G)} \left( \sum _{T \in \mathcal {T}(F)} \widetilde{c}_{\min , T}^2 + \sum _{\{ u, v \} \in E(F)} ( \widetilde{c}_u + \widetilde{c}_v)^2 \right) , \end{aligned}$$

with \(\widetilde{c}_u:= \sum _{i\in F_u} c_i\) for all \(u \in V(F)\) and \(\widetilde{c}_{\min , T}:= \min \{ \widetilde{c}_i: i \in V(T) \}\) for all \(T \in \mathcal T(F)\).

Proof

For every \(u \in V(F)\), we define a random vector \(\textbf{Y}_u= \{ X_i \}_{ i \in F_u}\), and treat each \(\textbf{Y}_u\) as a random variable. We then define a new random vector \(\textbf{Y}=(\textbf{Y}_u)_{u\in V(F)}\), and let \(g(\textbf{Y})=f(\textbf{X})\). It is easy to check that g is \(\widetilde{\textbf{c}}\)-Lipschitz by the triangle inequality, where \(\widetilde{\textbf{c}}=(\widetilde{c}_u)_{u\in V(F)}\). Hence the theorem immediately follows from Theorem 2.10. \(\square\)

It is useful to define the notion of forest complexity, which depends only on the graph, especially when the Lipschitz coefficients are of the same order.

Definition 2.13

(Forest complexity) The forest complexity of a graph G is defined by

$$\begin{aligned} \Lambda (G) := \min _{ (F, \{ F_x \}_{x \in V (F)}) \in \textsf {TP}(G) } \left( \sum _{T \in \mathcal {T}(F)} \min _{ u \in T } |F_u |^2 + \sum _{\{ u, v \} \in E(F)} |F_u \cup F_v|^2 \right) , \end{aligned}$$

where the minimization is over all tree-partitions of G.

Remark 2.14

The width of a tree-partition is the maximum number of vertices in a bag. The tree-partition-width \(\textsf {tpw}(G)\) of G is the minimum width of a tree-partition of G. Let \(F \in \textsf {TP}(G)\) be the tree-partition with tree-partition width \(\textsf {tpw}(G)\). Then

$$\begin{aligned} \Lambda (G) \leqslant |\mathcal {T}(F)|\textsf {tpw}(G)^2 + 4|E(F) |\textsf {tpw}(G)^2 = ( |V(F)|+ 3|E(F) |) \textsf {tpw}(G)^2, \end{aligned}$$

since the number of disjoint trees in a forest F equals \(|V(F)|- |E(F) |\). Upper bounds on tree-partition-width \(\Lambda (G)\) can be obtained using treewidth and the maximum degree of G, and are beyond the scope of this paper, see Wood (2009) for more details.

If all the Lipschitz coefficients are of the same value, then Theorem 2.12 gets simplified.

Corollary 2.15

Let function \(f: \varvec{{\Omega }}\rightarrow \mathbb {R}\) be Lipschitz with the same coefficient c, and \(\varvec{{\Omega }}\)-valued random vector \(\textbf{X}\) be G-dependent. Then for \(t > 0\),

$$\begin{aligned} \mathop {{}\mathbb {P}}( f(\textbf{X}) - \mathop {{}\mathbb {E}}f(\textbf{X}) \geqslant t ) \leqslant \exp \left( - \frac{2t^2}{ \Lambda (G) c^2}\right) . \end{aligned}$$
(2.12)

Similar to the theorems derived above, Corollary 2.15 also gives an exponentially decaying bound on the probability of deviation. The rate of decay is determined by the Lipschitz coefficients of the function, and the forest complexity of the dependency graph. Intuitively, the closer the dependency graph is to a forest, the faster the deviation probability decays. This uncovers how the dependencies among random variables influence concentration.

2.4.2 Examples

Here we present several explicit examples to demonstrate and estimate the forest complexity where random variables are structured as graphs. All these examples naturally emerge in the context of random processes that are intricately intertwined within graph structures.

Example 2.16

(G is a tree) In this case, \(\Lambda (G) \leqslant |E(G)|(1+1)^2 + 1 = 4n - 3\). We get an upper bound of \(\Lambda (G)\) that is linear in the number of variables, which is comparable to Janson’s concentration inequality up to some constant factor (see (2.5) with \(\chi _f(G)=2\) and Remark 2.11).

Example 2.17

(G is a cycle \(C_n\)) If n is even, a tree-partition is illustrated in Fig. 4, where the resulting forest is a path F of length n/2 with each gray belt representing a ‘bag’. We will keep this convention for the rest of this paper.

By the illustrated tree-partition, \(\Lambda (G) \leqslant 2 \times (1+2)^2 + (n/2-2)(2+2)^2+1 = O(n)\). When n is odd, according to the tree-partition shown in Fig. 5, \(\Lambda (G)\leqslant (1+2)^2 + (\frac{n - 1}{2}-1)(2+2)^2 +1 = O(n)\). Since \(\chi _f \geqslant 2\) for cycles, our bound is again comparable to Janson’s concentration inequality (2.5) up to some constant multiplicative factor.

Fig. 4
figure 4

A tree-partition of \(C_6\)

Fig. 5
figure 5

A tree-partition of \(C_5\)

Example 2.18

(G is a grid) Suppose G is a two-dimensional \((m\times m)\)-grid. Then \(n=m^2\). Considering the tree-partition illustrated in Fig. 6, we have

$$\begin{aligned} \Lambda (G) \leqslant 1 + 2 \sum _{i=1}^m (2m - 1)^2 = \dfrac{2}{3} m(2m+1)(2m-1)+1 = O(m^3) = O(n^\frac{3}{2}). \end{aligned}$$
Fig. 6
figure 6

A tree-partition of \(4\times 4\) gird

3 Generalization for learning from graph-dependent data

We now apply the concentration bounds obtained above to derive generalization bounds for supervised learning from graph-dependent data. Let

$$\begin{aligned} \textbf{S}:= ((x_1, y_1), \ldots , (x_n, y_n)) \in (\mathcal {X} \times \mathcal {Y})^n \end{aligned}$$

be a G-dependent training sample of size n, where \(\mathcal{X}\) denotes the input space and \(\mathcal{Y}\) denotes the set of labels. Let \(\mathcal{D}\) be the underlying distribution of data on \(\mathcal{X} \times \mathcal Y\). Note that the sample \(\textbf{S}\) contains dependent data with the same marginal distribution \(\mathcal {D}\).

Further we fix some \(\ell : \mathcal{Y} \times \mathcal{Y} \rightarrow \mathbb {R}_+\) as a non-negative loss function. For any hypothesis \(f: \mathcal{X} \rightarrow \mathcal{Y}\), the empirical error on sample \(\textbf{S}\) is defined by

$$\begin{aligned} {\widehat{R}}_\textbf{S}(f):= \frac{1}{n} \sum _{i = 1}^n \ell (y_i, f ( x_i )). \end{aligned}$$

For learning from dependent data, the generalization error can be defined in various ways. We adopt the following widely-used one (Hang & Steinwart, 2014; Lozano et al., 2006; Meir, 2000; Steinwart & Christmann, 2009)

$$\begin{aligned} R(f):= \mathop {{}\mathbb {E}}_{(x, y)\sim \mathcal {D}} [ \ell (y, f(x)) ], \end{aligned}$$
(3.1)

which assumes that the test data is independent of the training sample.

3.1 Generalization bounds via fractional Rademacher complexity

Our first approach is based on Rademacher complexity (Bartlett & Mendelson, 2002). This approach can be extended to accommodate interdependent data by utilizing the decomposition into independent sets described in Sect. 2.1.1.

Definition 3.1

(Fractional Rademacher complexity, Usunier et al., 2005)

Let \(\{ ( I_{j}, w_{j} ) \}_{j}\) be a fractional independent vertex cover of a dependency graph G constructed over a training set \(\textbf{S}\) of size n, with \(\sum _j w_j = \chi _f(G)\). Let \(\mathcal {F}=\{ f:\mathcal{X} \rightarrow \mathcal {Y}\}\) be the hypothesis class. Then, the empirical fractional Rademacher complexity of \(\mathcal {F}\) given \(\textbf{S}\) is defined by

$$\begin{aligned} \widehat{\mathfrak R}_{\textbf{S}}^{\star }(\mathcal {F}) = \frac{1}{n} \mathop {{}\mathbb {E}}_{\varvec{\sigma }} \left[ \sum _{j} w_j \sup _{f \in \mathcal {F}} \left( \sum _{i \in I_j} \sigma _i f ( x_i ) \right) \right] , \end{aligned}$$
(3.2)

where \(\varvec{\sigma }=(\sigma _i)_{1\leqslant i\leqslant n}\) denote a vector of n independent Rademacher variables, that is, \(\mathbb {P}(\sigma _i=-1)=\mathbb {P}(\sigma _i=+1)=1/2\) for each \(i \in [n]\). Moreover, the fractional Rademacher complexity of \(\mathcal {F}\) is defined by

$$\begin{aligned} \mathfrak R^{\star }(\mathcal {F}) = \mathop {{}\mathbb {E}}_{\textbf{S}} \left[ \widehat{\mathfrak R}_{\textbf{S}}^{\star }(\mathcal {F}) \right] . \end{aligned}$$

Remark 3.2

In the i.i.d. situation, the set of singleton vertices is a valid fractional independent vertex cover, and the fractional Rademacher complexity (3.2) simplifies to the original empirical Rademacher complexity (Bartlett & Mendelson, 2002) defined by

$$\begin{aligned} \widehat{\mathfrak R}_{\textbf{S}}(\mathcal {F}) = \frac{1}{n} \mathop {{}\mathbb {E}}_{\varvec{\sigma }} \left[ \sup _{f \in \mathcal {F}} \left( \sum _{i \in [n]} \sigma _i f ( x_i ) \right) \right] . \end{aligned}$$
(3.3)

Additionally, because the former is a sum of empirical Rademacher complexities, it enables one to get estimates by extending the properties of the empirical Rademacher complexity.

In the following, we give an example of a function class of linear functions with bounded-norm weight vectors, for which the empirical Rademacher averages can be bounded directly.

Theorem 3.3

Let \(\mathcal {F}=\{x\mapsto \langle \textbf{w},\phi (x)\rangle : \Vert \textbf{w}\Vert \leqslant B\}\) be a class of linear functions with bounded weights in a feature space such that \(\Vert \phi ( x )\Vert \leqslant \Gamma\) for all x. Then

$$\begin{aligned} \widehat{\mathfrak R}_{\textbf{S}}^{\star }(\mathcal {F}) \leqslant B \Gamma \sqrt{\frac{\chi _f(G)}{n}}. \end{aligned}$$
(3.4)

Proof

In view of the definition of the empirical fractional Rademacher complexity (3.2), by the linearity of expectation, we have

$$\begin{aligned} \widehat{\mathfrak R}_{\textbf{S}}^{\star }(\mathcal {F})&= \frac{1}{n} \sum _{j} w_j \mathop {{}\mathbb {E}}_{\varvec{\sigma }} \left[ \sup _{\Vert \textbf{w}\Vert \leqslant B} \left( \sum _{i \in I_j} \left\langle \textbf{w}, \sigma _i \phi ( x_i ) \right\rangle \right) \right] \\&\leqslant \frac{B}{n} \sum _{j} w_j \mathop {{}\mathbb {E}}_{\varvec{\sigma }} \left\Vert \sum _{i \in I_j} \sigma _i \phi ( x_i )\right\Vert \leqslant \frac{B}{n} \sum _{j} w_j \left( \mathop {{}\mathbb {E}}_{\varvec{\sigma }} \left\Vert \sum _{i \in I_j} \sigma _i \phi ( x_i )\right\Vert ^2\right) ^{1/2}, \end{aligned}$$

where the first inequality is by noting \(\Vert \textbf{w}\Vert \leqslant B\), and applying Cauchy–Schwarz inequality to the inner product, and the second inequality is by Jensen’s inequality.

As the Rademacher variables are independent, we have \(\mathop {{}\mathbb {E}}[\sigma _i\sigma _k]=\mathop {{}\mathbb {E}}[\sigma _i]\mathop {{}\mathbb {E}}[\sigma _k]=0\) for any distinct ik. Hence we have

$$\begin{aligned} \mathop {{}\mathbb {E}}_{\varvec{\sigma }} \left\Vert \sum _{i \in I_j} \sigma _i \phi ( x_i )\right\Vert ^2 = \mathop {{}\mathbb {E}}_{\varvec{\sigma }} \left[ \sum _{i,k \in I_j} \sigma _i\sigma _k \langle \phi ( x_i ),\phi (x_k) \rangle \right] = \sum _{i \in I_j} \Vert \phi ( x_i )\Vert ^2, \end{aligned}$$

and therefore,

$$\begin{aligned} \widehat{\mathfrak R}_{\textbf{S}}^{\star }(\mathcal {F}) \leqslant \frac{B}{n} \sum _{j} w_j \left( \sum _{i \in I_j} \Vert \phi ( x_i )\Vert ^2\right) ^{1/2}. \end{aligned}$$

Since we have \(\Vert \phi ( x_i )\Vert \leqslant \Gamma\) in the feature space, then

$$\begin{aligned} \widehat{\mathfrak R}_{\textbf{S}}^{\star }(\mathcal {F}) \leqslant \frac{B\Gamma }{n} \sum _{j} w_j \sqrt{|I_j |} = \frac{B\Gamma \chi _f(G)}{n} \sum _{j} \frac{w_j}{\chi _f(G)} \sqrt{|I_j |}. \end{aligned}$$

By noticing that \(\sum _{j} w_j / \chi _f(G) = 1\), using Jensen’s inequality for the square root function yields

$$\begin{aligned} \widehat{\mathfrak R}_{\textbf{S}}^{\star }(\mathcal {F}) \leqslant \frac{B\Gamma \chi _f(G)}{n} \sqrt{ \sum _{j} \frac{w_j}{\chi _f(G)} |I_j |} = \frac{B\Gamma \sqrt{\chi _f(G)}}{n}\sqrt{\sum _j w_j |I_j |}. \end{aligned}$$

The result follows then by noting \(\sum _j w_j |I_j |=n\) by (2.4). \(\square\)

Remark 3.4

Note that \(\phi\) could be the feature mapping corresponding to the last hidden layer of a neural network, or a kernel function. In particular, under the assumption of Theorem 3.3, let \(\phi\) be a feature mapping associated to a kernel K such that \(K(x, x) \leqslant \Gamma ^{2}\) for all x. Then the standard Rademacher complexity of kernel-based hypotheses (Mohri et al., 2018, Theorem 6.12) gives that \(\widehat{\mathfrak R}_{\textbf{S}}(\mathcal {F}) \leqslant B\Gamma /\sqrt{n}\), and in comparison, our bound (3.4) has an additional factor \(\sqrt{\chi _{f}(G)}\), which becomes exactly 1 as in Remark 3.2.

It is also worth noting that the fractional Rademacher complexity is defined for a given fractional cover. In general, our analysis holds for any optimal fractional cover; nevertheless, various cover selections may result in different bound values. Nonetheless, in practice, this influence is unlikely to have a significant impact.

We now obtain generalization bounds using the fractional Rademacher complexity.

Theorem 3.5

(Amini & Usunier, 2015; Usunier et al., 2005) Given a sample \(\textbf{S}\) of size n with dependency graph G and a loss function \(\ell : \mathcal {Y} \times \widehat{\mathcal {Y}} \rightarrow [0, M]\). Let \(\mathcal {F}\) denote the hypothesis class. Then, for any \(\delta \in (0, 1)\), with probability at least \(1 - \delta\), we have, for all \(f \in \mathcal {F}\), that

$$\begin{aligned} R(f) \leqslant {\widehat{R}}_{\textbf{S}}(f) + 2 \mathfrak R^{\star }(\ell \circ \mathcal {F}) + M\sqrt{\dfrac{\chi _f(G)}{2n} \log \left( \dfrac{1}{\delta }\right) }, \end{aligned}$$
(3.5)

and

$$\begin{aligned} R(f) \leqslant {\widehat{R}}_{\textbf{S}}(f) + 2\widehat{\mathfrak R}_\textbf{S}^{\star }(\ell \circ \mathcal {F}) + 3M\sqrt{\dfrac{\chi _f(G)}{2n} \log \left( \dfrac{2}{\delta }\right) }, \end{aligned}$$
(3.6)

where \(\ell \circ \mathcal {F}=\{(x,y)\mapsto \ell (y,f(x)) ~|~ f \in \mathcal {F}\}\).

Proof

For any \(f\in \mathcal {F}\), we have \({\widehat{R}}_\textbf{S}(f)\) is an unbiased estimator of R(f), since the data points in the sample \(\textbf{S}\) are assumed to be G-dependent and have the same marginal distribution. Hence considering a G-dependent “ghost” sample \(\textbf{S}^{\prime } = ((x_1^{\prime }, y_1^{\prime }), \ldots , (x_n^{\prime }, y_n^{\prime }))\) that is independently generated from the same distribution as \(\textbf{S}\), we have

$$\begin{aligned} \sup _{f \in \mathcal {F}} (R(f)-{\widehat{R}}_\textbf{S}(f)) = \sup _{f \in \mathcal {F}} \left( \mathop {{}\mathbb {E}}_{\textbf{S}^{\prime }}{\widehat{R}}_{\textbf{S}^{\prime }}(f) -{\widehat{R}}_\textbf{S}(f) \right) = \sup _{f \in \mathcal {F}} \left( \mathop {{}\mathbb {E}}_{\textbf{S}^{\prime }}\left[ {\widehat{R}}_{\textbf{S}'}(f) -{\widehat{R}}_\textbf{S}(f)\right] \right) . \end{aligned}$$

Let \(\{ ( I_{j}, w_{j} ) \}_{j \in [J]}\) be a fractional independent vertex cover of the dependency graph G with \(\sum _j w_j = \chi _f(G)\). By Jensen’s inequality and the convexity of the supremum, we get

$$\begin{aligned}&\sup _{f \in \mathcal {F}} \left( \mathop {{}\mathbb {E}}_{\textbf{S}^{\prime }}\left[ {\widehat{R}}_{\textbf{S}^{\prime }}(f) -{\widehat{R}}_\textbf{S}(f)\right] \right) \leqslant \mathop {{}\mathbb {E}}_{\textbf{S}'} \left[ \sup _{f \in \mathcal {F}}({\widehat{R}}_{\textbf{S}^{\prime }}(f) -{\widehat{R}}_\textbf{S}(f)) \right] \\&\quad = \mathop {{}\mathbb {E}}_{\textbf{S}^{\prime }} \left[ \sup _{f \in \mathcal {F}} \left( \frac{1}{n} \sum _{i \in [n]}(\ell (y^{\prime }_i,f(x^{\prime }_i))-\ell (y_i,f ( x_i )))\right) \right] \\&\quad = \frac{1}{n} \mathop {{}\mathbb {E}}_{\textbf{S}^{\prime }} \left[ \sup _{f \in \mathcal {F}} \left( \sum _{j=1}^J w_j\sum _{i \in I_j}(\ell (y^{\prime }_i,f(x^{\prime }_i))-\ell (y_i,f ( x_i )))\right) \right] , \end{aligned}$$

where the second equality is due to the decomposition (2.3).

Then by the sub-additivity of the supremum, we have

$$\begin{aligned} \sup _{f \in \mathcal {F}} (R(f)-{\widehat{R}}_\textbf{S}(f)) \leqslant g(\textbf{S}), \end{aligned}$$

where \(g(\textbf{S})\) is defined by

$$\begin{aligned} g(\textbf{S}) = \frac{1}{n} \mathop {{}\mathbb {E}}_{\textbf{S}^{\prime }} \left[ \sum _{j=1}^J w_j \sup _{f \in \mathcal {F}} \left( \sum _{i \in I_j}(\ell (y^{\prime }_i,f(x^{\prime }_i))-\ell (y_i,f ( x_i )))\right) \right] , \end{aligned}$$

and satisfies \(g(\textbf{S}) = \sum _{j} w_j g_{j} ( \textbf{S})\), where for each j,

$$\begin{aligned} g_j( \textbf{S}):=\frac{1}{n} \mathop {{}\mathbb {E}}_{\textbf{S}^{\prime }} \left[ \sup _{f \in \mathcal {F}} \left( \sum _{i \in I_j}(\ell (y^{\prime }_i,f(x^{\prime }_i))-\ell (y_i,f ( x_i )))\right) \right] . \end{aligned}$$

Note that each function \(g_j\) has bounded difference M/n and satisfies (2.6), and therefore is a decomposable Lipschitz function. Then using Theorem 2.8, for all \(\delta \in (0,1)\), with probability at least \(1-\delta\), we have

$$\begin{aligned}&\sup _{f \in \mathcal {F}} (R(f)-{\widehat{R}}_\textbf{S}(f)) \leqslant \mathop {{}\mathbb {E}}_{\textbf{S}} [ g(\textbf{S}) ] + M \sqrt{\dfrac{\chi _f(G)}{2n}\log \left( \dfrac{1}{\delta }\right) } \\&\quad = \sum _{j=1}^J \frac{w_j}{n} \mathop {{}\mathbb {E}}_{\textbf{S}, \textbf{S}^{\prime }} \left[ \sup _{f \in \mathcal {F}} \left( \sum _{i \in I_j}(\ell (y^{\prime }_i,f(x^{\prime }_i))-\ell (y_i,f ( x_i ))) \right) \right] +M\sqrt{\dfrac{\chi _f(G)}{2n}\log \left( \dfrac{1}{\delta }\right) }. \end{aligned}$$

Note that

$$\begin{aligned}&\mathop {{}\mathbb {E}}_{\textbf{S}, \textbf{S}^{\prime }} \left[ \sup _{f \in \mathcal {F}} \left( \sum _{i \in I_j}(\ell (y^{\prime }_i,f(x^{\prime }_i))-\ell (y_i,f ( x_i ))) \right) \right] \\&\quad = \mathop {{}\mathbb {E}}_{\textbf{S}, \textbf{S}^{\prime }} \mathop {{}\mathbb {E}}_{\varvec{\sigma }} \left[ \sup _{f \in \mathcal {F}} \left( \sum _{i \in I_j}\sigma _i(\ell (y^{\prime }_i,f(x^{\prime }_i))-\ell (y_i,f ( x_i ))) \right) \right] , \end{aligned}$$

since the introduction of Rademacher variables \(\varvec{\sigma }=(\sigma _i)_i\), uniformly taking values in \(\{-1,+1\}\), does not change the expectation. Indeed, for \(\sigma _i=+1\), the corresponding summand stays unaltered, and for \(\sigma _i=-1\), the corresponding summand reverses sign, which is the same as flipping \((x_i,y_i)\) and \((x^{\prime }_i,y^{\prime }_i)\) between \(\textbf{S}\) and \(\textbf{S}^{\prime }\). This change has no effect on the overall expectation as we are considering the expectation over \(\textbf{S}\) and \(\textbf{S}^{\prime }\), and by noting that \(\textbf{S}\) and \(\textbf{S}^{\prime }\) are independent and \(I_{j}\) is some independent set. Therefore, we have

$$\begin{aligned}&\sup _{f \in \mathcal {F}} (R(f)-{\widehat{R}}_\textbf{S}(f)) \nonumber \\&\quad \leqslant \sum _{j=1}^J \frac{w_j}{n} \mathop {{}\mathbb {E}}_{\textbf{S}, \textbf{S}^{\prime }} \mathop {{}\mathbb {E}}_{\varvec{\sigma }} \left[ \sup _{f \in \mathcal {F}} \left( \sum _{i \in I_j}\sigma _i(\ell (y^{\prime }_i,f(x^{\prime }_i))-\ell (y_i,f ( x_i ))) \right) \right] + M\sqrt{\dfrac{\chi _f(G) }{2n} \log \left( \dfrac{1}{\delta }\right) } \nonumber \\&\quad \leqslant 2 \sum _{j=1}^J \frac{w_j}{n} \mathop {{}\mathbb {E}}_{\varvec{\sigma }} \mathop {{}\mathbb {E}}_{\textbf{S}} \left[ \sup _{f \in \mathcal {F}} \left( \sum _{i \in I_j}\sigma _i(\ell (y_i,f ( x_i ))) \right) \right] + M\sqrt{\dfrac{\chi _f(G) }{2n} \log \left( \dfrac{1}{\delta }\right) }, \end{aligned}$$
(3.7)

where the last step uses the sub-additivity of the supremum. Then in view of Definition 3.1 of \(\widehat{\mathfrak R}_{\textbf{S}}^{\star }\), we obtain

$$\begin{aligned} \sup _{f \in \mathcal {F}} (R(f)-{\widehat{R}}_\textbf{S}(f)) \leqslant 2\mathop {{}\mathbb {E}}_{\textbf{S}} \left[ \widehat{\mathfrak R}_{\textbf{S}}^{\star }(\ell \circ \mathcal {F}) \right] +M\sqrt{\dfrac{\chi _f(G) }{2n}\log \left( \dfrac{1}{\delta }\right) }. \end{aligned}$$

Therefore the first bound (3.5) follows from the definition of the supremum, that is, for all \(f\in \mathcal {F}\),

$$\begin{aligned} R(f) - {\widehat{R}}_\textbf{S}(f) \leqslant \sup _{f \in \mathcal {F}} \left( R(f)-{\widehat{R}}_\textbf{S}(f)\right) . \end{aligned}$$

Note that

$$\begin{aligned} \widehat{\mathfrak R}_{\textbf{S}}^{\star }(\ell \circ \mathcal {F}) = \sum _{j} w_j \left( \frac{1}{n} \mathop {{}\mathbb {E}}_{\varvec{\sigma }} \left[ \sup _{f \in \mathcal {F}} \left( \sum _{i \in I_j}\sigma _i(\ell (y_i,f ( x_i ))) \right) \right] \right) \end{aligned}$$

satisfies the condition of Theorem 2.8 with bounded difference M/n, and therefore concentrates around its expectation \(\mathfrak R^{\star }(\ell \circ \mathcal {F})\). Then using the union bound with (3.5), yields the second bound (3.6). \(\square\)

From Remark 3.2, Theorem 3.5 is a natural extension of the standard Rademacher generalization bounds when examples are identically and independently distributed (see, for example, Mohri et al., 2018, Theorem 3.3), as in this case, \(\chi _f(G)=1\).

Remark 3.6

To use the symmetrization technique in Eq. (3.7), the variables involved in the same summation need to be independent. Consequently, when extending the concept of Rademacher complexities to scenarios involving interdependent variables, it becomes necessary to decompose the set of random variables into independent sets. In this context, the fractional independent vertex cover \(\{ ( I_{j}, w_{j} ) \}_{j}\) with \(\sum _{k} w_{k} = \chi _f(G)\) emerges as a pivotal tool for achieving an optimal decomposition, as \(\chi _f(G)\) is the minimum of \(\sum _{k} w_{k}\) over all fractional independent vertex covers.

3.2 Generalization bounds via algorithmic stability

This section establishes stability bounds for learning from graph-dependent data, using the concentration inequalities derived in the last section. Algorithmic stability has been used in the study of classification and regression to derive generalization bounds (Devroye & Wagner, 1979; Kearns & Ron, 1999; Kutin & Niyogi, 2002; Rogers & Wagner, 1978). A key advantage of stability bounds is that they are designed for specific learning algorithms, exploiting particular properties of the algorithms.

Since uniform stability was introduced in Bousquet and Elisseeff (2002), it has been among the most widely used notions of algorithmic stability. Given a training sample \(\textbf{S}\) of size n, for every \(i\in [n]\), removing the i-th element from \(\textbf{S}\) results in a sample of size \(n-1\), which is denoted by

$$\begin{aligned} \textbf{S}^{\setminus i} := ( (x_1, y_1), \ldots , (x_{i - 1}, y_{i - 1}), (x_{i + 1}, y_{i + 1}) \ldots , (x_n, y_n)). \end{aligned}$$

A learning algorithm \(\mathcal {A}\) is a function that maps the training set \(\textbf{S}\) onto a function \(f^{\mathcal {A}}_{\textbf{S}}: \mathcal{X} \rightarrow \mathcal{Y}\).

Definition 3.7

(Uniform stability, Bousquet & Elisseeff, 2002) Given an integer \(n > 0\), the learning algorithm \(\mathcal {A}\) is \(\beta _n\)-uniformly stable with respect to the loss function \(\ell\), if for any \(i\in [n]\), \(\textbf{S}\in (\mathcal{X} \times \mathcal {Y})^n\), and \((x, y) \in \mathcal{X} \times \mathcal{Y}\), it holds that

$$\begin{aligned} |\ell (y, f^{\mathcal {A}}_{\textbf{S}}(x)) - \ell (y, f^{\mathcal {A}}_{\textbf{S}^{\setminus i}}(x)) |\leqslant \beta _n. \end{aligned}$$
(3.8)

Intuitively, small perturbations of the training sample have little effect on the learning for a stable learning algorithm.

Now, we begin our analysis by considering the difference between the empirical error and the generalization error of a learning algorithm \(f^{\mathcal {A}}_{\textbf{S}}\) trained over a G-dependent sample \(\textbf{S}\), formally defined by

$$\begin{aligned} \Phi _{\mathcal {A}}(\textbf{S}) := R(f^{\mathcal {A}}_{\textbf{S}}) - {\widehat{R}}_\textbf{S}(f^{\mathcal {A}}_{\textbf{S}}). \end{aligned}$$
(3.9)

The mapping \(\Phi _{\mathcal {A}}: (\mathcal{X} \times \mathcal{Y})^n \rightarrow \mathbb {R}\) will play a critical role in estimating \(R(f^{\mathcal {A}}_{\textbf{S}})\) via stability. We will first bound the probability of the deviation of \(\Phi _{\mathcal {A}}(\textbf{S})\) from its expectation (Lemma 3.8), and then obtain an upper bound of expected value of \(\Phi _{\mathcal {A}}(\textbf{S})\) (Lemma 3.10).

Lemma 3.8

Given a G-dependent sample \(\textbf{S}\) of size n, and a \(\beta _n\)-uniformly stable learning algorithm \(\mathcal {A}\). Suppose the loss function \(\ell\) is bounded by M. Then for any \(t>0\),

$$\begin{aligned} \mathop {{}\mathbb {P}}( \Phi _{\mathcal {A}}(\textbf{S}) - \mathop {{}\mathbb {E}}[\Phi _{\mathcal {A}}(\textbf{S})] \geqslant t ) \leqslant \exp \left( - \frac{2n^2t^2}{\Lambda (G) (4n\beta _n + M)^2} \right) . \end{aligned}$$

We prove the following lemma, which states that the Lipschitz coefficients of \(\Phi _{\mathcal {A}}(\cdot )\) are all bounded by \(4\beta _n + M/n\). Then Lemma 3.8 follows from Lemma 3.9 and Theorem 2.15, since the Lipschitz coefficients are all of the same value.

Lemma 3.9

Given a \(\beta _n\)-uniformly stable learning algorithm \(\mathcal {A}\), for any \(\textbf{S}, \textbf{S}' \in (\mathcal{X} \times \mathcal{Y})^n\) that differ only in one entry, we have

$$\begin{aligned} |\Phi _{\mathcal {A}}(\textbf{S}) - \Phi _{\mathcal {A}}(\textbf{S}') |\,\leqslant 4\beta _n + \frac{M}{n}. \end{aligned}$$

Proof

In the literature Bousquet and Elisseeff (2002), Lemma 3.9 was proved for the i.i.d. case, actually, the proof remains valid in our dependent setting. Assume that \(\textbf{S}\) and \(\textbf{S}^{\prime }\) differ only in i-th entry, and denote \(\textbf{S}^{\prime }\) as

$$\begin{aligned} \textbf{S}^i:= ( (x_1, y_1), \ldots , (x_{i - 1}, y_{i - 1}), (x_i^{\prime }, y_i^{\prime }), (x_{i + 1}, y_{i + 1}) \ldots , (x_m, y_m)), \end{aligned}$$

such that the marginal distribution of \((x_i^{\prime }, y_i^{\prime })\) is also \(\mathcal {D}\).

Notice that we do not require the data to be i.i.d., as samples are dependent, and have the same marginal probability distribution \(\mathcal {D}\). To begin with, we bound \(R (f^{\mathcal {A}}_{\textbf{S}}) - R (f^{\mathcal {A}}_{\textbf{S}^i})\) using the triangle inequality,

$$\begin{aligned} \big |R (f^{\mathcal {A}}_{\textbf{S}}) - R (f^{\mathcal {A}}_{\textbf{S}^i}) \big |&\leqslant \big |R (f^{\mathcal {A}}_{\textbf{S}}) - R ( f^{\mathcal {A}}_{\textbf{S}^{\setminus i}} ) \big |+ \big |R(f^{\mathcal {A}}_{\textbf{S}^{\setminus i}}) - R (f^{\mathcal {A}}_{\textbf{S}^i}) \big |\\&= \big |\mathop {{}\mathbb {E}}_{\mathcal {D}} [\ell (y, f^{\mathcal {A}}_{\textbf{S}}(x))] - \mathop {{}\mathbb {E}}_{\mathcal {D}} [\ell (y, f^{\mathcal {A}}_{\textbf{S}^{\setminus i}}(x))] \big |\\&\quad + \big |\mathop {{}\mathbb {E}}_{\mathcal {D}} [\ell (y, f^{\mathcal {A}}_{\textbf{S}^{\setminus i}}(x))] - \mathop {{}\mathbb {E}}_{\mathcal {D}} [ \ell (y, f^{\mathcal {A}}_{\textbf{S}^i}(x))] \big |\\&= \big |\mathop {{}\mathbb {E}}_{\mathcal {D}} [ \ell (y, f^{\mathcal {A}}_{\textbf{S}}(x)) - \ell (y, f^{\mathcal {A}}_{\textbf{S}^{\setminus i}}(x)) ] \big |\nonumber \\&\quad + \big |\mathop {{}\mathbb {E}}_{\mathcal {D}} [\ell (y, f^{\mathcal {A}}_{\textbf{S}^{\setminus i}}(x)) - \ell (y, f^{\mathcal {A}}_{\textbf{S}^i}(x))] \big |\leqslant 2\beta _n, \end{aligned}$$

where the last inequality is by the uniform stability defined by (3.8).

Then we bound \({\widehat{R}}_\textbf{S}(f^{\mathcal {A}}_{\textbf{S}}) - {\widehat{R}}_{\textbf{S}^i} (f^{\mathcal {A}}_{\textbf{S}^i})\),

$$\begin{aligned}&n \big |{\widehat{R}}_\textbf{S}(f^{\mathcal {A}}_{\textbf{S}}) - {\widehat{R}}_{\textbf{S}^i} (f^{\mathcal {A}}_{\textbf{S}^i}) \big |= \left|\sum _{ (x_j, y_j) \in \textbf{S}} \ell (y_j, f^{\mathcal {A}}_{\textbf{S}}(x_j)) - \sum _{ (x_j, y_j) \in \textbf{S}^i } \ell (y_j, f^{\mathcal {A}}_{\textbf{S}^i}(x_j)) \right|\\&\quad \leqslant \big |\ell (y_i, f^{\mathcal {A}}_{\textbf{S}}( x_i )) - \ell (y_i^\prime , f^{\mathcal {A}}_{\textbf{S}^i}(x_i^\prime )) \big |+ \sum _{j\ne i} \left|\ell (y_j, f^{\mathcal {A}}_{\textbf{S}}(x_j)) - \ell (y_j, f^{\mathcal {A}}_{\textbf{S}^i}(x_j)) \right|\\&\quad \leqslant \sum _{j\ne i} \left|\ell (y_j, f^{\mathcal {A}}_{\textbf{S}}(x_j)) - \ell (y_j, f^{\mathcal {A}}_{\textbf{S}^{\setminus i}}(x_j)) \right|+ \sum _{j\ne i} \left|\ell (y_j, f^{\mathcal {A}}_{\textbf{S}^{\setminus i}}(x_j)) - \ell (y_j, f^{\mathcal {A}}_{\textbf{S}^i}(x_j)) \right|\\&\qquad + \big |\ell (y_i, f^{\mathcal {A}}_{\textbf{S}}( x_i )) - \ell (y_i^\prime , f^{\mathcal {A}}_{\textbf{S}^i}(x_i^\prime )) \big |\,\leqslant 2n\beta _n + M, \end{aligned}$$

where the last inequality is by the uniform stability and the assumption that \(\ell\) is bounded by M.

Combining the above bounds, by the triangle inequality, we have that

$$\begin{aligned} \big |\Phi _{\mathcal {A}}(\textbf{S}) - \Phi _{\mathcal {A}}(\textbf{S}^i)\big |&= \big |( R (f^{\mathcal {A}}_{\textbf{S}}) - {\widehat{R}}_\textbf{S}(f^{\mathcal {A}}_{\textbf{S}}) ) - ( R (f^{\mathcal {A}}_{\textbf{S}^i}) - {\widehat{R}}_{\textbf{S}^i} (f^{\mathcal {A}}_{\textbf{S}^i}) ) \big |\\&\leqslant \big |R (f^{\mathcal {A}}_{\textbf{S}}) - R (f^{\mathcal {A}}_{\textbf{S}^i}) |+ |{\widehat{R}}(f^{\mathcal {A}}_{\textbf{S}}) - {\widehat{R}}_{\textbf{S}^i} (f^{\mathcal {A}}_{\textbf{S}^i}) \big |\leqslant 4\beta _n + \dfrac{M}{n}, \end{aligned}$$

which completes the proof. \(\square\)

We are now in measure to bound the expectation of \(\Phi _{\mathcal {A}}(\textbf{S})\).

Lemma 3.10

Let \(\textbf{S}\) be a G-dependent sample of size n. Suppose the maximum degree of G is \(\Delta = \Delta (G)\). Let \(\mathcal {A}\) be a \(\beta _i\)-uniformly stable learning algorithm for every \(i \in [n-\Delta , n]\), and \(\beta _{n, \Delta } = \max _{i \in [0,\Delta ]} \beta _{n-i}\). Then we have

$$\begin{aligned} \mathop {{}\mathbb {E}}[ \Phi _{\mathcal {A}}(\textbf{S}) ] \leqslant 2 \beta _{n, \Delta } (\Delta + 1). \end{aligned}$$

The proof of the lemma is based on iterative perturbations of the training sample \(\textbf{S}\), where a perturbation is essentially removing a data point from \(\textbf{S}\). The property of uniform stability of the algorithm guarantees that each perturbation causes a discrepancy up to \(\beta _{n,\Delta }\), and in total \(2(\Delta +1)\) perturbations have to be made to eliminate the dependency between a data point and the others.

We start with a technical lemma before the proof of Lemma 3.10.

Lemma 3.11

Under the same assumptions in Lemma 3.10, we have

$$\begin{aligned} \max _{(x_i, y_i) \in \textbf{S}} \, \mathop {{}\mathbb {E}}_{(x, y), \textbf{S}} [\ell (y, f^{\mathcal {A}}_{\textbf{S}}(x)) - \ell (y_i, f^{\mathcal {A}}_{\textbf{S}}( x_i ))] \leqslant 2 \beta _{n, \Delta }(\Delta +1). \end{aligned}$$

Proof

For every \(i\in [n]\), let \(N_G(i)\) be the set of vertices adjacent to i in graph G, and suppose \(N_G^+(i) = N_G(i) \cup \{ i \} = \{j_1,\ldots ,j_{n_i}\}\) with \(j_{k-1}>j_k\). Define \(\textbf{S}^{(i,0)}=\textbf{S}\) and for every \(k \in [n_i]\), let \(\textbf{S}^{(i,k)}\) be obtained from \(\textbf{S}^{(i,k-1)}\) by removing the \(j_k\)-th entry. By the uniform stability of \(\mathcal {A}\), for any \((x,y)\in \mathcal X \times \mathcal Y\) and \(k\in [n_i]\), we have

$$\begin{aligned} \big |\ell (y, f^{\mathcal {A}}_{\textbf{S}^{(i,k-1)}}(x))-\ell (y, f^{\mathcal {A}}_{\textbf{S}^{(i,k)}}(x))\big |\leqslant \beta _{n, \Delta }. \end{aligned}$$

By a decomposition using a telescoping summation,

$$\begin{aligned} \ell (y, f^{\mathcal {A}}_{\textbf{S}}(x)) = \sum _{k=1}^{n_i} (\ell (y, f^{\mathcal {A}}_{\textbf{S}^{(i,k-1)}}(x))-\ell (y, f^{\mathcal {A}}_{\textbf{S}^{(i,k)}}(x)) + \ell (y, f^{\mathcal {A}}_{\textbf{S}^{(i,n_i)}}(x)). \end{aligned}$$

Similarly, we also get

$$\begin{aligned} \ell (y_i, f^{\mathcal {A}}_{\textbf{S}}( x_i )) = \sum _{k=1}^{n_i} (\ell (y_i, f^{\mathcal {A}}_{\textbf{S}^{(i,k-1)}}( x_i ))-\ell (y_i, f^{\mathcal {A}}_{\textbf{S}^{(i,k)}}( x_i )) + \ell (y_i, f^{\mathcal {A}}_{\textbf{S}^{(i,n_i)}}( x_i )). \end{aligned}$$

Now we are ready to bound the difference

$$\begin{aligned}&\ell (y, f^{\mathcal {A}}_{\textbf{S}}(x)) - \ell (y_i, f^{\mathcal {A}}_{\textbf{S}}( x_i )) \\&\quad = \sum _{k=1}^{n_i} \left( (\ell (y, f^{\mathcal {A}}_{\textbf{S}^{(i,k-1)}}(x))-\ell (y, f^{\mathcal {A}}_{\textbf{S}^{(i,k)}}(x))) - (\ell (y_i, f^{\mathcal {A}}_{\textbf{S}^{(i,k)}}( x_i ))-\ell (y_i, f^{\mathcal {A}}_{\textbf{S}^{(i,k-1)}}( x_i ))) \right) \\&\qquad + \ell (y, f^{\mathcal {A}}_{\textbf{S}^{(i,n_i)}}(x))-\ell (y_i, f^{\mathcal {A}}_{\textbf{S}^{(i,n_i)}}( x_i )) \\&\quad \leqslant \sum _{k=1}^{n_i} |\ell (y, f^{\mathcal {A}}_{\textbf{S}^{(i,k-1)}}(x))-\ell (y, f^{\mathcal {A}}_{\textbf{S}^{(i,k)}}(x)) |\\&\qquad + \sum _{k=1}^{n_i} |\ell (y_i, f^{\mathcal {A}}_{\textbf{S}^{(i,k)}}( x_i ))-\ell (y_i, f^{\mathcal {A}}_{\textbf{S}^{(i,k-1)}}( x_i )) |+ \ell (y, f^{\mathcal {A}}_{\textbf{S}^{(i,n_i)}}(x))-\ell (y_i, f^{\mathcal {A}}_{\textbf{S}^{(i,n_i)}}( x_i ))\\&\quad \leqslant 2 n_i\beta _{n, \Delta } + \ell (y, f^{\mathcal {A}}_{\textbf{S}^{(i,n_i)}}(x))-\ell (y_i, f^{\mathcal {A}}_{\textbf{S}^{(i,n_i)}}( x_i )). \end{aligned}$$

Therefore, by noting that \(n_i = |N_G^+(i) |\,\leqslant \Delta +1\) for all i, we have

$$\begin{aligned}&\mathop {{}\mathbb {E}}_{\textbf{S}, (x, y)} [\ell (y, f^{\mathcal {A}}_{\textbf{S}}(x)) - \ell (y_i, f^{\mathcal {A}}_{\textbf{S}}( x_i ))] \\&\quad \leqslant \mathop {{}\mathbb {E}}_{\textbf{S}, (x, y)} [ \ell (y, f^{\mathcal {A}}_{\textbf{S}^{(i,n_i)}}(x)) - \ell (y_i, f^{\mathcal {A}}_{\textbf{S}^{(i,n_i)}}( x_i )) ] + 2 n_i \beta _{n, \Delta } \\&\quad \leqslant \mathop {{}\mathbb {E}}_{\textbf{S}, (x, y)} [ \ell (y, f^{\mathcal {A}}_{\textbf{S}^{(i,n_i)}}(x)) - \ell (y_i, f^{\mathcal {A}}_{\textbf{S}^{(i,n_i)}}( x_i )) ] + 2 \beta _{n, \Delta }(\Delta +1) \\&\quad = \mathop {{}\mathbb {E}}_{\textbf{S}, (x, y)} [ \ell (y, f^{\mathcal {A}}_{\textbf{S}^{(i,n_i)}}(x)) ] - \mathop {{}\mathbb {E}}_\textbf{S}[\ell (y_i, f^{\mathcal {A}}_{\textbf{S}^{(i,n_i)}}( x_i )) ] + 2 \beta _{n, \Delta }(\Delta +1) \\&\quad = \mathop {{}\mathbb {E}}_{\textbf{S}^{(i,n_i)}, (x, y)} [ \ell (y, f^{\mathcal {A}}_{\textbf{S}^{(i,n_i)}}(x)) ] - \mathop {{}\mathbb {E}}_{\textbf{S}^{(i,n_i)},(x_i,y_i)} [\ell (y_i, f^{\mathcal {A}}_{\textbf{S}^{(i,n_i)}}( x_i )) ] + 2 \beta _{n, \Delta }(\Delta +1) \\&\quad = 2 \beta _{n, \Delta }(\Delta +1), \end{aligned}$$

where the last equality is because \((x_i, y_i)\) and (xy) are independent of \(\textbf{S}^{(i,n_i)}\), and have the same distribution. \(\square\)

Now we are ready to prove Lemma 3.10.

Proof of Lemma 3.10

From the definition of \(\Phi _{\mathcal {A}}(\textbf{S})\) in (3.9), we have

$$\begin{aligned} \mathop {{}\mathbb {E}}_{\textbf{S}} [ \Phi _{\mathcal {A}}(\textbf{S}) ]&= \mathop {{}\mathbb {E}}_{\textbf{S}} \left[ \mathop {{}\mathbb {E}}_{(x, y)} [\ell (y, f^{\mathcal {A}}_{\textbf{S}}(x))] - \frac{1}{n} \sum _{i = 1}^n \ell (y_i, f^{\mathcal {A}}_{\textbf{S}}( x_i )) \right] \\&= \frac{1}{n} \sum _{i = 1}^n \mathop {{}\mathbb {E}}_{\textbf{S}, (x, y)} [ \ell (y, f^{\mathcal {A}}_{\textbf{S}}(x)) - \ell (y_i, f^{\mathcal {A}}_{\textbf{S}}( x_i )) ] \leqslant 2 \beta _{n, \Delta }(\Delta +1), \end{aligned}$$

where the last inequality is by Lemma 3.11. \(\square\)

Combining Lemmas 3.8 and 3.10 gives the following theorem, which upper-bounds the generalization error of learning algorithms trained over G-dependent training sets of size n.

Theorem 3.12

Let \(\textbf{S}\) be a sample of size n with dependency graph G. Suppose the maximum degree of G is \(\Delta\). Assume that the learning algorithm \(\mathcal {A}\) is \(\beta _i\)-uniformly stable for all \(i \in [n - \Delta , n]\). Suppose the loss function \(\ell\) is bounded by M. Let \(\beta _{n, \Delta } = \max _{i \in [0,\Delta ]} \beta _{n-i}\). For any \(\delta \in (0, 1)\), with probability at least \(1 - \delta\), it holds that

$$\begin{aligned} R(f^{\mathcal {A}}_{\textbf{S}}) \leqslant {\widehat{R}}_\textbf{S}(f^{\mathcal {A}}_{\textbf{S}}) + 2 \beta _{n, \Delta } (\Delta + 1) + \frac{4n\beta _n + M}{n} \sqrt{\dfrac{\Lambda (G)}{2} \log \left( \dfrac{1}{\delta }\right) }. \end{aligned}$$

Remark 3.13

It is well known that for many learning algorithms, \(\beta _n=O(1/n)\) (see, for example, Bousquet and Elisseeff 2002), in this case, we have that \(\beta _{n, \Delta } (\Delta + 1) \leqslant \beta _{n-\Delta } (\Delta + 1) = O(\frac{\Delta }{n - \Delta })\), which vanishes asymptotically if \(\Delta = o(n)\). The term \(O\left( \sqrt{\Lambda (G)}/n\right)\) also vanishes asymptotically if \(\Lambda (G)=o(n^2)\). We also observe that if the training data are i.i.d., Theorem 3.12 degenerates to the standard stability bound obtained in Bousquet and Elisseeff (2002), by setting \(\Delta = 0\), \(\beta _{n, \Delta } = \beta _n\), and \(\Lambda (G) = n\).

4 Applications

In this section, we present three practical applications related to learning with interdependent data, for which we use the methodology presented in the previous sections to derive generalization bounds.

4.1 Bipartite ranking

The goal of bipartite ranking is to assign higher scores to instances of the positive class than the ones of the negative class (Agarwal & Niyogi, 2009; Freund et al., 2003). This framework corresponds to many applications of information retrieval such as recommender systems (Sidana et al., 2021), and uplift-modeling (Betlei et al., 2021), etc.

It has attracted a lot of interest in recent years since the empirical ranking error of a scoring function \(h:\mathcal {X}\rightarrow \mathbb {R}\) over a training set \(T:=\left( x_i,y_i\right) _{1\leqslant i\leqslant m}\) with \(y_i\in \{-1,+1\}\) defined by

$$\begin{aligned} \widehat{\mathcal {L}}_T(h)=\frac{1}{m_{-}m_{+}} \sum _{i:y_i=1}\sum _{j:y_j=-1}\textbf{1}_{\left\{ h( x_i )\leqslant h(x_j)\right\} }, \end{aligned}$$
(4.1)

is equal to one minus the Area Under the ROC Curve (AUC) of h (see, for example, Cortes and Mohri 2004), where \(m_-:=\sum _{i=1}^m \textbf{1}_{\left\{ y_i=-1\right\} }\) and \(m_+:=\sum _{i=1}^m \textbf{1}_{\left\{ y_i=1\right\} }\) are the number of negative and positive instances in the training set T respectively.

For two instances of different classes \((x,y), (x^{\prime },y^{\prime })\) in T such that \(y\ne y^{\prime }\), by considering the (unordered) pairs of examples \(\{ (x,y), (x^{\prime },y^{\prime }) \}\), and the classifier of pairs f associated to a scoring function h defined by

$$\begin{aligned} f(x,x^{\prime })=h(x)-h(x^{\prime }), \end{aligned}$$

we can rewrite the bipartite ranking loss (4.1) of h over T as the classification error of the associated f over the pairs of instances of different classes,

$$\begin{aligned} \widehat{R}_{\textbf{S}}(f) = \widehat{\mathcal {L}}_T(h) = \frac{1}{n} \sum _{\{ (x, y), (x^{\prime }, y^{\prime }) \} \in \textbf{S}}\textbf{1}_{\left\{ z_{y,y^{\prime }}f(x,x^{\prime })\leqslant 0\right\} }, \end{aligned}$$
(4.2)

where \(n=m_{-}m_{+}\),

$$\begin{aligned} \textbf{S}:= \left\{ (x, y), (x^{\prime },y^{\prime }) : (x, y) \in T, (x^{\prime },y^{\prime }) \in T, y\ne y^{\prime } \right\} \end{aligned}$$

is the set of n unordered pairs of examples from different classes in T, and

$$\begin{aligned} z_{y,y^{\prime }} :=2\textbf{1}_{\left\{ y-y^{\prime }>0\right\} }-1. \end{aligned}$$

Note that \(z_{1,-1} = 1\) and \(z_{-1,1} = -1\).

Let

$$\begin{aligned} T^+ := \{ ( x^+_i, 1): i\in [m_+]\} \quad \text{ and }\quad T^- := \{ (x^-_j, -1): j\in [m_-]\} \end{aligned}$$

be the sets of positive and negative instances of T respectively. Then \(T=T^+\cup T^-\). Without loss of generality, we assume that \(m_+ \leqslant m_-\), which corresponds to the usual situation in information retrieval, where there are fewer positive (relevant) instances than negative (irrelevant) ones.

In this case, the independent covers of the corresponding dependency graph of \(\textbf{S}\) is \(\{ ( I_k, 1 ) \}_{k\in \{1,\ldots ,m_-\}}\), where

$$\begin{aligned} I_k = \left\{ \left( x_i^+,x^-_{\sigma _{k,m_-}(i)}\right) : i\in [m_+]\right\} , \end{aligned}$$

with \(\sigma _{k,m_-}\) denoting the permutation that is defined by

$$\begin{aligned} \sigma _{k,m_-}(i)={\left\{ \begin{array}{ll}(k+i-1) (\textrm{mod}\, m_-), &{}\text {if}\; (k+i-1)(\textrm{mod}\ m_-)\ne 0 \\ m_-, &{} \text {otherwise.}\end{array}\right. } \end{aligned}$$

Figure 7 illustrates the dependency graph of a bipartite ranking problem with \(m_+=2\) positive examples and \(m_-=3\) negative instances as well as its corresponding independent covers represented by dotted ellipsoids.

Fig. 7
figure 7

The graph on the right is a dependency graph corresponding to a bipartite ranking problem with \(m_+=2\) positive examples \(T^+=\{ (x_1^+, 1), (x_2^+, 1) \}\); and \(m_-=3\) negative ones, \(T^-=\{ (x_1^-, -1), (x_2^-, -1), (x_3^-, -1)\}\). Each pair of examples from different classes corresponds to an edge of the complete bipartite graph \(K_{2, 3}\) on the left, and is represented by a vertex of the dependency graph on the right. Two pairs are adjacent in the dependency graph if they have an example in common. Fractional independent covers \(\{ (I_k, 1) \}_{1\leqslant k\leqslant 3}\) are shown by dotted ellipsoids

Remark 4.1

In the bipartite ranking, the dependent pairs of instances correspond to the edges of a complete bipartite graph \(K_{m_+, m_-}\), since pairs are chosen with one positive instance and one negative instance, see Fig. 7 for illustration.

Given a graph G, the line graph of G has the edges of G as its vertices, with two vertices adjacent if the corresponding edges have a vertex in common in G. Then the dependency graph for pairs \(\textbf{S}\) is the line graph of \(K_{m_+, m_-}\), known as an \(m_+ \times m_-\) Rook’s graph, which is a Cartesian product of two complete graphs.

For bipartite ranking, it is easy to check that

$$\begin{aligned} \frac{\chi _f}{n} = \frac{\max (m_-,m_+)}{ m_-m_+ } = \frac{1}{ \min (m_-,m_+) }. \end{aligned}$$

Therefore by Theorems 3.33.5, and Ledoux and Talagrand’s contraction lemma (Ledoux & Talagrand, 1991, p.78 Corollary 3.17) that can be extended to fractional Rademacher complexities giving \(\widehat{\mathfrak {R}}^\star _\textbf{S}({\ell \circ \mathcal {F}})=2\widehat{\mathfrak {R}}^\star _S({\mathcal {F}})\), we can bound the generalization error of bipartite ranking as follows.

Corollary 4.2

Let T be a training set composed of \(m_+\) positive instances and \(m_-\) negative ones; and \(\textbf{S}\) the set of unordered pairs of examples from different classes in T. Then for any scoring function from \(\mathcal {F}=\{f:(x,x^{\prime })\mapsto \langle \textbf{w},\phi (x)-\phi (x^{\prime })\rangle : \Vert \textbf{w}\Vert \leqslant B\}\), where \(\phi\) is a feature mapping with bounded norm such that \(\Vert \phi (x)-\phi (x^{\prime })\Vert \leqslant \Gamma\) for all \((x,x^{\prime })\), and for any \(\delta \in (0,1)\), with probability at least \(1-\delta\), we have

$$\begin{aligned} R(f) \leqslant {\widehat{R}}_\textbf{S}(f) + \frac{4B\Gamma }{\sqrt{m}} + 3\sqrt{\dfrac{ 1}{2m} \log \left( \dfrac{2}{\delta } \right) }, \end{aligned}$$

where \(m = \min (m_-,m_+)\).

4.2 Multi-class classification

We now address the problem of mono-label multi-class classification, where the output space is a discrete set of labels \(\mathcal {Y} = [K]\) with K classes. For the sake of presentation, we denote an element of \(\mathcal {X} \times \mathcal {Y}\) as \(x^y:= (x, y)\). For a class of predictor functions \(\mathcal {H} = \{h: \mathcal {X} \times \mathcal {Y} \rightarrow \mathbb {R}\}\), let \(\ell\) be the instantaneous loss of \(h\in \mathcal {H}\) on example \(x^y\) defined by

$$\begin{aligned} \ell (y,h(x^y))=\frac{1}{K-1} \sum _{y^{\prime }\in \mathcal {Y}\setminus \{y\}}\textbf{1}_{\left\{ h(x^y)\leqslant h(x^{y'})\right\} }. \end{aligned}$$

For any sample x, this loss function is the average number of classes, for which h assigns a higher score to the pairs constituted by x and any other classes that are not the true class of x. For a training set \(T=\left( x_i^{y_i} \right) _{1\leqslant i\leqslant m}\) of size m, the corresponding empirical error of a function \(h\in \mathcal {H}\) is

$$\begin{aligned} \widehat{\mathcal {L}}_T(h) = \frac{1}{m(K-1)}\sum _{i=1}^m\sum _{y^{\prime }\in \mathcal {Y}\setminus \{y_i\}}\textbf{1}_{\left\{ h(x_i^{y_i})\leqslant h(x_i^{y^{\prime }})\right\} }. \end{aligned}$$
(4.3)

Many multi-class classification algorithms like Adaboost.MR (Schapire & Singer, 1999) or the multiclass SVM (Weston & Watkins, 1998) aim to minimize a convex surrogate function of this loss.

Similar to the bipartite ranking case, by considering pairs \((x^y,x^{y'})\) with \(y'\in \mathcal {Y}{\setminus } \{y\}\), constituted by the pairs \(x^y\) of an example and its class, and the pairs \(x^{y'}\) of the same examples with all other classes, the classifier of pairs f associated to a function \(h\in \mathcal {H}\) is defined by

$$\begin{aligned} f(x^y,x^{y^{\prime }})=h(x^y)-h(x^{y^{\prime }}). \end{aligned}$$

Then the empirical loss of a function h over T, can be written as the classification error of the associated f,

$$\begin{aligned} \widehat{R}_{\textbf{S}}(f) = \widehat{\mathcal {L}}_T(h) = \frac{1}{n} \sum _{(x^y,x^{y^{\prime }})\in \textbf{S}}\textbf{1}_{\left\{ z_{y,y^{\prime }}f(x^y,x^{y^{\prime }})\leqslant 0\right\} }, \end{aligned}$$
(4.4)

where \(\textbf{S}=\{(x^y,x^{y^{\prime }}): x^y\in T, x^{y^{\prime }} \in T, y\ne y^{\prime }\}\) is of size \(n=m(K-1)\), and \(z_{y,y^{\prime }}=2\textbf{1}_{\left\{ y>y^{\prime }\right\} }-1\). In this case, an independent cover of the corresponding dependency graph of \(\textbf{S}\) could be \(\{ ( I_k, 1 ) \}_{ k\in \{1,\ldots ,K-1\} }\), where

$$\begin{aligned} I_k=\left\{ \left( x_i^1,x_i^{k+1}\right) : i\in [m]\right\} , \end{aligned}$$

with the corresponding fractional chromatic number \(\chi _f=K-1\).

Fig. 8
figure 8

The dependency graph for the multi-class classification problem with \(m=4\) examples and \(K=4\) classes is a vertex-disjoint union of 4 trianlges. Fractional independent covers \(\{ (I_k, 1) \}_{1\leqslant k\leqslant 3}\) are shown by dotted ellipsoids

Figure 8 illustrates a dependency graph for the multi-class classification problem with \(m=4\) and \(K=4\) as well as the corresponding fractional independent covers represented by dotted ellipsoids.

Similar to the bipartite ranking case, we have the following corollary based on the prior results.

Corollary 4.3

Let T be a training set of K-label instances and of size m. Let \(\textbf{S}\) be the set of no-redundant pairs of examples from different classes in T. Then for any scoring functions from \(\mathcal {F}=\{f:(x,x^{\prime })\mapsto \langle \textbf{w},\phi (x)-\phi (x^{\prime })\rangle : \Vert \textbf{w}\Vert \leqslant B\}\), where \(\phi\) is a feature mapping with the bounded norm such that \(\Vert \phi (x)-\phi (x^{\prime })\Vert \leqslant \Gamma\) for all \((x,x^{\prime })\), and for any \(\delta \in (0,1)\), with probability at least \(1-\delta\), we have

$$\begin{aligned} R(f) \leqslant {\widehat{R}}_\textbf{S}(f) + \frac{4B\Gamma }{\sqrt{m}}+ 3\sqrt{\dfrac{ 1}{2m} \log \left( \dfrac{2}{\delta }\right) }. \end{aligned}$$

Remark 4.4

The loss function we considered (4.3) is normalized by \(K-1\), and we obtain a result that is comparable to the binary classification case. For a loss function based on margins, \(\ell (y,h(x^y))=h(x^y)-\displaystyle \max _{y'\ne y}h(x^{y'})\); the Rademacher complexity term grows in lockstep with the number of classes K.

4.3 Learning from m-dependent data

Here we consider learning from m-dependent data, and give a practical learning scenario. Suppose that there are linearly aligned locations, for example, real estate along a street. Let \(y_i\) be the observation at location i, for example, the house price. Let \(x_i\) denote the random variable modeling geographical effect at location i. Assume that x’s are mutually independent and each \(y_i\) is geographically influenced by a neighborhood of size at most \(2q+1\). The goal is to learn to predict y from a sample \(\{( ( x_{i-q},\ldots ,x_i,\ldots ,x_{i+q} ), y_i )\}_{i\in [n]}\), where n is the size of the sample. See Fig. 9 for an example.

Fig. 9
figure 9

Each observation \(y_i\) is geographically determined by a set of variables \(\{ x_j \}_{i -2 \leqslant j \leqslant i + 2}\) of size 5. The sample \(\{( \{ x_j \}_{i -2 \leqslant j \leqslant i + 2}, y_i )\}_{i}\) is 4-dependent

This model accounts for the impact of local locations on house prices. Similar scenarios are frequently considered in spatial econometrics, and moving average processes in time series analysis, see Anselin (2013) for more examples.

The above application is a special case of m-dependence. A sequence of random variables \(\{ X_i \}_{i = 1}^n\) is said to be f(n)-dependent if subsets of variables separated by some distance greater than f(n) are independent. This model was introduced by Hoeffding and Robbins (1948) and has been studied extensively (see, for example, Stein 1972; Chen 1975). This is usually the canonical application for the results based on the dependency graph model. A special case of f(n)-dependence when \(f(n) = m\) is the following m-dependent model.

Definition 4.5

(m -dependence, Hoeffding and Robbins 1948) A sequence of random variables \(\{ X_i \}_{i = 1}^n\) is m-dependent for some \(m \geqslant 1\) if \(\{ X_j \}_{j = 1}^i\) and \(\{ X_j \}_{j = i+m+1}^n\) are independent for all \(i > 0\).

Fig. 10
figure 10

A tree-partition of the dependency graph for 2-dependent variables

Figure 10 illustrates a dependency graph G for a 2-dependent sequence \(\{ X_i \}_{i}\), and its tree-partition. The illustration demonstrates the division of an m-dependent sequence into blocks of size m. Subsequently, these blocks are sequentially mapped to vertices of a path of length \(\lceil n/m \rceil - 1\), as depicted in Fig. 10. This tree-partition shows that \(\Lambda (G) \leqslant \left( \lceil n/m \rceil - 1\right) ( m+m )^2 + m^2 \leqslant 4mn+m^2\).

Combining Theorem 3.12 and the above estimate of forest complexity gives the following.

Corollary 4.6

Let \(\textbf{S}\) be an m-dependent sample of size n. Assume that the learning algorithm \(\mathcal {A}\) is \(\beta _i\)-uniformly stable for any \(i \in [n - 2m, n]\). Suppose the loss function \(\ell\) is bounded by M. For any \(\delta \in (0, 1)\), with probability at least \(1 - \delta\), it holds that

$$\begin{aligned} R(f^{\mathcal {A}}_{\textbf{S}}) \leqslant {\widehat{R}}_\textbf{S}(f^{\mathcal {A}}_{\textbf{S}}) + 2 \beta _{n, 2m} (2m + 1) + ( 4n\beta _n + M ) \sqrt{\dfrac{2 m }{ n } \left( 1 + \frac{m}{n} \right) \log \left( \dfrac{1}{\delta }\right) }. \end{aligned}$$

Choose any uniformly stable learning algorithm in Bousquet and Elisseeff (2002) with \(\beta _n = O(1/n)\), such as regularization algorithms in RKHS, etc., and apply to the above-mentioned house price prediction problem. Then for any fixed q, with high probability, Corollary 4.6 gives that \(R(f^{\mathcal {A}}_{\textbf{S}}) \leqslant {\widehat{R}}_\textbf{S}(f^{\mathcal {A}}_{\textbf{S}}) + O\left( \sqrt{\dfrac{ 1}{ n } \log \left( \dfrac{1}{\delta }\right) }\right)\) for sufficiently large n, matching the stability bound in the i.i.d. case in Bousquet and Elisseeff (2002).

5 Concluding remarks

In this survey, we presented various McDiarmid-type concentration inequalities for functions of graph-dependent random variables. These concentration bounds were then used to obtain generalization error bounds for learning from graph-dependent samples via fractional Rademacher complexity and algorithm stability.

We also included some real practical applications of the methodology. Note that in our applications, the sample contains dependent data with the same marginal distribution, but this is not necessary and concentration inequalities derived are without this assumption, and therefore can be applied to situations where the distribution may change over time.

The dependency graphs used for our applications exhibit certain structural regularities and therefore we have explicit simple bounds. For applications under various other settings, we can still obtain meaningful bounds as long as we have suitable estimates of the fractional chromatic number or forest complexity. We will leave interested readers to investigate and find more applications.

There are various new directions that can be explored.

  1. 1.

    For dependent data, there are other definitions of the generalization error, such as the one specified in Kuznetsov and Mohri (2017) and Mohri and Rostamizadeh (2008, 2010). The connections between these and the one we used have been discussed in Mohri and Rostamizadeh (2008, 2010). It is a natural question whether our results can be adapted to this definition.

  2. 2.

    The dependency graph model we consider requires variables in disjoint non-adjacent subgraphs to be independent. There are some newly introduced dependency graph models such as weighted dependency graphs (Dousse & Féray, 2019; Féray, 2018), and the combination of mixing coefficients and dependency graphs (Lampert et al., 2018; Isaev et al., 2021). It would be interesting to use these new dependency graphs to obtain generalization bounds for learning under different dependent settings.

  3. 3.

    Recently, there are some new breakthroughs establishing sharper stability bounds (Bousquet et al., 2020; Feldman & Vondrak, 2019). It would be interesting to follow these results and to obtain sharper stability bounds for learning under graph-dependence.