1 Introduction

Operator semigroups are ubiquitous objects in pure and applied mathematics. It is well-known that many function spaces, such as the Besov spaces on \(\mathbb {R}^n\), can be characterized using, for example, the heat kernel [3, 31]. Recent work has generalized these results to function spaces on more general domains; for instance, see [2, 17]. More abstractly, a substantial amount of classical harmonic analysis in \(\mathbb {R}^n\) can be pushed through to the setting of a measure space equipped with a diffusion semigroup, as found in Stein’s book [29]. A limiting aspect of the theory developed in this book is the absence of an explicit geometry; though the statements of maximal theorems, Littlewood–Paley theorems and interpolation theorems make sense in this general setting, basic notions such as Lipschitz functions cannot be defined, since these require a metric on the underlying measure space.

A natural way of introducing a geometry in such an abstract setting is to use the semigroup itself to define a distance. This is the approach taken in the theory of diffusion maps [6] and related work [15, 16]. If the kernel of the semigroup at time t is denoted by \(a_t(x,y)\), then the diffusion distance at time t is defined as \(||a_t(x,\cdot ) - a_t(y,\cdot ) ||_{L^2(d\mu )}\), for an appropriate measure \(\mu \). This conceptually meaningful distance has found wide application in machine learning, where the kernel \(a_t(x,y)\) is a power of an affinity matrix measuring the relationship between two points in a data set.

The idea of defining distances using semigroups is the starting point for the present work. However, the ground distances \(D_\alpha (x,y)\) we introduce in Sect. 2 are not defined at a fixed scale, but incorporate all scales at once. Though the parameter \(\alpha > 0\) controls the weight placed at each scale, all scales are present. In a variety of examples, and for appropriate ranges of \(\alpha \), we will show that this distance is equivalent to a “snowflake” of the intrinsic distance \(\rho (x,y)\) on the underlying space; that is, the distance \(\rho (x,y)\) raised to a power less than 1 [18]. In the examples we consider, this power is a constant times \(\alpha \), and it follows that we must restrict \(\alpha \) to be less than 1.

Next, we consider the space \(\Lambda _{\alpha }\) of functions that are Lipschitz with respect to the distance \(D_\alpha (x,y)\). In the examples where \(D_\alpha (x,y)\) is a snowflake of an intrinsic distance \(\rho (x,y)\), Lipschitz functions are Hölder with respect to \(\rho (x,y)\). We will therefore call \(\Lambda _\alpha \) the Hölder–Lipschitz space. This space arises in many areas of applications. For instance, in nonparametric statistics Hölder–Lipschitz functions arise naturally as a model for unknown functions corrupted by noise. Simple equivalent formulas for the Hölder–Lipschitz norm in the Euclidean setting, such as those derived from wavelet theory [23], have been used for signal recovery [11].

In Sect. 4, we give simple characterizations of the Hölder–Lipschitz norm using the semigroup itself. As in the Euclidean setting, where similar results are well-known [3, 29, 31], the fundamental observation is that the size of a function’s variation across scales is equivalent to the size of its variation in space.

In Sect. 5, we study the space \(\Lambda _\alpha ^*\) of measures that can be integrated against Hölder–Lipschitz functions—that is, the space dual to \(\Lambda _\alpha \). In particular, we give simple characterizations of the norm on this space. This is of interest in many applications, as the dual norm of the difference between two probability measures is equal to their Earth Mover’s Distance (EMD). We will recall the definition of EMD and prove a basic property of it in Sect. 6. EMD is a popular tool in machine learning that suffers from high computational cost. The equivalent metrics we develop provide, in many situations, a fast approximation to it.

We impose certain regularity conditions on the semigroup. The conditions are highly non-restrictive, and in Sect. 3 we show that they hold for a broad class of semigroups, such as those considered in [17]. Examples include heat kernels on closed Riemannian manifolds, heat kernels on certain fractals, and subordinated heat kernels in \(\mathbb {R}^n\) (including the Poisson kernel), as well as the non-symmetric example of shifted heat kernels on \(\mathbb {R}^n\). In addition, we will show that if the theory holds for some finite collection of semigroups on different spaces, then it holds for their product on the cross-product of these spaces. In all the examples we consider, the parameter \(\alpha \) defining the distance \(D_\alpha (x,y)\) must lie between 0 and 1, and sometimes must be bounded away from 1.

In Sects. 7 and 8, we generalize our results to the product of measure spaces, each equipped with its own semigroup. We will focus on the example of two spaces for concreteness, though all results hold for arbitrarily many. In this setting, the natural measure of a function’s regularity is not the supremum of its difference quotients, but rather of its mixed difference quotients. We derive equivalent formulas for the norms on the space \(\Lambda _{\alpha ,\beta }\) of mixed Hölder–Lipschitz functions and its dual \(\Lambda _{\alpha ,\beta }^*\), where \(\alpha \) and \(\beta \) are the parameters used to define the distances on the two spaces.

Product spaces arise naturally in applications. In signal processing, for example, the spectrogram of a signal is a function on the product of the time and Fourier domains. By assuming that the spectrogram lies in a Sobolev space with dominating mixed derivatives—akin to the mixed Hölder–Lipschitz space \(\Lambda _{\alpha ,\beta }\)—one can develop effective estimators for recovering a spectrogram corrupted by noise [24]. Furthermore, the norm dual to Hölder–Lipschitz is a robust distance between two spectrograms. The equivalent dual norms we derive in this paper provide distances for comparing measures on any database with a product structure where each axis has its own semigroup, and hence its own geometry.

1.1 Notation

By “\(A\lesssim B\)” or “\(B\gtrsim A\)” we mean inequalities up to positive constants; that is, there is a constant \(C>0\) such that \(A \le C\cdot B \). Similarly, by “\(A \simeq B\)” we mean there are constants \(c,C > 0\) such that \(c\cdot A \le B \le C\cdot A\). What is meant by C being a “constant” will be clear in each instance.

We will encounter a variety of norms and seminorms throughout the paper. We will use \(||\cdot ||\), augmented with appropriate subscripts and superscripts, to denote norms, while we will use capital letters and parentheses, e.g. \(V(\cdot )\), augmented with appropriate subscripts and superscripts, to denote seminorms.

2 Multiscale Diffusion Distance

Our setting throughout the paper will be a sigma-finite measure space \(\mathcal {X}\). We will not need to explicitly refer to the \(\sigma \)-algebra or the measure. We suppose that \(\mathcal {X}\) is equipped with a family of kernels \(a_t(x,y)\), \(t>0\), in \(L^1\). Defining the operators

$$\begin{aligned} A_tf(x) = \int _\mathcal {X} a_t(x,y)f(y)dy \end{aligned}$$

we assume the following conditions:

  1. (S)

    (The semigroup property)  For all \(t,s>0\), \(A_tA_s = A_{t+s}\). This property can be expressed in terms of the kernels \(a_t(x,y)\) as

    $$\begin{aligned} a_{t+s}(x,y) = \int _\mathcal {X} a_t(x,w)a_s(w,y)dw. \end{aligned}$$
  2. (C)

    (The conservation property)  If \(\mathbf {1}\) is the constant function 1 on \(\mathcal {X}\), then for all \(t>0\), \(A_t\mathbf {1} = \mathbf {1}\). This property can be expressed in terms of the kernels \(a_t(x,y)\) as

    $$\begin{aligned} \int _\mathcal {X} a_t(x,y)dy = 1. \end{aligned}$$
  3. (I)

    (The integrability property)  There is a constant \(C > 0\) such that for all \(t>0\) and \(x\in \mathcal {X}\),

    $$\begin{aligned} \int _\mathcal {X}|a_t(x,y)|dy \le C. \end{aligned}$$
  4. (R)

    (The regularity property)  There are constants \(C > 0\) and \(\alpha >0\) such that for every \(1 \ge s \ge t>0\) and every \(x\in \mathcal {X},\)

    $$\begin{aligned} \int _\mathcal {X} |a_t(x,y)| \cdot ||a_s(x,\cdot ) - a_s(y,\cdot ) ||_1 dy \le C\bigg (\frac{t}{s}\bigg )^\alpha . \end{aligned}$$

We will actually only require a slightly weaker version of condition (R), namely the same condition restricted to dyadic times \(t=2^{-k}\) and \(s = 2^{-l}\), with k and l non-negative integers. Later in this section we will give an alternate characterization of (R), namely condition (G) below, that reveals its geometric content with respect to the distance \(D_\alpha (x,y)\) defined by formula (1). In Sect. 3, we will show that condition (R) holds for a wide variety of spaces \(\mathcal {X}\) and kernels \(a_t(x,y)\). In Sect. 4 we will show that this condition also implies convergence to the identity for the class of Hölder–Lipschitz functions that we will define there. Note too that in every example we discuss \(\alpha \) will be strictly less than 1, and sometimes will be bounded away from 1.

In contrast to [2, 17], we do not assume the existence of a metric on the space \(\mathcal {X}\). Rather, we will use the kernels \(a_t(x,y)\) to define a metric from scratch. This approach is inspired by the papers [6, 16] and related work. In the theory of diffusion maps [6], for example, each time t defines a diffusion distance, namely the \(L^2\) distance between \(a_t(x,\cdot )\) and \(a_t(y,\cdot )\). Each such distance captures the geometry of the space at a particular scale. These distances also have the feature that they can be approximately embedded into a low-dimensional Euclidean space.

As in [16], our starting point is the \(L^1\) distance between kernels at each scale, and not the \(L^2\) distance. We also consider a single distance that incorporates all scales at once, rather than a family of distances. Though there are no Euclidean embeddings of the distance we define (as with an \(L^2\) diffusion distance), for the application areas we have in mind there will usually be no need to explicitly compute our distance for all pairs of points; see Sect. 6.

We will be concerned with dyadic scales \(t\in (0,1]\); that is, scales \(t = 2^{-k}, k\ge 0\). To this end, define

$$\begin{aligned} P_k = A_{2^{-k}} \end{aligned}$$

and

$$\begin{aligned} p_k(x,y) = a_{2^{-k}}(x,y). \end{aligned}$$

Also, we define

$$\begin{aligned} D_k(x,y) = ||p_k(x,\cdot ) - p_k(y,\cdot ) ||_1. \end{aligned}$$

Define the multiscale distance

$$\begin{aligned} D_\alpha (x,y) = \sum _{k\ge 0}2^{-k\alpha }D_k(x,y). \end{aligned}$$
(1)

Note that condition (I) guarantees not only that \(D_\alpha (x,y)\) is finite, but that it is uniformly bounded for all x and y.

In Sect. 3 we will compute the distance \(D_\alpha (x,y)\) for many examples of semigroups. Before doing so, however, it will be convenient to turn our attention to the regularity condition (R) we impose on the kernels \(a_t(x,y)\). We reformulate condition (R) in geometric terms, where the geometry is defined by the distance \(D_\alpha (x,y)\). To that end, define the geometric condition (G) by

  1. (G)

    (The geometric property) There are constants \(C>0\) and \(\alpha >0\) such that for all \(k\ge 0\) and \(x\in \mathcal {X}\),

    $$\begin{aligned} \int _\mathcal {X} |p_k(x,y)| \cdot D_\alpha (x,y) dy \le C 2^{-k\alpha }. \end{aligned}$$

We show that conditions (R) and (G) are essentially equivalent. The following lemma will be convenient.

Lemma 1

Suppose there are constants \(C>0\) and \(\alpha >0\) such that for all non-negative integers \(k,l \ge 0\) and all \(x\in \mathcal {X}\),

$$\begin{aligned} \int |p_k(x,y)| \sum _{l=0}^k 2^{-l\alpha } ||p_l(x,\cdot ) - p_l(y,\cdot ) ||_1 dy \le C 2^{-k\alpha }. \end{aligned}$$

Then (G) holds, for the same choice of \(\alpha \) and a possibly different constant C.

Proof

By the integrability condition (I) the integrals \(\int _\mathcal {X} |p_l(x,y)|dy \) are uniformly bounded. Therefore

$$\begin{aligned} \sum _{l\ge k+1} 2^{-l\alpha } ||p_l(x,\cdot ) - p_l(y,\cdot ) ||_1 \lesssim 2^{-k\alpha } \end{aligned}$$

and so

$$\begin{aligned} \int _\mathcal {X} |p_k(x,y)| \sum _{l\ge k+1} 2^{-l\alpha } ||p_l(x,\cdot ) - p_l(y,\cdot ) ||_1 dy \lesssim 2^{-k\alpha }\int _\mathcal {X} |p_k(x,y)|dy \lesssim 2^{-k\alpha }. \end{aligned}$$

Recalling that \(D_l(x,y) = ||p_l(x,\cdot ) - p_l(y,\cdot ) ||_1\), we can therefore write

$$\begin{aligned} \int _\mathcal {X} |p_k(x,y)|\cdot D_\alpha (x,y) dy= & {} \int _\mathcal {X} |p_k(x,y)| \sum _{l=0}^k 2^{-l\alpha } D_l(x,y) dy\\&+\, \int _\mathcal {X} |p_k(x,y)|\sum _{l=k+1}^\infty 2^{-l\alpha } D_l(x,y) dy\\\lesssim & {} 2^{-k\alpha } \end{aligned}$$

completing the proof. \(\square \)

Proposition 1

Suppose that (R) holds for some \(\alpha > 0\) and all dyadic times \(s=2^{-l}, t=2^{-k}\), where \(0\le l \le k\). Then (G) holds for all \(k\ge 0\) and for any \(0 < \alpha ^\prime < \alpha \).

Proof

For all x, condition (R) implies

$$\begin{aligned} \int _\mathcal {X} |p_k(x,y)| 2^{-l\alpha ^\prime } ||p_l(x,\cdot ) - p_l(y,\cdot ) ||_1 \lesssim 2^{-k\alpha } 2^{l(\alpha -\alpha ^\prime )}. \end{aligned}$$

Summing over \(l=0,\dots ,k\) gives

$$\begin{aligned} \int _\mathcal {X} |p_k(x,y)| \sum _{l=0}^k2^{-l\alpha ^\prime }||p_l(x,\cdot ) - p_l(y,\cdot ) ||_1 dy&\lesssim 2^{-k\alpha } \sum _{l=0}^k2^{l(\alpha -\alpha ^\prime )} \\&\lesssim 2^{-k\alpha } 2^{k(\alpha -\alpha ^\prime )} = 2^{-k\alpha ^\prime }. \end{aligned}$$

By Lemma 1, we are done. \(\square \)

Proposition 2

Suppose condition (G) holds for some \(\alpha >0\). Then (R) holds for all dyadic times \(s=2^{-l}, t=2^{-k}\), and for all \(0<\alpha ^\prime \le \alpha \). In other words, for all \(0<\alpha ^\prime \le \alpha \) there is a constant C such that for all \(0\le l\le k\) and \(x\in \mathcal {X}\),

$$\begin{aligned} \int _\mathcal {X} |p_k(x,y)| \cdot ||p_l(x,\cdot ) - p_l(y,\cdot ) ||_1 dy \le C 2^{-(k-l)\alpha ^\prime }. \end{aligned}$$

Proof

Since \(2^{-l\alpha } ||p_l(x,\cdot ) - p_l(y,\cdot ) ||_1 \le D_\alpha (x,y)\) for all \(l\ge 0\), we have

$$\begin{aligned} \int _\mathcal {X} |p_k(x,y)|\cdot ||p_l(x,\cdot ) - p_l(y,\cdot ) ||_1 dy \le 2^{l\alpha }\int _\mathcal {X} |p_k(x,y)| D_\alpha (x,y) dy \lesssim 2^{-(k-l)\alpha }. \end{aligned}$$

Since \(\alpha ^\prime \le \alpha \), the result follows. \(\square \)

We will find condition (G) to be a more useful statement of regularity than (R) going forward. Note too that, as stated earlier, to recover (G) we need only assume (R) for dyadic times s and t between 0 and 1.

3 Examples of Kernels Satisfying Our Conditions

In this section, we show that the conditions (S), (C), (I) and (R) (and equivalently, by Propositions 1 and 2, condition (G) as well) we impose on the kernels \(a_t(x,y)\) hold for semigroups arising in many different settings. Specifically, we consider two very general conditions (conditions 1 and 2 formulated below) that are found in the paper [17]. These conditions assume the existence of another metric \(\rho (x,y)\) on \(\mathcal {X}\) and posit that the kernels \(a_t(x,y)\) exhibit a certain regularity with respect to \(\rho (x,y)\). We show that we can recover the four conditions (S), (C), (I) and (R) from conditions 1 and 2.

Furthermore, we also obtain an upper bound on the distance \(D_\alpha (x,y)\) given by Eq. (1) in terms of the distance \(\rho (x,y)\). More precisely, we show that \(D_\alpha (x,y)\) is bounded above by \(\rho (x,y)\) raised to a power less than 1. Such a distance—that is, a distance of the form \(\rho (x,y)^\delta \), where \(0<\delta < 1\)—is called a snowflake of \(\rho (x,y)\) [18]. By imposing even stronger regularity on \(a_t(x,y)\) in the form of condition 3 below (also found in [17]), we will show that the distance \(D_\alpha (x,y)\) is in fact equivalent to a thresholded snowflake of \(\rho (x,y)\). We will use our analysis to establish conditions (S), (C), (I) and (R) for many examples of semigroups and compute the distance \(D_\alpha (x,y)\) for these examples.

Throughout this section, we will always assume \(0<\alpha <1\); as we will see, it will often be necessary to impose an even tighter upper bound on \(\alpha \).

3.1 Hölder Continuous Kernels with Decay

Suppose there is a metric \(\rho (x,y)\) on \(\mathcal {X}\) and a measure \(\mu \) such that \(\mu (B(x,r)) \lesssim r^n\), where \(n>0\) is fixed. In addition to the conservation property (C) and the uniform \(L^1\) bound (I) that we already assume, the kernel \(a_t(x,y)\) is assumed to be symmetric, and the following two regularity conditions are imposed:

  1. 1.

    An upper bound on the kernel: there is a non-negative, monotonic decreasing function \(\Phi :[0,\infty ) \rightarrow \mathbb {R}\) and a number \(\beta >0\) such that for any \(\gamma <\beta \),

    $$\begin{aligned} \int ^\infty \tau ^{n+\gamma } \Phi (\tau ) \frac{d\tau }{\tau } < \infty \end{aligned}$$

    and

    $$\begin{aligned} |a_t(x,y)| \le \frac{1}{t^{n/\beta }} \Phi \bigg ( \frac{\rho (x,y)}{t^{1/\beta } }\bigg ). \end{aligned}$$
  2. 2.

    The Hölder continuity estimate: there is some constant \(\Theta >0\) sufficiently small such that for all \(t\in (0,1]\), all x and y in \(\mathcal {X}\) with \(\rho (x,y) \le t^{1/\beta }\), and all \(u \in \mathcal {X}\),

    $$\begin{aligned} |a_t(x,u) - a_t(y,u)| \le \bigg (\frac{\rho (x,y)}{t^{1/\beta }}\bigg )^{\Theta } \frac{1}{t^{n/\beta }} \Phi \bigg (\frac{\rho (x,u)}{t^{1/\beta }}\bigg ). \end{aligned}$$

These conditions are found in [17]. As discussed there, examples of semigroups satisfying these estimates include the subordinated heat kernels in \(\mathbb {R}^n\), the heat kernel on certain Riemannian manifolds, the heat kernel on a variety of fractals such as the unbounded Sierpinksi Gasket, and the heat kernel of the semigroup \(e^{-tL}\) for certain elliptic operators L on \(\mathbb {R}^n\).

We will show that if we assume conditions 1 and 2, then our geometric condition (G) is satisfied for all \(0 < \alpha < \min \{1,\Theta /\beta \}\). The first step in showing this is to prove that our distance \(D_\alpha (x,y)\) defined from the semigroup is bounded above by a power of the distance \(\rho (x,y)\).

Lemma 2

For any \(0\le \eta <1\), there is a finite constant \(C>0\) such that for every \(0<t\le 1\) and every \(x\in \mathcal {X}\),

$$\begin{aligned} \int _\mathcal {X} \rho (x,y)^{\beta \eta }\frac{1}{t^{n/\beta }} \Phi \bigg (\frac{\rho (x,y)}{t^{1/\beta }}\bigg ) dy \le Ct^{\eta }. \end{aligned}$$

Proof

Let \(V_k=B(x,2^{k+1}t^{1/\beta })\setminus B(x,2^kt^{1/\beta })\). The upper bound on the kernel from condition 1 yields the following inequality:

$$\begin{aligned}&\int _\mathcal {X} \rho (x,y)^{\beta \eta } \frac{1}{t^{n/\beta }} \Phi \bigg ( \frac{\rho (x,y)}{t^{1/\beta } }\bigg ) dy \\&\quad = \frac{1}{t^{n/\beta }}\bigg \{\int _{B(x,t^{1/\beta })} + \sum _{k=0}^\infty \int _{V_k} \bigg \}\rho (x,y)^{\beta \eta } \Phi \bigg ( \frac{\rho (x,y)}{t^{1/\beta } }\bigg ) dy \\&\quad \lesssim t^\eta t^{-n/\beta }\bigg \{\Phi (0) \mu (B(x,t^{1/\beta })) \\&\qquad +\, \sum _{k=0}^\infty 2^{k\eta \beta }\Phi (2^k) \mu (B(x,2^{k+1}t^{1/\beta })) \bigg \} \\&\quad \lesssim t^\eta t^{-n/\beta }\bigg \{ \Phi (0) t^{n/\beta }+ \sum _{k=0}^\infty \Phi (2^k) 2^{k(n+\eta \beta )} t^{n/\beta } \bigg \} \\&\quad \lesssim t^\eta \bigg \{ \Phi (0) + \int _1^\infty \tau ^{n+\eta \beta } \Phi (\tau ) \frac{d\tau }{\tau } \bigg \} \lesssim t^\eta . \end{aligned}$$

We used that \(\eta <1\) and condition 1 to conclude that the last integral is finite. This is the desired result. \(\square \)

Proposition 3

For every \(0<\alpha < \min \{1,\Theta /\beta \}\), there is a constant \(C>0\) such that \(D_\alpha (x,y) \le C \min \{1, \rho (x,y)^{\alpha \beta } \}\).

Proof

Since \(D_\alpha (x,y)\) is uniformly bounded, we need only consider the case when \(\rho (x,y) \le 1\). Condition 2 and Lemma 2 above with \(\eta =0\) imply that whenever \(\rho (x,y) \le t^{1/\beta }\),

$$\begin{aligned} ||a_t(x,\cdot ) - a_t(y,\cdot ) ||_1 \le \bigg (\frac{\rho (x,y)}{t^{1/\beta }}\bigg )^{\Theta } \frac{1}{t^{n/\beta }} \int _\mathcal {X} \Phi \bigg (\frac{\rho (x,u)}{t^{1/\beta }}\bigg ) du \lesssim \bigg (\frac{\rho (x,y)}{t^{1/\beta }}\bigg )^{\Theta }. \end{aligned}$$

Consequently, if we define K so that \(2^{-K} \le \rho (x,y)^{\beta } < 2^{-K+1}\), then

$$\begin{aligned} D_\alpha (x,y)&\lesssim \rho (x,y)^\Theta \sum _{k=0}^K 2^{-k\alpha } 2^{k\Theta /\beta } + \sum _{k=K+1}^\infty 2^{-k\alpha } \lesssim \rho (x,y)^{\Theta } 2^{K(\Theta /\beta - \alpha )} + 2^{-K\alpha } \\&\lesssim \rho (x,y)^{\alpha \beta }. \end{aligned}$$

We used that \(\alpha < \Theta /\beta \) for the upper bound on the first sum. \(\square \)

With this upper bound on \(D_\alpha (x,y)\), it is now straightforward to show that our geometric condition (G) holds for a range of \(\alpha \).

Theorem 1

Under conditions 1 and 2, condition (G) holds for all \(0<\alpha <\min \{1,\Theta /\beta \}\).

Proof

From Proposition 3, we have the upper bound \(D_\alpha (x,y) \lesssim \rho (x,y)^{\alpha \beta }\). Consequently, taking \(\eta =\alpha \) in Lemma 2 yields

$$\begin{aligned} \int _\mathcal {X} |a_t(x,y)| D_\alpha (x,y) dy&\lesssim \int _\mathcal {X} \rho (x,y)^{\alpha \beta } \frac{1}{t^{n/\beta }} \Phi \bigg ( \frac{\rho (x,y)}{t^{1/\beta } }\bigg ) dy \lesssim t^\alpha \end{aligned}$$

which is the desired result. \(\square \)

3.2 The Distance \(D_\alpha (x,y)\) for Kernels with a Matching Lower Bound

Having established conditions (G) and (R) from the upper bound \(D_\alpha (x,y) \lesssim \min \{1, \rho (x,y)^{\alpha \beta } \}\) for all \(0 < \alpha < \min \{1,\Theta /\beta \}\) under the continuity and decay conditions 1 and 2 of the previous section, we now formulate general conditions under which we can prove a corresponding lower bound, \(D_\alpha (x,y) \gtrsim \min \{1,\rho (x,y)^{\alpha \beta }\}\). We will then study several examples where both conditions are satisfied.

Note that the lower bound is not necessary for the general results of our paper to hold; in particular, our primary concern was to prove Theorem 1, which establishes condition (G) for a large class of examples. We only prove lower bounds on \(D_\alpha (x,y)\) to show that it is equivalent to a snowflake of the “natural” distance \(\rho (x,y)\) in many cases.

In this section we will assume a stronger relation between the measure \(\rho (x,y)\) and the metric \(\mu \), namely the two-sided estimate \(\mu (B(x,r)) \simeq r^n\). We suppose too that in addition to conditions 1 and 2 of the previous section, we also have the following condition:

  1. 3.

    A local lower bound on the kernel: there is a monotonic decreasing function \(\Psi : [0,\infty ) \rightarrow \mathbb {R}\) and \(R>0\) such that for all \(t \in (0,1]\) and all \(\rho (x,y) < R\)

    $$\begin{aligned} |a_t(x,y)| \ge \frac{1}{t^{n/\beta }} \Psi \bigg ( \frac{\rho (x,y)}{t^{1/\beta }} \bigg ). \end{aligned}$$

We will show:

Proposition 4

Under the conditions 1,2 and 3, \(D_\alpha (x,y) \gtrsim \min \{1,\rho (x,y)^{\alpha \beta }\}\).

We will deduce the result from the following lemmas.

Lemma 3

There is a constant \(A>1\) and a constant \(\epsilon >0\) such that whenever \(x,y \in \mathcal {X}\) and \(t \in (0,1]\) satisfy \(At^{1/\beta } \le \rho (x,y) < R\), we have

$$\begin{aligned} ||a_t(x,\cdot ) - a_t(y,\cdot ) ||_1 > \epsilon . \end{aligned}$$

Proof

Temporarily fix any \(A>1\) and suppose \(At^{1/\beta } \le \rho (x,y) < R\). Then for any \(u\in B(x,t^{1/\beta })\), the triangle inequality implies

$$\begin{aligned} \rho (y,u) \ge \rho (x,y) - \rho (x,u) \ge (A-1)t^{1/\beta }. \end{aligned}$$

From the monotonicity of \(\Phi \) it follows that \(\Phi ( {\rho (y,u)}/{t^{1/\beta }} ) \le \Phi (A-1)\). Consequently, using the upper and lower bounds on \(|a_t(x,y)|\) and the fact that \(\mu (B(x,r)) \simeq r^n\), we have

$$\begin{aligned}&||a_t(x,\cdot ) - a_t(y,\cdot ) ||_1\\&\quad \ge \int _{B(x,t^{1/\beta })} |a_t(x,u)| du - \int _{B(x,t^{1/\beta })} |a_t(y,u)| du \\&\quad \ge \frac{1}{t^{n/\beta }} \int _{B(x,t^{1/\beta })} \Psi \bigg ( \frac{\rho (x,u)}{t^{1/\beta }} \bigg ) du - \frac{1}{t^{n/\beta }} \int _{B(x,t^{1/\beta })} \Phi \bigg ( \frac{\rho (y,u)}{t^{1/\beta }} \bigg ) du\\&\quad \ge \frac{1}{t^{n/\beta }} \int _{B(x,t^{1/\beta })} \Psi \bigg ( \frac{\rho (x,u)}{t^{1/\beta }} \bigg ) du - \frac{1}{t^{n/\beta }} \int _{B(x,t^{1/\beta })} \Phi (A-1) du\\&\quad \ge C_1 \Psi (1) - C_2 \Phi (A-1) \end{aligned}$$

for some constants \(C_1, C_2 >0\). Since \(\Phi \) is decreasing, by choosing A large enough we can guarantee \( \epsilon \equiv C_1 \Psi (1) - C_2 \Phi (A-1) > 0\), yielding the desired result. \(\square \)

Corollary 1

Let R be as in condition 3. Then for all \(\rho (x,y) < R\),

$$\begin{aligned} D_\alpha (x,y) \gtrsim \rho (x,y)^{\alpha \beta } . \end{aligned}$$

Proof

By the previous lemma, \(||a_t(x,\cdot ) - a_t(y,\cdot ) ||_1 > \epsilon > 0\) whenever \(At^{1/\beta } \le \rho (x,y)\). Take L so that \(2^{-L} \le \rho (x,y)^\beta / A^\beta < 2^{-L+1}\). Then

$$\begin{aligned} D_\alpha (x,y) \ge \sum _{k=L}^\infty 2^{-k\alpha } ||p_k(x,\cdot ) - p_k(y,\cdot ) ||_1 \ge \epsilon \sum _{k=L}^\infty 2^{-k\alpha } \simeq 2^{-L\alpha } \simeq \rho (x,y)^{\alpha \beta }. \end{aligned}$$

\(\square \)

Lemma 4

Let R be as in condition 3. There are constants \(C > 0\) and \(\delta >0\) such that whenever \(\rho (x,y) \ge R\) and \(t^{1/\beta } < \delta R\),

$$\begin{aligned} ||a_t(x,\cdot ) - a_t(y,\cdot ) ||_1 \ge C. \end{aligned}$$

Proof

Since \(\rho (x,y) \ge R\), the balls B(xR / 2) and B(yR / 2) are disjoint. Consequently

$$\begin{aligned} ||a_t(x,\cdot ) - a_t(y,\cdot ) ||_1&\ge \int _{B(x,R/2)} |a_t(x,u)|du - \int _{B(x,R/2)} |a_t(y,u)|du \\&\ge \int _{B(x,R/2)} |a_t(x,u)|du - \int _{B(y,R/2)^c} |a_t(y,u)|du \\&\ge 1 - \int _{B(x,R/2)^c} |a_t(x,u)|du - \int _{B(y,R/2)^c} |a_t(y,u)|du. \end{aligned}$$

The result follows from Lemma 5 below. \(\square \)

Lemma 5

Fix any \(r>0\) and \(\epsilon >0\). Then there exists \(\delta >0\) sufficiently small so that whenever \(0<t^{1/\beta } <\delta r\),

$$\begin{aligned} \int _{B(x,r)^c} |a_t(x,u)|du \le \epsilon . \end{aligned}$$

Proof

Temporarily fix \(\delta >0\) and suppose \(0<t^{1/\beta } <\delta r\). Then if we let \(V_k = B(x,2^{k+1}r) \setminus B(x,2^kr)\), we have

$$\begin{aligned} \int _{B(x,r)^c} |a_t(x,u)|du&\le \frac{1}{t^{n/\beta }} \int _{B(x,r)^c} \Phi \bigg ( \frac{\rho (x,y)}{t^{1/\beta }} \bigg ) du = \frac{1}{t^{n/\beta }}\sum _{k\ge 0} \int _{V_k} \Phi \bigg ( \frac{\rho (x,y)}{t^{1/\beta }} \bigg ) du \\&\lesssim \frac{1}{t^{n/\beta }}\sum _{k\ge 0} \Phi \bigg ( \frac{2^k r}{t^{1/\beta }} \bigg ) (2^{k+1}r)^n \lesssim \frac{r^n}{t^{n/\beta }} \int _1^\infty \tau ^n \Phi \bigg (\frac{\tau r}{t^{1/\beta }}\bigg )\frac{d\tau }{\tau } \\&= \frac{r^n}{t^{n/\beta }} \int _{rt^{-1/\beta }}^\infty t^{n/\beta } r^{-n} s^n\Phi (s) \frac{ds}{s} \le \int _{\delta ^{-1}}^\infty s^n \Phi (s) \frac{ds}{s}. \end{aligned}$$

By taking \(\delta \) small enough, the integral can be made as small as desired, completing the proof. \(\square \)

Corollary 2

There is a constant \(B>0\) such that whenever \(\rho (x,y) \ge R\), \(D_\alpha (x,y) \ge B\).

Proof

Take C and \(\delta \) from Lemma 4. Let \(K=\lfloor \log _2 (1/(\delta ^\beta R^\beta )) \rfloor \). Then \(2^{-K} \le \delta ^\beta R^\beta \), and so

$$\begin{aligned} D_\alpha (x,y) \ge \sum _{k=K}^\infty 2^{-k\alpha } C \simeq C\delta ^{\alpha \beta } R^{\alpha \beta } >0 \end{aligned}$$

which completes the proof. \(\square \)

Corollaries 1 and 2 easily imply Proposition 4. Furthermore, Propositions 3 and 4 yield the following theorem:

Theorem 2

If all the conditions 1, 2 and 3 on \(a_t(x,y)\) hold, and if \(\mu (B(x,r)) \simeq r^n\), then for \(0<\alpha < \min \{1,\Theta /\beta \}\) the distance \(D_\alpha (x,y)\) is equivalent to the thresholded snowflake distance \(\min \{1, \rho (x,y)^{\alpha \beta }\}\).

3.3 Heat Kernel on a Riemannian Manifold

We illustrate Theorems 1 and 2 on some selected examples. We consider first the case where \(\mathcal {X}\) is a closed (compact, without boundary) Riemannian manifold of dimension n, and \(a_t(x,y)\) is its heat kernel. This section may be of interest to those in the machine-learning community, as approximations to the heat kernel on data sampled from submanifolds of \(\mathbb {R}^n\) are widely-used in the analysis of many data sets [1, 21].

Since \(\mathcal {X}\) is compact, it is true that \(\mu (B(x,r)) \simeq r^n\). Furthermore, the following two lemmas can be easily derived from the parametrix construction of the heat kernel given in [5, Chapter VI, Section 4]. Here, \(\rho (x,y)\) is the geodesic distance on the manifold.

Lemma 6

There are positive constants AB such that

$$\begin{aligned} a_t(x,y) \le \frac{A}{t^{n/2}}e^{-B\rho (x,y)^2/t} \end{aligned}$$

for all \(t \in (0,1]\).

Lemma 7

There are positive constants CD

$$\begin{aligned} \frac{C}{t^{n/2}}e^{-D\rho (x,y)^2/t} \le a_t(x,y) \end{aligned}$$

whenever \(t\in (0,1]\) and \(\rho (x,y)\) is sufficiently small.

Taking \(\beta = 2\), from Lemma 6 we see that the heat kernel \(a_t(x,y)\) satisfies condition 1 with \(\Phi (\tau ) \simeq e^{-B\tau ^2}\) and times \(0 < t \le 1\), which is the only range of t we made use of in Sect. 3.1; and from Lemma 7 the heat kernel satisfies condition 3 with \(\Psi (\tau ) \simeq e^{-D\tau ^2}\). To apply Theorem 2, it remains to show condition 2. We will deduce the continuity estimate from the following gradient bound:

Lemma 8

There are constants \(E,F > 0\) such that for all \(t\in (0,1]\) and for all x and y in \(\mathcal {X}\),

$$\begin{aligned} ||\nabla _x a_t(x,y) ||\le \frac{E}{\sqrt{t}}\frac{e^{-F\rho (x,y)^2/t}}{t^{n/2}} \end{aligned}$$

where \(\nabla _x\) denotes the gradient with respect to the first variable.

Proof

Using the asymptotic expansion of \(a_t(x,y)\) in [5, Chapter VI, Section 4], it is easy to show a Gaussian upper bound on the time derivative of \(a_t(x,y)\), namely

$$\begin{aligned} \bigg | \frac{\partial a_t}{\partial t}(x,y)\bigg | \le \frac{b}{t} \frac{e^{-c \rho (x,y)^2/t} }{t^{n/2}} \end{aligned}$$

for some positive constants bc. Since the curvature of \(\mathcal {X}\) is bounded (because \(\mathcal {X}\) is compact) we can apply Theorem 1.4 from [20], which states that there are constants \(A_1, A_2,A_3\) such that

$$\begin{aligned} ||\nabla _x a_t(x,y) ||^2&\le \bigg (A_1 + \frac{A_2}{t} \bigg ) a_t(x,y)^2 + A_3 a_t(x,y) \frac{\partial a_t}{\partial t}(x,y). \end{aligned}$$

For \(t\in (0,1]\), it follows from the Gaussian estimates on \(a_t(x,y)\) and \(\partial _t a_t(x,y)\) that

$$\begin{aligned} ||\nabla _x a_t(x,y) ||^2 \le \frac{\tilde{A}}{t}\frac{e^{-2B\rho (x,y)^2/t}}{t^n} + \frac{b}{t} \frac{e^{-c \rho (x,y)^2/t} }{t^{n/2}}\frac{A}{t^{n/2}}e^{-B\rho (x,y)^2/t} \lesssim \frac{1}{t} \frac{e^{-\tilde{B}\rho (x,y)^2/t}}{t^n} \end{aligned}$$

for sufficiently small \(\tilde{B} > 0\), from which the result follows. \(\square \)

We next prove a consequence of the mean value theorem that will be useful.

Lemma 9

If \(x,y \in \mathcal {X}\) are sufficiently close, then for any smooth function \(h:\mathcal {X} \rightarrow \mathbb {R}\) there is a point \(\tilde{x}\) lying on the minimal geodesic from x to y such that

$$\begin{aligned} |h(x) - h(y)| \le ||\nabla h(\tilde{x}) ||\rho (x,y). \end{aligned}$$

Proof

Suppose \(r \equiv \rho (x,y)\) is less than the injectivity radius of the manifold M (which is positive, since M is compact). Let \(\gamma (t)\) be the unit speed geodesic connecting x to y. Then \(\gamma (r)=y\), and \(\gamma (0)=x\). For details, see, for instance, [10, Chapter 13, Section 2].

Consider the function \(\hat{h}(t) = h(\gamma (t))\). Observe that \(\hat{h}(0)=h(x)\) and \(\hat{h}(r) = h(y)\). By the mean value theorem, there is some point \(t_1\) between 0 and r such that

$$\begin{aligned} \frac{h(y) - h(x)}{\rho (x,y)} = \frac{\hat{h}(r) - \hat{h}(0)}{r} = \hat{h}^\prime (t_1) = \frac{d}{dt}h(\gamma (t))\bigg |_{t=t_1} = \langle \nabla h(\gamma (t_1)),\gamma ^\prime (t_1)\rangle \end{aligned}$$

Consequently, since \(\gamma \) has unit speed, the Cauchy–Schwarz inequality gives

$$\begin{aligned} |h(y) - h(x)| = |\langle \nabla h(\gamma (t_1)),\gamma ^\prime (t_1)\rangle | \rho (x,y) \le ||\nabla h(\gamma (t_1)) ||\rho (x,y). \end{aligned}$$

Consequently, if we let \(\tilde{x} = \gamma (t_1)\), then \(\tilde{x}\) lies on the minimal geodesic connecting x and y, and

$$\begin{aligned} |h(x) - h(y)| \le ||\nabla h(\tilde{x})||\rho (x,y). \end{aligned}$$

\(\square \)

Corollary 3

There are positive constants GH such that whenever \(t \in (0,1]\) and \(\rho (x,y) \le t^{1/2}\),

$$\begin{aligned} |a_t(x,u) - a_t(y,u)| \le G\frac{\rho (x,y)}{\sqrt{t}} \frac{e^{-H\rho (x,u)^2/t}}{t^{n/2}}. \end{aligned}$$

Proof

From Lemma 9 and the gradient estimate from Lemma 8, we have the bound

$$\begin{aligned} |a_t(x,u) - a_t(y,u)| \le \rho (x,y) \frac{E}{\sqrt{t}}\frac{e^{-F\rho (u,\tilde{x})^2/t}}{t^{n/2}}. \end{aligned}$$

where \(\tilde{x}\) is some point on the minimal geodesic connecting x and y. Since \(\rho (x,y) \le t^{1/2}\), it is also true that \(\rho (x,\tilde{x}) \le t^{1/2}\). Consequently, we have

$$\begin{aligned} \rho (u,x)^2 \le 2\rho (u,\tilde{x})^2 + 2\rho (\tilde{x},x)^2 \le 2\rho (u,\tilde{x})^2 + 2t \end{aligned}$$

and so

$$\begin{aligned} |a_t(x,u) - a_t(y,u)|&\le \rho (x,y) \frac{E}{\sqrt{t}}\frac{e^{-F\rho (u,\tilde{x})^2/t}}{t^{n/2}} \le \rho (x,y) \frac{E}{\sqrt{t}}\frac{e^{-F(\rho (u,x)^2-2t)/2t}}{t^{n/2}} \\&\le \rho (x,y) \frac{Ee}{\sqrt{t}}\frac{e^{-(F/2)\rho (u,x)^2/t}}{t^{n/2}} \end{aligned}$$

which is the desired result. \(\square \)

Corollary 3 gives us condition 2. We can therefore apply Theorems 1 and 2 to conclude that for all \(0 < \alpha < 1/2\), \(D_\alpha (x,y) \simeq \rho (x,y)^{2\alpha }\), and condition (G) is satisfied.

We note that to establish (G), we only made use of the Gaussian upper bound from Lemma 6, the continuity estimate from Corollary 3, and the upper bound \(\mu (B(x,r)) \lesssim r^n\). Condition (G) will therefore hold for the heat kernel on any manifold where these estimates are true, and not just closed manifolds. For example, as discussed in [17], the Gaussian upper bounds and continuity estimates hold for the heat kernel on any geodesically complete Riemannian manifold with non-negative curvature.

3.4 Subordinated Heat Kernels with Shifts on \(\mathbb {R}^n\)

Next we consider the case in which \(a_t(x,y) = K_t(x-y)\), where \(K_t(u)\) is a radial kernel, i.e. \(K_t(x) = K_t(y)\) if \(|x|=|y|\), satisfying the following scaling property:

$$\begin{aligned} K_t(x) = t^{-n/\beta }K_1(t^{-1/\beta }x) \end{aligned}$$

where \(0<\beta \le 2\). For details on the construction of such kernels in one dimension, the reader can refer to the book [33]. These kernels are known as subordinated heat kernels on \(\mathbb {R}^n\), and can be expressed as an average of the Gaussian heat kernel at different scales. Concretely, when \(0<\beta <2\) (that is, \(\beta \ne 2\)), \(K_t(x)\) is of the form

$$\begin{aligned} K_t(x) = \int _0^\infty \eta _t(s) g_s(x)ds \end{aligned}$$

where \(g_s\) is the Gaussian kernel at time s, and for each t the function \(\eta _t(s)\) is a probability density on \((0,\infty )\), known as the subordinator. In fact, \(\eta _t\) satisfies the identity

$$\begin{aligned} \exp (-t\lambda ^{\beta /2}) = \int _0^\infty \eta _t(s) e^{-s\lambda }ds \end{aligned}$$

for all \(\lambda \ge 0\), from which it easily follows that the Fourier transform of \(K_t\) is

$$\begin{aligned} \hat{K}_t(\xi ) = \exp (-t|\xi |^\beta ). \end{aligned}$$

For example, when \(\beta = 2\), \(K_t\) is equal to the Gaussian heat kernel, and when \(\beta =1\) it is equal to the Poisson kernel.

It is shown in [17] that any subordinated heat kernel satisfies conditions 1, 2 and 3. More precisely,

$$\begin{aligned} a_t(x,y) \simeq \frac{1}{t^{n/\beta }}\bigg (1+\frac{|x-y|}{t^{1/\beta }} \bigg )^{-(n+\beta )} \end{aligned}$$

and

$$\begin{aligned} |a_t(x,u) - a_t(y,u)| \lesssim \frac{|x-y|}{t^{1/m}} \frac{1}{t^{n/m}} \bigg ( 1 + \frac{|x-y|}{t^{1/m}} \bigg )^{-(n+m)}. \end{aligned}$$

From Theorems 1 and 2, it follows immediately that whenever \(0<\alpha <1/\beta \), the distance \(D_\alpha (x,y)\) with respect to the kernel \(a_t(x,y)=K_t(x - y)\) is equivalent to \(\min \{1, |x-y|^{ \alpha \beta } \}\) and that condition (G) holds. Note that our use of the parameter \(\beta \) in the definition of the subordinated heat kernel coincides with its use in the conditions 1, 2 and 3.

This leads us to a family of examples of non-symmmetric semigroups for which condition (G) holds, namely the subordinated heat kernels with shifts. Take \(\beta \in [1,2]\). Then for a fixed parameter \(\theta \in \mathbb {R}\), define

$$\begin{aligned} a_t(x,y) = t^{-n\beta }K_1\left( t^{-1/\beta } ( x - \theta t -y )\right) . \end{aligned}$$

It is easy to check from the semigroup property for the non-shift case \(\theta =0\) that \(a_t(x,y)\) is also a semigroup. Furthermore, when \(0 < \alpha < 1/\beta \) we still have \(D_\alpha (x,y) \simeq \min \{1,|x-y|^{\alpha \beta }\}\). Therefore, we can verify condition (G) directly by writing

$$\begin{aligned} \int _{\mathbb {R}^n} a_t(x,y) D_\alpha (x,y) dy&\lesssim \int _{\mathbb {R}^n} t^{-n/\beta }K_1(t^{-1/\beta } ( x - \theta t -y )) \min \{1,|x-y|^{\alpha \beta }\}dy \\&= \int _{\mathbb {R}^n} t^{-n/\beta }K_1(t^{-1/\beta } ( x -y )) \min \{1,|x-y + \theta t |^{\alpha \beta }\}dy \\&\le \int _{\mathbb {R}^n}t^{-n/\beta }K_1(t^{-1/\beta } ( x -y )) \min \{1,|x-y|^{\alpha \beta }\} dy \\&\quad \,+ \int _{\mathbb {R}^n}t^{-n/\beta }K_1(t^{-1/\beta } ( x -y )) |\theta t|^{\alpha \beta } dy \\&\lesssim t^{\alpha } + t^{\alpha \beta }. \end{aligned}$$

The last line follows from condition (G) in the case \(\theta =0\). Since \(\beta \ge 1\), condition (G) is satisfied. Note that this range of \(\beta \) includes both the heat kernel \((\beta =2)\) and the Poisson kernel \((\beta =1)\).

3.5 Products of Kernels and Anisotropic Distances

Suppose that \(a_t(x_1,x_2)\) and \(b_t(y_1,y_2)\) are two semigroups on spaces \(\mathcal {X}\) and \(\mathcal {Y}\), respectively, for which the conditions (S), (C), (I) and (R) (and thus (G)) hold. We define their product by \(c_t((x_1,y_1),(x_2,y_2)) = a_t(x_1,x_2)\cdot b_t(y_1,y_2)\). It is easy to check that the kernel \(c_t\) defines a semigroup on \(\mathcal {X} \times \mathcal {Y}\), and that the three conditions (S), (C), and (I) all hold. We will check that (G) holds as well. Since we have three semigroups, we augment our notation to distinguish between the distances each one induces. Fixing the distance parameter \(\alpha \), we will write \(D_{\alpha }^{a}(x_1,x_2)\) for the distance induced by \(a_t\), and similarly for \(b_t\) and \(c_t\). We then have:

Proposition 5

The distance \(D_\alpha ^c((x_1,y_1),(x_2,y_2))\) on \(\mathcal {X} \times \mathcal {Y}\) is equivalent to \(D_\alpha ^a(x_1,x_2) + D_\alpha ^b(y_1,y_2)\).

Proof

This follows immediately from the following lemma. \(\square \)

Lemma 10

For every \(z_1 = (x_1,y_1), z_2=(x_2,y_2)\) in \(\mathcal {X} \times \mathcal {Y}\), we have

$$\begin{aligned} ||c_t(z_1,\cdot ) - c_t(z_2,\cdot ) ||_1 \simeq ||a_t(x_1,\cdot ) - a_t(x_2,\cdot ) ||_1 + ||b_t(y_1,\cdot ) - b_t(y_2,\cdot ) ||_1. \end{aligned}$$

Proof

First, we prove that

$$\begin{aligned} ||c_t(z_1,\cdot ) - c_t(z_2,\cdot ) ||_1 \lesssim ||a_t(x_1,\cdot ) - a_t(x_2,\cdot ) ||_1 + ||b_t(y_1,\cdot ) - b_t(y_2,\cdot ) ||_1. \end{aligned}$$

To see this, observe that if \(x_1 = x_2\) then

$$\begin{aligned} ||c_t((x_1,y_1),\cdot ) - c_t((x_1,y_2),\cdot ) ||_1&= \int _\mathcal {X}\int _\mathcal {Y} |a_t(x_1,x)| |b_t(y_1,y) - b_t(y_2,y)|dy dx\\&\lesssim ||b_t(y_1,\cdot ) - b_t(y_2,\cdot ) ||_1 \end{aligned}$$

where we have used condition (I) on the kernels \(a_t\). Similarly,

$$\begin{aligned} ||c_t((x_1,y_2),\cdot ) - c_t((x_2,y_2),\cdot ) ||_1 \lesssim ||a_t(x_1,\cdot ) - a_t(x_2,\cdot ) ||_1. \end{aligned}$$

We therefore have

$$\begin{aligned}&||c_t((x_1,y_1),\cdot ) - c_t((x_2,y_2),\cdot ) ||_1\\&\quad \le ||c_t((x_1,y_1),\cdot ) - c_t((x_1,y_2),\cdot ) ||_1\\&\qquad +\,||c_t((x_1,y_2),\cdot ) - c_t((x_2,y_2),\cdot ) ||_1 \\&\quad \lesssim ||a_t(x_1,\cdot ) - a_t(x_2,\cdot ) ||_1+ ||b_t(y_1,\cdot ) - b_t(y_2,\cdot ) ||_1, \end{aligned}$$

as desired.

For the other direction, observe that

$$\begin{aligned} ||c_t(z_1,\cdot ) - c_t(z_2,\cdot ) ||_1&= \int _\mathcal {X} \int _\mathcal {Y} | a_t(x_1,x)b_t(y_1,y) - a_t(x_2,x)b_t(y_2,y) | dydx \\&\ge \int _\mathcal {X} \bigg | \int _\mathcal {Y} [a_t(x_1,x)b_t(y_1,y) - a_t(x_2,x)b_t(y_2,y) ]dy \bigg | dx \\&= \int _\mathcal {X} \bigg |a_t(x_1,x) \int _\mathcal {Y} b_t(y_1,y)dy - a_t(x_2,x) \int _\mathcal {Y} b_t(y_2,y) dy \bigg | dx \\&= \int _\mathcal {X} |a_t(x_1,x) - a_t(x_2,x)| dx = ||a_t(x_1,\cdot ) - a_t(x_2,\cdot ) ||_1. \end{aligned}$$

where we have used condition (C) in the last equality. Similarly,

$$\begin{aligned} ||c_t(z_1,\cdot ) - c_t(z_2,\cdot ) ||_1 \ge ||b_t(y_1,\cdot ) - b_t(y_2,\cdot ) ||_1 \end{aligned}$$

from which it follows

$$\begin{aligned} ||c_t(z_1,\cdot ) - c_t(z_2,\cdot ) ||_1 \ge \frac{1}{2} (||a_t(x_1,\cdot ) - a_t(x_2,\cdot ) ||_1 + ||b_t(y_1,\cdot ) - b_t(y_2,\cdot ) ||_1) \end{aligned}$$

completing the proof. \(\square \)

From Proposition 5 we can easily deduce that condition (G) holds for \(c_t\) if it holds for \(a_t\) and \(b_t\).

Proposition 6

If condition (G) holds for \(a_t\) and \(b_t\), then it holds for their product \(c_t\) as well.

Proof

We have, using condition (I) for both \(a_t\) and \(b_t\),

$$\begin{aligned}&\int _{\mathcal {X} \times \mathcal {Y}} |c_t(z_1,z_2)| D_\alpha ^c(z_1,z_2) dz_2 \\&\quad \lesssim \int _\mathcal {X} \int _\mathcal {Y} |a_t(x_1,x_2)\cdot b_t(y_1,y_2)|(D_\alpha ^a(x_1,x_2) + D_\alpha ^b(y_1,y_2)) dx_2dy_2 \\&\quad \lesssim \int _\mathcal {X} |a_t(x_1,x_2)|D_\alpha ^a(x_1,x_2) dx_2 + \int _\mathcal {Y} |b_t(y_1,y_2)|D_\alpha ^b(y_1,y_2) dy_2 \\&\quad \lesssim t^\alpha \end{aligned}$$

which is the desired result. \(\square \)

Of course, these results hold for the product of any number of kernels, not just two, and the proofs are similar. A natural example is the product of subordinated heat kernels on \(\mathbb {R}^n\). Suppose that \(n=n_1 + \dots + n_l\) and that on each space \(\mathbb {R}^{n_i}\) we have a subordinated heat kernel \(a_t^{(i)}(x_i,y_i)\) with scaling parameter \(\beta _i\), as in Sect. 3.4. Then as long as \(0 < \alpha < \min \{1, 1/\beta _1,\dots ,1/\beta _l\}\), for \(x=(x_1,\dots ,x_l), y=(y_1,\dots , y_l)\), \(x_i,y_i\in \mathbb {R}^{n_i}\), the kernel

$$\begin{aligned} a_t(x,y) = \prod _{i=1}^l a_t^{(i)}(x_i,y_i) \end{aligned}$$

generates the distance

$$\begin{aligned} D_\alpha (x,y) \simeq \min \{ 1, | x-y | _{_\text {AN}} \} \end{aligned}$$

where

$$\begin{aligned} |x-y|_{_\text {AN}} = \sum _{i=1}^l |x_i-y_i|^{\alpha \beta _i} \end{aligned}$$

is an example of an anisotropic distance on \(\mathbb {R}^n\); see, for example, [9, 13, 27] for work on function spaces built from such distances.

4 The Space of Hölder–Lipschitz Functions

We now turn to characterizing functions that are Lipschitz with respect to the distance \(D_\alpha (x,y)\), for a fixed \(\alpha \in (0,1)\). We assume that \(\alpha \) is chosen so that condition (G) holds; in particular, by Proposition 1 if the kernel satisfies condition (R) for some \(\alpha ^\prime \), we take any \(0<\alpha <\alpha ^\prime \). As we have seen in Sect. 3, the distance \(D_\alpha (x,y)\) is, in many cases of interest, of the form \(\rho (x,y)^{\alpha \beta }\), where \(\rho (x,y)\) is some other distance on \(\mathcal {X}\) and \(0 < \alpha \beta < 1\). In classical analysis, functions that are Lipschitz with respect to a snowflake metric are called Hölder. We will therefore refer to the space of Lipschitz functions with respect to \(D_\alpha (x,y)\) as the Hölder–Lipschitz space.

More formally, for a function f on \(\mathcal {X}\) define its maximum variation seminorm as

$$\begin{aligned} V(f) = \sup _{x\ne y} \frac{|f(x)-f(y)|}{D_\alpha (x,y)}. \end{aligned}$$

We then define the Hölder–Lipschitz norm of a function f on \(\mathcal {X}\) to be

$$\begin{aligned} ||f ||_{\Lambda _\alpha } = \sup _x|f(x)| + V(f) \end{aligned}$$

and the Hölder–Lipschitz space \(\Lambda _\alpha \) to be the collection of all functions on \(\mathcal {X}\) for which this norm is finite.

We will define two alternate norms on \(\Lambda _\alpha \) and show that they are equivalent to \(||f ||_{\Lambda _\alpha }\). We define the difference operators

$$\begin{aligned} \Delta _k = P_{k+1} - P_k,\,\,\,\ \delta _k = I - P_k. \end{aligned}$$

We also define the seminorms

$$\begin{aligned} V^{(1)}(f) = \sup _{k\ge 0}\sup _x2^{k\alpha }|\Delta _kf(x)| \end{aligned}$$

and

$$\begin{aligned} V^{(2)}(f) = \sup _{k\ge 0}\sup _x2^{k\alpha }|\delta _kf(x)|. \end{aligned}$$

The alternate norms can now be defined as

$$\begin{aligned} ||f ||_{\Lambda _\alpha }^{(1)} = \sup _x|f(x)| + V^{(1)}(f) \end{aligned}$$

and

$$\begin{aligned} ||f ||_{\Lambda _\alpha }^{(2)} = \sup _x|f(x)| + V^{(2)}(f). \end{aligned}$$

We immediately see the use of condition (R) and its equivalent condition (G) in the following result:

Proposition 7

For all \(f \in \Lambda _\alpha \), \(V^{(2)}(f) \lesssim V(f)\).

Proof

Take any \(k\ge 0\). Since \(p_k(x,\cdot )\) has integral 1 for every x, we have

$$\begin{aligned} |f(x) - P_kf(x)|&= \bigg |f(x) - \int _\mathcal {X} p_k(x,y)f(y)dy\bigg | = \bigg |\int _\mathcal {X} p_k(x,y)(f(x)-f(y)) dy\bigg | \\&\le V(f)\int _\mathcal {X} |p_k(x,y)|D_\alpha (x,y) dy\, \lesssim \, V(f) 2^{-k\alpha } \end{aligned}$$

from which the desired inequality follows trivially. \(\square \)

Corollary 4

For all \(f \in \Lambda _\alpha \), \(||f ||_{\Lambda _\alpha }^{(2)} \lesssim ||f ||_{\Lambda _\alpha }\).

Next, we make the following simple observation about uniform convergence to the identity:

Lemma 11

If \(||f ||_{\Lambda _\alpha }^{(2)}<\infty \), then \(P_kf\) converges to f uniformly as \(k \rightarrow \infty \).

Proof

This is clear from the definition of \(||f ||_{\Lambda _\alpha }^{(2)}\) (more specifically, the definition of \(V^{(2)}(f)\)). \(\square \)

Since \(||f ||_{\Lambda _\alpha }^{(2)} \lesssim ||f ||_{\Lambda _\alpha }\), it follows that:

Lemma 12

For all \(f \in \Lambda _{\alpha }\), \(P_kf\) converges to f uniformly as \(k \rightarrow \infty \).

We now prove:

Proposition 8

The seminorms \(V^{(1)}(f)\) and \(V^{(2)}(f)\) are equivalent for \(f \in \Lambda _\alpha \).

Proof

First, write f as a telescopic series:

$$\begin{aligned} f - P_0f = \sum _{l=0}^\infty [P_{l+1}f - P_{l}f] \end{aligned}$$

where the series converges uniformly by Lemma 12.

Similarly we can write \(P_kf\) as a telescopic series

$$\begin{aligned} P_kf - P_0f = \sum _{l=0}^{k-1} [P_{l+1}f - P_{l}f] \end{aligned}$$

and subtracting the two series gives:

$$\begin{aligned} |f - P_kf|&= \bigg |\sum _{l=0}^\infty (P_{l+1}f - P_{l}f) - \sum _{l=0}^{k-1} (P_{l+1}f - P_{l}f) \bigg | \\&= \bigg |\sum _{l=k}^\infty (P_{l+1}f - P_{l}f) \bigg | \le \sum _{l=k}^\infty |P_{l+1}f - P_{l}f| \\&\le V^{(1)}(f) \sum _{l=k}^\infty 2^{-l\alpha } = V^{(1)}(f) \frac{1}{1-2^{-\alpha }}2^{-k\alpha } \end{aligned}$$

and consequently

$$\begin{aligned} 2^{k\alpha } \sup _x| \delta _k f(x) | \le \frac{1}{1-2^{-\alpha }} V^{(1)}(f). \end{aligned}$$

Taking the supremum over all \(k \ge 0\) shows \(V^{(2)}(f) \lesssim V^{(1)}(f)\).

For the other direction, we simply observe

$$\begin{aligned} |P_kf - P_{k+1}f| \le |(P_k - I)f|+ |(P_{k+1} - I)f| \le 2V^{(2)}(f)2^{-k\alpha } \end{aligned}$$

implying

$$\begin{aligned} 2^{k\alpha } \sup _x| \Delta _k f(x) | \le 2 V^{(2)}(f). \end{aligned}$$

Taking the supremum over all \(k\ge 0\) gives the result. \(\square \)

Corollary 5

The norms \(||f ||_{\Lambda _\alpha }^{(1)}\) and \(||f ||_{\Lambda _\alpha }^{(2)}\) are equivalent on \(\Lambda _\alpha \).

Now we turn to proving the main result of this section, namely that \(||f ||_{\Lambda _\alpha }^{(1)}\) and \(||f ||_{\Lambda _\alpha }^{(2)}\) are equivalent to \(||f ||_{\Lambda _\alpha }\). The following simple observation will be useful:

Lemma 13

\((P_{k+1}+P_k)\Delta _k = \Delta _{k-1}.\)

Proof

This is a simple algebraic computation:

$$\begin{aligned} (P_{k+1}+P_k )\Delta _k&= (P_{k+1}+P_k )(P_{k+1} - P_k ) = P_{k+1}P_{k+1} - P_{k+1}P_k + P_kP_{k+1} - P_{k}P_{k} \\&= A_{2^{-(k+1)}}A_{2^{-(k+1)}} - A_{2^{-k}} A_{2^{-k}} = A_{2^{-(k+1)}+2^{-(k+1)}} - A_{2^{-k}+2^{-k}} \\&= A_{2^{-k}} - A_{2^{-(k-1)}} = P_k - P_{k-1} = \Delta _{k-1}. \end{aligned}$$

\(\square \)

Lemma 14

Suppose f is bounded. Then for all \(x, y \in \mathcal {X}\),

$$\begin{aligned} |P_kf(x) - P_kf(y)| \le \sup _{x^\prime }|f(x^\prime )| \cdot D_k(x,y) \end{aligned}$$

Proof

$$\begin{aligned} |P_kf(x) - P_kf(y)|&= \bigg | \int _\mathcal {X} p_k(x,u)f(u)du - \int _\mathcal {X} p_k(y,u)f(u)du \bigg | \\&= \bigg | \int _\mathcal {X} [p_k(x,u)-p_k(y,u)]f(u)du \bigg | \\&\le \sup _{x^\prime }|f(x^\prime )| \cdot ||p_k(x,\cdot ) - p_k(y,\cdot ) ||_1 \\&= \sup _{x^\prime }|f(x^\prime )| \cdot D_k(x,y). \end{aligned}$$

\(\square \)

Proposition 9

For all \(f \in \Lambda _\alpha \), \(||f ||_{\Lambda _\alpha } \lesssim ||f ||_{\Lambda _\alpha }^{(1)}.\)

Proof

Expand f in a telescopic series:

$$\begin{aligned} f - P_0f= \sum _{k=0}^\infty \big [P_{k+1}f - P_{k}f\big ] = \sum _{k=0}^\infty \Delta _kf(x) = \sum _{k=0}^\infty \big [(P_{k+1} + P_{k+2})\Delta _{k+1}f\big ] \end{aligned}$$

where we have used Lemma 13. The series converges uniformly by Lemma 12.

For all \(x,y\in \mathcal {X}\),

$$\begin{aligned} |P_k\Delta _kf(x) - P_k\Delta _kf(y)|= & {} \bigg |\int _\mathcal {X} p_k(x,u)(\Delta _kf)(u)du - \int _\mathcal {X} p_k(y,u)(\Delta _kf)(u)du\bigg | \nonumber \\= & {} \bigg |\int _\mathcal {X} (p_k(x,u) - p_k(y,u))(\Delta _kf)(u) du \bigg | \nonumber \\\le & {} V^{(1)}(f) 2^{-k\alpha } D_k(x,y). \end{aligned}$$
(2)

Similarly,

$$\begin{aligned} |P_{k+1}\Delta _kf(x) - P_{k+1}\Delta _kf(y)| \le V^{(1)}(f) 2^{-k\alpha }D_{k+1}(x,y). \end{aligned}$$
(3)

For every \(x,y\in \mathcal {X}\)

$$\begin{aligned} f(x) - f(y)= & {} \sum _{k=0}^\infty \big [(P_{k+1} + P_{k+2})\Delta _{k+1}f\big ](x) - \sum _{k=0}^\infty \big [(P_{k+1} + P_{k+2})\Delta _{k+1}f\big ](y) \\&+\, P_0f(x) - P_0f(y). \end{aligned}$$

From the inequalities (2) and (3) we get

$$\begin{aligned}&\bigg |\sum _{k=0}^\infty \big [(P_{k+1} + P_{k+2})\Delta _{k+1}f\big ](x) - \sum _{k=0}^\infty \big [(P_{k+1} + P_{k+2})\Delta _{k+1}f\big ](y)\bigg | \\&\quad \le \bigg | \sum _{k=0}^\infty (P_{k+1}\Delta _{k+1}f(x) - P_{k+1}\Delta _{k+1}f(y))\bigg | \\&\qquad +\, \bigg | \sum _{k=0}^\infty (P_{k+2}\Delta _{k+1}f(x) - P_{k+2}\Delta _{k+1}f(y))\bigg | \\&\quad \le \sum _{k=0}^\infty V^{(1)}(f)2^{-(k+1)\alpha }D_{k+1}(x,y) + \sum _{k=0}^\infty V^{(1)}(f)2^{-(k+1)\alpha }D_{k+2}(x,y) \\&\quad \le V^{(1)}(f)(1+2^\alpha )D_\alpha (x,y). \end{aligned}$$

By Lemma 14, we also know \(|P_0f(x) - P_0f(y)| \le \sup _{x^\prime }|f(x^\prime )| D_\alpha (x,y)\). Consequently, for every \(x,y\in \mathcal {X}\)

$$\begin{aligned} |f(x) - f(y)| \le (V^{(1)}(f)(1+2^\alpha ) + \sup _{x^\prime }|f(x^\prime )| )D_\alpha (x,y) \end{aligned}$$

and so

$$\begin{aligned} \sup _{x\ne y}\frac{|f(x) - f(y)|}{D_\alpha (x,y)} \le V^{(1)}(f)(1+2^\alpha ) + \sup _{x^\prime }|f(x^\prime )| . \end{aligned}$$

Therefore

$$\begin{aligned} ||f ||_{\Lambda _\alpha }= & {} \sup _{x}|f(x)| + \sup _{x\ne y}\frac{|f(x) - f(y)|}{D_\alpha (x,y)} \le V^{(1)}(f)(1+2^\alpha ) + 2 \sup _{x}|f(x)| \\\le & {} 3||f ||_{\Lambda _\alpha }^{(1)}. \end{aligned}$$

\(\square \)

Putting together Corollaries 4, 5 and Proposition 9, we have shown:

Theorem 3

The norms \(||f ||_{\Lambda _\alpha }\), \(||f ||_{\Lambda _\alpha }^{(1)}\) and \(||f ||_{\Lambda _\alpha }^{(2)}\) are equivalent on \(\Lambda _\alpha \).

4.1 The Necessity of Condition (G) for Positive Kernels

The only place in which we have made use of condition (G) so far—and indeed, the only place we will directly use it in the whole paper—was in establishing Proposition 7; in fact, assuming condition (G) made the proof of Proposition 7 almost tautological. While it is true that in Sect. 3 we showed that condition (G) holds for vastly many semigroups encountered throughout mathematics, the reader may still wonder if it is actually necessary to assume it, or if perhaps an even weaker condition could be imposed on the semigroup.

In this subsection, we give a partial answer to this question. Specifically, we show that if the kernels \(p_k(x,y)\) are non-negative for all \(k \ge 0\) and for all \(x,y\in \mathcal {X}\), then condition (G) is equivalent to the result of Proposition 7. Consequently, in order for the Hölder–Lipschitz norm of a function to be equivalent in size to its scale variations, as defined by the norms \(||f ||_{\Lambda _\alpha }^{(1)}\) and \(||f ||_{\Lambda _\alpha }^{(2)}\), condition (G) must hold. In fact, this statement is true not only for the distance \(D_\alpha (x,y)\) but for any distance D(xy) on \(\mathcal {X}\).

We now formalize this result:

Proposition 10

Suppose the kernels \(p_k(x,y)\) are non-negative for all \(k \ge 0\) and for all \(x,y\in \mathcal {X}\). Let D(xy) be any distance on \(\mathcal {X}\) and \(\alpha \in (0,1)\). Suppose there is a constant \(C>0\) such that for any function f on \(\mathcal {X}\)

$$\begin{aligned} \sup _{k\ge 0}\sup _{x \in \mathcal {X}} 2^{k\alpha }|\delta _k f(x)| \le C \sup _{x \ne y} \frac{f(x) - f(y)}{D(x,y)} \end{aligned}$$
(4)

Then condition (G) must hold, with D(xy) in place of \(D_\alpha (x,y)\); that is,

$$\begin{aligned} \int _\mathcal {X} p_k(x,y) \cdot D(x,y) dy \le C 2^{-k\alpha }. \end{aligned}$$

Proof

Take any \(x_0 \in \mathcal {X}\), and consider the function \(f(x) = D(x_0,x)\). Then \(f(x_0) = 0\), so that

$$\begin{aligned} |\delta _k f(x_0)| = |f(x_0) - P_kf(x_0)| = \int _\mathcal {X} p_k(x_0,x) D(x_0,x) dy . \end{aligned}$$

Because

$$\begin{aligned} \sup _{x \ne y} \frac{f(x) - f(y)}{D(x,y)} = 1 \end{aligned}$$

assumption (4) gives

$$\begin{aligned} \int _\mathcal {X} p_k(x,y) \cdot D(x,y) dy = |\delta _k f(x_0)| \le C2^{-k\alpha } \end{aligned}$$

completing the proof. \(\square \)

5 The Space Dual to \(\Lambda _\alpha \)

We now turn to the space of \(L^1\) measures on \(\mathcal {X}\) (that is, measures with finite total variation; see, for instance, [14]). Since all Hölder–Lipschitz functions are in \(L^\infty \), we can view every such measure as a distribution acting on \(\Lambda _\alpha \). We will denote the action of such a distribution T on a function \(f \in \Lambda _\alpha \) by \(\langle f,T\rangle = \int _\mathcal {X} f(x) dT(x) \). The dual norm to the Hölder–Lipschitz space \(\Lambda _\alpha \) is defined as:

$$\begin{aligned} ||T ||_{\Lambda _\alpha ^*} = \sup _{||f ||_{\Lambda _\alpha }\le 1} \langle f,T \rangle . \end{aligned}$$

The space \(\Lambda _\alpha ^*\) is the space of \(L^1\) measures T equipped with the norm \(||T ||_{\Lambda _\alpha ^*}\). In Sect. 6, we will give a well-known interpretation of the dual norm of the difference of two probability measures.

In this section we will define two other norms on \(\Lambda _\alpha ^*\) that are more amenable to computation, and prove their equivalence to \(||T ||_{\Lambda _\alpha ^*}\). In what follows, for a linear operator A acting on \(\Lambda \) we denote by \(A^*\) the dual operator acting on \(\Lambda ^*\).

We define the seminorms

$$\begin{aligned} W^{(1)}(T) = \sum _{k \ge 0} 2^{-k\alpha } ||\Delta _k^* T ||_1 \end{aligned}$$

and

$$\begin{aligned} W^{(2)}(T) = \sum _{k \ge 0} 2^{-k\alpha } ||d_k^* T ||_1 \end{aligned}$$

where

$$\begin{aligned} d_k = P_k - P_0. \end{aligned}$$

Now we define the equivalent norms. The first is defined by

$$\begin{aligned} ||T ||_{\Lambda _\alpha ^*}^{(1)} = ||P_0^*T ||_1 + W^{(1)}(T) \end{aligned}$$

and the second is defined by

$$\begin{aligned} ||T ||_{\Lambda _\alpha ^*}^{(2)} = ||P_0 ^*T ||_1 + W^{(2)}(T). \end{aligned}$$

We show that all three norms \(||T ||_{\Lambda _\alpha ^*}, ||T ||_{\Lambda _\alpha ^*}^{(1)}\) and \(||T ||_{\Lambda _\alpha ^*}^{(2)}\) are equivalent on \(\Lambda _\alpha ^*\).

Proposition 11

The seminorms \(W^{(1)}(T)\) and \(W^{(2)}(T)\) are equivalent on \(\Lambda _\alpha ^*\).

Proof

We first show \(W^{(1)}(T) \le (1+2^{\alpha }) W^{(2)}(T)\).

$$\begin{aligned} W^{(1)}(T)&= \sum _{k=0}^\infty 2^{-k\alpha }||\Delta _k^*T ||_1 = \sum _{k=0}^\infty 2^{-k\alpha }||(P_k^*-P_{k+1}^*)T ||_1 \\&\le \sum _{k=0}^\infty 2^{-k\alpha } ||(P_k^*-P_0^*)T ||_1 + \sum _{k=0}^\infty 2^{-k\alpha }||(P_{k+1}^*-P_0^*)T ||_1 \\&\le (1+2^{\alpha }) W^{(2)}(T). \end{aligned}$$

For the other direction, we write \(d_k^*\) as the telescopic sum

$$\begin{aligned} d_k^*T = P_k^*T - P_0^*T = \sum _{l=0}^{k-1}\big [(P_{l+1}^*T(x) - P_l^*)T\big ] = \sum _{l=0}^{k-1}\Delta _l^*T. \end{aligned}$$

Then \(||d_k^* T ||_1 \le \sum _{l=0}^{k-1}||\Delta _l^*T ||_1\), and consequently Fubini’s theorem yields

$$\begin{aligned} W^{(2)}(T)&= \sum _{k=0}^\infty 2^{-k\alpha } ||d_k^* T ||_1 \le \sum _{k=0}^\infty 2^{-k\alpha }\sum _{l=0}^{k-1}||\Delta _l^*T ||_1 \\&= \sum _{l=0}^\infty ||\Delta _l^*T ||_1 \sum _{k\ge l+1}2^{-k\alpha } \\&= \sum _{l=0}^\infty ||\Delta _l^*T ||_1 \frac{2^{-(l+1)\alpha }}{1-2^{-\alpha }} = \frac{2^{-\alpha }}{1-2^{-\alpha }}W^{(1)}(T) \end{aligned}$$

completing the proof. \(\square \)

Corollary 6

The norms \(||T ||_{\Lambda _\alpha ^*}^{(1)}\) and \(||T ||_{\Lambda _\alpha ^*}^{(2)}\) are equivalent on \(\Lambda _\alpha ^*\).

Next we turn to the main result of this section, namely that \(||T ||_{\Lambda _\alpha ^*}^{(1)}\) and \(||T ||_{\Lambda _\alpha ^*}^{(2)}\) are equivalent to \(||T ||_{\Lambda _\alpha ^*}\).

Proposition 12

For all \(T \in \Lambda _\alpha ^*\), \(||T ||_{\Lambda _\alpha ^*} \lesssim ||T ||_{\Lambda _\alpha ^*}^{(2)}\).

Proof

Suppose f is any function with \(||f ||_{\Lambda _\alpha }\le 1\). Making use of Lemma 13 and the uniform convergence of \(P_kf\) to f as \(k\rightarrow \infty \) we can write

$$\begin{aligned} f - P_0f&= \sum _{j=0}^{\infty }\Delta _jf = \sum _{j=1}^{\infty }P_j\Delta _jf + \sum _{j=1}^{\infty }P_{j+1}\Delta _jf \\&= \sum _{j=1}^{\infty } (P_j - P_0) \Delta _jf + \sum _{j=1}^{\infty } (P_{j+1} - P_0) \Delta _jf + 2P_0(I - P_1)f . \end{aligned}$$

Therefore,

$$\begin{aligned} \langle f,T \rangle&= \sum _{j=1}^{\infty } \langle T, (P_j - P_0) \Delta _jf \rangle + \sum _{j=1}^{\infty } \langle T, (P_{j+1} - P_0) \Delta _jf \rangle \\&\quad +\, \langle T, (3P_0 - 2P_0P_1) f \rangle \\&= \sum _{j=1}^{\infty } \langle (P_j^* - P_0^*)T ,\Delta _j f\rangle + \sum _{j=1}^{\infty } \langle (P_{j+1}^* - P_0^*)T, \Delta _jf \rangle \\&\quad +\, \langle P_0^* T, (3I - 2P_1) f \rangle . \end{aligned}$$

Consequently, we have

$$\begin{aligned} |\langle f,T \rangle |&\le \sum _{j=1}^{\infty } ||(P_j^* - P_0^*)T ||_1 \sup _x|\Delta _jf(x)| + \sum _{j=1}^{\infty } ||(P_{j+1}^* - P_0^*)T ||_1 \sup _x|\Delta _jf(x)| \\&\quad +\, ||P_0^*T ||_1 \sup _x|(3I - 2P_1) f(x)| \\&\lesssim \sum _{j=1}^{\infty } 2^{-j\alpha } ||d_j^*T ||_1 + \sum _{j=1}^{\infty } 2^{-j\alpha }||d_{j+1}^*T ||_1 + ||P_0^*T ||_1 \sup _x|f(x)| \end{aligned}$$

where in the last inequality we have used the equivalence of \(||f ||_{\Lambda _\alpha }\) and \(||f ||_{\Lambda _\alpha }^{(1)}\) from Theorem 3 and the fact that \(\sup _{x}|P_kf(x)| \lesssim \sup _{x}|f(x)|\) (a trivial consequence of condition (I) on the kernel). Since \(\sup _x|f(x)|\le ||f ||_{\Lambda _\alpha }\le 1\), it follows immediately that

$$\begin{aligned} |\langle f, T\rangle | \lesssim \sum _{j=1}^{\infty } 2^{-j\alpha } ||d_j^* T ||_1 + ||P_0^* T ||_1 = ||T ||_{\Lambda _\alpha ^*}^{(2)}. \end{aligned}$$

Now take the supremum over all f with \(||f ||_{\Lambda _\alpha }\le 1\) to reach the desired conclusion. \(\square \)

Proposition 13

For all \(T \in \Lambda _\alpha ^*\), \(||T ||_{\Lambda _\alpha ^*}^{(2)} \lesssim ||T ||_{\Lambda _\alpha ^*}\).

Proof

Define the function f by

$$\begin{aligned} f(x)&= \sum _{k=1}^\infty 2^{-k\alpha } (P_k - P_0) {{\mathrm{sgn}}}\big [(P_k^* - P_0^*)T\big ](x) + P_0\big [{{\mathrm{sgn}}}(P_0^* T)\big ](x) \\&= \sum _{k=1}^\infty 2^{-k\alpha } P_k{{\mathrm{sgn}}}\big [(P_k^* - P_0^*)T\big ](x) + P_0 F(x) \end{aligned}$$

where

$$\begin{aligned} F(x) = {{\mathrm{sgn}}}(P_0^* T)(x) - \sum _{k=1}^\infty 2^{-k\alpha } {{\mathrm{sgn}}}\big [(P_k^* - P_0^*)T\big ] (x). \end{aligned}$$

Since

$$\begin{aligned} \sup _x|F(x)| \le 1 + \sum _{k=1}^\infty 2^{-k\alpha } \le 1 + \frac{2^{-\alpha }}{1+2^{-\alpha }} \end{aligned}$$

therefore, by Lemma 14,

$$\begin{aligned} |P_0F(x) - P_0F(y)| \le \bigg (1 + \frac{2^{-\alpha }}{1+2^{-\alpha }}\bigg ) D_\alpha (x,y) \end{aligned}$$

for all \(x,y\in \mathcal {X}\). Furthermore, letting \(h_k = {{\mathrm{sgn}}}[(P_k^* - P_0^*)T]\), Lemma 14 also implies that \(|P_kh_k(x) - P_kh_k(y)| \le D_k(x,y)\), and consequently

$$\begin{aligned}&\bigg | \sum _{k=1}^\infty 2^{-k\alpha } P_k{{\mathrm{sgn}}}[(P_k^* - P_0^*)T](x) - \sum _{k=1}^\infty 2^{-k\alpha } P_k{{\mathrm{sgn}}}[(P_k^* - P_0^*)T](y)\bigg | \\&\quad \le \sum _{k=1}^\infty 2^{-k\alpha } |P_kh_k(x) - P_kh_k(y)| \\&\quad \le \sum _{k=1}^\infty 2^{-k\alpha } D_k(x,y) \le D_\alpha (x,y). \end{aligned}$$

We also have the estimate

$$\begin{aligned} ||f ||_\infty \lesssim \sum _{k=1}^\infty 2^{-k\alpha } + 1 \le 1 + \frac{2^{-\alpha +1}}{1+2^{-\alpha }}. \end{aligned}$$

It follows that \(||f ||_{\Lambda _\alpha } \le C(\alpha )\), where \(C(\alpha )\) is a constant depending only on \(\alpha \) (in particular, not on T or f). By the definition of f, we see

$$\begin{aligned} \langle f , T \rangle&= \sum _{k=1}^\infty 2^{-k\alpha } \langle (P_k - P_0) {{\mathrm{sgn}}}[(P_k^* - P_0^*)T] ,T\rangle + \langle P_0[{{\mathrm{sgn}}}(P_0^* T)] ,T\rangle \\&= \sum _{k=1}^\infty 2^{-k\alpha } \langle {{\mathrm{sgn}}}[(P_k^* - P_0^*)T],(P_k^*-P_0^*)T \rangle + \langle {{\mathrm{sgn}}}(P_0^* T),P_0^*T\rangle \\&= \sum _{k=1}^\infty 2^{-k\alpha } ||(P_k^* - P_0^*)T ||_1 + ||P_0^*T ||_1. \end{aligned}$$

Therefore,

$$\begin{aligned} ||T ||_{\Lambda _\alpha ^*} \ge C(\alpha )^{-1}\langle f,T\rangle \simeq \sum _{k=1}^\infty 2^{-k\alpha } ||(P_k^* - P_0^*)T ||_1 + ||P_0^*f ||_1 =||T ||_{\Lambda _\alpha ^*}^{(2)} \end{aligned}$$

which is the desired result. \(\square \)

Putting together Corollary 6, Propositions 12 and 13 we have shown:

Theorem 4

The norms \(||T ||_{\Lambda _\alpha ^*}\), \(||T ||_{\Lambda _\alpha ^*}^{(1)}\) and \(||T ||_{\Lambda _\alpha ^*}^{(2)}\) are equivalent on \(\Lambda _\alpha ^*\).

6 Application to Earth Mover’s Distance

The dual norm \(||T ||_{\Lambda _\alpha ^*}\) has a natural interpretation when the distribution T is the difference of two probability measures P and Q. We will explain this in a more general setting. Suppose \(\Omega \) is any metric/measure space with metric \(\rho \). A measure \(\pi \) on \(\Omega \times \Omega \) satisfies the equality-of-marginals condition with respect to P and Q if

$$\begin{aligned} \pi (\Omega ,E)&= P(E) \nonumber \\ \pi (E,\Omega )&= Q(E) \end{aligned}$$
(EM)

for all measurable sets \(E \subset \Omega \). The Kantorovich–Rubinstein condition is the statement that:

$$\begin{aligned} \sup _{g:|g(x) - g(y)| \le \rho (x,y)}\bigg \{ \int _\Omega gdP - \int _\Omega gdQ \bigg \} = \inf _{\pi : \,(EM) \text { holds}} \int _{\Omega \times \Omega } \rho (x,y) d\pi (x,y). \end{aligned}$$
(KR)

The following theorem gives conditions under which (KR) is true:

Theorem 5

(Kantorovich–Rubinstein) Suppose \(\Omega \) is a measure space that is separable with respect to the metric \(\rho \). Let P and Q be two probability measures on \(\Omega \) such that the expected distance under P and Q from any point is finite. Then the equation (KR) is true.

Proof

The proof can be found in [12]. \(\square \)

The quantity on the right of (KR) is known as the Earth Mover’s Distance between P and Q, denoted \({{\mathrm{EMD}}}(P,Q)\). It has a physical interpretation. We view each measure \(\pi \) satisfying the equality-of-marginals condition (EM) with respect to P and Q as a transport between the measures Q and P; that is, for any two measurable sets \(A,B \subset \Omega \), \(\pi (A,B)\) is interpreted as the amount of mass moved from set A to set B. The equality-of-marginals condition (EM) guarantees that the transport rearranges the mass distribution described by Q to end up with the distribution described by P. If \(\rho (x,y)\) is the cost-per-mass of moving mass from location x to location y, then \({{\mathrm{EMD}}}(P,Q)\) is the minimal cost over all transports; in other words, it is the cheapest way of rearranging mass distributed like Q to get mass distributed like P.

The quantity on the left of (KR) is equal to the norm of \(T = P - Q\) in the space dual to Lipschitz functions, except we do not require that the functions T is integrated against lie in \(L^\infty \). However, when the diameter of the space is finite, as for the distances \(D_\alpha \) we have defined, and \(\int dP = \int dQ\), then we can assume that all Lipschitz functions integrated against \(P-Q\) are uniformly bounded, and the two definitions are easily seen to be equivalent; that is, the norm \(||P - Q ||_{\Lambda _\alpha ^*}\) is equivalent in size to the left side of (KR).

Due to the way it exploits the geometry of the metric space on which the two probability distributions are defined, EMD has many desirable properties that make it a natural choice of metric for many problems in machine learning [22, 25, 26, 32]. We now describe one such property, which helps explain its robustness.

Suppose \(\Omega \) is a space with a metric \(\rho (x,y)\) and measure \(\mu \). Suppose \(\Omega \) is separable with respect to \(\rho \). Let p be a probability density on \(\Omega \) relative to \(\mu \); that is, p takes values in \([0,\infty )\) and \(\int _\Omega p d\mu = 1\). Let \(h:\Omega \rightarrow \Omega \) be a 1-1, absolutely continuous (with respect to \(\mu \)) transformation satisfying

$$\begin{aligned} \rho (x,h(x)) \le \epsilon \end{aligned}$$
(5)

for all \(x\in \Omega \). Let \(\nu \) be the measure induced by the change-of-variable h, that is, \(\nu (S) = \mu (h(S))\) for measurable subsets \(S\subset \Omega \); and let \(\frac{d\nu }{d\mu }\) denote the Radon–Nikodym derivative of \(\nu \) with respect to \(\mu \). Then we define the probability density

$$\begin{aligned} q(x) = p(h(x))\frac{d\nu }{d\mu }(x) \end{aligned}$$

obtained from p by the change-of-variable h. We can think of q as a perturbation of p. The distance between p and q in some metrics may be quite large, even though the change-of-variable h is small. For example, if p is a narrow bump on \(\mathbb {R}\) of width less than \(\epsilon \), and h is a shift of size \(\epsilon \), then the supports of p and q will be disjoint and their \(L^1\) distance will be 2, the maximum possible. However, we now show that \({{\mathrm{EMD}}}(p,q)\) is no greater than the size of the perturbation h itself.

Proposition 14

Under the assumptions described above, \({{\mathrm{EMD}}}(p,q)\le \epsilon \).

Proof

We use the Kantorovich–Rubinstein Theorem (KR):

$$\begin{aligned} {{\mathrm{EMD}}}(p,q) = \sup \bigg \{\int _\Omega f(x)(p(x) - q(x))d\mu (x) : |f(x) - f(y)| \le \rho (x,y)\bigg \}. \end{aligned}$$

Take any f with \(|f(x) - f(y)| \le \rho (x,y)\) for all x and y, and observe that

$$\begin{aligned} \int _\Omega f(x)q(x)d\mu (x)&= \int _\Omega f(x)p(h(x))\frac{d\nu }{d\mu }(x)d\mu (x) \\&= \int _\Omega f(x)p(h(x))d\nu (x) \\&= \int _\Omega f(h^{-1}(y)) p(y)d\mu (y). \end{aligned}$$

Consequently,

$$\begin{aligned} \int _\Omega f(x)(p(x) - q(x))d\mu (x)&= \int _\Omega p(x)(f(x) - f(h^{-1}(x)))d\mu (x). \end{aligned}$$

By assumption (5) on h, we have \(\rho (x,h^{-1}(x))\le \epsilon \); hence, since f has Lipschitz constant 1, we have

$$\begin{aligned} |f(x) - f(h^{-1}(x))| \le \rho (x,h^{-1}(x)) \le \epsilon \end{aligned}$$

and therefore

$$\begin{aligned} \int _\Omega f(x)(p(x) - q(x))d\mu (x)= & {} \int _\Omega p(x)(f(x) - f(h^{-1}(x)))d\mu (x) \\\le & {} \epsilon \int _\Omega p(x)d\mu (x) = \epsilon \end{aligned}$$

since p is a probability density. Taking the supremum over all Lipschitz f gives \({{\mathrm{EMD}}}(p,q) \le \epsilon \), as desired. \(\square \)

In order to apply this theory to the setting of this paper, we need to check that condition (KR) holds when the space \(\mathcal {X}\) is given the metric \(D_\alpha (x,y)\). Since \(D_\alpha (x,y)\) is uniformly bounded (so the expected distance from any point under a probability measure is automatically finite), to apply the Kantorovich–Rubinstein Theorem (Theorem 5) it remains to check that the resulting metric space is separable. We will prove separability by using our assumption that \(\mathcal {X}\) is sigma-finite.

Lemma 15

Under the metric \(D_\alpha (x,y)\), balls in \(\mathcal {X}\) of positive radius have positive measure.

Proof

We deduce this from condition (G) as follows. Suppose that there were some ball B(xr), \(r>0\), with measure zero. Then

$$\begin{aligned} 1 = \int _\mathcal {X} p_k(x,y) dy \le \int _\mathcal {X} |p_k(x,y)| dy = \int _{B(x,r)^c} |p_k(x,y)| dy \end{aligned}$$

and consequently

$$\begin{aligned} r \le \int _{B(x,r)^c} |p_k(x,y)| D_\alpha (x,y)dy \le C 2^{-k\alpha }. \end{aligned}$$

Since \(r>0\), taking \(k\rightarrow \infty \) yields a contradiction. \(\square \)

Proposition 15

The metric \(D_\alpha (x,y)\) turns \(\mathcal {X}\) into a separable metric space; in particular, the Kantorovich–Rubinstein Theorem holds on \(\mathcal {X}\).

Proof

Since we assume \(\mathcal {X}\) is sigma-finite, we can write \(\mathcal {X}\) as a countable union of finite measure sets. Without loss of generality, we can therefore assume that \(\mathcal {X}\) itself has finite measure. Take any positive integer n. Use Zorn’s Lemma to find a maximal collection of points \(\{x_i^{(n)}\}_{i\in \mathcal {I}_n}\) so that \(D_\alpha (x_i^{(n)},x_j^{(n)})\ge 1/n\), where \(\mathcal {I}_n\) is some index set. By maximality, every point in \(\mathcal {X}\) is within 1 / n of one of the points \(x_i^{(n)}\).

We will show that \(\mathcal {I}_n\) is countable. Observe that the balls \(B(x_i^{(n)},1/2n)\) are pairwise disjoint and have positive measure. Since \(\mathcal {X}\) has finite measure, there can only be finitely many balls whose measure lies in the interval \((2^{-k-1},2^{-k}]\), for each \(k\in \mathbb {Z}\). Since the measure of each ball must lie in one such interval, and there are countably many intervals, there are only countably many balls.

Consequently, \(\cup _{n=1}^\infty \{ x_i^{(n)}\}_{i\in \mathcal {I}_n}\) is a countable dense subset of \(\mathcal {S}\), and the proof is complete. \(\square \)

In our setting of the space \(\mathcal {X}\) with the semigroup \(a_t(x,y)\), the formulas for the norm \(||T ||_{\Lambda _\alpha ^*}\) from Sect. 5 provide an approximation to Earth Mover’s Distance. From Theorem 4, and the Kantorovich–Rubinstein Theorem, the Earth Mover’s Distance between two probability measures \(\mu \) and \(\nu \) is equivalent to the expressions

$$\begin{aligned} ||\mu - \nu ||_{\Lambda _\alpha ^*}^{(1)} = ||P_0^*(\mu - \nu ) ||_1 + \sum _{k \ge 0} 2^{-k\alpha } ||\Delta _k^* (\mu - \nu ) ||_1 \end{aligned}$$
(6)

and

$$\begin{aligned} ||\mu - \nu ||_{\Lambda _\alpha ^*}^{(2)} = ||P_0^*(\mu - \nu ) ||_1 + \sum _{k \ge 0} 2^{-k\alpha } ||d_k^* (\mu - \nu ) ||_1. \end{aligned}$$
(7)

In machine learning applications, these formulas can often be computed fast, and thus provide a fast approximation to Earth Mover’s Distance. We only give a sketch of how this works, waving our hands regarding the issues that arise when using discrete data. We take \(\mathcal {X}\) to be a collection of n data points, and the operators \(P_k\) to be dyadic powers of a Markov matrix M on the data, as in the theory of diffusion maps [6].

If the mixing time of the walk is bounded by a power of n, then the series found in formulas (6) and (7) can be well approximated by the first \(O(\log n)\) terms. For many operators encountered in practice, all their dyadic powers can be applied in time \(O(n \log ^kn)\); for instance, see [7, 8]. In this environment, the approximations to EMD given by (6) and (7) can be evaluated at cost \(O(n \log ^kn)\) as well. Note that a simplifying consideration in the case of the operators used for diffusion maps is that the Markov matrices M considered there are similar to a symmetric positive definite matrix (with maximum eigenvalue equal to 1), and so any algorithm that permits rapid application of all powers of such matrices will enable a fast approximation to EMD in our setting.

In more specialized cases similar formulas have been shown to approximate EMD as well. The work that most closely resembles this one is wavelet EMD [28]. Here, wavelets are used in place of the operators \(\Delta _k\) and \(d_k\). The applicability of this method is limited to \(\mathbb {R}^n\), where the ground distance is a snowflake of the Euclidean metric.

The reader can also refer to the papers by Charikar [4] and Indyk and Thaper [19]. Though the particulars are quite different than in the present work, the general spirit is the same; EMD can be approximated by a weighted sum of \(L^1\) norms of difference operators at different scales, whatever the notion of “scale” might mean for the geometry under consideration.

7 Mixed Hölder–Lipschitz Functions on Product Spaces

We now consider the setting where we have a product of spaces, each equipped with its own semigroup satisfying (S), (C), (I) and (R) so that the theory developed so far can be applied. For simplicity, we will consider only two spaces, which we will denote \(\mathcal {X}\) and \(\mathcal {Y}\), each with a semigroup \(A_s\) and \(B_t\) with kernels \(a_s(x,x^\prime )\) and \(b_t(y,y^\prime )\), respectively. All the results and their proofs can be extended to arbitrarily many semigroups. We define the dyadic discretizations for times between 0 and 1

$$\begin{aligned} P_k = A_{2^{-k}}, \,\, p_k(x,x^\prime ) = a_{2^{-k}}(x,x^\prime ),\,\, k\ge 0 \end{aligned}$$

and

$$\begin{aligned} Q_l = B_{2^{-l}}, \,\, q_l(y,y^\prime ) = b_{2^{-k}}(y,y^\prime ),\,\,l\ge 0 \end{aligned}$$

and the distances

$$\begin{aligned} D_{\mathcal {X},k}(x,x^\prime ) = ||p_k(x,\cdot ) - p_k(x^\prime ,\cdot ) ||_1 \end{aligned}$$

and

$$\begin{aligned} D_{\mathcal {Y},l}(y,y^\prime ) = ||q_l(y,\cdot ) - q_l(y^\prime ,\cdot ) ||_1. \end{aligned}$$

For \(0<\alpha ,\beta <1\), such that the geometric condition (G) holds for \(P_k\) with respect to \(\alpha \) and (G) holds for \(Q_l\) with respect to \(\beta \), we define metrics on \(\mathcal {X}\) and \(\mathcal {Y}\) by

$$\begin{aligned} D_{\mathcal {X},\alpha }(x,x^\prime ) = \sum _{k\ge 0}2^{-k\alpha }D_{\mathcal {X},k}(x,x^\prime ) \end{aligned}$$

and

$$\begin{aligned} D_{\mathcal {Y},\beta }(y,y^\prime ) = \sum _{l\ge 0}2^{-l\beta }D_{\mathcal {Y},k}(y,y^\prime ). \end{aligned}$$

For brevity, we will let \(D_\mathcal {X} = D_{\mathcal {X},\alpha }\) and \(D_\mathcal {Y} = D_{\mathcal {Y},\beta }\).

We will define a regularity norm and its dual on the product space \(\mathcal {X}\times \mathcal {Y}\). We first define the following quantities:

$$\begin{aligned} V_\mathcal {X}(f) = \sup _{y,x\ne x^\prime } \frac{f(x,y) - f(x^\prime ,y)}{D_\mathcal {X}(x,x^\prime )}, \end{aligned}$$
$$\begin{aligned} V_\mathcal {Y}(f) = \sup _{x,y\ne y^\prime } \frac{f(x,y) - f(x,y^\prime )}{D_\mathcal {Y}(y,y^\prime )}, \end{aligned}$$

and

$$\begin{aligned} M(f) = \sup _{x\ne x^\prime , y\ne y^\prime } \frac{f(x,y) - f(x,y^\prime ) - f(x^\prime ,y) + f(x^\prime ,y^\prime )}{D_\mathcal {X}(x,x^\prime )D_\mathcal {Y}(y,y^\prime )}. \end{aligned}$$

We then define the norm

$$\begin{aligned} ||f ||_{\Lambda _{\alpha ,\beta }} \equiv M(f) + V_\mathcal {X}(f) + V_\mathcal {Y}(f) + \sup _x|f(x)| \end{aligned}$$

and denote by \(\Lambda _{\alpha ,\beta }\) the space of all functions f where \(||f ||_{\Lambda _{\alpha ,\beta }} < \infty \). We will call \(\Lambda _{\alpha ,\beta }\) the mixed Hölder-Lipschitz space, since the functions \(f \in \Lambda _{\alpha ,\beta }\) must have bounded mixed difference quotients. The space \(\Lambda _{\alpha ,\beta }\) is analogous to spaces of functions with dominating mixed derivatives on \(\mathbb {R}^n\); see [30].

Lemma 16

Taking

$$\begin{aligned} \tilde{V}_\mathcal {X}(f) = \sup _{y,x\ne x^\prime } \frac{(Q_0f)(x,y) - (Q_0f)(x^\prime ,y)}{D_\mathcal {X}(x,x^\prime )}, \end{aligned}$$
$$\begin{aligned} \tilde{V}_\mathcal {Y}(f) = \sup _{x,y\ne y^\prime } \frac{(P_0f)(x,y) - (P_0f)(x,y^\prime )}{D_\mathcal {Y}(y,y^\prime )} \end{aligned}$$

in place of, respectively, the seminorms \(V_\mathcal {X}(f)\) and \(V_\mathcal {Y}(f)\) in the definition of \(||f ||_{\Lambda _{\alpha ,\beta }}\) yields an equivalent norm.

Proof

From condition (I), it is immediate that \(\tilde{V}_\mathcal {X}(f) \lesssim V_\mathcal {X}(f) \) and \(\tilde{V}_\mathcal {Y}(f) \le V_\mathcal {Y}(f) \). For the other inequality, we can control \(V_\mathcal {X}(f)\) by \(\tilde{V}_\mathcal {X}(f)\) and M(f), and control \(V_\mathcal {Y}(f)\) by \(\tilde{V}_\mathcal {Y}(f)\) and M(f). To see this, observe that

$$\begin{aligned}&| (Q_0f)(x,y) - (Q_0f)(x^\prime ,y) - f(x,y) + f(x^\prime ,y)| \\&\quad = \bigg | \int _\mathcal {X}q_0(y,y^\prime )[f(x,y^\prime ) - f(x^\prime ,y^\prime ) - f(x,y) + f(x^\prime ,y)]dy^\prime \bigg | \\&\quad \le CD_\mathcal {X}(x,x^\prime ){{\mathrm{diam}}}(\mathcal {Y})M(f) \\&\quad \lesssim M(f)D_\mathcal {X}(x,x^\prime ) \end{aligned}$$

where we have used condition (I) in the second-to-last inequality. Consequently, \(V_\mathcal {X}(f) \lesssim \tilde{V}_\mathcal {X}(f) + M(f)\); similarly, \(V_\mathcal {Y}(f) \lesssim \tilde{V}_\mathcal {Y}(f) + M(f)\). It follows that replacing \(V_\mathcal {X}(f)\) and \(V_\mathcal {Y}(f)\) by, respectively, \(\tilde{V}_\mathcal {X}(f)\) and \(\tilde{V}_\mathcal {Y}(f)\) in the definition of \(||f ||_{\Lambda _{\alpha ,\beta }}\) yields an equivalent norm. \(\square \)

Of course, other minor variations in the definition of \(||f ||_{\Lambda _{\alpha ,\beta }}\) yielding equivalent norms are also possible. However, as in the case of a single space our primary goal is to give simpler characterizations of the norm \(||f ||_{\Lambda _{\alpha ,\beta }}\) involving the changes in the function’s averages across scales. More precisely, the equivalent norms will measure the variations across all pairs of scales. In Sect. 8, we will use these to give simple characterizations of the norm on the space dual to \(\Lambda _{\alpha ,\beta }\).

We define the difference operators

$$\begin{aligned} \Delta _{P,k} = P_{k+1} - P_k ,\,\,\,\,\Delta _{Q,l} = Q_{l+1} - Q_l \end{aligned}$$

as well as

$$\begin{aligned} \delta _{P,k} = I - P_k,\,\,\,\,\delta _{Q,l} = I - Q_l. \end{aligned}$$

We then define

$$\begin{aligned} V_\mathcal {X}^{(1)}(f) = \sup _{k\ge 0}\sup _{x,y}2^{k\alpha }|\Delta _{P,k}f(x,y)|,\,\,\,\,\,V_\mathcal {Y}^{(1)}(f) = \sup _{l\ge 0}\sup _{x,y}2^{l\beta }|\Delta _{Q,l}f(x,y)|, \end{aligned}$$

and

$$\begin{aligned} M^{(1)}(f) = \sup _{k\ge 0,l\ge 0}\sup _{x,y}2^{k\alpha +l\beta } |\Delta _{P,k}\Delta _{Q,l}f(x,y)|. \end{aligned}$$

Similarly, define

$$\begin{aligned} V_\mathcal {X}^{(2)}(f) = \sup _{k\ge 0}\sup _{x,y}2^{k\alpha }|\delta _{P,k}f(x,y)|,\,\,\,\,\,V_\mathcal {Y}^{(2)}(f) = \sup _{l\ge 0}\sup _{x,y}2^{l\beta }|\delta _{Q,l}f(x,y)|, \end{aligned}$$

and

$$\begin{aligned} M^{(2)}(f) = \sup _{k\ge 0,l\ge 0}\sup _{x,y}2^{k\alpha +l\beta }\ |\delta _{P,k}\delta _{Q,l}f(x,y)|. \end{aligned}$$

We can now define the equivalent regularity norms by

$$\begin{aligned} ||f ||_{\Lambda _{\alpha ,\beta }}^{(1)} = M^{(1)}(f) + V_\mathcal {X}^{(1)}(f) + V_\mathcal {Y}^{(1)}(f) + \sup _{x,y}|f(x,y)| \end{aligned}$$

and

$$\begin{aligned} ||f ||_{\Lambda _{\alpha ,\beta }}^{(2)} = M^{(2)}(f) + V_\mathcal {X}^{(2)}(f) + V_\mathcal {Y}^{(2)}(f) + \sup _{x,y}|f(x,y)|. \end{aligned}$$

We first show that \(||f ||_{\Lambda _{\alpha ,\beta }}\) controls \(||f ||_{\Lambda _{\alpha ,\beta }}^{(2)}\). It will follow that on \(\Lambda _{\alpha ,\beta }\) we have uniform convergence of the semigroups and their products to the identity.

Proposition 16

For any function f, \(||f ||_{\Lambda _{\alpha ,\beta }}^{(2)} \lesssim ||f ||_{\Lambda _{\alpha ,\beta }}\).

Proof

Showing that \(V_\mathcal {X}^{(2)}(f)\) and \(V_\mathcal {Y}^{(2)}(f)\) are controlled by, respectively, \(V_\mathcal {X}(f)\) and \(V_\mathcal {Y}(f)\) is an immediate consequence of the one-dimensional result, Proposition 7. To show \(M^{(2)}(f) \lesssim M(f)\), observe that Proposition 7 also gives

$$\begin{aligned} |2^{k\alpha }\delta _{P,k}2^{l\beta } \delta _{Q,l} f(x,y)|&\lesssim \sup _{x \ne x^\prime } \frac{2^{l\beta }\delta _{Q,l} f(x,y) - 2^{l\beta }\delta _{Q,l} f(x^\prime ,y)}{D_\mathcal {X}(x,x^\prime )} \\&= \sup _{x \ne x^\prime } \frac{2^{l\beta }\delta _{Q,l}[ f(x,\cdot ) - f(x^\prime ,\cdot )] (y)}{D_\mathcal {X}(x,x^\prime )}. \end{aligned}$$

Now apply Proposition 7 again to the function \(y \mapsto f(x,y) - f(x^\prime ,y)\) to obtain the bound

$$\begin{aligned} |2^{l\beta }\delta _{Q,l}[ f(x,\cdot ) - f(x^\prime ,\cdot )] (y)| \lesssim \sup _{y\ne y^\prime }\frac{f(x,y) - f(x^\prime ,y)-f(x,y^\prime )+ f(x^\prime ,y^\prime ) }{D_\mathcal {Y}(y,y^\prime )}. \end{aligned}$$

The result follows. \(\square \)

It is easy to see that if \(||f ||_{\Lambda _{\alpha ,\beta }}^{(2)} < \infty \), then

$$\begin{aligned} \lim _{k \rightarrow \infty , l \rightarrow \infty } P_k Q_l f = f \end{aligned}$$

uniformly, where the limits can be taken in either order or simultaneously. Since \(||f ||_{\Lambda _{\alpha ,\beta }}^{(2)} \lesssim ||f ||_{\Lambda _{\alpha ,\beta }} \), the same convergence applies for any \(f \in \Lambda _{\alpha ,\beta }\).

We will next show that \(||f ||_{\Lambda _{\alpha ,\beta }}^{(1)}\) and \(||f ||_{\Lambda _{\alpha ,\beta }}^{(2)}\) are equivalent, and then that \(||f ||_{\Lambda _{\alpha ,\beta }} \lesssim ||f ||_{\Lambda _{\alpha ,\beta }}^{(1)}\). To that end:

Lemma 17

The seminorms \(V_\mathcal {X}^{(1)}(f)\) and \(V_\mathcal {X}^{(2)}(f)\) are equivalent on \(\Lambda _{\alpha ,\beta }\), as are the seminorms \(V_\mathcal {Y}^{(1)}(f)\) and \(V_\mathcal {Y}^{(2)}(f)\).

Proof

This follows immediately from Proposition 8 for a single semigroup. \(\square \)

Lemma 18

The seminorms \(M^{(1)}(f)\) and \(M^{(2)}(f)\) are equivalent on \(\Lambda _{\alpha ,\beta }\).

Proof

From Proposition 8, we have

$$\begin{aligned} |2^{k\alpha } \Delta _{P,k} 2^{l\beta } \Delta _{Q,l} f(x,y)|&\lesssim \sup _{x^\prime }\sup _{k^\prime \ge 0} 2^{k^\prime \alpha }|\delta _{P,k^\prime } 2^{l\beta } \Delta _{Q,l} f(x^\prime ,y)| \\&= \sup _{x^\prime }\sup _{k^\prime \ge 0} 2^{l\beta } | \Delta _{Q,l} 2^{k^\prime \alpha }\delta _{P,k^\prime }f(x^\prime ,y)| \\&\lesssim \sup _{x^\prime }\sup _{k^\prime \ge 0} \sup _{y^\prime } \sup _{l^\prime } 2^{l^\prime \beta } | \delta _{Q,l} 2^{k^\prime \alpha }\delta _{P,k^\prime }f(x^\prime ,y^\prime )| \end{aligned}$$

which proves \(M^{(1)}(f) \lesssim M^{(2)}(f)\). The other direction is proved similarly. \(\square \)

Combining Lemmas 17 and 18, we get:

Proposition 17

The norms \(||f ||_{\Lambda _{\alpha ,\beta }}^{(1)}\) and \(||f ||_{\Lambda _{\alpha ,\beta }}^{(2)}\) are equivalent on \(\Lambda _{\alpha ,\beta }\).

To finish proving that all three norms are equivalent, we will show that \(||f ||_{\Lambda _{\alpha ,\beta }}\lesssim ||f ||_{\Lambda _{\alpha ,\beta }}^{(1)}\).

Proposition 18

For all \(f \in \Lambda _{\alpha ,\beta }\), \(||f ||_{\Lambda _{\alpha ,\beta }} \lesssim ||f ||_{\Lambda _{\alpha ,\beta }}^{(1)}\).

Proof

First, it is trivial to deduce \(V_\mathcal {X}(f) \lesssim V_\mathcal {X}^{(1)}(f) + \sup _{x,y}|f(x,y)|\) and \(V_\mathcal {Y}(f) \lesssim V_\mathcal {Y}^{(1)}(f)+ \sup _{x,y}|f(x,y)|\) from Proposition 9. Therefore, it remains to show \(M(f) \lesssim ||f ||_{\Lambda _{\alpha ,\beta }}^{(1)}\).

Fix any \(y,y^\prime \in \mathcal {Y}\) and define

$$\begin{aligned} g(x) = \frac{f(x,y) - f(x,y^\prime )}{D_\mathcal {Y}(y,y^\prime )}. \end{aligned}$$

From Proposition 9 again, we have that for all \(x \ne x^\prime \),

$$\begin{aligned} \frac{f(x,y)-f(x,y^\prime )-f(x^\prime ,y)+f(x^\prime ,y^\prime )}{D_\mathcal {X}(x,x^\prime )D_\mathcal {Y}(y,y^\prime )} \lesssim \sup _{k \ge 0} \sup _{x^{\prime \prime }} 2^{k\alpha } |\Delta _{P,k}g(x^{\prime \prime })| + \sup _{x^{\prime \prime }}|g(x^{\prime \prime })|. \end{aligned}$$

The supremum of g is bounded by

$$\begin{aligned} \sup _{x^{\prime \prime }}|g(x^{\prime \prime })| \le V_\mathcal {Y}(f) \lesssim V_\mathcal {Y}^{(1)}(f) + \sup _{x,y}|f(x,y)|. \end{aligned}$$

Furthermore, we have

$$\begin{aligned} 2^{k\alpha }|\Delta _{P,k}g(x^{\prime \prime })|&= 2^{k\alpha }\frac{|\Delta _{P,k}f(x^{\prime \prime },y) - \Delta _{P,k}f(x^{\prime \prime },y^\prime )|}{D_\mathcal {Y}(y,y^\prime )} \\&\lesssim 2^{k\alpha }\sup _{l\ge 0} \sup _{y^{\prime \prime }} 2^{l\beta }|\Delta _{Q,l}\Delta _{P,k}f(x^{\prime \prime },y^{\prime \prime })| + 2^{k\alpha } \sup _{y^{\prime \prime }} |\Delta _{P,k}f(x^{\prime \prime },y^{\prime \prime })|\\&\le M^{(1)}(f) + V_{\mathcal {X}}^{(1)}(f). \end{aligned}$$

It follows that \(M(f) \lesssim M^{(1)}(f) + V_\mathcal {Y}^{(1)}(f) + V_\mathcal {X}^{(1)}(f) + \sup _{x,y}|f(x,y)| = ||f ||_{\Lambda _{\alpha ,\beta }}^{(1)}\), which completes the proof. \(\square \)

Combining Propositions 16, 17, and 18, we have shown

Theorem 6

The norms \(||f ||_{\Lambda _{\alpha ,\beta }}\), \(||f ||_{\Lambda _{\alpha ,\beta }}^{(1)}\) and \(||f ||_{\Lambda _{\alpha ,\beta }}^{(2)}\) are equivalent on \(\Lambda _{\alpha ,\beta }\).

7.1 Approximating Mixed Hölder–Lipschitz Functions

The reader will recall Proposition 7, which states that for a Hölder–Lipschitz function f on a single space \(\mathcal {X}\),

$$\begin{aligned} \sup _x |f(x) - P_Lf(x)| \le C V(f) 2^{-L\alpha } \end{aligned}$$

for some constant \(C > 0 \). In other words, Hölder–Lipschitz functions f are well-approximated by their averages under the semigroup. We will derive a similar result for mixed Hölder–Lipschitz functions. For any positive integer L, define the operator \(\mathcal {P}_L\) by

$$\begin{aligned} \mathcal {P}_Lf = \sum _{k,l:k+l\le L} \Delta _{P,k} \Delta _{Q,l} f + \delta _{Q,L}P_0 f + \delta _{P,L}Q_0 f + P_0 Q_0 f. \end{aligned}$$

Loosely speaking, \(\mathcal {P}_L f\) captures the activity of the function f at the pairs of scales \((2^{-k},2^{-l})\) with \(2^{-(k+l)} \ge 2^{-L}\); in particular, the reciprocals \((2^{k},2^{l})\) lie in the hyperbolic cross \(\{(x,y) \in \mathbb {R}^2 : |xy| \le 2^{L} \}\). There is an extensive theory of hyperbolic cross approximations from classical analysis; for instance, see [30]. In our setting, we have the following result on uniformly approximating f by \(\mathcal {P}_Lf\):

Proposition 19

For any \(L \ge 0\), we have

$$\begin{aligned} \sup _{x,y}|\mathcal {P}_Lf(x,y) - f(x,y)| \le C ||f ||_{\Lambda _{\alpha ,\beta }} {\left\{ \begin{array}{ll} L 2^{-L\alpha }, &{}\text { if } \alpha = \beta \\ 2^{-L\min (\alpha ,\beta )}, &{}\text { if } \alpha \ne \beta \end{array}\right. } \end{aligned}$$

where \(C>0\) is a constant.

Proof

We can write f as

$$\begin{aligned} f&= \sum _{k,l \ge 0} \Delta _{P,k} \Delta _{Q,l} f + \sum _{l \ge 0} \Delta _{Q,l} P_0 f + \sum _{k \ge 0} \Delta _{P,k} Q_0 f + P_0 Q_0 f \\&= \sum _{k,l \ge 0} \Delta _{P,k} \Delta _{Q,l} f + \sum _{l \ge L} \Delta _{Q,l} P_0 f + \delta _{Q,L}P_0 f + \sum _{k \ge L} \Delta _{P,k} Q_0 f \\&\quad +\, \delta _{P,L}Q_0 f + P_0 Q_0 f . \end{aligned}$$

Therefore,

$$\begin{aligned} f - \mathcal {P}_L f&= \sum _{k,l:k+l > L} \Delta _{P,k} \Delta _{Q,l} f + \sum _{l \ge L} \Delta _{Q,l} P_0 f + \sum _{k \ge L} \Delta _{P,k} Q_0 f. \end{aligned}$$

It is easy to see from condition (I) that \(V_{\mathcal {Y}}^{(1)}(P_{0}f) \lesssim V_{\mathcal {Y}}^{(1)}(f)\) and \(V_{\mathcal {X}}^{(1)}(Q_{0}f) \lesssim V_{\mathcal {X}}^{(1)}(f)\). Using this, we have

$$\begin{aligned} \bigg | \sum _{l \ge L} \Delta _{Q,l} P_0 f \bigg | \lesssim \sum _{l \ge L} V_{\mathcal {Y}}^{(1)}(P_0 f) 2^{-l\beta } \lesssim \sum _{l \ge L} V_{\mathcal {Y}}^{(1)}(f) 2^{-l\beta } \lesssim V_{\mathcal {Y}}^{(1)}(f) 2^{-L \beta } \end{aligned}$$
(8)

and similarly

$$\begin{aligned} \bigg | \sum _{l \ge L} \Delta _{Q,l} P_0 f \bigg | \lesssim V_{\mathcal {X}}^{(1)}(f) 2^{-L \alpha } . \end{aligned}$$
(9)

Finally, we control the mixed difference term:

$$\begin{aligned}&\bigg | \sum \limits _{k,l:k+l > L} \Delta _{P,k} \Delta _{Q,l} f \bigg | \nonumber \\&\quad \le \sum \limits _{k=0}^{L} \sum \limits _{l=L-k+1}^{\infty } | \Delta _{P,k} \Delta _{Q,l} f | + \sum \limits _{k=L+1}^{\infty } \sum \limits _{l=0}^{\infty } | \Delta _{P,k} \Delta _{Q,l} f | \nonumber \\&\quad \le M^{(1)}(f) \sum \limits _{k=0}^{L} 2^{-k\alpha } \sum \limits _{l=L-k+1}^{\infty } 2^{-l\beta } + M^{(1)}(f)\sum \limits _{k=L+1}^{\infty } 2^{-k\alpha } \sum \limits _{l=0}^{\infty } 2^{-l\beta } \nonumber \\&\quad \le \frac{M^{(1)}(f)}{1-2^{-\beta }} 2^{-(L+1)\beta } \sum \limits _{k=0}^{L} 2^{-k(\alpha -\beta )} + \frac{M^{(1)}(f)}{1-2^{-\beta }} \frac{1}{1-2^{-\alpha }}2^{-(L+1)\alpha }. \end{aligned}$$
(10)

Now, if \(\alpha = \beta \), then \(\sum _{k=0}^{L+1} 2^{-k(\alpha -\beta )} = L+2\), and (10) becomes

$$\begin{aligned} \bigg | \sum _{k,l:k+l > L} \Delta _{P,k} \Delta _{Q,l} f \bigg | \lesssim M^{(1)}(f)L2^{-L\alpha } \end{aligned}$$

which, combined with (8) and (9), gives the estimate \(\sup _{x,y}|\mathcal {P}_Lf(x,y) - f(x,y)| \lesssim ||f ||_{\Lambda _{\alpha ,\beta }}^{(1)}L 2^{-L\alpha }\).

If \(\alpha < \beta \), then \(\sum _{k=0}^{L+1} 2^{-k(\alpha -\beta )} \simeq 2^{L(\beta - \alpha )}\) so that (10) becomes

$$\begin{aligned} \bigg | \sum _{k,l:k+l > L} \Delta _{P,k} \Delta _{Q,l} f \bigg | \lesssim M^{(1)}(f) (2^{-L\beta } 2^{L(\beta -\alpha )} + 2^{-L\alpha }) \lesssim M^{(1)}(f) 2^{-L\alpha } \end{aligned}$$

and combining this with (8) and (9) gives the estimate \(\sup _{x,y}|\mathcal {P}_Lf(x,y) - f(x,y)| \lesssim ||f ||_{\Lambda _{\alpha ,\beta }}^{(1)}2^{-L\alpha }\).

Finally, if \(\alpha > \beta \), then \( \sum _{k=0}^{L+1} 2^{-k(\alpha -\beta )} \simeq 1\), and so from (10) we get:

$$\begin{aligned} \bigg | \sum _{k,l:k+l > L} \Delta _{P,k} \Delta _{Q,l} f \bigg | \lesssim M^{(1)}(f) ( 2^{-L\beta } + 2^{-L\alpha }) \lesssim M^{(1)}(f) 2^{-L\beta } \end{aligned}$$

from which the estimate \(\sup _{x,y}|\mathcal {P}_Lf(x,y) - f(x,y)| \lesssim ||f ||_{\Lambda _{\alpha ,\beta }}^{(1)}2^{-L\beta }\) also follows. Since \(||f ||_{\Lambda _{\alpha ,\beta }}^{(1)} \lesssim ||f ||_{\Lambda _{\alpha ,\beta }}\) by Theorem 6, we are done. \(\square \)

8 The Space Dual to \(\Lambda _{\alpha ,\beta }\)

We now consider the space \(\Lambda _{\alpha ,\beta }^*\) of \(L^1\) measures dual to the space \(\Lambda _{\alpha ,\beta }\) of mixed Hölder–Lipschitz functions. We will derive two simpler norms that are equivalent to the canonical dual norm on \(\Lambda _{\alpha ,\beta }^*\), as we did for the case of a single semigroup in Sect. 5.

The dual norm of \(T \in \Lambda _{\alpha ,\beta }^*\) is given by

$$\begin{aligned} ||T ||_{\Lambda _{\alpha ,\beta }^*}= \sup _{||f ||_{\Lambda _{\alpha ,\beta }} \le 1} \langle f, T\rangle . \end{aligned}$$

Before defining the equivalent norms, we introduce some notation. Define

$$\begin{aligned} d_{P,k} = P_k - P_0, \,\, d_{Q,l} = Q_k - Q_0 \end{aligned}$$

and

$$\begin{aligned} W_\mathcal {X}^{(1)}(T) = \sum _{k \ge 0} 2^{-k\alpha } ||\Delta _{P,k}^* Q_0^* f ||_1, \,\, W_\mathcal {Y}^{(1)}(T) = \sum _{l \ge 0} 2^{-l\beta } ||\Delta _{Q,l}^* P_0^* f ||_1 \end{aligned}$$

and

$$\begin{aligned} W_\mathcal {X}^{(2)}(T) = \sum _{k \ge 0} 2^{-k\alpha } ||d_{P,k}^* Q_0^* T ||_1, \, \, W_\mathcal {Y}^{(2)}(T) = \sum _{l \ge 0} 2^{-l\beta } ||d_{Q,l}^* P_0^* T ||_1 \end{aligned}$$

as well as

$$\begin{aligned} N^{(1)}(T)= & {} \sum _{k \ge 0, l \ge 0} 2^{-k\alpha } 2^{-l\beta } ||\Delta _{P,k}^* \Delta _{Q,l}^* T ||_1, \,\, N^{(2)}(T) \\= & {} \sum _{k \ge 0, l \ge 0} 2^{-k\alpha } 2^{-l\beta } ||d_{P,k}^* d_{Q,l}^* T ||_1. \end{aligned}$$

With these definitions, we define the two norms we will show are equivalent to \(||T ||_{\Lambda _{\alpha ,\beta }^*}\). The first norm is defined by

$$\begin{aligned} ||T ||_{\Lambda _{\alpha ,\beta }^*}^{(1)} = N^{(1)}(T)+ W_\mathcal {X}^{(1)}(T) + W_\mathcal {Y}^{(1)}(T) + ||P_0^* Q_0^* T ||_1 \end{aligned}$$

and the second is defined by

$$\begin{aligned} ||T ||_{\Lambda _{\alpha ,\beta }^*}^{(2)} = N^{(2)}(T)+ W_\mathcal {X}^{(2)}(T) + W_\mathcal {Y}^{(2)}(T) + ||P_0^* Q_0^* T ||_1 . \end{aligned}$$

Lemma 19

The seminorms \(W_\mathcal {X}^{(1)}(T)\) and \(W_\mathcal {X}^{(2)}(T)\) are equivalent on \(\Lambda _{\alpha ,\beta }^*\), as are the seminorms \(W_\mathcal {Y}^{(1)}(T)\) and \(W_\mathcal {Y}^{(2)}(T)\).

Proof

This follows immediately from Proposition 11. \(\square \)

Lemma 20

The seminorms \(N^{(1)}(T)\) and \(N^{(2)}(T)\) are equivalent on \(\Lambda _{\alpha ,\beta }^*\).

Proof

We reduce the proof to the case of a single semigroup by applying Proposition 11 repeatedly. We have

$$\begin{aligned} N^{(1)}(T)&= \sum _{ l \ge 0} 2^{-l\beta } \sum _{k \ge 0} 2^{-k\alpha } ||\Delta _{P,k}^* \Delta _{Q,l}^* T ||_1 \\&= \int _\mathcal {Y}\sum _{ l \ge 0} 2^{-l\beta } \sum _{k \ge 0} 2^{-k\alpha } ||\Delta _{P,k}^* \Delta _{Q,l}^* T(\cdot ,y) ||_{L^1(X)}dy\\&\simeq \int _\mathcal {Y} \sum _{ l \ge 0} 2^{-l\beta } \sum _{k \ge 0} 2^{-k\alpha }||d_{P,k}^* \Delta _{Q,l}^* T(\cdot ,y) ||_{L^1(X)}dy \\&= \int _\mathcal {X} \sum _{k \ge 0} 2^{-k\alpha }\sum _{ l \ge 0} 2^{-l\beta } ||\Delta _{Q,l}^* d_{P,k}^*T(x,\cdot ) ||_{L^1(Y)}dx \\&\simeq \int _\mathcal {X} \sum _{k \ge 0} 2^{-k\alpha }\sum _{ l \ge 0} 2^{-l\beta } ||d_{Q,l}^* d_{P,k}^*T(x,\cdot ) ||_{L^1(X)} dx \\&= \sum _{k \ge 0} 2^{-k\alpha }\sum _{ l \ge 0} 2^{-l\beta } ||d_{Q,l}^* d_{P,k}^*T ||_1= N^{(2)}(T). \end{aligned}$$

\(\square \)

Proposition 20

The norms \(||T ||_{\Lambda _{\alpha ,\beta }^*}^{(1)}\) and \(||T ||_{\Lambda _{\alpha ,\beta }^*}^{(2)}\) are equivalent on \(\Lambda _{\alpha ,\beta }^*\).

Proof

This follows immediately from the preceding two lemmas. \(\square \)

We will now prove that \(||T ||_{\Lambda _{\alpha ,\beta }^*}\) and \(||T ||_{\Lambda _{\alpha ,\beta }^*}^{(2)}\) are equivalent on \(\Lambda _{\alpha ,\beta }^*\). We will work formally; all manipulations can be easily justified by the fact that \(P_kQ_lf\) converges uniformly to f as \(k,l\rightarrow \infty \), whenever \(f \in \Lambda _{\alpha ,\beta }\). Take any function f with \(||f ||_{\Lambda _{\alpha ,\beta }} \le 1\). Write

$$\begin{aligned} f = \sum _{k \ge 0, l \ge 0} \Delta _{P,k} \Delta _{Q,l} f + \sum _{k \ge 0} \Delta _{P,k} Q_0 f + \sum _{l \ge 0} \Delta _{Q,l} P_0 f + P_0 Q_0 f. \end{aligned}$$
(11)

We want to show that \(|\langle f,T \rangle | \lesssim ||T ||_{\Lambda _{\alpha ,\beta }^*}^{(2)}\). We will deal with the inner product of f with each of the four terms on the right side of (11) separately. First, we have

$$\begin{aligned} |\langle P_0 Q_0 f, T \rangle | = |\langle f ,P_0^* Q_0^* T\rangle | \le ||P_0^* Q_0^* T ||_1 \end{aligned}$$
(12)

since \(||f ||_\infty \le 1\).

To control the inner product of T with \(\sum _{k \ge 0} \Delta _{P,k} Q_0 f\), first observe that, using Lemma 13, we get

$$\begin{aligned} \Delta _{P,k} Q_0= & {} \Delta _{P,k+1} (P_{k+2} + P_{k+1}) Q_0 \nonumber \\= & {} \Delta _{P,k+1} P_{k+2} Q_0 + \Delta _{P,k+1} P_{k+1} Q_0. \end{aligned}$$
(13)

Now, we have the identity

$$\begin{aligned} \Delta _{P,k+1} P_{k+1} Q_0&= \Delta _{P,k+1} (P_{k+1} - P_0) Q_0 + \Delta _{P,k+1} P_0 Q_0 \\&= \Delta _{P,k+1} d_{P,k+1} Q_0 + \Delta _{P,k+1} P_0 Q_0, \end{aligned}$$

from which it follows:

$$\begin{aligned} | \langle \Delta _{P,k+1} P_{k+1} Q_0 f, T\rangle |\le & {} |\langle \Delta _{P,k+1} d_{P,k+1} Q_0 f,T\rangle | + | \langle \Delta _{P,k+1} P_0 Q_0f,T\rangle | \nonumber \\= & {} |\langle \Delta _{P,k+1} f,d_{P,k+1} ^*Q_0^*T\rangle | + | \langle \Delta _{P,k+1}f, P_0^* Q_0^*T\rangle | \nonumber \\\le & {} \sup _{x,y} |\Delta _{P,k+1} f(x,y)| \Big \{ ||d_{P,k+1}^*Q_0^*f ||_1 + ||P_0^* Q_0^* T ||_1 \Big \} \nonumber \\\lesssim & {} 2^{-k\alpha } ||d_{P,k+1}^*Q_0^*T ||_1 +2^{-k\alpha } ||P_0^* Q_0^* T ||_1. \end{aligned}$$
(14)

Similarly,

$$\begin{aligned} | \langle \Delta _{P,k+1} P_{k+2} Q_0 f, T\rangle |&\le 2^{-k\alpha } ||d_{P,k+2}^*Q_0^*T ||_1 +2^{-k\alpha } ||P_0^* Q_0^* T ||_1 . \end{aligned}$$
(15)

Combining equation (13) with the estimates (14) and (15) yields

$$\begin{aligned} |\langle \Delta _{P,k} Q_0 f, T \rangle | \lesssim 2^{-k\alpha } \Big \{ ||d_{P,k+1}^*Q_0^*T ||_1 + ||d_{P,k+2}^*Q_0^*T ||_1 + 2 ||P_0^* Q_0^* T ||_1 \Big \} \end{aligned}$$

and summing over \(k \ge 0\) then yields

$$\begin{aligned} \bigg |\bigg \langle \sum _{k \ge 0} \Delta _{P,k} Q_0 f, T\bigg \rangle \bigg |&\lesssim W_\mathcal {X}^{(2)}(T) + ||P_0^* Q_0^* T ||_1. \end{aligned}$$
(16)

A nearly identical proof shows that

$$\begin{aligned} \bigg |\bigg \langle \sum _{l \ge 0} \Delta _{Q,l} P_0 f, T\bigg \rangle \bigg |&\lesssim W_\mathcal {Y}^{(2)}(T) + ||P_0^* Q_0^* T ||_1. \end{aligned}$$
(17)

The only term left to control from (11) is the inner product of T with

$$\begin{aligned} \sum _{k \ge 0, l \ge 0} \Delta _{P,k} \Delta _{Q,l} f. \end{aligned}$$

Using the identity

$$\begin{aligned} \Delta _{P,k-1} \Delta _{Q,l-1}= & {} \Delta _{P,k} (P_{k+1} + P_{k}) \Delta _{Q,l} (Q_{l+1} + Q_{l}) \nonumber \\= & {} \Delta _{P,k} P_{k+1}\Delta _{Q,l} Q_{l+1} + \Delta _{P,k} P_{k}\Delta _{Q,l} Q_{l+1} \nonumber \\&+\, \Delta _{P,k} P_{k+1}\Delta _{Q,l} Q_{l}+\Delta _{P,k} P_{k}\Delta _{Q,l} Q_{l} \end{aligned}$$
(18)

it follows that we must control the inner product of T with each of the four terms on the right side of (18) (applied to f). The argument is the same for each, so we will show it only for \(\Delta _{P,k} P_{k} \Delta _{Q,l} Q_{l} f = \Delta _{P,k} \Delta _{Q,l} P_{k}Q_{l} f \).

From the easily-verified identity

$$\begin{aligned} P_k Q_l f = d_{P,k} d_{Q,l} f+ d_{P,k} Q_0 f+ d_{Q,l} P_0 f+ P_0 Q_0f \end{aligned}$$

it follows that

$$\begin{aligned} \Delta _{P,k} \Delta _{Q,l}P_k Q_l f= & {} \Delta _{P,k} \Delta _{Q,l} d_{P,k} d_{Q,l} f+ \Delta _{P,k} \Delta _{Q,l}d_{P,k} Q_0f \nonumber \\&+\, \Delta _{P,k} \Delta _{Q,l}d_{Q,l} P_0f +\Delta _{P,k} \Delta _{Q,l} P_0 Q_0f. \end{aligned}$$
(19)

We will bound the inner product of T with the sum over \(k\ge 0\) and \(l\ge 0\) of each of the four terms in (19) separately. First, we have

$$\begin{aligned} |\langle \Delta _{P,k} \Delta _{Q,l} P_0 Q_0f, T \rangle |&= |\langle \Delta _{P,k} \Delta _{Q,l}f, P_0 ^*Q_0^* T \rangle | \le 2^{-k\alpha }2^{-l\beta }||P_0 ^*Q_0^* T ||_1 \end{aligned}$$

and summing over k and l gives the upper bound \(||P_0 ^*Q_0^* T ||_1\).

Next, observe that

$$\begin{aligned} |\langle \Delta _{P,k} \Delta _{Q,l}d_{Q,l} P_0f , T \rangle |&= |\langle \Delta _{P,k} \Delta _{Q,l}f ,d_{Q,l}^* P_0^* T \rangle | \le 2^{-k\alpha }2^{-l\beta }||d_{Q,l}^* P_0^* T ||_1 \end{aligned}$$

and summing over k and l gives the upper bound \(\sum _{l=0}^\infty 2^{-l\beta } ||d_{Q,l}^* P_0^* T ||_1 = W_\mathcal {Y}^{(2)}(T)\). Similarly, the inner product of T with \(\Delta _{P,k} \Delta _{Q,l}d_{P,k} Q_0f\) can be bounded above by \(W_\mathcal {X}^{(2)}(T)\).

Finally, we have the upper bound

$$\begin{aligned} |\langle \Delta _{P,k} \Delta _{Q,l} d_{P,k} d_{Q,l} f , T \rangle |&= | \langle \Delta _{P,k} \Delta _{Q,l} f ,d_{P,k}^* d_{Q,l}^* T \rangle | \le 2^{-k\alpha }2^{-l\beta } ||d_{P,k}^* d_{Q,l}^* T ||_1 \end{aligned}$$

and summing over k and l gives the upper bound \(\sum _{k,l} 2^{-k\alpha } 2^{-l\beta }||d_{P,k}^* d_{Q,l}^* T ||_1 = N^{(2)}(T) \). Putting the four bounds together and applying Eq. (19) yields

$$\begin{aligned} \bigg | \bigg \langle \sum _{k,l} \Delta _{P,k} \Delta _{Q,l} P_{k}Q_{l} f, T \bigg \rangle \bigg | \lesssim ||T ||_{\Lambda _{\alpha ,\beta }^*}^{(2)} \end{aligned}$$

and the same estimate applied to each of the four terms on the right side of Eq. (18) gives

$$\begin{aligned} \bigg |\bigg \langle \sum _{k \ge 0, l \ge 0} \Delta _{P,k} \Delta _{Q,l} f ,T \bigg \rangle \bigg | \lesssim ||T ||_{\Lambda _{\alpha ,\beta }^*}^{(2)}. \end{aligned}$$
(20)

Combining the estimates (12), (16), (17) and (20) with (11), we complete the proof that \(||T ||_{\Lambda _{\alpha ,\beta }^*} \lesssim ||T ||_{\Lambda _{\alpha ,\beta }^*}^{(2)}\).

To prove the reverse inequality, as in the proof of Proposition 13 for a single semigroup we define a function f such that \(||f ||_{\Lambda _{\alpha ,\beta }} \simeq 1 \) and whose inner product with T achieves the norm \(||T ||_{\Lambda _{\alpha ,\beta }^*}^{(2)}\). It is easy to check that the function f defined by

$$\begin{aligned} f&= \sum _{k,l\ge 0} 2^{-k\alpha } 2^{-l\beta } d_{P,l} d_{Q,l} {{\mathrm{sgn}}}(d_{P,k}^* d_{Q,l}^* T) + \sum _{k\ge 0} 2^{-k\alpha } d_{P,k} Q_0{{\mathrm{sgn}}}(d_{P,k}^* Q_0^* T) \\&\quad +\, \sum _{l\ge 0} 2^{-l\beta } d_{Q,l} P_0{{\mathrm{sgn}}}(d_{Q,l}^* P_0^* T) + P_0 Q_0 {{\mathrm{sgn}}}(P_0^* Q_0^* T). \end{aligned}$$

satisfies the necessary conditions; in fact, each of the four terms defining f have mixed Hölder–Lipschitz norm bounded independently of T, and \(\langle f, T\rangle = ||T ||_{\Lambda _{\alpha ,\beta }^*}^{(2)}\). We have therefore shown:

Theorem 7

The norms \(||T ||_{\Lambda _{\alpha ,\beta }^*}\), \(||T ||_{\Lambda _{\alpha ,\beta }^*}^{(1)} \), and \(||T ||_{\Lambda _{\alpha ,\beta }^*}^{(2)} \) are equivalent on \(\Lambda _{\alpha ,\beta }^*\).