1 Introduction

If the Fourier transform of two Borel probability measures on \(\mathbb {R}\) are equal, then the measures themselves are also equal. The celebrated Berry–Esseen smoothing inequality is a quantitative form of this fundamental fact of classical Fourier analysis. Given two Borel probability measures \(\nu _1\) and \(\nu _2\) on \(\mathbb {R}\), let \(F_j (x) = \nu _j ((-\infty , x])\), \(j=1,2\), and define

$$\begin{aligned} \delta _{\mathrm {unif}} (\nu _1, \nu _2) = \sup _{x \in \mathbb {R}} \left| F_1 (x)-F_2(x) \right| . \end{aligned}$$

Theorem A

(Berry–Esseen smoothing inequality) Assume that \(|F_2(x)-F_2(y)|\le K |x-y|\) for all \(x,y \in \mathbb {R}\) with some constant \(K>0\). Then for any real number \(T>0\),

$$\begin{aligned} \delta _{\mathrm {unif}} (\nu _1, \nu _2) \ll \frac{K}{T} + \int _{-T}^T \frac{|\widehat{\nu _1}(t) -\widehat{\nu _2}(t)|}{|t|} \, \mathrm {d}t \end{aligned}$$

with a universal implied constant.

In the terminology of probability theory, \(F_j (x)\) is the distribution function of \(\nu _j\); the Fourier transform \(\widehat{\nu _j}(t)= \int _{\mathbb {R}}e^{itx}\, \mathrm {d}\nu _j (x)\) is the characteristic function of \(\nu _j\); finally, \(\delta _{\mathrm {unif}}\) is the uniform metric (or Kolmogorov metric) on the set of probability distributions. For somewhat sharper forms of Theorem A see Petrov [28, Chapter 5.1]. Throughout the paper \(A \ll B\) and \(A=O(B)\) mean that \(A \le CB\) with some implied constant \(C>0\).

Similar smoothing inequalities are known for several other probability metrics on \(\mathbb {R}\), see Bobkov [4] for a survey. Some, but not all require a smoothness assumption on one of the distributions; for instance, Theorem A is usually formulated under the assumption that \(F_2\) is differentiable and \(|F_2'(x)|\le K\). A common feature of such results is that the distance of \(\nu _1\) and \(\nu _2\) in some probability metric is bounded above by the sum of two terms depending on a free parameter \(T>0\): one term decays as T increases, and the other term depends on the Fourier transforms of \(\nu _1\) and \(\nu _2\) only on the interval \([-T,T]\).

Berry–Esseen type smoothing inequalities are known in other spaces as well. The first multidimensional version, an upper bound for the uniform metric on \(\mathbb {R}^d\) is due to von Bahr [36]. Niederreiter and Philipp proved an analogous result for two Borel probability measures \(\nu _1\) and \(\nu _2\) on the torus \(\mathbb {R}^d/\mathbb {Z}^d\). By identifying \(\mathbb {R}^d/\mathbb {Z}^d\) with the unit cube \([0,1)^d\), we can define the uniform metric as \(\delta _{\mathrm {unif}} (\nu _1, \nu _2) = \sup _{x \in [0,1]^d} |\nu _1 (B(x)) - \nu _2(B(x))|\), where \(B(x)=[0,x_1)\times \cdots \times [0,x_d)\). The Fourier transform is now \(\widehat{\nu _j}(m)=\int _{\mathbb {R}^d/\mathbb {Z}^d} e^{-2\pi i \langle m,x \rangle } \, \mathrm {d} \nu _j(x)\), \(m \in \mathbb {Z}^d\). Let \(\mu _{\mathbb {R}^d/\mathbb {Z}^d}\) be the normalized Haar measure, and let \(\Vert m \Vert _{\infty }=\max _{1 \le k \le d} |m_k|\).

Theorem B

(Niederreiter–Philipp [25]) Assume that \(\nu _2(B) \le K \mu _{\mathbb {R}^d/\mathbb {Z}^d}(B)\) for all axis parallel boxes \(B \subseteq [0,1)^d\) with some constant \(K>0\). Then for any real number \(M>0\),

$$\begin{aligned} \delta _{\mathrm {unif}} (\nu _1, \nu _2) \ll \frac{K}{M} + \sum _{\begin{array}{c} m \in \mathbb {Z}^d \\ 0<\Vert m \Vert _{\infty }<M \end{array}} \frac{|\widehat{\nu _1}(m)-\widehat{\nu _2}(m)|}{\prod _{k=1}^d \max \{|m_k|, 1\}} \end{aligned}$$

with an implied constant depending only on d.

The goal of this paper is to prove a Berry–Esseen type smoothing inequality in more general compact groups. In this more general setting only those probability metrics remain meaningful whose definition does not rely on concepts such as axis parallel boxes and distribution functions. One of the most important such metrics is the p-Wasserstein metric \(W_p\). Given a compact metric space \((X,\rho )\) and two Borel probability measures \(\nu _1\) and \(\nu _2\) on X, we define

$$\begin{aligned} W_p (\nu _1, \nu _2) = \inf _{\vartheta \in \mathrm {Coup}(\nu _1, \nu _2)} \int _{X \times X} \rho (x,y)^p \, \mathrm {d} \vartheta (x,y) \qquad (0<p \le 1), \end{aligned}$$

and

$$\begin{aligned} W_p (\nu _1, \nu _2) = \inf _{\vartheta \in \mathrm {Coup}(\nu _1, \nu _2)} \left( \int _{X \times X} \rho (x,y)^p \, \mathrm {d} \vartheta (x,y) \right) ^{1/p} \qquad (1<p<\infty ). \end{aligned}$$

Here \(\mathrm {Coup}(\nu _1, \nu _2)\) is the set of couplings of \(\nu _1\) and \(\nu _2\); that is, the set of Borel probability measures \(\vartheta \) on \(X \times X\) with marginals \(\vartheta (B \times X)=\nu _1 (B)\) and \(\vartheta (X \times B) = \nu _2 (B)\), \(B \subseteq X\) Borel. Recall that for any \(p>0\), \(W_p\) is a metric on the set of Borel probability measures on X, and it generates the topology of weak convergence. Observe also the general inequalities \(W_p(\nu _1, \nu _2) \le W_1 (\nu _1, \nu _2)^p\), \(0<p \le 1\) and \(W_1(\nu _1, \nu _2) \le W_p (\nu _1, \nu _2)\), \(1 \le p < \infty \). The Wasserstein metric originates in the theory of optimal transportation, see Villani [35].

Respecting the philosophy of the Berry–Esseen inequality, we wish to find an upper bound to \(W_p (\nu _1, \nu _2)\) depending on the Fourier transform of \(\nu _1\) and \(\nu _2\) only up to a certain “level”. For this reason we chose to work with compact Lie groups, where the theory of highest weights provides a suitable framework to formalize the meaning of “level”. More precisely, our main result applies to any compact, connected Lie group G; classical examples include \(\mathbb {R}^d/\mathbb {Z}^d\), \(\mathrm {U}(d)\), \(\mathrm {SU}(d)\), \(\mathrm {SO}(d)\), \(\mathrm {Sp}(d)\) and \(\mathrm {Spin} (d)\). Let \(\widehat{G}\) denote the unitary dual, and let \(d_{\pi }\), \(\lambda _{\pi }\) and \(\kappa _{\pi }\) denote the dimension, the highest weight and the negative Laplace eigenvalue of the representation \(\pi \in \widehat{G}\), respectively. Further, let \(\Vert A \Vert _{\mathrm {HS}}=\sqrt{\mathrm {tr} (A^*A)}\) be the Hilbert–Schmidt norm of a matrix A. For a more formal setup we refer to Sect. 2.1. In this paper we prove the following Berry–Esseen type smoothing inequality for \(W_p\) on compact Lie groups.

Theorem 1

Let \(\nu _1\) and \(\nu _2\) be Borel probability measures on a compact, connected Lie group G. For any \(0<p \le 1\) and any real number \(M>0\),

$$\begin{aligned} W_p (\nu _1, \nu _2) \ll \frac{1}{M^p} + M^{1-p} \left( \sum _{\begin{array}{c} \pi \in \widehat{G} \\ 0< |\lambda _{\pi }|<M \end{array}} \frac{d_{\pi }}{\kappa _{\pi }} \Vert \widehat{\nu _1}(\pi ) - \widehat{\nu _2}(\pi ) \Vert _{\mathrm {HS}}^2 \right) ^{1/2} \end{aligned}$$
(1)

with an implied constant depending only on G.

The result holds without any smoothness assumption on \(\nu _1\) and \(\nu _2\). As the applications in Sect. 2.3 will show, the inequality is sharp up to a constant factor depending on G. In the simplest case of the torus \(G=\mathbb {R}^d/\mathbb {Z}^d\) equation (1) reads

$$\begin{aligned} W_p (\nu _1, \nu _2) \ll \frac{1}{M^p} + M^{1-p} \left( \sum _{\begin{array}{c} m \in \mathbb {Z}^d \\ 0<|m|<M \end{array}} \frac{|\widehat{\nu _1}(m) - \widehat{\nu _2} (m)|^2}{|m|^2} \right) ^{1/2} \end{aligned}$$
(2)

with an implied constant depending on d, where |m| is the Euclidean norm of m; this should be compared to Theorem B. For a detailed proof with explicit constants we refer to our earlier paper [8, Proposition 3] and to Bobkov and Ledoux [5, 6].

Our methods do not work when \(p>1\); the reason is that the proof is based on Kantorovich duality for \(W_p\). Recall that the Kantorovich duality theorem states that for any \(0<p\le 1\),

$$\begin{aligned} W_p (\nu _1, \nu _2 ) = \sup _{f \in \mathcal {F}_p} \left| \int _G f \, \mathrm {d}\nu _1 - \int _G f \, \mathrm {d}\nu _2 \right| , \end{aligned}$$

where, with \(\rho \) denoting the geodesic distance on G,

$$\begin{aligned} \mathcal {F}_p = \left\{ f:G \rightarrow \mathbb {R} \, : \, |f(x)-f(y)| \le \rho (x,y)^p \text { for all } x,y \in G \right\} \end{aligned}$$

is the set of p-Hölder functions on G, with Hölder constant 1. Theorem 1 thus estimates the difference of the integrals of f with respect to \(\nu _1\) and \(\nu _2\) uniformly in \(f \in \mathcal {F}_p\). From our methods it also follows that for any \(f \in \mathcal {F}_p\),

$$\begin{aligned} \left| \int _G f \, \mathrm {d}\nu _1 - \int _G f \, \mathrm {d}\nu _2 \right| \ll \frac{1}{M^p} + \sum _{\begin{array}{c} \pi \in \widehat{G}\\ 0<|\lambda _{\pi }|<M \end{array}} d_{\pi } \Vert \widehat{f}(\pi ) \Vert _{\mathrm {HS}} \cdot \Vert \widehat{\nu _1}(\pi ) - \widehat{\nu _2}(\pi ) \Vert _{\mathrm {HS}} \end{aligned}$$

with an implied constant depending only on G; see Proposition 5. Hence for a given \(f \in \mathcal {F}_p\) whose Fourier transform decays fast enough, the results of Theorem 1 can be improved. Fast Fourier decay follows e.g. from suitable smoothness assumptions on f, see Sugiura [32]. We mention that by prescribing a higher order modulus of continuity for f, the term \(1/M^p\) can also be improved. Note that Fourier decay rates play a role in classical Berry–Esseen type inequalities as well: the coefficient \(|t|^{-1}\) (resp. \(\prod _{k=1}^d \max \{ |m_k|,1 \}^{-1}\)) in Theorem A (resp. Theorem B) is explained by the fact that the Fourier transform of the indicator function of an interval (resp. axis parallel box) decays at this rate.

The most straightforward application of Theorem 1 is estimating the rate of convergence of random walks in the \(W_p\) metric. Let \(\nu ^{*k}\) denote the k-fold convolution power of \(\nu \), and let \(\mu _G\) be the Haar measure on G. Recall that \(\nu ^{*k} \rightarrow \mu _G\) weakly as \(k \rightarrow \infty \) if and only if the support of \(\nu \) is contained neither in a proper closed subgroup, nor in a coset of a proper closed normal subgroup of G; see Stromberg [31]. Using a nonuniform spectral gap result of Varjú, we prove the following application of Theorem 1.

Corollary 2

Let \(\nu \) be a Borel probability measure on a compact, connected, semisimple Lie group G. If \(\nu ^{*k} \rightarrow \mu _G\) weakly as \(k \rightarrow \infty \), then

$$\begin{aligned} W_1 (\nu ^{*k}, \mu _G ) \ll e^{-ck^{1/3}}, \end{aligned}$$

where the constant \(c>0\) and the implied constant depend only on G and \(\nu \).

The condition of semisimplicity cannot be removed. The rate of convergence \(W_p \ll \exp (-pck^{1/3})\), \(0<p \le 1\) immediately follows. The main motivation came from our recent paper [8] on quantitative ergodic theorems for random walks. Given independent, identically distributed G-valued random variables \(\zeta _1, \zeta _2, \dots \) with distribution \(\nu \), we showed that for any \(f \in \mathcal {F}_p\) the sum \(\sum _{k=1}^N f(\zeta _1 \zeta _2 \cdots \zeta _k)\) satisfies the central limit theorem and the law of the iterated logarithm, provided that \(\sum _{k=1}^{\infty } W_p (\nu ^{*k},\mu _G) < \infty \). Corollary 2 thus provides a large class of examples of random walks with fast enough convergence in \(W_p\), and consequently to which our quantitative ergodic theorems apply. We do not know whether \(W_p (\nu ^{*k}, \mu _G) \ll e^{-ck^{1/3}}\) remains true for \(p>1\).

Another possible application is in uniform distribution theory, where the goal is finding finite sets \(\{ a_1, a_2, \dots , a_N \} \subset G\) which make the integration error \(|N^{-1} \sum _{k=1}^N f(a_k)-\int _G f \, \mathrm {d}\mu _G|\) small for a suitable class of test functions. Applying Theorem 1 to \(\nu _1=N^{-1} \sum _{k=1}^N \delta _{a_k}\) (where \(\delta _a\) is the Dirac measure concentrated at \(a \in G\)) and \(\nu _2=\mu _G\), we can quantitatively measure how well distributed a finite set is with respect to test functions \(f \in \mathcal {F}_p\). Note that in this case we have

$$\begin{aligned} \Vert \widehat{\nu _1} (\pi ) - \widehat{\nu _2} (\pi ) \Vert _{\mathrm {HS}}^2 = \frac{1}{N^2} \sum _{k, \ell =1}^N \chi _{\pi } (a_k^{-1}a_{\ell }) , \end{aligned}$$
(3)

where \(\chi _{\pi } (x)=\mathrm {tr} \, \pi (x)\) is the character of the representation \(\pi \in \widehat{G}\). Theorem 1 thus becomes an abstract version of the Erdős–Turán inequality, estimating the distance of a finite set from uniformity in terms of character sums. As an illustration, consider the classical construction of Lubotzky, Phillips and Sarnak [23, 24] of a finite point set in \(\mathrm {SO}(3)\) with optimal spectral gap. We will prove that this point set is also optimally close to the Haar measure in the \(W_p\) metric up to a constant factor; see Sect. 2.3.2 for a more precise formulation.

Corollary 3

The finite point set \(\{ a_1, a_2, \dots , a_N \} \subset \mathrm {SO}(3)\) with optimal spectral gap constructed in [23, 24] satisfies, for all \(0<p \le 1\),

$$\begin{aligned} W_p \left( \frac{1}{N} \sum _{k=1}^N \delta _{a_k} , \mu _{\mathrm {SO}(3)} \right) \ll N^{-p/3} \end{aligned}$$

with a universal implied constant.

Estimating \(W_p\) is more difficult in the \(p>1\) case, when Kantorovich duality is not available. Graham [21] gave an analogue of (2) for all \(p \ge 1\) on \(\mathbb {R}/\mathbb {Z}\) and on the interval [0, 1]. His arguments do not generalize to higher dimensions; see, however, [30] for connections between \(W_p\), \(p \ge 1\) and Fourier analysis on \(\mathbb {R}^d/\mathbb {Z}^d\). Fourier methods were used to estimate \(W_p\), \(p \ge 1\) in the optimal matching problem in [1, 5, 6, 14].

Brown and Steinerberger [13, 29] considered the more abstract setting of compact Riemannian manifolds, and estimated the distance in \(W_2\) from \(N^{-1}\sum _{k=1}^N \delta _{a_k}\) to the Riemannian volume in terms of character sums (3), and also in terms of the Green function of the Laplace–Beltrami operator. A similar Erdős–Turán inequality on compact Riemannian manifolds with respect to sufficiently nice test sets was proved by Colzani, Gigante and Travaglini [17]. Numerical results for certain finite point sets on the orthogonal group \(\mathrm {O}(d)\) and on Grassmannian manifolds were obtained by Pausinger [27].

For probability distributions on \(\mathbb {R}\), Esseen [19, 4, Corollary 8.3] used a smoothing inequality for \(W_1(\nu _1, \nu _2)\) to estimate the rate of convergence in \(W_1\) in the central limit theorem; our Theorem 1 and Corollary 2 are far reaching analogues of these classical results in the compact setting. In the \(p>1\) case the only known smoothing inequality on \(\mathbb {R}\) applies to \(W_2 (\nu _1, \nu _2)\) with \(\nu _2\) a Gaussian distribution [4, Theorem 11.1].

The discussion above can be generalized from \(\mathcal {F}_p\) to the class of functions with an arbitrarily prescribed modulus of continuity, and we will actually work out the details in this generality. In particular, our results apply to any given \(f \in C(G)\). The formal setup and notation are given in Sect. 2.1; we state the general form of Theorem 1 with explicit constants in Sect. 2.2; applications to random walks and to uniform distribution theory are discussed in more detail, and the proofs of Corollaries 2 and 3 are given in Sect. 2.3. The proof of the main result, Theorem 4 is given in Sect. 3.

2 Results

2.1 Notation

For the general theory of compact Lie groups we refer to Bourbaki [9, Chapter 9]. Throughout the paper G denotes a compact, connected Lie group with identity element \(e \in G\) and Lie algebra \(\mathfrak {g}\). Let \(\exp : \mathfrak {g} \rightarrow G\) and \(n=\mathrm {dim} \, G\) denote the exponential map and the dimension of G as a real smooth manifold. Fix an Ad-invariant inner product \((\cdot , \cdot )\) on \(\mathfrak {g}\), and let \(|X|=\sqrt{(X,X)}\), \(X \in \mathfrak {g}\). This inner product defines a Riemannian metric on G; let \(\rho \) denote the corresponding geodesic metric on G. The Laplace–Beltrami operator on G is \(\Delta = \sum _{k=1}^n X_k X_k\) (as an element of the universal enveloping algebra of \(\mathfrak {g}\)), where \(X_1, \dots , X_n\) is an orthonormal base in \(\mathfrak {g}\) with respect to \((\cdot , \cdot )\); this does not depend on the choice of the orthonormal base.

Let r denote the rank of G, and fix a maximal torus T in G with Lie algebra \(\mathfrak {t}\). Let \(\mathfrak {t}^*=\mathrm {Hom}(\mathfrak {t},\mathbb {R})\) denote the dual vector space. The sets

$$\begin{aligned} \Gamma&= \left\{ X \in \mathfrak {t} \, : \, \exp (2 \pi X)=e \right\} , \\ \Gamma ^*&= \left\{ \lambda \in \mathfrak {t}^* \, : \, \lambda (X) \in \mathbb {Z} \text { for all } X \in \Gamma \right\} \end{aligned}$$

are dual lattices of full rank in \(\mathfrak {t}\) and \(\mathfrak {t}^*\), respectively. The inner product on \(\mathfrak {g}\) naturally defines an inner product on \(\mathfrak {t}^*\), which we also denote by \((\cdot , \cdot )\); we also write \(|\lambda |=\sqrt{(\lambda , \lambda )}\), \(\lambda \in \mathfrak {t}^*\). The inner product defines a normalized Lebesgue measure m on \(\mathfrak {t}\).

The weights will be considered elements of \(\Gamma ^*\); the character of T corresponding to \(\lambda \in \Gamma ^*\) is \(\exp (2 \pi X) \mapsto e^{2 \pi i \lambda (X)}\), \(X \in \mathfrak {t}\). Let R be the set of roots, and choose a set of positive roots \(R^+\); we have \(|R|=n-r\) and \(|R^+|=(n-r)/2\). Let \(\Gamma _{\mathrm {root}}^* \subseteq \Gamma ^*\) be the lattice spanned by the roots. Further, let

$$\begin{aligned} C^+=\left\{ \lambda \in \mathfrak {t}^* \, : \, (\lambda , \alpha ) \ge 0 \text { for all } \alpha \in R^+ \right\} \end{aligned}$$

be the dominant Weyl chamber; the set of dominant weights is thus \(\Gamma ^* \cap C^+\). The Weyl group of G with respect to T will be denoted by \(W(G,T)=N_G(T)/T\).

Let \(\widehat{G}\) be the unitary dual of G. For any \(\pi \in \widehat{G}\), let \(d_{\pi }\) and \(\lambda _{\pi }\) denote the dimension and the highest weight of \(\pi \). The map \(\pi \mapsto \lambda _{\pi }\) is a bijection from \(\widehat{G}\) to the set of dominant weights \(\Gamma ^* \cap C^+\). Let \(\kappa _{\pi } \ge 0\) denote the negative Laplace eigenvalue of \(\pi \); that is, \(\Delta \pi = -\kappa _{\pi } \pi \) where \(\Delta \) acts entrywise. Recall that

$$\begin{aligned} \kappa _{\pi } = |\lambda _{\pi }|^2 + 2 (\lambda _{\pi }, \rho ^+) \qquad \text {and} \qquad d_{\pi } = \frac{\prod _{\alpha \in R^+} (\lambda _{\pi }+ \rho ^+ , \alpha )}{\prod _{\alpha \in R^+}(\rho ^+, \alpha )}, \end{aligned}$$

where \(\rho ^+ = \sum _{\alpha \in R^+} \alpha /2\) is the half-sum of positive roots; in particular,

$$\begin{aligned} |\lambda _{\pi }|^2 \le \kappa _{\pi } \le |\lambda _{\pi }|^2 + O(|\lambda _{\pi }|) \qquad \text {and} \qquad d_{\pi } \ll |\lambda _{\pi }|^{(n-r)/2}. \end{aligned}$$

Let \(\mu _G\) (resp. \(\mu _T\)) denote the normalized Haar measure on G (resp. T). The Fourier transform of a function \(f:G \rightarrow \mathbb {C}\) is \(\widehat{f}(\pi ) = \int _G f(x) \pi (x)^* \, \mathrm {d}\mu _G(x)\), \(\pi \in \widehat{G}\); that of a Borel probability measure \(\nu \) on G is \(\widehat{\nu }(\pi )=\int _G \pi (x)^* \, \mathrm {d}\nu (x)\), \(\pi \in \widehat{G}\). Here \(\pi (x)^*\) denotes the adjoint of \(\pi (x)\), and the integrals are taken entrywise.

Let \(g:[0,\infty ) \rightarrow [0,\infty )\) be a nondecreasing and subadditiveFootnote 1 function such that \(\lim _{t \rightarrow 0^+}g(t)=0\), and define

$$\begin{aligned} W_g (\nu _1, \nu _2) = \inf _{\vartheta \in \mathrm {Coup}(\nu _1, \nu _2)} \int _{G \times G} g(\rho (x,y)) \, \mathrm {d} \vartheta (x,y), \end{aligned}$$

where \(\mathrm {Coup}(\nu _1, \nu _2)\) is the set of couplings, as before. Letting

$$\begin{aligned} \mathcal {F}_g = \left\{ f:G \rightarrow \mathbb {R} \, : \, |f(x)-f(y)| \le g(\rho (x,y)) \text { for all } x,y \in G \right\} , \end{aligned}$$

the Kantorovich duality theorem states

$$\begin{aligned} W_g (\nu _1, \nu _2) = \sup _{f \in \mathcal {F}_g} \left| \int _G f \, \mathrm {d} \nu _1 - \int _G f \, \mathrm {d} \nu _2 \right| . \end{aligned}$$

Note that \(W_g\) is a metric on the set of Borel probability measures on G and it generates the topology of weak convergence, unless g is constant zero. In the special case \(g(t)=t^p\), \(0<p \le 1\) we write \(W_p\) (resp. \(\mathcal {F}_p\)) instead of \(W_g\) (resp. \(\mathcal {F}_g\)). We mention that given \(f \in C(G)\), the function

$$\begin{aligned} g_f(t)=\sup \{ |f(x)-f(y)| \, : \, x,y \in G, \, \rho (x,y) \le t \} \end{aligned}$$

is nondecreasing and subadditive, and \(\lim _{t \rightarrow 0^+} g_f(t)=0\); in fact, \(g_f\) is the smallest g for which \(f \in \mathcal {F}_g\).

Remark

Kantorovich duality is usually stated for \(g(t)=t\), i.e. for Lipschitz functions. To see the general case, note that \(g(\rho (x,y))\) is another metric on G generating the topology of G, unless g is constant zero; the subadditivity of g is needed for the triangle inequality. Kantorovich duality for Lipschitz functions in the \(g(\rho (x,y))\) metric thus implies Kantorovich duality for \(W_g\) as claimed. Further, since the usual 1-Wasserstein metric with respect to \(g(\rho (x,y))\) generates the topology of weak convergence, so does \(W_g\).

2.2 Berry–Esseen Inequality on Compact Lie Groups

Let \(\eta : \mathfrak {t} \rightarrow \mathbb {R}\) be a W(GT)-invariant smooth function such that \(\eta (X) = 0\) for all \(|X| \ge 1\), and \(0 \le \eta (X) \le \eta (0)=1\) for all \(X \in \mathfrak {t}\). Since W(GT) acts by orthogonal transformations on \(\mathfrak {t}\), W(GT)-invariance can be ensured e.g. if \(\eta (X)\) depends only on |X|. For instance, we can use the “bump function”

$$\begin{aligned} \eta (X) = \left\{ \begin{array}{ll} \exp \left( - \frac{|X|^2}{1-|X|^2} \right) &{} \text {if } |X| <1, \\ 0 &{} \text {if } |X| \ge 1 . \end{array} \right. \end{aligned}$$

Let \(F: \mathfrak {t} \rightarrow \mathbb {C}\), \(F(X) = \int _{\mathfrak {t}} \eta (Y) e^{2 \pi i (X,Y)} \, \mathrm {d}m(Y)\); note that F(X) is a Schwarz function, thus |F(X)| decays at an arbitrary polynomial rate as \(|X| \rightarrow \infty \). The main result of the paper is the following Berry–Esseen type inequality.

Theorem 4

Let \(\nu _1\) and \(\nu _2\) be Borel probability measures on a compact, connected Lie group G, and let \(g: [0,\infty ) \rightarrow [0,\infty )\) be nondecreasing and subadditive such that \(\lim _{t \rightarrow 0^+}g(t)=0\). Let

$$\begin{aligned} \psi (t)= \frac{2}{|W(G,T)|} \int _{\mathfrak {t}} g \left( \frac{2 \pi |X|}{\lfloor t/(|2\rho ^+ |+a) \rfloor } \right) a^r |F(aX)| \bigg ( \prod _{\alpha \in R^+} |e^{2 \pi i \alpha (X)}-1|^2 \bigg ) \, \mathrm {d} m(X) \end{aligned}$$

where \(a=\min _{\lambda \in \Gamma _{\mathrm {root}}^*} |\lambda |/2=\min _{\alpha \in R} |\alpha |/2\) and \(\rho ^+ = \sum _{\alpha \in R^+} \alpha /2\), and let

$$\begin{aligned} \phi (t) = \inf _{0<c<2(\sqrt{n^2+n}-n)} \sqrt{\frac{n}{1-c-c^2/(4n)}} \cdot \frac{g \left( \frac{c}{nt} \right) }{\frac{c}{nt}} . \end{aligned}$$

Then for any real number \(M \ge |2\rho ^+|+a\),

$$\begin{aligned} W_g (\nu _1, \nu _2) \le \psi (M) + \phi (M) \left( \sum _{\begin{array}{c} \pi \in \widehat{G} \\ 0<|\lambda _{\pi }|<M \end{array}} \frac{d_{\pi }}{\kappa _{\pi }} \Vert \widehat{\nu _1}(\pi ) - \widehat{\nu _2}(\pi ) \Vert _{\mathrm {HS}}^2 \right) ^{1/2} . \end{aligned}$$

Observe that \(\psi (M) \ll g(1/M)\) and \(\phi (M) \ll Mg(1/M)\) with implied constants depending only on G. Theorem 1 thus follows from Theorem 4 with \(g(t)=t^p\), \(0<p\le 1\).

2.3 Applications

2.3.1 Spectral Gaps and Random Walks

Given Borel probability measures \(\nu _1\) and \(\nu _2\) on G, let \(\nu _1 * \nu _2\) denote their convolution, and let \(\nu _1^*(B)=\nu _1(B^{-1})\), \(B \subseteq G\) Borel. If \(\zeta _1\) and \(\zeta _2\) are independent G-valued random variables with distribution \(\nu _1\) and \(\nu _2\), then \(\nu _1 * \nu _2\) (resp. \(\nu _1^*\)) is the distribution of \(\zeta _1 \zeta _2\) (resp. \(\zeta _1^{-1}\)).

Let \(L_0^2(G,\mu _G)\) be the orthogonal complement of the space of constant functions in \(L^2(G,\mu _G)\); that is, the set of all \(f \in L^2(G,\mu _G)\) with \(\int _G f \, \mathrm {d}\mu _G =0\). Given a Borel probability measure \(\nu \) on G, let \(T_{\nu }: L_0^2(G,\mu _G) \rightarrow L_0^2(G,\mu _G)\),

$$\begin{aligned} (T_{\nu }f)(x)= \int _G f(xy) \, \mathrm {d}\nu (y) \end{aligned}$$

be its associated Markov operator. Observe that \(T_{\nu _1*\nu _2}=T_{\nu _1} T_{\nu _2}\) and \(T_{\nu ^*}=T_{\nu }^*\); in particular, \(T_{\nu }\) is self-adjoint (resp. normal) if and only if \(\nu =\nu ^*\) (resp. \(\nu * \nu ^* = \nu ^* * \nu \)).

We start with a trivial estimate for \(W_p (\nu , \mu _G)\) in terms of \(T_{\nu }\). It is not difficult to see that

$$\begin{aligned} q_{\nu } := \Vert T_{\nu } \Vert _{\mathrm {op}} = \sup _{\begin{array}{c} \pi \in \widehat{G} \\ \pi \ne \pi _0 \end{array}} \Vert \widehat{\nu } (\pi ) \Vert _{\mathrm {op}}, \end{aligned}$$

where \(\pi _0 \in \widehat{G}\) denotes the trivial representation, and \(\Vert \cdot \Vert _{\mathrm {op}}\) is the operator norm. Let \(f \in \mathcal {F}_p\) with \(\int _G f \, \mathrm {d}\mu _G=0\) be arbitrary, and note that \(T_{\nu } f \in \mathcal {F}_p\). Since \(|T_{\nu } f| \ge |(T_{\nu } f)(e)|/2\) on the ball centered at e with radius \(r=(|(T_{\nu } f)(e)|/2)^{1/p}\), we have

$$\begin{aligned} \begin{aligned} \Vert T_{\nu } f \Vert _2^2&\ge \left( \frac{(T_{\nu }f)(e)}{2} \right) ^2 \mu _G \left( B(e,r) \right) \gg |(T_{\nu }f)(e)|^{2+n/p}, \\ \Vert T_{\nu }f \Vert _2^2&\le \Vert T_{\nu } \Vert _{\mathrm {op}}^2 \cdot \Vert f \Vert _2^2 \ll \Vert T_{\nu } \Vert _{\mathrm {op}}^2 . \end{aligned} \end{aligned}$$

Therefore \(|\int _G f \, \mathrm {d} \nu | = |(T_{\nu }f)(e)| \ll q_{\nu }^{2p/(n+2p)}\), and consequently

$$\begin{aligned} W_p (\nu , \mu _G ) \ll q_{\nu }^{2p/(n+2p)} . \end{aligned}$$
(4)

We now deduce a sharp improvement on the trivial estimate (4). Recall that \(\Vert A \Vert _{\mathrm {HS}} \le \sqrt{d_{\pi }} \Vert A \Vert _{\mathrm {op}}\) for any \(d_{\pi } \times d_{\pi }\) matrix A. With \(\nu _1=\nu \) and \(\nu _2=\mu _G\),

$$\begin{aligned} \begin{aligned} \sum _{\begin{array}{c} \pi \in \widehat{G}\\ 0<|\lambda _{\pi }|<M \end{array}} \frac{d_{\pi }}{\kappa _{\pi }} \Vert \widehat{\nu _1}(\pi ) - \widehat{\nu _2}(\pi ) \Vert _{\mathrm {HS}}^2&\le \sum _{\begin{array}{c} \pi \in \widehat{G} \\ 0<|\lambda _{\pi }|<M \end{array}} \frac{d_{\pi }^2}{\kappa _{\pi }} \Vert \widehat{\nu }(\pi ) \Vert _{\mathrm {op}}^2 \\&\ll \sum _{\begin{array}{c} \pi \in \widehat{G}\\ 0<|\lambda _{\pi }|<M \end{array}} |\lambda _{\pi }|^{n-r-2} q_{\nu }^2 \\&\ll \left\{ \begin{array}{ll} q_{\nu }^2 &{} \text {if } n=1, \\ (\log (M+2)) q_{\nu }^2 &{} \text {if } n=2, \\ M^{n-2} q_{\nu }^2 &{} \text {if } n \ge 3 . \end{array} \right. \end{aligned} \end{aligned}$$
(5)

Optimizing the value of the free parameter \(M>0\), in dimension \(n \ge 3\) Theorem 1 thus gives that for any \(0<p \le 1\),

$$\begin{aligned} W_p (\nu , \mu _G ) \ll q_{\nu }^{2p/n} \end{aligned}$$
(6)

with an implied constant depending only on G. Using Theorem 4 instead, we get \(W_g(\nu , \mu _G) \ll g(q_{\nu }^{2/n})\). Similar estimates can be deduced in dimensions \(n=1\) and 2. Clearly \(q_{\nu } \le 1\), and \(q_{\nu _1*\nu _2} \le q_{\nu _1} q_{\nu _2}\); in particular, (6) gives the upper bound for the rate of convergence of random walks \(W_p (\nu _1 * \cdots * \nu _N, \mu _G) \ll \prod _{k=1}^N q_{\nu _k}^{2p/n}\).

We say that \(\nu \) has a spectral gap, if the spectral radius of \(T_{\nu }\) is strictly less than 1; note that this is a direct generalization of Cramér’s condition in classical probability theory. Assuming \(T_{\nu }\) is normal, having a spectral gap is equivalent to \(q_{\nu }<1\); for general \(T_{\nu }\), it is equivalent to \(q_{\nu ^{*m}}<1\) for some integer \(m \ge 1\). Deciding whether a given \(\nu \) has a spectral gap is a highly nontrivial problem. Generalizing results of Bourgain and Gamburd [10, 11] on \(\mathrm {SU}(2)\) and \(\mathrm {SU}(d)\), Benoist and Saxcé considered a Borel probability measure \(\nu \) on a compact, connected, simple Lie group G. They proved [3, Theorem 3.1] that if the support of \(\nu \) is not contained in any proper closed subgroup, and each element of the support (as a matrix) has algebraic entries, then \(\nu \) has a spectral gap. The same authors also conjectured that the condition that the matrix entries are algebraic can be dropped.

Using (6) (or even just (4)), \(W_p (\nu ^{*k},\mu _G) \rightarrow 0\) exponentially fast as \(k\rightarrow \infty \) whenever \(\nu \) has a spectral gap. Corollary 2 is thus basically an unconditional (i.e. not assuming the conjecture of Benoist and Saxcé), weaker form of this fact. In contrast to the (semi)simple case, \(W_p(\nu ^{*k},\mu _G) \rightarrow 0\) polynomially fast for certain finitely supported measures \(\nu \) on the torus \(\mathbb {R}^d / \mathbb {Z}^d\) [8].

So far we have only discussed the relationship between \(W_p(\nu , \mu _G)\) and the spectral gap of \(\nu \). Theorem 1, however, provides a quantitative relationship between \(W_p (\nu , \mu _G)\) and the spectrum of the self-adjoint operator \(T_{\nu }^* T_{\nu }\) itself. Indeed, by the Peter–Weyl theorem \(L_0^2(G,\mu _G)=\oplus _{\pi \in \widehat{G}, \pi \ne \pi _0} V_{\pi }\), where \(V_{\pi }\) is the vector space spanned by the entries of \(\pi (x)\). Since \((T_{\nu }\pi )(x)=\pi (x) \widehat{\nu }(\pi )^*\), the action of \(T_{\nu }\) on \(V_{\pi }\) is determined by \(\widehat{\nu }(\pi )\); in particular, \(d_{\pi } \Vert \widehat{\nu }(\pi ) \Vert _{\mathrm {HS}}^2\) is simply the sum of all spectrum points of \(T_{\nu }^* T_{\nu }\) on \(V_{\pi }\). The proof of Corollary 2 is based on this quantitative relationship.

Proof of Corollary 2

Varjú [34, Theorem 6] proved that for any Borel probability measure \(\vartheta \) on G and any \(M>0\),

$$\begin{aligned} 1- \max _{\begin{array}{c} \pi \in \widehat{G} \\ 0<|\lambda _{\pi }| \le M \end{array}} \Vert \widehat{\vartheta }(\pi ) \Vert _{\mathrm {op}} \ge c_0 \left( 1- \max _{\begin{array}{c} \pi \in \widehat{G} \\ 0<|\lambda _{\pi }| \le M_0 \end{array}} \Vert \widehat{\vartheta }(\pi ) \Vert _{\mathrm {op}} \right) \frac{1}{\log ^A (M+2)} , \end{aligned}$$
(7)

where the constants \(c_0, M_0>0\) and \(1 \le A \le 2\) depend only on the group G; in fact, the exact value of A was also given. Since \(\nu ^{*k} \rightarrow \mu _G\) weakly, we have \(\widehat{\nu }(\pi )^k = \widehat{\nu ^{*k}} (\pi ) \rightarrow 0\) for all \(\pi \ne \pi _0\), and hence the spectral radius of \(\widehat{\nu }(\pi )\) is less than 1. It follows that for any \(\pi \in \widehat{G}\) with \(0<|\lambda _{\pi }| \le M_0\), we have \(\Vert \widehat{\nu } (\pi )^m \Vert _{\mathrm {op}}<1\) with some positive integer \(m=m(G,\nu )\); in particular,

$$\begin{aligned} b=b(G,\nu ) := c_0 \left( 1- \max _{\begin{array}{c} \pi \in \widehat{G} \\ 0<|\lambda _{\pi }| \le M_0 \end{array}} \Vert \widehat{\nu }(\pi )^m \Vert _{\mathrm {op}} \right) >0. \end{aligned}$$

Applying (7) to \(\vartheta =\nu ^{*m}\), we get that for any positive integer k and any \(M>0\),

$$\begin{aligned} \begin{aligned} \max _{\begin{array}{c} \pi \in \widehat{G} \\ 0<|\lambda _{\pi }| \le M \end{array}} \Vert \widehat{\nu ^{*k}} (\pi ) \Vert _{\mathrm {op}} \le \left( \max _{\begin{array}{c} \pi \in \widehat{G} \\ 0<|\lambda _{\pi }| \le M \end{array}} \Vert \widehat{\nu } (\pi )^m \Vert _{\mathrm {op}} \right) ^{\lfloor k/m \rfloor }&\le \left( 1-\frac{b}{\log ^A (M+2)} \right) ^{(k-m)/m} \\&\le e^{-b(k-m)/(m\log ^A (M+2))} . \end{aligned} \end{aligned}$$

Hence

$$\begin{aligned} \begin{aligned} \sum _{\begin{array}{c} \pi \in \widehat{G}\\ 0<|\lambda _{\pi }|<M \end{array}} \frac{d_{\pi }}{\kappa _{\pi }} \Vert \widehat{\nu ^{*k}} (\pi ) \Vert _{\mathrm {HS}}^2&\le \sum _{\begin{array}{c} \pi \in \widehat{G}\\ 0<|\lambda _{\pi }|<M \end{array}} \frac{d_{\pi }^2}{\kappa _{\pi }} \Vert \widehat{\nu ^{*k}} (\pi ) \Vert _{\mathrm {op}}^2 \\&\ll \sum _{\begin{array}{c} \pi \in \widehat{G}\\ 0<|\lambda _{\pi }|<M \end{array}} |\lambda _{\pi }|^{n-r-2} e^{-b(k-m)/(m\log ^A (M+2))} \\&\ll M^n e^{-b(k-m)/(m\log ^A (M+2))}. \end{aligned} \end{aligned}$$

The first factor is actually 1, \(\log (M+2)\), \(M^{n-2}\) in the cases \(n=1\), \(n=2\), \(n \ge 3\), but this will not play an important role. Theorem 1 thus gives that for any \(M>0\),

$$\begin{aligned} W_1 (\nu ^{*k}, \mu _G ) \ll \frac{1}{M} + M^{n/2} e^{-b(k-m)/(2m \log ^A (M+2))} . \end{aligned}$$

Choosing \(\log ^{A+1} M= b(k-m)/(2mn)\), we deduce

$$\begin{aligned} W_1 (\nu ^{*k}, \mu _G ) \ll k^{\frac{1}{A+1}} \exp \left( - \frac{n}{2} \left( \frac{b(k-m)}{2mn} \right) ^{\frac{1}{A+1}} \right) . \end{aligned}$$

In particular, \(W_1 (\nu ^{*k}, \mu _G ) \ll \exp \left( -ck^{\frac{1}{A+1}} \right) \) with any \(0<c<\frac{n}{2} \cdot (\frac{b}{2mn})^{\frac{1}{A+1}}\).\(\square \)

Remark

Using Theorem 4 instead of Theorem 1, we deduce the general form of the conclusion of Corollary 2 as \(W_g (\nu ^{*k}, \mu _G) \ll g(e^{-ck^{1/3}})\).

2.3.2 Uniform Distribution Theory

Next, we consider applications in uniform distribution theory. It is not difficult to see e.g. directly from the definition of \(W_g\), that for any given nonempty finite set \(A \subset G\) and any g as in Sect. 2.1,

$$\begin{aligned} \inf _{\mathrm {supp} \, \nu \subseteq A} W_g (\nu , \mu _G) = \int _G g(\mathrm {dist} \, (A,x)) \, \mathrm {d} \mu _G (x) , \end{aligned}$$
(8)

where the infimum is over all probability measures \(\nu \) whose support is contained in A, and \(\mathrm {dist} \, (A, \cdot )\) denotes distance from the set A. Indeed, the infimum is attained when for any \(a \in A\), \(\nu (\{ a \})\) is the Haar measure of the Voronoi cell

$$\begin{aligned} \{ x \in G \, : \, \mathrm {dist} \, (A,x) = \rho (a,x) \} . \end{aligned}$$

In this case the optimal transport plan from \(\nu \) to \(\mu _G\) is to simply spread \(\nu (\{ a \})\) evenly over the given Voronoi cell. Recall that open balls B(xr) in G of radius \(0<r<\mathrm {diam}\, G\) satisfy \(r^n \ll \mu _G (B(x,r)) \ll r^n\). A standard ball packing argument (using e.g. the “3r covering lemma” of Vitali) shows that the optimal distance from a probability measure supported on at most N points to the Haar measure is

$$\begin{aligned} g(N^{-1/n}) \ll \inf _{|\mathrm {supp} \, \nu | \le N} W_g (\nu , \mu _G ) \ll g(N^{-1/n}) \end{aligned}$$
(9)

with implied constants depending only on G. In particular, (8) and (9) hold for \(W_p\), \(0<p\le 1\). We mention that \(W_p\), \(1 \le p < \infty \) also satisfies the same estimates as \(W_1\). For a detailed proof in the case \(1\le p<\infty \) see Kloeckner [22]; the proof for \(0<p<1\) and for more general g is identical. We refer to the same paper for far reaching generalizations (e.g. to more general measures on Riemannian manifolds).

Lubotzky, Phillips and Sarnak [23, 24] considered the problem of finding well distributed finite sets in \(\mathrm {SO}(3)\), and consequently, on the sphere \(S^2\). For any N such that \(2N-1\) is a prime congruent to 1 modulo 4, they constructed a symmetric set \(\{ a_1, a_2, \dots , a_{2N} \} \subset \mathrm {SO}(3)\) for which the probability measure \(\nu _N= (2N)^{-1} \sum _{k=1}^{2N} \delta _{a_k}\) satisfies \(q_{\nu _N} = \sqrt{2N-1}/N\); this spectral gap is in fact optimal among all symmetric sets of size 2N. Since \(\mathrm {SO}(3)\) has dimension \(n=3\), (6) yields that for any \(0<p \le 1\),

$$\begin{aligned} W_p (\nu _N, \mu _{\mathrm {SO}(3)}) \ll N^{-p/3} \end{aligned}$$

with a universal implied constant; by (9), this is optimal. This proves Corollary 3. Note that the trivial estimate (4) only yields \(W_p (\nu _N, \mu _{\mathrm {SO}(3)}) \ll N^{-p/(3+2p)}\). More generally, we have \(W_g (\nu _N, \mu _{\mathrm {SO}(3)}) \ll g(N^{-1/3})\).

Clozel [16] proved a similar optimal (up to a constant factor) spectral gap estimate in terms of the size of a finite set in \(\mathrm {U}(d)\). Less precise estimates on more general compact homogeneous spaces were obtained by Oh [26].

2.3.3 Empirical Measures

Finally, we address the sharpness of Theorems 1 and 4; we do so by deducing a simple estimate on the mean rate of convergence of empirical measures. Let \(\nu \) be an arbitrary Borel probability measure on G, and let \(\zeta _1, \zeta _2, \dots , \zeta _N\) be independent, identically distributed G-valued random variables with distribution \(\nu \). The probability measure \(\overline{\nu }_N:=N^{-1} \sum _{k=1}^N \delta _{\zeta _k}\) is called the corresponding empirical measure. Theorem 1 gives an estimate for \(W_p (\overline{\nu }_N, \nu )\) — a random variable! — as follows. Let \(E_{\pi } = \mathbb {E} \pi (\zeta _1 ) = \widehat{\nu }(\pi )^*\). With \(\nu _1 = \overline{\nu }_N\) and \(\nu _2 = \nu \) we then have

$$\begin{aligned} \widehat{\nu _1} (\pi ) - \widehat{\nu _2} (\pi ) = \frac{1}{N} \sum _{k=1}^N \left( \pi (\zeta _k ) -E_{\pi } \right) ^* , \end{aligned}$$

and by independence, the “variance” satisfies

$$\begin{aligned} \mathbb {E} \, \Vert \widehat{\nu _1} (\pi ) - \widehat{\nu _2}(\pi ) \Vert _{\mathrm {HS}}^2 = \frac{1}{N^2} \sum _{k=1}^N \mathbb E\, \mathrm {tr} \left( \pi (\zeta _k)^* \pi (\zeta _k) - E_{\pi }^* E_{\pi } \right) \le \frac{d_{\pi }}{N} . \end{aligned}$$

In the last step we used that \(\pi (x)\) is unitary. Following the steps in (5), in dimension \(n \ge 3\) Theorem 1 gives that for any \(0<p \le 1\) and any \(M>0\),

$$\begin{aligned} \mathbb {E} W_p (\overline{\nu }_N, \nu ) \le \sqrt{\mathbb EW_p (\overline{\nu }_N, \nu )^2} \ll \frac{1}{M^p} + M^{1-p} \sqrt{\frac{M^{n-2}}{N}} . \end{aligned}$$

Optimizing the value of \(M>0\), we finally obtain that in dimension \(n \ge 3\), for any \(0<p \le 1\),

$$\begin{aligned} \mathbb {E} W_p (\overline{\nu }_N, \nu ) \ll N^{-p/n} \end{aligned}$$
(10)

with an implied constant depending only on G; more generally, \(\mathbb {E} W_g (\overline{\nu }_N, \nu ) \ll g(N^{-1/n})\). These are sharp by (9); in particular, Theorems 1 and 4 are also sharp up to a constant factor depending on G. Note that the only compact, connected Lie groups in dimension \(n=1\) and \(n=2\) are \(\mathbb {R}/\mathbb {Z}\) and \(\mathbb {R}^2/\mathbb {Z}^2\), and the sharpness of Theorem 1 on these groups follows from results in [8].

The rate of convergence of empirical measures in \(W_p\), \(p \ge 1\) on more general metric spaces was studied by Bach and Weed [2], and by Boissard and Le Gouic [7]. Instead of Fourier methods, they used a sequence of partitions of the metric space, each refining its predecessor to construct transport plans. It follows e.g. from [7, Corollary 1.2] that our estimate (10) can be improved to \(\mathbb EW_p (\overline{\nu }_N, \nu ) \ll N^{-1/n}\) for all \(1 \le p < n/2\). We refer to [2] for improvements for measures \(\nu \) supported on sets of lower dimension than the ambient space.

3 Proof of Theorem 4

The proof of Berry–Esseen type inequalities are usually based on smoothing with an approximate identity whose Fourier transform has bounded support. For instance, in the proof of Theorem A this Fourier transform is the “rooftop function” \(\max \{ 1-|t|/T, 0 \}\), supported on \([-T,T]\). The proof of Theorem B uses the discrete version \(\prod _{k=1}^d \max \{1-|m_k|/(M+1), 0 \}\), supported on \([-M,M]^d\); in the setting of the torus this is known as the Fejér kernel.

Our proof of Theorem 4 follows the same idea. We will choose a kernel \(K_M: G \rightarrow \mathbb {C}\) whose Fourier transform satisfies \(\widehat{K_M}(\pi )=0\) whenever \(|\lambda _{\pi }| \ge M\). Clearly,

$$\begin{aligned} \left| \int _G f \, \mathrm {d}\nu _1 - \int _G f \, \mathrm {d}\nu _2 \right| \le 2 \Vert f-f*K_M \Vert _{\infty } + \left| \int _G f*K_M \, \mathrm {d}\nu _1 - \int _G f*K_M \, \mathrm {d}\nu _2 \right| , \end{aligned}$$
(11)

where \(f*K_M\) denotes convolution. Our goal is to find an upper estimate of the right hand side which is uniform in \(f \in \mathcal {F}_g\); by Kantorovich duality, the same upper estimate will hold for \(W_g (\nu _1, \nu _2)\). A possible choice for \(K_M\) could be a Fejér-like kernel

$$\begin{aligned} \frac{1}{|B_M|} \bigg | \sum _{\begin{array}{c} \pi \in \widehat{G} \\ \lambda _{\pi } \in B_M \end{array}} \chi _{\pi } \bigg |^2 \end{aligned}$$

with some set \(B_M \subseteq \{ \lambda \in \Gamma ^* \cap C^+ \, : \, |\lambda |<M/2 \}\). For convergence properties of such Fejér kernels on compact Lie groups we refer to [12] and [33]. We mention that using these kernels it is possible to deduce upper bounds to \(W_p\), \(0<p<1\) sharp up to a constant factor depending on p, but in the case \(p=1\) we necessarily lose a logarithmic factor in M; the reason is that the Fejér kernel does not approximate Lipschitz functions optimally in the supremum norm. Fixing this shortcoming is easy on the torus; we simply need to use the normalized square of the Fejér kernel instead. By Jackson’s theorem we then have the optimal rate of approximation \(\Vert f-f*K_M \Vert _{\infty } \ll g(1/M)\), and a sharp Berry–Esseen smoothing inequality on the torus follows [8]. Similar modifications of the Fejér kernel are known to yield Jackson type theorems on certain classical groups, see Gong [20]; however, this approach seems not to have been worked out in full generality. An elegant proof of Jackson’s theorem on an arbitrary compact, connected Lie group was nevertheless found by Cartwright and Kucharski [15], and in this paper we will use their kernel.

For the sake of completeness, we include the construction of the kernel in Sect. 3.1; we carry out the smoothing procedure in Sect. 3.2; finally, prove a decay estimate for the Fourier transform of f and finish the proof of Theorem 4 in Sect. 3.3.

3.1 Construction of the Kernel

Recall that the Weyl integral formula [9, p. 338] states that for any central function \(\varphi \in L^1(G,\mu _G)\) we have

$$\begin{aligned} \int _G \varphi \, \mathrm {d}\mu _G = \frac{1}{|W(G,T)|}\int _T \varphi \cdot \delta _G \, \mathrm {d}\mu _T , \end{aligned}$$

where the function \(\delta _G : T \rightarrow \mathbb {R}\) is defined as

$$\begin{aligned} \delta _G (\exp (2 \pi X))=\prod _{\alpha \in R} (e^{2 \pi i \alpha (X)}-1) = \prod _{\alpha \in R^+} \left| e^{2 \pi i \alpha (X)} -1 \right| ^2 , \qquad X \in \mathfrak {t}. \end{aligned}$$

In particular, \(\delta _G \ge 0\) and \(\int _T \delta _G \, \mathrm {d}\mu _T = |W(G,T)|\). Expanding the product in its definition, \(\delta _G\) is thus a W(GT)-invariant trigonometric polynomial on T of the form

$$\begin{aligned} \delta _G (\exp (2 \pi X)) = \sum _{\lambda \in \Gamma _{\mathrm {root}}^*} c_{\lambda } e^{2 \pi i \lambda (X)} , \qquad X \in \mathfrak {t}. \end{aligned}$$
(12)

The constant term is \(c_0=|W(G,T)|\), and all coefficients satisfy \(|c_{\lambda }| \le |W(G,T)|\). Observe also that for any \(\varphi \in L^1 (T, \mu _T)\),

$$\begin{aligned} \int _T \varphi (t) \, \mathrm {d}\mu _T (t) = \int _{\mathfrak {t}/\Gamma } \varphi (\exp (2 \pi X)) \, \mathrm {d}\mu _{\mathfrak {t}/\Gamma }(X), \end{aligned}$$
(13)

where \(\mu _{\mathfrak {t}/\Gamma }\) is the normalized Haar measure on \(\mathfrak {t}/\Gamma \). Note that \(\mu _{\mathfrak {t}/\Gamma } = m/\mathrm {Vol} (\mathfrak {t}/\Gamma )\), where \(\mathrm {Vol} (\mathfrak {t}/\Gamma )\) is the Lebesgue measure of the fundamental domain of \(\Gamma \).

Following [15] with minor modifications, we now construct a kernel \(K_M\). Let \(\eta (X)\) and F(X) be as in Sect. 2.2, and recall the notation \(a=\min _{\lambda \in \Gamma _{\mathrm {root}}^*} |\lambda |/2\). Let \(M \ge |2 \rho ^+| +a\) be arbitrary, and set \(M_0=\lfloor M/(|2\rho ^+|+a) \rfloor \). Define \(P: T \rightarrow \mathbb {C}\) as

$$\begin{aligned} P (\exp (2 \pi X)) = \mathrm {Vol} (\mathfrak {t}/\Gamma ) (aM_0)^r \sum _{Y \in \Gamma } F(aM_0(X+Y)) , \qquad X \in \mathfrak {t} . \end{aligned}$$

Note that P is well-defined, smooth and W(GT)-invariant. By (13), its Fourier coefficient with respect to \(\lambda \in \Gamma ^*\) (i.e. the character \(\exp (2 \pi X) \mapsto e^{2 \pi i \lambda (X)}\) on T) is

$$\begin{aligned} \widehat{P}(\lambda )&= \int _{\mathfrak {t}/\Gamma } P (\exp (2 \pi X)) e^{-2 \pi i \lambda (X)} \, \mathrm {d} \mu _{\mathfrak {t}/\Gamma } (X) \\&= \int _{\mathfrak {t}} (aM_0)^r F(aM_0 X) e^{-2 \pi i \lambda (X)} \, \mathrm {d} m (X) \\&= \eta (\lambda ^* / (aM_0)) , \end{aligned}$$

where \(\lambda ^*\) is the unique element in \(\mathfrak {t}\) with \(\lambda (X) = (\lambda ^*, X)\). By the construction of \(\eta \), \(\widehat{P}(\lambda )=0\) whenever \(|\lambda | \ge aM_0\); consequently, P is a W(GT)-invariant trigonometric polynomial on T with degree \(<aM_0\). Observe also that

$$\begin{aligned} \frac{\delta _G (t^{M_0})}{\delta _G (t)} = \prod _{\alpha \in R} \frac{e^{2 \pi i M_0 \alpha (X)}-1}{e^{2 \pi i \alpha (X)}-1} \qquad (t=\exp (2 \pi X)) \end{aligned}$$

is a W(GT)-invariant trigonometric polynomial on T of degree \(\le |2 \rho ^+| (M_0-1)\). Hence \(P(t) \delta _G (t^{M_0})/\delta _G (t)\) is a W(GT)-invariant trigonometric polynomial on T of degree \(<aM_0+|2 \rho ^+| (M_0-1)<M\). It follows (see e.g. [15, Lemma 1]), that there exists a central trigonometric polynomial \(K_M\) on G of degree \(<M\) — that is, a function \(K_M: G \rightarrow \mathbb {C}\) of the form \(K_M=\sum _{\pi \in \widehat{G}, |\lambda _{\pi }|<M} a_{\pi } \chi _{\pi }\) — such that

$$\begin{aligned} K_M (t) = P(t) \frac{\delta _G (t^{M_0})}{\delta _G (t)} \qquad \text {for all } t \in T. \end{aligned}$$

3.2 The Smoothing Procedure

First, we estimate the coefficients in \(K_M=\sum _{\pi \in \widehat{G}, |\lambda _{\pi }|<M} a_{\pi } \chi _{\pi }\). Let \(W(\pi )\) denote the set of weights of a representation \(\pi \in \widehat{G}\). Since there exists a unitary matrix U such that \(U \pi (\exp (2 \pi X)) U^* = \mathrm {diag} \, (e^{2 \pi i \mu (X)} \, : \, \mu \in W(\pi ))\), \(X \in \mathfrak {t}\), we have

$$\begin{aligned} \chi _{\pi } (\exp (2 \pi X)) = \sum _{\mu \in W(\pi )} e^{2 \pi i \mu (X)}, \qquad X \in \mathfrak {t}. \end{aligned}$$

Therefore

$$\begin{aligned} \begin{aligned} a_{\pi }&= \int _G K_M \overline{\chi _{\pi }} \, \mathrm {d} \mu _G \\&= \frac{1}{|W(G,T)|} \int _{\mathfrak {t}/\Gamma } P(\exp (2 \pi X)) \delta _G (\exp (2 \pi M_0 X)) \overline{\chi _{\pi } (\exp (2 \pi X))} \, \mathrm {d} \mu _{\mathfrak {t}/\Gamma } (X) \\&= \int _{\mathfrak {t}} (aM_0)^r F(aM_0 X) \bigg ( \sum _{\lambda \in \Gamma _{\mathrm {root}}^*} \frac{c_{\lambda }}{|W(G,T)|} e^{2 \pi i M_0 \lambda (X)} \bigg ) \bigg ( \sum _{\mu \in W(\pi )} e^{-2 \pi i \mu (X)} \bigg ) \, \mathrm {d}m(X) \\&= \sum _{\mu \in W(\pi )} \sum _{\lambda \in \Gamma _{\mathrm {root}}^*} \frac{c_{\lambda }}{|W(G,T)|} \eta (-\lambda ^* /a+\mu ^* /(aM_0)) , \end{aligned} \end{aligned}$$

where \(\lambda ^* \in \mathfrak {t}\) is such that \(\lambda (X) = (\lambda ^*, X)\). By the construction of \(\eta \), for any given \(\mu \in W(\pi )\) there is at most one \(\lambda \in \Gamma _{\mathrm {root}}^*\) for which \(\eta (-\lambda ^* /a+\mu ^* /(aM_0)) \ne 0\). Recalling that \(|c_{\lambda }| \le |W(G,T)|\) and \(0 \le \eta \le 1\), it follows that the coefficients of \(K_M\) satisfy \(|a_{\pi }| \le d_{\pi }\). For the trivial character the only nonvanishing term is \(\lambda =0\); in particular, \(\int _G K_M \, \mathrm {d} \mu _G =(c_0/|W(G,T)|) \eta (0)=1\).

Remark

In fact, the only nonvanishing term is \(\lambda =0\) whenever \(|\lambda _{\pi }| \le aM_0\). Assuming in addition, that \(\eta (X)=1\) for all \(|X| \le 1/2\), we thus have \(a_{\pi }=d_{\pi }\) whenever \(|\lambda _{\pi }| \le aM_0/2\). In other words, \(f*K_M=f\) for any central trigonometric polynomial of the form \(f=\sum _{\pi \in \widehat{G}, |\lambda _{\pi }|\le aM_0/2} b_{\pi } \chi _{\pi }\). Thus \(K_M\) is an analogue of the de la Vallée Poussin kernel, although its construction is not based on the Fejér kernel.

Proposition 5

For any \(f \in \mathcal {F}_g\) and any real \(M \ge |2\rho ^+|+a\),

$$\begin{aligned} \left| \int _G f \, \mathrm {d} \nu _1 - \int _G f \, \mathrm {d} \nu _2 \right| \le \psi (M) + \sum _{\begin{array}{c} \pi \in \widehat{G} \\ |\lambda _{\pi }|<M \end{array}} d_{\pi } \Vert \widehat{f} (\pi ) \Vert _{\mathrm {HS}} \cdot \Vert \widehat{\nu _1} (\pi ) - \widehat{\nu _2} (\pi ) \Vert _{\mathrm {HS}} , \end{aligned}$$

where \(\psi \) is as in Theorem 4.

Proof

Recall (11). We first show that \(2 \Vert f-f*K_M \Vert _{\infty } \le \psi (M)\); with somewhat weaker constant factors this was proved in [15]. Since the geodesic metric \(\rho \) is translation invariant both from the left and from the right, from the Weyl integral formula, (13) and (12) we deduce

$$\begin{aligned} \Vert f&-f*K_M \Vert _{\infty } \\&= \sup _{x \in G} \left| \int _G \left( f(x)-f(xy^{-1}) \right) K_M(y) \, \mathrm {d}\mu _G (y) \right| \\&\le \int _G g(\rho (e,y)) |K_M(y)| \, \mathrm {d}\mu _G (y) \\&= \frac{1}{|W(G,T)|} \int _{\mathfrak {t}/\Gamma } g(\rho (e,\exp (2 \pi X))) |P(\exp (2 \pi X))| \delta _G( \exp (2 \pi M_0 X)) \, \mathrm {d} \mu _{\mathfrak {t}/\Gamma } (X) \\&\le \frac{1}{|W(G,T)|} \int _{\mathfrak {t}} g(2 \pi |X|) (aM_0)^r |F(aM_0X)| \bigg ( \prod _{\alpha \in R^+} |e^{2 \pi i M_0 \alpha (X)}-1|^2 \bigg ) \, \mathrm {d} m(X) \\&= \psi (M)/2. \end{aligned}$$

In the penultimate step we used that \(g(\rho (e, \exp (2 \pi X))) \le g(2 \pi |X|)\), as the exponential map is a geodesic of unit speed. Finally, using \(|a_{\pi }| \le d_{\pi }\) we get

$$\begin{aligned} \begin{aligned} \left| \int _G f*K_M \, \mathrm {d}\nu _1 - \int _G f*K_M \, \mathrm {d}\nu _2 \right|&\le \sum _{\begin{array}{c} \pi \in \widehat{G}\\ |\lambda _{\pi }|< M \end{array}} d_{\pi } \left| \int _G f*\chi _{\pi } \, \mathrm {d}\nu _1 - \int _G f*\chi _{\pi } \, \mathrm {d}\nu _2 \right| \\&= \sum _{\begin{array}{c} \pi \in \widehat{G}\\ |\lambda _{\pi }|<M \end{array}} d_{\pi } \left| \mathrm {tr} \left( \widehat{f}(\pi )^* \left( \widehat{\nu _1}(\pi ) - \widehat{\nu _2}(\pi ) \right) \right) \right| \\&\le \sum _{\begin{array}{c} \pi \in \widehat{G}\\ |\lambda _{\pi }|<M \end{array}} d_{\pi } \Vert \widehat{f}(\pi ) \Vert _{\mathrm {HS}} \cdot \left\| \widehat{\nu _1}(\pi ) - \widehat{\nu _2}(\pi ) \right\| _{\mathrm {HS}} , \end{aligned} \end{aligned}$$

which proves the claim.\(\square \)

3.3 Decay of the Fourier Transform

We prove a decay estimate for the Fourier transform in somewhat greater generality than what we need, and then finish the proof of Theorem 4.

Proposition 6

Assume that \(f \in L^1 (G,\mu _G)\) satisfies

$$\begin{aligned} \left( \int _G |f(xh)-f(x)|^2 \, \mathrm {d}\mu _G (x) \right) ^{1/2} \le g(\rho (h,e)) \end{aligned}$$

for all \(h \in G\) with some nondecreasing function \(g: [0,\infty ) \rightarrow [0, \infty )\). Then for any real number \(M>0\),

$$\begin{aligned} \sum _{\begin{array}{c} \pi \in \widehat{G} \\ |\lambda _{\pi }| \le M \end{array}} d_{\pi } \kappa _{\pi } \Vert \widehat{f} (\pi ) \Vert _{\mathrm {HS}}^2 \le \inf _{0<c<2(\sqrt{n^2+n}-n)} \frac{n}{1-c-c^2/(4n)} \cdot \frac{g \left( \frac{c}{nM} \right) ^2}{\left( \frac{c}{nM} \right) ^2} . \end{aligned}$$

If \(g(t)=t^p\) with some \(0<p\le 1\), we can choose e.g. \(c=(\sqrt{17}-3)/2\) (this is optimal in the worst case \(p \rightarrow 0\), \(n=1\)) yielding

$$\begin{aligned} \sum _{\begin{array}{c} \pi \in \widehat{G}, |\lambda _{\pi }| \le M \end{array}} d_{\pi } \kappa _{\pi } \Vert \widehat{f} (\pi ) \Vert _{\mathrm {HS}}^2 \le 9 n^{3-2p} M^{2-2p}. \end{aligned}$$
(14)

In the special case \(p=1\) the factor 9 can be removed, since the optimal choice is then to let \(c \rightarrow 0\) (and \(M \rightarrow \infty \)). An estimate similar to (14) has recently been proved by Daher, Delgado and Ruzhansky [18], with an unspecified implied constant in the place of \(9n^{3-2p}\). Our main improvement is that this implied constant does not depend on f; a crucial feature in the study of the p-Wasserstein metric.

Proof of Proposition 6

We follow ideas in [18]. For the sake of simplicity, we shall think about \(\pi \in \widehat{G}\) as a \(d_{\pi } \times d_{\pi }\) unitary matrix-valued function on G. For any matrix \(A \in \mathbb {C}^{d_{\pi } \times d_{\pi }}\) let \(\Vert A \Vert _{\mathrm {op}}=\sup \{ |Av| \, : \, v \in \mathbb {C}^{d_{\pi }}, |v|=1 \}\) and \(\Vert A \Vert _{\mathrm {HS}}=\sqrt{\mathrm {tr}\, (A^*A)}\) denote the operator norm and the Hilbert–Schmidt norm, respectively. The operator norm is submultiplicative; further, for all \(A,B \in \mathbb {C}^{d_{\pi } \times d_{\pi }}\) we have \(\Vert AB \Vert _{\mathrm {HS}} \le \Vert A \Vert _{\mathrm {op}} \cdot \Vert B \Vert _{\mathrm {HS}}\), and the Cauchy–Schwarz inequality \(|\mathrm {tr}\, (A^*B)| \le \Vert A \Vert _{\mathrm {HS}} \cdot \Vert B \Vert _{\mathrm {HS}}\).

One readily verifies the identity

$$\begin{aligned} \left( \pi (h)-I_{d_{\pi }} \right) \widehat{f}(\pi ) = \int _G \left( f(xh)-f(x) \right) \pi (x)^* \, \mathrm {d}\mu _G (x) , \end{aligned}$$

where \(I_{d_{\pi }}\) denotes the \(d_{\pi } \times d_{\pi }\) identity matrix. By the Parseval formula and the assumption on f, for any \(h \in G\) we have

$$\begin{aligned} \begin{aligned} \sum _{\pi \in \widehat{G}} d_{\pi } \mathrm {tr} \left( \left( \pi (h)-I_{d_{\pi }} \right) ^* \left( \pi (h)-I_{d_{\pi }} \right) \widehat{f}(\pi ) \widehat{f}(\pi )^* \right)&= \int _G |f(xh)-f(x)|^2 \, \mathrm {d} \mu _G (x) \\&\le g(\rho (h,e))^2 . \end{aligned} \end{aligned}$$

Since the exponential map is a geodesic, we have \(\rho (\exp (uX),e) \le |uX|\) for all \(X \in \mathfrak {g}\) and \(u \in \mathbb {R}\). For any \(h=\exp (uX)\) the previous estimate thus yields

$$\begin{aligned} \sum _{\pi \in \widehat{G}} d_{\pi } \mathrm {tr} \left( \left( \pi (h)-I_{d_{\pi }} \right) ^* \left( \pi (h)-I_{d_{\pi }} \right) \widehat{f}(\pi ) \widehat{f}(\pi )^* \right) \le g(|uX|)^2. \end{aligned}$$
(15)

Next, we wish to find a lower estimate. For any \(X \in \mathfrak {g}\) let

$$\begin{aligned} \mathrm {d} \pi (X) = \frac{\mathrm {d}}{\mathrm {d}u} \pi (\exp (uX)) \mid _{u=0} \in \mathbb {C}^{d_{\pi } \times d_{\pi }} \end{aligned}$$

denote the derived representation of \(\pi \).

Lemma 1

(Taylor expansion of degree 1) For any \(X \in \mathfrak {g}\) and any \(u \in \mathbb {R}\),

$$\begin{aligned} \left\| \pi (\exp (uX)) - I_{d_{\pi }} - u \cdot \mathrm {d}\pi (X) \right\| _{\mathrm {op}} \le \frac{u^2}{2} \Vert \mathrm {d} \pi (X) \Vert _{\mathrm {op}}^2 . \end{aligned}$$

Proof of Lemma 1

We simply apply the usual Taylor formula to the matrix-valued function \(F(u)=\pi (\exp (uX))\). Since \(\pi \) is a homomorphism, we have \(F'(u)=\pi (\exp (uX)) \mathrm {d} \pi (X)\). First, note that for any \(u \in \mathbb {R}\),

$$\begin{aligned} \begin{aligned} \left\| \pi (\exp (uX)) - I_{d_{\pi }} \right\| _{\mathrm {op}}&= \left\| \int _0^u \pi (\exp (yX)) \mathrm {d} \pi (X) \, \mathrm {d}y \right\| _{\mathrm {op}} \\&\le \int _0^{|u|} \Vert \pi (\exp (yX)) \Vert _{\mathrm {op}} \cdot \Vert \mathrm {d} \pi (X) \Vert _{\mathrm {op}} \, \mathrm {d} y \\&= |u| \cdot \Vert \mathrm {d} \pi (X) \Vert _{\mathrm {op}} . \end{aligned} \end{aligned}$$

We used the fact that \(\pi (\exp (yX))\) is a unitary matrix and thus has operator norm 1. Therefore

$$\begin{aligned} \begin{aligned} \left\| \pi (\exp (uX)) - I_{d_{\pi }} - u \cdot \mathrm {d} \pi (X) \right\| _{\mathrm {op}}&= \left\| \int _0^u \left( \pi (\exp (yX)) - I_{d_{\pi }} \right) \mathrm {d} \pi (X) \, \mathrm {d}y \right\| _{\mathrm {op}} \\&\le \int _0^{|u|} \left\| \pi (\exp (yX)) - I_{d_{\pi }} \right\| _{\mathrm {op}} \cdot \Vert \mathrm {d} \pi (X) \Vert _{\mathrm {op}} \, \mathrm {d} y \\&\le \int _0^{|u|} |y| \cdot \Vert \mathrm {d} \pi (X) \Vert _{\mathrm {op}}^2 \, \mathrm {d}y \\&= \frac{u^2}{2} \Vert \mathrm {d} \pi (X) \Vert _{\mathrm {op}}^2 . \end{aligned} \end{aligned}$$

\(\square \)

Lemma 2

(Sugiura) For any \(X \in \mathfrak {g}\), we have \(\Vert \mathrm {d} \pi (X) \Vert _{\mathrm {op}} \le |\lambda _{\pi }| \cdot |X|\).

Proof of Lemma 2

In [32, Theorem 2] Sugiura stated and proved the estimate \(\Vert \mathrm {d} \pi (X) \Vert _{\mathrm {HS}} \le \sqrt{d_{\pi }} |\lambda _{\pi }| \cdot |X|\). His proof is based on the fact that with some \(d_{\pi } \times d_{\pi }\) unitary matrix U, we have \(U \mathrm {d} \pi (X) U^* = \mathrm {diag} \left( i \lambda (X) \, : \, \lambda \in W(\pi ) \right) \), where \(W(\pi )\) is the set of weights of \(\pi \). Further, we have \(|\lambda | \le |\lambda _{\pi }|\) for all \(\lambda \in W(\pi )\). Hence Sugiura’s proof in fact yields the slightly stronger claim of Lemma 2.\(\square \)

Lemma 3

Let \(X_1, \dots , X_n\) be an orthonormal base in \(\mathfrak {g}\). For any \(u \in \mathbb {R}\), the points \(h_k=\exp (uX_k)\) satisfy

$$\begin{aligned} \sum _{k=1}^n \left( \pi (h_k)-I_{d_{\pi }} \right) ^* \left( \pi (h_k)-I_{d_{\pi }} \right) = u^2 \kappa _{\pi } I_{d_{\pi }} + E \end{aligned}$$

with some \(E\in \mathbb {C}^{d_{\pi } \times d_{\pi }}\), \(\Vert E \Vert _{\mathrm {op}} \le n |u|^3 |\lambda _{\pi }|^3 + n (u^4/4) |\lambda _{\pi }|^4\).

Proof of Lemma 3

By Lemma 1 we can write

$$\begin{aligned} \pi (h_k)-I_{d_{\pi }} = u \cdot \mathrm {d} \pi (X_k) + E_k \end{aligned}$$

with some error matrix \(E_k\) satisfying \(\Vert E_k \Vert _{\mathrm {op}} \le (u^2/2) \Vert \mathrm {d} \pi (X_k) \Vert _{\mathrm {op}}^2\). Therefore

$$\begin{aligned} \sum _{k=1}^n \left( \pi (h_k)-I_{d_{\pi }} \right) ^* \left( \pi (h_k)-I_{d_{\pi }} \right) = u^2 \sum _{k=1}^n \mathrm {d} \pi (X_k)^* \mathrm {d} \pi (X_k) +E \end{aligned}$$

where

$$\begin{aligned} \begin{aligned} \Vert E \Vert _{\mathrm {op}}&= \left\| \sum _{k=1}^n \left( u \cdot \mathrm {d} \pi (X_k)^* E_k + E_k^* u \cdot \mathrm {d} \pi (X_k) + E_k^*E_k \right) \right\| _{\mathrm {op}} \\&\le \sum _{k=1}^n \left( 2 |u| \cdot \Vert \mathrm {d} \pi (X_k) \Vert _{\mathrm {op}} \cdot \frac{u^2}{2} \Vert \mathrm {d} \pi (X_k) \Vert _{\mathrm {op}}^2 + \frac{u^4}{4} \Vert \mathrm {d} \pi (X_k) \Vert _{\mathrm {op}}^4 \right) . \end{aligned} \end{aligned}$$

By Lemma 2, the previous estimate yields \(\Vert E \Vert _{\mathrm {op}} \le n |u|^3 |\lambda _{\pi }|^3 + n (u^4/4) |\lambda _{\pi }|^4\). On the other hand, we have \(\mathrm {d} \pi (X)^* = -\mathrm {d} \pi (X)\), and by the definition of the Laplace–Beltrami operator,

$$\begin{aligned} \sum _{k=1}^n \mathrm {d} \pi (X_k)^* \mathrm {d} \pi (X_k) = - \sum _{k=1}^n \mathrm {d} \pi (X_k) \mathrm {d} \pi (X_k) = -(\Delta \pi )(e) = \kappa _{\pi } I_{d_{\pi }}.\qquad \qquad \qquad \quad \square \end{aligned}$$

We now finish the proof of Proposition 6. Recall that \(|\lambda _{\pi }|^2 \le \kappa _{\pi }\). From Lemma 3 we deduce that for any \(u \in \mathbb {R}\),

$$\begin{aligned} \mathrm {tr} \bigg ( \sum _{k=1}^n \left( \pi (h_k)-I_{d_{\pi }} \right) ^*&\left( \pi (h_k)-I_{d_{\pi }} \right) \widehat{f}(\pi ) \widehat{f}(\pi )^* \bigg ) \\&= \mathrm {tr} \left( u^2 \kappa _{\pi } \widehat{f}(\pi ) \widehat{f}(\pi )^* \right) + \mathrm {tr} \left( E \widehat{f}(\pi ) \widehat{f}(\pi )^* \right) \nonumber \\&\ge u^2 \kappa _{\pi } \Vert \widehat{f}(\pi ) \Vert _{\mathrm {HS}}^2 - \Vert E \widehat{f}(\pi ) \Vert _{\mathrm {HS}} \cdot \Vert \widehat{f}(\pi ) \Vert _{\mathrm {HS}} \nonumber \\&\ge u^2 \kappa _{\pi } \Vert \widehat{f}(\pi ) \Vert _{\mathrm {HS}}^2 - \Vert E \Vert _{\mathrm {op}} \cdot \Vert \widehat{f}(\pi ) \Vert _{\mathrm {HS}}^2 \nonumber \\&\ge \Vert \widehat{f} (\pi ) \Vert _{\mathrm {HS}}^2 \left( u^2 \kappa _{\pi } - n |u|^3 |\lambda _{\pi }|^3 - n \frac{u^4}{4} |\lambda _{\pi }|^4 \right) \nonumber \\&\ge \Vert \widehat{f} (\pi ) \Vert _{\mathrm {HS}}^2 \cdot u^2 \kappa _{\pi } \left( 1 - n |u| \cdot |\lambda _{\pi }| - n \frac{u^2}{4} |\lambda _{\pi }|^2 \right) . \nonumber \end{aligned}$$
(16)

Let \(M>0\) and \(0<c<2(\sqrt{n^2+n}-n)\) be arbitrary, and choose \(u=c/(nM)\). For any \(|\lambda _{\pi }| \le M\) we then have

$$\begin{aligned} 1 - n |u| \cdot |\lambda _{\pi }| - n \frac{u^2}{4} |\lambda _{\pi }|^2 \ge 1-c-\frac{c^2}{4n} >0, \end{aligned}$$

and thus (15) and (16) imply

$$\begin{aligned} \begin{aligned} n g \left( \frac{c}{nM} \right) ^2&\ge \sum _{\begin{array}{c} \pi \in \widehat{G} \\ |\lambda _{\pi }| \le M \end{array}} d_{\pi } \mathrm {tr} \left( \sum _{k=1}^n \left( \pi (h_k)-I_{d_{\pi }} \right) ^* \left( \pi (h_k)-I_{d_{\pi }} \right) \widehat{f}(\pi ) \widehat{f}(\pi )^* \right) \\&\ge \sum _{\begin{array}{c} \pi \in \widehat{G} \\ |\lambda _{\pi }| \le M \end{array}} d_{\pi } \Vert \widehat{f} (\pi ) \Vert _{\mathrm {HS}}^2 \left( \frac{c}{nM} \right) ^2 \kappa _{\pi } \left( 1-c-\frac{c^2}{4n} \right) . \end{aligned} \end{aligned}$$

Since \(0<c<2(\sqrt{n^2+n}-n)\) was arbitrary, the claim follows. \(\square \)

Proof of Theorem 4

From Propositions 5 and 6 and the Cauchy–Schwarz inequality we get that for any \(f \in \mathcal {F}_g\) and any real number \(M \ge |2 \rho ^+|+a\),

$$\begin{aligned} \begin{aligned} \bigg | \int _G&f \, \mathrm {d}\nu _1 - \int _G f \, \mathrm {d}\nu _2 \bigg | \\&\le \psi (M) + \left( \sum _{\begin{array}{c} \pi \in \widehat{G} \\ 0<|\lambda _{\pi }|<M \end{array}} d_{\pi } \kappa _{\pi } \Vert \widehat{f}(\pi ) \Vert _{\mathrm {HS}}^2 \right) ^{1/2} \left( \sum _{\begin{array}{c} \pi \in \widehat{G} \\ 0<|\lambda _{\pi }|<M \end{array}} \frac{d_{\pi }}{\kappa _{\pi }} \Vert \widehat{\nu _1}(\pi ) - \widehat{\nu _2}(\pi ) \Vert _{\mathrm {HS}}^2 \right) ^{1/2} \\&\le \psi (M)+\phi (M) \left( \sum _{\begin{array}{c} \pi \in \widehat{G} \\ 0<|\lambda _{\pi }|<M \end{array}} \frac{d_{\pi }}{\kappa _{\pi }} \Vert \widehat{\nu _1}(\pi ) - \widehat{\nu _2}(\pi ) \Vert _{\mathrm {HS}}^2 \right) ^{1/2} . \end{aligned} \end{aligned}$$

By Kantorovich duality, the same upper bound holds for \(W_g (\nu _1, \nu _2)\).\(\square \)