1 Introduction

Learning from samples in learning theory and constructing models from observed data in system identification are two aspects of the same problem in some applications inferring relationships between observed input and output quantities. An important feature of learning theory techniques [2, 17, 21] is that learning is formulated in a high-dimensional feature space, which is transformed into the data space by means of reproducing kernel Hilbert spaces. System identification builds mathematical models of systems that generate the observed input-output data. It can be seen as an interface between the real world applications and mathematics of control theory and model abstractions [10]. Recent research work has encouraged applying statistical learning theory to system identification, as surveyed by L. Ljung in [9].

System identification is a challenging problem that demands function approximation, regression, model fitting and statistical methods [10]. It was formulated as a problem in statistical learning theory in [22]. In this paper we consider the system identification problem for linear, time-invariant, causal systems in the framework of [10, 22] where a transfer function is a convenient representation of a system. Estimation of the transfer function of a linear system is a classical problem [10]. Nevertheless, it still attracts great attention and nice findings are continuously obtained [1, 9, 12, 13]. Mathematically the transfer function is a function of complex variables as described below.

A linear, discrete-time, time-invariant, causal system with input signal \(\{x(t)\}_{t\in \mathbb {Z}}\) and output signal \(\{y(t)\}_{t\in \mathbb {Z}}\) is described in terms of a sequence \(\{g_k\}_{k\in \mathbb {Z}}\) supported on \(\mathbb {N}\) by the discrete convolution

$$\begin{aligned} y(t)=\sum _{k=1}^\infty g_kx(t-k), \qquad t\in \mathbb {Z} . \end{aligned}$$

If we define a backward shift operator \(q^{-1}\) on \(\ell ^2(\mathbb {Z})\) by \(q^{-1}x(t)=q^{-1}(x)(t)=x(t-1)\), we have

$$\begin{aligned} y(t)=\sum _{k=1}^\infty g_kx(t-k) = \sum _{k=1}^\infty g_kq^{-k}x(t)= G(q)x(t), \end{aligned}$$

where \(q^{-k}\) denotes the operator \((q^{-1})^k\), and \(G(q)\) the operator \(\sum _{k=1}^\infty g_kq^{-k}\) (if it is well-defined). The corresponding function \(G\) defined possibly on \(\mathbb {C}\setminus \{0\}\) by

$$\begin{aligned} G(z)=\sum _{k=1}^\infty g_k z^{-k} \end{aligned}$$

is called the transfer function and \(G(q)\) the transfer operator of the system. The transfer function \(G\) (or “system”, or “filter”) is called stable if

$$\begin{aligned} \Vert g\Vert _{\ell ^1}:=\sum _{k=1}^\infty |g_k|<\infty . \end{aligned}$$

Under this stability condition, the operator \(G(q)\) is well-defined with \(\Vert G(q)\Vert \le \Vert g\Vert _{\ell ^1}\), and the Laurent series

$$\begin{aligned} G(z)=\sum _{k=1}^\infty g_kz^{-k} \end{aligned}$$

is continuous on and analytic outside the unit circle \(\mathbb {S}:=\{z\in {\mathbb {C}}: |z|=1\}\). Moreover \(G\) is in \(H_2\), the Hardy space of functions analytic outside \(\mathbb {S}\) with the norm

$$\begin{aligned} \Vert f\Vert _{H_2}=\left\{ \frac{1}{2\pi }\int _0^{2\pi }|f(e^{i\theta })|^2d\theta \right\} ^{\frac{1}{2}} \end{aligned}$$

finite.

In the system identification literature one of the most important cases is when the transfer function is rational. For a system with finite McMillan degree, one can always decompose \(G\) via a partial fraction expansion as

$$\begin{aligned} G(z) =\sum _{k=1}^Nc_k\frac{1}{z-a_k}. \end{aligned}$$
(1.1)

Hence, the following atomic set for linear systems

$$\begin{aligned} \mathcal {A} = \left\{ K_w (z)=\frac{1-|w|^2}{z-w}: w\in \mathbb {D}\right\} \end{aligned}$$

gives building blocks where \(\mathbb {D}:=\{z\in \mathbb {C}:|z|<1\}\). See the discussion in [14] for details.

The problem considered in this paper is frequency domain identification based on a set of frequency domain measurements. We study this problem under the assumption that the transfer function \(G_{\star }\) to be estimated has single poles of magnitude at most \(\rho \) for some \(0<\rho <1\) meaning that \(G_{\star }\) can be expressed as

$$\begin{aligned} G_{\star }(z)=\sum _{k=1}^\infty \nu _k \frac{1-|w_k|^2}{z-w_k}, \end{aligned}$$
(1.2)

where \(\nu _k\in \mathbb {D}_\rho :=\{z\in {\mathbb {C}}: |z| \le \rho \}\) and \(\sum _{k=1}^\infty |\nu _k|<\infty \). The frequency domain identification problem aims at approximating \(G_{\star }\) by means of observed and possibly noisy input-output data of the form

$$\begin{aligned} \big \{(e^{i\theta _1}, y_1), (e^{i\theta _2}, y_2), \cdots , (e^{i\theta _m}, y_m)\big \}\in (\mathbb {S}\times \mathbb {C})^m, \quad \hbox {with} \ y_k\approx G_{\star }(e^{i\theta _k}). \end{aligned}$$
(1.3)

We need to determine \(w\)’s, the locations of the poles of \(G_{\star }\), and the coefficients \(\{\nu _k\}\).

The main purpose of this paper is to estimate the transfer function by a learning theory approach. We view this as a least squares regression problem aiming at learning a target function \(f_{\star }\) defined on a compact metric space \(X\) from a sample of input-output pairs \(\mathbf{z} = \{(x_k, y_k)\}_{k=1}^m\). A common model in learning theory is to assume a Borel probability measure \(\mu \) on \(Z:= X \times Y\) with \(Y \subseteq {\mathbb {R}}\), which yields conditional means \(f_\mu (x) = E[y|x]\) for \(x\in X\). The link between the target function \(f_{\star }\) and the sample \(\mathbf{z}\) is that \(f_{\star } = f_\mu \), called the regression function, given by the conditional means of \(\mu \) while \(\mathbf{z}\) is independently drawn from \(\mu \), thereby \(y_k \approx f_{\star }(x_k)\).

There have been many learning algorithms for solving the least squares regression problem. A large family is given by kernel methods including support vector machines (for both regression and classification, with a general convex loss function). Such a learning algorithm is based on a kernel \(K: X \times X \rightarrow {\mathbb {R}}\) which is continuous, symmetric and positive semi-definite. The kernel generates a reproducing kernel Hilbert space (RKHS) \(({\mathcal H}_K, \Vert \cdot \Vert _K)\) and the learning algorithm for the least squares regression is given by a regularization scheme

$$\begin{aligned} f_{\mathbf{z}, \gamma } = \arg \min _{f\in {\mathcal H}_K} \left\{ \frac{1}{m}\sum _{k=1}^m|f(x_k)-y_k|^2+ \gamma \,\Vert f\Vert _{K}^2\right\} , \end{aligned}$$
(1.4)

where \(\gamma >0\) is a regularization parameter. The reproducing property of the RKHS together with the Hilbert space norm square as the penalty implies that \(f_{\mathbf{z}, \gamma } = \sum _{k=1}^m c^\mathbf{z}_k K(\cdot , x_k)\) with the coefficient vector (usually having no sparsity) satisfying a linear system of equations. Another family of learning algorithms based on a kernel \(K\) is given by coefficient-based regularization schemes. Here the output function takes the form \(f_{\mathbf{z}, \gamma , \Omega } = \sum _{k=1}^m c^\mathbf{z}_k K(\cdot , x_k)\) with the coefficient vector \(c^\mathbf{z} = (c^\mathbf{z}_k)_{k=1}^m\) given in terms of a penalty functional \(\Omega : {\mathbb {R}}^m \rightarrow {\mathbb {R}}_+\) by a regularization scheme

$$\begin{aligned} c^\mathbf{z} = \arg \min _{c \in {\mathbb {R}}^m} \left\{ \frac{1}{m}\sum _{k=1}^m\left| \sum _{j=1}^m c_j K(x_k, x_j)-y_k\right| ^2+ \gamma \, \Omega (c)\right\} . \end{aligned}$$
(1.5)

A typical example of the penalty functional is the \(\ell ^1\)-norm \(\Omega (c) = \sum _{k=1}^m |c_k|\). A major advantage of the coefficient-based regularization schemes is sparsity of the coefficient vector \(c^\mathbf{z}\) in (1.5). This has been observed in Lasso [20] type algorithms in statistics and in the large literature of compressed sensing. Moreover, the regularization scheme (1.5) can be implemented by the LARS algorithm [3] for solving a more general optimization problem

$$\begin{aligned} \arg \min _{c \in {\mathbb {R}}^p} \left\{ \left\| {\mathcal M} c -y\right\| _{\ell ^2}^2+ \gamma \, \Vert c\Vert _{\ell ^1}\right\} , \end{aligned}$$
(1.6)

where \({\mathcal M}\) is an \(m \times p\) matrix with \(p \in {\mathbb {N}}\) and \(y\in {\mathbb {R}}^m\). Besides possible sparsity, the LARS algorithm produces a full piecewise linear solution path of (1.6), which plays an important role in cross-validation or similar methods to tune the parameter \(\gamma \).

In the above learning theory model, \(\{x_k\}_{k=1}^m\) is a random sample drawn from the marginal distribution \(\mu _X\) of \(\mu \) on \(X\). This is a random design framework. In the literature of sampling theory, signal processing, and related fields, \(\{x_k\}_{k=1}^m\) is often taken to be deterministic, well spaced on \(X\). This corresponds to a fixed design framework. Essential differences between these two frameworks include goodness of the numerical conditioning of the involved linear operators in the latter and improvement of approximation or prediction ability of the former caused by randomization which leads to technical difficulty in error analysis of learning algorithms [16].

Let us turn back to the problem of frequency domain identification with the input-output data (1.3). In the existing literature, it is often assumed that the measurements of the frequency response take the form

$$\begin{aligned} y_k= G_{\star }(e^{i\theta _k})+\eta _k, \quad k=1,2,\cdots , m, \end{aligned}$$
(1.7)

where \(\{e^{i\theta _k}\}_{k=1}^m\) is deterministic and \(\{\eta _k\}_{k=1}^m\) is a noise sequence consisting of independent and identically distributed random variables. In particular, in fixed design Gaussian regression models, \(\{e^{i\theta _1},e^{i\theta _2},\cdots ,e^{i\theta _m}\}\) are deterministic elements of \(\mathbb {S}\) (often equally spaced), and \(\eta _1\cdots ,\eta _m\) are drawn independently from a Gaussian distribution \(\mathcal {N}(0, \sigma ^2)\) with \(\sigma ^2>0\). The function \(G_{\star }\) in (1.7) is the transfer function to be estimated. This is the fixed design framework studied in [14].

In this paper we take a random design framework and state frequency domain identification as a least squares regression problem in learning theory. We take the input space \(X\) to be the unit circle \({\mathbb S}\) and the output space to be \(Y={\mathbb {C}}\). Let \(\mu \) be a Borel probability measure on \({\mathbb S}\times {\mathbb {C}}\). The conditional distributions \(\mu (\cdot |z)\) with \(z\in \mathbb {S}\) define the regression function \(f_\mu \) as

$$\begin{aligned} f_\mu (z)=\int _{\mathbb {C}}y d \mu (y|z), \qquad z\in {\mathbb S}. \end{aligned}$$
(1.8)

Our analysis is based on the assumption that the transfer function is the regression function: \(G_{\star } = f_\mu \). A random sample \(\mathbf{{z}} =\{(z_k =e^{i\theta _k}, y_k)\}_{k=1}^m\in (\mathbb {S}\times {\mathbb {C}})^m\) is drawn independently from \(\mu \). This is the data in (1.3) or (1.7) and \(\{e^{i\theta _1},e^{i\theta _2},\cdots ,e^{i\theta _m}\}\) is now random on \(\mathbb {S}\). The randomization makes our study different from that in [14] and improves the approximation ability of the learning algorithm. The arising technical difficulty we overcome in our error analysis is the possibly poor numerical conditioning of the involved matrices.

Once the sample \(\mathbf{{z}}\) is taken, the forms of the learning algorithms in [14] and in this paper (involving atomic norms, to be given below) are the same, both can be stated as a special case of (1.6) and can be solved efficiently by the LARS algorithm.

The coefficient-based learning algorithm considered here is also essentially different from the standard one in learning theory [15]: the two variables of the kernel here are defined on different domains and the hypothesis space depends strongly on the nature of the model we consider. Moreover, locating the poles \(\{w_k\}\) of \(G_{\star }\) in (1.2) is part of the learning task (not only the coefficients \(\{\nu _k\}\)), not coming automatically from the sample, which is totally different.

2 Problem Formulation and Main Result

2.1 Nonsymmetric Kernel and Hypothesis Space

If the randomness of the sampling points \(\{z_k\}_{k=1}^m\) is ignored and the sample is fixed, the following problem formulation is the same as that in [14].

The kernel function \(K\) here is defined on \({\mathbb D} \times {\mathbb S}\) by

$$\begin{aligned} K(w, z) =\frac{1-|w|^2}{z-w}, \qquad w \in {\mathbb D}, z\in {\mathbb S}. \end{aligned}$$
(2.1)

It is a continuous function, but is not symmetric. For \(w \in {\mathbb D}\), \(K_w\) denotes the function on \({\mathbb S}\) given by \(K_w (z)=K(w, z) =\frac{1-|w|^2}{z-w}\). Such functions form an atomic set (or a dictionary)

$$\begin{aligned} \mathcal {A}=\left\{ K_w (z)=\frac{1-|w|^2}{z-w}: w\in \mathbb {D}\right\} . \end{aligned}$$

The atomic norm used in the algorithm is defined for expansions of the form \(f=\sum _{w\in \mathbb {D}_\rho } c_w K_{w}\), where the coefficient sequence \(\{c_w\}\) is absolutely summable.

Definition 2.1

For \(0< \rho < 1\), we define a function space \({\mathcal H}_{1,\rho }\) by

$$\begin{aligned} {\mathcal H}_{1,\rho } =\left\{ f: f =\sum _{k=1}^\infty \nu _k K_{w_k},\; w_k \in {\mathbb D}_\rho , \;\nu _k \in {\mathbb {C}}, \; \sum _{k=1}^\infty |\nu _k| < \infty \right\} \end{aligned}$$

with the atomic norm

$$\begin{aligned} \Vert f\Vert _{\mathcal A}=\inf \left\{ \sum _{k=1}^\infty |\nu _k|: f =\sum _{k=1}^\infty \nu _k K_{w_k}, w_k \in {\mathbb D}_\rho , \nu _k \in {\mathbb {C}}\right\} . \end{aligned}$$

For computational purposes and introducing learning algorithms, we need an approximate space \({\mathcal H}_{1,\rho }^{(\epsilon )}\) of \({\mathcal H}_{1,\rho }\) which consists of linear combinations of a finite atomic set.

Definition 2.2

Let \(\epsilon >0\) and \({\mathbb D}_\rho ^{(\epsilon )}\) be a finite subset of \({\mathbb D}_\rho \). We define the hypothesis function space \({\mathcal H}_{1,\rho }^{(\epsilon )}\) as a subspace of \({\mathcal H}_{1, \rho }\) by

$$\begin{aligned} {\mathcal H}_{1,\rho }^{(\epsilon )} =\left\{ f: f=\sum _{w \in {\mathbb D}_\rho ^{(\epsilon )}} \nu _w K_{w},\;\nu _w \in {\mathbb {C}}\right\} \end{aligned}$$

with the norm

$$\begin{aligned} \Vert f\Vert _{\mathcal A}=\inf \left\{ \sum _{w \in {\mathbb D}_\rho ^{(\epsilon )}} |\nu _w|: f =\sum _{w \in {\mathbb D}_\rho ^{(\epsilon )}} \nu _w K_{w}\right\} . \end{aligned}$$

The space \({\mathcal H}_{1,\rho }^{(\epsilon )}\) is the closed subspace of \({\mathcal H}_{1,\rho }\) generated by \(\{K_w:w\in {\mathbb D}_\rho ^{(\epsilon )}\}\). The learning algorithm is defined on the space \({\mathcal H}_{1,\rho }^{(\epsilon )}\) which plays the role of a hypothesis space in learning theory.

2.2 Learning Algorithm

Once the hypothesis space and its atomic norm are introduced, the learning algorithm can now be defined in terms of a sample \(\mathbf{z}\) as in [14].

Definition 2.3

Let \(\mathbf{z} =\{(z_k, y_k)\}_{k=1}^m \in ({\mathbb S} \times {\mathbb {C}})^m\), \(0< \rho < 1\), and \({\mathbb D}_\rho ^{(\epsilon )}\) be a finite subset of \({\mathbb D}_\rho \). The learning algorithm is defined as

$$\begin{aligned} f_{\mathbf{z}, \gamma }^{(\epsilon )} = \sum _{w \in {\mathbb D}_\rho ^{(\epsilon )}} c^\mathbf{z}_w K_{w}, \end{aligned}$$
(2.2)

where the sequence \(c^\mathbf{z} = (c^\mathbf{z}_w)_{w \in {\mathbb D}_\rho ^{(\epsilon )}}\) is defined by the regularization scheme

$$\begin{aligned} c^{\mathbf{z}} =\mathop {\mathrm{arg}\min }_{c \in {\mathbb {C}}^{{\mathbb D}_\rho ^{(\epsilon )}}} \left\{ \frac{1}{m}\sum _{k=1}^{m}|f_c (z_k)-y_k|^{2}+\gamma \, \Vert c\Vert _{\ell ^1}\right\} . \end{aligned}$$
(2.3)

Here \(f_c = \sum _{w \in {\mathbb D}_\rho ^{(\epsilon )}} c_w K_{w}\) for a sequence \(c \in {\mathbb {C}}^{{\mathbb D}_\rho ^{(\epsilon )}}\), \(\Vert c\Vert _{\ell ^1}=\sum _{w \in {\mathbb D}_\rho ^{(\epsilon )}}|c_w|\), and \(\gamma >0\) is a regularization parameter.

The regularization scheme (2.3) in the above algorithm is a special case of the optimization problem (1.6) with the matrix \({\mathcal M} = \left( K(w, z_k)\right) _{k\in \{1, \ldots , m\}, w \in {\mathbb D}_\rho ^{(\epsilon )}}\) and regularization parameter \(m \gamma \) (with a slight modification for handling the complex numbers), hence can be computed efficiently by the LARS algorithm, no matter whether the sampling points \(\{z_k\}_{k=1}^m\) are deterministic or random.

Remark 2.4

The learning algorithm above is equivalent to the following atomic norm regularization scheme

$$\begin{aligned} f_{\mathbf{z}, \gamma }^{(\epsilon )} = \mathop {\mathrm{arg}\min }_{f\in {\mathcal H}_{1,\rho }^{(\epsilon )}} \left\{ \frac{1}{m}\sum _{k=1}^m|f(z_k)-y_k|^2+\gamma \,\Vert f\Vert _{\mathcal {A}}\right\} . \end{aligned}$$
(2.4)

2.3 Main Result

We state our main result on error analysis for the learning algorithm (2.2) in this subsection and postpone the proofs to the end of the paper. Our result is stated under the assumption that the subset \({\mathbb D}_\rho ^{(\epsilon )}\) of \({\mathbb D}_\rho \) is \(\epsilon \)-dense.

Definition 2.5

A point set \(\{w_1,\cdots , w_n\}\subset \mathbb {D}_\rho \) is said to be \(\epsilon \)-dense in \(\mathbb {D}_\rho \) if for every \(w \in \mathbb {D}_\rho \) there exists some \(1\le k\le n\) such that \(|w -w_k|\le \epsilon \).

Throughout the paper we assume \(|y|\le M\) almost surely. Here we do not assume that the transfer function is strictly proper, as done in [14]. Instead, our more general assumption is that the transfer function \(G_{\star } = f_\mu \) belongs to \({\mathcal H}_{1, \rho }\) for some \(0<\rho <1\).

Theorem 2.6

Suppose \(f_\mu \in {\mathcal H}_{1, \rho }\) for some \(0<\rho <1\). Let \(0<\beta <1\), \(s\in \mathbb {N}\), and \(\gamma =m^{-\theta }\) with \(0< \theta < \frac{1}{2(1+\beta )}\). Then exists a constant \(C_s>0\) depending on \(s\) such that for \(m\ge C_s^{\frac{2s}{\theta }}\), any \(\epsilon \)-dense subset \({\mathbb D}_\rho ^{(\epsilon )}\) of \({\mathbb D}_\rho \) with \(\epsilon =m^{-\frac{\theta }{2s}}\), and \(0 < \delta < 1\), with confidence \(1-\delta \), we have

$$\begin{aligned} \left\| f_{\mathbf{z},\gamma }^{(\epsilon )}-f_\mu \right\| ^2_{L^2_{\mu _{\mathbb {S}}}}\le C \left( \log \left( \frac{8}{\delta } \right) \right) ^{\frac{2 - 2\theta (1+\beta )}{1 - 2\theta (1+\beta )}+9} m^{-\theta }, \end{aligned}$$

where \(C\) is a constant independent of \(m\) or \(\delta \).

Remark 2.7

Since the parameter \(\beta >0\) can be arbitrarily small, the power index \(\theta \) of the learning rate of order \(O(m^{-\theta })\) becomes arbitrarily close to \(1/2\). Such a convergence rate is the best in the literature of coefficient-based regularization schemes with \(\ell ^1\)-penalty [15].

2.4 Further Discussions

A special task of frequency domain identification is to locate the poles \(\{w_k\}\) of the transfer function \(G_{\star }\) in the expression (1.2), not only the coefficients \(\{\nu _k\}\). This may be carried out approximately by the error bound stated in Theorem 2.6, at least when the system has a finite McMillan degree with (1.1) valid.

Our assumption that \(G_{\star } \in {\mathcal H}_{1, \rho }\) for some \(0<\rho <1\) implies that the poles of the transfer function lie in \(\mathbb {D}_\rho \). Hence the finiteness of the \(\ell ^1\) norm of the coefficient sequence \(\{\nu _k\}\) ensures the stability of the system, and is a reasonable condition for the study of system identification.

In this paper we study the problem of identification of time-invariant causal systems by a learning theory approach. When the system is not causal, i.e., the sequence \(\{g_k\}\) is not supported on \({\mathbb {N}}\), zeros of the transfer function need to be put into consideration and the atomic set \({\mathcal A}\) should be expanded. This changes the nature of the atomic norm-based regularization scheme and the learning algorithm. When the system is even time-variant [8], identification of systems or linear operators [4, 5, 11] is more challenging. It would be interesting to extend our results and methods to these more general settings.

Sparsity of the general optimization problem (1.6) has been extensively investigated in the literature of statistics and learning theory [29]. Conditions stated in terms of structures of the matrix \({\mathcal M}\) and sparsity of the coefficient vector of a target function with respect to feature sets (such as irrepresentable conditions in [29]) are crucial. When the \(\epsilon \)-dense set \({\mathbb D}_\rho ^{(\epsilon )}\) is fixed and when the sampling points \(\{e^{i\theta _k}\}_{k=1}^m\) are deterministic and well spaced (as assumed in [14]), one might use the existing results [29] to derive some sparsity of the solution vector \(c^\mathbf{z}\) in terms of the atomic functions \(\{K_{w}\}_{w \in {\mathbb D}_\rho ^{(\epsilon )}}\). However, how to get sparsity from some sparsity conditions for the transfer function decompositions (1.1) in terms of poles or (1.2) is a complicated issue and deserves further study, especially when the sampling points \(\{e^{i\theta _k}\}_{k=1}^m\) are random and the matrix \({\mathcal M}\) is poorly conditioned (with small probability) in our random design framework.

3 Error Decomposition

Our analysis for the algorithm (2.2) is based on an error decomposition which is a technique for error analysis of data dependent regularization schemes developed in [15, 23, 27, 28].

Define the generalization error \({\mathcal E}\) on the set of complex-valued measurable functions on \({\mathbb S}\) as

$$\begin{aligned} {\mathcal E}(f) = \int _{\mathbb S}\int _{\mathbb {C}}\left| f(z) - y\right| ^2 d \mu (y|z) d \mu _{\mathbb S}(z) \end{aligned}$$
(3.1)

and its empirical version associated with the sample \(\mathbf{z}\) as

$$\begin{aligned} {\mathcal E}_\mathbf{z}(f) = \frac{1}{m} \sum _{k=1}^m \left| f(z_k) - y_k\right| ^2. \end{aligned}$$
(3.2)

Then the learning algorithm (2.2) rewritten in the form (2.4) can be stated as

$$\begin{aligned} f_{\mathbf{z}, \gamma }^{(\epsilon )} =\arg \min _{f \in {\mathcal H}_{1,\rho }^{(\epsilon )}} \left\{ {\mathcal E}_\mathbf{z}(f)+\gamma \, \Vert f\Vert _{\mathcal A}\right\} . \end{aligned}$$
(3.3)

Our error decomposition technique is expressed by means of a regularizing function defined by

$$\begin{aligned} f_{\gamma }^{(\epsilon )}=\arg \min _{f \in {\mathcal H}_{1, \rho }^{(\epsilon )}} \left\{ {\mathcal E}(f)-\mathcal{E}(f_\mu )+\gamma \Vert f\Vert _{\mathcal A}\right\} . \end{aligned}$$
(3.4)

Proposition 3.1

Let \(\gamma >0\) and \(f_{\mathbf{z},\gamma }^{(\epsilon )}\) be given by (2.2). There holds

$$\begin{aligned} \mathcal{E} (f_{\mathbf{z},\gamma }^{(\epsilon )})-\mathcal{E}(f_\mu ) + \gamma \Vert f_{\mathbf{z},\gamma }^{(\epsilon )}\Vert _{\mathcal {A}} \le \mathcal{S}^{(\epsilon )}(\mathbf{z}, \gamma )+\mathcal{{D}}^{(\epsilon )}(\gamma ), \end{aligned}$$
(3.5)

where \(\mathcal{S}^{(\epsilon )}(\mathbf{z}, \gamma )\) is the sample error defined by

$$\begin{aligned} \mathcal{S}^{(\epsilon )}(\mathbf{z}, \gamma )=\mathcal{E} (f_{\mathbf{z},\gamma }^{(\epsilon )} )-\mathcal{E}_\mathbf{z} (f_{\mathbf{z},\gamma }^{(\epsilon )} ) +\mathcal{E}_\mathbf{z} (f_{\gamma }^{(\epsilon )})-\mathcal{E} (f_{\gamma }^{(\epsilon )} ) \end{aligned}$$

and \(\mathcal{D}^{(\epsilon )}(\gamma )\) is the regularization error defined by

$$\begin{aligned} \mathcal{D}^{(\epsilon )}(\gamma ) =\min _{f \in {\mathcal H}_{1,\rho }^{(\epsilon )}} \left\{ {\mathcal E}(f) -{\mathcal E}(f_\mu )+\gamma \Vert f\Vert _{\mathcal A}\right\} . \end{aligned}$$

Proof

The proof of Proposition 3.1 follows from expressing \(\mathcal{E}(f_{\mathbf{z},\gamma }^{(\epsilon )})-\mathcal{E}(f_\mu ) + \gamma \Vert f_{\mathbf{z},\gamma }^{(\epsilon )}\Vert _{\mathcal A}\) as

$$\begin{aligned}&\left\{ \mathcal{E}(f_{\mathbf{z},\gamma }^{(\epsilon )} )-\mathcal{E}_\mathbf{z} (f_{\mathbf{z},\gamma }^{(\epsilon )} )\right\} + \left\{ \left[ \mathcal{E}_\mathbf{z} (f_{\mathbf{z},\gamma }^{(\epsilon )} ) +\gamma \Vert f_{\mathbf{z},\gamma }^{(\epsilon )}\Vert _{\mathcal A}\right] -\left[ \mathcal{E}_\mathbf{z} (f_{\gamma }^{(\epsilon )}) +\gamma \Vert f_{\gamma }^{(\epsilon )} \Vert _{\mathcal A}\right] \right\} \\&+ \left\{ \mathcal{E}_\mathbf{z}(f_{\gamma }^{(\epsilon )}) - \mathcal{E} (f_{\gamma }^{(\epsilon )} )\right\} + \left\{ \mathcal{E} (f_{\gamma }^{(\epsilon )} ) - {\mathcal E}(f_\mu )+\gamma \Vert f_{\gamma }^{(\epsilon )} \Vert _{\mathcal A}\right\} \end{aligned}$$

and the inequality \(\mathcal{E}_\mathbf{z}(f_{\mathbf{z},\gamma }^{(\epsilon )} ) +\gamma \Vert f_{\mathbf{z},\gamma }^{(\epsilon )}\Vert _{\mathcal A} \le \mathcal{E}_\mathbf{z} (f_{\gamma }^{(\epsilon )} ) +\gamma \Vert f_{\gamma }^{(\epsilon )} \Vert _{\mathcal A}\) induced by the equivalent definition (3.3) of the function \(f_{\mathbf{z},\gamma }\). \(\square \)

The regularization error will be estimated in the next section by means of a local polynomial reproduction formula, while the sample error will be bounded in Sect. 5, thanks to concentration inequalities and an iteration technique.

4 Estimating Regularization Error

The nonsymmetric kernel \(K\) for Algorithm (2.2) is the most special feature in our error analysis. Estimating the induced regularization error is the technical difficulty we need to overcome, compared to the analysis for real-valued kernels in [23, 28]. To this end, we need some properties of the fundamental functions \(K_w\) induced by the kernel \(K\).

Lemma 4.1

Let \(0<\rho <1\), \(K\) be given by (2.1) and \(K_w\) by \(K_w (z) =K(w, z)\) for \(w\in \mathbb {D}\) and \(z\in \mathbb {S}\). Then the following properties hold.

  1. (i)

    \(\Vert K_w\Vert _\infty \le 2\) for any \(w\in \mathbb {D}\).

  2. (ii)

    \(|K_w(z_1)-K_w(z_2)|\le \frac{1+\rho }{1-\rho }|z_1-z_2|\) for any \(w\in \mathbb {D}_\rho \) and \(z_1,z_2\in \mathbb {S}\).

  3. (iii)

    \(|K_{w_1}(z)-K_{w_2}(z)|\le \left( \frac{1+\rho }{1-\rho }\right) ^2|w_1-w_2|\) for any \(w_1,w_2\in \mathbb {D}_\rho \) and \(z\in \mathbb {S}\).

  4. (iv)

    \(\mathcal {H}_{1,\rho }\) can be regarded as a subset of \(C(\mathbb {S})\) with the inclusion map \(I : \mathcal {H}_{1,\rho } \rightarrow C(\mathbb {S})\) bounded as

    $$\begin{aligned} \Vert f\Vert _\infty \le 2 \Vert f\Vert _{\mathcal {A}}. \end{aligned}$$
    (4.1)

Proof

Part (i) follows from the inequalities \(|z-w|\ge |z|-|w|=1-|w|\) and

$$\begin{aligned} |K_w(z)|= \frac{(1-|w|)(1+|w|)}{|z -w|}\le 1+|w| \le 2. \end{aligned}$$

Part (ii) is seen since

$$\begin{aligned} |K_w(z_1)-K_w(z_2)|= & {} \left| \frac{1-|w|^2}{z_1-w}-\frac{1-|w|^2}{z_2-w}\right| = \left| \frac{(1-|w|^2)(z_2-z_1)}{(z_1-w)(z_2-w)}\right| \\\le & {} \frac{1+|w|}{1-|w|}|z_1-z_2| \le \frac{1+\rho }{1-\rho }|z_1-z_2|. \end{aligned}$$

In the similar way, for part (iii), we have for \(w_1,w_2\in \mathbb {D}_\rho \) and \(z\in \mathbb {S}\),

$$\begin{aligned} |K_{w_1}(z)-K_{w_2}(z)|= & {} \left| \frac{z(|w_2|^2-|w_1|^2)+(w_1-w_2) + w_1w_2(\bar{w_1}-\bar{w_2})}{(z-w_1)(z-w_2)}\right| \\\le & {} \frac{1+|w_1|+|w_2| + |w_1w_2|}{(1-|w_1|)(1-|w_2|)}|w_1-w_2|\\= & {} \frac{1+|w_1|}{1-|w_1|} \frac{1+|w_2|}{1-|w_2|}|w_1-w_2| \le \left( \frac{1+\rho }{1-\rho }\right) ^2 |w_1-w_2|. \end{aligned}$$

Part (iv) is a direct consequence of Part (i) and the definition of the norm \(\Vert f\Vert _{\mathcal {A}}\). The proof of Lemma 4.1 is complete. \(\square \)

In this section we estimate the regularization error under the assumption that the regression function \(f_\mu \) lies in the space \({\mathcal H}_{1, \rho }\).

Our estimation is carried out by using high order smoothness of the kernel \(K_w\). To this end, we need a local polynomial reproduction formula from the literature of multivariate approximation [7, 24, 25]. Note that \(\epsilon \)-dense sets can be stated in an equivalent way as subsets of \(\mathbb {R}^2\).

In the following we also denote by \(|u|\) the Euclidian \(\ell ^2\) norm for \(u=(x,y)\in X\subset \mathbb {R}^2\).

Definition 4.2

Let \(X =\{(x, y) \in {\mathbb {R}}^2: |x|^2 + |y|^2 \le \rho ^2\}\). A point set \(\{(x_1, y_1),\cdots , (x_n, y_n)\}\) in \(X\) is said to be \(\Delta \)-dense if for every \((x, y) \in X\) there exists some \(1\le k\le n\) such that \(|(x, y) -(x_k, y_k)| \le \Delta \).

Denote by \(\mathcal{P}^2_s\) the space of polynomials on \({\mathbb {R}}^2\) of degree at most \(s\) (here the coefficients are complex). The following lemma is a formula for local polynomial reproduction (see Theorem 3.10 in [24] which is also valid for polynomials with complex coefficients).

Lemma 4.3

There exists a constant \(C_s\) depending on \(s\in {\mathbb {N}}\) such that for any \(\Delta \)-dense point set \(\{(x_1, y_1),\cdots , (x_n, y_n)\}\) in \(X\) with \(\Delta \le \frac{1}{C_s}\) and every \(u\in X\), we can find real numbers \(b_j(u)\), \(1\le j\le n\), satisfying

  1. (i)

    \(\sum _{j=1}^{n}b_j(u)p(x_j, y_j)=p(u) \qquad \forall \, p\in \mathcal{P}_{s}^{2}\),

  2. (ii)

    \(\sum _{j=1}^{n}|b_j(u)|\le 2\),

  3. (iii)

    \(b_j(u)=0\) provided that \(|u-(x_j, y_j)|>C_{s}\Delta \).

Now we can estimate the regularization error \(\mathcal{D}^{(\epsilon )}(\gamma )\).

Proposition 4.4

Assume \(f_\mu \in {\mathcal H}_{1, \rho }\). Let \(s\in \mathbb {N}\). If \({\mathbb D}_\rho ^{(\epsilon )} =\{z_k =x_k + i y_k\}_{k=1}^n\) is a finite subset of \({\mathbb D}_\rho \) such that \(\{(x_1, y_1),\cdots , (x_n, y_n)\}\) is \(\epsilon \)-dense with \(\epsilon \le \frac{1}{C_s}\), then there exists a constant \(C_{s, \rho }\) depending on \(s\) and \(\rho \) such that

$$\begin{aligned} \mathcal{D}^{(\epsilon )}(\gamma ) \le C_{s, \rho } \Vert f_\mu \Vert _{\mathcal A} \left( \Vert f_\mu \Vert _{\mathcal A}\,\epsilon ^{2s} + \gamma \right) , \qquad \forall \, \gamma >0. \end{aligned}$$

Proof

The condition \(f_\mu \in {\mathcal H}_{1, \rho }\) tells us that for any \(\iota >0\), \(f_\mu \) can be written as

$$\begin{aligned} f_\mu =\sum _{k=1}^\infty \nu _k K_{w_k} \end{aligned}$$

with \(w_k \in {\mathbb D}_\rho \), \(\nu _k \in {\mathbb {C}}\), and

$$\begin{aligned} \Vert f_\mu \Vert _{\mathcal A} \le \sum _{k=1}^\infty |\nu _k| \le \Vert f_\mu \Vert _{\mathcal A} +\iota . \end{aligned}$$
(4.2)

Then there exists some \(N_{0}\in {\mathbb {N}}\) such that \(\sum _{k=N_{0}+1}^{\infty }|\nu _k| < \iota \). It follows from \(|z-w_k| \ge \big ||z| - |w_k|\big |\) that for each \(z\in {\mathbb S}\), there holds

$$\begin{aligned} \left| f_\mu (z) - \sum _{k=1}^{N_0} \nu _k K_{w_k} (z)\right| = \left| \sum _{N_0+1}^\infty \nu _k K_{w_k} (z)\right| \le \sum _{N_0+1}^\infty |\nu _k| \left| \frac{1-|w_k|^2}{1-|w_k|}\right| \le 2\iota . \end{aligned}$$
(4.3)

Let \(z\in {\mathbb S}\) and \(k\in \{1, \ldots , N_0\}\). Write \(w_k\) as \(w_k =v_k + i t_k\) and \(u_k=(v_k, t_k)\). Note that \(w_k\in \mathbb {C}\) and \(u_k\in X\subset \mathbb {R}^2\) but \(|w_k|=|u_k|\). Let \(u=(v,t)\in X\). Take \(\Delta = \epsilon \) and \(\{b_j(u)\}_{j=1}^n\) as in Lemma 4.3. By Lemma 4.3, we know for every polynomial \(p\in \mathcal{{P}}_{s}^{2}\) that

$$\begin{aligned} p(u)=\sum _{j\in I(u)}b_{j}(u)p(x_j, y_j)\quad \text{ with }\quad \sum _{j\in I(u)}|b_j(u)|\le 2, \end{aligned}$$

where \(I(u)=\{j\in \{1,\cdots , m\}: |(x_{j}, y_j)-u|\le C_{s}\epsilon \}\). Now by taking \(p_k\) to be the Taylor polynomial of the complex-valued function \(g(u)=\frac{1-|u|^2}{z-(v+it)}\) (of a variable \(u=(v,t)\in X\subset \mathbb {R}^2\)) at \(u_k\) of degree less than \(s\), we find

$$\begin{aligned} K_{w_k} (z)=\frac{1-|w_k|^2}{z-w_k}= \frac{1-|u_k|^2}{z-w_k}=p_{k}(u_k) \end{aligned}$$

and

$$\begin{aligned} \max _{j\in I(u_k)}|K_{x_j + i y_j}(z)-p_{k}(x_j, y_j)|= & {} \max _{j\in I(u_k)}|g(x_j,y_j)-p_k(x_j,y_j)|\\\le & {} \left\| \frac{1-|u|^2}{z-(v+it)}\right\| _{C^s (X)}\max _{j\in I(u_k)}|(x_j, y_j) -u_k|^s\\\le & {} C_s^s \left\| \frac{1-|u|^2}{z-(v+it)}\right\| _{C^s (X)}\epsilon ^{s}. \end{aligned}$$

It follows that

$$\begin{aligned}&|K_{w_k} (z)-\sum _{j\in I(u_k)}b_{j}(u_k)K_{x_j + i y_j}(z)| \\&\quad = \left| p_{k}(u_k) -\sum _{j\in I(u_k)}b_{j}(u_k)\Big (p_{k}(x_j, y_j) + K_{x_j + i y_j}(z) - p_{k}(x_j, y_j)\Big )\right| \\&\quad = \left| \sum _{j\in I(u_k)}b_{j}(u_k)\Big (K_{x_j + i y_j}(z) - p_{k}(x_j, y_j)\Big )\right| \\&\quad \le \sum _{j\in I(u_k)} |b_{j}(u_k)| C_s^s \left\| \frac{1-|u|^2}{z-(v+it)}\right\| _{C^s (X)}\epsilon ^{s} \le 2 C_s^s \left\| \frac{1-|u|^2}{z-(v+it)}\right\| _{C^s (X)}\epsilon ^{s}. \end{aligned}$$

The above bound holds for every \(z\in {\mathbb S}\) and \(k\in \{1, \ldots , N_0\}\). This in connection with (4.2) yields

$$\begin{aligned}&\int _{\mathbb S}\biggl |\sum _{k=1}^{N_0}\nu _k\Bigl (K_{w_k}(z)-\sum _{j\in I(u_k)}b_{j}(u_k)K_{x_j + i y_j}(z)\Bigl )\biggl |^2d \mu _{\mathbb S}(z)\\&\quad \le (\Vert f_\mu \Vert _{\mathcal A} +\iota )^2 4 C_s^{2s} \left\| \frac{1-|u|^2}{z-(v+it)}\right\| _{C^s (X)}^2\epsilon ^{2s}. \end{aligned}$$

Since \(x_j + i y_j =z_j \in {\mathbb {D}}_\rho ^{(\epsilon )}\), we know that \(f= \sum _{k=1}^{N_0} \nu _k \sum _{j\in I(u_k)}b_{j}(u_k)K_{x_j + i y_j}\) lies in \({\mathcal H}_{1,\rho }^{(\epsilon )}\) and

$$\begin{aligned} \Vert f\Vert _{\mathcal A} \le \sum _{k=1}^{N_0} |\nu _k| \sum _{j\in I(u_k)}|b_{j}(u_k)| \le \sum _{k=1}^{N_0} |\nu _k| 2 \le 2 (\Vert f_\mu \Vert _{\mathcal A} +\iota ). \end{aligned}$$

The above bounds and (4.3) yield

$$\begin{aligned}&\int _{\mathbb S} \left| f(z) -f_\mu (z)\right| ^2 d \mu _{\mathbb S}(z) +\gamma \Vert f\Vert _{\mathcal A} \\&\quad \le 8\iota ^2 + (\Vert f_\mu \Vert _{\mathcal A} +\iota )^2 8 C_s^{2s} \left\| \frac{1-|u|^2}{z-(v+it)}\right\| _{C^s(X)}^2 \epsilon ^{2s} + 2 (\Vert f_\mu \Vert _{\mathcal A} +\iota ) \gamma . \end{aligned}$$

It can be easily seen that for any complex-valued measurable function \(f\) on \({\mathbb S}\), there holds

$$\begin{aligned} {\mathcal E}(f) -{\mathcal E}(f_\mu )= \int _{\mathbb S} \left| f(z) - f_\mu (z)\right| ^2 d \mu _{\mathbb S}(z). \end{aligned}$$
(4.4)

Since \(\mathcal{D}^{(\epsilon )}(\gamma )\) is independent of \(\iota \), we have

$$\begin{aligned} \mathcal{D}^{(\epsilon )}(\gamma ) \le \Vert f_\mu \Vert _{\mathcal A}^2 8 C_s^{2s} \left\| \frac{1-|u|^2}{z-(v+it)}\right\| _{C^s(X)}^2\epsilon ^{2s} + 2\gamma \Vert f_\mu \Vert _{\mathcal A} . \end{aligned}$$

The function \(\frac{1-|u|^2}{z-(v+it)}\) for \(z\in {\mathbb S}\) and \(u\in {\mathbb D}_\rho \) is \(C^\infty \). Since \(X =\{(x, y) \in {\mathbb {R}}^2: |x|^2 + |y|^2 \le \rho ^2\}\) is compact, the number \(C'_{s,\rho }:=\sup _{z\in {\mathbb S}} \left\| \frac{1-|u|^2}{z-(v+it)}\right\| _{C^s(X)}^2\) depending on \(s\) and \(\rho \) is finite for any \(s\in {\mathbb {N}}\). Then the desired statement is verified with \(C_{s,\rho }=8 C_s^{2s} C'_{s, \rho }+2\). \(\square \)

5 Estimating Sample Error in the Random Design

In this section we estimate the sample error in the random design setting. Recall the sample error \( \mathcal{S}^{(\epsilon )}(\mathbf{z}, \gamma )\) which can be expressed as

$$\begin{aligned} \mathcal{S}^{(\epsilon )}(\mathbf{z}, \gamma ) = \mathcal{S}_1^{(\epsilon )}(\mathbf{z}, \gamma ) + \mathcal{S}_2^{(\epsilon )}(\mathbf{z}, \gamma ), \end{aligned}$$

where

$$\begin{aligned} \mathcal{S}_1^{(\epsilon )}(\mathbf{z}, \gamma )= & {} \left\{ \mathcal{E}_{\mathbf {z}}(f_{\gamma }^{(\epsilon )})-\mathcal{E}_\mathbf{z}(f_{\mu })\right\} -\left\{ \mathcal{E}(f_{\gamma }^{(\epsilon )})-\mathcal{E}(f_{\mu })\right\} ,\\ \mathcal{S}_2^{(\epsilon )}(\mathbf{z}, \gamma )= & {} \left\{ \mathcal{E} (f_{\mathbf {z},\gamma }^{(\epsilon )})-\mathcal{E}(f_{\mu })\right\} - \left\{ \mathcal{E}_{\mathbf {z}}(f_{\mathbf {z},\gamma }^{(\epsilon )})-\mathcal{E}_{\mathbf {z}} (f_{\mu })\right\} . \end{aligned}$$

To estimate the quantity \(\mathcal{S}_1^{(\epsilon )}(\mathbf{z}, \gamma )\), we consider the random variables \(\xi \) on \((\mathbb {S}\times \mathbb {C}, \mu )\) given by

$$\begin{aligned} \xi (z,y) = |f_\gamma ^{(\epsilon )}(z)-y|^2-|f_\mu (z)-y|^2 \end{aligned}$$

and satisfying

$$\begin{aligned} \mathcal{S}_1^{(\epsilon )}(\mathbf{z}, \gamma ) = \frac{1}{m}\sum _{k=1}^m\xi (z_k,y_k)-\mathbb {E}(\xi ). \end{aligned}$$

Since \(|y|\le M\) almost surely, we have \(\Vert f_\mu \Vert _\infty \le M\). From the definition of \(f_{\gamma }^{(\epsilon )}\) given by (3.4) we deduce that

$$\begin{aligned} \Vert f_\gamma ^{(\epsilon )}\Vert _\infty \le 2 \Vert f_\gamma ^{(\epsilon )}\Vert _{\mathcal A}\le 2 \mathcal{D}^{(\epsilon )}(\gamma )/\gamma . \end{aligned}$$

Hence \(\xi \) is bounded by

$$\begin{aligned} M_{\epsilon ,\gamma }:=\left( \frac{2\mathcal{D}^{(\epsilon )}(\gamma )}{\gamma }+3M\right) ^2 \end{aligned}$$

and its variance is bounded by \(M_{\epsilon ,\gamma } \mathcal{D}^{(\epsilon )}(\gamma )\). Applying the one-side Bernstein inequality [26] yields the following bound for \(\mathcal{S}_1^{(\epsilon )}(\mathbf{z}, \gamma )\).

Lemma 5.1

Let \(0 <\gamma \le 1\). For any \(0 < \delta < 1\), with confidence \(1-\frac{\delta }{4}\), it holds that

$$\begin{aligned} S_1^{(\epsilon )}(\mathbf{z},\gamma ) \le \frac{15\log \left( \frac{4}{\delta }\right) ({\mathcal {D}}^{(\epsilon )}(\gamma ))^2}{m\gamma ^2}+ \frac{33 M^2\log \left( \frac{4}{\delta }\right) }{m}+\mathcal{D}^{(\epsilon )}(\gamma ). \end{aligned}$$
(5.1)

It is more difficult to bound \(S_2^{(\epsilon )}(\mathbf{z},\gamma )\) because it involves the sample \(\mathbf {z}\) through \(f_{\mathbf{z},\gamma }^{(\epsilon )}\). We use a probability inequality that handles a class of functions in \({\mathcal {H}}_{1,\rho }^ {(\epsilon )}\). Such an inequality uses covering numbers in \({\mathcal {H}}_{1,\rho }^ {(\epsilon )}\) to describe the complexity of \({\mathcal {H}}_{1,\rho }^{(\epsilon )}\). We bound the covering numbers in \({\mathcal {H}}_{1,\rho }^{(\epsilon )}\) firstly.

Definition 5.2

The covering number \(\mathcal {N}(X, r)\) of the metric space \(X\) is the minimal \(l \in \mathbb {N}\) such that there exist \(l\) balls in \(X\) with radius \(r\) covering \(X\).

In the function space \({\mathcal {H}}_{1,\rho }\), denote \(\mathcal {B}_R = \{f\in {\mathcal H}_{1,\rho } :\Vert f\Vert _{\mathcal {A}}\le R\}\). Recall the inclusion map \(I: \mathcal {H}_{1,\rho } \rightarrow C(\mathbb {S})\) defined in Lemma 4.1 (iv). We need the covering number \(\mathcal {N}(I(\mathcal {B}_1), \epsilon )\) of \(I(\mathcal {B}_1)\) as a subset of \(C(\mathbb {S})\).

For every \(\varepsilon > 0\) and \(R \ge M\), the following inequality as a uniform law of large numbers for a class of functions can be easily seen as in [2, Proposition 8.15]

$$\begin{aligned}&\mathop {\mathrm {Prob}}\left\{ \sup _{f\in \mathcal {B}_R}\frac{(\mathcal {E}(f)-\mathcal {E}(f_\mu ))-(\mathcal {E}_{\mathbf {z}}(f) - \mathcal {E}_{\mathbf {z}}(f_\mu ))}{\sqrt{\mathcal {E}(f)-\mathcal {E}(f_\mu )+\varepsilon }} \le \sqrt{\varepsilon }\right\} \nonumber \\&\qquad \quad \ge 1- \mathcal {N}\left( I(\mathcal {B}_1),\frac{\varepsilon }{25R^2}\right) \exp \left\{ \frac{m\varepsilon }{54\cdot 25R^2}\right\} . \end{aligned}$$
(5.2)

With this inequality, as in [2], we have the following bound for \(S_2^{(\epsilon )}(\mathbf{z}, \gamma )\).

For \(R>0\), we define the subset \(\mathcal {W}(R)\) of \((\mathbb {S}\times \mathbb {C})^m\) as

$$\begin{aligned} \mathcal {W}(R) =\{\mathbf {z}\in (\mathbb {S}\times \mathbb {C})^m: \Vert f_{\mathbf{z},\gamma }^{(\epsilon )}\Vert _{\mathcal {A}}\le R\}. \end{aligned}$$

Lemma 5.3

If \(0 < \gamma \le 1\) and \(R\ge 1\), then there exists a subset \(V_R\) of \((\mathbb {S}\times \mathbb {C})^m\) of measure at most \(\delta /4\) such that

$$\begin{aligned} S_2^{(\epsilon )}(\mathbf{z},\gamma ) \le \frac{1}{2} \left( \mathcal {E}(f_\mathbf{z,\gamma }^{(\epsilon )})-\mathcal {E}(f_\mu )\right) + 2 \varepsilon _{R,\delta , m},\quad \forall \, \mathbf{z}\in \mathcal {W}(R)\setminus V_R, \end{aligned}$$
(5.3)

where \(\varepsilon _{R,\delta , m}\) is the smallest positive number \(\varepsilon \) satisfying

$$\begin{aligned} \mathcal {N}(I(\mathcal {B}_1), \frac{\varepsilon }{25R})\exp \left\{ \frac{m\varepsilon }{54\cdot 25R^2}\right\} \le \frac{\delta }{4}. \end{aligned}$$

To drive concrete error bounds, we need to bound the covering number \(\mathcal {N}(I(\mathcal {B}_1), \varepsilon )\). We carry out such a bound by applying methods from [2, 30, 31].

Lemma 5.4

For any \(0< \beta <1\), there exists some constant \(C_{\rho ,\beta }>0\) such that

$$\begin{aligned} \log \mathcal {N}(I(\mathcal {B}_1), r)\le C_{\rho ,\beta } \left( \frac{1}{r}\right) ^{\beta },\quad \forall \, r >0. \end{aligned}$$

Proof

Since \(0<\rho <1\), for any \(\ell \in \mathbb {N}\), the functions \(K_w\), \(w\in \mathbb {D}_\rho \), are uniformly bounded in \(C^\ell (\mathbb {S})\), the space of complex-valued functions having continuous partial derivatives up to order \(\ell \). We can prove that

$$\begin{aligned} \sup _{w\in \mathbb {D}_\rho }\Vert K_w\Vert _{C^{\ell }( \mathbb {S})}=:C_{\ell ,\rho }\le \frac{2^{\ell +2}\ell !}{(1-\rho )^\ell } <\infty . \end{aligned}$$

Indeed, by the Cauchy integral formula, we have

$$\begin{aligned} K_{w}^{(\ell )}(z)= & {} \frac{\ell !}{2\pi i}\oint _{|\xi -z|=(1-\rho )/2}\frac{K_w(\xi )}{(\xi -z)^\ell }d\xi \\= & {} \frac{\ell !}{2\pi i}\oint _{|\xi -z|=(1-\rho )/2}\frac{1-|w|^2}{(w-\xi )(\xi -z)^\ell }d\xi . \end{aligned}$$

Since \(w\in \mathbb {D}_\rho \) we know that \(|w-\xi |\ge (1-\rho )/2\). It follows that

$$\begin{aligned} |K_{w}^{(\ell )}(z)| \le \frac{2^{\ell +1}\ell !(1-\rho ^2)}{(1-\rho )(1-\rho )^\ell } \le \frac{2^{\ell +2}\ell !}{(1-\rho )^\ell }. \end{aligned}$$

Hence for any \(f =\sum _{k=1}^\infty \nu _k K_{w_k}\in \mathcal {H}_{1,\rho }\) we have

$$\begin{aligned} \Vert f\Vert _{C^\ell ( \mathbb {S})}\le \sum _{k=1}^\infty |\nu _k|\Vert K_{w_k}\Vert _{C^\ell ( \mathbb {S})}\le C_{\ell ,\rho }\sum _{k=1}^\infty |\nu _k|, \end{aligned}$$

which gives \(\Vert f\Vert _{C^\ell ( \mathbb {S})}\le C_{\ell ,\rho }\Vert f\Vert _{\mathcal {A}}\). Hence for any \(f =\sum _{k=1}^\infty \nu _k K_{w_k}\in \mathcal {B}_1\) with \(\sum _{k=1}^\infty |\nu _k|\le 1\) and \(w_k\in \mathbb {D}_\rho \) we have

$$\begin{aligned} \Vert f\Vert _{C^\ell ( \mathbb {S})}\le \sum _{k=1}^\infty |\nu _k|\Vert K_{w_k}\Vert _{C^\ell ( \mathbb {S})}\le C_{\ell ,\rho }. \end{aligned}$$

Thus, \(\mathcal {B}_1\) can be embedded into the ball of \(C^\ell ( \mathbb {S})\) with radius \(C_{\ell ,\rho }\) whose covering number can be bounded by

$$\begin{aligned} \exp \left( C_{\ell }\frac{C_{\ell ,\rho }}{r}\right) ^\beta , \end{aligned}$$

where the positive constant \(C_\ell \) depends only on \(\ell \). See [2, Chapter 5] and references therein. Then our conclusion follows by taking \(\ell \ge 2/\beta \). \(\square \)

Combining Proposition 4.4, Lemmas 5.1 and 5.3 with the bounds in Lemma 5.4 and Proposition 3.1, we deduce the following conclusion.

Proposition 5.5

Let \(0<\gamma \le 1, R\ge 1, s\in \mathbb {N}, 0<\beta <1\) and \(\epsilon =\gamma ^{\frac{1}{2s}}\). If the conditions of Theorem 2.6 are satisfied and \(0<\delta <1\), then there is a subset \(V_R\) of \((\mathbb {S}\times \mathbb {C})^m\) with measure at most \(\delta \) such that

$$\begin{aligned}&\mathcal{E}(f_{\mathbf{z},\gamma }^{(\epsilon )})-\mathcal{{E}}(f_\mu )+\gamma \Vert f_{\mathbf{z},\gamma }^{(\epsilon )}\Vert _{\mathcal {A}}\\&\quad \le 2C' (\log (4/\delta )+1)m^{-\frac{1}{1+\beta }}R^2 + C_1\gamma +C_2\log (4/\delta )m^{-1},\quad \forall \, \mathbf{z}\in \mathcal {W}(R)\backslash V_R, \end{aligned}$$

where \(C'=25 \max \{108,(108 C_{\rho ,\beta })^{1/(1+\beta )}\}\), \(C_1=2C_{s,\rho } \Vert f_\mu \Vert _{\mathcal A} (\Vert f_\mu \Vert _{\mathcal A}+1)\) and \(C_2=30 C_1^2 + 66 M^2\).

A bound for the generalization error is not tight enough when we only combine the above estimates. To get better error estimates, we shall apply an iteration technique [6, 15, 18, 19, 26] to improve the rough bound for \(f_\mathbf{z,\gamma }^{(\epsilon )}\). Since \(f_\mathbf{z,\gamma }^{(\epsilon )}\) is a good approximation of \(f_{\gamma }^{(\epsilon )}\), for \(\Vert f_{\mathbf{z},\gamma }^{(\epsilon )}\Vert _{\mathcal {A}}\) one would expect a tighter bound, better than \(M^2/\gamma \).

Lemma 5.6

Under the assumption of Proposition 5.5, let \(0<\delta <1\) and \(\epsilon ^{2s}=\gamma =m^{-\theta }\) with \(0 < \theta < \frac{1}{2(1+\beta )}\) and \(m\ge C_s^{\frac{2s}{\theta }}\). Then with confidence \(1-\delta \), there holds

$$\begin{aligned} \Vert f_{\mathbf{z},\gamma }^{(\epsilon )}\Vert _{\mathcal A}\le \tilde{C} (\log (4/\delta )+1)^{2^{{J}}} \end{aligned}$$
(5.4)

where \(\tilde{C}\) is a constant independent of \(\delta \) or \(m\), and

$$\begin{aligned} J=\max \left\{ 2, \log _2 \frac{1-\theta (1+\beta )}{1-2\theta (1+\beta )}\right\} . \end{aligned}$$

Proof

Based on Proposition 5.5, we know that

$$\begin{aligned} \Vert f_{\mathbf{z},\gamma }^{(\epsilon )}\Vert _{\mathcal A}\le \max \{{a}_{m,\gamma }R^2, {b}_{m,\gamma }\},\quad \forall \, \mathbf{z}\in \mathcal {W}(R)\backslash V_R, \end{aligned}$$

where

$$\begin{aligned} {a}_{m,\gamma }= & {} 4C' (\log (4/\delta )+1)m^{-\frac{1}{1+\beta }}\gamma ^{-1},\\ {b}_{m,\gamma }= & {} 2C_1 + 2C_2\log (4/\delta )m^{-1}\gamma ^{-1}. \end{aligned}$$

It follows that

$$\begin{aligned} \mathcal {W}(R)\subset \mathcal {W}(\max \{{a}_{m,\gamma }R^2, {b}_{m,\gamma }\})\cup V_R. \end{aligned}$$
(5.5)

Consider a sequence \(\{R^{(j)}\}_{j\in \mathbb {N}}\), where \(R^{(0)}=\frac{M^2}{\gamma }\) and \(R^{(j)}=\max \{{a}_{m,\gamma }(R^{(j-1)})^2, {b}_{m,\gamma }\}\). Since \(|y|\le M\) almost surely, by taking \(f=0\) in (2.4), we see \(\Vert f_{\mathbf{z},\gamma }^{(\epsilon )}\Vert _{\mathcal {A}}\le \frac{M^2}{\gamma }\). So for \(J\in \mathbb {N}\),

$$\begin{aligned} (\mathbb {S}\times \mathbb {C})^m=\mathcal {W}(R^{(0)})\subset \mathcal {W}(R^{(1)})\cup V_{R^{(0)}} \subset \cdots \subset \mathcal {W}(R^{(J)})\cup \left( \bigcup _{j=0}^{J-1}V_{R^{(j)}}\right) . \end{aligned}$$

Hence the measure of \(\mathcal {W}(R^{(J)})\) is at least \(1-J\delta \). From the definition of the sequence \(\{R^{(j)}\}\), we see that \(R^{(J)}\) is

$$\begin{aligned}&\max \Big \{({a}_{m,\gamma })^{1+2+2^2+\cdots +2^{J-1}}(R^{(0)})^{2^{J}}, ({a}_{m,\gamma })^{1+2+2^2+\cdots +2^{J-2}}({b}_{m,\gamma })^{2^{J-1}},\cdots , {a}_{m,\gamma }{b}^2_{m,\gamma },{b}_{m,\gamma }\Big \}\\&\quad \le \max \Big \{A_{\delta }^{2^J-1}M^{2^{J+1}}\gamma ^{1-2^{J+1}}m^{-\frac{2^J-1}{1+\beta }}, {b}_{m,\gamma },{b}_{m,\gamma } \Big (A_\delta B_\delta \gamma ^{-1}m^{-\frac{1}{1+\beta }}\max \Big \{1,\gamma ^{-1}m^{-1}\}\Big )^{2^{J-1}-1}\Big \} \end{aligned}$$

where \(A_\delta =4C' (\log (4/\delta )+1)\) and \(B_\delta =2C_1+2C_2\log (4/\delta )\).

We let \(\gamma =m^{-\theta }\) with \(0<\theta < \frac{1}{2(1+\beta )}\) and we determine \(J\) to be the smallest positive integer satisfying

$$\begin{aligned} (2^{J+1}-1)\theta -\frac{2^J-1}{1+\beta } \le 0. \end{aligned}$$

With this choice of \(J\), we have with confidence at least \(1-J\delta \),

$$\begin{aligned} R^{(J)}\le \max \Big \{A_{\delta }^{2^J-1}M^{2^{J+1}}, B_\delta , B_\delta (A_\delta B_\delta )^{2^{J-1}-1}\Big \}. \end{aligned}$$

The desired result follows by scaling \(J\delta \) to \(\delta \). \(\square \)

6 Proof of Main Result

We are in a position to prove our main result.

Proof of Theorem 2.6

Let \(R\) be the right side of (5.4). From Lemma 5.6, we see that \(\mathcal {W}(R)\) has measure at least \(1-\delta \). Based on Proposition 5.5, there is a subset \(V_R\) of \((\mathbb {S}\times \mathbb {C})^m\) with measure at most \(\delta \) such that for all \(\mathbf{z}\in \mathcal {W}(R)\backslash V_R\),

$$\begin{aligned} \mathcal {E}(f_{\mathbf{z},\gamma }^{(\epsilon )})-\mathcal {E}(f_\mu )\le & {} 2C' (\log (4/\delta )+1)m^{-\frac{1}{1+\beta }}R^2 + C_1m^{-\theta } +C_2\log (4/\delta )m^{-1}\\\le & {} \left( 2C' \widetilde{C}^2\left( \log \frac{4}{\delta }+1\right) ^{2^{J+1}+1} +C_1+C_2 \right) m^{-\theta }. \end{aligned}$$

The desired result follows by scaling \(2\delta \) to \(\delta \). \(\square \)