Abstract
This paper aims at proposing a learning theory approach to the topic of estimating transfer functions in system identification. A frequency domain identification problem is formulated as an atomic norm regularization scheme in a random design framework of learning theory. Such a formulation makes it possible to obtain sparsity and provide finite sample estimates for learning the transfer function in a learning theory framework. Error analysis is done for the learning algorithm by applying a local polynomial reproduction formula, concentration inequalities and iteration techniques. The convergence rate obtained here is the best in the literature. It is hoped that the learning theory approach to the frequency domain identification problem would bring new ideas and lead to more interactions among the areas of system identification, learning theory and frequency analysis.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Learning from samples in learning theory and constructing models from observed data in system identification are two aspects of the same problem in some applications inferring relationships between observed input and output quantities. An important feature of learning theory techniques [2, 17, 21] is that learning is formulated in a high-dimensional feature space, which is transformed into the data space by means of reproducing kernel Hilbert spaces. System identification builds mathematical models of systems that generate the observed input-output data. It can be seen as an interface between the real world applications and mathematics of control theory and model abstractions [10]. Recent research work has encouraged applying statistical learning theory to system identification, as surveyed by L. Ljung in [9].
System identification is a challenging problem that demands function approximation, regression, model fitting and statistical methods [10]. It was formulated as a problem in statistical learning theory in [22]. In this paper we consider the system identification problem for linear, time-invariant, causal systems in the framework of [10, 22] where a transfer function is a convenient representation of a system. Estimation of the transfer function of a linear system is a classical problem [10]. Nevertheless, it still attracts great attention and nice findings are continuously obtained [1, 9, 12, 13]. Mathematically the transfer function is a function of complex variables as described below.
A linear, discrete-time, time-invariant, causal system with input signal \(\{x(t)\}_{t\in \mathbb {Z}}\) and output signal \(\{y(t)\}_{t\in \mathbb {Z}}\) is described in terms of a sequence \(\{g_k\}_{k\in \mathbb {Z}}\) supported on \(\mathbb {N}\) by the discrete convolution
If we define a backward shift operator \(q^{-1}\) on \(\ell ^2(\mathbb {Z})\) by \(q^{-1}x(t)=q^{-1}(x)(t)=x(t-1)\), we have
where \(q^{-k}\) denotes the operator \((q^{-1})^k\), and \(G(q)\) the operator \(\sum _{k=1}^\infty g_kq^{-k}\) (if it is well-defined). The corresponding function \(G\) defined possibly on \(\mathbb {C}\setminus \{0\}\) by
is called the transfer function and \(G(q)\) the transfer operator of the system. The transfer function \(G\) (or “system”, or “filter”) is called stable if
Under this stability condition, the operator \(G(q)\) is well-defined with \(\Vert G(q)\Vert \le \Vert g\Vert _{\ell ^1}\), and the Laurent series
is continuous on and analytic outside the unit circle \(\mathbb {S}:=\{z\in {\mathbb {C}}: |z|=1\}\). Moreover \(G\) is in \(H_2\), the Hardy space of functions analytic outside \(\mathbb {S}\) with the norm
finite.
In the system identification literature one of the most important cases is when the transfer function is rational. For a system with finite McMillan degree, one can always decompose \(G\) via a partial fraction expansion as
Hence, the following atomic set for linear systems
gives building blocks where \(\mathbb {D}:=\{z\in \mathbb {C}:|z|<1\}\). See the discussion in [14] for details.
The problem considered in this paper is frequency domain identification based on a set of frequency domain measurements. We study this problem under the assumption that the transfer function \(G_{\star }\) to be estimated has single poles of magnitude at most \(\rho \) for some \(0<\rho <1\) meaning that \(G_{\star }\) can be expressed as
where \(\nu _k\in \mathbb {D}_\rho :=\{z\in {\mathbb {C}}: |z| \le \rho \}\) and \(\sum _{k=1}^\infty |\nu _k|<\infty \). The frequency domain identification problem aims at approximating \(G_{\star }\) by means of observed and possibly noisy input-output data of the form
We need to determine \(w\)’s, the locations of the poles of \(G_{\star }\), and the coefficients \(\{\nu _k\}\).
The main purpose of this paper is to estimate the transfer function by a learning theory approach. We view this as a least squares regression problem aiming at learning a target function \(f_{\star }\) defined on a compact metric space \(X\) from a sample of input-output pairs \(\mathbf{z} = \{(x_k, y_k)\}_{k=1}^m\). A common model in learning theory is to assume a Borel probability measure \(\mu \) on \(Z:= X \times Y\) with \(Y \subseteq {\mathbb {R}}\), which yields conditional means \(f_\mu (x) = E[y|x]\) for \(x\in X\). The link between the target function \(f_{\star }\) and the sample \(\mathbf{z}\) is that \(f_{\star } = f_\mu \), called the regression function, given by the conditional means of \(\mu \) while \(\mathbf{z}\) is independently drawn from \(\mu \), thereby \(y_k \approx f_{\star }(x_k)\).
There have been many learning algorithms for solving the least squares regression problem. A large family is given by kernel methods including support vector machines (for both regression and classification, with a general convex loss function). Such a learning algorithm is based on a kernel \(K: X \times X \rightarrow {\mathbb {R}}\) which is continuous, symmetric and positive semi-definite. The kernel generates a reproducing kernel Hilbert space (RKHS) \(({\mathcal H}_K, \Vert \cdot \Vert _K)\) and the learning algorithm for the least squares regression is given by a regularization scheme
where \(\gamma >0\) is a regularization parameter. The reproducing property of the RKHS together with the Hilbert space norm square as the penalty implies that \(f_{\mathbf{z}, \gamma } = \sum _{k=1}^m c^\mathbf{z}_k K(\cdot , x_k)\) with the coefficient vector (usually having no sparsity) satisfying a linear system of equations. Another family of learning algorithms based on a kernel \(K\) is given by coefficient-based regularization schemes. Here the output function takes the form \(f_{\mathbf{z}, \gamma , \Omega } = \sum _{k=1}^m c^\mathbf{z}_k K(\cdot , x_k)\) with the coefficient vector \(c^\mathbf{z} = (c^\mathbf{z}_k)_{k=1}^m\) given in terms of a penalty functional \(\Omega : {\mathbb {R}}^m \rightarrow {\mathbb {R}}_+\) by a regularization scheme
A typical example of the penalty functional is the \(\ell ^1\)-norm \(\Omega (c) = \sum _{k=1}^m |c_k|\). A major advantage of the coefficient-based regularization schemes is sparsity of the coefficient vector \(c^\mathbf{z}\) in (1.5). This has been observed in Lasso [20] type algorithms in statistics and in the large literature of compressed sensing. Moreover, the regularization scheme (1.5) can be implemented by the LARS algorithm [3] for solving a more general optimization problem
where \({\mathcal M}\) is an \(m \times p\) matrix with \(p \in {\mathbb {N}}\) and \(y\in {\mathbb {R}}^m\). Besides possible sparsity, the LARS algorithm produces a full piecewise linear solution path of (1.6), which plays an important role in cross-validation or similar methods to tune the parameter \(\gamma \).
In the above learning theory model, \(\{x_k\}_{k=1}^m\) is a random sample drawn from the marginal distribution \(\mu _X\) of \(\mu \) on \(X\). This is a random design framework. In the literature of sampling theory, signal processing, and related fields, \(\{x_k\}_{k=1}^m\) is often taken to be deterministic, well spaced on \(X\). This corresponds to a fixed design framework. Essential differences between these two frameworks include goodness of the numerical conditioning of the involved linear operators in the latter and improvement of approximation or prediction ability of the former caused by randomization which leads to technical difficulty in error analysis of learning algorithms [16].
Let us turn back to the problem of frequency domain identification with the input-output data (1.3). In the existing literature, it is often assumed that the measurements of the frequency response take the form
where \(\{e^{i\theta _k}\}_{k=1}^m\) is deterministic and \(\{\eta _k\}_{k=1}^m\) is a noise sequence consisting of independent and identically distributed random variables. In particular, in fixed design Gaussian regression models, \(\{e^{i\theta _1},e^{i\theta _2},\cdots ,e^{i\theta _m}\}\) are deterministic elements of \(\mathbb {S}\) (often equally spaced), and \(\eta _1\cdots ,\eta _m\) are drawn independently from a Gaussian distribution \(\mathcal {N}(0, \sigma ^2)\) with \(\sigma ^2>0\). The function \(G_{\star }\) in (1.7) is the transfer function to be estimated. This is the fixed design framework studied in [14].
In this paper we take a random design framework and state frequency domain identification as a least squares regression problem in learning theory. We take the input space \(X\) to be the unit circle \({\mathbb S}\) and the output space to be \(Y={\mathbb {C}}\). Let \(\mu \) be a Borel probability measure on \({\mathbb S}\times {\mathbb {C}}\). The conditional distributions \(\mu (\cdot |z)\) with \(z\in \mathbb {S}\) define the regression function \(f_\mu \) as
Our analysis is based on the assumption that the transfer function is the regression function: \(G_{\star } = f_\mu \). A random sample \(\mathbf{{z}} =\{(z_k =e^{i\theta _k}, y_k)\}_{k=1}^m\in (\mathbb {S}\times {\mathbb {C}})^m\) is drawn independently from \(\mu \). This is the data in (1.3) or (1.7) and \(\{e^{i\theta _1},e^{i\theta _2},\cdots ,e^{i\theta _m}\}\) is now random on \(\mathbb {S}\). The randomization makes our study different from that in [14] and improves the approximation ability of the learning algorithm. The arising technical difficulty we overcome in our error analysis is the possibly poor numerical conditioning of the involved matrices.
Once the sample \(\mathbf{{z}}\) is taken, the forms of the learning algorithms in [14] and in this paper (involving atomic norms, to be given below) are the same, both can be stated as a special case of (1.6) and can be solved efficiently by the LARS algorithm.
The coefficient-based learning algorithm considered here is also essentially different from the standard one in learning theory [15]: the two variables of the kernel here are defined on different domains and the hypothesis space depends strongly on the nature of the model we consider. Moreover, locating the poles \(\{w_k\}\) of \(G_{\star }\) in (1.2) is part of the learning task (not only the coefficients \(\{\nu _k\}\)), not coming automatically from the sample, which is totally different.
2 Problem Formulation and Main Result
2.1 Nonsymmetric Kernel and Hypothesis Space
If the randomness of the sampling points \(\{z_k\}_{k=1}^m\) is ignored and the sample is fixed, the following problem formulation is the same as that in [14].
The kernel function \(K\) here is defined on \({\mathbb D} \times {\mathbb S}\) by
It is a continuous function, but is not symmetric. For \(w \in {\mathbb D}\), \(K_w\) denotes the function on \({\mathbb S}\) given by \(K_w (z)=K(w, z) =\frac{1-|w|^2}{z-w}\). Such functions form an atomic set (or a dictionary)
The atomic norm used in the algorithm is defined for expansions of the form \(f=\sum _{w\in \mathbb {D}_\rho } c_w K_{w}\), where the coefficient sequence \(\{c_w\}\) is absolutely summable.
Definition 2.1
For \(0< \rho < 1\), we define a function space \({\mathcal H}_{1,\rho }\) by
with the atomic norm
For computational purposes and introducing learning algorithms, we need an approximate space \({\mathcal H}_{1,\rho }^{(\epsilon )}\) of \({\mathcal H}_{1,\rho }\) which consists of linear combinations of a finite atomic set.
Definition 2.2
Let \(\epsilon >0\) and \({\mathbb D}_\rho ^{(\epsilon )}\) be a finite subset of \({\mathbb D}_\rho \). We define the hypothesis function space \({\mathcal H}_{1,\rho }^{(\epsilon )}\) as a subspace of \({\mathcal H}_{1, \rho }\) by
with the norm
The space \({\mathcal H}_{1,\rho }^{(\epsilon )}\) is the closed subspace of \({\mathcal H}_{1,\rho }\) generated by \(\{K_w:w\in {\mathbb D}_\rho ^{(\epsilon )}\}\). The learning algorithm is defined on the space \({\mathcal H}_{1,\rho }^{(\epsilon )}\) which plays the role of a hypothesis space in learning theory.
2.2 Learning Algorithm
Once the hypothesis space and its atomic norm are introduced, the learning algorithm can now be defined in terms of a sample \(\mathbf{z}\) as in [14].
Definition 2.3
Let \(\mathbf{z} =\{(z_k, y_k)\}_{k=1}^m \in ({\mathbb S} \times {\mathbb {C}})^m\), \(0< \rho < 1\), and \({\mathbb D}_\rho ^{(\epsilon )}\) be a finite subset of \({\mathbb D}_\rho \). The learning algorithm is defined as
where the sequence \(c^\mathbf{z} = (c^\mathbf{z}_w)_{w \in {\mathbb D}_\rho ^{(\epsilon )}}\) is defined by the regularization scheme
Here \(f_c = \sum _{w \in {\mathbb D}_\rho ^{(\epsilon )}} c_w K_{w}\) for a sequence \(c \in {\mathbb {C}}^{{\mathbb D}_\rho ^{(\epsilon )}}\), \(\Vert c\Vert _{\ell ^1}=\sum _{w \in {\mathbb D}_\rho ^{(\epsilon )}}|c_w|\), and \(\gamma >0\) is a regularization parameter.
The regularization scheme (2.3) in the above algorithm is a special case of the optimization problem (1.6) with the matrix \({\mathcal M} = \left( K(w, z_k)\right) _{k\in \{1, \ldots , m\}, w \in {\mathbb D}_\rho ^{(\epsilon )}}\) and regularization parameter \(m \gamma \) (with a slight modification for handling the complex numbers), hence can be computed efficiently by the LARS algorithm, no matter whether the sampling points \(\{z_k\}_{k=1}^m\) are deterministic or random.
Remark 2.4
The learning algorithm above is equivalent to the following atomic norm regularization scheme
2.3 Main Result
We state our main result on error analysis for the learning algorithm (2.2) in this subsection and postpone the proofs to the end of the paper. Our result is stated under the assumption that the subset \({\mathbb D}_\rho ^{(\epsilon )}\) of \({\mathbb D}_\rho \) is \(\epsilon \)-dense.
Definition 2.5
A point set \(\{w_1,\cdots , w_n\}\subset \mathbb {D}_\rho \) is said to be \(\epsilon \)-dense in \(\mathbb {D}_\rho \) if for every \(w \in \mathbb {D}_\rho \) there exists some \(1\le k\le n\) such that \(|w -w_k|\le \epsilon \).
Throughout the paper we assume \(|y|\le M\) almost surely. Here we do not assume that the transfer function is strictly proper, as done in [14]. Instead, our more general assumption is that the transfer function \(G_{\star } = f_\mu \) belongs to \({\mathcal H}_{1, \rho }\) for some \(0<\rho <1\).
Theorem 2.6
Suppose \(f_\mu \in {\mathcal H}_{1, \rho }\) for some \(0<\rho <1\). Let \(0<\beta <1\), \(s\in \mathbb {N}\), and \(\gamma =m^{-\theta }\) with \(0< \theta < \frac{1}{2(1+\beta )}\). Then exists a constant \(C_s>0\) depending on \(s\) such that for \(m\ge C_s^{\frac{2s}{\theta }}\), any \(\epsilon \)-dense subset \({\mathbb D}_\rho ^{(\epsilon )}\) of \({\mathbb D}_\rho \) with \(\epsilon =m^{-\frac{\theta }{2s}}\), and \(0 < \delta < 1\), with confidence \(1-\delta \), we have
where \(C\) is a constant independent of \(m\) or \(\delta \).
Remark 2.7
Since the parameter \(\beta >0\) can be arbitrarily small, the power index \(\theta \) of the learning rate of order \(O(m^{-\theta })\) becomes arbitrarily close to \(1/2\). Such a convergence rate is the best in the literature of coefficient-based regularization schemes with \(\ell ^1\)-penalty [15].
2.4 Further Discussions
A special task of frequency domain identification is to locate the poles \(\{w_k\}\) of the transfer function \(G_{\star }\) in the expression (1.2), not only the coefficients \(\{\nu _k\}\). This may be carried out approximately by the error bound stated in Theorem 2.6, at least when the system has a finite McMillan degree with (1.1) valid.
Our assumption that \(G_{\star } \in {\mathcal H}_{1, \rho }\) for some \(0<\rho <1\) implies that the poles of the transfer function lie in \(\mathbb {D}_\rho \). Hence the finiteness of the \(\ell ^1\) norm of the coefficient sequence \(\{\nu _k\}\) ensures the stability of the system, and is a reasonable condition for the study of system identification.
In this paper we study the problem of identification of time-invariant causal systems by a learning theory approach. When the system is not causal, i.e., the sequence \(\{g_k\}\) is not supported on \({\mathbb {N}}\), zeros of the transfer function need to be put into consideration and the atomic set \({\mathcal A}\) should be expanded. This changes the nature of the atomic norm-based regularization scheme and the learning algorithm. When the system is even time-variant [8], identification of systems or linear operators [4, 5, 11] is more challenging. It would be interesting to extend our results and methods to these more general settings.
Sparsity of the general optimization problem (1.6) has been extensively investigated in the literature of statistics and learning theory [29]. Conditions stated in terms of structures of the matrix \({\mathcal M}\) and sparsity of the coefficient vector of a target function with respect to feature sets (such as irrepresentable conditions in [29]) are crucial. When the \(\epsilon \)-dense set \({\mathbb D}_\rho ^{(\epsilon )}\) is fixed and when the sampling points \(\{e^{i\theta _k}\}_{k=1}^m\) are deterministic and well spaced (as assumed in [14]), one might use the existing results [29] to derive some sparsity of the solution vector \(c^\mathbf{z}\) in terms of the atomic functions \(\{K_{w}\}_{w \in {\mathbb D}_\rho ^{(\epsilon )}}\). However, how to get sparsity from some sparsity conditions for the transfer function decompositions (1.1) in terms of poles or (1.2) is a complicated issue and deserves further study, especially when the sampling points \(\{e^{i\theta _k}\}_{k=1}^m\) are random and the matrix \({\mathcal M}\) is poorly conditioned (with small probability) in our random design framework.
3 Error Decomposition
Our analysis for the algorithm (2.2) is based on an error decomposition which is a technique for error analysis of data dependent regularization schemes developed in [15, 23, 27, 28].
Define the generalization error \({\mathcal E}\) on the set of complex-valued measurable functions on \({\mathbb S}\) as
and its empirical version associated with the sample \(\mathbf{z}\) as
Then the learning algorithm (2.2) rewritten in the form (2.4) can be stated as
Our error decomposition technique is expressed by means of a regularizing function defined by
Proposition 3.1
Let \(\gamma >0\) and \(f_{\mathbf{z},\gamma }^{(\epsilon )}\) be given by (2.2). There holds
where \(\mathcal{S}^{(\epsilon )}(\mathbf{z}, \gamma )\) is the sample error defined by
and \(\mathcal{D}^{(\epsilon )}(\gamma )\) is the regularization error defined by
Proof
The proof of Proposition 3.1 follows from expressing \(\mathcal{E}(f_{\mathbf{z},\gamma }^{(\epsilon )})-\mathcal{E}(f_\mu ) + \gamma \Vert f_{\mathbf{z},\gamma }^{(\epsilon )}\Vert _{\mathcal A}\) as
and the inequality \(\mathcal{E}_\mathbf{z}(f_{\mathbf{z},\gamma }^{(\epsilon )} ) +\gamma \Vert f_{\mathbf{z},\gamma }^{(\epsilon )}\Vert _{\mathcal A} \le \mathcal{E}_\mathbf{z} (f_{\gamma }^{(\epsilon )} ) +\gamma \Vert f_{\gamma }^{(\epsilon )} \Vert _{\mathcal A}\) induced by the equivalent definition (3.3) of the function \(f_{\mathbf{z},\gamma }\). \(\square \)
The regularization error will be estimated in the next section by means of a local polynomial reproduction formula, while the sample error will be bounded in Sect. 5, thanks to concentration inequalities and an iteration technique.
4 Estimating Regularization Error
The nonsymmetric kernel \(K\) for Algorithm (2.2) is the most special feature in our error analysis. Estimating the induced regularization error is the technical difficulty we need to overcome, compared to the analysis for real-valued kernels in [23, 28]. To this end, we need some properties of the fundamental functions \(K_w\) induced by the kernel \(K\).
Lemma 4.1
Let \(0<\rho <1\), \(K\) be given by (2.1) and \(K_w\) by \(K_w (z) =K(w, z)\) for \(w\in \mathbb {D}\) and \(z\in \mathbb {S}\). Then the following properties hold.
-
(i)
\(\Vert K_w\Vert _\infty \le 2\) for any \(w\in \mathbb {D}\).
-
(ii)
\(|K_w(z_1)-K_w(z_2)|\le \frac{1+\rho }{1-\rho }|z_1-z_2|\) for any \(w\in \mathbb {D}_\rho \) and \(z_1,z_2\in \mathbb {S}\).
-
(iii)
\(|K_{w_1}(z)-K_{w_2}(z)|\le \left( \frac{1+\rho }{1-\rho }\right) ^2|w_1-w_2|\) for any \(w_1,w_2\in \mathbb {D}_\rho \) and \(z\in \mathbb {S}\).
-
(iv)
\(\mathcal {H}_{1,\rho }\) can be regarded as a subset of \(C(\mathbb {S})\) with the inclusion map \(I : \mathcal {H}_{1,\rho } \rightarrow C(\mathbb {S})\) bounded as
$$\begin{aligned} \Vert f\Vert _\infty \le 2 \Vert f\Vert _{\mathcal {A}}. \end{aligned}$$(4.1)
Proof
Part (i) follows from the inequalities \(|z-w|\ge |z|-|w|=1-|w|\) and
Part (ii) is seen since
In the similar way, for part (iii), we have for \(w_1,w_2\in \mathbb {D}_\rho \) and \(z\in \mathbb {S}\),
Part (iv) is a direct consequence of Part (i) and the definition of the norm \(\Vert f\Vert _{\mathcal {A}}\). The proof of Lemma 4.1 is complete. \(\square \)
In this section we estimate the regularization error under the assumption that the regression function \(f_\mu \) lies in the space \({\mathcal H}_{1, \rho }\).
Our estimation is carried out by using high order smoothness of the kernel \(K_w\). To this end, we need a local polynomial reproduction formula from the literature of multivariate approximation [7, 24, 25]. Note that \(\epsilon \)-dense sets can be stated in an equivalent way as subsets of \(\mathbb {R}^2\).
In the following we also denote by \(|u|\) the Euclidian \(\ell ^2\) norm for \(u=(x,y)\in X\subset \mathbb {R}^2\).
Definition 4.2
Let \(X =\{(x, y) \in {\mathbb {R}}^2: |x|^2 + |y|^2 \le \rho ^2\}\). A point set \(\{(x_1, y_1),\cdots , (x_n, y_n)\}\) in \(X\) is said to be \(\Delta \)-dense if for every \((x, y) \in X\) there exists some \(1\le k\le n\) such that \(|(x, y) -(x_k, y_k)| \le \Delta \).
Denote by \(\mathcal{P}^2_s\) the space of polynomials on \({\mathbb {R}}^2\) of degree at most \(s\) (here the coefficients are complex). The following lemma is a formula for local polynomial reproduction (see Theorem 3.10 in [24] which is also valid for polynomials with complex coefficients).
Lemma 4.3
There exists a constant \(C_s\) depending on \(s\in {\mathbb {N}}\) such that for any \(\Delta \)-dense point set \(\{(x_1, y_1),\cdots , (x_n, y_n)\}\) in \(X\) with \(\Delta \le \frac{1}{C_s}\) and every \(u\in X\), we can find real numbers \(b_j(u)\), \(1\le j\le n\), satisfying
-
(i)
\(\sum _{j=1}^{n}b_j(u)p(x_j, y_j)=p(u) \qquad \forall \, p\in \mathcal{P}_{s}^{2}\),
-
(ii)
\(\sum _{j=1}^{n}|b_j(u)|\le 2\),
-
(iii)
\(b_j(u)=0\) provided that \(|u-(x_j, y_j)|>C_{s}\Delta \).
Now we can estimate the regularization error \(\mathcal{D}^{(\epsilon )}(\gamma )\).
Proposition 4.4
Assume \(f_\mu \in {\mathcal H}_{1, \rho }\). Let \(s\in \mathbb {N}\). If \({\mathbb D}_\rho ^{(\epsilon )} =\{z_k =x_k + i y_k\}_{k=1}^n\) is a finite subset of \({\mathbb D}_\rho \) such that \(\{(x_1, y_1),\cdots , (x_n, y_n)\}\) is \(\epsilon \)-dense with \(\epsilon \le \frac{1}{C_s}\), then there exists a constant \(C_{s, \rho }\) depending on \(s\) and \(\rho \) such that
Proof
The condition \(f_\mu \in {\mathcal H}_{1, \rho }\) tells us that for any \(\iota >0\), \(f_\mu \) can be written as
with \(w_k \in {\mathbb D}_\rho \), \(\nu _k \in {\mathbb {C}}\), and
Then there exists some \(N_{0}\in {\mathbb {N}}\) such that \(\sum _{k=N_{0}+1}^{\infty }|\nu _k| < \iota \). It follows from \(|z-w_k| \ge \big ||z| - |w_k|\big |\) that for each \(z\in {\mathbb S}\), there holds
Let \(z\in {\mathbb S}\) and \(k\in \{1, \ldots , N_0\}\). Write \(w_k\) as \(w_k =v_k + i t_k\) and \(u_k=(v_k, t_k)\). Note that \(w_k\in \mathbb {C}\) and \(u_k\in X\subset \mathbb {R}^2\) but \(|w_k|=|u_k|\). Let \(u=(v,t)\in X\). Take \(\Delta = \epsilon \) and \(\{b_j(u)\}_{j=1}^n\) as in Lemma 4.3. By Lemma 4.3, we know for every polynomial \(p\in \mathcal{{P}}_{s}^{2}\) that
where \(I(u)=\{j\in \{1,\cdots , m\}: |(x_{j}, y_j)-u|\le C_{s}\epsilon \}\). Now by taking \(p_k\) to be the Taylor polynomial of the complex-valued function \(g(u)=\frac{1-|u|^2}{z-(v+it)}\) (of a variable \(u=(v,t)\in X\subset \mathbb {R}^2\)) at \(u_k\) of degree less than \(s\), we find
and
It follows that
The above bound holds for every \(z\in {\mathbb S}\) and \(k\in \{1, \ldots , N_0\}\). This in connection with (4.2) yields
Since \(x_j + i y_j =z_j \in {\mathbb {D}}_\rho ^{(\epsilon )}\), we know that \(f= \sum _{k=1}^{N_0} \nu _k \sum _{j\in I(u_k)}b_{j}(u_k)K_{x_j + i y_j}\) lies in \({\mathcal H}_{1,\rho }^{(\epsilon )}\) and
The above bounds and (4.3) yield
It can be easily seen that for any complex-valued measurable function \(f\) on \({\mathbb S}\), there holds
Since \(\mathcal{D}^{(\epsilon )}(\gamma )\) is independent of \(\iota \), we have
The function \(\frac{1-|u|^2}{z-(v+it)}\) for \(z\in {\mathbb S}\) and \(u\in {\mathbb D}_\rho \) is \(C^\infty \). Since \(X =\{(x, y) \in {\mathbb {R}}^2: |x|^2 + |y|^2 \le \rho ^2\}\) is compact, the number \(C'_{s,\rho }:=\sup _{z\in {\mathbb S}} \left\| \frac{1-|u|^2}{z-(v+it)}\right\| _{C^s(X)}^2\) depending on \(s\) and \(\rho \) is finite for any \(s\in {\mathbb {N}}\). Then the desired statement is verified with \(C_{s,\rho }=8 C_s^{2s} C'_{s, \rho }+2\). \(\square \)
5 Estimating Sample Error in the Random Design
In this section we estimate the sample error in the random design setting. Recall the sample error \( \mathcal{S}^{(\epsilon )}(\mathbf{z}, \gamma )\) which can be expressed as
where
To estimate the quantity \(\mathcal{S}_1^{(\epsilon )}(\mathbf{z}, \gamma )\), we consider the random variables \(\xi \) on \((\mathbb {S}\times \mathbb {C}, \mu )\) given by
and satisfying
Since \(|y|\le M\) almost surely, we have \(\Vert f_\mu \Vert _\infty \le M\). From the definition of \(f_{\gamma }^{(\epsilon )}\) given by (3.4) we deduce that
Hence \(\xi \) is bounded by
and its variance is bounded by \(M_{\epsilon ,\gamma } \mathcal{D}^{(\epsilon )}(\gamma )\). Applying the one-side Bernstein inequality [26] yields the following bound for \(\mathcal{S}_1^{(\epsilon )}(\mathbf{z}, \gamma )\).
Lemma 5.1
Let \(0 <\gamma \le 1\). For any \(0 < \delta < 1\), with confidence \(1-\frac{\delta }{4}\), it holds that
It is more difficult to bound \(S_2^{(\epsilon )}(\mathbf{z},\gamma )\) because it involves the sample \(\mathbf {z}\) through \(f_{\mathbf{z},\gamma }^{(\epsilon )}\). We use a probability inequality that handles a class of functions in \({\mathcal {H}}_{1,\rho }^ {(\epsilon )}\). Such an inequality uses covering numbers in \({\mathcal {H}}_{1,\rho }^ {(\epsilon )}\) to describe the complexity of \({\mathcal {H}}_{1,\rho }^{(\epsilon )}\). We bound the covering numbers in \({\mathcal {H}}_{1,\rho }^{(\epsilon )}\) firstly.
Definition 5.2
The covering number \(\mathcal {N}(X, r)\) of the metric space \(X\) is the minimal \(l \in \mathbb {N}\) such that there exist \(l\) balls in \(X\) with radius \(r\) covering \(X\).
In the function space \({\mathcal {H}}_{1,\rho }\), denote \(\mathcal {B}_R = \{f\in {\mathcal H}_{1,\rho } :\Vert f\Vert _{\mathcal {A}}\le R\}\). Recall the inclusion map \(I: \mathcal {H}_{1,\rho } \rightarrow C(\mathbb {S})\) defined in Lemma 4.1 (iv). We need the covering number \(\mathcal {N}(I(\mathcal {B}_1), \epsilon )\) of \(I(\mathcal {B}_1)\) as a subset of \(C(\mathbb {S})\).
For every \(\varepsilon > 0\) and \(R \ge M\), the following inequality as a uniform law of large numbers for a class of functions can be easily seen as in [2, Proposition 8.15]
With this inequality, as in [2], we have the following bound for \(S_2^{(\epsilon )}(\mathbf{z}, \gamma )\).
For \(R>0\), we define the subset \(\mathcal {W}(R)\) of \((\mathbb {S}\times \mathbb {C})^m\) as
Lemma 5.3
If \(0 < \gamma \le 1\) and \(R\ge 1\), then there exists a subset \(V_R\) of \((\mathbb {S}\times \mathbb {C})^m\) of measure at most \(\delta /4\) such that
where \(\varepsilon _{R,\delta , m}\) is the smallest positive number \(\varepsilon \) satisfying
To drive concrete error bounds, we need to bound the covering number \(\mathcal {N}(I(\mathcal {B}_1), \varepsilon )\). We carry out such a bound by applying methods from [2, 30, 31].
Lemma 5.4
For any \(0< \beta <1\), there exists some constant \(C_{\rho ,\beta }>0\) such that
Proof
Since \(0<\rho <1\), for any \(\ell \in \mathbb {N}\), the functions \(K_w\), \(w\in \mathbb {D}_\rho \), are uniformly bounded in \(C^\ell (\mathbb {S})\), the space of complex-valued functions having continuous partial derivatives up to order \(\ell \). We can prove that
Indeed, by the Cauchy integral formula, we have
Since \(w\in \mathbb {D}_\rho \) we know that \(|w-\xi |\ge (1-\rho )/2\). It follows that
Hence for any \(f =\sum _{k=1}^\infty \nu _k K_{w_k}\in \mathcal {H}_{1,\rho }\) we have
which gives \(\Vert f\Vert _{C^\ell ( \mathbb {S})}\le C_{\ell ,\rho }\Vert f\Vert _{\mathcal {A}}\). Hence for any \(f =\sum _{k=1}^\infty \nu _k K_{w_k}\in \mathcal {B}_1\) with \(\sum _{k=1}^\infty |\nu _k|\le 1\) and \(w_k\in \mathbb {D}_\rho \) we have
Thus, \(\mathcal {B}_1\) can be embedded into the ball of \(C^\ell ( \mathbb {S})\) with radius \(C_{\ell ,\rho }\) whose covering number can be bounded by
where the positive constant \(C_\ell \) depends only on \(\ell \). See [2, Chapter 5] and references therein. Then our conclusion follows by taking \(\ell \ge 2/\beta \). \(\square \)
Combining Proposition 4.4, Lemmas 5.1 and 5.3 with the bounds in Lemma 5.4 and Proposition 3.1, we deduce the following conclusion.
Proposition 5.5
Let \(0<\gamma \le 1, R\ge 1, s\in \mathbb {N}, 0<\beta <1\) and \(\epsilon =\gamma ^{\frac{1}{2s}}\). If the conditions of Theorem 2.6 are satisfied and \(0<\delta <1\), then there is a subset \(V_R\) of \((\mathbb {S}\times \mathbb {C})^m\) with measure at most \(\delta \) such that
where \(C'=25 \max \{108,(108 C_{\rho ,\beta })^{1/(1+\beta )}\}\), \(C_1=2C_{s,\rho } \Vert f_\mu \Vert _{\mathcal A} (\Vert f_\mu \Vert _{\mathcal A}+1)\) and \(C_2=30 C_1^2 + 66 M^2\).
A bound for the generalization error is not tight enough when we only combine the above estimates. To get better error estimates, we shall apply an iteration technique [6, 15, 18, 19, 26] to improve the rough bound for \(f_\mathbf{z,\gamma }^{(\epsilon )}\). Since \(f_\mathbf{z,\gamma }^{(\epsilon )}\) is a good approximation of \(f_{\gamma }^{(\epsilon )}\), for \(\Vert f_{\mathbf{z},\gamma }^{(\epsilon )}\Vert _{\mathcal {A}}\) one would expect a tighter bound, better than \(M^2/\gamma \).
Lemma 5.6
Under the assumption of Proposition 5.5, let \(0<\delta <1\) and \(\epsilon ^{2s}=\gamma =m^{-\theta }\) with \(0 < \theta < \frac{1}{2(1+\beta )}\) and \(m\ge C_s^{\frac{2s}{\theta }}\). Then with confidence \(1-\delta \), there holds
where \(\tilde{C}\) is a constant independent of \(\delta \) or \(m\), and
Proof
Based on Proposition 5.5, we know that
where
It follows that
Consider a sequence \(\{R^{(j)}\}_{j\in \mathbb {N}}\), where \(R^{(0)}=\frac{M^2}{\gamma }\) and \(R^{(j)}=\max \{{a}_{m,\gamma }(R^{(j-1)})^2, {b}_{m,\gamma }\}\). Since \(|y|\le M\) almost surely, by taking \(f=0\) in (2.4), we see \(\Vert f_{\mathbf{z},\gamma }^{(\epsilon )}\Vert _{\mathcal {A}}\le \frac{M^2}{\gamma }\). So for \(J\in \mathbb {N}\),
Hence the measure of \(\mathcal {W}(R^{(J)})\) is at least \(1-J\delta \). From the definition of the sequence \(\{R^{(j)}\}\), we see that \(R^{(J)}\) is
where \(A_\delta =4C' (\log (4/\delta )+1)\) and \(B_\delta =2C_1+2C_2\log (4/\delta )\).
We let \(\gamma =m^{-\theta }\) with \(0<\theta < \frac{1}{2(1+\beta )}\) and we determine \(J\) to be the smallest positive integer satisfying
With this choice of \(J\), we have with confidence at least \(1-J\delta \),
The desired result follows by scaling \(J\delta \) to \(\delta \). \(\square \)
6 Proof of Main Result
We are in a position to prove our main result.
Proof of Theorem 2.6
Let \(R\) be the right side of (5.4). From Lemma 5.6, we see that \(\mathcal {W}(R)\) has measure at least \(1-\delta \). Based on Proposition 5.5, there is a subset \(V_R\) of \((\mathbb {S}\times \mathbb {C})^m\) with measure at most \(\delta \) such that for all \(\mathbf{z}\in \mathcal {W}(R)\backslash V_R\),
The desired result follows by scaling \(2\delta \) to \(\delta \). \(\square \)
References
Chen, T., Ohlsson, H., Ljung, L.: On the estimation of transfer functions, regularizations and Gaussian processes - Revisited. Automatica 48, 1525–1535 (2012)
Cucker, F., Zhou, D.X.: Learning Theory: An Approximation Theory Viewpoint. Cambridge University Press, Cambridge (2007)
Efron, B., Hastie, T., Johnstone, I., Tibshirani, R.: Least angle regression. Ann. Stat. 32, 407–499 (2004)
Gröchenig, K.: Foundations of Time-Frequency Analysis. Birkhäuser, Boston (2001)
Heckel, R., Bölcskei, H.: Identification of sparse linear operators. IEEE Trans. Inform. Theory 59, 7985–8000 (2013)
Hu, T., Fan, J., Wu, Q., Zhou, D.X.: Regularization schemes for minimum error entropy principle. Anal. Appl. (2014). doi:10.1142/S0219530514500110
Jetter, K., Stöckler, J., Ward, J.D.: Error estimates for scattered data interpolation on spheres. Math. Comput. 68, 733–747 (1999)
Kailath, T.: Measurements on time-variant communication channels. IEEE Trans. Inform. Theory 8, 229–236 (1962)
Ljung, L.: Perspectives on system identification. Ann. Rev. Control 34, 1–12 (2010)
Ljung, L.: System Identification: Theory for the User, 2nd edn. Prentice-Hall, Englewood Cliffs (1999)
Pfander, G.E., Walnut, D.F.: Measurement of time-variant linear channels. IEEE Trans. Inform. Theory 52, 4808–4820 (2006)
Pillonetto, G., Minh, H.Q., Chiuso, A.: A new kernel-based approach for nonlinear system identification. IEEE Trans. Autom. Control 56, 2825–2840 (2011)
Pillonetto, G., Nicolao, G.D.: A new kernel-based approach for linear system identification. Automatica 46, 81–93 (2010)
Shah, P., Bhaskar, B.N., Tang, G., Recht, B.: Linear system identification via atomic norm regularization. In: Proceedings of the 51st IEEE Conference on Decision and Control, Maui, Hawaii, USA, pp. 6265–6270 (2012)
Shi, L., Feng, Y.L., Zhou, D.X.: Concentration estimates for learning with \(\ell ^1\)-regularizer and data dependent hypothesis spaces. Appl. Comput. Harmon. Anal. 31, 286–302 (2011)
Smale, S., Zhou, D.X.: Shannon sampling and function reconstruction from point values. Bull. Am. Math. Soc. 41, 279–305 (2004)
Smale, S., Zhou, D.X.: Learning theory estimates via integral operators and their approximations. Constr. Approx. 26, 153–172 (2007)
Steinwart, I., Scovel, C.: Fast rates for support vector machines using Gaussian kernels. Ann. Stat. 35, 575–607 (2007)
Sun, H.W., Wu, Q.: Indefinite kernel networks with dependent sampling. Anal. Appl. 11, 1350020 (2013). 15 pages
Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. B 58, 267–288 (1996)
Vapnik, V.: Statistical Learning Theory. Wiley, New York (1998)
Vidyasagar, M., Karandikar, R.L.: A learning theory approach to system identification and stochastic adaptive control. J. Process Control 18, 421–430 (2008)
Wang, H.Y., Xiao, Q.W., Zhou, D.X.: An approximation theory approach to learning with \(\ell ^1\) regularization. J. Approx. Theory 167, 240–258 (2013)
Wendland, H.: Local polynomial reproduction and moving least squares approximation. IMA J. Numer. Anal. 21, 285–300 (2001)
Wendland, H.: Scattered Data Approximation. Cambridge Monographs on Applied and Computational Mathematics, vol. 17. Cambridge University Press, Cambridge (2005)
Wu, Q., Ying, Y., Zhou, D.X.: Learning rates of least-square regularized regression. Found. Comput. Math. 6, 171–192 (2006)
Wu, Q., Zhou, D.X.: Learning with sample dependent hypothesis spaces. Comput. Math. Appl. 56, 2896–2907 (2008)
Xiao, Q.W., Zhou, D.X.: Learning by nonsymmetric kernels with data dependent spaces and \(\ell ^1\)-regularizer. Taiwan. J. Math. 14, 1821–1836 (2010)
Zhao, P., Yu, B.: On model selection consistency of Lasso. J. Mach. Learn. Res. 7, 2541–2563 (2006)
Zhou, D.X.: Capacity of reproducing kernel spaces in learning theory. IEEE Trans. Inform. Theory 49, 1743–1752 (2003)
Zhou, D.X.: Derivative reproducing properties for kernel methods in learning theory. J. Comput. Appl. Math. 220, 456–463 (2008)
Acknowledgments
The work described in this paper is supported partially by the Research Grants Council of Hong Kong [Project No. CityU 105011] and by National Natural Science Foundation of China under Grants 11371007 and 11461161006.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Gitta Kutyniok.
Rights and permissions
About this article
Cite this article
Li, L., Zhou, DX. Learning Theory Approach to a System Identification Problem Involving Atomic Norm. J Fourier Anal Appl 21, 734–753 (2015). https://doi.org/10.1007/s00041-015-9389-y
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00041-015-9389-y
Keywords
- Learning theory
- System identification
- Transfer function estimation
- Frequency domain identification
- Atomic norm regularization