1 Introduction

Today, we witness a worldwide triumphant march of deep neural networks, impacting not only various application fields, but also areas in mathematics such as inverse problems. Originally, neural networks were developed by McCulloch and Pitts [48] in 1943 to introduce a theoretical framework for artificial intelligence. At that time, however, the limited amount of data and the lack of sufficient computational power only allowed the training of shallow networks, that is, networks with only few layers of neurons, which did not lead to the anticipated results. The current age of big data and the significantly increased computer performance now make the application of deep learning algorithms feasible, leading to the successful training of very deep neural networks. For this reason, neural networks have seen an impressive comeback. The list of important applications in public life ranges from speech recognition systems on cell phones over self-driving cars to automatic diagnoses in healthcare. For applications in science, one can witness a similarly strong impact of deep learning methods in research areas such as quantum chemistry [61] and molecular dynamics [47], often allowing to resolve problems which were deemed unreachable before. This phenomenon is manifested similarly in certain fields of mathematics, foremost in inverse problems [2, 10], but lately also, for instance, in numerical analysis of partial differential equations [8].

Yet, most of the existing research related to deep learning is empirically driven and a profound and comprehensive mathematical foundation is still missing, in particular for the previously mentioned applications. This poses a significant challenge not only for mathematics itself, but in general for the “safe” applicability of deep neural networks [22].

A deep neural network in mathematical terms is a tuple

$$\begin{aligned} \Phi = \big ( (T_1, \alpha _1), \dots , (T_L, \alpha _L) \big ) \end{aligned}$$
(1.1)

consisting of affine-linear maps \(T_\ell : {\mathbb {R}}^{N_{\ell - 1}} \rightarrow {\mathbb {R}}^{N_\ell }\) (hence \(T_\ell (x) = A_\ell \, x + b_\ell \) for appropriate matrices \(A_\ell \) and vectors \(b_\ell \), often with a convolutional or Toeplitz structure) and of nonlinearities \(\alpha _{\ell }: {\mathbb {R}}^{N_{\ell }} \rightarrow {\mathbb {R}}^{N_{\ell }}\) that typically encompass componentwise rectification, possibly followed by a pooling operation.

The tuple in (1.1) encodes the architectural components of the neural network, where L denotes the number of layers of the network, while \(L-1\) is the number of hidden layers. The highly structured function \({\mathtt {R}}(\Phi )\) implemented by such a network \(\Phi \) is then defined by applying the different maps in an iterative (layer-wise) manner; precisely,

$$\begin{aligned}&{\mathtt {R}}(\Phi ) : {\mathbb {R}}^{N_0} \rightarrow {\mathbb {R}}^{N_L}, \quad \text {with} \quad {\mathtt {R}}(\Phi ) := \alpha _L \circ T_L \circ \cdots \circ \alpha _1 \circ T_1 . \end{aligned}$$

We call this function the realization of the deep neural network \(\Phi \). It is worth pointing out that most of the literature calls this function itself the neural network; one can, however—depending on the choice of the activation functions—imagine the same function being realized by different architectural components, so that it would not make sense, for instance, to speak of the number of layers of \({\mathtt {R}}(\Phi )\); this is only well defined when we talk about \(\Phi \) itself. The complexity of a neural network can be captured by various numbers such as the depth L, the number of hidden neurons \(N(\Phi ) = \sum _{\ell =1}^{L-1} N_{\ell }\), or the number of connections (also called the connectivity, or the number of weights) given by \(W(\Phi ) = \sum _{\ell = 1}^{L} \Vert A_\ell \Vert _{\ell ^0}\), where \(\Vert A_\ell \Vert _{\ell ^0}\) denotes the number of nonzero entries of the matrix \(A_\ell \).

From a mathematical perspective, the central task of a deep neural network is to approximate a function \(f : {\mathbb {R}}^{N_0} \rightarrow {\mathbb {R}}^{N_L}\), which, for instance, encodes a classification problem. Given a training data set \(\big ( x_i,f(x_i) \big )_{i=1}^m\), a loss function \({\mathcal {L}} : {\mathbb {R}}^{N_L} \times {\mathbb {R}}^{N_L} \rightarrow {\mathbb {R}}\), and a regularizer \({\mathcal {P}}\), which imposes, for instance, sparsity conditions on the weights of the neural network \(\Phi \), solving the optimization problem

$$\begin{aligned} \min _{\Phi } \sum _{i=1}^m {\mathcal {L}} \big ( {\mathtt {R}}(\Phi )(x_i),f(x_i) \big ) + \lambda {\mathcal {P}}(\Phi ), \end{aligned}$$
(1.2)

typically through a variant of stochastic gradient descent, yields a learned neural network \({\widehat{\Phi }}\). The objective is to achieve \({\mathtt {R}}({\widehat{\Phi }}) \approx f\), which is only possible if the function f can indeed be well approximated by (the realization of) a network with the prescribed architecture. Various theoretical results have already been published to establish the ability of neural networks—often with specific architectural constraints—to approximate functions from certain function classes; this is referred to as analyzing the expressivity of neural networks. However, the fundamental question asking which function spaces are truly natural for deep neural networks has never been comprehensively addressed. Such an approach may open the door to a novel viewpoint and lead to a refined understanding of the expressive power of deep neural networks.

In this paper, we introduce approximation spaces associated with neural networks. This leads to an extensive theoretical framework for studying the expressivity of deep neural networks, allowing us also to address questions such as the impact of the depth and of the activation function, or of so-called (and widely used) skip connections on the approximation power of deep neural networks.

1.1 Expressivity of Deep Neural Networks

The first theoretical results concerning the expressivity of neural networks date back to the early 90s, at that time focusing on shallow networks, mainly in the context of the universal approximation theorem [16, 35, 36, 43]. The breakthrough result of the ImageNet competition in 2012 [38] and the ensuing worldwide success story of neural networks have brought renewed interest to the study of neural networks, now with an emphasis on deep networks. The surprising effectiveness of such networks in applications has motivated the study of the effect of depth on the expressivity of these networks. Questions related to the learning phase are of a different nature, focusing on aspects of statistical learning and optimization, and hence constitute a different research field.

Let us recall some of the key contributions in the area of expressivity, in order to put our results into perspective. The universal approximation theorems by Hornik [35] and Cybenko [16] can be counted as a first highlight, stating that neural networks with only one hidden layer can approximate continuous functions on compact sets arbitrarily well. Examples of further work in this early stage, hence focusing on networks with a single hidden layer, are approximation error bounds in terms of the number of neurons for functions with bounded first Fourier moments [5, 6], the failure of those networks to provide localized approximations [13], a fundamental lower bound on approximation rates [12, 18], and the approximation of smooth/analytic functions [50, 52]. Some of the early contributions already study networks with multiple hidden layers, such as [29] for approximating continuous functions, and [53] for approximating functions together with their derivatives. Also, [13] which shows in certain instances that deep networks can perform better than single-hidden-layer networks can be counted toward this line of research. For a survey of those early results, we refer to [24, 57].

More recent work focuses predominantly on the analysis of the effect of depth. Some examples—again without any claim of completeness—are [23], in which a function is constructed which cannot be expressed by a small two-layer network, but which is implemented by a three-layer network of low complexity, or [51] which considers so-called compositional functions, showing that such functions can be approximated by neural networks without suffering from the curse of dimensionality. A still different viewpoint is taken in [14, 15], which focus on a similar problem as [51] but attacking it by utilizing results on tensor decompositions. Another line of research aims to study the approximation rate when approximating certain function classes by neural networks with growing complexity [9, 49, 55, 62, 68].

1.2 The Classical Notion of Approximation Spaces

In classical approximation theory, the notion of approximation spaces refers to (quasi)-normed spaces that are defined by their elements satisfying a specific decay of a certain approximation error; see, for instance, [21] In this introduction, we will merely sketch the key construction and properties; we refer to Sect. 3 for more details.

Let X be a quasi-Banach space equipped with the quasi-norm \(\Vert \cdot \Vert _{X}\). Furthermore, here, as in the rest of the paper, let us denote by \({\mathbb {N}}= \{1,2,\ldots \}\) the set of natural numbers and write \({\mathbb {N}}_{0} = \{0\} \cup {\mathbb {N}}\), \({\mathbb {N}}_{\ge m} = \{n \in {\mathbb {N}}, n \ge m\}\). For a prescribed family \(\Sigma = (\Sigma _{n})_{n \in {\mathbb {N}}_{0}}\) of subsets \(\Sigma _{n} \subset X\), one aims to classify functions \(f \in X\) by the decay (as \(n \rightarrow \infty \)) of the error of best approximation by elements from \(\Sigma _{n}\), given by \(E(f,\Sigma _{n})_X :=\inf _{g \in \Sigma _{n}} \Vert f - g \Vert _{X}\). The desired rate of decay of this error is prescribed by a discrete weighted \(\ell ^q\)-norm, where the weight depends on the parameter \(\alpha > 0\). For \(q = \infty \), this leads to the class

$$\begin{aligned} A_\infty ^\alpha (X,\Sigma ) := \Big \{ f \in X \, : \, \sup _{n \ge 1} \,\, [n^{\alpha } \cdot E(f,\Sigma _{n-1})_{X}] < \infty \Big \} . \end{aligned}$$

Thus, intuitively speaking, this class consists of those elements of X for which the error of best approximation by elements of \(\Sigma _{n}\) decays at least as \({\mathcal {O}}(n^{-\alpha })\) for \(n \rightarrow \infty \). This general philosophy also holds for the more general classes , \(q > 0\).

If the initial family \(\Sigma \) of subsets of X satisfies some quite natural conditions, more precisely \(\Sigma _0 = \{0\}\), each \(\Sigma _n\) is invariant to scaling, \(\Sigma _n \subset \Sigma _{n+1}\), and the union \(\bigcup _{n \in {\mathbb {N}}_0} \Sigma _n\) is dense in X, as well as the slightly more involved condition that \(\Sigma _n + \Sigma _n \subset \Sigma _{cn}\) for some fixed \(c \in {\mathbb {N}}\), then an abundance of results are available for the approximation classes . In particular, turns out to be a proper linear function space, equipped with a natural (quasi)-norm. Particular highlights of the theory are various embedding and interpolation results between the different approximation spaces.

1.3 Our Contribution

We introduce a novel perspective on the study of expressivity of deep neural networks by introducing the associated approximation spaces and investigating their properties. This is in contrast with the usual approach of studying the approximation fidelity of neural networks on classical spaces. We utilize this new viewpoint for deriving novel results on, for instance, the impact of the choice of activation functions and the depth of the networks.

Given a so-called (nonlinear) activation function \(\varrho : {\mathbb {R}}\rightarrow {\mathbb {R}}\), a classical setting is to consider nonlinearities \(\alpha _\ell \) in (1.1) corresponding to a componentwise application of the activation function for each hidden layer \(1 \le \ell < L\), and \(\alpha _{L}\) being the identity. We refer to networks of this form as strict \(\varrho \)-networks. To introduce a framework of sufficient flexibility, we also consider nonlinearities where for each component either \(\varrho \) or the identity is applied. We refer to such networks as generalized \(\varrho \)-networks; the realizations of such generalized networks include various function classes such as multilayer sparse linear transforms [41], networks with skip connections [54], ResNets [32, 67], or U-nets [58].

Let us now explain how we utilize this framework of approximation spaces. Our focus will be on approximation rates in terms of growing complexity of neural networks, which we primarily measure by their connectivity, since this connectivity is closely linked to the number of bytes needed to describe the network, and also to the number of floating point operations needed to apply the corresponding function to a given input. This is in line with recent results [9, 55, 68] which explicitly construct neural networks that reach an optimal approximation rate for very specific function classes, and in contrast to most of the existing literature focusing on complexity measured by the number of neurons. We also consider the approximation spaces for which the complexity of the networks is measured by the number of neurons.

In addition to letting the number of connections or neurons tend to infinity while keeping the depth of the networks fixed, we also allow the depth to evolve with the number of connections or neurons. To achieve this, we link both by a non-decreasing depth-growth function \({\mathscr {L}}: {\mathbb {N}}\rightarrow {\mathbb {N}}\cup \{ \infty \}\), where we allow the possibility of not restricting the number of layers when \({\mathscr {L}}(n) = \infty \). We then consider the function families \(\mathtt {W}_n(\Omega \rightarrow {\mathbb {R}}^{k},\varrho , {\mathscr {L}})\) (resp. \(\mathtt {N}_n(\Omega \rightarrow {\mathbb {R}}^{k},\varrho , {\mathscr {L}})\)) made of all restrictions to a given subset \(\Omega \subseteq {\mathbb {R}}^{d}\) of functions which can be represented by (generalized) \(\varrho \)-networks with input/output dimensions d and k, at most n nonzero connection weights (resp. at most n hidden neurons), and at most \({\mathscr {L}}(n)\) layers. Finally, given a space X of functions \(\Omega \rightarrow {\mathbb {R}}^{k}\), we will use the sets (resp. ) to define the associated approximation spaces. Typical choices for X are

(1.3)

with the space of uniformly continuous functions on \(\Omega \) that vanish at infinity, equipped with the supremum norm. For ease of notation, we will sometimes also write , and (resp. ).

Let us now give a coarse overview of our main results, which we are able to derive with our choice of approximation spaces based on or .

1.3.1 Core properties of the novel Approximation Spaces

We first prove that each of these two families \(\Sigma = (\Sigma _n)_{n \in {\mathbb {N}}_0}\) satisfies the necessary requirements for the associated approximation spaces —which we denote by and , respectively—to be amenable to various results from approximation theory. Under certain conditions on \(\varrho \) and \({\mathscr {L}}\), Theorem 3.27 shows that these approximation spaces are even equipped with a convenient (quasi-)Banach spaces structure. The spaces and are nested (Lemma 3.9) and do not generally coincide (Lemma 3.10).

To prepare the ground for the analysis of the impact of depth, we then prove nestedness with respect to the depth growth function. In slightly more detail, we identify a partial order \(\preceq \) and an equivalence relation \(\sim \) on depth growth functions such that the following holds (Lemma 3.12 and Theorem 3.13):

  1. (1)

    If \({\mathscr {L}}_{1} \preceq {\mathscr {L}}_{2}\), then for any \(\alpha \), q, X and \(\varrho \); and

  2. (2)

    if \({\mathscr {L}}_{1} \sim {\mathscr {L}}_{2}\), then for any \(\alpha \), q, X and \(\varrho \).

The same nestedness results hold for the spaces . Slightly surprising and already insightful might be that under mild conditions on the activation function \(\varrho \), the approximation classes for strict and generalized \(\varrho \)-networks are in fact identical, allowing to derive the conclusion that their expressivities coincide (see Theorem 3.8).

1.3.2 Approximation Spaces Associated with ReLU Networks

The rectified linear unit (ReLU) and its powers of exponent \(r \in {\mathbb {N}}\)—in spline theory better known under the name of truncated powers [21, Chapter 5, Equation (1.1)]— are defined by

$$\begin{aligned} \varrho _r : {\mathbb {R}}\rightarrow {\mathbb {R}}, x \mapsto (x_{+})^{r}, \end{aligned}$$

where \(x_{+} = \max \{ 0,x \} = \varrho _{1}(x)\), with the ReLU activation function being \(\varrho _{1}\). Considering these activation functions is motivated practically by the wide use of the ReLU [42], as well as theoretically by the existence [45, Theorem 4] of pathological activation functions giving rise to trivial—too rich—approximation spaces that satisfy , for all \(\alpha ,q\). In contrast, the classes associated with \(\varrho _{r}\)-networks are non-trivial for \(p \in (0,\infty ]\) (Theorem 4.16). Moreover, strict and generalized \(\varrho _{r}\)-networks yield identical approximation classes for any subset \(\Omega \subseteq {\mathbb {R}}^d\) of nonzero measure (even unbounded), for any \(p \in (0,\infty ]\) (Theorem 4.2). Furthermore, for any \(r \in {\mathbb {N}}\), these approximation classes are (quasi-)Banach spaces (Theorem 4.2), as soon as

$$\begin{aligned} L := \sup _{n \in {\mathbb {N}}} {\mathscr {L}}(n) \ge {\left\{ \begin{array}{ll} 2, &{}\quad \text {if}\ \Omega \ \text {is bounded or }\ d=1, \\ 3, &{}\quad \text {otherwise.} \end{array}\right. }. \end{aligned}$$

The expressivity of networks with more general activation functions can be related to that of \(\varrho _{r}\)-networks (see Theorem 4.7) in the following sense: If \(\varrho \) is continuous and piecewise polynomial of degree at most r, then its approximation spaces are contained in those of \(\varrho _{r}\)-networks. In particular, if \(\Omega \) is bounded or if \({\mathscr {L}}\) satisfies a certain growth condition, then for \(s,r \in {\mathbb {N}}\) such that \(1 \le s \le r\)

Also, if \(\varrho \) is a spline of degree r and not a polynomial, then its approximation spaces match those of \(\varrho _{r}\) on bounded \(\Omega \). In particular, on a bounded domain \(\Omega \), the spaces associated with the leaky ReLU [44], the parametric ReLU [33], the absolute value (as, e.g., in scattering transforms [46]), and the soft-thresholding activation function [30] are all identical to the spaces associated with the ReLU.

Studying the relation of approximation spaces of \(\varrho _{r}\)-networks for different r, we derive the following statement as a corollary (Corollary 4.14) of Theorem 4.7: Approximation spaces of \(\varrho _{2}\)-networks and \(\varrho _{r}\)-networks are equal for \(r \ge 2\) when \({\mathscr {L}}\) satisfies a certain growth condition, showing a saturation from degree 2 on. Given this growth condition, for any \(r \ge 2\), we obtain the following diagram:

1.3.3 Relation to Classical Function Spaces

Focusing still on ReLU networks, we show that ReLU networks of bounded depth approximate \(C_{c}^{3}(\Omega )\) functions at bounded rates (Theorem 4.17) in the sense that, for open \(\Omega \subset {\mathbb {R}}^d\) and \(L := \sup _{n} {\mathscr {L}}(n) < \infty \), we prove

As classical function spaces (e.g., Sobolev, Besov) intersect \(C^{3}_{c}(\Omega )\) non-trivially, they can only embed into or if the networks are somewhat deep (\(L \ge 1 + \alpha /2\) or \(\lfloor L/2 \rfloor \ge \alpha /2\), respectively), giving some insight about the impact of depth on the expressivity of neural networks.

We then study relations to the classical Besov spaces \({B^{s}_{\sigma ,\tau } (\Omega ) := B^s_{\tau }(L_\sigma (\Omega ;{\mathbb {R}}))}\). We establish both direct estimates—that is, embeddings of certain Besov spaces into approximation spaces of \(\varrho _{r}\)-networks—and inverse estimates— that is, embeddings of the approximation spaces into certain Besov spaces.

The main result in the regime of direct estimates is Theorem 5.5 showing that if \(\Omega \subset {\mathbb {R}}^d\) is a bounded Lipschitz domain, if \(r \ge 2\), and if \(L := \sup _{n \in {\mathbb {N}}} {\mathscr {L}}(n)\) satisfies \(L \ge 2 + 2 \lceil \log _2 d \rceil \), then

(1.4)

For large input dimensions d, however, the condition \(L \ge 2 + 2 \lceil \log _2 d \rceil \) is only satisfied for quite deep networks. In the case of more shallow networks with \(L \ge 3\), the embedding (1.4) still holds (for any \(r \in {\mathbb {N}}\)), but is only established for \(0< \alpha < \tfrac{\min \{ 1, p^{-1} \}}{d}\). Finally, in case of \(d = 1\), the embedding (1.4) is valid as soon as \(L \ge 2\) and \(r \ge 1\).

Regarding inverse estimates, we first establish limits on possible embeddings (Theorem 5.7). Precisely, for \(\Omega = (0,1)^{d}\) and any \(r \in {\mathbb {N}}\), \(\alpha ,s \in (0,\infty )\), and \(\sigma , \tau \in (0,\infty ]\) we have, with \(L := \sup _{n} {\mathscr {L}}(n) \ge 2\):

  • if \(\alpha < \lfloor L/2\rfloor \cdot \min \{ s, 2 \}\) then does not embed into \(B^{s}_{\sigma ,\tau } (\Omega )\);

  • if \(\alpha < (L-1) \cdot \min \{ s, 2 \}\) then does not embed into \(B^{s}_{\sigma ,\tau } (\Omega )\).

A particular consequence is that for unbounded depth \(L = \infty \), none of the spaces , can embed into any Besov space of strictly positive smoothness \(s>0\).

For scalar input dimension \(d=1\), an embedding into a Besov space with the relation \(\alpha = \lfloor L/2 \rfloor \cdot s\) (respectively, \(\alpha = (L-1) \cdot s\)) is indeed achieved for \(X = L_{p}( (0,1) )\), \(0< p < \infty \), \(r \in {\mathbb {N}}\), (Theorem 5.13):

1.4 Expected Impact and Future Directions

We anticipate our results to have an impact in a number of areas that we now describe together with possible future directions:

  • Theory of Expressivity We introduce a general framework to study approximation properties of deep neural networks from an approximation space viewpoint. This opens the door to transfer various results from this part of approximation theory to deep neural networks. We believe that this conceptually new approach in the theory of expressivity will lead to further insight. One interesting topic for future investigation is, for instance, to derive a finer characterization of the spaces , , for \(r \in \{1,2\}\) (with some assumptions on \({\mathscr {L}}\)).

    Our framework is amenable to various extensions; for example, the restriction to convolutional weights would allow a study of approximation spaces of convolutional neural networks.

  • Statistical Analysis of Deep Learning Approximation spaces characterize fundamental trade-offs between the complexity of a network architecture and its ability to approximate (with proper choices of parameter values) a given function f. In statistical learning, a related question is to characterize which generalization bounds (also known as excess risk guarantees) can be achieved when fitting network parameters using m independent training samples. Some “oracle inequalities” [60] of this type have been recently established for idealized training algorithms minimizing the empirical risk (1.2). Our framework in combination with existing results on the VC dimension of neural networks [7] is expected to shed new light on such generalization guarantees through a generic approach encompassing various types of constraints on the considered architecture.

  • Design of Deep Neural Networks—Architectural Guidelines Our results reveal how the expressive power of a network architecture may be impacted by certain choices such as the presence of certain types of skip connections or the selected activation functions. Thus, our results provide indications on how a network architecture may be adapted without hurting its expressivity, in order to get additional degrees of freedom to ease the task of optimization-based learning algorithms and improve their performance. For instance, while we show that generalized and strict networks have (under mild assumptions on the activation function) the same expressivity, we have not yet considered so-called ResNet architectures. Yet, the empirical observation that a ResNet architecture makes it easier to train deep networks [32] calls for a better understanding of the relations between the corresponding approximations classes.

1.5 Outline

The paper is organized as follows.

Section 2 introduces our notations regarding neural networks and provides basic lemmata concerning the “calculus” of neural networks. The classical notion of approximation spaces is reviewed in Sect. 3, and therein also specialized to the setting of approximation spaces of networks, with a focus on approximation in \(L_p\) spaces. This is followed by Sect. 4, which concentrates on \(\varrho \)-networks with \(\varrho \) the so-called ReLU or one of its powers. Finally, Sect. 5 studies embeddings between (resp. ) and classical Besov spaces, with .

2 Neural Networks and Their Elementary Properties

In this section, we formally introduce the definition of neural networks used throughout this paper and discuss the elementary properties of the corresponding sets of functions.

2.1 Neural Networks and Their Main Characteristics

Definition 2.1

(Neural network) Let \(\varrho : {\mathbb {R}}\rightarrow {\mathbb {R}}\). A (generalized) neural network with activation function \(\varrho \) (in short: a \(\varrho \)-network) is a tuple \(\big ( (T_1, \alpha _1), \dots , (T_L, \alpha _L) \big )\), where each \(T_\ell : {\mathbb {R}}^{N_{\ell - 1}} \rightarrow {\mathbb {R}}^{N_\ell }\) is an affine-linear map, \(\alpha _L = \mathrm {id}_{{\mathbb {R}}^{N_L}}\), and each function \(\alpha _\ell : {\mathbb {R}}^{N_\ell }\rightarrow {\mathbb {R}}^{N_\ell }\) for \(1 \le \ell < L\) is of the form \(\alpha _\ell = \bigotimes _{j = 1}^{N_\ell } \varrho _j^{(\ell )}\) for certain \(\varrho _j^{(\ell )} \in \{\mathrm {id}_{{\mathbb {R}}}, \varrho \}\). Here, we use the notation

$$\begin{aligned}&\bigotimes _{j=1}^n \theta _j : X_1 \times \cdots \times X_n \rightarrow Y_1 \times \cdots \times Y_n,\\&\quad (x_1,\dots ,x_n) \mapsto \big ( \theta _1 (x_1), \dots , \theta _n (x_n) \big ) \, \text { for } \, \theta _j : X_j \rightarrow Y_j . \blacktriangleleft \end{aligned}$$

Definition 2.2

A \(\varrho \)-network as above is called strict if \(\varrho _j^{(\ell )} = \varrho \) for all \(1 \le \ell < L\) and \(1 \le j \le N_\ell \). \(\blacktriangleleft \)

Definition 2.3

(Realization of a network) The realization \({\mathtt {R}}(\Phi )\) of a network \({\Phi = \big ( (T_1, \alpha _1), \dots , (T_L, \alpha _L) \big )}\) as above is the function

$$\begin{aligned} {\mathtt {R}}(\Phi ) : {\mathbb {R}}^{N_0} \rightarrow {\mathbb {R}}^{N_L} , \quad \text {with} \quad {\mathtt {R}}(\Phi ) := \alpha _L \circ T_L \circ \cdots \circ \alpha _1 \circ T_1 . \blacktriangleleft \end{aligned}$$

The complexity of a neural network is characterized by several features.

Definition 2.4

(Depth, number of hidden neurons, number of connections) Consider a neural network \(\Phi = \big ( (T_1, \alpha _1), \dots , (T_L, \alpha _L) \big )\) with \(T_\ell : {\mathbb {R}}^{{N_{\ell - 1}}} \rightarrow {\mathbb {R}}^{N_\ell }\) for \(1 \le \ell \le L\).

  • The input-dimension of \(\Phi \) is \({d_{\mathrm {in}}}(\Phi ) := N_0 \in {\mathbb {N}}\), its output-dimension is \({d_{\mathrm {out}}}(\Phi ) := N_L \in {\mathbb {N}}\).

  • The depth of \(\Phi \) is \(L(\Phi ) := L \in {\mathbb {N}}\), corresponding to the number of (affine) layers of \(\Phi \).

    We remark that with these notations, the number of hidden layers is \(L-1\).

  • The number of hidden neurons of \(\Phi \) is \(N(\Phi ) := \sum _{\ell = 1}^{L-1} N_\ell \in {\mathbb {N}}_0\);

  • The number of connections (or number of weights) of \(\Phi \) is \(W(\Phi ) := \sum _{\ell = 1}^{L} \Vert T_\ell \Vert _{\ell ^0} \in {\mathbb {N}}_0\), with \({\Vert T\Vert _{\ell ^0} := \Vert A\Vert _{\ell ^{0}}}\) for an affine map \(T: x \mapsto A x + b\) with A some matrix and b some vector; here, \(\Vert \cdot \Vert _{\ell ^{0}}\) counts the number of nonzero entries in a vector or a matrix. \(\blacktriangleleft \)

Remark 2.5

If \(W(\Phi ) = 0\), then \({\mathtt {R}}(\Phi )\) is constant (but not necessarily zero), and if \(N(\Phi )=0\), then \({\mathtt {R}}(\Phi )\) is affine-linear (but not necessarily zero or constant). \(\blacklozenge \)

Unlike the notation used in [9, 55], which considers \(W_0 (\Phi ) := \sum _{\ell = 1}^L (\Vert A^{(\ell )}\Vert _{\ell ^0} + \Vert b^{(\ell )}\Vert _{\ell ^0})\) where \({T_\ell \, x = A^{(\ell )} x + b^{(\ell )}}\), Definition 2.4 only counts the nonzero entries of the linear part of each \(T_\ell \), so that \(W(\Phi ) \le W_{0}(\Phi )\). Yet, as shown with the following lemma, both definitions are in fact equivalent up to constant factors if one is only interested in the represented functions. The proof is in Appendix A.1.

Lemma 2.6

For any network \(\Phi \), there is a “compressed” network \({\widetilde{\Phi }}\) with \({\mathtt {R}}( {\widetilde{\Phi }} ) = {\mathtt {R}}(\Phi )\) such that \(L({\widetilde{\Phi }}) \le L(\Phi )\), \(N({\widetilde{\Phi }} \,) \le N(\Phi )\), and

$$\begin{aligned} W({\widetilde{\Phi }}) \le W_0 ({\widetilde{\Phi }} \,) \le {d_{\mathrm {out}}}(\Phi ) + 2 \cdot W(\Phi ) . \end{aligned}$$

The network \({\widetilde{\Phi }}\) can be chosen to be strict if \(\Phi \) is strict. \(\blacktriangleleft \)

Remark 2.7

The reason for distinguishing between a neural network and its associated realization is that for a given function \(f : {\mathbb {R}}^d\rightarrow {\mathbb {R}}^k\), there might be many different neural networks \(\Phi \) with \(f = {\mathtt {R}}(\Phi )\), so that talking about the number of layers, neurons, or weights of the function f is not well defined, whereas these notions certainly make sense for neural networks as defined above. A possible alternative would be to define, for example,

$$\begin{aligned} L(f) := \min \big \{ L(\Phi ) \, : \, \Phi \text { neural network with } {\mathtt {R}}(\Phi ) = f \big \}, \end{aligned}$$

and analogously for N(f) and W(f); but this has the considerable drawback that it is not clear whether there is a neural network \(\Phi \) that simultaneously satisfies, e.g., \(L(\Phi ) = L(f)\) and \(W(\Phi ) = W(f)\). Because of these issues, we prefer to properly distinguish between a neural network and its realization.\(\blacklozenge \)

Remark 2.8

Some of the conventions in the above definitions might appear unnecessarily complicated at first sight, but they have been chosen after careful thought. In particular:

  • Many neural network architectures used in practice use the same activation function for all neurons in a common layer. If this choice of activation function even stays the same across all layers—except for the last one—one obtains a strict neural network.

  • In applications, network architectures very similar to our “generalized” neural networks are used; examples include residual networks (also called “ResNets,” see [32, 67]), and networks with skip connections [54].

  • As expressed in Sect. 2.3, the class of realizations of generalized neural networks admits nice closure properties under linear combinations and compositions of functions. Similar closure properties do in general not hold for the class of strict networks.

  • The introduction of generalized networks will be justified in Sect. 3.3, where we show that if one is only interested in approximation theoretic properties of the respective function class, then—at least on bounded domains \(\Omega \subset {\mathbb {R}}^d\) for “generic” \(\varrho \), but also on unbounded domains for the ReLU activation function and its powers—generalized networks and strict networks have identical properties. \(\blacklozenge \)

2.2 Relations Between Depth, Number of Neurons, and Number of Connections

We now investigate the relationships between the quantities describing the complexity of a neural network \(\Phi = \big ( (T_1,\alpha _1), \dots , (T_L, \alpha _L) \big )\) with \(T_\ell : {\mathbb {R}}^{N_{\ell - 1}} \rightarrow {\mathbb {R}}^{N_\ell }\).

Given the number of (hidden) neurons of the network, the other quantities can be bounded. Indeed, by definition we have \(N_\ell \ge 1\) for all \(1 \le \ell \le L-1\); therefore, the number of layers satisfies

$$\begin{aligned} L(\Phi ) = 1+\sum _{\ell = 1}^{L-1} 1 \le 1+ \sum _{\ell = 1}^{L-1} N_\ell = 1+ N(\Phi ) . \end{aligned}$$
(2.1)

Similarly, as \(\Vert T_{\ell }\Vert _{\ell ^{0}} \le N_{\ell -1} N_{\ell }\) for each \(1 \le \ell < L\), we have

$$\begin{aligned} W(\Phi )= & {} \sum _{\ell = 1}^{L} \Vert T_{\ell }\Vert _{\ell ^{0}} \le \sum _{\ell = 1}^{L} N_{\ell -1} N_{\ell } \le \sum _{\ell '=0}^{L-1}\sum _{\ell =1}^{L} N_{\ell '}N_{\ell } \nonumber \\= & {} (d_{\mathrm {in}} (\Phi ) +N(\Phi )) (N(\Phi ) + d_{\mathrm {out}}(\Phi )), \end{aligned}$$
(2.2)

showing that \(W(\Phi ) = {\mathcal {O}}([N(\Phi )]^2 + d k)\) for fixed input and output dimensions dk. When \(L(\Phi )=2\), we have in fact \( W(\Phi ) = \Vert T_{1}\Vert _{\ell ^{0}}+\Vert T_{2}\Vert _{\ell ^{0}}\le N_{0}N_{1}+N_{1}N_{2} = (N_{0}+N_{2})N_{1} = (d_{\mathrm {in}} (\Phi ) + d_{\mathrm {out}}(\Phi )) \cdot N(\Phi ) \).

In general, one cannot bound the number of layers or of hidden neurons by the number of nonzero weights, as one can build arbitrarily large networks with many “dead neurons.” Yet, such a bound is true if one is willing to switch to a potentially different network which has the same realization as the original network. To show this, we begin with the case of networks with zero connections.

Lemma 2.9

Let \(\Phi = \big ( (T_1,\alpha _1), \dots , (T_L, \alpha _L) \big )\) be a neural network. If there exists some \(\ell \in \{1,\dots ,L\}\) such that \(\Vert T_{\ell }\Vert _{\ell ^{0}} = 0\), then \({\mathtt {R}}(\Phi ) \equiv c\) for some \(c \in {\mathbb {R}}^{k}\) where \(k = d_{\mathrm {out}}(\Phi )\).\(\blacktriangleleft \)

Proof

As \(\Vert T_{\ell }\Vert _{\ell ^{0}} = 0\), the affine map \(T_{\ell }\) is a constant map \({\mathbb {R}}^{N_{\ell - 1}} \ni y \mapsto b^{(\ell )} \in {\mathbb {R}}^{N_{\ell }}\). Therefore, \(f_{\ell } = \alpha _{\ell } \circ T_{\ell }: {\mathbb {R}}^{N_{\ell -1}} \rightarrow {\mathbb {R}}^{N_{\ell }}\) is a constant map, so that also \({\mathtt {R}}(\Phi ) = f_{L} \circ \cdots \circ f_{\ell } \circ \cdots \circ f_{1}\) is constant. \(\square \)

Corollary 2.10

If \(W(\Phi ) < L(\Phi )\), then \({\mathtt {R}}(\Phi ) \equiv c\) for some \(c \in {\mathbb {R}}^{k}\) where \(k = d_{\mathrm {out}}(\Phi )\).\(\blacktriangleleft \)

Proof

Let \(\Phi = \big ( (T_1,\alpha _1), \dots , (T_L,\alpha _L) \big )\) and observe that if \(\sum _{\ell =1}^L \Vert T_{\ell }\Vert _{\ell ^{0}} = W(\Phi ) < L(\Phi ) = \sum _{\ell =1}^{L} 1\), then there must exist \(\ell \in \{1,\dots ,L\}\) such that \(\Vert T_{\ell }\Vert _{\ell ^{0}}=0\), so that we can apply Lemma 2.9. \(\square \)

Indeed, constant maps play a special role as they are exactly the set of realizations of neural networks with no (nonzero) connections. Before formally stating this result, we introduce notations for families of neural networks of constrained complexity, which can have a variety of shapes as illustrated in Fig. 1.

Definition 2.11

Consider \(L \in {\mathbb {N}}\cup \{\infty \}\), \(W,N \in {\mathbb {N}}_0 \cup \{\infty \}\), and \(\Omega \subseteq {\mathbb {R}}^{d}\) a non-empty set.

  • \({\mathcal {NN}}_{W,L,N}^{\varrho ,d,k}\) denotes the set of all generalized \(\varrho \)-networks \(\Phi \) with input dimension d, output dimension k, and with \(W(\Phi ) \le W\), \(L(\Phi ) \le L\), and \(N(\Phi ) \le N\).

  • \({\mathcal {SNN}}_{W,L,N}^{\varrho ,d,k}\) denotes the subset of networks \(\Phi \in {\mathcal {NN}}_{W,L,N}^{\varrho ,d,k}\) which are strict.

  • The class of all functions \(f : {\mathbb {R}}^d \rightarrow {\mathbb {R}}^k\) that can be represented by (generalized) \(\varrho \)-networks with at most W weights, L layers, and N neurons is

    $$\begin{aligned} {\mathtt {NN}}_{W,L,N}^{\varrho ,d,k} := \big \{ {\mathtt {R}}(\Phi ) \,:\, \Phi \in {\mathcal {NN}}_{W,L,N}^{\varrho ,d,k} \big \} . \end{aligned}$$

    The set of all restrictions of such functions to \(\Omega \) is denoted \({\mathtt {NN}}_{W,L,N}^{\varrho ,d,k}(\Omega )\).

  • Similarly,

    $$\begin{aligned} {\mathtt {SNN}}_{W,L,N}^{\varrho ,d,k} := \big \{ {\mathtt {R}}(\Phi ) \,:\, \Phi \in {\mathcal {SNN}}_{W,L,N}^{\varrho , d, k} \big \} . \end{aligned}$$

    The set of all restrictions of such functions to \(\Omega \) is denoted \({\mathtt {SNN}}_{W,L,N}^{\varrho ,d,k}(\Omega )\).

Finally, we define \({\mathtt {NN}}_{W,L}^{\varrho ,d,k} := {\mathtt {NN}}^{\varrho ,d,k}_{W,L,\infty }\) and \({\mathtt {NN}}_{W}^{\varrho ,d,k} := {\mathtt {NN}}^{\varrho ,d,k}_{W,\infty ,\infty }\), as well as \({{\mathtt {NN}}^{\varrho ,d,k} := {\mathtt {NN}}^{\varrho ,d,k}_{\infty ,\infty ,\infty }}\). We will use similar notations for \({\mathtt {SNN}}\), \({\mathcal {NN}}\), and \({\mathcal {SNN}}\). \(\blacktriangleleft \)

Remark 2.12

If the dimensions dk and/or the activation function \(\varrho \) are implied by the context, we will sometimes omit them from the notation.\(\blacklozenge \)

Fig. 1
figure 1

The considered network classes include a variety of networks such as: (top) shallow networks with a single hidden layer, where the number of neurons is of the same order as the number of possible connections; (middle) “narrow and deep” networks, e.g., with a single neuron per layer, where the same holds; (bottom) “truly” sparse networks that have much fewer nonzero weights than potential connections

Lemma 2.13

Let \(\varrho : {\mathbb {R}}\rightarrow {\mathbb {R}}\), and let \(d,k \in {\mathbb {N}}\), \(N \in {\mathbb {N}}_{0} \cup \{\infty \}\), and \(L \in {\mathbb {N}}\cup \{\infty \}\) be arbitrary. Then,

$$\begin{aligned} {\mathtt {NN}}_{0,L,N}^{\varrho ,d,k} = {\mathtt {SNN}}_{0,L,N}^{\varrho ,d,k} = {\mathtt {NN}}_{0,1,0}^{\varrho ,d,k} = {\mathtt {SNN}}_{0,1,0}^{\varrho ,d,k} = \{f: {\mathbb {R}}^{d} \rightarrow {\mathbb {R}}^{k} \,\mid \, \exists c \in {\mathbb {R}}^{k}: f \equiv c\} . \blacktriangleleft \end{aligned}$$

Proof

If \(f \equiv c\) where \(c \in {\mathbb {R}}^{k}\), then the affine map \(T : {\mathbb {R}}^d \rightarrow {\mathbb {R}}^{k}, x \mapsto c\) satisfies \(\Vert T \Vert _{\ell ^0} = 0\) and the (strict) network \(\Phi := \big ( (T, \mathrm {id}_{{\mathbb {R}}^{k}}) \big )\) satisfies \({\mathtt {R}}(\Phi ) \equiv c = f\), \(W(\Phi ) = 0\), \(N(\Phi ) = 0\) and \(L(\Phi ) = 1\). By Definition 2.11, we have \(\Phi \in {\mathcal {SNN}}_{0,1,0}^{\varrho ,d,k}\) whence \(f \in {\mathtt {SNN}}_{0,1,0}^{\varrho ,d,k}\). The inclusions \({\mathtt {SNN}}_{0,1,0}^{\varrho ,d,k} \subset {\mathtt {NN}}_{0,1,0}^{\varrho ,d,k} \subset {\mathtt {NN}}_{0,L,N}^{\varrho ,d,k}\) and \({\mathtt {SNN}}_{0,1,0}^{\varrho ,d,k} \subset {\mathtt {SNN}}_{0,L,N}^{\varrho ,d,k} \subset {\mathtt {NN}}_{0,L,N}^{\varrho ,d,k}\) are trivial by definition of these sets. If \(f \in {\mathtt {NN}}_{0,L,N}^{\varrho ,d,k}\), then there is \(\Phi \in {\mathcal {NN}}_{0,L,N}^{\varrho ,d,k}\) such that \(f = {\mathtt {R}}(\Phi )\). As \(W(\Phi ) = 0 < 1 \le L(\Phi )\), Corollary 2.10 yields \(f = {\mathtt {R}}(\Phi ) \equiv c\). \(\square \)

Our final result in this subsection shows that any realization of a network with at most \(W \ge 1\) connections can also be obtained by a network with W connections but which additionally has at most \(L \le W\) layers and at most \(N \le W\) hidden neurons. The proof is postponed to Appendix A.2.

Lemma 2.14

Let \(\varrho : {\mathbb {R}}\rightarrow {\mathbb {R}}\), \(d,k \in {\mathbb {N}}\), \(L \in {\mathbb {N}}\cup \{\infty \}\), and \(W \in {\mathbb {N}}\) be arbitrary. Then, we have

$$\begin{aligned} {\mathtt {NN}}_{W,L,\infty }^{\varrho ,d,k} = {\mathtt {NN}}_{W, L,W}^{\varrho ,d,k} \subset {\mathtt {NN}}_{W,W,W}^{\varrho ,d,k} . \end{aligned}$$

The inclusion is an equality for \(L \ge W\). In particular, \({\mathtt {NN}}_{W}^{\varrho ,d,k} = {\mathtt {NN}}_{W,\infty ,W}^{\varrho ,d,k} = {\mathtt {NN}}_{W,W,W}^{\varrho ,d,k}\). The same claims are valid for strict networks, replacing the symbol \({\mathtt {NN}}\) by \({\mathtt {SNN}}\) everywhere.\(\blacktriangleleft \)

To summarize, for given input and output dimensions dk, when combining (2.2) with the above lemma, we obtain that for any network \(\Phi \), there exists a network \(\Psi \) with \({\mathtt {R}}(\Psi ) = {\mathtt {R}}(\Phi )\) and \(L(\Psi ) \le L(\Phi )\), and such that

$$\begin{aligned} N(\Psi ) \le W(\Psi ) \le W(\Phi ) \le N^{2}(\Phi ) + (d+k) N(\Phi ) + dk. \end{aligned}$$
(2.3)

When \(L(\Phi ) = 2\), we have in fact \(N(\Psi ) \le W(\Psi ) \le W(\Phi ) \le (d+k) N(\Phi )\); see the discussion after (2.2).

Remark 2.15

(Connectivity, flops and bits.) A motivation for measuring a network’s complexity by its connectivity is that the number of connections is directly related to several practical quantities of interest such as the number of floating point operations needed to compute the output given the input, or the number of bits needed to store a (quantized) description of the network in a computer file. This is not the case for complexity measured in terms of the number of neurons.\(\blacklozenge \)

2.3 Calculus with Generalized Neural Networks

In this section, we show as a consequence of Lemma 2.14 that the class of realizations of generalized neural networks of a given complexity—as measured by the number of connections \(W(\Phi )\)—is closed under addition and composition, as long as one is willing to increase the complexity by a constant factor. To this end, we first show that one can increase the depth of generalized neural networks with controlled increase in the required complexity.

Lemma 2.16

Given \(\varrho :{\mathbb {R}}\rightarrow {\mathbb {R}}\), \({d,k \in {\mathbb {N}}}\), \(c := \min \{d,k\}\), \(\Phi \in {\mathcal {NN}}^{\varrho ,d,k}\), and \(L_0 \in {\mathbb {N}}_{0}\), there exists \(\Psi \in {\mathcal {NN}}^{\varrho ,d,k}\) such that \({\mathtt {R}}(\Psi ) = {\mathtt {R}}(\Phi )\), \(L(\Psi ) = L(\Phi ) + L_0\), \(W(\Psi ) = W(\Phi ) + c L_0\), \(N(\Psi ) = N(\Phi ) + c L_0\).\(\blacktriangleleft \)

Fig. 2
figure 2

(left) Graphical convention for drawing neural networks; this convention is used everywhere except in Fig. 1. (right) Depth synchronization of Lemma 2.16; identity layers are added at the output if \(k < d\); in case of \(d < k\), they are added at the input

This fact appears without proof in [60, Section 5.1] under the name of depth synchronization for strict networks with the ReLU activation function, with \(c = d\). We refine it to \(c = \min \{d,k\}\) and give a proof for generalized networks with arbitrary activation function in Appendix A.3. The underlying proof idea is illustrated in Fig. 2.

Fig. 3
figure 3

Illustration of the networks constructed in the proofs of Lemmas 2.17 and 2.18. (top) Implementation of Cartesian products; (middle) implementation of addition; (bottom) implementation of composition

A consequence of the depth synchronization property is that the class of generalized networks is closed under linear combinations and Cartesian products. The proof idea behind the following lemma, whose proof is in Appendix A.4, is illustrated in Fig. 3 (top and middle).

Lemma 2.17

Consider arbitrary \(d,k,n \in {\mathbb {N}}\), \(c \in {\mathbb {R}}\), \(\varrho :{\mathbb {R}}\rightarrow {\mathbb {R}}\), and \(k_i \in {\mathbb {N}}\) for \(i \in \{1,\dots ,n\}\).

  1. (1)

    If \(\Phi \in {\mathcal {NN}}^{\varrho ,d,k}\), then \(c \cdot {\mathtt {R}}(\Phi ) = {\mathtt {R}}(\Psi )\) where \(\Psi \in {\mathcal {NN}}^{\varrho ,d,k}\) satisfies \(W(\Psi ) \le W(\Phi )\) (with equality if \(c \ne 0\)), \(L(\Psi ) = L(\Phi )\), \(N(\Psi ) = N(\Phi )\). The same holds with \({\mathcal {SNN}}\) instead of \({\mathcal {NN}}\).

  2. (2)

    If \(\Phi _i \in {\mathcal {NN}}^{\varrho , d, k_i}\) for \(i \in \{1,\dots ,n\}\), then \(({\mathtt {R}}(\Phi _{1}),\ldots ,{\mathtt {R}}(\Phi _{n})) = {\mathtt {R}}(\Psi )\) with \(\Psi \in {\mathcal {NN}}^{\varrho ,d,K}\), where

    $$\begin{aligned} L(\Psi )= & {} \max _{i = 1,\dots ,n} L(\Phi _{i}), \quad W(\Psi ) \le \delta +\sum _{i=1}^{n} W(\Phi _{i}), \quad \\ N(\Psi )\le & {} \delta +\sum _{i=1}^{n} N(\Phi _{i}), \quad \text {and} \quad K:=\sum _{i=1}^{n} k_{i}, \end{aligned}$$

    with \(\delta := c \cdot \big (\max _{i=1,\dots ,n} L(\Phi _{i})-\min _{i} L(\Phi _{i})\big )\) and \(c := \min \{d, K-1 \}\).

  3. (3)

    If \(\Phi _1,\dots ,\Phi _n \in {\mathcal {NN}}^{\varrho , d, k}\), then \(\sum _{i=1}^n {\mathtt {R}}(\Phi _{i}) = {\mathtt {R}}(\Psi )\) with \(\Psi \in {\mathcal {NN}}^{\varrho ,d,k}\), where

    $$\begin{aligned} L(\Psi ) = \max _{i} L(\Phi _{i}), \quad W(\Psi ) \le \delta + \sum _{i=1}^{n} W(\Phi _{i}), \qquad \text {and} \quad N(\Psi ) \le \delta + \sum _{i=1}^{n} N(\Phi _{i}), \end{aligned}$$

    with \(\delta := c \left( \max _{i} L(\Phi _{i})-\min _{i} L(\Phi _{i})\right) \) and \(c := \min \{ d,k \}\). \(\blacktriangleleft \)

One can also control the complexity of certain networks resulting from compositions in an intuitive way. To state and prove this, we introduce a convenient notation: For a matrix \(A \in {\mathbb {R}}^{n \times d}\), we denote

$$\begin{aligned} \Vert A \Vert _{\ell ^{0,\infty }} := \max _{i \in \{1,\dots ,d\}} \Vert A \, e_i \Vert _{\ell ^0} \quad \text {and} \quad \Vert A \Vert _{\ell ^{0,\infty }_*} := \Vert A^T \Vert _{\ell ^{0,\infty }} = \max _{i \in \{1,\dots ,n\}} \Vert e_i^T A \Vert _{\ell ^0},\nonumber \\ \end{aligned}$$
(2.4)

where \(e_1,\dots , e_n\) is the standard basis of \({\mathbb {R}}^n\). Likewise, for an affine-linear map \(T = A \bullet + b\), we denote \(\Vert T \Vert _{\ell ^{0,\infty }} := \Vert A \Vert _{\ell ^{0,\infty }}\) and \(\Vert T \Vert _{\ell ^{0,\infty }_{*}} := \Vert A \Vert _{\ell ^{0,\infty }_{*}}\).

Lemma 2.18

Consider arbitrary \(d,d_1,d_2,k,k_1 \in {\mathbb {N}}\) and \(\varrho :{\mathbb {R}}\rightarrow {\mathbb {R}}\).

  1. (1)

    If \(\Phi \in {\mathcal {NN}}^{\varrho ,d,k}\) and \(P: {\mathbb {R}}^{d_{1}} \rightarrow {\mathbb {R}}^{d}\), \(Q:{\mathbb {R}}^{k} \rightarrow {\mathbb {R}}^{k_{1}}\) are two affine maps, then \(Q \circ {\mathtt {R}}(\Phi ) \circ P = {\mathtt {R}}(\Psi )\) where \(\Psi \in {\mathcal {NN}}^{\varrho ,d_{1},k_{1}}\) with \(L(\Psi )= L(\Phi )\), \(N(\Psi )= N(\Phi )\) and

    $$\begin{aligned} W(\Psi ) \le \Vert Q \Vert _{\ell ^{0,\infty }} \cdot W(\Phi ) \cdot \Vert P \Vert _{\ell ^{0,\infty }_{*}}. \end{aligned}$$

    The same holds with \({\mathcal {SNN}}\) instead of \({\mathcal {NN}}\).

  2. (2)

    If \(\Phi _1 \in {\mathcal {NN}}^{\varrho , d, d_1}\) and \(\Phi _2 \in {\mathcal {NN}}^{\varrho , d_1, d_2}\), then \({\mathtt {R}}(\Phi _2) \circ {\mathtt {R}}(\Phi _1) = {\mathtt {R}}(\Psi )\) where \(\Psi \in {\mathcal {NN}}^{\varrho , d, d_2}\) and

    $$\begin{aligned} W(\Psi )= & {} W(\Phi _{1})+W(\Phi _{2}), \quad L(\Psi ) = L(\Phi _1)+L(\Phi _2), \quad \\ N(\Psi )= & {} N(\Phi _1)+N(\Phi _2)+d_1. \end{aligned}$$
  3. (3)

    Under the assumptions of Part (2), there is also \(\Psi ' \in {\mathcal {NN}}^{\varrho , d, d_2}\) such that \({\mathtt {R}}(\Phi _2) \circ {\mathtt {R}}(\Phi _1) = {\mathtt {R}}(\Psi ')\) and

    $$\begin{aligned} W(\Psi ')\le & {} W(\Phi _{1}) + \max \{ N(\Phi _{1}),d \} \, W(\Phi _{2}), \quad \\ L(\Psi ')= & {} \! L(\Phi _1) \!+\! L(\Phi _2) \!-\! 1, \quad N(\Psi ') = N(\Phi _1) \!+\! N(\Phi _2). \end{aligned}$$

    In this case, the same holds for \({\mathcal {SNN}}\) instead of \({\mathcal {NN}}\). \(\blacktriangleleft \)

The proof idea of Lemma 2.18 is illustrated in Fig. 3 (bottom). The formal proof is in Appendix A.5. A direct consequence of Lemma 2.18-(1) that we will use in several places is that \({x \mapsto a_2 \, g(a_{1}x+b_{1})+b_{2} \in {\mathtt {NN}}^{\varrho ,d,k}_{W,L,N}}\) whenever \(g \in {\mathtt {NN}}^{\varrho ,d,k}_{W,L,N}\), \(a_{1},a_{2} \in {\mathbb {R}}\), \(b_{1} \in {\mathbb {R}}^{d}\), \(b_{2} \in {\mathbb {R}}^{k}\).

Our next result shows that if \(\sigma \) can be expressed as the realization of a \(\varrho \)-network, then realizations of \(\sigma \)-networks can be re-expanded into realizations of \(\varrho \)-networks of controlled complexity.

Lemma 2.19

Consider two activation functions \(\varrho ,\sigma \) such that \(\sigma = {\mathtt {R}}(\Psi _{\sigma })\) for some \( \Psi _{\sigma } \in {\mathcal {NN}}^{\varrho ,1,1}_{w,\ell ,m} \) with \(L(\Psi _{\sigma }) = \ell \in {\mathbb {N}}\), \(w \in {\mathbb {N}}_{0}\), \(m \in {\mathbb {N}}\). Furthermore, assume that \(\sigma \not \equiv \mathrm {const}\).

Then, the following hold:

  1. (1)

    if \(\ell =2\) then for any WNLdk we have \( {\mathtt {NN}}_{W,L,N}^{\sigma ,d,k} \subset {\mathtt {NN}}_{Wm^{2},L,Nm}^{\varrho ,d,k} \)

  2. (2)

    for any \(\ell ,W,N,L,d,k\) we have \( {\mathtt {NN}}_{W,L,N}^{\sigma ,d,k} \subset {\mathtt {NN}}_{mW + w N, 1 + (L-1)\ell , N(1+m)}^{\varrho ,d,k}. \) \(\blacktriangleleft \)

The proof of Lemma 2.19 is in Appendix A.6. In the case, when \(\sigma \) is simply an s-fold composition of \(\varrho \), we have the following improvement of Lemma 2.19.

Lemma 2.20

Let \(s \in {\mathbb {N}}\). Consider an activation function \(\varrho : {\mathbb {R}}\rightarrow {\mathbb {R}}\), and let \(\sigma := \varrho \circ \cdots \circ \varrho \), where the composition has s “factors.”

We have

$$\begin{aligned} {\mathtt {NN}}^{\sigma ,d,k}_{W,L,N} \subset {\mathtt {NN}}^{\varrho ,d,k}_{W+(s-1)N,1+s(L-1),sN} \quad \forall \, W, N \in {\mathbb {N}}_0 \cup \{\infty \} \text { and } L \in {\mathbb {N}}\cup \{\infty \}. \end{aligned}$$

The same holds for strict networks, replacing \({\mathtt {NN}}\) by \({\mathtt {SNN}}\) everywhere.\(\blacktriangleleft \)

The proof is in Appendix A.7. In our next result, we consider the case where \(\sigma \) cannot be exactly implemented by \(\varrho \)-networks, but only approximated arbitrarily well by such networks of uniformly bounded complexity.

Lemma 2.21

Consider two activation functions \(\varrho , \sigma : {\mathbb {R}}\rightarrow {\mathbb {R}}\). Assume that \(\sigma \) is continuous and that there are \(w,m \in {\mathbb {N}}_{0}\), \(\ell \in {\mathbb {N}}\) and a family \(\Psi _{h} \in {\mathcal {NN}}_{w,\ell ,m}^{\varrho ,1,1}\) parameterized by \(h \in {\mathbb {R}}\), with \(L(\Psi _{h}) = \ell \), such that \(\sigma _{h} := {\mathtt {R}}(\Psi _{h}) \xrightarrow [h \rightarrow 0]{} \sigma \) locally uniformly on \({\mathbb {R}}\). For any \(d,k \in {\mathbb {N}}\), \(W,N \in {\mathbb {N}}_{0}\), \(L \in {\mathbb {N}}\), we have

$$\begin{aligned} {\mathtt {NN}}^{\sigma ,d,k}_{W,L,N} \subset {\left\{ \begin{array}{ll} \overline{{\mathtt {NN}}^{\varrho ,d,k}_{Wm^{2},L,Nm}}, &{}\quad {\mathrm{if }} \ell =2; \\ \overline{{\mathtt {NN}}^{\varrho ,d,k}_{mW+wN,1+(L-1)\ell ,N(1+m)}}, &{}\quad \mathrm{for}\,\,\mathrm{any }\,\, \ell , \end{array}\right. } \end{aligned}$$
(2.5)

where the closure is with respect to locally uniform convergence.\(\blacktriangleleft \)

The proof is in Appendix A.8. In the next lemma, we establish a relation between the approximation capabilities of strict and generalized networks. The proof is given in Appendix A.9.

Lemma 2.22

Let \(\varrho : {\mathbb {R}}\rightarrow {\mathbb {R}}\) be continuous and assume that \(\varrho \) is differentiable at some \(x_0 \in {\mathbb {R}}\) with \(\varrho ' (x_0) \ne 0\). For any \(d,k \in {\mathbb {N}}\), \(L \in {\mathbb {N}}\cup \{\infty \}\), \(N \in {\mathbb {N}}_0 \cup \{\infty \}\), and \(W \in {\mathbb {N}}_0\), we have

$$\begin{aligned} {\mathtt {NN}}_{W,L,N}^{\varrho ,d,k} \subset \overline{{\mathtt {SNN}}_{4W,L,2N}^{\varrho ,d,k}}, \end{aligned}$$

where the closure is with respect to locally uniform convergence.\(\blacktriangleleft \)

2.4 Networks with Activation Functions that can Represent the Identity

The convergence in Lemma 2.22 is only locally uniformly, which is not strong enough to ensure equality of the associated approximation spaces on unbounded domains. In this subsection, we introduce a certain condition on the activation functions which ensures that strict and generalized networks yield the same approximation spaces also on unbounded domains.

Definition 2.23

We say that a function \(\varrho : {\mathbb {R}}\rightarrow {\mathbb {R}}\) can represent \(f: {\mathbb {R}}\rightarrow {\mathbb {R}}\) with n terms (where \(n \in {\mathbb {N}}\)) if \(f \in {\mathtt {SNN}}^{\varrho ,1,1}_{\infty ,2,n}\); that is, if there are \(a_i, b_i, c_i \in {\mathbb {R}}\) for \(i \in \{1,\dots ,n\}\), and some \(c \in {\mathbb {R}}\) satisfying

$$\begin{aligned} f(x) = c + \sum _{i=1}^n a_i \cdot \varrho (b_i \, x + c_i) \qquad \forall \, x \in {\mathbb {R}}. \end{aligned}$$

A particular case of interest is when \(\varrho \) can represent the identity \(\mathrm {id}: {\mathbb {R}}\rightarrow {\mathbb {R}}\) with n terms. \(\blacktriangleleft \)

As shown in Appendix A.10, primary examples are the ReLU activation function and its powers.

Lemma 2.24

For any \(r \in {\mathbb {N}}\), \(\varrho _r\) can represent any polynomial of degree \(\le r\) with \(2r + 2\) terms.\(\blacktriangleleft \)

Lemma 2.25

Assume that \(\varrho : {\mathbb {R}}\rightarrow {\mathbb {R}}\) can represent the identity with n terms. Let \(d,k \in {\mathbb {N}}\), \(W, N \in {\mathbb {N}}_0\), and \(L \in {\mathbb {N}}\cup \{\infty \}\) be arbitrary. Then, \( {\mathtt {NN}}_{W,L,N}^{\varrho ,d,k} \subset {\mathtt {SNN}}_{n^2 \cdot W, L, n \cdot N}^{\varrho ,d,k} \). \(\blacktriangleleft \)

The proof of Lemma 2.25 is in Appendix A.9. The next lemma is proved in Appendix A.11.

Lemma 2.26

If \(\varrho : {\mathbb {R}}\rightarrow {\mathbb {R}}\) can represent all polynomials of degree two with n terms, then:

  1. (1)

    For \(d \in {\mathbb {N}}_{\ge 2}\), the multiplication function \(M_{d}: {\mathbb {R}}^{d} \rightarrow {\mathbb {R}}, x \mapsto \prod _{i=1}^{d} x_{i}\) satisfies

    $$\begin{aligned} M_{d} \in {\mathtt {NN}}^{\varrho ,d,1}_{6n(2^{j}-1),2j,(2n+1)(2^{j}-1)-1} \quad \text {with}\quad j = \lceil \log _{2} d \rceil . \end{aligned}$$

    In particular, for \(d =2\) we have \(M_2 \in {\mathtt {SNN}}^{\varrho ,d,1}_{6n,2,2n}\).

  2. (2)

    For \(k \in {\mathbb {N}}\), the multiplication map \(m : {\mathbb {R}}\times {\mathbb {R}}^k \rightarrow {\mathbb {R}}^k, (x,y) \mapsto x \cdot y\) satisfies \( m \in {\mathtt {NN}}^{\varrho , 1+k, k}_{6kn, 2,2kn} \). \(\blacktriangleleft \)

3 Neural Network Approximation Spaces

The overall goal of this paper is to study approximation spaces associated with the sequence of sets \(\Sigma _{n}\) of realizations of networks with at most n connections (resp. at most n neurons), \(n \in {\mathbb {N}}_{0}\), either for fixed network depth \(L \in {\mathbb {N}}\), or for unbounded depth \(L = \infty \), or even for varying depth \(L = {\mathscr {L}}(n)\).

In this section, we first formally introduce these approximation spaces, following the theory from [21, Chapter 7, Section 9], and then specialize these spaces to the context of neural networks. The next sections will be devoted to establishing embeddings between classical functions spaces and neural network approximation spaces, as well as nesting properties between such spaces.

3.1 Generic Tools from Approximation Theory

Consider a quasi-BanachFootnote 1 space X equipped with the quasi-norm \(\Vert \cdot \Vert _{X}\), and let \(f \in X\). The error of best approximation of f from a non-empty set \(\Gamma \subset X\) is

$$\begin{aligned} E(f,\Gamma )_X := \inf _{g \in \Gamma } \Vert f - g \Vert _{X} \in [0,\infty ) . \end{aligned}$$
(3.1)

In case of [as in Eq. (1.3)] with \(\Omega \subseteq {\mathbb {R}}^{d}\) a set of nonzero measure, the corresponding approximation error will be denoted by \(E(f,\Gamma )_{p}\). As in [21, Chapter 7, Section 9], we consider an arbitrary family \(\Sigma = (\Sigma _{n})_{n \in {\mathbb {N}}_{0}}\) of subsets \(\Sigma _{n} \subset X\) and define for \(f \in X\), \(\alpha \in (0,\infty )\), and \(q \in (0,\infty ]\) the following quantity (which will turn out to be a quasi-norm under mild assumptions on the family \(\Sigma \)):

As expected, the associated approximation class is simply

For \(q = \infty \), this class is precisely the subset of elements \(f \in X\) such that \(E(f,\Sigma _{n})_X = {\mathcal {O}}(n^{-\alpha })\), and the classes associated with \(0<q<\infty \) correspond to subtle variants of this subset. If we assume that \(\Sigma _n \subset \Sigma _{n+1}\) for all \(n \in {\mathbb {N}}_0\), then the following “embeddings” can be derived directly from the definition; see [21, Chapter 7, Equation (9.2)]:

(3.2)

Note that we do not yet know that the approximation classes are (quasi)-Banach spaces. Therefore, the notation \(X_{1}\hookrightarrow X_{2}\)—where for \(i \in \{1,2\}\) we consider the class \(X_{i} := \{x \in X: \Vert x\Vert _{X_{i}}<\infty \}\) associated with some “proto”-quasi-norm \(\Vert \cdot \Vert _{X_{i}}\)—simply means that \(X_{1} \subset X_{2}\) and \(\Vert \cdot \Vert _{X_{2}} \le C \cdot \Vert \cdot \Vert _{X_{1}}\), even though \(\Vert \cdot \Vert _{X_{i}}\) might not be proper (quasi)-norms and \(X_{i}\) might not be (quasi)-Banach spaces. When the classes are indeed (quasi)-Banach spaces (see below), this corresponds to the standard notion of a continuous embedding.

As a direct consequence of the definitions, we get the following result on the relation between approximation classes using different families of subsets.

Lemma 3.1

Let X be a quasi-Banach space, and let \(\Sigma = (\Sigma _n)_{n \in {\mathbb {N}}_0}\) and \(\Sigma ' = (\Sigma _n')_{n \in {\mathbb {N}}_0}\) be two families of subsets \(\Sigma _n, \Sigma _n' \subset X\) satisfying the following properties:

  1. (1)

    \(\Sigma _0 = \{0\} = \Sigma _0'\);

  2. (2)

    \(\Sigma _n \subset \Sigma _{n+1}\) and \(\Sigma _n' \subset \Sigma _{n+1}'\) for all \(n \in {\mathbb {N}}_0\); and

  3. (3)

    there are \(c \in {\mathbb {N}}\) and \(C > 0\) such that \(E(f,\Sigma _{c m})_{X} \le C \cdot E(f,\Sigma '_{m})_{X}\) for all \(f \in X, m \in {\mathbb {N}}\).

Then, holds for arbitrary \(q \in (0,\infty ]\) and \(\alpha > 0\). More precisely, there is a constant \(K = K(\alpha ,q,c,C) > 0\) satisfying

Remark

One can alternatively assume that \(E(f, \Sigma _{c m})_X \le C \cdot E(f, \Sigma _m ')_X\) only holds for \(m \ge m_0 \in {\mathbb {N}}\). Indeed, if this is satisfied and if we set \(c' := m_0 \, c\), then we see for arbitrary \(m \in {\mathbb {N}}\) that \(m_0 m \ge m_0\), so that

$$\begin{aligned} E(f, \Sigma _{c' m})_X = E(f, \Sigma _{c \cdot m_0 \, m})_X \le C \cdot E(f, \Sigma _{m_0 m} ')_X \le C \cdot E(f, \Sigma _m ')_X . \end{aligned}$$

Here, the last step used that \(m_0 \, m \ge m\), so that \(\Sigma _m ' \subset \Sigma _{m_0 \, m}'\). \(\blacklozenge \)

The proof of Lemma 3.1 can be found in Appendix B.1.

In [21, Chapter 7, Section 9], the authors develop a general theory regarding approximation classes of this type. To apply this theory, we merely have to verify that \(\Sigma = (\Sigma _n)_{n \in {\mathbb {N}}_0}\) satisfies the following list of axioms, which is identical to [21, Chapter 7, Equation (5.2)]:

  1. (P1)

    \(\Sigma _0 = \{0\}\);

  2. (P2)

    \(\Sigma _n \subset \Sigma _{n+1}\) for all \(n \in {\mathbb {N}}_0\);

  3. (P3)

    \(a \cdot \Sigma _n = \Sigma _n\) for all \(a \in {\mathbb {R}}{\setminus } \{0\}\) and \(n \in {\mathbb {N}}_0\);

  4. (P4)

    There is a fixed constant \(c \in {\mathbb {N}}\) with \(\Sigma _n + \Sigma _n \subset \Sigma _{cn}\) for all \(n \in {\mathbb {N}}_0\);

  5. (P5)

    \(\Sigma _{\infty } := \bigcup _{j \in {\mathbb {N}}_0} \Sigma _j\) is dense in X;

  6. (P6)

    for any \(n \in {\mathbb {N}}_0\), each \(f \in X\) has a best approximation from \(\Sigma _n\).

As we will show in Theorem 3.27, Properties (P1)–(P5) hold in for an appropriately defined family \(\Sigma \) related to neural networks of fixed or varying network depth \(L \in {\mathbb {N}}\cup \{\infty \}\).

Property (P6), however, can fail in this setting even for the simple case of the ReLU activation function; indeed, a combination of Lemmas 3.26 and 4.4 shows that ReLU networks of bounded complexity can approximate the discontinuous function \({\mathbb {1}}_{[a,b]}\) arbitrarily well. Yet, since realizations of ReLU networks are always continuous, \({\mathbb {1}}_{[a,b]}\) is not implemented exactly by such a network; hence, no best approximation exists. Fortunately, Property (P6) is not essential for the theory from [21] to be applicable: By the arguments given in [21, Chapter 7, discussion around Equation (9.2)] (see also [4, Proposition 3.8 and Theorem 3.12]), we get the following properties of the approximation classes that turn out to be approximation spaces, i.e., quasi-Banach spaces.

Proposition 3.2

If Properties (P1)–(P5) hold, then the classes are quasi-Banach spaces satisfying the continuous embeddings (3.2) and . \(\blacktriangleleft \)

Remark

Note that is in general only a quasi-norm, even if X is a Banach space and \(q \in [1,\infty ]\). Only if one additionally knows that all the sets \(\Sigma _n\) are vector spaces [that is, one can choose \(c = 1\) in Property (P4)], one knows for sure that is a norm. \(\blacklozenge \)

Proof

Everything except for the completeness and the embedding is shown in [21, Chapter 7, Discussion around Equation (9.2)]. In [21, Chapter 7, Discussion around Equation (9.2)], it was shown that the embedding (3.2) holds. All other properties claimed in Proposition 3.2 follow by combining Remark 3.5, Proposition 3.8, and Theorem 3.12 in [4]. \(\square \)

3.2 Approximation Classes of Generalized Networks

We now specialize to the setting of neural networks and consider \(d,k \in {\mathbb {N}}\), an activation function \(\varrho : {\mathbb {R}}\rightarrow {\mathbb {R}}\), and a non-empty set \(\Omega \subseteq {\mathbb {R}}^d\).

Our goal is to define a family of sets of (realizations of) \(\varrho \)-networks of “complexity” \(n \in {\mathbb {N}}_{0}\). The complexity will be measured in terms of the number of connections \(W \le n\) or the number of neurons \(N \le n\), possibly with a control on how the depth L evolves with n.

Definition 3.3

(Depth growth function) A depth growth function is a non-decreasing function

$$\begin{aligned} {\mathscr {L}}: {\mathbb {N}}\rightarrow {\mathbb {N}}\cup \{\infty \}, n \mapsto {\mathscr {L}}(n) . \blacktriangleleft \end{aligned}$$

Definition 3.4

(Approximation family, approximation spaces) Given an activation function \(\varrho \), a depth growth function \({\mathscr {L}}\), a subset \(\Omega \subseteq {\mathbb {R}}^d\), and a quasi-Banach space X whose elements are (equivalence classes of) functions \(f : \Omega \rightarrow {\mathbb {R}}^k\), we define , and

(3.3)
(3.4)

To highlight the role of the activation function \(\varrho \) and the depth growth function \({\mathscr {L}}\) in the definition of the corresponding approximation classes, we introduce the specific notation

(3.5)
(3.6)

The quantities and are defined similarly. Notice that the input and output dimensions dk as well as the set \(\Omega \) are implicitly described by the space X. Finally, if the depth growth function is constant (\({\mathscr {L}}\equiv L\) for some \(L \in {\mathbb {N}}\)), we write \(\mathtt {W}_n(X, \varrho , L)\), etc. \(\blacktriangleleft \)

Remark 3.5

By convention, , while \({\mathtt {NN}}_{0,L}^{\varrho ,d,k}\) is the set of constant functions \(f \equiv c\), where \(c \in {\mathbb {R}}^{k}\) is arbitrary (Lemma 2.13), and \({\mathtt {NN}}_{\infty ,L,0}^{\varrho ,d,k}\) is the set of affine functions.\(\blacklozenge \)

Remark 3.6

Lemma 2.14 shows that \({\mathtt {NN}}^{\varrho ,d,k}_{W,L} = {\mathtt {NN}}^{\varrho ,d,k}_{W,W}\) if \(L \ge W \ge 1\); hence, the approximation family associated with any depth growth function \({\mathscr {L}}\) is also generated by the modified depth growth function \({\mathscr {L}}' (n) := \min \{n, {\mathscr {L}}(n) \}\), which satisfies \({\mathscr {L}}' (n) \in \{1, \dots , n \}\) for all \(n \in {\mathbb {N}}\).

In light of Eq. (2.1), a similar observation holds for with \({\mathscr {L}}' (n) := \min \{n+1, {\mathscr {L}}(n) \}\).

It will be convenient, however, to explicitly specify unbounded depth as \({\mathscr {L}} \equiv +\infty \) rather than the equivalent form \({\mathscr {L}}(n) = n\) (resp. rather than \({\mathscr {L}}(n) = n+1\)).\(\blacklozenge \)

We will further discuss the role of the depth growth function in Sect. 3.5. Before that, we compare approximation with generalized and strict networks.

3.3 Approximation with Generalized Versus Strict Networks

In this subsection, we show that if one only considers the approximation theoretic properties of the resulting function classes, then—under extremely mild assumptions on the activation function \(\varrho \)—it does not matter whether we consider strict or generalized networks, at least on bounded domains \(\Omega \subset {\mathbb {R}}^d\). Here, instead of the approximating sets for generalized neural networks defined in (3.3)–(3.4) we wish to consider the corresponding sets for strict neural networks, given by , and

and the associated approximation classes that we denote by

Since generalized networks are at least as expressive as strict ones, these approximation classes embed into the corresponding classes for generalized networks, as we now formalize.

Proposition 3.7

Consider \(\varrho \) an activation function, \({\mathscr {L}}\) a depth growth function, and X a quasi-Banach space of (equivalence classes of) functions from a subset \(\Omega \subseteq {\mathbb {R}}^{d}\) to \({\mathbb {R}}^{k}\). For any \(\alpha >0\) and \(q \in (0,\infty ]\), we have and ; hence,

Proof

We give the proof for approximation spaces associated with connection complexity; the proof is similar for the case of neuron complexity. Obviously, for all \(n \in {\mathbb {N}}_{0}\), so that the approximation errors satisfy for all \(n \in {\mathbb {N}}_0\). This implies whence . \(\square \)

Under mild conditions on \(\varrho \), the converse holds on bounded domains when approximating in \(L_{p}\). This also holds on unbounded domains for activation functions that can represent the identity.

Theorem 3.8

(Approximation classes of strict vs. generalized networks) Consider \(d \in {\mathbb {N}}\), a measurable set \(\Omega \subseteq {\mathbb {R}}^{d}\) with nonzero measure, and \(\varrho : {\mathbb {R}}\rightarrow {\mathbb {R}}\) an activation function. Assume either that:

  • \(\Omega \) is bounded, \(\varrho \) is continuous, and \(\varrho \) is differentiable at some \(x_{0} \in {\mathbb {R}}\) with \(\varrho '(x_{0}) \ne 0\); or that

  • \(\varrho \) can represent the identity \(\mathrm {id}: {\mathbb {R}}\rightarrow {\mathbb {R}}, x \mapsto x\) with m terms for some \(m \in {\mathbb {N}}\).

Then, for any depth growth function \({\mathscr {L}}\), \(k \in {\mathbb {N}}\), \(\alpha > 0\), \(p,q \in (0,\infty ]\), with as in Eq. (1.3), we have the identities

and there exists \(C < \infty \) such that

Before giving the proof, let us clarify the precise choice of (quasi)-norm for the vector-valued spaces from Eq. (1.3). For \(f = (f_{1},\ldots ,f_{k}): \Omega \rightarrow {\mathbb {R}}^{k}\) and \(0< p < \infty \), it is defined by \( \Vert f\Vert _{L_p(\Omega ;{\mathbb {R}}^k)}^{p} := \sum _{\ell =1}^{k} \Vert f_{\ell }\Vert _{L_{p}(\Omega ;{\mathbb {R}})}^{p} = \int _{\Omega } |f(x)|_{p}^{p} \, {\mathrm{d}}x \), where \(|u|_{p}^{p} := \sum _{\ell =1}^{k}|u_{\ell }|^{p}\) for each \(u \in {\mathbb {R}}^{k}\). For \(p=\infty \), we use the definition \(\Vert f\Vert _{\infty } := \max _{\ell = 1,\ldots ,k} \Vert f_{\ell }\Vert _{L_{\infty }(\Omega ;{\mathbb {R}})}\).

Proof

When \(\varrho \) can represent the identity with m terms, we rely on Lemma 2.25 and on the estimate \({\mathscr {L}}(n) \le {\mathscr {L}}(m^2 n)\) to obtain for any \(n \in {\mathbb {N}}\) that

and similarly , so that

We now establish similar results for the case where \(\Omega \) is bounded, \(\varrho \) is continuous, and \(\varrho '(x_{0}) \ne 0\) is well defined for some \(x_{0} \in {\mathbb {R}}\). We rely on Lemma 2.22. First, note by continuity of \(\varrho \) that any \(f \in {\mathtt {NN}}^{\varrho ,d,k} \supset {\mathtt {SNN}}^{\varrho ,d,k}\) is a continuous function \(f : {\mathbb {R}}^d \rightarrow {\mathbb {R}}^k\). Furthermore, since \(\Omega \) is bounded, \({\overline{\Omega }}\) is compact, so that \(f|_{{\overline{\Omega }}}\) is uniformly continuous and bounded. Clearly, this implies that \(f|_{\Omega }\) is uniformly continuous and bounded as well. Since , this implies

and similarly for . Since \(\Omega \subset {\mathbb {R}}^d\) is bounded, locally uniform convergence on \({\mathbb {R}}^d\) implies convergence in . Hence, for any \(n \in {\mathbb {N}}_0\), using that \({\mathscr {L}}(n) \le {\mathscr {L}}(4n)\), Lemma 2.22 yields

where the closure is taken with respect to the topology induced by . Similarly, we have

Now, for an arbitrary subset , observe by continuity of that

that is, if one is only interested in the distance of functions f to the set \(\Gamma \), then switching from \(\Gamma \) to its closure \({\overline{\Gamma }}\) (computed in ) does not change the resulting distance. Therefore,

In both settings (\(\varrho \) can represent the identity, or \(\Omega \) is bounded and \(\varrho \) differentiable at \(x_0\)), Lemma 3.1 shows and for some \(C \in (0,\infty )\). The conclusion follows using Proposition 3.7. \(\square \)

3.4 Connectivity Versus Number of Neurons

Lemma 3.9

Consider \(\varrho : {\mathbb {R}}\rightarrow {\mathbb {R}}\) an activation function, \({\mathscr {L}}\) a depth growth function, \(d,k \in {\mathbb {N}}\), \(p \in (0,\infty ]\) and a measurable \(\Omega \subseteq {\mathbb {R}}^d\) with nonzero measure. With , we have for any \(\alpha >0\) and \(q \in (0,\infty ]\)

and there exists \(c > 0\) such that

When \(L := \sup _{n} {\mathscr {L}}(n)=2\) (i.e., for shallow networks), the exponent \(\alpha /2\) can be replaced by \(\alpha \); that is, with equivalent norms.\(\blacktriangleleft \)

Remark

We will see in Lemma 3.10 that if, for instance, \(\varrho = \varrho _r\) is a power of the ReLU, if \(\Omega \) is bounded, and if \(L := \sup _{n \in {\mathbb {N}}} {\mathscr {L}}(n)\) satisfies \(3 \le L < \infty \). In general, however, one cannot expect the spaces to be always distinct. For instance, if \(\varrho \) is the activation function constructed in [45, Theorem 4], if \(L \ge 3\), and if \(\Omega \) is bounded, then both and coincide with \(X_p(\Omega )\). \(\blacklozenge \)

Proof

We give the proof for generalized networks. By Lemma 2.14 and Eq. (2.3),

$$\begin{aligned} {\mathtt {NN}}^{\varrho ,d,k}_{n,{\mathscr {L}}(n),\infty } \subset {\mathtt {NN}}^{\varrho ,d,k}_{n,{\mathscr {L}}(n),n} \subset {\mathtt {NN}}^{\varrho ,d,k}_{\infty ,{\mathscr {L}}(n),n} \subset {\mathtt {NN}}^{\varrho ,d,k}_{n^{2}+(d+k)n+dk,{\mathscr {L}}(n),n} \subset {\mathtt {NN}}^{\varrho ,d,k}_{n^{2}+(d+k)n+dk,{\mathscr {L}}(n),\infty } \end{aligned}$$

for any \(n \in {\mathbb {N}}\). Hence, the approximation errors satisfy

(3.7)

By the first inequality in (3.7), and .

When \(L=2\), by the remark following Eq. (2.3) we get \({\mathtt {NN}}^{\varrho ,d,k}_{\infty ,{\mathscr {L}}(n),n} \subset {\mathtt {NN}}^{\varrho ,d,k}_{(d+k)n,{\mathscr {L}}(n),\infty }\); hence, so that Lemma 3.1 shows , with a corresponding (quasi)-norm estimate; hence, these spaces coincide with equivalent (quasi)-norms.

For the general case, observe that \(n^{2}+(d+k)n +dk \le (n+\gamma )^{2}\) with \(\gamma := \max \{ d,k \}\). Let us first consider the case \(q < \infty \). In this case, we note that if \((n+\gamma )^2 + 1 \le m \le (n+\gamma +1)^2\), then \(n^2 \le m \le (2\gamma +2)^2 \, n^2\), and thus, \(m^{\alpha q - 1} \lesssim n^{2 \alpha q - 2}\), where the implied constant only depends on \(\alpha , q\), and \(\gamma \). This implies

$$\begin{aligned} \sum _{m=(n+\gamma )^2 + 1}^{(n+\gamma +1)^2} m^{\alpha q - 1} \le C \cdot n^{2 \alpha q - 1} \qquad \forall \, n \in {\mathbb {N}}\end{aligned}$$

where \(C = C(\alpha ,q,\gamma ) < \infty \), since the sum has \(((n+\gamma )+1)^2 - (n+\gamma )^2 = 2 n+2\gamma + 1 \le 4n (2\gamma +1)\) many summands. By the second inequality in (3.7), we get for any \(n \in {\mathbb {N}}\)

It follows that

To conclude, we use that with \(C' = \sum _{m=1}^{(\gamma +1)^{2}} m^{\alpha q-1}\).

The proof for \(q=\infty \) is similar. The proof for strict networks follows along similar lines. \(\square \)

The final result in this subsection shows that the inclusions in Lemma 3.9 are quite sharp.

Lemma 3.10

For \(r \in {\mathbb {N}}\), define \(\varrho _r : {\mathbb {R}}\rightarrow {\mathbb {R}}, x \mapsto (x_+)^r\).

Let \(\Omega \subset {\mathbb {R}}^d\) be bounded and measurable with non-empty interior. Let \(L, L' \in {\mathbb {N}}_{\ge 2}\), let \(r_1, r_2 \in {\mathbb {N}}\), let \(p_1,p_2,q_1,q_2 \in (0,\infty ]\), and \(\alpha , \beta > 0\). Then, the following hold:

  1. (1)

    If , then \(L' - 1 \ge \tfrac{\beta }{\alpha } \cdot \lfloor L/2 \rfloor \).

  2. (2)

    If , then \(\lfloor L/2 \rfloor \ge \frac{\alpha }{\beta } \cdot (L' - 1)\).

In particular, if , then \(L = 2\).\(\blacktriangleleft \)

The proof of this result is given in Appendix E.

3.5 Role of the Depth Growth Function

In this subsection, we investigate the relation between approximation classes associated with different depth growth functions. First, we define a comparison rule between depth growth functions.

Definition 3.11

(Comparison between depth growth functions) The depth growth function \({\mathscr {L}}\) is dominated by the depth growth function \({\mathscr {L}}'\) (denoted \({\mathscr {L}} \preceq {\mathscr {L}}'\) or \({\mathscr {L}}' \succeq {\mathscr {L}}\)) if there are \(c,n_{0} \in {\mathbb {N}}\) such that

$$\begin{aligned} \forall \, n \ge n_{0}: \quad {\mathscr {L}}(n) \le {\mathscr {L}}'(cn) . \end{aligned}$$
(3.8)

Observe that \({\mathscr {L}} \le {\mathscr {L}}'\) implies \({\mathscr {L}} \preceq {\mathscr {L}}'\).

The two depth growth functions are equivalent (denoted \({\mathscr {L}} \sim {\mathscr {L}}'\)) if \({\mathscr {L}} \preceq {\mathscr {L}}'\) and \({\mathscr {L}} \succeq {\mathscr {L}}'\), that is to say if there exist \(c,n_{0} \in {\mathbb {N}}\) such that for each \(n \ge n_{0}\), \({\mathscr {L}}(n) \le {\mathscr {L}}'(c n)\) and \({\mathscr {L}}'(n) \le {\mathscr {L}}(c n)\). This defines an equivalence relation on the set of depth growth functions. \(\blacktriangleleft \)

Lemma 3.12

Consider two depth growth functions \({\mathscr {L}}\), \({\mathscr {L}}'\). If \({\mathscr {L}} \preceq {\mathscr {L}}'\), then for each \(\alpha > 0\) and \(q \in (0,\infty ]\), there is a constant \(C = C({\mathscr {L}},{\mathscr {L}}',\alpha ,q) \in [1,\infty )\) such that:

for each activation function \(\varrho : {\mathbb {R}}\rightarrow {\mathbb {R}}\), each (bounded or unbounded) set \(\Omega \subset {\mathbb {R}}^d\), and each quasi-Banach space X of (equivalence classes of) functions \(f: \Omega \rightarrow {\mathbb {R}}^{k}\).

The same holds with (resp. ) instead of (resp. ).

The constant C depends only on the constants \(c,n_{0} \in {\mathbb {N}}\) involved in (3.8) and on \(\alpha , q\).\(\blacktriangleleft \)

Proof

Let \(c, n_0 \in {\mathbb {N}}\) as in Eq. (3.8). For \(n \ge n_0\), we then have \({\mathscr {L}}(n) \le {\mathscr {L}}'(c n)\), and hence,

$$\begin{aligned}&{\mathtt {NN}}^{\varrho ,d,k}_{n,{\mathscr {L}}(n),\infty } \subset {\mathtt {NN}}^{\varrho ,d,k}_{n, {\mathscr {L}}'(c n),\infty } \subset {\mathtt {NN}}^{\varrho ,d,k}_{c n, {\mathscr {L}}'(c n),\infty } ,\\&{\mathtt {NN}}^{\varrho ,d,k}_{\infty ,{\mathscr {L}}(n),n} \subset {\mathtt {NN}}^{\varrho ,d,k}_{\infty , {\mathscr {L}}'(c n),n} \subset {\mathtt {NN}}^{\varrho ,d,k}_{\infty , {\mathscr {L}}'(c n),cn} , \end{aligned}$$

from which we easily get

$$\begin{aligned} E\big (f, \mathtt {W}_{c n} (X, \varrho , {\mathscr {L}}') \big )_X&\le E\big ( f, \mathtt {W}_{n} (X, \varrho , {\mathscr {L}}) \big )_X \qquad \forall n \ge n_0 ,\\ E\big (f, \mathtt {N}_{c n} (X, \varrho , {\mathscr {L}}') \big )_X&\le E\big ( f, \mathtt {N}_{n} (X, \varrho , {\mathscr {L}}) \big )_X \qquad \forall \, n \ge n_0 . \end{aligned}$$

Now, Lemma 3.1 and the associated remark complete the proof. Exactly the same proof works for strict networks; one just has to replace \({\mathtt {NN}}\) by \({\mathtt {SNN}}\) everywhere. \(\square \)

As a direct consequence of Lemma 3.12, we see that equivalent depth growth functions induce the same approximation spaces.

Theorem 3.13

If \({\mathscr {L}}, {\mathscr {L}}'\) are two depth-growth functions satisfying \({\mathscr {L}} \sim {\mathscr {L}}'\), then for any \(\alpha > 0\) and \(q \in (0,\infty ]\), there is a constant \(C \in [1,\infty )\) such that

for each activation function \(\varrho \), each \(\Omega \subset {\mathbb {R}}^d\), and each quasi-Banach space X of (equivalence classes of) functions \(f: \Omega \rightarrow {\mathbb {R}}^{k}\). The same holds with (resp. ) instead of (resp. ). The constant C depends only on the constants \(c,n_{0} \in {\mathbb {N}}\) in Definition 3.11 and on \(\alpha , q\). \(\blacktriangleleft \)

Theorem 3.13 shows in particular that if \(L := \sup _{n} {\mathscr {L}}(n) < \infty \), then with equivalent “proto-norms” (and similarly with instead of or with strict networks instead of generalized ones). Indeed, it is easy to see that \({\mathscr {L}} \sim {\mathscr {L}}'\) if \({\sup _{n} {\mathscr {L}}(n) = \sup _{n} {\mathscr {L}}'(n) = L < \infty }\).

Lemma 3.14

Consider \({\mathscr {L}}\) a depth growth function and \(\varepsilon > 0\).

  1. (1)

    if \({\mathscr {L}}+\varepsilon \preceq {\mathscr {L}}\) then \({\mathscr {L}}+b \sim {\mathscr {L}}\) for each \(b \ge 0\);

  2. (2)

    if \(e^{\varepsilon }{\mathscr {L}} \preceq {\mathscr {L}}\) then \(a{\mathscr {L}}+b \sim {\mathscr {L}}\) for each \(a \ge 1\), \(b \ge 1-a\). \(\blacktriangleleft \)

Proof

For the first claim, we first show by induction on \(k \in {\mathbb {N}}\) that \({\mathscr {L}}+k\varepsilon \preceq {\mathscr {L}}\). For \(k=1\), this holds by assumption. For the induction step, recall that \({\mathscr {L}}+k\varepsilon \preceq {\mathscr {L}}\) simply means that there are \(c, n_{0} \in {\mathbb {N}}\) such that \({\mathscr {L}}(n)+k\varepsilon \le {\mathscr {L}}(cn)\) for all \(n \in {\mathbb {N}}_{\ge n_{0}}\). Therefore, if \(n \ge n_{0}\), then \({\mathscr {L}}(n)+(k+1)\varepsilon \le {\mathscr {L}}(c n)+\varepsilon \le {\mathscr {L}}(c^{2}n)\) since \(n' = cn \ge n \ge n_{0}\). Now, note that if \({\mathscr {L}} \le {\mathscr {L}}'\), then also \({\mathscr {L}} \preceq {\mathscr {L}}'\). Therefore, given \(b \ge 0\) we choose \(k \in {\mathbb {N}}\) such that \(b \le k\varepsilon \) and get \({\mathscr {L}} \preceq {\mathscr {L}} + b \preceq {\mathscr {L}} + k \varepsilon \preceq {\mathscr {L}}\), so that all these depth-growth functions are equivalent.

For the second claim, a similar induction yields \(e^{k\varepsilon } {\mathscr {L}} \preceq {\mathscr {L}}\) for all \(k \in {\mathbb {N}}\). Now, given \(a \ge 1\) and \(b \ge 1-a\), we choose \(k \in {\mathbb {N}}\) such that \(a+b_+ \le e^{k\varepsilon }\), where \(b_+ = \max \{0, b\}\). There are now two cases: If \(b \ge 0\), then clearly \({\mathscr {L}} \le a {\mathscr {L}} \le a {\mathscr {L}} + b\). If otherwise \(b < 0\), then \(b {\mathscr {L}} \le b\), since \({\mathscr {L}} \ge 1\), and hence, \( {\mathscr {L}} = a {\mathscr {L}} + (1 - a) {\mathscr {L}} \le a {\mathscr {L}} + b {\mathscr {L}} \le a {\mathscr {L}} + b \). Therefore, we see in both cases that \( {\mathscr {L}} \le a {\mathscr {L}} + b_+ \le (a + b_+) {\mathscr {L}} \le e^{k \varepsilon } \, {\mathscr {L}} \preceq {\mathscr {L}} \). \(\square \)

The following two examples discuss elementary properties of poly-logarithmic and polynomial growth functions, respectively.

Example 3.15

Assume there are \(q \ge 1\), \(\alpha , \beta > 0\) such that \(|{\mathscr {L}}(n) - \alpha \log ^{q} n| \le \beta \) for all \(n \in {\mathbb {N}}\).

Choosing \(c \in {\mathbb {N}}\) such that \(\varepsilon := \alpha \log ^{q} c - 2\beta >0\), we have

$$\begin{aligned} {\mathscr {L}}(n)+\varepsilon\le & {} \alpha \log ^{q}n + \beta + \varepsilon = \alpha \log ^{q} n + \alpha \log ^{q} c -\beta \\\le & {} \alpha (\log c + \log n)^{q} -\beta = \alpha \log ^{q} (cn) -\beta \le {\mathscr {L}}(cn) \end{aligned}$$

for all \(n \in {\mathbb {N}}\); hence, \({\mathscr {L}}+\varepsilon \preceq {\mathscr {L}}\). Here, we used that \(x^q + y^q = \Vert (x,y)\Vert _{\ell ^q}^q \le \Vert (x,y)\Vert _{\ell ^1}^q = (x+y)^q\) for \(x,y \ge 0\).

By Lemma 3.14, we get \({\mathscr {L}} \sim {\mathscr {L}}+b\) for arbitrary \(b \ge 0\). Moreover, as \(\lfloor \alpha \log ^{q}n\rfloor \le \alpha \log ^{q} n \le {\mathscr {L}}(n)+\beta \), we have \(\max (1,\lfloor \alpha \log ^{q}(\cdot )\rfloor ) \preceq {\mathscr {L}}+\beta \sim {\mathscr {L}}\). Similarly, \({\mathscr {L}}(n) \le \lfloor \alpha \log ^{q}n\rfloor + \beta +1\); hence, \({\mathscr {L}} \sim \max (1,\lfloor \alpha \log ^{q} (\cdot ) \rfloor )\).

Example 3.16

Assume there are \(\gamma > 0\) and \(C \ge 1\) such that \(1/C \le {\mathscr {L}}(n)/n^{\gamma } \le C\) for all \(n \in {\mathbb {N}}\).

Choosing any integer \(c \ge (2C^2)^{1/\gamma }\), we have \(2C^{2}c^{-\gamma } \le 1\), and hence,

$$\begin{aligned} 2{\mathscr {L}}(n) \le 2C n^{\gamma } \le 2Cc^{-\gamma } (cn)^{\gamma } \le 2Cc^{-\gamma } C {\mathscr {L}}(cn) = 2C^{2}c^{-\gamma } {\mathscr {L}}(cn) \le {\mathscr {L}}(cn) \end{aligned}$$

for all \(n \in {\mathbb {N}}\); hence, \(2{\mathscr {L}} \preceq {\mathscr {L}}\). By Lemma 3.14, we get \({\mathscr {L}} \sim a{\mathscr {L}}+b\) for each \(a \ge 1, b \ge 1-a\). Moreover, we have \(\lceil n^{\gamma }\rceil \le n^{\gamma }+1 \le C{\mathscr {L}}(n)+1\) for all \(n \in {\mathbb {N}}\); hence, \(\lceil (\cdot )^{\gamma }\rceil \preceq C{\mathscr {L}}+1 \sim {\mathscr {L}}\). Similarly, \({\mathscr {L}}(n) \le C n^{\gamma } \le C \lceil n^{\gamma } \rceil \), and thus, \({\mathscr {L}} \sim \lceil (\cdot )^{\gamma }\rceil \).

In the next sections, we conduct preliminary investigations on the role of the (finite or infinite) depth L in terms of the associated approximation spaces for \(\varrho _{r}\)-networks. A general understanding of the role of depth growth largely remains an open question. A very surprising result in this direction was recently obtained by Yarotsky [69].

Remark 3.17

It is not difficult to show that approximation classes defined on nested sets \(\Omega ' \subset \Omega \subset {\mathbb {R}}^{d}\) satisfy natural restriction properties. More precisely, the map

is well defined and bounded (meaning, ), and the same holds for the spaces \(N_q^\alpha \) instead of \(W_q^\alpha \).

Furthermore, the approximation classes of vector-valued functions \(f: \Omega \rightarrow {\mathbb {R}}^{k}\) are Cartesian products of real-valued function classes; that is,

is bijective and . Again, the same holds for the spaces \(N_q^\alpha \) instead of \(W_q^\alpha \). For the sake of brevity, we omit the easy proofs.\(\blacklozenge \)

3.6 Approximation Classes are Approximation Spaces

We now verify that the main axioms needed to apply Proposition 3.2 are satisfied. Properties (P1)–(P4) hold without any further assumptions:

Lemma 3.18

Let \(\varrho : {\mathbb {R}}\rightarrow {\mathbb {R}}\) be arbitrary, and let \({\mathscr {L}}\) be a depth growth function. The sets \(\Sigma _{n}\) defined in (3.5)–(3.6) satisfy Properties (P1)–(P4) on Page 12, with \(c=2+\min \{d,k\}\) for Property (P4).\(\blacktriangleleft \)

Proof

We generically write to indicate either or .

Property (P1). We have by definition. For later use, let us also verify that for \(n \in {\mathbb {N}}\). Indeed, Lemma 2.13 shows \(0 \in {\mathtt {NN}}^{\varrho ,d,k}_{0,1,0} \subset {\mathtt {NN}}^{\varrho ,d,k}_{n,L,m}\) for all \(n,m,L \in {\mathbb {N}}\cup \{\infty \}\), and hence, for all \(n \in {\mathbb {N}}\).

Property (P2). The inclusions \({\mathtt {NN}}_{W, L,\infty }^{\varrho ,d,k} \subset {\mathtt {NN}}_{W+1, L',\infty }^{\varrho ,d,k}\) and \({\mathtt {NN}}_{\infty , L,N}^{\varrho ,d,k} \subset {\mathtt {NN}}_{\infty , L',N+1}^{\varrho ,d,k}\) for \(W,N \in {\mathbb {N}}_{0}\) and \(L, L' \in {\mathbb {N}}\cup \{\infty \}\) with \(L \le L'\) hold by the very definition of these sets. As \({\mathscr {L}}\) is non-decreasing (that is, \({\mathscr {L}}(n+1) \ge {\mathscr {L}}(n)\)), we thus get for all \(n \in {\mathbb {N}}\). As seen in the proof of Property (P1), this also holds for \(n = 0\).

Property (P3). By Lemma 2.17-(1), if \(f \in {\mathtt {NN}}_{W, L,N}^{\varrho ,d,k}\), then \(a \cdot f \in {\mathtt {NN}}_{W, L, N}^{\varrho ,d,k}\) for any \(a \in {\mathbb {R}}\). Therefore, for each \(a \in {\mathbb {R}}\) and \(n \in {\mathbb {N}}\). The converse is proved similarly for \(a \ne 0\); hence, for each \(a \in {\mathbb {R}}{\setminus } \{0\}\) and \(n \in {\mathbb {N}}\). For \(n = 0\), this holds trivially.

Property (P4). The claim is trivial for \(n=0\). For \(n \in {\mathbb {N}}\), let be arbitrary.

For the case of , let \(g_{1},g_{2} \in {\mathtt {NN}}_{n,{\mathscr {L}}(n),\infty }^{\varrho ,d,k}\) such that \(f_{i} = g_{i}|_{\Omega }\). Lemma 2.14 shows that \(g_{i} \in {\mathtt {NN}}_{n,L',\infty }^{\varrho ,d,k}\) with \(L' := \min \{{\mathscr {L}}(n), n\}\). By Lemma 2.17-(3), setting \(c_{0} := \min \{d,k\}\), and \(W' := 2n + c_{0} \cdot (L'-1) \le (2+c_{0})n\), we have \( g_{1}+g_{2} \in {\mathtt {NN}}_{W',L'}^{\varrho ,d,k} \subset {\mathtt {NN}}_{(2+c_{0})n,{\mathscr {L}}((2+c_{0})n)}^{\varrho ,d,k} \), where for the last inclusion we used that \(L' \le {\mathscr {L}}(n)\), that \({\mathscr {L}}\) is non-decreasing, and that \(n \le (2+c_{0})n\).

For the case of , consider similarly \(g_{1},g_{2} \in {\mathtt {NN}}_{\infty ,{\mathscr {L}}(n),n}^{\varrho ,d,k}\) such that \(f_{i} = g_{i}|_{\Omega }\). By (2.1), \(g_{i} \in {\mathtt {NN}}_{\infty ,L',n}^{\varrho ,d,k}\) with \(L' := \min \{{\mathscr {L}}(n), n+1\}\). By Lemma 2.17-(3) again, setting \(c_{0} := \min \{d,k\}\), and \(N' := 2n + c_{0} \cdot (L'-1) \le (2+c_{0})n\), we get \( g_{1}+g_{2} \in {\mathtt {NN}}_{\infty ,L',N'}^{\varrho ,d,k} \subset {\mathtt {NN}}_{\infty ,{\mathscr {L}}((2+c_{0})n),(2+c_{0})n}^{\varrho ,d,k} \).

By Definitions (3.3)–(3.4), this shows in all cases that .

\(\square \)

We now focus on Property (P5), in the function space with \(p \in (0,\infty ]\) and \(\Omega \subset {\mathbb {R}}^d\) a measurable set with nonzero measure. First, as proved in Appendix B.2, these spaces are indeed complete, and each \(f \in X_p^k (\Omega )\) can be extended to an element \({\widetilde{f}} \in X_p^k ({\mathbb {R}}^d)\).

Definition 3.19

(admissible domain) For brevity, in the rest of the paper we refer to \(\Omega \subseteq {\mathbb {R}}^{d}\) as an admissible domain if, and only if, it is Borel measurable with nonzero measure. \(\blacktriangleleft \)

Lemma 3.20

Consider \(\Omega \subseteq {\mathbb {R}}^{d}\) an admissible domain, \(k \in {\mathbb {N}}\), and \(C_{0}({\mathbb {R}}^{d};{\mathbb {R}}^{k})\) the space of continuous functions \(f: {\mathbb {R}}^{d} \rightarrow {\mathbb {R}}^{k}\) that vanish at infinity.

For \(0<p<\infty \), we have ; likewise, . The spaces are quasi-Banach spaces.\(\blacktriangleleft \)

In light of definitions (3.3)–(3.4), we have

with \(L := \sup _{n} {\mathscr {L}}(n) \in {\mathbb {N}}\cup \{+\infty \}\). Properties (P3) and (P4) imply that is a linear space. We study its density in X, dealing first with a few degenerate cases.

3.6.1 Degenerate Cases

Property (P5) can fail to hold for certain activation functions: When \(\varrho \) is a polynomial and \({\mathscr {L}}\) is bounded, the set only contains polynomials of bounded degree; hence, for non-trivial \(\Omega \), is not dense in X. Property (P5) fails again for networks with a single hidden layer (\(L=2\)) and certain domains such as \(\Omega = {\mathbb {R}}^{d}\). Indeed, the realization of any network in \({\mathtt {NN}}^{\varrho ,d,k}_{\infty ,2,\infty }\) is a finite linear combination of ridge functions \(x \mapsto \varrho (A_{i}x+b_{i})\). A ridge function is in \(L_{p}({\mathbb {R}}^{d})\) (\(p<\infty \)) only if it is zero. Moreover, one can check that if a linear combination of ridge functions belongs to \(L_{p}({\mathbb {R}}^{d})\) (\(1 \le p \le 2\)), then it vanishes; hence, .

3.6.2 Non-degenerate Cases

We now show that Property (P5) holds under proper assumptions on the activation function \(\varrho \), the depth growth function \({\mathscr {L}}\), and the domain \(\Omega \). The proof uses the celebrated universal approximation theorem for multilayer feedforward networks [43]. In light of the above observations, we introduce the following definition:

Definition 3.21

An activation function \(\varrho : {\mathbb {R}}\rightarrow {\mathbb {R}}\) is called non-degenerate if the following hold:

  1. (1)

    \(\varrho \) is Borel measurable;

  2. (2)

    \(\varrho \) is locally bounded; that is, \(\varrho \) is bounded on \([-R,R]\) for each \(R > 0\);

  3. (3)

    there is a closed null-set \(A \subset {\mathbb {R}}\) such that \(\varrho \) is continuous at every \(x_0 \in {\mathbb {R}}{\setminus } A\);

  4. (4)

    there does not exist a polynomial \(p : {\mathbb {R}}\rightarrow {\mathbb {R}}\) such that \(\varrho (x) = p(x)\) for almost all \(x \in {\mathbb {R}}\). \(\blacktriangleleft \)

Remark

A continuous activation function is non-degenerate if and only if it is not a polynomial. \(\blacklozenge \)

These are precisely the assumptions imposed on the activation function in [43], where the following version of the universal approximation theorem is shown:

Theorem 3.22

[43, Theorem 1] Let \(\varrho : {\mathbb {R}}\rightarrow {\mathbb {R}}\) be a non-degenerate activation function, \(K \subset {\mathbb {R}}^d\) be compact, \(\varepsilon > 0\), and \(f : {\mathbb {R}}^d \rightarrow {\mathbb {R}}\) be continuous. Then, there is \(N \in {\mathbb {N}}\) and suitable \(b_j, c_j \in {\mathbb {R}}\), \(w_j \in {\mathbb {R}}^d\), \(1 \le j \le N\) such that \(g : {\mathbb {R}}^d \rightarrow {\mathbb {R}}, x \mapsto \sum _{j=1}^N c_j \, \varrho (\langle w_j , x \rangle + b_j)\) satisfies \(\Vert f - g\Vert _{L_\infty (K)} \le \varepsilon \). \(\blacktriangleleft \)

We prove in Appendix B.3 that Property (P5) holds under appropriate assumptions:

Theorem 3.23

(Density) Consider \(\varrho : {\mathbb {R}}\rightarrow {\mathbb {R}}\) a Borel measurable, locally bounded activation function, \({\mathscr {L}}\) a depth growth function, and \(p \in (0,\infty ]\). Set \(L := \sup _{n \in {\mathbb {N}}} {\mathscr {L}}(n) \in {\mathbb {N}}\cup \{+\infty \}\).

  1. (1)

    Let \(\Omega \subset {\mathbb {R}}^{d}\) be a bounded admissible domain and assume that \(L \ge 2\).

    1. (a)

      For \(p \in (0,\infty )\), we have ;

    2. (b)

      For \(p = \infty \), the same holds if \(\varrho \) is continuous;

    3. (c)

      For \(p \in (0,\infty )\), if \(\varrho \) is non-degenerate, then is dense in ;

    4. (d)

      For \(p=\infty \), the same holds if \(\varrho \) is non-degenerate and continuous.

  2. (2)

    Assume that the \(L_p\)-closure of contains a function \(g : {\mathbb {R}}^d \rightarrow {\mathbb {R}}\) such that:

    1. (a)

      There is a non-increasing function \(\mu : [0,\infty ) \rightarrow [0,\infty )\) satisfying \(\int _{{\mathbb {R}}^d} \mu (|x|) \, \mathrm{d} x < \infty \) and furthermore \(|g(x)| \le \mu (|x|)\) for all \(x \in {\mathbb {R}}^d\).

    2. (b)

      \(\int _{{\mathbb {R}}^d} g(x) \, \mathrm{d} x \ne 0\); note that this integral is well defined, since \(\int _{{\mathbb {R}}^d} |g(x)| \, \mathrm{d} x \le \int _{{\mathbb {R}}^d} \mu (|x|) \, d x < \infty \).

    Then, is dense in for every admissible domain \(\Omega \subseteq {\mathbb {R}}^{d}\) and every \(k \in {\mathbb {N}}\).\(\blacktriangleleft \)

Remark

Claim (2) applies to any admissible domain, bounded or not. Furthermore, it should be noted that the first assumption (the existence of \(\mu \)) is always satisfied if g is bounded and has compact support. \(\blacklozenge \)

Corollary 3.24

Property (P5) holds for any bounded admissible domain \(\Omega \subset {\mathbb {R}}^d\) and \(p \in (0,\infty ]\) as soon as \(\sup _{n} {\mathscr {L}}(n) \ge 2\) and \(\varrho \) is continuous and not a polynomial.

Corollary 3.25

Property (P5) holds for any (even unbounded) admissible domain \(\Omega \subseteq {\mathbb {R}}^d\) and \(p \in (0,\infty ]\) as soon as \(L := \sup _{n} {\mathscr {L}}(n) \ge 2\) and as long as \(\varrho \) is continuous and such that \({\mathtt {NN}}_{\infty , L,\infty }^{\varrho ,d,1}\) contains a compactly supported, bounded, nonnegative function \(g \ne 0\).

In Sect. 4, we show that the assumptions of Corollary 3.25 indeed hold when \(\varrho \) is the ReLU or one of its powers, provided \(L \ge 3\) (or \(L \ge 2\) in input dimension \(d=1\)). This is a consequence of the following lemma, whose proof we defer to Appendix B.4.

Lemma 3.26

Consider \(\varrho :{\mathbb {R}}\rightarrow {\mathbb {R}}\) and \(W,N,L \in {\mathbb {N}}\). Assume there is \(\sigma \in {\mathtt {NN}}^{\varrho ,1,1}_{W,L,N}\) such that

$$\begin{aligned} \sigma (x) = {\left\{ \begin{array}{ll} 0, &{}\quad {\mathrm{if}}\; x \le 0 \\ 1, &{}\quad {\mathrm{if}}\; x \ge 1 \end{array}\right. } \qquad \text {and} \qquad 0 \le \sigma (x) \le 1 \quad \forall \, x \in {\mathbb {R}}. \end{aligned}$$
(3.9)

Then, the following hold:

  1. (1)

    For \(d \in {\mathbb {N}}\) and \(0<\varepsilon <\tfrac{1}{2}\), there is \(h \in {\mathtt {NN}}^{\varrho ,d,1}_{2dW(N+1), 2L - 1,(2d+1)N}\) with \(0 \le h \le 1\), \({{\text {supp}}}(h) \subset [0,1]^{d}\), and

    $$\begin{aligned} |h(x) - {\mathbb {1}}_{[0,1]^d} (x)| \le {\mathbb {1}}_{[0,1]^d {\setminus } [\varepsilon , 1-\varepsilon ]^d} (x) \quad \forall \, \, x \in {\mathbb {R}}^d . \end{aligned}$$
    (3.10)

    For input dimension \(d=1\), this holds for some \(h \in {\mathtt {NN}}_{2W,L,2N}^{\varrho ,1,1}\).

  2. (2)

    There is \(L' \le 2L-1\) (resp. \(L' \le L\) for input dimension \(d=1\)) such that for each hyper-rectangle \([a,b] := \prod _{i=1}^d [a_i, b_i]\) with \(d \in {\mathbb {N}}\) and \(-\infty< a_{i}< b_{i} < \infty \), each \(p \in (0,\infty )\), and each \(\varepsilon > 0\), there is a compactly supported, nonnegative function \(0 \le g \le 1\) such that \({{\text {supp}}}(g) \subset [a,b]\),

    $$\begin{aligned} \Vert g - {\mathbb {1}}_{[a,b]} \Vert _{L_p ({\mathbb {R}}^d)} < \varepsilon , \end{aligned}$$

    and \(g = {\mathtt {R}}(\Phi )\) for some \(\Phi \in {\mathcal {NN}}^{\varrho ,d,1}_{2dW(N+1), L',(2d+1)N}\) with \(L(\Phi ) = L'\). For input dimension \(d=1\), this holds for some \(\Phi \in {\mathcal {NN}}_{2W,L',2N}^{\varrho ,1,1}\) with \(L(\Phi ) = L'\). \(\blacktriangleleft \)

With the elements established so far, we immediately get the following theorem.

Theorem 3.27

Consider \(\varrho : {\mathbb {R}}\rightarrow {\mathbb {R}}\) an activation function, \({\mathscr {L}}\) a depth growth function, \(d \in {\mathbb {N}}\), \(p \in (0,\infty ]\) and \(\Omega \subseteq {\mathbb {R}}^d\) an admissible domain. Set \(L := \sup _{n \in {\mathbb {N}}} {\mathscr {L}}(n) \in {\mathbb {N}}\cup \{+\infty \}\). Assume that at least one of the following properties holds:

  1. (1)

    \(\varrho \) is continuous and not a polynomial, \(L \ge 2\), and \(\Omega \) is bounded;

  2. (2)

    contains some compactly supported, bounded, nonnegative \(g \ne 0\).

Then, for every \(k \in {\mathbb {N}}\), \(\alpha > 0\), \(q \in (0,\infty ]\), and with as in Eq. (1.3), we have:

  • Properties (P1)–(P5) are satisfied for (resp. for );

  • and are (quasi)-Banach spaces. \(\blacktriangleleft \)

In particular, if \(\varrho \) is continuous and satisfies the assumptions of Lemma 3.26 for some \(L \in {\mathbb {N}}\) and if \(\sup _{n \in {\mathbb {N}}} {\mathscr {L}}(n) \ge 2 L - 1\) (or \(\sup _{n \in {\mathbb {N}}} {\mathscr {L}}(n) \ge L\) in case of \(d = 1\)), then the conclusions of Theorem 3.27 hold on any admissible domain.

3.7 Discussion and Perspectives

One could envision defining approximation classes where the sets \(\Sigma _{n}\) incorporate additional constraints besides \(L \le {\mathscr {L}}(n)\). For the theory to hold, one must, however, ensure either that: (a) the additional constraints are weak enough to ensure the approximation errors (and therefore the approximation spaces) are unchanged—cf. the discussion of strict versus generalized networks; or, more interestingly, that (b) the constraint gets sufficiently relaxed when n grows, to ensure compatibility with the additivity property.

As an example, constraints of potential interest include a lower (resp. upper) bound on the minimum width \(\min _{1 \le \ell \le L-1} N_{\ell }\) (resp. maximum width \(\max _{1 \le \ell \le L-1} N_{\ell }\)), since they impact the memory needed to compute “in place” the output of the network.

While network families with a fixed lower bound on their minimum width do satisfy the additivity Property (P4), this is no longer the case of families with a fixed upper bound on their maximum width. Consider now a complexity-dependent upper bound f(n) for the maximum width. Since “adding” two networks of a given width yields one with width at most doubled, the additivity property will be preserved provided that \(2f(n) \le f(cn)\) for some \(c \in {\mathbb {N}}\) and all \(n \in {\mathbb {N}}\). This can, e.g., be achieved with \(f(n) := \lfloor \alpha n\rfloor \), with the side effect that for \(n<1/\alpha \), the set \(\Sigma _{n}\) only contains affine functions.

4 Approximation Spaces of the ReLU and Its Powers

The choice of activation function has a decisive influence on the approximation spaces and . As evidence of this, consider the following result.

Theorem 4.1

[45, Theorem 4] There exists an analytic squashing functionFootnote 2\(\varrho : {\mathbb {R}}\rightarrow {\mathbb {R}}\) such that: for any \(d \in {\mathbb {N}}\), any continuous function from \(\Omega = [0,1]^{d}\) to \({\mathbb {R}}\) can be approximated arbitrarily well in the uniform norm by a strict \(\varrho \)-network with \(L=3\) layers and \(W \le 21 d^2 + 15d + 3\) connections. \(\blacktriangleleft \)

Consider the pathological activation function \(\varrho \) from Theorem 4.1 and a depth growth function \({\mathscr {L}}\) satisfying \(L := \sup _{n} {\mathscr {L}}(n) \ge 2\). Since \(\varrho \) is continuous and not a polynomial, we can apply Theorem 3.27; hence, and are well-defined quasi-Banach spaces for each bounded admissible domain \(\Omega \), \(p \in (0,\infty ]\) and . Yet, if \(L \ge 3\), there is \(n_{0}\) so that \({\mathscr {L}}(n) \ge 3\) for \(n \ge n_{0}\), and the set is dense in X for any \(p \in (0,\infty ]\) provided that \(n \ge \max \{ n_{0}, 21 d^2 + 15 d + 3 \}\); hence, for any \(f \in X\) and any such n, showing that with equivalent (quasi)-norms.

The approximation spaces generated by pathological activation functions such as in Theorem 4.1 are so degenerate that they are uninteresting both from a practical perspective (computing a near best approximation with such an activation function is hopeless) and from a theoretical perspective. (The whole scale of approximation spaces collapses to .)

Much more interesting is the study of approximation spaces generated by commonly used activation functions such as the ReLU \(\varrho _{1}\) or its powers \(\varrho _{r}\), \(r \in {\mathbb {N}}\). For any admissible domain, generalized and strict \(\varrho _{r}\)-networks indeed yield well-defined approximations spaces that coincide.

Theorem 4.2

(Approximation spaces of generalized and strict \(\varrho _{r}\)-networks) Let \(r \in {\mathbb {N}}\) and define \(\varrho _r : {\mathbb {R}}\rightarrow {\mathbb {R}}, x \mapsto (x_+)^r\), where \(x_+ := \max \{0, x\}\). Consider with \(p \in (0,\infty ]\), \(d,k \in {\mathbb {N}}\) and \(\Omega \subseteq {\mathbb {R}}^{d}\) an arbitrary admissible domain. Let \({\mathscr {L}}\) be any depth growth function.

  1. (1)

    For each \(\alpha > 0,q \in (0,\infty ]\), \(r \in {\mathbb {N}}\), we have

    and there is \(C < \infty \) such that

  2. (2)

    If the depth growth function \({\mathscr {L}}\) satisfies

    $$\begin{aligned} \sup _{n \in {\mathbb {N}}} {\mathscr {L}}(n) \ge {\left\{ \begin{array}{ll} 2, &{} \text {if}\ \Omega \ \text {is bounded }{} or \ d=1 \\ 3, &{} \text {otherwise} \end{array}\right. }, \end{aligned}$$

    then, for each \(\alpha > 0\), \(q \in (0,\infty ]\), \(r \in {\mathbb {N}}\) and \(\varrho := \varrho _{r}\), the following hold:

    • Properties (P1)–(P5) are satisfied for (resp. for );

    • and are (quasi)-Banach spaces. \(\blacktriangleleft \)

Remark 4.3

For a bounded domain or when \(d=1\), the second claim holds for any depth growth function allowing at least one hidden layer. In the other cases, the restriction to at least two hidden layers is unavoidable (except for some exotic unbounded domains with vanishing mass at infinity) as the only realization of a \(\varrho _{r}\)-network of depth two that belongs to is the zero network.\(\blacklozenge \)

Proof of Theorem 4.2

By Lemma 2.24, \(\varrho _{r}\) can represent the identity using \(2r + 2\) terms. By Theorem 3.8, this establishes the first claim. The second claim follows from Theorem 3.27, once we show that we can apply the latter. For bounded \(\Omega \), this is clear, since \(\varrho _r\) is continuous and not a polynomial, and hence non-degenerate. For general \(\Omega \), we relate \(\varrho _{r}\) to B-splines to establish the following lemma (which we prove below).

Lemma 4.4

For any \(r \in {\mathbb {N}}\), there is \(\sigma _{r} \in {\mathtt {SNN}}^{\varrho _{r},1,1}_{2(r+1),2,r+1}\) satisfying (3.9).\(\blacktriangleleft \)

Combined with Lemma 3.26, we obtain the existence of a compactly supported, continuous, nonnegative function \(g \ne 0\) such that (respectively, for input dimension \(d=1\)). Hence, Theorem 3.27 is applicable. \(\square \)

Definition 4.5

(B-splines) For any function \(f: {\mathbb {R}}\rightarrow {\mathbb {R}}\), define \(\Delta f: x \mapsto f(x)-f(x-1)\). Let \(\varrho _{0} := {\mathbb {1}}_{[0,\infty )}\) denote the Heaviside function, and \(\beta _{+}^{(0)} := {\mathbb {1}}_{[0,1)} = \Delta \varrho _{0}\) the B-spline of degree 0. The B-spline of degree n is obtained by convolving \(\beta _{+}^{(0)}\) with itself \(n+1\) times:

$$\begin{aligned} \beta _{+}^{(n)} := \underbrace{\beta ^{(0)}_{+} \star \ldots \star \beta ^{(0)}_{+}}_{n+1\ \text {factors}}. \end{aligned}$$

For \(n \ge 0\), \(\beta _{+}^{(n)}\) is nonnegative and is zero except for \(x \in [0,n+1]\). We have \(\beta _{+}^{(n)} \in C^{n-1}_{c}({\mathbb {R}})\) for \(n \ge 1\). Indeed, this follows since \(\varrho _n \in C^{n-1}({\mathbb {R}})\), and since it is known (see [65, Equation (10)], noting that [65] uses centered B-splines) that the B-spline of degree n can be decomposed as

$$\begin{aligned} \beta _{+}^{(n)} = \frac{\Delta ^{n+1} \varrho _{n}}{n!} = \frac{1}{n!} \sum _{k=0}^{n+1} \left( {\begin{array}{c}n+1\\ k\end{array}}\right) (-1)^{k} \varrho _{n}(x-k). \blacktriangleleft \end{aligned}$$
(4.1)

Proof of Lemma 4.4

For \(n \ge 0\), \(\beta _{+}^{(n)}\) is nonnegative and is zero except for \(x \in [0,n+1]\). Its primitive

$$\begin{aligned} g_{n}(x) := \int _{0}^{x} \beta _{+}^{(n)} (t) \mathrm{d}t \end{aligned}$$

is thus non-decreasing, with \(g_{n}(x) = 0\) for \(x \le 0\) and \(g_{n}(x) = g_{n}(n+1)\) for \(x \ge n+1\). Since \(\beta _{+}^{(n)} \in C^{n-1}_{c}({\mathbb {R}})\) for \(n \ge 1\), we have \(g_{n} \in C^{n}({\mathbb {R}})\) for \(n \ge 1\). Furthermore, \(g_0 \in C^0 ({\mathbb {R}})\) since \(\beta ^{(0)}_+\) is bounded.

For \(r \ge 1\), the above facts imply that the function \(\sigma _{r}(x) := g_{r-1}(rx) / g_{r-1}(r)\) belongs to \(C^{r-1}({\mathbb {R}})\) and satisfies (3.9). To conclude, we now prove that \(\sigma _{r} \in {\mathtt {SNN}}^{\varrho _{r},1,1}_{2(r+1),2,r+1}\). For \(0 \le k \le n+1\), we have

$$\begin{aligned}&\int _{0}^{x} \varrho _{n}(t-k){\mathrm{d}}t\\&\quad = {\left\{ \begin{array}{ll} 0, &{} \quad {\mathrm{if} }\; x \le k \\ \int _{k}^{x} (t-k)^{n} {\mathrm{d}}t = \int _{0}^{x-k}t^{n}\mathrm{d}t = \tfrac{(x-k)^{n+1}}{n+1}, &{}\quad \text {otherwise} \end{array}\right. } \quad = \quad \frac{\varrho _{n+1}(x-k)}{n+1} . \end{aligned}$$

By (4.1), it follows that

$$\begin{aligned} g_{n}(x) = \frac{1}{(n+1)!} \sum _{k=0}^{n+1} \left( {\begin{array}{c}n+1\\ k\end{array}}\right) (-1)^{k} \varrho _{n+1}(x-k), \end{aligned}$$

and hence,

$$\begin{aligned} \sigma _{r}(x) = \frac{g_{r-1}(rx)}{g_{r-1}(r)} = \frac{1}{r!\ g_{r-1}(r)} \sum _{k=0}^{r} \left( {\begin{array}{c}r\\ k\end{array}}\right) (-1)^{k} \varrho _{r}(rx-k) . \end{aligned}$$

Setting \(\alpha _{1} := \varrho _{r} \otimes \ldots \otimes \varrho _{r}: {\mathbb {R}}^{r+1} \rightarrow {\mathbb {R}}^{r+1}\) as well as \(T_1: {\mathbb {R}}\rightarrow {\mathbb {R}}^{r+1}, x \mapsto (rx-k)_{k=0}^{r}\) and

$$\begin{aligned} T_{2}: {\mathbb {R}}^{r+1} \rightarrow {\mathbb {R}}, y = (y_{k})_{k=0}^{r} \mapsto \tfrac{1}{r!\ g_{r-1}(r)} \sum _{k=0}^{r} \left( {\begin{array}{c}r\\ k\end{array}}\right) (-1)^{k} y_{k} \end{aligned}$$

and \(\Phi := \big ( (T_{1},\alpha _{1}),(T_{2},\mathrm {id}_{{\mathbb {R}}}) \big )\), it is then easy to check that \(\sigma _{r} = {\mathtt {R}}(\Phi )\). Obviously \(L(\Phi )=2\), \(N(\Phi )=r+1\), and \(\Vert T_{i}\Vert _{\ell ^{0}} = r+1\) for \(i=1,2\); hence, as \(\Phi \) is strict, we have \(\Phi \in {\mathcal {SNN}}^{\varrho _{r},1,1}_{2(r+1),2,r+1}\). \(\square \)

4.1 Piecewise Polynomial Activation Functions Versus \(\varrho _{r}\)

In this subsection, we show that approximation spaces of \(\varrho _{r}\)-networks contain the approximation spaces of continuous piecewise polynomial activation functions and match those of (free-knot) spline activation functions.

Definition 4.6

Consider an interval \(I \subseteq {\mathbb {R}}\). A function \(f: I \rightarrow {\mathbb {R}}\) is piecewise polynomial if there are finitely many intervals \(I_i \subset I\) such that \(I = \bigcup _i I_i\) and \(f|_{I_i}\) is a polynomial. It is of degree at most \(r \in {\mathbb {N}}\) when each \(f|_{I_{i}}\) is of degree at most r, and with at most \(n \in {\mathbb {N}}\) pieces (or with at most \(n-1 \in {\mathbb {N}}_{0}\) breakpoints) when there are at most n such intervals. The set of piecewise polynomials of degree at most r with at most n pieces is denoted \({\mathtt {PPoly}}^{r}_{n}(I)\), and we set \({\mathtt {PPoly}}^{r}(I) := \cup _{n \in {\mathbb {N}}} {\mathtt {PPoly}}^{r}_{n}(I)\).

A function \(f \in {\mathtt {Spline}}^{r}_{n}(I) := {\mathtt {PPoly}}^{r}_{n}(I) \cap C^{r-1}(I)\) is called a free-knot spline of degree at most r with at most n pieces (or at most \(n-1\) breakpoints). We set \({\mathtt {Spline}}^{r}(I) := \cup _{n \in {\mathbb {N}}} {\mathtt {Spline}}^{r}_n(I)\). \(\blacktriangleleft \)

Theorem 4.7

Consider a depth growth function \({\mathscr {L}}\), an admissible domain \(\Omega \subset {\mathbb {R}}^d\), and let with \(d,k \in {\mathbb {N}}\), \(p \in (0,\infty ]\). Let \(r \in {\mathbb {N}}\), set \(\varrho _r : {\mathbb {R}}\rightarrow {\mathbb {R}}, x \mapsto (x_+)^r\), and let \(\alpha >0\), \(q \in (0,\infty ]\).

  1. (1)

    If \(\varrho :{\mathbb {R}}\rightarrow {\mathbb {R}}\) is continuous and piecewise polynomial of degree at most r, then

    (4.2)

    Moreover, if \(\Omega \) is bounded, or if \(r=1\), or if \({\mathscr {L}}+1 \preceq {\mathscr {L}}\), then we further have

    (4.3)
  2. (2)

    If \(\varrho \in {\mathtt {Spline}}^{r}({\mathbb {R}})\) is not a polynomial and \(\Omega \) is bounded, then we have (with equivalent norms)

    (4.4)
  3. (3)

    For any \(s \in {\mathbb {N}}\) we have

    (4.5)

The same results hold with instead of . \(\blacktriangleleft \)

Examples 3.15 and 3.16 provide important examples of depth growth functions \({\mathscr {L}}\) with \({\mathscr {L}}+1 \preceq {\mathscr {L}}\), so that (4.3) holds on any domain.

Remark 4.8

(Nestedness) For \(1 \le r' \le r\), the function \(\varrho := \varrho _{r'}\) is indeed a continuous piecewise polynomial with two pieces of degree at most r. Theorem 4.7 thus implies that if \(\Omega \) is bounded or \({\mathscr {L}} + 1 \preceq {\mathscr {L}}\), then and .

We will see in Corollary 4.14 that if \(2 {\mathscr {L}} \preceq {\mathscr {L}}\), then these embeddings are indeed equalities if \(2 \le r' \le r\).\(\blacklozenge \)

The main idea behind the proof of Theorem 4.7 is to combine Lemma 2.19 and its consequences with the following results proved in Appendices C.1C.2.

Lemma 4.9

Consider \(\varrho :{\mathbb {R}}\rightarrow {\mathbb {R}}\) a continuous piecewise polynomial function with at most \(n \in {\mathbb {N}}\) pieces of degree at most \(r \in {\mathbb {N}}\). WithFootnote 3\(w := 2 \cdot (4^{r}-1) / 3\) and \(m := 2^{r} - 1\) we have

$$\begin{aligned} \varrho \in \overline{{\mathtt {NN}}^{\varrho _{r},1,1}_{4(r+1)+(n-1)w,2,2(r+1)+(n-1)m}}, \end{aligned}$$

where the closure is with respect to the topology of locally uniform convergence. For \(r=1\) (that is, when \(\varrho \) is continuous and piecewise affine with at most \(n \in {\mathbb {N}}\) pieces and \(\varrho _r = \varrho _1\)), we even have \({\varrho \in {\mathtt {SNN}}^{\varrho _{r},1,1}_{2(n+1),2,n+1}}\).\(\blacktriangleleft \)

Lemma 4.10

Consider \(r \in {\mathbb {N}}\) and \(\varrho \in {\mathtt {Spline}}^{r} ({\mathbb {R}})\). If \(\varrho \) is not a polynomial, then \(\varrho _r \in \overline{{\mathtt {NN}}^{\varrho ,1,1}_{5^{r}r!,2,3^{r}r!}}\), where the closure is with respect to locally uniform convergence.\(\blacktriangleleft \)

For bounded \(\Omega \), locally uniform convergence on \({\mathbb {R}}^{d}\) implies convergence in for all \(p \in (0,\infty ]\). To similarly “upgrade” locally uniform convergence to convergence in X on unbounded domains, we use the following localization lemma which is proved in Appendix C.3.

Lemma 4.11

Consider \(d,k \in {\mathbb {N}}\), \(r \in {\mathbb {N}}_{\ge 2}\). There is \(c = c(d,k,r) \in {\mathbb {N}}\) such thatFootnote 4 for any \(W,L,N \in {\mathbb {N}}\), \(g \in {\mathtt {NN}}^{\varrho _{r},d,k}_{W,L,N}\), \(R \ge 1,\delta > 0\), there is \(g_{R,\delta } \in {\mathtt {NN}}^{\varrho _{r},d,k}_{cW,\max \{ L+1, 3 \},cN}\), such that

$$\begin{aligned} |g_{R,\delta } ( x) - ({\mathbb {1}}_{[-R,R]^{d}} \cdot g)(x)| \le 2 \cdot |g(x)| \cdot {\mathbb {1}}_{[-R-\delta ,R+\delta ]^{d} {\setminus } [-R,R]^{d}} (x) \qquad \forall \, x \in {\mathbb {R}}^d .\nonumber \\ \end{aligned}$$
(4.6)

For \(d=1\), the same holds with \(\max \{ L+1,2 \}\) layers instead of \(\max \{ L+1,3 \}\).\(\blacktriangleleft \)

The following proposition describes how one can “upgrade” the locally uniform convergence to convergence in , at the cost of slightly increasing the depth of the approximating networks.

Proposition 4.12

Consider \(\Omega \subset {\mathbb {R}}^d\) an admissible domain and with \(d,k \in {\mathbb {N}}\), \(p \in (0,\infty ]\). Assume \(\varrho \in \overline{{\mathtt {NN}}^{\varrho _{r},1,1}_{\infty ,2,m}}\) where the closure is with respect to locally uniform convergence and \(r \in {\mathbb {N}}_{\ge 2}\), \(m \in {\mathbb {N}}\). For any \(W,N \in {\mathbb {N}}_{0} \cup \{\infty \}\), \(L \in {\mathbb {N}}\cup \{\infty \}\), we have, with closure in X,

$$\begin{aligned} {\mathtt {NN}}_{W,L,N}^{\varrho ,d, k}(\Omega ) \cap X \subset \overline{ {\mathtt {NN}}^{\varrho _r, d, k}_{cWm^{2}, \max \{ L+1,3 \},cNm}(\Omega ) \cap X }^{X} , \end{aligned}$$

where \(c = c(d,k, r) \in {\mathbb {N}}\) is as in Lemma 4.11. If \(d=1\), the same holds with \(\max \{ L+1,2 \}\) layers instead of \(\max \{ L+1,3 \}\). If \(\Omega \) is bounded, or if \(\varrho \in {\mathtt {NN}}^{\varrho _{r},1,1}_{\infty ,2,m}\) with \(r=1\), then the same holds with \(c=1\) and L layers instead of \(\max \{ L+1, 3 \}\) (resp. instead of \(\max \{ L+1, 2 \}\) when \(d=1\)). \(\blacktriangleleft \)

The proof is in Appendix C.4. We are now equipped to prove Theorem 4.7.

Proof of Theorem 4.7

We give the proof for ; minor adaptations yield the results for .

For Claim (1), first note that Lemma 4.9 shows that there is some \(m \in {\mathbb {N}}\) satisfying \(\varrho \in \overline{{\mathtt {NN}}_{\infty ,2,m}^{\varrho _r,1,1}}\), where the closure is with respect to locally uniform convergence. Define \(\ell := 3\) if \(d \ge 2\) (resp. \(\ell := 2\) if \(d=1\)) and \(\widetilde{{\mathscr {L}}}:= \max \{ {\mathscr {L}}+1,\ell \}\) (resp. \(\widetilde{{\mathscr {L}}}:= {\mathscr {L}}\) when \(\Omega \) is bounded or \(r=1\)) and consider \(c \in {\mathbb {N}}\) as in Proposition 4.12. Thus, since \(\widetilde{{\mathscr {L}}}\) is non-decreasing, by Proposition 4.12 and Lemma 2.14, we have for all \(n \in {\mathbb {N}}\)

$$\begin{aligned} \mathtt {W}_n(X,\varrho ,{\mathscr {L}}) = {\mathtt {NN}}^{\varrho ,d,k}_{n,{\mathscr {L}}(n),\infty }(\Omega ) \cap X&\subset \overline{ {\mathtt {NN}}^{\varrho _r,d,k}_{cnm^{2},\widetilde{{\mathscr {L}}}(n),\infty }(\Omega ) \cap X }^{X} \\&\subset \overline{ {\mathtt {NN}}^{\varrho _r,d,k}_{cnm^{2},\widetilde{{\mathscr {L}}}(cnm^{2}),\infty }(\Omega ) \cap X }^{X}\\&= \overline{\mathtt {W}_{cm^{2}n}(X,\varrho _r, \widetilde{{\mathscr {L}}})}^{X}. \end{aligned}$$

Hence, for any \(f \in X\) and \(n \in {\mathbb {N}}\)

$$\begin{aligned} E(f,\mathtt {W}_n(X,\varrho ,{\mathscr {L}}))_{X} \ge E\big ( f,\mathtt {W}_{cm^{2}n}(X,\varrho _r,\widetilde{{\mathscr {L}}})_{X} \big ) . \end{aligned}$$

Thus, Lemma 3.1 yields (4.2). When \(\Omega \) is bounded or \(r=1\), as \(\widetilde{{\mathscr {L}}} = {\mathscr {L}}\), this yields (4.3). When \({\mathscr {L}}+1 \preceq {\mathscr {L}}\), as \(\widetilde{{\mathscr {L}}} \le \max \{ {\mathscr {L}}+1,\ell \} \le {\mathscr {L}}+\ell +1\), we have \(\widetilde{{\mathscr {L}}} \preceq {\mathscr {L}}+\ell +1 \preceq {\mathscr {L}}\) by Lemma 3.14, yielding again (4.3) by Lemma 3.12.

For Claim (2), if \(\Omega \) is bounded and \(\varrho \in {\mathtt {Spline}}^{r}({\mathbb {R}})\) is not a polynomial, combining Lemma 4.10 with Lemma 2.21, we similarly get the converse to  (4.3). This establishes (4.4).

We now prove Claim (3). Since \(\varrho _{r^s} = \varrho _r \circ \cdots \circ \varrho _r\) (where \(\varrho \) appears s times), Lemma 2.20 shows that \( {\mathtt {NN}}^{\varrho _{r^{s}},d,k}_{W,L,N} \subset {\mathtt {NN}}^{\varrho _r,d,k}_{W+(s-1)N,1+s(L-1),sN} \) for all WLN. Combining this with Lemma 2.14, we obtain

$$\begin{aligned} {\mathtt {NN}}^{\varrho _{r^{s}},d,k}_{n,{\mathscr {L}}(n),\infty } \subset {\mathtt {NN}}^{\varrho _{r^{s}},d,k}_{n,{\mathscr {L}}(n),n} \subset {\mathtt {NN}}^{\varrho _r,d,k}_{sn,1+s({\mathscr {L}}(n)-1),sn} \subset {\mathtt {NN}}^{\varrho _r,d,k}_{sn,1+s({\mathscr {L}}(sn)-1),\infty } \quad \forall \, n \in {\mathbb {N}}. \end{aligned}$$

Therefore, we get for any \(f \in X\) and \(n \in {\mathbb {N}}\)

$$\begin{aligned} E(f,\mathtt {W}_n(X,\varrho _{r^{s}},{\mathscr {L}}))_{X} \ge E\big ( f,\mathtt {W}_{sn}(X,\varrho _{r},1+s({\mathscr {L}}-1))_{X} \big ). \end{aligned}$$

Hence, we can finally apply Lemma 3.1 to obtain (4.5). \(\square \)

Remark 4.13

Inspecting the proofs, we see that if \(\varrho \in {\mathtt {Spline}}^{r}\) has exactly one breakpoint, then \(\varrho \in {\mathtt {NN}}^{\varrho _{r},1,1}_{w,2,m}\) and \(\varrho _{r} \in {\mathtt {NN}}^{\varrho ,1,1}_{w,2,m}\) for some \(w,m \in {\mathbb {N}}\). This is stronger than \(\varrho \in \overline{{\mathtt {NN}}^{\varrho _{r},1,1}_{w,2,m}}\) (resp. than \(\varrho _{r} \in \overline{{\mathtt {NN}}^{\varrho ,1,1}_{w,2,m}}\)) and implies (4.4) with equivalent norms even on unbounded domains. Examples include the leaky ReLU [44], the parametric ReLU [33], and the absolute value which is used in scattering transforms [46].

Another spline of degree one is soft-thresholding, \(\sigma (x) := x(1-\lambda /|x|)_{+}\), which appears in Iterative Shrinkage Thresholding Algorithms (ISTA) for \(\ell ^{1}\) sparse recovery in the context of linear inverse problems [28, Chap. 3] and has been used in the Learned ISTA (LISTA) method [30]. As \(\sigma \in {\mathtt {Spline}}^{1}\), using soft-thresholding as an activation function on bounded \(\Omega \) is exactly as expressive as using the ReLU.\(\blacklozenge \)

4.2 Saturation Property of Approximation Spaces with Polynomial Depth Growth

For certain depth growth functions, the approximation spaces of \(\varrho _{r}\)-networks are independent of the choice of \(r \ge 2\).

Corollary 4.14

With the notations of Theorem 4.7, if \(2 {\mathscr {L}} \preceq {\mathscr {L}}\) then for every \(r \in {\mathbb {N}}_{\ge 2}\) we have

(4.7)
(4.8)

where the equality is with equivalent quasi-norms.

Example 4.15

By Example 3.16, for polynomially growing depth we do have \(2{\mathscr {L}} \preceq {\mathscr {L}}\). This includes the case \({\mathscr {L}}(n) = n+1\), which gives the same approximation spaces as \({\mathscr {L}} \equiv \infty \); see Remark 3.6.

In words, approximation spaces of \(\varrho _{r}\)-networks with appropriate depth growth have a saturation property: Increasing the degree r beyond \(r=2\) does not pay off in terms of the considered function spaces. Note, however, that the constants in the norm equivalence may still play a qualitative role in practice.

Proof

We prove (4.7) the proof of (4.8) is similar. By Lemma 3.14, since \(2{\mathscr {L}} \preceq {\mathscr {L}}\), we have \(a {\mathscr {L}} + b \sim {\mathscr {L}}\) for all \(a \ge 1\), \(b \ge 1-a\). In particular, \({\mathscr {L}} + 1 \preceq {\mathscr {L}}\); hence, (4.3) holds with \(\varrho = \varrho _{r'}\), \(r' \in {\mathbb {N}}\), \(1 \le r' \le r\). Combined with (4.5) and Lemma 3.12, since \(r \le 2^r\) for \(r \in {\mathbb {N}}\), we see

for all \(r \in {\mathbb {N}}_{\ge 2}\). In the middle, we used that \(1 + r ({\mathscr {L}} - 1) \preceq 1 + r {\mathscr {L}} \preceq (1 + r) {\mathscr {L}} \preceq {\mathscr {L}}\). \(\square \)

4.3 Piecewise Polynomial Activation Functions Yield Non-trivial Approximation Spaces

In light of the pathological example of Theorem 4.1, it is important to check that the approximation spaces and with \(\varrho = \varrho _{r}\), \(r \in {\mathbb {N}}\), are non-trivial: They are proper subspaces of . This is what we prove for any continuous and piecewise polynomial activation function \(\varrho \).

Theorem 4.16

Let \(\varrho : {\mathbb {R}}\rightarrow {\mathbb {R}}\) be continuous and piecewise polynomial (with finitely many pieces), let \(\Omega \subset {\mathbb {R}}^d\) be measurable with non-empty interior, and let \(s > 0\).

Let \(p,q \in (0,\infty ]\), \(k \in {\mathbb {N}}\), \(\alpha \in (0,\infty )\), and . Finally, let \({\mathscr {L}}\) be a depth-growth function satisfying \(\sup _{n\in {\mathbb {N}}} {\mathscr {L}}(n) \ge 2\). Then, and . \(\blacktriangleleft \)

The proof is given at the end of Appendix E.

4.4 ReLU Networks of Bounded Depth have Limited Expressiveness

In this subsection, we show that approximation spaces of ReLU networks of bounded depth and high approximation rate \(\alpha \) are non-trivial in a very explicit sense: They fail to contain any nonzero function in \(C_{c}^{3}({\mathbb {R}}^{d})\). This quite general obstruction to the expressiveness of shallow ReLU networks, and to the embedding of “classical” function spaces into the approximation spaces of shallow ReLU networks, is obtained by translating [55, Theorem 4.5] into the language of approximation spaces.

Theorem 4.17

Let \(\Omega \subseteq {\mathbb {R}}^d\) be an open admissible domain, \(p,q \in (0,\infty ]\), , \(L \in {\mathbb {N}}\), and \(\alpha > 0\).

  • If , then \(\lfloor L/2 \rfloor \ge \alpha / 2\);

  • If , then \(L - 1 \ge \alpha /2\). \(\blacktriangleleft \)

Before we give a proof, we immediately highlight a consequence.

Corollary 4.18

Let Y be a function space such that \(C_c^3 (\Omega ) \cap Y \ne \{0\}\) where \(\Omega \subseteq {\mathbb {R}}^d\) is an open admissible domain. For \(p \in (0,\infty ]\), , \(L \in {\mathbb {N}}\), \(\alpha > 0\) and \(q \in (0,\infty ]\), we have

  • If , then \(\lfloor L/2 \rfloor \ge \alpha /2\);

  • If , then \(L-1 \ge \alpha /2\). \(\blacktriangleleft \)

Remark

All “classical” function spaces (Sobolev, Besov, or modulation spaces, ...) include \(C_c^\infty (\Omega )\); hence, this shows that none of these spaces embed into (resp. into ) for \(\alpha > 2L\). In other words, to achieve embeddings into approximation spaces of ReLU networks with a good approximation rate, one needs depth! \(\blacklozenge \)

Proof of Theorem 4.17

The claimed estimates are trivially satisfied in case of \(L = 1\); hence, we will assume \(L \ge 2\) in what follows.

Let \(f \in C_c^3 (\Omega )\) be not identically zero. We derive necessary criteria on L which have to be satisfied if or . By Eq. (3.2), we have and the same for ; thus, it suffices to consider the case \(q = \infty \).

Extending f by zero outside \(\Omega \), we can assume \(f \in C_c^3({\mathbb {R}}^d)\) with \({{\text {supp}}}f \subset \Omega \). We claim that there is \(x_0 \in {{\text {supp}}}(f) \subset \Omega \) with \({{\text {Hess}}}_f (x_0) \ne 0\), where \({{\text {Hess}}}_f\) denotes the Hessian of f. If this was false, we would have \({{\text {Hess}}}_f \equiv 0\) on all of \({\mathbb {R}}^d\), and hence, \(\nabla f \equiv v\) for some \(v \in {\mathbb {R}}^d\). This would imply \(f (x) = \langle v, x \rangle + b\) for all \(x \in {\mathbb {R}}^d\), with \(b = f(0)\). However, since \(f \equiv 0\) on the non-empty open set \({\mathbb {R}}^d {\setminus } {{\text {supp}}}(f)\), this would entail \(v = 0\), and then, \(f \equiv 0\), contradicting our choice of f.

Now, choose \(r > 0\) such that \(\Omega _0 := B_r (x_0) \subset \Omega \). Then, \(f|_{\Omega _0}\) is not an affine-linear function, so that [55, Proposition C.5] yields a constant \(C_1 = C_1(f,p) > 0\) satisfying

$$\begin{aligned} \Vert f - g \Vert _{L^p (\Omega _0)} \ge C_1 \cdot P^{-2} \quad \text {for each }P\text {-piecewise slice affine function }g : {\mathbb {R}}^d \rightarrow {\mathbb {R}}.\nonumber \\ \end{aligned}$$
(4.9)

Here, a function \(g : {\mathbb {R}}^d \rightarrow {\mathbb {R}}\) is called P-piecewise slice affine if for arbitrary \(x_0, v \in {\mathbb {R}}^d\) the function \(g_{x_0, v} : {\mathbb {R}}\rightarrow {\mathbb {R}}, t \mapsto g(x_0 + t v)\) is piecewise affine-linear with at most P pieces; that is, \(g_{x_0, v} \in {\mathtt {PPoly}}_{P}^1 ({\mathbb {R}})\).

Now, Lemma 5.19 (which will be proved independently) shows that there is a constant \(K = K(L) \in {\mathbb {N}}\) such that

$$\begin{aligned} {\mathtt {NN}}_{W,L,\infty }^{\varrho _1, 1, 1} \subset {\mathtt {PPoly}}_{K \cdot W^{\lfloor L/2 \rfloor }}^1 ({\mathbb {R}}) \qquad \text {and} \qquad {\mathtt {NN}}_{\infty ,L,N}^{\varrho _1, 1, 1} \subset {\mathtt {PPoly}}_{K \cdot N^{L-1}}^1 ({\mathbb {R}}) \end{aligned}$$

for all \(N \in {\mathbb {N}}\). Furthermore, if \(g \in {\mathtt {NN}}_{W,L,N}^{\varrho _1,d,1}\), then Lemma 2.18 shows \(g_{x_0, v} \in {\mathtt {NN}}_{W,L,N}^{\varrho _1,1,1}\); here, we used that the affine map \(T : {\mathbb {R}}\rightarrow {\mathbb {R}}^d , t \mapsto x_0 + t v\) satisfies \(\Vert T \Vert _{\ell ^{0,\infty }_*} \le 1\). In combination, we see that each \(g \in {\mathtt {NN}}_{W,L,\infty }^{\varrho _1,d,1}\) is P-piecewise slice affine with \(P = K \cdot W^{\lfloor L/2 \rfloor }\), and each \(g \in {\mathtt {NN}}_{\infty ,L,N}^{\varrho _1,d,1}\) is P-piecewise slice affine with \(P = K \cdot N^{L-1}\).

Now, if , then there is a constant \(C_2 = C_2(f,\alpha ,p) > 0\) such that for each \(n \in {\mathbb {N}}\), there is \(g_n \in {\mathtt {NN}}_{n,L,\infty }^{\varrho _1,d,1}\) satisfying \(\Vert f - g_n \Vert _{L^p(\Omega _0)} \le \Vert f - g_n \Vert _{X} \le C_2 \cdot n^{-\alpha }\). Furthermore, since \(g_n\) is P-piecewise slice affine with \(P = K \cdot n^{\lfloor L/2 \rfloor }\), Eq. (4.9) shows that \( K^{-2} C_1 \cdot n^{-2 \lfloor L/2 \rfloor } \le \Vert f - g_n \Vert _{L^p(\Omega _0)} \le C_2 \cdot n^{-\alpha } \). Since this holds for all \(n \in {\mathbb {N}}\), we get \(\alpha - 2 \lfloor L/2 \rfloor \le 0\), as claimed.

The proof in case of is almost identical, and hence omitted. \(\square \)

Our next result shows that for networks of fixed depth, neural networks using the activation function \(\varrho _r\) with \(r \ge 2\) are strictly more expressive than ReLU networks—at least in the regime of very high approximation rates.

Corollary 4.19

Consider \(\Omega \subseteq {\mathbb {R}}^d\) an open admissible domain, \(p \in (0,\infty ]\), , \(L \in {\mathbb {N}}\). In case of \(d = 1\), assume that \(r \ge 4\) and \(L \ge 2\), or that \(r \in \{2,3\}\) and \(L \ge 3\). In case of \(d > 1\), assume instead that \(r \ge 4\) and \(L \ge 3\), or that \(r \in \{2,3\}\) and \(L \ge 5\). Then, the following hold:

Proof

We use Lemma 4.20 to get and , and we conclude using Corollary 4.18. \(\square \)

Lemma 4.20

Consider \(d,r,L \in {\mathbb {N}}\), \(\Omega \subset {\mathbb {R}}^d\) an open admissible domain, \(p \in (0,\infty ]\), . In case of \(d = 1\), assume that \(r \ge 4\) and \(L \ge 2\), or that \(r \in \{2,3\}\) and \(L \ge 3\). In case of \(d > 1\), assume instead that \(r \ge 4\) and \(L \ge 3\), or that \(r \in \{2,3\}\) and \(L \ge 5\).

Then, for each \(\alpha > 0\) and \(q \in (0,\infty ]\), we have \(\blacktriangleleft \)

Proof

Since \(\Omega \) is an admissible domain, it is non-empty. Being open, \(\Omega \) thus contains a hyper-rectangle \([a,b] := \prod _{i=1}^d [a_i,b_i] \subset \Omega \), where \(a_i < b_i\).

For \(r' \ge 2\), let \(\sigma _{r'} \in {\mathtt {SNN}}^{\varrho _{r'},1,1}_{2(r'+1),2,r'+1}\) be the function constructed in Lemma 4.4. As \(\sigma _{r'}\) satisfies (3.9), the function g built from \(\sigma _{r'}\) in Lemma 3.26-(2) for small enough \(\varepsilon \) is nonzero and satisfies \({{{\text {supp}}}(g) \subset [a,b] \subset \Omega }\) and \(g \in {\mathtt {NN}}_{\infty ,3,\infty }^{\varrho _{r'},d,1}\) (resp. \(g \in {\mathtt {NN}}_{\infty ,2,\infty }^{\varrho _{r'},d,1}\) when \(d=1\)). Note that if \(r' \ge 4\), then \(\varrho _{r'} \in C^{3}({\mathbb {R}})\); hence, \(g \in C^{3}_{c}({\mathbb {R}}^{d}) {\setminus } \{0\}\).

When \(r \ge 4\), set \(r':=r\) so that \(g \in {\mathtt {NN}}^{\varrho _{r},d,1}_{\infty ,3,\infty }\) (\(g \in {\mathtt {NN}}^{\varrho _{r},d,1}_{\infty ,2,\infty }\) when \(d=1\)). When \(r \in \{2,3\}\), set \(r' := r^{2} \ge 4\). As \(\varrho _{r'} = \varrho _{r} \circ \varrho _{r}\), Lemma 2.20 with \(s=2\) yields \(g \in {\mathtt {NN}}^{\varrho _{r},d,1}_{\infty ,5,\infty }\) (\(g \in {\mathtt {NN}}^{\varrho _{r},d,1}_{\infty ,3,\infty }\) for \(d=1\)).

It is not hard to see that our assumptions regarding L imply in each case for n large enough that \( g|_{\Omega } \in \mathtt {W}_{n}(X,\varrho _{r},L) \cap \mathtt {N}_{n}(X,\varrho _{r},L) \), and hence, . \(\square \)

5 Direct and Inverse Estimates with Besov Spaces

In this section, we characterize certain embeddings

  • of Besov spaces into and ; these are called direct estimates;

  • of and into Besov spaces; these are called inverse estimates.

Since the approximation classes for output dimension \(k > 1\) are k-fold Cartesian products of the classes for \(k=1\) (cf. Remark 3.17), we focus on scalar output dimension \(k=1\). We will use so-called Jackson inequalities and Bernstein inequalities, as well as the notion of real interpolation spaces. These concepts are recalled in Sect. 5.1, while Besov spaces and some of their properties are briefly recalled in Sect. 5.2 before we proceed to our main results.

5.1 Reminders on Interpolation Theory

Given two quasi-normed vector spaces \((Y_J, \Vert \cdot \Vert _{Y_J})\) and \((Y_B, \Vert \cdot \Vert _{Y_B})\) with \(Y_J \hookrightarrow X\) and \(Y_B \hookrightarrow X\) for a given quasi-normed linear space \((X, \Vert \cdot \Vert _X)\), we say that \(Y_J\) fulfills a Jackson inequality with exponent \(\gamma > 0\) with respect to the family \(\Sigma = (\Sigma _n)_{n \in {\mathbb {N}}_0}\), if there is a constant \(C_J > 0\) such that

$$\begin{aligned} E(f,\Sigma _{n})_{X} \le C_J \cdot n^{-\gamma } \cdot \Vert f \Vert _{Y_J} \qquad \forall \, f \in Y_J \text { and } n \in {\mathbb {N}}. \end{aligned}$$
(J)

We say that \(Y_B\) fulfills a Bernstein inequality with exponent \(\gamma > 0\) with respect to \(\Sigma = (\Sigma _n)_{n \in {\mathbb {N}}_0}\), if there is a constant \(C_B > 0\) such that

$$\begin{aligned} \Vert \varphi \Vert _{Y_B} \le C_B \cdot n^\gamma \cdot \Vert \varphi \Vert _X \qquad \forall \, n \in {\mathbb {N}}\text { and } \varphi \in \Sigma _n . \end{aligned}$$
(B)

As shown in the proof of [21, Chapter 7, Theorem 9.1], we have the following:

Proposition 5.1

Denote by \((X,Y)_{\theta , q}\) the real interpolation space obtained from XY, as defined, e.g., in [21, Chapter 6, Section 7]. Then, the following hold:

  • If \(Y_J \hookrightarrow X\) fulfills the Jackson inequality with exponent \(\gamma > 0\), then

  • If \(Y_B \hookrightarrow X\) fulfills the Bernstein inequality with exponent \(\gamma >0\), then

In particular, if the single space \(Y = Y_J = Y_B\) satisfies both inequalities with the same exponent \(\gamma \), then for all \(0< \alpha < \gamma \) and \(0 < q \le \infty \).

By [21, Chapter 7, Theorem 9.3], if \(\Sigma \) satisfies Properties (P1)–(P5), then for \(0 < \tau \le \infty \), \(0< \alpha < \infty \) the space \(Y := A^{\alpha }_{\tau }(X,\Sigma )\) satisfies matching Jackson and Bernstein inequalities with exponent \(\gamma :=\alpha \). The Bernstein inequality reads

$$\begin{aligned} \exists \, C = C (\alpha ,\tau ,X) > 0 \quad \forall \, n \in {\mathbb {N}}\text { and } \varphi \in \Sigma _n : \quad \Vert \varphi \Vert _{A^{\alpha }_{\tau }(X,\Sigma )} \le C \cdot n^\alpha \cdot \Vert \varphi \Vert _X .\nonumber \\ \end{aligned}$$
(5.1)

We will also use the following well-known property of (real) interpolation spaces (see [21, Chapter 6, Theorem 7.1]): For quasi-Banach spaces \(X_1, X_2\) and \(Y_1, Y_2\), assume that \(T : X_1 + X_2 \rightarrow Y_1 + Y_2\) is linear and such that \(T|_{X_i} : X_i \rightarrow Y_i\) is bounded for \(i \in \{1,2\}\). Then, \(T|_{(X_1, X_2)_{\theta ,q}} : (X_1, X_2)_{\theta ,q} \rightarrow (Y_1, Y_2)_{\theta ,q}\) is well defined and bounded for all \(\theta \in (0,1)\) and \(q \in (0,\infty ]\).

5.2 Reminders on Besov Spaces

We refer to [20, Section 2] for the definition of the Besov spaces \(B^{s}_{\sigma ,\tau } (\Omega ) := B^s_{\tau }(X_\sigma (\Omega ;{\mathbb {R}}))\) with \(\sigma , \tau \in (0,\infty ]\), \(s \in (0,\infty )\) and with a Lipschitz domainFootnote 5\(\Omega \subset {\mathbb {R}}^{d}\) (see [1, Definition 4.9] for the precise definition of these domains).

As shown in [19, Theorem 7.1], we have for all \(p, s \in (0,\infty )\) the embedding

$$\begin{aligned} B^{s}_{\sigma ,p}( (0,1)^d ) \hookrightarrow L_p( (0,1)^d ;{\mathbb {R}}), \quad \text {provided} \quad \sigma = (s/d + 1/p)^{-1}. \end{aligned}$$

Combined with the embedding \(B^{s}_{p,q}(\Omega ) \hookrightarrow B^s_{p,q'}(\Omega )\) for \(q \le q'\) (see [17, Displayed equation on Page 92]) and because of \(\sigma = (s/d + 1/p)^{-1} \le p\), we see that

$$\begin{aligned} B^{s}_{\sigma ,\sigma } ( (0,1)^d ) \hookrightarrow B^{s}_{\sigma ,p}( (0,1)^d ) \hookrightarrow L_p( (0,1)^d ;{\mathbb {R}}), \quad \text {provided} \quad \sigma = (s/d + 1/p)^{-1}. \nonumber \\ \end{aligned}$$
(5.2)

For the special case \(\Omega = (0,1) \subset {\mathbb {R}}\) and each fixed \(p \in (0,\infty )\), the sub-family of Besov spaces \(B^{s}_{\sigma ,\sigma }( (0,1) )\) with \(\sigma = (s/d+1/p)^{-1}\) satisfies

$$\begin{aligned} \big ( L_p( (0,1);{\mathbb {R}}), B^s_{\sigma , \sigma }( (0,1) ) \big )_{\theta ,q} = B^{\theta s}_{q,q}( (0,1) ), \quad \text {for}\ 0< \theta < 1, \text {where}\ q= (\theta s + 1/p)^{-1}.\nonumber \\ \end{aligned}$$
(5.3)

This is shown in [21, Chapter 12, Corollary 8.5].

Finally, from the definition of Besov spaces given in [20, Equation (2.2)] it is clear that

$$\begin{aligned} B_{p,q}^\alpha (\Omega ) \hookrightarrow B_{p,q}^\beta (\Omega ) \quad \text {if} \quad p,q \in (0,\infty ] \text { and } 0< \beta < \alpha . \end{aligned}$$
(5.4)

5.3 Direct Estimates

In this subsection, we investigate embeddings of Besov spaces into the approximation spaces where \(\Omega \subseteq {\mathbb {R}}^{d}\) is an admissible domain and with \(p \in (0,\infty ]\). For technical reasons, we further assume \(\Omega \) to be a bounded Lipschitz domain, such as \(\Omega =(0,1)^d\), see [1, Definition 4.9]. The main idea is to exploit known direct estimates for Besov spaces on such domains which give error bounds for the n-term approximations with B-spline-based wavelet systems, see [19].

For \(t \in {\mathbb {N}}_0\) and \(d \in {\mathbb {N}}\), the tensor product B-spline is \( \beta _d^{(t)} (x_1,\dots ,x_d) := \beta _{+}^{(t)}(x_1) \, \beta _{+}^{(t)}(x_2) \, \cdots \beta _{+}^{(t)}(x_d), \) where \(\beta _+^{(t)}\) is as introduced in Definition 4.5. Notice that \(\beta _d^{(0)} = {\mathbb {1}}_{[0,1)^{d}}\).

By Lemma 4.4, there is \({\sigma _{r}} \in {\mathtt {NN}}^{\varrho _{r},1,1}_{2(r+1),2,r+1}\) satisfying (3.9); hence, by Lemma 3.26 there is \(L \le 3\) such that for \(\varepsilon > 0\), we can approximate \(\beta _d^{(0)}\) with \(g_\varepsilon = {\mathtt {R}}(\Phi _\varepsilon )\) with precision \(\Vert \beta _d^{(0)} - g_\varepsilon \Vert _{L_p({\mathbb {R}}^d)} < \varepsilon \), where \(L(\Phi _\varepsilon ) = L\) and \( \Phi _\varepsilon \in {\mathtt {NN}}^{\varrho _{r},d,1}_{w,3,m} \), for suitable \(w = w(d,r), m = m(d,r) \in {\mathbb {N}}\). Furthermore, if \(d = 1\), then Lemma 3.26 shows that the same holds for some \(\Phi _\varepsilon \in {\mathtt {NN}}^{\varrho _r,d,1}_{w,2,m}\).

For approximating \(\beta ^{(t)}_d\) (with \(t \in {\mathbb {N}}\)) instead of \(\beta _d^{(0)}\), we can actually do better. In fact, we prove in Appendix D.1 that one can implement \(\beta _d^{(t)}\) as a \(\varrho _{t}\)-network, provided that \(t \ge \min \{ d,2 \}\).

Lemma 5.2

Let \(d,t \in {\mathbb {N}}\) with \(t \ge \min \{ d, 2 \}\). Then, the tensor product B-spline

$$\begin{aligned} \beta _d^{(t)} : {\mathbb {R}}^d \rightarrow {\mathbb {R}}, \quad \beta _d^{(t)} (x) := \beta _{+}^{(t)}(x_1) \beta _{+}^{(t)}(x_2) \cdots \beta _{+}^{(t)} (x_d) \end{aligned}$$
(5.5)

satisfies \(\beta _d^{(t)} \in {\mathtt {NN}}^{\varrho _t,d,1}_{w,L,m}\) with \(L = 2 + 2 \lceil \log _{2}d\rceil \) and

$$\begin{aligned} {\left\{ \begin{array}{ll} w = 28 d(t+1) \;{\mathrm{and}}\; m = 13d(t+1), &{}\quad {\mathrm{if}}\; d > 1, \\ w = 2 (t+2) \;{\mathrm{and}}\; m = t+2 , &{} \quad {\mathrm{if}}\; d = 1. \end{array}\right. } \blacktriangleleft \end{aligned}$$

In the following, we will consider n-term approximations with respect to the continuous wavelet-type system generated by \(\beta _d^{(t)}\). Precisely, for \(a > 0\) and \(b \in {\mathbb {R}}^d\), define \(\beta ^{(t)}_{a,b} := \beta _d^{(t)} (a \cdot + b)\). The continuous wavelet-type system generated by \(\beta _d^{(t)}\) is then \({\mathcal {D}}_d^{t} := \{ \beta ^{(t)}_{a,b} :a \in (0,\infty ), b \in {\mathbb {R}}^d \}\). For any \(t \in {\mathbb {N}}_{0}\), we define \(\Sigma _{0}({\mathcal {D}}_d^t) := \{0\}\), and the reservoir of all n-term expansions from \({\mathcal {D}}_d^{t}\), \(n \in {\mathbb {N}}\), is given by

$$\begin{aligned} \Sigma _n({\mathcal {D}}_d^{t}) := \bigg \{ g = \sum _{i=1}^n c_i g_i :c_i \in {\mathbb {R}}, g_i \in {\mathcal {D}}_d^t \bigg \}. \end{aligned}$$

The following lemma relates \(\Sigma _n({\mathcal {D}}_d^{t})\) to \({\mathtt {NN}}^{\varrho _{r},d,1}_{cn,L,cn}\) for a suitably chosen constant \(c=c(d,r,t) \in {\mathbb {N}}\).

Lemma 5.3

Consider \(d \in {\mathbb {N}}\), \(t \in {\mathbb {N}}_{0}\), \(p \in (0,\infty ]\), .

  1. (1)

    If \(t=0\) and \(p<\infty \), then, with \(L := \min \{ d+1,3 \}\) and \(c = c(d,r) \in {\mathbb {N}}\), we have

    $$\begin{aligned} \Sigma _{n}({\mathcal {D}}_d^0) \subset \overline{{\mathtt {NN}}^{\varrho _{r},d,1}_{cn,L,cn} \cap X}^{X} \qquad \forall \, n,r \in {\mathbb {N}}. \end{aligned}$$
    (5.6)
  2. (2)

    If \(t \ge \min \{ d,2 \}\), then, with \(L := 2 + 2 \lceil \log _{2} d \rceil \), we have for any \(p \in (0,\infty ]\) that

    $$\begin{aligned} \Sigma _n({\mathcal {D}}_d^t) \subset {\mathtt {NN}}^{\varrho _t,d,1}_{cn,L,cn} \cap X \qquad \forall \, n \in {\mathbb {N}}, \end{aligned}$$
    (5.7)

    where \(c = c(d,t) \in {\mathbb {N}}\). \(\blacktriangleleft \)

Proof

Part (1): For \(t = 0\), \(r \in {\mathbb {N}}\), \(0< p < \infty \), we have already noticed before Lemma 5.2 that there exist \(w = w(d,r), m = m(d,r) \in {\mathbb {N}}\) such that \(\beta _d^{(0)} \in \overline{{\mathtt {NN}}^{\varrho _r,d,1}_{w,L,m} \cap X}^{X}\), where \(L = \min \{ d+1,3 \}\). Since \(\beta _{a,b}^{(0)} = \beta _d^{(0)} \circ P_{a,b}\) for the affine map \(P_{a,b} : {\mathbb {R}}^d \rightarrow {\mathbb {R}}^d, x \mapsto a x + b\) and since \(\Vert P_{a,b}\Vert _{\ell ^{0,\infty }_*} = 1\), Lemmas 2.17-(1) and 2.18-(1) yield \(\beta _{a,b}^{(0)} \in \overline{{\mathtt {NN}}^{\varrho _r,d,1}_{c,L,c} \cap X}^{X}\) with \(c := \max \{ w,m \}\). Thus, the claim follows from Parts (1) and (3) of Lemma 2.17.

Part (2): For \(t \ge \min \{ d,2 \}\), Lemma 5.2 shows that \(\beta _{a,b}^{(t)} \in {\mathtt {NN}}^{\varrho _t,d,1}_{c,L,c} \cap X\) with \(L = 2 + 2 \lceil \log _2 d \rceil \) and \(c := \max \{ w,m \}\) where \(w = w(d,t)\) and \(m = m(d,t)\) are as in Lemma 5.2. As before, we conclude using Parts (1) and (3) of Lemma 2.17. \(\square \)

Corollary 5.4

Consider \(d \in {\mathbb {N}}\), \(\Omega \subset {\mathbb {R}}^{d}\) an admissible domain, \(p \in (0, \infty ]\), , \({\mathscr {L}}\) a depth growth function, \(L := \sup _{n} {\mathscr {L}}(n) \in {\mathbb {N}}\cup \{\infty \}\). For \(t \in {\mathbb {N}}_{0}\), define \(\Sigma ({\mathcal {D}}_d^{t}) := ( \Sigma _{n}({\mathcal {D}}_d^t))_{n \in {\mathbb {N}}_{0}}\).

  1. (1)

    If \(L \ge \min \{ d+1, 3 \}\) and \(p < \infty \), then for any \(r \ge 1\)

  2. (2)

    If \(L \ge 2 + 2 \lceil \log _{2} d\rceil \) then for any \(r \ge \min \{ d,2 \}\), we have

Proof

For the proof of Part (1) let \(L_0 := \min \{d+1, 3\}\), while \(L_0 := 2 + 2 \lceil \log _2 d \rceil \) for the proof of Part (2). Since \(L \ge L_0\), there is \(n_{0} \in {\mathbb {N}}\) such that \({\mathscr {L}}(n) \ge L_0\) for all \(n \ge n_{0}\).

We first start with the proof of Part (2). By Lemma 5.3-(2), with \(t = r \ge \min \{ d,2 \}\), Eq. (5.7) holds for some \(c \in {\mathbb {N}}\). For \(n \ge n_{0}/c\), we have \(2 + 2 \lceil \log _2 d \rceil = L_0 \le {\mathscr {L}}(cn)\), whence

$$\begin{aligned} \Sigma _n({\mathcal {D}}_d^t) \subset \overline{\mathtt {W}_{cn}(X,\varrho _{r},{\mathscr {L}})}^{X}. \end{aligned}$$

Therefore, we see that

$$\begin{aligned} E(f,\Sigma _n({\mathcal {D}}_d^t))_{X} \ge E(f,\mathtt {W}_{cn}(X,\varrho _{r},{\mathscr {L}}))_{X} \quad \forall \, f \in X \text { and } n \ge \tfrac{n_0}{c}. \end{aligned}$$
(5.8)

For the proof of Part (1), the same reasoning with (5.6) instead of (5.7) yields (5.8) with \(t=0\) and any \(r \in {\mathbb {N}}\). For both parts, we conclude using Lemma 3.1 and the associated remark. \(\square \)

Theorem 5.5

Let \(\Omega \subset {\mathbb {R}}^{d}\) be a bounded Lipschitz domain of positive measure. For \(p \in (0,\infty ]\), define \(X_p (\Omega )\) as in Eq. (1.3). Let \({\mathscr {L}}\) be a depth growth function.

  1. (1)

    Suppose that \(d = 1\) and \(L := \sup _{n \in {\mathbb {N}}} {\mathscr {L}}(n) \ge 2\). Then, the following holds for each \(r \in {\mathbb {N}}\):

    (5.9)
  2. (2)

    Suppose that \(d > 1\) and \(L := \sup _{n \in {\mathbb {N}}} {\mathscr {L}}(n) \ge 3\), and let \(r \in {\mathbb {N}}\).

    Define \(r_0 := r\) if \(r \ge 2\) and \(L \ge 2 + 2 \lceil \log _2 d \rceil \), and \(r_0 := 0\) otherwise.

    Then,

    (5.10)

Remark 5.6

If \(\Omega \) is open, then each Besov space \(B_{p,q}^{s d}(\Omega )\) contains \(C_{c}^{3}(\Omega )\). Hence, by Corollary 4.18, the embeddings (5.9) or (5.10) with \(r = 1\) imply that \(\lfloor L/2 \rfloor \ge s / 2\). This is indeed the case, since these embeddings for \(r = 1\) are only established when \(L \ge 2\) and \(0< d s < 1 + \min \{ p^{-1},1 \} \le 2\), which implies \(s / 2 < 1/d \le 1 \le \lfloor L/2 \rfloor \).\(\blacklozenge \)

Proof of Theorem 5.5

See Appendix D.2. \(\square \)

5.4 Limits on Possible Inverse Estimates

For networks of finite depth \({\mathscr {L}} \equiv L < \infty \), there are limits on possible embeddings of (resp. of ) into Besov spaces.

Theorem 5.7

Consider \(\Omega = (0,1)^{d}\), \(p \in (0,\infty ]\), , \({\mathscr {L}}\) a depth growth function such that \(L := \sup _{n} {\mathscr {L}}(n) \in {\mathbb {N}}_{\ge 2} \cup \{\infty \}\) and \(r \in {\mathbb {N}}\). For \(\sigma , \tau , q \in (0,\infty ]\) and \(\alpha , s \in (0,\infty )\), the following claims holdFootnote 6:

  1. (1)

    If , then \( \alpha \ge \lfloor L/2\rfloor \cdot \min \{ s, 2 \} \).

  2. (2)

    If , then \( \alpha \ge (L-1) \cdot \min \{ s, 2 \} \). \(\blacktriangleleft \)

A direct consequence is that for networks of unbounded depth (\(L=\infty \)), none of the spaces , embed into any Besov space of strictly positive smoothness \(s>0\).

Remark 5.8

For \(L=2\), as \(\lfloor L/2\rfloor = L-1\), the two inequalities resulting from Theorem 5.7 match. This is natural as for \(L=2\) we know from Lemma 3.9 that . For \(L \ge 3\), the inequalities no longer match. Each inequality is in fact stronger than what would be achieved by simply combining the other one with Lemma 3.9. Note also that in contrast to the direct estimate (5.10) of Theorem 5.5 where the Besov spaces are of smoothness sd, here the dimension d does not appear. \(\blacklozenge \)

The proof of Theorem 5.7 employs a particular family of oscillating functions that have a long history [31] in the analysis of neural networks and of the benefits of depth [64].

Fig. 4
figure 4

A plot of the function \(\Delta _j\) (for \(j = 3\))

Definition 5.9

(Sawtooth functions) Consider \(\beta _{+}^{(1)}\) the B-spline of degree one, and \(\Delta _{1} := \beta _{+}^{(1)}(2\cdot )\) the “hat” function supported on [0, 1]. For \(j \ge 1\), the univariate “sawtooth” function of order j,

$$\begin{aligned} \Delta _{j} = \sum _{k=0}^{2^{j-1}-1} \Delta _{1}(2^{j-1}\cdot -k), \end{aligned}$$
(5.11)

has support in [0, 1] and is made of \(2^{j-1}\) triangular “teeth” (see Fig. 4). The multivariate sawtooth function \(\Delta _{j,d}\) is defined as \(\Delta _{j,d}(x) := \Delta _{j}(x_{1})\) for \(x = (x_{1},\ldots ,x_{d}) \in {\mathbb {R}}^{d}\), \(j \in {\mathbb {N}}\). \(\blacktriangleleft \)

An important property of \(\Delta _j\) is that it is a realization of a \(\varrho _1\)-network of specific complexity. The proof of this lemma is in Appendix D.3.

Lemma 5.10

Let \(L \in {\mathbb {N}}_{\ge 2}\) and define \(C_L := 4 \, L + 2^{L-1}\). Then,

$$\begin{aligned} \Delta _j \in {\mathtt {NN}}^{\varrho _1, 1, 1}_{\infty , L, C_L \cdot 2^{j / (L-1)}} \qquad \text {and} \qquad \Delta _j \in {\mathtt {NN}}^{\varrho _1, 1, 1}_{C_L \cdot 2^{j / \lfloor L/2 \rfloor }, L, \infty } \qquad \forall \, j \in {\mathbb {N}}. \blacktriangleleft \end{aligned}$$

Corollary 5.11

For \(L \in {\mathbb {N}}_{\ge 2}\), let \(C_L\) as in Lemma 5.10. Then,

$$\begin{aligned} \Delta _{j,d} \in {\mathtt {NN}}^{\varrho _1, d, 1}_{\infty , L, C_L \cdot 2^{j / (L-1)}} \qquad \text {and} \qquad \Delta _{j,d} \in {\mathtt {NN}}^{\varrho _1, d, 1}_{C_L \cdot 2^{j / \lfloor L/2 \rfloor }, L, \infty } \qquad \forall \, j \in {\mathbb {N}}. \blacktriangleleft \end{aligned}$$

Proof

We have \(\Delta _{j,d} = \Delta _j \circ T\) for the affine map \(T : {\mathbb {R}}^d \rightarrow {\mathbb {R}}, (x_1,\dots ,x_d) \mapsto x_1\), which satisfies \(\Vert T\Vert _{\ell ^{0,\infty }_*} = 1\). Now, the claim is a direct consequence of Lemmas 5.10 and 2.18-(1). \(\square \)

We further prove in Appendix D.4 that the Besov norm of \(\Delta _{j,d}\) grows exponentially with j:

Lemma 5.12

Let \(d \in {\mathbb {N}}\) and \(\Omega = (0,1)^d\). Let \(p,q \in (0,\infty ]\) and \(s \in (0, \infty )\) be arbitrary. Let \({s'} \in (0,2)\) with \({s'} \le s\). There is a constant \(c = c(d,p,q,s,{s'}) > 0\) such that

$$\begin{aligned} \Vert \Delta _{j,d}\Vert _{B^s_{p,q} (\Omega )} \ge c \cdot 2^{{s'} j} \qquad \forall \, j \in {\mathbb {N}}. \blacktriangleleft \end{aligned}$$

Given this lower bound on the Besov space norm of the sawtooth function \(\Delta _{j,d}\), we can now prove the limitations regarding possible inverse estimates that we announced above.

Proof of Theorem 5.7

We start with the proof for . Let us fix \(\ell \in {\mathbb {N}}\) with \(\ell \le \lfloor L/2 \rfloor \), and note that \(2\ell \le L\), so that there is some \(j_0 = j_0 (\ell , {\mathscr {L}}) \in {\mathbb {N}}\) such that \({\mathscr {L}}(2^j) \ge 2\ell \) for all \({j \ge j_0}\). Now, Corollary 5.11 (applied with \(2\ell \) instead of L) shows that \( \Delta _{\ell j, d} \in {\mathtt {NN}}^{\varrho _1, d, 1}_{C_{2\ell } 2^{(\ell j)/\ell }, 2\ell , \infty } \subset {\mathtt {NN}}^{\varrho _1, d, 1}_{C_{2\ell } 2^{j}, {\mathscr {L}}(C_{2\ell } 2^{j}), \infty } \) for all \(j \ge j_0\) and a suitable constant \(C_{2\ell } \in {\mathbb {N}}\). Therefore, the Bernstein inequality (5.1) yields a constant \(C = C(d,\alpha , q, p) > 0\) such that

Let \(s_0 := \min \{ 2, s \}\), let \(0< s' < s_0\) be arbitrary, and note as a consequence of Eq. (4.3) that

Here, we used that \(\Omega \) is bounded, so that Eq. (4.3) is applicable. Overall, as a consequence of this embedding and of Lemma 5.12, we obtain \(c = c(d,s',s,\sigma ,\tau ) > 0\) and \({C' = C'(\sigma ,\tau ,s,p,q,\alpha ,{\mathscr {L}},\Omega ) > 0}\) satisfying

for all \(j \ge j_0\). This implies \(s' \cdot \ell \le \alpha \). Since this holds for all \(s' \in (0,s_0)\) and all \(\ell \in {\mathbb {N}}\) with \(\ell \le \lfloor L/2 \rfloor \), we get \(\lfloor L/2 \rfloor \cdot s_0 \le \alpha \), as claimed.

Now, we prove the claim for . In this case, fix \(\ell \in {\mathbb {N}}\) with \(\ell +1 \le L\), and note that there is some \(j_0 \in {\mathbb {N}}\) satisfying \({\mathscr {L}}(2^j) \ge \ell + 1\) for all \(j \ge j_0\). Now, Corollary 5.11 (applied with \(\ell +1\) instead of L) yields a constant \(C_{\ell +1} \in {\mathbb {N}}\) such that \( \Delta _{\ell j, d} \in {\mathtt {NN}}^{\varrho _1, d, 1}_{\infty , \ell +1, C_{\ell +1} 2^{(\ell j)/((\ell +1) - 1)}} \subset {\mathtt {NN}}^{\varrho _1, d, 1}_{\infty , {\mathscr {L}}(C_{\ell +1} 2^{j}), C_{\ell +1} 2^{j}} \) for all \(j \ge j_0\). As above, the Bernstein inequality (5.1) therefore shows for all \(j \ge j_0\) and some constant \(C = C(d,\alpha ,q,p) < \infty \). Reasoning as above, we get that

for all \(j \ge j_0\) and \(0< s' < s_0 = \min \{2,s\}\). Therefore, \(s' \cdot \ell \le \alpha \). Since this holds for all \(s' \in (0,s_0)\) and all \(\ell \in {\mathbb {N}}\) with \(\ell + 1 \le L\), we get \(\alpha \ge s_0 \cdot (L-1)\), as claimed. \(\square \)

5.5 Univariate Inverse Estimates (\(d=1\))

The “no-go theorem” (Theorem 5.7) holds for \(\Omega = (0,1)^{d}\) in any dimension \(d \ge 1\), for any \(0 < p \le \infty \). In this subsection, we show in dimension \(d = 1\) that Theorem 5.7 is quite sharp. Precisely, we prove the following:

Theorem 5.13

Let \(X = L_p(\Omega )\) with \(\Omega =(0,1)\) and \(p \in (0,\infty )\), let \(r \in {\mathbb {N}}\), and let \({\mathscr {L}}\) be a depth growth function. Assume that \(L := \sup _{n} {\mathscr {L}}(n) < \infty \). Setting \(\nu := \lfloor L/2\rfloor \), the following statements hold:

  1. (1)

    For \(s \in (0,\infty )\), \(\alpha \in (0, \nu s)\) and \(q \in (0,\infty ]\), we have

  2. (2)

    For \(\alpha \in (0,\infty )\), we have

The same holds for instead of if we set \(\nu := L-1\). \(\blacktriangleleft \)

The proof involves a Bernstein inequality for piecewise polynomials by Petrushev [56], and new bounds on the number of pieces of piecewise polynomials implemented by \(\varrho _{r}\)-networks. Petrushev considers the (nonlinear) set \(\tilde{\mathtt {S}}(k,n)\) of all piecewise polynomials on (0, 1) of degree at most \(r=k-1\) (\(k \in {\mathbb {N}}\)) with at most \(n-1\) breakpoints in [0, 1]. In the language of Definition 4.6, \(\tilde{\mathtt {S}}(k,n) = {\mathtt {PPoly}}_n^r ( (0,1) )\) is the set of piecewise polynomials of degree at most \(r=k-1 \in {\mathbb {N}}_{0}\) with at most n pieces on (0, 1).

By [21, Chapter 12, Theorem 8.2] (see [56, Theorem 2.2] for the original proof), the following Bernstein-type inequality holds for each family \(\Sigma := (\tilde{\mathtt {S}}(k,n))_{n \in {\mathbb {N}}}\), \(k \in {\mathbb {N}}\):

Theorem 5.14

([56, Theorem 2.2]) Let \(\Omega = (0,1)\), \(p \in (0,\infty )\), \(r \in {\mathbb {N}}_{0}\), and \(s \in (0,r+1)\) be arbitrary, and set \(\sigma := (s + 1/p)^{-1}\). Then, there is a constant \(C < \infty \) such that we have

$$\begin{aligned} \Vert f \Vert _{B_{\sigma , \sigma }^{s}(\Omega )} \le C \cdot n^s \cdot \Vert f\Vert _{L_p(\Omega )} \qquad \forall \, n \in {\mathbb {N}}\text { and } f \in {\mathtt {PPoly}}_{n}^{r}(\Omega ). \blacktriangleleft \end{aligned}$$

Remark 5.15

Theorem 5.14 even holds for discontinuous piecewise polynomial functions, see [56, Theorem 2.2]. Hence, the Besov spaces in Theorem 5.13 also contain discontinuous functions. This is natural, as \(\varrho _{r}\)-networks with bounded number of connections or neurons approximate indicator functions arbitrarily well (though with weight values going to infinity, see the proof of Lemma 3.26).\(\blacklozenge \)

When f is a realization of a \(\varrho _{r}\)-network of depth L, it is piecewise polynomial [64]. As there are \(L-1\) hidden layers, the polynomial pieces are of degree \(r^{L-1}\) at most; hence, \(f|_{(0,1)} \in {\mathtt {PPoly}}^{r^{L-1}}_{n} ( (0,1) )\) for large enough n. This motivates the following definition.

Definition 5.16

(Number of pieces) Define \(n_{r}(W,L,N)\) to be the optimal bound on the number of polynomial pieces for a \(\varrho _{r}\)-network with \(W \in {\mathbb {N}}_{0}\) connections, depth \(L \in {\mathbb {N}}\) and \(N \in {\mathbb {N}}_{0}\) neurons; that is,

$$\begin{aligned} n_{r}(W,L,N) := \min \left\{ n \in {\mathbb {N}}\quad :\quad \forall \, g \in {\mathtt {NN}}^{\varrho _{r},1,1}_{W,L,N} \,:\, g|_{(0,1)} \in {\mathtt {PPoly}}_{n}^{r^{L-1}}( (0,1) ) \right\} . \end{aligned}$$

Furthermore, let \(n_{r}(W,L,\infty ) := \sup _{N \in {\mathbb {N}}_{0}} n_{r}(W,L,N)\) and \(n_{r}(\infty ,L,N) := \sup _{W \in {\mathbb {N}}_{0}} n_{r}(W,L,N)\). \(\blacktriangleleft \)

Remark 5.17

The definition of \(n_r (W,L,N)\) is independent of the non-degenerate interval \(I \subset {\mathbb {R}}\) used for its definition. To see this, write \(n_r^{(I)} (W,L,N)\) for the analogue of \(n_r (W,L,N)\), but with (0, 1) replaced by a general non-degenerate interval \(I \subset {\mathbb {R}}\). First, note that \(n_r^{(I)} (W,L,N) \le n_r^{(J)} (W,L,N)\) if \(I \subset J\).

Next, note for \(g \in {\mathtt {NN}}^{\varrho _r,1,1}_{W,L,N}\) and \(a \in (0,\infty )\), \(b \in {\mathbb {R}}\) that \(g_{a,b} := g (a \cdot + b) \in {\mathtt {NN}}^{\varrho _r,1,1}_{W,L,N}\) as well (see Lemma 2.18) and that \(g|_I \in {\mathtt {PPoly}}_n^{r^{L-1}} (I)\) if and only \(g_{a,b} |_{a^{-1} (I - b)} \in {\mathtt {PPoly}}_n^{r^{L-1}} (a^{-1} (I-b))\). Therefore, \(n_r^{(I)}(W,L,N) = n_r^{(a I + b)}(W,L,N)\) for all \(a \in (0,\infty )\) and \(b \in {\mathbb {R}}\).

Now, if \(J \subset {\mathbb {R}}\) is any non-degenerate interval, and if \(I \subset {\mathbb {R}}\) is a bounded interval, then \(a I + b \subset J\) for suitable \(a > 0\), \(b \in {\mathbb {R}}\). Hence, \(n_r^{(I)} = n_r^{(a I + b)} \le n_r^{(J)}\). In particular, this shows \(n_r^{(I)} = n_r^{(J)}\) for all bounded non-degenerate intervals \(I,J \subset {\mathbb {R}}\).

Finally, if \(g \in {\mathtt {NN}}^{\varrho _r,1,1}_{W,L,N}\) is arbitrary, then \(g \in {\mathtt {PPoly}}_n^{r^{L-1}}({\mathbb {R}})\) for some \(n \in {\mathbb {N}}\). Thus, there are \(a,b \in {\mathbb {R}}\), \(a < b\) such that \(g|_{(-\infty ,a+1)}\) and \(g|_{(b-1, \infty )}\) are polynomials of degree at most \(r^{L-1}\). Let \(k := n_r^{((a,b))} (W,L,N) = n_r^{((0,1))} (W,L,N)\), so that \(g|_{(a,b)} \in {\mathtt {PPoly}}_k^{r^{L-1}} ( (a,b) )\). Clearly, \(g \in {\mathtt {PPoly}}_{k}^{r^{L-1}} ({\mathbb {R}})\). Hence, \(n_r^{({\mathbb {R}})}(W,L,N) \le k = n_r^{((0,1))}(W,L,N)\).\(\blacklozenge \)

We now have the ingredients to establish the first main lemma behind the proof of Theorem 5.13.

Lemma 5.18

Let \(X = L_{p}(\Omega )\) with \(\Omega = (0,1)\) and \(p \in (0,\infty )\). Let \(r \in {\mathbb {N}}\) and \(\nu \in (0,\infty )\), and let \({\mathscr {L}}\) be a depth growth function such that \(L := \sup _{n} {\mathscr {L}}(n) < \infty \). Assume that

$$\begin{aligned} \sup _{W \in {\mathbb {N}}} W^{-\nu }\ n_{r}(W,L,\infty ) < \infty . \end{aligned}$$
(5.12)
  1. (1)

    For \(s \in (0,r+1)\), \(\alpha \in (0, \nu \cdot s)\), and \(q \in (0,\infty ]\), we have

    (5.13)
  2. (2)

    For \(\alpha \in (0, \nu (r+1))\), we have

    (5.14)

The same results hold with instead of if we assume instead that

$$\begin{aligned} \sup _{N \in {\mathbb {N}}} N^{-\nu }\ n_{r}(\infty ,L,N) < \infty . \blacktriangleleft \end{aligned}$$
(5.15)

Proof of Lemma 5.18

As \( {\mathtt {NN}}^{\varrho _{r},1,1}_{n,{\mathscr {L}}(n),\infty } \subset {\mathtt {NN}}^{\varrho _{r},1,1}_{n,L,\infty } \) for each \(n \in {\mathbb {N}}\), Theorem 5.14 and Eq. (5.12) yield a constant \(C < \infty \) such that

$$\begin{aligned} \Vert f \Vert _{B_{\sigma , \sigma }^s(\Omega )} \le C \cdot n^{\nu s} \cdot \Vert f \Vert _{L_p(\Omega )}, \quad \text {for all }\ n \in {\mathbb {N}}\text { and } f \in \mathtt {W}_{n}(X,\varrho _r,{\mathscr {L}}),\nonumber \\ \end{aligned}$$
(5.16)

where \(\sigma := (s+1/p)^{-1} = (s/d+1/p)^{-1}\) (recall \(d=1\)). By (5.2), we further get that \( Y_{B} := B_{\sigma , \sigma }^s(\Omega ) \hookrightarrow L_p(\Omega ) \), whence (5.16) is a valid Bernstein inequality for \(Y_{B}\) with exponent \(\gamma := s \cdot \nu > \alpha \). Proposition 5.1 with \(\theta := \alpha /\gamma =\alpha /(s\nu )\) and \(0 < q \le \infty \) yields (5.13).

When \(0< \alpha < \nu (r+1)\), there is \(s \in (0, r+1)\) such that \(0< \alpha < \nu \cdot s\); hence, (5.13) holds for any \(0 <q \le \infty \). By (5.3), we see for \(\theta := \frac{\alpha }{s \nu } \in (0,1)\) and \(q := (\theta s + 1/p)^{-1} = (\alpha /\nu + 1/p)^{-1}\) that the right-hand side of (5.13) is simply \(B_{q, q}^{\theta s} (\Omega ) = B_{q,q}^{\alpha /\nu }(\Omega )\).

The proof for follows the same steps. \(\square \)

Theorem 5.13 is a corollary of Lemma 5.18 once we establish (5.12) [resp. (5.15)]. The smaller \(\nu \) the better, as it yields a larger value for \(\alpha /\nu \), hence a smoother (smaller) Besov space in (5.14).

Lemma 5.19

Consider \(L\in {\mathbb {N}}_{\ge 2}\), \(r \in {\mathbb {N}}\).

  • Property (5.12) holds if and only if \(\nu \ge \lfloor L/2\rfloor \);

  • Property (5.15) holds if and only if \(\nu \ge L-1\). \(\blacktriangleleft \)

Proof

If (5.12) holds with some exponent \(\nu \), then Lemma 5.18-(2) with \({\mathscr {L}} \equiv L\), arbitrary \(p \in (0,\infty )\), \(\alpha :=\nu \) and \(q:=(\alpha /\nu +1/p)^{-1}\) yields with \(\Omega := (0,1)\). If we set \(s := 1\), then \(\min \{s,2\} = s = 1\). Hence, Theorem 5.7 implies \(\nu = \alpha \ge \lfloor L/2\rfloor \). The same argument shows that if (5.15) holds with some exponent \(\nu \), then \(\nu \ge L-1\). For the converse results, it is clearly sufficient to establish (5.12) with \(\nu = \lfloor L/2\rfloor \) and (5.15) with \(\nu = L-1\). The proofs are in Appendix D.5. \(\square \)

Proof of Theorem 5.13

We only prove Part (1) for the spaces \(W_q^{\alpha }\). The proof for the \(N_q^{\alpha }\) spaces and that of Part (2) are similar.

Let \(s \in (0,\infty )\) be arbitrary, and choose \(r' \in {\mathbb {N}}\) such that \(r \le r'\) and \(s \in (0,r'+1)\). Combining Lemmas 5.18 and  5.19, we get . Since \(\Omega \) is bounded, Theorem 4.7 shows that . By combining the two embeddings, we get the claim. \(\square \)