1 Introduction

A basic question in a rigorous study of neural networks is a precise characterization of the class of functions representable by neural networks with a certain activation function. The question is of fundamental importance because neural network functions are a popular hypothesis class in machine learning and artificial intelligence. Every aspect of learning using neural networks benefits from a better understanding of the function class: from the statistical aspect of understanding the bias introduced in the learning procedure by using a particular neural hypothesis class, to the algorithmic question of training, i.e., finding the “best" function in the class that extrapolates the given sample of data points.

It may seem that the universal approximation theorems for neural networks render this question less relevant, especially since these results apply to a broad class of activation functions [2, 10, 23]. We wish to argue otherwise. Knowledge of the finer structure of the function class obtained by using a particular activation function can be exploited advantageously. For example, the choice of a certain activation function may lead to much smaller networks that achieve the same bias compared to the hypothesis class given by another activation function, even though the universal approximation theorems guarantee that asymptotically both activation functions achieve arbitrarily small bias. As another example, one can design targeted training algorithms for neural networks with a particular activation function if the structure of the function class is better understood, as opposed to using a generic algorithm like some variant of (stochastic) gradient descent. This has recently led to globally optimal empirical risk minimization algorithms for rectified linear units (ReLU) neural networks with specific architecture [3, 7, 11] that are very different in nature from conventional approaches like (stochastic) gradient descent; see also [13, 15,16,17,18,19, 26].

Recent results of this nature have been obtained with ReLU neural networks. Any neural network with ReLU activations clearly gives a piecewise linear function. Conversely, any piecewise linear function \({\mathbb {R}}^{n} \rightarrow {\mathbb {R}}\) can be exactly represented with at most \(\lceil \log _{2}(n+1) \rceil \) hidden layers [3], thus characterizing the function class representable using ReLU activations. However, it remains an open question if \(\lceil \log _{2}(n+1) \rceil \) are indeed needed. It is conceivable that all piecewise linear functions can be represented by 2 or 3 hidden layers. It is believed this is not the case and there is a strict hierarchy starting from 1 hidden layer, all the way to \(\lceil \log _{2}(n+1) \rceil \) hidden layers. It is known that there are functions representable using 2 hidden layers that cannot be represented with a single hidden layer, but even the 2 versus 3 hidden layer question remains open. Some partial progress on this question can be found in [21].

In this paper, we study the class of functions representable using threshold activations (also known as the Heaviside activation, unit step activation, and McCulloch-Pitts neurons). It is easy to see that any function represented by such a neural network is a piecewise constant function. We show that any piecewise constant function can be represented by such a neural network, and surprisingly – contrary to what is believed to be true for ReLU activations – there is always a neural network with at most 2 hidden layers that does the job. We also establish that there are functions that cannot be represented by a single hidden layer and thus one cannot do better than 2 hidden layers in general. Our constructions also show that the size of the neural network is polynomial in the number of “pieces" of the function, in the case of fixed input dimension, similar to recent results for ReLU activations which give a polynomial size network [21]. However, the degree of the polynomial in our results is linear in the dimension, compared to a quadratic dependence in the ReLU results. We also have tighter lower bounds on the size, compared to known results for ReLU networks. Moreover, we show that if we are allowed to ignore zero measure sets, the size bounds are only quadratic in the number of “pieces", even for varying input dimension. Finally, we use these insights to design an algorithm to solve the empirical risk minimization (training) problem for these neural networks to global optimality whose running time is polynomial in the size of the data sample, assuming the input dimension and the network architecture are fixed. To the best of our knowledge, this is the first globally optimal training algorithm for any family of neural networks that works for arbitrary architectures and has computational complexity that is polynomial in the number of data points, that does not involve a discretization of parameter space or the input space.

Linear threshold activations only retain the sign from the input (after applying an affine linear function). We now show a way to reintegrate additional input information to enhance the expressivity of the linear threshold neural networks, while maintaining similar network sizes. For this purpose, we introduce a novel type of neural network named shortcut linear threshold neural networks. These networks are distinguished by a shortcut connection that performs a linear transform on the input, and takes the inner product with the output of the last hidden layer (see Fig. 1). This structure enables a shift from piecewise constant to piecewise linear functions, without necessitating a change in the network’s size. The novelty resides in the coupling of the network’s input with the output of the last hidden layer through an inner product operation. This could be a potentially new way to design and apply neural networks which does not seem to have been explored in the literature before, to the best of our knowledge.

Fig. 1
figure 1

Illustration of a \(\mathbb R^{w_0} \rightarrow \mathbb R\) shortcut linear threshold neural networks with k hidden layers, where \(x{ \in \mathbb R^{w_0}}\) is the input to the network, \(x^{(k)} {\in \mathbb R^{w_k}}\) is the output of the k-th hidden layer

These shortcut linear threshold networks can represent piecewise linear functions that are possibly discontinuous. This is a strict superset of the family of functions representable by ReLU neural networks that give continuous piecewise linear functions. Nevertheless, we show that the model complexity of this new network is not significantly larger than ReLU networks and these new neural networks can be trained provably to global optimality using the ERM algorithm we develop for linear threshold activations. For a more comprehensive discussion on this topic, we direct readers to Sect. 6.2. These results provide some evidence that shortcut linear threshold networks could be a superior class compared to ReLU networks. While the results we present are all theoretical in nature, we believe they provide reasonable motivation to explore the potential of this new class of neural networks in applications. We leave this avenue open for future work.

2 Formal statement of results

We first introduce necessary definitions and notation, followed by a precise statement of our results.

2.1 Definitions and notations

2.1.1 Polyhedral theory

Definition 1

A polyhedral complex \({\mathcal {P}}\) is a collection of polyhedra having the following properties:

  1. (A)

    For every \(P, P' \in {\mathcal {P}}, P \cap P' \) is a common face of P and \(P'\).

  2. (B)

    every face of a polyhedron in \({\mathcal {P}}\) belongs to \({\mathcal {P}}\).

We denote by \(\dim (P)\) the dimension of a polyhedron and by \(\mathring{P}\) the relative interior of P. \(|{\mathcal {P}}|\) will denote the number of polyhedra in a polyhedral complex \({\mathcal {P}}\) and is called the size of \({\mathcal {P}}\). The set of full-dimensional polyhedra in \({\mathcal {P}}\) is denoted by \({{\,\textrm{full}\,}}({\mathcal {P}})\), and thus, \(|{{\,\textrm{full}\,}}({\mathcal {P}})|\) corresponds to the number of full-dimensional polyhedra in \({\mathcal {P}}\).

Definition 2

(Piecewise linear and piecewise constant functions) We say that a function \(f: \mathbb R^n \rightarrow \mathbb R\) is piecewise linear if there exists a finite polyhedral complex that covers \(\mathbb R^n\) and f is affine linear in the relative interior of each polyhedron in the complex. If each of the affine functions are constant functions, i.e., f is constant in the relative interior of each polyhedron, then we call such a function piecewise constant. We use PWL\(_n\) and \(\text {PWC}_n\) to denote the class of all piecewise linear functions and piecewise constant functions (respectively) from \(\mathbb R^n\) to \(\mathbb R\); thus, \(\text {PWC}_n\subseteq \text {PWL}_n\). We will also use CPWL\(_n\) to denote subclass of piecewise linear functions from \(\mathbb R^n\) to \(\mathbb R\) that are also continuous.

Definition 3

Let \(f \in PWL _n\). We say that \(x \in {\mathbb {R}}^{n}\) is a regular point for f if there exists \(\epsilon > 0\) such that f is affine linear on the ball centered at x and radius \(\epsilon \). A point that is not regular is called a breakpoint for f.

Note that there may be multiple polyhedral complexes that correspond to a given piecewise linear or constant function, with possibly different sizes. For example, the indicator function of the nonnegative orthant \(\mathbb R^n_+\) is a piecewise constant function but there are many different ways to break up the complement of the nonnegative orthant into polyhedral regions. This motivates the following definitions.

Definition 4

We say that a polyhedral complex \(\mathcal {P}\) is compatible with a piecewise linear or constant function f if f is linear or constant (respectively) in the relative interior of every polyedron in \(\mathcal {P}\). Moreover, for any function \(f\in \text {PWL}_n\), \(\text {comp}(f)\) refers to the set of all polyhedral complexes compatible with f. We denote \(\mathcal {P}^*_f\) as a polyhedral complex with the smallest cardinality in \(\text {comp}(f)\), that is, \(\mathcal {P}^*_f \in \arg \min _{\mathcal {P} \in \text {comp}(f)}|\mathcal {P}|\). Consequently, the number of pieces of f, denoted as |f|, is defined as \(|{{\,\textrm{full}\,}}(\mathcal {P}^*_f)|\), which corresponds to the number of full-dimensional polyhedra in \(\mathcal {P}^*_f\).

2.1.2 Neural network terminology

Definition 5

(Neural networks (NN)) Fix an activation function \(\sigma : \mathbb R\rightarrow \mathbb R\). For any number of hidden layers \(k \in {\mathbb {N}}\), input and output dimensions \(w_0\), \(w_{k+1} \in {\mathbb {N}}\), a \({\mathbb {R}}^{w_0} \rightarrow {\mathbb {R}}^{w_{k+1}}\) neural network (NN) with \(\sigma \) activation is given by specifying a sequence of k natural numbers \(w_1, w_2, \cdots , w_k\) representing widths of the hidden layers and a set of \(k+1\) affine transformations \(T_i: {\mathbb {R}}^{w_{i-1}} \rightarrow {\mathbb {R}}^{w_{i}}\), \(i=1, \ldots , k+1\). Such a NN is called a \((k+1)\)-layer NN, and is said to have k hidden layers. The function \(f: {\mathbb {R}}^{w_0} \rightarrow {\mathbb {R}}^{w_{k+1}}\) computed or represented by this NN is:

$$\begin{aligned} f= T_{k+1}\circ \sigma \circ T_{k} \circ \cdots T_2 \circ \sigma \circ T_1. \end{aligned}$$

If \(T_i\) is represented by the matrix \(A^i \in \mathbb R^{w_i\times w_{i-1}}\) and vector \(b^i \in \mathbb R^{w_i}\), i.e., \(T_i(x) = A^ix + b^i\), then the weights of neuron \(j \in \{1, \ldots , w_{i}\}\) in the i-th hidden layer are given by the entries of the j-th row of \(A^i\) and the bias of this neuron is given by the j-th coordinate of \(b^i\). The set of all weights and biases of all neurons is called the set of learning parameters of the NN, and the size of the NN, or the number of neurons in the NN, is \(w_1 + \cdots + w_k \).

Definition 6

The threshold activation function is a map from \(\mathbb R\) to \(\{0,1\}\) given by the indicator of the positive reals, i.e., \(x > 0\). By extending this to apply coordinatewise, we get a function \(\sigma : \mathbb R^d \rightarrow \{0,1\}^d\) for any \(d \ge 1\), i.e., \(\sigma (x)_i\) is 1 if and only if \(x_i > 0\) for \(i=1, \ldots , d\). For any subset \(X \subseteq \mathbb R^n\), \(\mathbb {1}_X\) will denote its indicator function, i.e., \(\mathbb {1}_X(y) = 1\) if \(y \in X\) and 0 otherwise.

Definition 7

(Threshold and ReLU activations) Linear threshold neural networks are those NNs where \(\sigma \) is the threshold activation function defined above. For natural numbers nk and a tuple \(w = (w_1, \ldots , w_k)\), we use \(LT ^{w}_{n}(k)\) to denote the family of all possible linear threshold NNs with input dimension \(w_0 = n\), k hidden layers with widths \(w_1, \ldots , w_k\) and output dimension \(w_{k+1} = 1\). \(LT _{n}(k):= \bigcup _{w=(w_1, \ldots , w_k)} LT ^{w}_{n}(k)\) will denote the family of all linear threshold activation neural networks with k hidden layers.

ReLU neural networks are those NNs where \(\sigma (x) = \max \{0,x\}\), which is called the Rectified Linear Unit (ReLU) activation function. Analogous to the notation for linear threshold functions, we introduce ReLU\(_n^w(k)\) and ReLU\(_n(k)\) for ReLU neural networks.

We rigorously define Shortcut Linear Threshold Neural Networks (SLT NNs) as follows: Given a \(\mathbb R^{w_0} \rightarrow \mathbb R\) linear threshold NN with \(k \in {\mathbb {N}}\) hidden layers and input \(x \in \mathbb R^{w_0}\), the network’s output is constructed as a linear combination of the outputs of the neurons in the final hidden layer. Specifically, the output equals \(\langle b , x^{(k)} \rangle \), where \(x^{(k)} = [\mathbb {1}_{X_1}(x),\dots ,\mathbb {1}_{X_{w_k}}(x)]^\top \) represents the output of the \(k-\)th layer, and \(b \in \mathbb R^{w_k}\). Distinctively, instead of employing a constant vector b for the ultimate linear combination, we utilize an affine linear transformation of the original input x as the linear coefficients. Hence, the output of the shortcut linear threshold NN is defined as \(\langle A^\top x+b , x^{(k)} \rangle \), where \(A \in \mathbb R^{w_0\times w_k}\) and \(b \in \mathbb R^{w_k}\). Notably, by setting \(A=0\), we revert to linear threshold NNs as defined in Definition 7.

Analogously to linear threshold and ReLU NNs, we denote the class of functions represented by shortcut linear threshold NNs with \(w_0 = n\), \(w_{k+1}=1\), k hidden layers, and \(w=(w_1, \ldots , w_k)\) signifying the widths of the hidden layers as SLT\(_n^w(k)\) and SLT\(_n(k)\).

Definition 8

(Shortcut linear threshold NNs) We define shortcut linear threshold NNs as a type of linear threshold NNs with a shortcut connection. More formally, consider a linear threshold NN with \(k \in {\mathbb {N}}\) hidden layers, an input vector \(x \in \mathbb R^{w_0}\), and an output of the k-th hidden layer, denoted as \(x^{(k)} \in \mathbb R^{w_k}\). For a shortcut linear threshold NN based on this linear threshold NN, the output is defined as \(\langle A^\top x+b , x^{(k)} \rangle \), where \(A \in \mathbb R^{w_0\times w_k}\) and \(b \in \mathbb R^{w_k}\). It is worth noting that choosing \(A=0\) recovers linear threshold NNs as defined in Definition 7.

Analogous to linear threshold and ReLU NNs, we use SLT\(_n^w(k)\) and SLT\(_n(k)\) to denote the class of functions represented by shortcut linear threshold NNs with \(w_0 = n\), \(w_{k+1}=1\), k hidden layers, and \(w=(w_1, \ldots , w_k)\) representing the widths of the hidden layers.

Note that the novelty in the above definition is in how the output is derived from the output of the final hidden layer and the original input. To the best of our knowledge, such a modification in the definition of neural representations of functions is new. We present some results that we feel give evidence for the usefulness of this definition.

To better present our results regarding the size bounds of neural networks that are capable of computing a specific function, we introduce the notation \({\mathcal {N}}(H,f)\) to represent the set of all neural networks in H that can compute a given function f. For instance, \({\mathcal {N}}(LT _{n}(k),f)\) represents the set of all linear threshold neural networks with an input dimension of n and k hidden layers that can compute the function f. We further use \({\mathcal {N}}_\mu (H,f)\) to denote the set of all neural networks in H that can compute a function g satisfying \(\mu (\{x:f(x) \ne g(x)\}) = 0\), where \(\mu (\cdot )\) denotes the Lebesgue measure. The size of a neural network N is denoted by |N|.

2.2 Our contributions

2.2.1 Results for linear threshold NNs

Any function expressed by a linear threshold neural network is a constant piecewise function (i.e. \(LT _{n}(k)\subseteq \text {PWC}_{n}\) for all natural numbers k), because a composition of piecewise constant functions is piecewise constant. In this work we show that linear threshold neural networks with 2 hidden layers can compute any constant piecewise function, i.e. \(\text {LT}_n(2) = \text {PWC}_{n} \). We also prove that this is optimal, in the sense that a single hidden layer does not suffice to represent all piecewise constant functions. More formally,

Theorem 1

\(LT _1(1) = {{\,\textrm{PWC}\,}}_1\), and for all natural numbers \(n, k \ge 2\),

$$\begin{aligned} LT _n(1) \subsetneq LT _n(2) = LT _{n}(k)= {{\,\textrm{PWC}\,}}_n. \end{aligned}$$

Equivalently, any piecewise constant function \(f: {\mathbb {R}}^{n}\rightarrow {\mathbb {R}}\) can be computed by a linear threshold NN with at most 2 hidden layers. Moreover,

$$\begin{aligned} {\min _{N \in {\mathcal {N}}(LT _n(2),f)} |N| \le 3 |{\mathcal {P}}^*_f|.} \end{aligned}$$

Next, we show that the bound on the size of the neural network in Theorem 1 is in a sense best possible, up to constant factors.

Proposition 1

There exists a family of functions \(f_n \in {{\,\textrm{PWC}\,}}_n\) such that

$$\begin{aligned} {\min _{N \in {\mathcal {N}}( LT _n(2), f_n)} |N| \ge |{\mathcal {P}}_{f_n}^*|,\ \forall n \in {\mathbb {N}}_+.} \end{aligned}$$

Notwithstanding Proposition 1, there still remains the possibility that the construction we give to prove the equalities in Theorem 1 is suboptimal in terms of the size of the networks produced by our construction. Going beyond our specific construction, it is a priori possible that there are families of piecewise constant functions that can be represented with polynomial size circuits if one uses more than 2 hidden layers, while any linear threshold NN with 2 hidden layers has super polynomial size. Such results have been established for NNs involving ReLU and other activation functions; see, e.g., [3, 9, 12, 30] for a representative sample.

Building upon our previous discussion and results that also consider polyhedra of lower dimensions, it is important to highlight that the main focus in practical applications is full-dimensional polyhedra. This arises from the fact that for any countable set of data points, there is zero probability of them being contained within lower-dimensional polyhedra. Furthermore, this approach provides an equitable basis for comparing against size results for ReLU NNs, which represent continuous functions whose values on lower-dimensional polyhedra are determined by the values on full-dimensional polyhedra. Thus, we formulate the ensuing theorem related to the smallest linear threshold NN size expressing a given function, placing our focus solely on full-dimensional polyhedra.

Theorem 2

Let \(f\in \) \({{\,\textrm{PWC}\,}}_n\) with \(p \ge 2\) pieces. Then,

$$\begin{aligned} \min _{N \in {\mathcal {N}}_\mu (LT _n(2),f)} |N| \le \frac{p(p+1)}{2}+1. \end{aligned}$$

The applicability and significance of this theorem will become more evident in Sect. 2.2.3, specifically in the context of our proposed network structure. When expressing a given continuous piecewise linear function, this theorem provides an upper bound for the minimum neural network size that is significantly smaller compared to existing ReLU NN results. However, Theorem 2 ignores sets of measure zero to derive an upper bound quadratic in p. If we seek to precisely express a given function \(f \in {{\,\textrm{PWC}\,}}_n\) with p pieces, the bound increases to \({\mathcal {O}}(p^{n+1})\) as shown in the following proposition.

Proposition 2

Let \(f\in \) \({{\,\textrm{PWC}\,}}_n\) with \(p \ge n+1\) pieces. Then,

$$\begin{aligned} \min _{N \in {\mathcal {N}}(LT _n(2),f)} |N| \le 3 \left( \frac{ep}{n+1} \right) ^{n+1}. \end{aligned}$$

2.2.2 Algorithm for exact empirical risk minimization

In addition to our structural results, we present a new algorithm to perform exact empirical risk minimization (ERM) for linear threshold neural networks with fixed architecture, i.e., fixed k and \(w=(w_1, \ldots , w_k)\). Given D data points \((x_i, y_i) \in {\mathbb {R}}^{n} \times {\mathbb {R}}, \, i=1, \cdots , D\), the ERM problem with hypothesis class \(LT ^{w}_{n}(k)\) is

$$\begin{aligned} \min _{f\in LT ^{w}_{n}(k)} \; \frac{1}{D} \sum _{i=1}^{D} \ell (f(x_i), y_i) {,} \end{aligned}$$
(1)

where \(\ell \) is a convex loss function.

Theorem 3

For natural numbers nk and tuple \(w = (w_1, \ldots , w_k)\), there exists an algorithm that computes the global optimum (1), up to \(\epsilon \)-accuracy, with running time \(O(D^{w_1 n}\cdot 2^{\sum _{i=1}^{k-1}w_i^{2} w_{i+1}}\cdot poly (D, w_1, \ldots , w_k, {\log \left( \frac{1}{\epsilon }\right) }))\). If \(\ell \) is the absolute value difference, then the global optimum can be computed exactly.

Thus, the algorithm is polynomial in the size of the data sample, if n, k, \(w_1\), \(\ldots \), \(w_k\) are considered fixed constants.

2.2.3 Shortcut linear threshold NNs

Theorem 4

ReLU\(_n{(\lceil \log _2(n+1) \rceil )}=CPWL _n\) \(\subsetneq \) PWL\(_n=SLT _n(2)\).

Moreover, for a given function \(f \in {{\,\textrm{PWL}\,}}_n\), we can derive the same upper bound on the size of shortcut linear threshold neural networks as in Proposition 2 and Theorem 2, using analogous arguments. Furthermore, for representing continuous piecewise linear functions, the continuity of the function allows us to modify our construction to eliminate the need to ignore sets of zero measure while asserting an upper bound quadratic in p. This particular insight contributes to the subsequent corollary of Theorem 2.

Corollary 1

Let \(f\in \) \({{\,\textrm{CPWL}\,}}_n\) with \(p \ge 2\) pieces. Then,

$$\begin{aligned} \min _{N \in {\mathcal {N}}(SLT _n(2),f)} |N| \le p^2+1. \end{aligned}$$

We can compare Corollary 1 with known bounds on the sizes of ReLU NNs. For a fixed CPWL\(_n\) function f with p pieces, Theorem 2.1 in [3] establishes that a ReLU NN needs no more than \(\lceil \log _2(n+1) \rceil \) hidden layers to compute f. In contrast, our SLT\(_n\) construction described in Theorem 4 requires only 2 hidden layers. Additionally, Theorem 4.4 in [21] states that the ReLU NN, with \(\lceil \log _2(n+1) \rceil \) hidden layers to compute f, will have width bounded by \({\mathcal {O}}(p^{2n^2+3n+1})\). In comparison, our Corollary 1 implies a significantly tighter bound for the size of the SLT\(_n\) network to compute a same function f, namely \({\mathcal {O}}(p^{2})\).

It is also straightforward to modify the ERM algorithm presented in Theorem 3 to apply to shortcut linear threshold NNs with the same architecture. We direct readers to Sect. 5.2 for details.

The rest of the article is organized as follows. Sect. 3 collects some general structural results on neutral networks that use threshold activations and introduces some concepts useful for this analysis that will be used throughout the paper. Section 4 gives the proofs of the results stated above for linear threshold NNs. Section 5 provides the proofs of the results involving shortcut linear threshold NNs discussed above. Section 6 closes the paper with a discussion of some open questions.

3 Preliminary results for threshold activations

In this section, we will collect some structural results for neural networks with linear threshold activations. These will be useful for the proofs of our main results.

Definition 9

Let \(m \ge 1\) be any natural number. We say that a collection \({\mathcal {A}}\) of subsets of \(\{1, \ldots , m\}\) is linearly separable if there exist \(\alpha _1, \ldots , \alpha _m, \beta \in \mathbb R\) such that any subset \(A \subseteq \{1, \ldots , m\}\) is in \({\mathcal {A}}\) if and only if \(\sum _{s\in A} \alpha _s + \beta > 0\). \({\mathcal {L}}_m\) refers to the set of all linearly separable collections of subsets of \(\{1, \ldots , m\}\).

Remark 1

We note that given a collection \({\mathcal {A}}\) of subsets of \(\{1, \cdots , m\}\) one can test if \({\mathcal {A}}\) is linearly separable by checking if the optimum value of the following linear program is strictly positive:

$$\begin{aligned}&\max _{t\in {\mathbb {R}}, \alpha \in {\mathbb {R}}^{m}, \beta \in {\mathbb {R}}} \;\; t \\ s.t. \;\; \sum _{s\in A} \alpha _s + \beta \ge t \quad&\forall A \in {\mathcal {A}} \quad and \quad \sum _{s\in A} \alpha _s + \beta \le 0 \quad \forall A \notin {\mathcal {A}} \end{aligned}$$

In Algorithm 1 below, we will enumerate through all possible collections in \({\mathcal {L}}_m\) (for different values of m). We assume this has been done a priori using the linear programs above and this enumeration can be done in time \(|{\mathcal {L}}_m|\) during the execution of Algorithm 1.

Example 1

In \({\mathbb {R}}^{2}\), \({\mathcal {A}}=\{\emptyset , \{1\}, \{2\}\}\) is linearly separable, but \(\{\emptyset , \{1,2\}\}\) is not linearly separable because the set of inequalities \(\beta > 0\), \(\alpha _1 + \alpha _2 + \beta > 0\), \(\alpha _1 + \beta \le 0 \) and \(\alpha _2 + \beta \le 0\) have no solution. Two examples of linearly separable collections in \({\mathbb {R}}^{3}\) are given in Fig. 2.

Fig. 2
figure 2

Two linearly separable collections of \(2^{\{1, 2, 3\}}\) in \(\mathbb {R}^{3}\). The subsets of \(\{1, 2, 3\}\) are represented by the vertices of \(\{0,1\}^3\). The blue hyperplanes represent a possible separation of the corresponding vertices, giving two different linearly separable collections

Lemma 1

Let \(k\ge 2\) and \(w = (w_1, \ldots , w_k)\), and consider a NN of LT\(_n^{w}(k)\) or \({{\,\textrm{SLT}\,}}_n^w(k)\). The output of each neuron in this neural network is the indicator function of a specific subset of \(\mathbb R^n\). Suppose we fix the weights of the neural network up to the \((k-1)\)-st hidden layer. This fixes the sets \(Y_1, \ldots , Y_{w_{k-1}} \subseteq \mathbb R^n\) computed by the \(w_{k-1}\) neurons in this layer.

Then the output of a neuron in the k-th layer is the indicator function of \(X\subseteq \mathbb R^n\) (by adjusting the weights and bias of this neuron) if and only if there exists a linearly separable collection \({\mathcal {A}}\) of subsets of \(\{1, \ldots , w_{k-1}\}\) such that:

$$\begin{aligned} X = \bigcup _{A \in {\mathcal {A}}} \left[ \, \left( \, \bigcap _{s \in A} Y_s \, \right) \, \cap \, \left( \, \bigcap _{s \notin A} Y_s^{c} \, \right) \, \right] {.} \end{aligned}$$

Proof

Let \(\alpha \in {\mathbb {R}}^{w_{k-1}}\), \(\beta \in {\mathbb {R}}\) be the weights and bias of the neuron in the k-th layer. By definition, the set represented by this neuron is

$$\begin{aligned} S_{\alpha , \beta } = \{x \in {\mathbb {R}}^{n}: \alpha _1 \mathbb {1}_{ Y_1}(x) + \cdots + \alpha _{w_{k-1}}\mathbb {1}_{ Y_{w_{k-1}}}(x) + \beta > 0\}{.} \end{aligned}$$

We can suppose without loss of generality that \(S_{\alpha , \beta }\) is non empty, otherwise the output of the neuron is always 0 and the property is true by taking \({\mathcal {A}}=\emptyset \). Therefore, we define the collection \({\mathcal {A}} := \{ A \subseteq \{1, \cdots , w_{k-1}\}: \sum _{i \in A} \alpha _i + \beta > 0\} \). Since \(S_{\alpha , \beta }\) is non empty, \({\mathcal {A}}\) is non empty, and by definition, is a linearly separable collection. Now consider the set:

$$\begin{aligned} O:=\bigcup _{A \in {\mathcal {A}}} \left[ \, \left( \, \bigcap _{s \in A} Y_s \, \right) \, \cap \, \left( \, \bigcap _{s \notin A} Y_s^{c} \, \right) \, \right] {.} \end{aligned}$$

We will prove now \(O=S_{\alpha , \beta }\). We first show \(O \subseteq S_{\alpha , \beta }\). If \(O=\emptyset \), this is trivial. Else, let \(x \in O\). Then there exists \(A \in {\mathcal {A}}, \; A \subseteq \{1, \cdots , w_{k-1}\}\) such that \( x \in (\, \bigcap _{s \in A} Y_s \,) \, \cap \, (\, \bigcap _{s \notin A} Y_s^{c}\, ) \), and by definition of \({\mathcal {A}}\)

$$\begin{aligned} \sum _{s \in A} \alpha _s \mathbb {1}_{ Y_s}(x) + \beta > 0{.} \end{aligned}$$

The previous inequality implies that:

$$\begin{aligned} \sum _{i=1}^{w_{k-1}} \alpha _k \mathbb {1}_{ Y_k}(x) + \beta&= \sum _{s \in A} \alpha _s \mathbb {1}_{ Y_s}(x) + \sum _{s \notin A} \alpha _s \mathbb {1}_{ Y_s}(x) + \beta \\&= \sum _{s \in A} \alpha _s \mathbb {1}_{ Y_s}(x) + \beta > 0{.} \end{aligned}$$

The first equality holds because \(x\notin Y_s \) if and only if \( s\notin A\). This means that \(x \in S_{\alpha , \beta }\), hence \(O \subseteq S_{\alpha , \beta }\).

To show the reverse inclusion, let \(x \in S_{\alpha , \beta }\). Then \( \alpha _1 \mathbb {1}_{ Y_1}(x) + \cdots + \alpha _{w_{k-1}}\mathbb {1}_{ Y_{w_{k-1}}}(x) + \beta > 0\). Let \(A := \{s: x \in Y_s\}\). Then:

$$\begin{aligned} \sum _{s \in A } \alpha _s + \beta = \sum _{i=1}^{w_{k-1}} \alpha _k \mathbb {1}_{ Y_k}(x) + \beta > 0 {,} \end{aligned}$$

hence \(A\in {\mathcal {A}}\), and by construction \(x \in (\, \bigcap _{s \in A} Y_s ) \, \cap \, (\, \bigcap _{s \notin A} Y_s^{c} \,) \), hence \(x\in O\), and \(S_{\alpha , \beta } \subseteq O\).

Conversely, let \({\mathcal {A}}\) be a linearly separable collection of subsets of \(\{1, \cdots , w_{k-1}\}\). By definition there exists \(\alpha \in {\mathbb {R}}^{n}\) and \(\beta \in {\mathbb {R}}\) such that \(A \in {\mathcal {A}} \) if and only if \( \sum _{s \in A } \alpha _s + \beta > 0 \). \(\alpha \) and \(\beta \) can be chosen as the weights of the neuron in the k-th hidden layer and its output is the function \(\mathbb {1}_{\{x\in \mathbb R^n\;:\;\sum _{i=1}^{w_{k-1}}\alpha _i \mathbb {1}_{ Y_i}(x) + \beta > 0\}}\). \(\square \)

The following is a corollary of Lemma 1, which indicates that breakpoints are non-increasing as we proceed through the hidden layers.

Corollary 2

Let \(k\ge 2\) and \(w = (w_1, \ldots , w_k)\), and consider a NN of \(LT ^{w}_{n}(k)\) or \({{\,\textrm{SLT}\,}}_n^w(k)\). Then the breakpoints of the output of any neuron in the j-th layer are included in the union of the breakpoints of the neurons of the \( (j-1)\)-st layer, where \(2 \le j \le k\).

Proof

Let \(\mathbb {1}_{Y_1},\dots ,\mathbb {1}_{Y_{w_{j-1}}}\) be the functions computed by neurons in the \((j-1)\)-st layer, where \(2 \le j \le k\). Consider any neuron in the j-th layer, suppose it computes \(\mathbb {1}_X\). By Lemma 1, there exists a linearly separable collection \({\mathcal {A}}\) of subsets of \(\{1,\ldots ,w_{j-1}\}\) such that

$$\begin{aligned} X = \bigcup _{A \in {\mathcal {A}}}\left[ \left( \bigcap _{s \in A}Y_s \right) \cap \left( \bigcap _{s \notin A}Y_s^c \right) \right] , \end{aligned}$$

then we have

$$\begin{aligned} \partial X&\subseteq \bigcup _{A \in {\mathcal {A}}} \partial \left[ \left( \bigcap _{s \in A}Y_s \right) \cap \left( \bigcap _{s \notin A}Y_s^c \right) \right] \\&\subseteq \bigcup _{A \in {\mathcal {A}}} \left[ \partial \left( \bigcap _{s \in A}Y_s \right) \cup \partial \left( \bigcap _{s \notin A}Y_s^c \right) \right] \\&\subseteq \bigcup _{A \in {\mathcal {A}}} \left[ \left( \bigcup _{s \in A}\partial Y_s \right) \cup \left( \bigcup _{s \notin A} \partial Y_s^c \right) \right] \\&= \bigcup _{A \in {\mathcal {A}}} \left[ \left( \bigcup _{s \in A}\partial Y_s \right) \cup \left( \bigcup _{s \notin A} \partial Y_s \right) \right] \\&= \bigcup _{i=1}^{w_{j-1}} \partial Y_i. \end{aligned}$$

For any nonempty set \(A \subseteq \mathbb R^n\), the set of breakpoints of \(\mathbb {1}_A\) is \(\partial A\), so the above inclusion ends the proof. \(\square \)

Definition 10

For any single neuron with a linear threshold activation with k inputs, the output is the indicator of an open halfspace, i.e., \(\mathbb {1}_{\{x\in \mathbb R^k : \langle a, x \rangle + b > 0\}}\) for some \(a \in \mathbb R^k\) and \(b\in \mathbb R\). We say that \(\{x\in \mathbb R^k : \langle a, x \rangle + b = 0\}\) is the hyperplane associated with this neuron.

4 Main proofs (Linear Threshold DNNs)

Proof of Theorem 1

Proposition 3

\({LT _1(1)} = PWC _1\), i.e., linear threshold neural networks with a single hidden layer can compute any piecewise constant function \({\mathbb {R}}\rightarrow {\mathbb {R}}\). Furthermore, given that \(f \in PWC _1 \), it follows that

$$\begin{aligned} {\min _{N \in {\mathcal {N}}(LT _1(1),f)}|N| \le 3|{\mathcal {P}}_f^*|+1.} \end{aligned}$$

Proof

Let \(f: {\mathbb {R}}\rightarrow {\mathbb {R}}\) a piecewise constant function. By definition, the union of polyhedra in \({\mathcal {P}}_f^*\) is \(\mathbb R\) and such that f is constant on the relative interior of each of the polyhedra. In \({\mathbb {R}}\), non empty polyhedra are either reduced to a point, or they are the intervals of the form [ab], \(]-\infty , a]\) , \([a, +\infty [\) with \(a \le b \in {\mathbb {R}}\), or \({\mathbb {R}}\) itself. We first show that we can compute the indicator function on each of the interior of those intervals with at most two neurons. The interior of \([a, +\infty [\), \(]-\infty , b]\) or \({\mathbb {R}}\) can obviously be computed by one neuron (e.g. \(x \mapsto \mathbb {1}_{\{ax < 0\}}\) with \(a=0\) for \({\mathbb {R}}\)). The last cases (singletons and polyhedron of the form [ab]) requires a more elaborate construction. To compute the function \(\mathbb {1}_{\{x \in ]a,b[ \}} \), it is sufficient to implement a Dirac function, since \(\mathbb {1}_{\{x \in ]a,b[\}} = \mathbb {1}_{\{ x< b \}} - \mathbb {1}_{\{ x < a\}} - \delta _a(x) \) where \(\delta _{a}\) is the Dirac in \(a \in {\mathbb {R}}\), i.e, \(\delta _a: {\mathbb {R}} \rightarrow {\mathbb {R}}, \; x \mapsto \mathbb {1}_{\{ x = a\}}\). \(\delta _a\) can be computed by a linear combination of three neurons, since \(g_a: {\mathbb {R}} \rightarrow {\mathbb {R}}, \; x \mapsto \mathbb {1}_{ {\mathbb {R}} } - (\mathbb {1}_{\{ x < a\}} + \mathbb {1}_{\{x> a \} }) \) is equal to \(\delta _a\). Using a linear combination of the basis functions (polyhedra and faces), we can compute exactly f. To show that \(3|{\mathcal {P}}_f^*|+1\) neurons suffice, \(\mathbb {1}_{{\mathbb {R}}}\) is computed with one shared neuron, and then 3 other neurons are needed at most for one polyhedron using our construction. \(\square \)

We next show that starting with two dimensions, linear threshold NNs with a single hidden layer cannot compute every possible piecewise constant function.

Proposition 4

Let \(C_2:=\{(x_1, x_2)\in {\mathbb {R}}^{2} \; | \; 0 \le x_1, x_2 \le 1\}\). Then \(\mathbb {1}_{C_2}\) cannot be represented by any linear threshold neural network with one hidden layer.

Proof

Consider any piecewise constant function on \(\mathbb R^2\) represented by a single hidden layer neural network. Let \(g:= x \mapsto \sum _{i=1}^{m}\alpha _i \mathbb {1}_{\{x \in \mathbb R^2\;:\; \langle a_i, x \rangle + b_i< 0\}}\) with \(\alpha _1, \cdots , \alpha _m \in {\mathbb {R}}\), \(a_1, \ldots , a_m \in \mathbb R^2\) and \(b_1, \ldots , b_m \in \mathbb R\), be a single hidden layer neural network with the smallest size. This implies that if \(i\ne j\) then the halfspace \(\{x \in \mathbb R^2: \langle a_i, x\rangle + b_i > 0\}\) is different from the halfspace \(\{x \in \mathbb R^2: \langle a_j, x\rangle + b_j > 0\}\). Otherwise, we may either replace the two corresponding neurons with a single neuron with weight \(\alpha = \alpha _i + \alpha _j\) and we would obtain a smaller neural network. This implies that the set of breakpoints for g is a union of lines in \(\mathbb R^2\). However, the set of breakpoints \(\mathbb {1}_{C_2}\) is formed by the sides of the cube, which is a union of finite length line segments. This shows that \(\mathbb {1}_{C_2}\) cannot be represented by a single hidden layer linear threshold NN. \(\square \)

In the two following Lemmata, we assemble the last pieces towards a complete proof of Theorem 1 which states that 2 hidden layers actually suffice to represent any piecewise constant function in \(PWC _n\).

Lemma 2

Let P be a polyhedron in \({\mathbb {R}}^{n}\) given by the intersection of m halfspaces. Then, \(\mathbb {1}_P \in LT _n(2)\) and

$$\begin{aligned} {\min _{N \in {\mathcal {N}}(LT _n(2),\mathbb {1}_P)} |N| \le m+1.} \end{aligned}$$

Proof

Let P a polyhedron, i.e. \(P= \{x \in {\mathbb {R}}^{n} \;| \; A x \le b \}\) with \(A =(a_1, \cdots , a_m)^{\top }\in {\mathbb {R}}^{m\times n}\) and \(b=(b_1, \cdots , b_m) \in {\mathbb {R}}^{m}\). Let us consider the m neurons \( (\phi _i: x \mapsto \mathbb {1}_{\{x \in \mathbb R^n:\;\langle a_i, x \rangle > b_i\}} )_{1\le i \le m}\), and \(\phi : x \mapsto \sum _{i} \phi _i (x)\). Then for all \(x \in {\mathbb {R}}^{n}\), \(\phi (x) < 1\) if and only if \(x \in P\). Now, defining \(\psi : y \mapsto \mathbb {1}_{\{y \in \mathbb R:\; y < 1\}} \) yields \(\psi \circ \phi = \mathbb {1}_{P}\). \(\psi \) can obviously be computed with a neuron. Therefore, one can compute \(\mathbb {1}_P\) with m neurons in the first hidden layer and one neuron in the second, which provides a construction of \(\mathbb {1}_P\) with \(m+1\) neurons in total, and proves the result. \(\square \)

Lemma 3

Let P be a polyhedron in \({\mathbb {R}}^{n}\). Then \(\mathbb {1}_{\mathring{P}}\) can be computed with a two hidden layer linear threshold NN, using the indicator of P and the indicators of its faces.

Proof

Let P be a polyhedron. First, we always have \(\mathbb {1}_{\mathring{P}} = \mathbb {1}_{P} - \mathbb {1}_{ \text {Union of facets of P}} \). Therefore it is sufficient to prove that we can implement \(\mathbb {1}_{\text {Union of facets of P}}\) for any P. Using the inclusion exclusion principle on indicator functions, suppose that the facets of P are \(f_1, \cdots , f_l\), then:

$$\begin{aligned} \mathbb {1}_{\bigcup _{j=1}^{l}f_j} = \sum _{j=1}^{l} (-1)^{j+1} \sum _{1 \le i_1< \cdots < i_j \le l } \mathbb {1}_{f_{i_1} \cap \cdots \cap f_{i_j} }{.} \end{aligned}$$

It should be noted that for any \(j \in \{1, \cdots , l\}, \; F = f_{i_1} \cap \cdots \cap f_{i_j}\) is either empty, or a face of P, hence a polyhedron of dimension lower or equal to \(\dim (P)-1 \). Therefore, using Lemma 2, we can implement F with a two hidden neural network with at most \(m+1\) neurons, where m is the number of halfspaces in an inequality description of P. If s is the number of faces of P, then there are at most s polyhedra to compute. \(\square \)

Proof of Theorem 1

By definition, in order to represent \(f\in \text {PWC}_n\) it suffices to compute the indicator function of the relative interior of each polyhedron in one of its smallest polyhedral complex \({\mathcal {P}}_f^*\). This can be achieved with just two hidden layers using Lemma 3. This establishes the equalities in the statement of the theorem. The strict containment \(\text {LT}_n(1) \subsetneq \text {LT}_n(2)\) is given by Proposition 4.

Let m be the total number of halfspaces used in an inequality description of all the polyhedra in the polyhedral complex \({\mathcal {P}}_f^*\). Since all faces are included in the polyhedral complex, there exists an inequality description with \(m \le 2|{\mathcal {P}}_f^*|\). The factor 2 appears because for each facet of a full-dimensional polyhedron in \({\mathcal {P}}_f^*\), one may need both directions of the inequality that describes this facet. Then the construction in the proofs of Lemmas 2 and 3 show that one needs at most \(m \le 2|{\mathcal {P}}_f^*|\) neurons in the first hidden layer and at most \(|{\mathcal {P}}_f^*|\) neurons in the second hidden layer.\(\square \)

Proof of Proposition 1

Proof of Proposition 1

Consider the sets \(P_1 := \{x \in {\mathbb {R}}^{n}: x_1 \le 0 \}\), \(P_i := \{x \in {\mathbb {R}}^{n}: (i-2) < x_1 \le i-1\}\) for \(i \in \{2, \cdots , m-1\}\), and \(P_m := \{x \in {\mathbb {R}}^{n}: x_1 > m-2 \}\). Note that \(\bigcup _{i=1}^{m}P_i={\mathbb {R}}^{n}\). Let \(f \in \text {PWC}_{n}\) be such that \(f(x) = i\) for all \(x \in P_i\), where \(i \in \{1,\dots ,m\}\). It is easy to see that \({\mathcal {P}}_f^* = \{P_1,\dots ,P_m\}\), and that the breakpoints of f is a set of \(m-1\) hyperplanes, with empty pairwise intersections.

By Theorem 1, \(f \in LT _n(2)\), and according to Corollary 2, any neural network \(N \in {\mathcal {N}}(LT _n(2),f)\) must have these hyperplanes associated with neurons in the first hidden layer, necessitating at least \(m-1\) neurons in this layer. Taking into account the neurons in the subsequent layer, we establish that \(|N|\ge m = |{\mathcal {P}}_f^*|\). \(\square \)

The proofs of Proposition 2 and Theorem 2 rely on some facts from polyhedral geometry, which are incorporated into the subsequent lemma.

Lemma 4

Let \({\mathcal {P}}\) be a finite polyhedral complex in \(\mathbb R^n\) with \({{\,\textrm{full}\,}}({\mathcal {P}}) = \{P_1,\dots ,P_m\}\), where \(m \in {\mathbb N}_+\). If the union of all polyhedra in \({\mathcal {P}}\) equals \(\mathbb R^n\), then the following statements are all true.

  1. 1.

    \(\bigcup _{i=1}^m P_i = \mathbb R^n\).

  2. 2.

    For any k dimensional polyhedron \(F \in {\mathcal {P}}\) with \(0 \le k \le n\), there exist \(n-k+1\) distinct full-dimensional polyhedra in the complex whose common intersection equals F.

  3. 3.

    \(m \le \vert {\mathcal {P}} \vert < \left( \frac{e m}{n+1} \right) ^{n+1},\) where \(e\approx 2.71828\) is Euler’s number.

Proof

Suppose \(\bigcup _{i=1}^m P_i \ne \mathbb R^n\), and consider \(x \in \mathbb R^n\backslash \left( \bigcup _{i=1}^m P_i \right) \), then there exists some \(\varepsilon > 0\) such that \({\mathcal {B}}(x,\varepsilon ) \subseteq \mathbb R^n\backslash \left( \bigcup _{i=1}^m P_i \right) \) since \(\bigcup _{i=1}^m P_i\) is closed as it is a finite union of polyhedra. This leads to a contradiction since \({\mathcal {P}}\) covers \(\mathbb R^n\) but a finite union of polyhedra with dimension at most \(n-1\) cannot cover \({\mathcal {B}}(x,\varepsilon )\). This proves part 1.

For part 2., we first observe that \(F \in {\mathcal {P}}\) if and only if F is a face of some full-dimensional polyhedra in \({\mathcal {P}}\). One direction follows from the definition of a polyhedral complex. For the other direction, consider any \(F \in {\mathcal {P}}\). Using part 1.,

$$\begin{aligned} F = \mathbb R^n \cap F = \left( \bigcup _{i=1}^m P_i\right) \cap F = \bigcup _{i=1}^m \left( P_i \cap F\right) . \end{aligned}$$

By definition of a polyhedral complex, \(P_i \cap F\) is a face of F, \(\forall i \in \{1, \ldots , m\}\). The above equality thus implies that F is a finite union of some faces of F. This implies that one of these faces must be F itself, i.e., there exists \(i \in \{1, \ldots , m\}\) such that \(P_i \cap F = F\). Also, by definition \(F = P_i \cap F\) is a face of \(P_i\), which proves that F is a face of some full-dimensional polyhedra in \({\mathcal {P}}\).

Next consider any k dimensional polyhedron \(F \in {\mathcal {P}}\). By the argument above, there exists \(i \in \{1, \ldots , m\}\) such that F is a face of \(P_i\). When \(k=n\), the result is trivial with \(F = P_i\). We now show the result for \(k=n-1\). Let \(\langle a, x \rangle \le b\) be a facet defining inequality for \(P_i\) corresponding to F. Let \(x_0\) be a point in the relative interior of F. Consider the sequence \(x_0 + \frac{1}{n}a\) and observe that no point in this sequence is contained in \(P_i\). Since this is an infinite sequence, there must exist \(j\in \{1, \ldots , m\}\) with \(j\ne i\) such that \(P_j\) contains infinitely many points from this sequence. Taking limits over this subsequence and using the fact that \(P_j\) is closed, we obtain that \(x_0 \in P_j\). Thus, \(x_0 \in P_i \cap P_j\) and \(P_i \cap P_j\) is a common face of \(P_i\) and \(P_j\). However, since \(x_0\) is in the relative interior of the facet F, this common face must be F. Thus we are done for the case \(k=n-1\). For any \(k \le n-2\), the face F must be the intersection of \(n-k\) distinct facets of \(P_i\). By the argument above, each of these \(n-k\) facets is given by the intersection of \(P_i\) with another full-dimensional polyhedron in the complex. Since these are distinct facets, the corresponding full-dimensional polyhedra must be distinct. Including \(P_i\), the intersection of these \(n-k+1\) polyhedra equals the intersection of these \(n-k\) facets of \(P_i\), which is precisely F. This finishes the proof of part 2.

The first inequality of 3. follows from the fact that \(P_1, \ldots , P_m \in {\mathcal {P}}\). From 2., every polyhedron in the complex of dimension k must be the intersection of \(n-k+1\) distinct full-dimensional polyhedra. Therefore, \({m \atopwithdelims ()n-k+1}\) gives an upper bound for the number of all the k dimensional polyhedra in \({\mathcal {P}}\). Now we can give an upper bound for \(\vert {\mathcal {P}} \vert \):

$$\begin{aligned} \vert {\mathcal {P}} \vert \le \sum _{k=0}^n { m \atopwithdelims ()n-k+1} < \left( \frac{e m}{n+1} \right) ^{n+1}, \end{aligned}$$

where the second inequality comes from using Stirling’s approximation. \(\square \)

Proofs of Proposition 2and Theorem 2

Proof of Proposition 2

By definition, the smallest polyhedra complex compatible with f, \({\mathcal {P}}_f^*\), has p full-dimensional polyhedra. Then the construction in Theorem 1 implies f can be computed by a linear threshold NN with size at most \(3 \vert {\mathcal {P}}_f^* \vert \le 3\left( \frac{e p}{n+1} \right) ^{n+1}\), where the inequality holds by part 3 of Lemma 4. \(\square \)

Proof of Theorem 2

Let \({{\,\textrm{full}\,}}({\mathcal {P}}_f^*) = \{P_1,\dots ,P_p\}\). Define \(\alpha _i\) as the value of f within the interior of \(P_i\) for \(i \in \{1,\dots ,p\}\). Part 2 of Lemma 4 implies that there are at most \({p \atopwithdelims ()2}\) polyhedra of dimension \(n-1\) in \({\mathcal {P}}_f^*\). Consequently, the first hidden layer requires no more than \({p \atopwithdelims ()2}\) neurons to associate the corresponding hyperplanes, along with an additional neuron for computing \(\mathbb {1}_{\mathbb R^n}\) to reverse the halfspaces. In the second hidden layer, we employ p neurons to compute the functions \(\mathbb {1}_{{\tilde{P}}_1},\dots ,\mathbb {1}_{{\tilde{P}}_p}\), satisfying \(\mu (\{x:\mathbb {1}_{{\tilde{P}}_i}(x) \ne \mathbb {1}_{P_i}(x)\}) = 0\) for \(i \in \{1, \ldots , p\}\), and the corresponding weights after the second hidden layer are set to \(\alpha _1,\dots ,\alpha _p\). This construction yields a linear threshold NN of size no more than \({p \atopwithdelims ()2}+1+p=\frac{p(p+1)}{2}+1\), computing a function that is consistent with f within the interior of each \(P_i\), and thus equals to f almost everywhere. \(\square \)

Proof of Theorem 3

Consider a neural network with k hidden layers and widths \(w=(w_1, \ldots , w_k)\) that implements a function in \(LT ^{w}_{n}(k)\). The output of any neuron on these data points is in \(\{0,1\}\) and thus each neuron can be thought of as picking out a subset of the set \(X:= \{x_1, \ldots , x_D\}\). Lemma 1 provides a way to enumerate these subsets of X in a systematic manner.

Definition 11

For any finite subset \(F \subseteq \mathbb R^n\), a subset \(F'\) of F is said to be linearly separable if there exists \(a\in \mathbb R^n\), \(b\in \mathbb R\) such that \(F' = \{x\in F: \langle a, x \rangle + b > 0\}\).

The following is a well-known result in combinatorial geometry [27].

Theorem 5

For any finite subset \(F \subseteq \mathbb R^n\), there are at most \(2{|F|\atopwithdelims ()n}\) linearly separable subsets.

By considering the natural mapping between subsets of \(\{1, \ldots , m\}\) and \(\{0,1\}^m\), we also obtain the following corollary.

Corollary 3

For any \(m \ge 1\), there are at most \(2{2^{m}\atopwithdelims ()m}\) linearly separable collections of subsets of \(\{1, \ldots , m\}\). In other words, \(|{\mathcal {L}}_m| \le 2{2^{m}\atopwithdelims ()m}\).

Algorithm 1
figure a

Algorithm to solve (1) for linear threshold NNs with n inputs, k hidden layers and widths \(w = (w_1, \ldots , w_k)\).

Proof of Theorem 3

Algorithm 1 solves (1). The correctness comes from the observation that a recursive application of Lemma 1 shows that the sets \(Y^k_1, \ldots , Y^k_{w_k}\) computed by the algorithm, intersected with X are all possible subsets of X computed by the neurons in the last hidden layer. The \(\gamma _1, \ldots , \gamma _{w_k}\) are simply the weights of the last layer that combine the indicator functions of these subsets to yield the function value of the neural network on each data point. The convex minimization problem in line 13 finds the optimal \(\gamma _j\) values, for this particular choice of subsets \(Y^k_1, \ldots , Y^k_{w_k}\). Selecting the minimum over all these choices solves the problem.

The outermost for loop iterates at most \(O(D^{w_1 n}\cdot 2^{\sum _{i=1}^{k-1}{w_i^{2} w_{i+1}}})\) times using Theorem 5 and Corollary 3. The computation of the \(\delta _{ij}\) values in Step 14 can be done in time \(poly (D, w_1, \ldots , w_k)\). The convex minimization problem in \(w_k\) variables and D terms in the sum can be solved in \(poly (D,w_k)\) time. Putting these together gives the overall running time. \(\square \)

We now show that the exponential dependence on the dimension n in Theorem 3 is actually necessary unless P=NP. We consider the version of (1) with single neuron and show that it is NP-hard with a direct reduction.

Theorem 6

(NP-hardness). The One-Node-Linear-Threshold problem, i.e., Problem 1 with \(k=1\) and \(w_1 = 1\), is NP-hard when the dimension n is considered part of the input. This implies in particular that Problem 1 is NP-hard when n is part of the input.

Proof

We here use a result of [22, Theorem 3.1], which showed that the following decision problem is NP-complete. 

MinDis(Halfspaces): Given disjoint sets of positive and negative examples of \({\mathbb {Z}}^{n}\) and a bound \(k\ge 1\), does there exist a separating hyperplane which leads to at most k misclassifications?

MinDis(Halfspaces) is a special case of (1) with a single neuron: given data points \(x_1, \cdots , x_D \in {\mathbb {R}}^{n}\) and \(y_1, \cdots , y_D \in \{0,1\}\), compute \(\alpha \in {\mathbb {R}}^{n}, \beta \in {\mathbb {R}}\) that minimizes \(\frac{1}{D} \sum _{i=1}^{D} (\mathbb {1}_{\{\langle \alpha , x_i\rangle + \beta > 0\}} - y_i)^{2}\). \(\square \)

5 Shortcut linear threshold NNs

5.1 Representability of shortcut linear threshold NNs

Proof of Theorem 4

Arora et al. [3] proved that \({{\,\textrm{ReLU}\,}}_n = {{\,\textrm{CPWL}\,}}_n\), and it’s clear that \({{\,\textrm{CPWL}\,}}_n\subseteq {{\,\textrm{PWL}\,}}_n\) and \({{\,\textrm{SLT}\,}}_n(2)\subseteq {{\,\textrm{PWL}\,}}_n\), so it remains to prove that \({{\,\textrm{PWL}\,}}_n \subseteq {{\,\textrm{SLT}\,}}_n(2)\). Let \(f \in {{\,\textrm{PWL}\,}}_n\) be an arbitrary piecewise linear function, and let \({\mathcal {P}}_f^* = \{P_1,\dots ,P_m\}\). By definition, \(f(x) = \sum _{i=1}^{m}(a_i^\top x + c_i)\mathbb {1}_{\mathring{P}_i}(x)\) for some \(a_i \in \mathbb R^n,\ c_i \in \mathbb R\), where \(i \in \{1,\dots ,m\}\). By the proof of Lemma 3, we are able to compute \(\mathbb {1}_{\mathring{P_i}}\) by a linear combination of the outputs of some neurons in the second hidden layer. In other words, let \(x^{(2)}= [\mathbb {1}_{X_1^1}(x),\dots ,\mathbb {1}_{X_{\ell _1}^{1}}(x),\dots ,\mathbb {1}_{X^{m}_{\ell _{m}}}(x)]^\top \) be the output of the second hidden layer such that for every \(i\in \{1, \ldots , m\}\), we have \(\mathbb {1}_{\mathring{P_i}}(x) = \sum _{j=1}^{\ell _i} \alpha _j^{(i)} \mathbb {1}_{X_j^i}(x)\), where \(\ell _i \in {\mathbb N}_+,\ \alpha _j^{(i)} \in \mathbb R\), and \(\mathbb {1}_{X^i_j}\) are computed by the individual neurons in the second hidden layer. Note that the number of neurons in the second hidden layer is \(w_2 = \sum _{k=1}^{m}\ell _k\).

Now consider introducing a shortcut connection with

$$\begin{aligned} A = [\alpha _{1}^{(1)}a_1,\dots ,\alpha _{\ell _1}^{(1)}a_1,\dots ,\alpha _{\ell _m}^{(m)}a_m] \in \mathbb R^{n \times w_2} \end{aligned}$$

and \(b = [\alpha _1^{(1)} c_1,\dots ,\alpha _{\ell _1}^{(1)}c_1,\dots ,\alpha _{\ell _m}^{(m)}c_m]^\top \in \mathbb R^{w_2}\), then the output of this shortcut NN is given by:

$$\begin{aligned} \langle A^{\top } x + b,x^{(2)}\rangle&= \left\langle \begin{bmatrix}\alpha _{1}^{(1)}(a_1^\top x+c_1)\\ \vdots \\ \alpha _{\ell _1}^{(1)}(a_1^\top x+c_1)\\ \vdots \\ \alpha _{\ell _m}^{(m)}(a_m^\top x+c_m)\end{bmatrix}, \begin{bmatrix}\mathbb {1}_{X_1^1}(x)\\ \vdots \\ \mathbb {1}_{X_{\ell _1}^{1}}(x)\\ \vdots \\ \mathbb {1}_{X^{m}_{\ell _{m}}}(x)\end{bmatrix} \right\rangle \\&= \sum _{i=1}^m \sum _{j=1}^{l_i}(a_i^\top x+c_i) \alpha _j^{(i)} \mathbb {1}_{X^i_j}(x)\\&= \sum _{i=1}^m (a_i^\top x+c_i) \sum _{j=1}^{l_i} \alpha _j^{(i)} \mathbb {1}_{X^i_j}(x)\\&= \sum _{i=1}^{m} \left( a_i^\top x + c_i \right) \cdot \mathbb {1}_{\mathring{P}_{i}}(x) \\&= f(x). \end{aligned}$$

This establishes that \({{\,\textrm{PWL}\,}}_n \subseteq {{\,\textrm{SLT}\,}}_n(2)\), completing the proof. \(\square \)

5.2 Adapting the ERM algorithm for shortcut linear threshold NNs

We now consider solving the ERM problem for a \(\mathbb R^n \rightarrow \mathbb R\) shortcut LT NN with k hidden layers, and \(w = (w_1,\dots ,w_k)\). Upon comparing with Algorithm 1, we note that the difference between our shortcut linear threshold NNs and the linear threshold NNs solely resides in the presence of a shortcut connection in the former across the piecewise regions. Except for the output layer, all other layers in the two networks can be analogously compared. Consequently, the algorithmic process concerning linear threshold NNs can be seamlessly incorporated, except for those steps involving the output layer. Hence, in the global optimization algorithm, the only difference arises in Step 15:

$$\begin{aligned} \text {temp} = \min _{\gamma \in {\mathbb {R}}^{w_k},a_j\in \mathbb R^{n}\;\forall j\in [w_k]} \;\;\sum _{i=1}^D \; \ell \left( \sum _{j=1}^{w_k} (a_j^\top x_i + \gamma _j) \delta _{ij}, y_i \right) , \end{aligned}$$

which can be resolved in poly\((D,(n+1)w_k)\) time.

6 Discussions and open questions

6.1 Linear threshold NNs

Results from Boolean circuit complexity can be used to show that our general construction in Theorem 1 may produce 2 hidden layer networks that are exponentially larger than networks that use 3 hidden layers.

Example 2

Consider the piecewise constant function \(p_n(x): \mathbb R^n \rightarrow \mathbb R\) defined as

$$\begin{aligned} p_n(x) = \sigma \left( \prod _{i=1}^n x_i\right) {,} \end{aligned}$$

where \(\sigma \) is the threshold activation function. \(p_n\) has \(\Omega (2^n)\) pieces implying that the construction from Theorem 1 has size \(\Omega (2^n)\). However, \(p_n\) can be represented by a linear threshold NN with 3 hidden layers and \({\mathcal {O}}(n)\) size.

Example 3

(Braid arrangement [31]) Consider a \(\mathbb R^n \rightarrow \mathbb R\) function

$$\begin{aligned} B_n(x) = \sigma \left( \prod _{1\le i < j \le n} (x_j-x_i) \right) {,} \end{aligned}$$

where \(\sigma \) is the threshold activation function. \(B_n\) has \(\Omega (n!)\) pieces implying that the construction from Theorem 1 has size \(\Omega (n!)\). However, \(B_n\) can be represented by a linear threshold NN with 3 hidden layers and \({\mathcal {O}}(n^2)\) size.

The parity function is defined as the function \(\text {par}: \{0,1\}^n \rightarrow \{0,1\}\) as \(\text {par}(x) = \sum _{i=1}^n x_i \; (\text {mod} \;\;2)\). It is well-known that the parity function can be implemented by a linear threshold NN with 2 hidden layers and \({\mathcal {O}}(n)\) nodes, when restricted to 0/1 inputs [28, 29]. Observe that \(p_n(x) = \text {par}(\sigma (x))\) where we apply the threshold activation \(\sigma \) component wise on \(x\in \mathbb R^n\). This proves that \(p_n\) can be computed by a linear threshold NN with 3 hidden layers and \({\mathcal {O}}(n)\) size. Each orthant of \(\mathbb R^n\) is a piece of \(p_n\) since any adjacent orthant has a different value. Similarly, if we define \(\text {diff}(x) \in \mathbb R^{n(n-1)/2}\) by \(\text {diff}(x)_{ij} = x_i - x_j\) for \(1 \le i<j \le n\), then \(B_n(x) = \text {par}(\sigma (\text {diff}(x)))\). The fact that \(B_n\) has n! pieces comes from results on the so-called Braid arrangement [31].

Lower bounds for the number of wires used in a linear threshold NN have also been studied in the Boolean circuit complexity literature [24, 25, 29]. The number of wires is the number of connections between neurons when the neural network is viewed as a directed acyclic graph. This amounts to the number of nonzero entries of the matrices involved in the affine transformations in Definition 5. Tight bounds that are superlinear but subquadratic in n are known for the wire complexity of any constant depth linear threshold NN implementing the parity function. These results also imply that there is a \({\mathcal {O}}(\log \log n)\) depth NN that implements the parity function with \({\mathcal {O}}(n)\) wires. See [24, 29] for details. These constructions can be used to implement \(p_n\) and \(B_n\).

It is not clear if the functions in Example 2 and Example 3 can be implemented by 2 hidden layers networks of polynomial size, or whether there exist superpolynomial lower bounds on the size of such networks. In the first case, we will know that our construction in this paper is suboptimal. In the second case, we will have our gap result for 2 versus 3 hidden layers.

6.2 Shortcut linear threshold NNs

Shortcut linear threshold NNs (SLT NNs) may bear a superficial resemblance to Residual Neural Networks (ResNet), mainly due to the incorporation of shortcut or skip connections in both architectures. However, ResNet, a significant advancement in deep learning pioneered by He et al. [20], employs skip connections to enable a straightforward addition of skipped layers. This design strategy aims to combat issues such as the vanishing gradient problem prevalent in deep networks. In contrast, SLT NNs use a similar shortcut concept but apply it differently, producing output through the dot product of the input layer (after linear transformation) and the final hidden layer. This distinction signifies a shift from ResNet’s engineering focus to a more theoretical perspective in SLT NNs, aiming to augment the representability of neural networks by transitioning from piecewise constant to piecewise linear function representation.

In the realm of machine learning, the concept of VC-dimension, named after Vapnik and Chervonenkis, acts as a measure of the capacity, or complexity, of a hypothesis class, essentially characterizing the sample complexity needed to learn from that hypothesis class. This makes it a fundamental tool in learning theory. As part of our motivation for this novel type of shortcut connections, we aim to compare the complexity of SLT NNs and ReLU NNs using this dimensionality measure. This comparison aids in gauging the ability of the networks to learn and generalize from data when using SLT NNs. A fundamental result on the VC-dimension of parametrized classes of functions [2, Theorem 8.4] can be used to show that the VC-dimension of a SLT NN with n inputs and k hidden layers is \({\mathcal {O}}\left( (W+nw_k)^2\right) \), where W corresponds to the number of learning parameters not including the shortcut connection, \(w_k\) represents the neurons in the last hidden layer; \(nw_k\) designates the additional parameters associated with the linear transformation A in the shortcut connections (note that the parameters corresponding to the shift b are already included in W because they are present in the original linear threshold NN without the shortcut). For a similarly structured ReLU NN devoid of the shortcut connection, the VC-dimension is \(\Theta \left( Wk \log W \right) \) (see [4]). Therefore, the discrepancy in the VC-dimension between comparable architectures is not dramatic, and the ability of shortcut linear threshold functions to represent discontinuous piecewise linear functions can potentially give them a competitive edge over ReLU NNs.

Furthermore, we can construct a globally optimal ERM algorithm for shortcut linear threshold NNs across all architectures, an accomplishment not yet attained for ReLU networks beyond specific restricted structures [3, 8, 11, 14, 15, 17,18,19].

6.3 Complexity of neural network training

There has been a recent strand of work around the computational complexity of training neural networks provably to global optimality. It has been known for decades that the complexity of neural network training with classical activation functions is hard, and recently this insight has been extended to ReLU activations as well; see [1, 5, 11, 14, 16]. On the positive side, fixed-parameter tractable algorithms and approximation algorithms have been designed [3, 8, 14, 15, 17, 18]. However, these algorithms are restricted to architectures with a single hidden layer, or with very similar architectures to single hidden layers. As mentioned in the Introduction of this paper, our training algorithm for Linear Threshold NNs and Shortcut Linear Threshold NNs works for any architecture and has running time polynomial in the number of data points, assuming the size of the network and the data dimension are constants. To the best of our knowledge, a training algorithm with global optimality guarantees for general architectures that has fixed parameter tractability has not appeared in the literature, except for an interesting study by Bienstock, Munoz and Pokutta [6]. They formulate the training problem as a linear programming problem which solves the problem to \(\epsilon \)-accuracy in time that is linear in the number of data points and polynomial in \(\frac{1}{\epsilon }\), assuming the input dimension and the network architecture are fixed. This is done via a discretization of the neural network parameter space and input space. For a general convex loss, our algorithm will also have to be content with \(\epsilon \)-approximate solutions, since this is the best one can do for minimizing general convex functions. However, our running time is polynomial in \(\log (1/\epsilon )\), in contrast to \(\frac{1}{\epsilon }\). Moreover, for certain loss functions like the \(\ell _1\) or \(\ell _\infty \), our algorithm will indeed be exact, because the convex optimization problem becomes a linear programming problem, but the algorithm in [6] will still need to rely on discretizations, leading to an approximation. On the other hand, the linear dependence of the algorithm in [6] on the number of data points is much better than our algorithm. It should be noted that their analysis does not formally apply to the linear threshold activation functions, since \(x\mapsto {\mathbb {1}}_{\{x>0\}}\) is not Lipschitz continuous, which is an assumption needed in their work.

6.4 Open questions

On the structural side, we need a better understanding of the depth and size tradeoff for (shortcut) linear threshold NNs. In particular, can we show it is possible to represent certain functions with 3 more hidden layers using an exponentially smaller number of neurons compared to what is needed with 2 hidden layers? For instance, in the case of ReLU activations, there exist functions such that going from 2 to 3 hidden layers brings an exponential (in the dimension n) gain in the size of the neural network [12]. We think it is an interesting open question to determine if such families of functions exist for linear threshold networks.

We also suspect that one does not need to go beyond 3 hidden layers to improve on the size bounds, if one is prepared to ignore zero measure sets. This conjecture is formulated based on our empirical observations with these neural networks.

Conjecture 1

For every natural number \(n\in {\mathbb N}\), there exists \(C(n) \in \mathbb R_+\) such that for any \(f \in {{\,\textrm{PWC}\,}}_n\),

$$\begin{aligned} \min _{N \in {\mathcal {N}}_\mu (LT _n(3),f)}|N| \le C(n) \cdot \min _{k \in {\mathbb N}_+}\min _{N_k \in {\mathcal {N}}_\mu (LT _n(k),f)} |N_k| . \end{aligned}$$

A similar conjecture regarding representing the continuous functions for shortcut linear threshold neural networks can be naturally extended in an analogous manner.

Conjecture 2

For every natural number \(n\in {\mathbb N}\), there exists \(C(n) \in \mathbb R_+\) such that for any \(f \in {{\,\textrm{CPWL}\,}}_n\),

$$\begin{aligned} \min _{N \in {\mathcal {N}}(SLT _n(3),f)}|N| \le C(n) \cdot \min _{k \in {\mathbb N}_+}\min _{N_k \in {\mathcal {N}}(SLT _n(k),f)} |N_k| . \end{aligned}$$

On the algorithmic side, we solve the empirical risk minimization problem to global optimality with running time that is polynomial in the size of the data sample, assuming that the input dimension and the architecture size are fixed constants. The running time is exponential in terms of these parameters (see Theorem 3). While the exponential dependence on the input dimension cannot be avoided unless \(P=NP\) (see Theorem 6), another interesting open question is to determine if the exponential dependence on the architectural parameters is really needed, or if an algorithm can be designed that has complexity which is polynomial in both the data sample and the architecture parameters. A similar question is also open in the case of ReLU neural networks [3].