1 Introduction

Deep learning has proven immensely powerful at solving a number of important predictive tasks arising in areas such as image classification, speech recognition, machine translation, and robotics and control [32, 48]. The workhorse model in deep learning is the feedforward network \(\texttt {NN}{}:{\mathbb {R}}^{m_0} \rightarrow {\mathbb {R}}^{m_s}\) that maps a (possibly very high-dimensional) input \(x^0 \in {\mathbb {R}}^{m_0}\) to an output \(x^s = \texttt {NN}{}(x^0)\in {\mathbb {R}}^{m_s}\). A feedforward network with s layers can be recursively described as

$$\begin{aligned} x^i_j = \texttt {NL}^{i,j}(w^{i,j} \cdot x^{i-1} + b^{i,j}) \quad { \forall i \in [\![s ]\!],\; j \in [\![m_i ]\!]} \end{aligned}$$
(1)

where \([\![n ]\!]\mathop {=}\limits ^{\text {def}}\{1,\ldots ,n\}\), \(m_i\) is both the number of neurons in layer i and the output dimension of the neurons in layer \(i-1\) (with the input \(x^0 \in {\mathbb {R}}^{m_0}\) considered to be the 0-th layer). Furthermore, for each \(i \in [\![s ]\!]\) and \(j \in [\![m_i ]\!]\), \(\texttt {NL}^{i,j}:{\mathbb {R}}\rightarrow {\mathbb {R}}\) is some simple nonlinear activation function that is fixed before training, and \(w^{i,j}\) and \(b^{i,j}\) are the weights and bias of an affine function which is learned during the training procedure. In its simplest and most common form, the activation function would be the rectified linear unit (ReLU), defined as \(\texttt {ReLU}{}(v) = \max \{0,v\}\).

Many standard nonlinearities \(\texttt {NL}{}\), such as the ReLU, are piecewise linear: that is, there exists a partition \(\{S^i \subseteq D\}_{i=1}^d\) of the domain and affine functions \(\{f^i\}_{i=1}^d\) such that \(\texttt {NL}{}(x) = f^i(x)\) for all \(x \in S^i\). If all nonlinearities describing \(\texttt {NN}{}\) are piecewise linear, then the entire network \(\texttt {NN}{}\) is piecewise linear as well, and conversely, any continuous piecewise linear function can be represented by a ReLU neural network [5]. Beyond the ReLU, we also investigate activation functions that can be expressed as the maximum of finitely many affine functions, a class which includes certain common operations such as max pooling and reduce max.

There are numerous contexts in which one may want to solve an optimization problem containing a trained neural network such as \(\texttt {NN}{}\). For example, this paradigm arises in deep reinforcement learning problems with high dimensional action spaces and where any of the cost-to-go function, immediate cost, or the state transition functions are learned by a neural network [6, 23, 56, 62, 64, 80]. More generally, this approach may be used to approximately optimize learned functions that are difficult to model explicitly [34, 52, 53]. Alternatively, there has been significant recent interest in verifying the robustness of trained neural networks deployed in systems like self-driving cars that are incredibly sensitive to unexpected behavior from the machine learning model [20, 60, 68]. Relatedly, a string of recent work has used optimization over neural networks trained for visual perception tasks to generate new images which are “most representative” for a given class [59], are “dreamlike” [57], or adhere to a particular artistic style via neural style transfer [30].

1.1 MIP formulation preliminaries

In this work, we study mixed-integer programming (MIP) approaches for optimization problems containing trained neural networks. In contrast to heuristic or local search methods often deployed for the applications mentioned above, MIP offers a framework for producing provably optimal solutions. This is particularly important, for example, in verification problems, where rigorous dual bounds can guarantee robustness in a way that purely primal methods cannot.

Let \(f:D\subseteq {\mathbb {R}}^\eta \rightarrow {\mathbb {R}}\) be a \(\eta \)-variate affine function with domain D, \(\texttt {NL}{}:{\mathbb {R}}\rightarrow {\mathbb {R}}\) be an univariate nonlinear activation function, and \(\circ \) be the standard function composition operator so that \((\texttt {NL}{} \circ f)(x) = \texttt {NL}{}(f(x))\). Functions of the form \((\texttt {NL}{} \circ f)(x)\) precisely describe an individual neuron, with \(\eta \) inputs and one output. A standard way to model a function \(g:D\subseteq {\mathbb {R}}^\eta \rightarrow {\mathbb {R}}\) using MIP is to construct a formulation of its graph defined by \({\text {gr}}(g; D) \mathop {=}\limits ^{\text {def}}\left\{ \left( x,y\right) \in {\mathbb {R}}^\eta \times {\mathbb {R}}\,|\, x \in D, \, y=g(x) \right\} \). We therefore focus on constructing MIP formulations for the graph of individual neurons:

$$\begin{aligned} {\text {gr}}(\texttt {NL}{} \circ f; D) \mathop {=}\limits ^{\text {def}}\left\{ \left( x,{y}\right) {\in {\mathbb {R}}^\eta \times {\mathbb {R}}}\, |\, x \in D,\, {y=(\texttt {NL}{} \circ f) (x)} \right\} , \end{aligned}$$
(2)

because, as we detail at the end of this section, we can readily produce a MIP formulation for the entire network as the composition of formulations for each individual neuron.

Definition 1

Throughout, we will notationally use the convention that \(x \in {\mathbb {R}}^\eta \) and \(y \in {\mathbb {R}}\) are respectively the input and output variables of the function \(g:D\subseteq {\mathbb {R}}^\eta \rightarrow {\mathbb {R}}\) we are modeling. In addition, \(v \in {\mathbb {R}}^h\) and \(z \in {\mathbb {R}}^q\) are respectively any auxiliary continuous and binary variables used to construct a formulation of \({\text {gr}}(g; D)\). The orthogonal projection operator \({\text {Proj}}\) will be subscripted by the variables to project onto, e.g. \({\text {Proj}}_{x,y}(R) = \left\{ (x,y) \,|\, \exists v, z \text { s.t. } (x,y,v,z) \in R\right\} \) is the orthogonal projection of R onto the “original space” of x and y variables. We denote by \(\text {ext}(Q)\) the set of extreme points of a polyhedron Q.

Take \(S\mathop {=}\limits ^{\text {def}}{\text {gr}}(g; D) \subseteq {\mathbb {R}}^{\eta +1}\) to be the set we want to model, and a polyhedron \(Q \subset {\mathbb {R}}^{\eta + 1 + {h} + q}\). Then:

  • A (valid) mixed-integer programming (MIP) formulation of S consists of the linear constraints on \((x,y,v,z)\in {\mathbb {R}}^{\eta + 1 + {h} + q}\) which define a polyhedron Q, combined with the integrality constraints \(z \in \{0,1\}^q\), such that

    $$\begin{aligned} S = {\text {Proj}}_{x,y}\left( Q \cap ({\mathbb {R}}^{\eta + 1 + {h}} \times \{0,1\}^q)\right) . \end{aligned}$$

    We refer to Q as the linear programming (LP) relaxation of the formulation.

  • A MIP formulation is sharp if \({\text {Proj}}_{x,y}(Q) = {\text {Conv}}(S)\).

  • A MIP formulation is hereditarily sharp if fixing any subset of binary variables z to 0 or 1 preserves sharpness.

  • A MIP formulation is ideal (or perfect) if \({\text {ext}}(Q) \subseteq {\mathbb {R}}^{\eta + 1 + {h}} \times \{0,1\}^q\).

  • The separation problem for a family of inequalities is to find a valid inequality violated by a given point or certify that no such inequality exists.

  • An inequality is valid for the formulation if each integer feasible point in \(\underline{Q} = \left\{ (x,y,v,z) \in Q\,|\, z \in \{0,1\}^q\right\} \) satisfies the inequality. Moreover, a valid inequality is facet-defining if the dimension of the points in \(\underline{Q}\) satisfying the inequality at equality is exactly one less than the dimension of \(\underline{Q}\) itself.

Note that ideal formulations are sharp, but the converse is not necessarily the case [72, Proposition 2.4]. In this sense, ideal formulations offer the tightest possible relaxation, and the integrality property in Definition 1 tends to lead to superior computational performance. Furthermore, note that a hereditarily sharp formulation is a formulation which retains its sharpness at every node in a branch-and-bound tree, and as such is potentially superior to a formulation which only guarantees sharpness at the root node [41, 42]. Additionally, it is important to keep in mind that modern MIP solvers will typically require an explicit, finite list of inequalities defining Q.

Finally, for each \(i \in [\![s ]\!]\) and \(j \in [\![m_i ]\!]\) let \(Q_{i,j}\subseteq {\mathbb {R}}^{m_{i-1} + 1 + h_{i,j} + q_{i,j}}\) be a polyhedron such that \({\text {gr}}(\texttt {NL}^{i,j}{} \circ f^{i,j}; D_{i,j}) = {\text {Proj}}_{x,y}\left( Q_{i,j} \cap ({\mathbb {R}}^{m_{i-1} + 1 + h_{i,j}} \times \{0,1\}^{ q_{i,j}})\right) \), where each \(D_{i,j}\) is the domain of neuron \(\texttt {NL}^{i,j}\) and \(f^{i,j}(x) = w^{i,j} \cdot x + b^{i,j}\) is the learned affine function in network (1). Then, a formulation for the complete network (1) is given by

$$\begin{aligned} \left( x^{i-1},x^i_j,v^{i,j},z^{i,j}\right) \in \left( Q_{i,j} \cap ({\mathbb {R}}^{m_{i-1} + 1 + h_{i,j}} \times \{0,1\}^{ q_{i,j}})\right) \quad \forall i \in [\![s ]\!],\; j \in [\![m_i ]\!]. \end{aligned}$$

Any properties of the individual-neuron formulations \(Q_{i,j}\) (e.g. being ideal), are not necessarily preserved for the combined formulation for the complete network. However, for similarly sized-formulations, combining stronger formulations (e.g. ideal instead of sharp) usually leads to complete formulations that are more computationally effective [72]. Hence, our focus for all theoretical properties of the formulations will be restricted to individual-neuron formulations.

1.2 Our contributions

We highlight the contributions of this work as follows.

  1. 1.

    Generic recipes for building strong MIP formulations of the maximum of d affine functions for any bounded polyhedral input domain.

    • [Propositions 3 and 4] We derive both primal and dual characterizations for ideal formulations via the Cayley embedding.

    • [Propositions 5 and 6] We relax the Cayley embedding in a particular way, and use this to derive simpler primal and dual characterizations for (hereditarily) sharp formulations.

    • We discuss how to separate both dual characterizations via subgradient descent.

  2. 2.

    Simplifications for common special cases.

    • [Corollary 1] We show the equivalence of the ideal and sharp characterizations when \(d=2\) (i.e. the maximum of two affine functions).

    • [Proposition 7] We show that, if the input domain is a product of simplices, the separation problem of the sharp formulation can be reframed as a series of transportation problems.

    • [Corollaries 2 and 3, and Proposition 9] When the input domain is a product of simplices, and either (1) \(d=2\), or (2) each simplex is two-dimensional, we provide an explicit, finite description for the sharp formulation. Furthermore, none of these inequalities are redundant.

  3. 3.

    Application of these results to construct MIP formulations for neural network nonlinearities.

    • [Propositions 12 and 13] We derive an explicit ideal formulation for the ReLU nonlinearity over a box input domain, the most common case. Separation over this ideal formulation can be performed in time linear in the input dimension.

    • [Corollary 5] We derive an explicit ideal formulation for the ReLU nonlinearity where some (or all) of the input domain is one-hot encoded categorical or discrete data. Again, the separation can be performed efficiently, and none of the inequalities are redundant.

    • [Propositions 10 and 11] We present an explicit hereditarily sharp formulation for the maximum of d affine functions over a box input domain, and provide an efficient separation routine. Moreover, a subset of the defining constraints serve as a tightened big-M formulation.

    • [Proposition 14] We produce similar results for more exotic neurons: the leaky ReLU and the clipped ReLU (see the online supplement).

  4. 4.

    Computational experiments on verification problems arising from image classification networks trained on the MNIST digit data set.

    • We observe that our new formulations, along with just a few rounds of separation over our families of cutting planes, can improve the solve time of Gurobi on verification problems by orders of magnitude.

Our contributions are depicted in Fig. 1. It serves as a roadmap of the paper.

Fig. 1
figure 1

A roadmap of our results. The single arrows depict dependencies between our technical results, while the double arrows depict the way in which we use these results to establish our applied formulations in the paper. Notationally, d is the number of pieces of the piecewise linear function, \([L,U]^\eta \) is an \(\eta \)-dimensional box, and \((\varDelta ^p)^\tau \) refers to a product of \(\tau \) simplices, each with p components

1.3 Relevant prior work

In recent years a number of authors have used MIP formulations to model trained neural networks [19, 22, 24, 29, 40, 47, 54, 64, 66, 67, 70, 80, 81], mostly applying big-M formulation techniques to ReLU-based networks. When applied to a single neuron of the form (2), these big-M formulations will not be ideal or offer an exact convex relaxation; see Example 1 for an illustration. Additionally, a stream of literature in the deep learning community has studied convex relaxations [12, 26, 27, 61, 63], primarily for verification tasks. Moreover, some authors have investigated how to use convex relaxations within the training procedure in the hopes of producing neural networks with a priori robustness guarantees [25, 78, 79].

The usage of mathematical programming in deep learning, specifically for training predictive models, has also been investigated in [15]. Beyond mathematical programming and convex relaxations, a number of authors have investigated other algorithmic techniques for modeling trained neural networks in optimization problems, drawing primarily from the satisfiability, constraint programming, and global optimization communities [10, 11, 43, 51, 65]. Throughout this stream of the literature, there is discussion of the performance for specific subclasses of neural network models, including binarized [44] or input convex [3] neural networks. Improving our formulations for specific subclasses of neural networks would constitute interesting future work.

Outside of applications in machine learning, the formulations presented in this paper also have connections with existing structures studied in the MIP and constraint programming communities, like indicator variables, on/off constraints, and convex envelopes [7, 13, 17, 37, 38, 69]. Finally, this paper has connections with distributionally robust optimization, in that the primal–dual pair of sharp formulations presented can be viewed as equivalent to the discrete sup–sup duality result [21] for the marginal distribution model [35, 58, 77].

1.4 Starting assumptions and notation

We use the following notational conventions throughout the paper.

  • The nonnegative orthant: \({\mathbb {R}}_{\geqslant 0} \mathop {=}\limits ^{\text {def}}\left\{ x \in {\mathbb {R}}\,|\, x \geqslant 0\right\} \).

  • The n-dimensional simplex: \(\varDelta ^{{n}} \mathop {=}\limits ^{\text {def}}\left\{ x \in {\mathbb {R}}^{{n}}_{\geqslant 0} \,|\, \sum _{i=1}^{{n}} x_i = 1\right\} \).

  • The set of integers from 1 to n: \([\![n ]\!]= \{1, \ldots , n\}\).

  • “Big-M” coefficients: \(M^+(f;D) \mathop {=}\limits ^{\text {def}}\max _{x \in D} f(x)\) and \(M^-(f;D) {\mathop {=}\limits ^{\text {def}}} \min _{x \in D} f(x)\).

  • The dilation of a set: if \(z \in {\mathbb {R}}_{\geqslant 0}\) and \(D \subseteq {\mathbb {R}}^\eta \), then \(z \cdot D \mathop {=}\limits ^{\text {def}}\left\{ z x \,|\, x \in D\right\} \).Footnote 1

Furthermore, we will make the following simplifying assumptions.

Assumption 1

The input domain D is a bounded polyhedron.

While a bounded input domain assumption will make the formulations and analysis considerably more difficult than the unbounded setting (see [7] for a similar phenomenon), it ensures that standard MIP representability conditions are satisfied (e.g. [72, Sect. 11]). Furthermore, variable bounds are natural for many applications (for example in verification problems), and are absolutely essential for ensuring reasonable dual bounds.

Assumption 2

Each neuron is irreducible: for any \(k \in [\![d ]\!]\), there exists some \(x \in D\) where \(f^k(x) > f^\ell (x)\) for each \(\ell \ne k\).

Observe that if a neuron is not irreducible, this means that it is unnecessarily complex, and one or more of the affine functions can be completely removed. Moreover, the assumption can be verified in polynomial time via d LPs by checking if \(\max _{x,\varDelta } \left\{ \varDelta \,\big |\, x \in D,\ \varDelta \leqslant f^k(x) - f^\ell (x)\ \forall \ell \ne k\right\} > 0\) for each \(k \in [\![d ]\!]\). In the special case where \(d = 2\) (e.g. ReLU) and D is a box, this can be done in linear time. Finally, if the assumption does not hold, it will not affect the validity of the formulations or cuts derived in this work, though certain results pertaining to non-redundancy or facet-defining properties may no longer hold.

2 Motivating example: the ReLU nonlinearity over a box domain

Since the work of Glorot et al. [31], the ReLU neuron has become the workhorse of deep learning models. Despite the simplistic ReLU structure, neural networks formed by ReLU neurons are easy to train, reason about, and can model any continuous piecewise linear function [5]. Moreover, they can approximate any continuous, non-linear function to an arbitrary precision under a bounded width [36].

Accordingly, the general problem of maximizing (or minimizing) the output of a trained ReLU network is NP-hard [43]. Nonetheless, in this section we present the strongest possible MIP formulations for a single ReLU neuron without continuous auxiliary variables. As discussed at the end of Sect. 1.1 and corroborated in the computational study in Sect. 6, this leads to faster solve times for the entire ReLU network in practice.

2.1 A big-M formulation

To start, we will consider the ReLU in the simplest possible setting: where the input is univariate. Take the two-dimensional set \({\text {gr}}(\texttt {ReLU}{}; [l,u])\), where [lu] is some interval in \({\mathbb {R}}\) containing zero. It is straightforward to construct an ideal formulation for this univariate ReLU.

Proposition 1

An ideal formulation for \({\text {gr}}(\texttt {ReLU}{} ; [l,u])\) is:

$$\begin{aligned} y&\geqslant x \end{aligned}$$
(3a)
$$\begin{aligned} y&\leqslant x - l(1-z) \end{aligned}$$
(3b)
$$\begin{aligned} y&\leqslant uz \end{aligned}$$
(3c)
$$\begin{aligned} (x,y,z)&\in {\mathbb {R}}\times {\mathbb {R}}_{\geqslant 0} \times [0,1] \end{aligned}$$
(3d)
$$\begin{aligned} z&\in \{0,1\}. \end{aligned}$$
(3e)

Proof

Follows from inspection, or as a special case of Proposition 12 to be presented in Sect. 5.2. \(\square \)

A more realistic setting would have an \(\eta \)-variate ReLU nonlinearity whose input is some \(\eta \)-variate affine function \(f : [L,U] \rightarrow {\mathbb {R}}\) where \(L,U\in {\mathbb {R}}^\eta \) and \([L,U]\mathop {=}\limits ^{\text {def}}\left\{ x \in {\mathbb {R}}^\eta \,\big |\, L_i \leqslant x_i \leqslant U_i\ \forall i\in [\![\eta ]\!]\right\} \) (i.e. the \(\eta \)-variate ReLU nonlinearity given by \(\texttt {ReLU}{} \circ f\)). The box-input [LU] corresponds to known (finite) bounds on each component, which can typically be efficiently computed via interval arithmetic or other standard methods.

Observe that we can model the graph of the \(\eta \)-variate ReLU neuron as a simple composition of the graph of a univariate ReLU activation function and an \(\eta \)-variate affine function:

$$\begin{aligned} {{\text {gr}}\left( \texttt {ReLU}{} \circ f;[L,U]\right) } = \left\{ (x,y){\in {\mathbb {R}}^{\eta +1}} \left| \begin{array}{c} (f(x),y) \in {\text {gr}}\left( \texttt {ReLU}{}; [{m^-,m^+}]\right) \\ L \leqslant x \leqslant U \end{array}\right. \right\} ,\nonumber \\ \end{aligned}$$
(4)

where \(m^-\mathop {=}\limits ^{\text {def}}M^-(f; [L,U])\) and \(m^+\mathop {=}\limits ^{\text {def}}M^+(f; [L,U])\). Using formulation (3) as a submodel, we can write a formulation for the ReLU over a box domain as:

$$\begin{aligned} y&\geqslant f(x) \end{aligned}$$
(5a)
$$\begin{aligned} y&\leqslant f(x) - M^-(f;[L,U]) \cdot (1-z) \end{aligned}$$
(5b)
$$\begin{aligned} y&\leqslant M^+(f;[L,U]) \cdot z \end{aligned}$$
(5c)
$$\begin{aligned} (x,y,z)&\in [L,U] \times {\mathbb {R}}_{\geqslant 0} \times [0,1] \end{aligned}$$
(5d)
$$\begin{aligned} z&\in \{0,1\}. \end{aligned}$$
(5e)

This is the approach taken recently in the bevy of papers referenced in Sect. 1.3. Unfortunately, after the composition with the affine function f over a box input domain, this formulation is no longer sharp.

Example 1

If \(f(x) = x_1 + x_2 - 1.5\), formulation (5) for \({\text {gr}}(\texttt {ReLU}{} \circ f; [0,1]^2)\) is

$$\begin{aligned} y&\geqslant x_1 + x_2 - 1.5 \end{aligned}$$
(6a)
$$\begin{aligned} y&\leqslant x_1 + x_2 - 1.5 + 1.5(1-z) \end{aligned}$$
(6b)
$$\begin{aligned} y&\leqslant 0.5z \end{aligned}$$
(6c)
$$\begin{aligned} (x,y,z)&\in [0,1]^2\times {\mathbb {R}}_{\geqslant 0} \times [0,1] \end{aligned}$$
(6d)
$$\begin{aligned} z&\in \{0,1\}. \end{aligned}$$
(6e)

The point \(({\hat{x}},{\hat{y}},{\hat{z}}) = ((1,0),0.25,0.5)\) is feasible for the LP relaxation (6a)–(6d). However, observe that the inequality \(y \leqslant 0.5x_2\) is valid for \({\text {gr}}(\texttt {ReLU}{} \circ f; [0,1]^2)\), but is violated by \(({\hat{x}},{\hat{y}})\). Therefore, the formulation does not offer an exact convex relaxation (and, hence, is not ideal). See Fig. 2 for an illustration: on the left, of the big-M formulation projected to (xy)-space, and on the right, the tightest possible convex relaxation.

Fig. 2
figure 2

For \(f(x) = x_1 + x_2 - 1.5\): (Left) \({\text {Conv}}({\text {gr}}(\texttt {ReLU}{} \circ f; [0,1]^2))\), and (Right) the projection of the big-M formulation (5) to (xy)-space, where we mark the point \((x,y) = ((1,0),0.25)\) that is not in the convex hull, but is valid for the projection of the big-M LP relaxation (6a)–(6d)

Moreover, the integrality gap of (5) can be arbitrarily bad, even in fixed dimension \(\eta \).

Example 2

Fix \(\gamma \in {\mathbb {R}}_{\geqslant 0}\) and even \(\eta \in {\mathbb {N}}\). Take the affine function \(f(x) = \sum _{i=1}^\eta x_i\), the input domain \([L,U] = [-\gamma ,\gamma ]^\eta \), and \({\hat{x}} = \gamma \cdot (1,-1,\ldots ,1,-1)\) as a scaled vector of alternating positive and negative ones. We can check that \(({\hat{x}},{\hat{y}},{\hat{z}}) = ({\hat{x}},\frac{1}{2}\gamma \eta ,\frac{1}{2})\) is feasible for the LP relaxation of the big-M formulation (5). Additionally, \(f({\hat{x}}) = 0\), and for any \({\tilde{y}}\) such that \(({\hat{x}},{\tilde{y}}) \in {\text {Conv}}({\text {gr}}(\texttt {ReLU}{} \circ f; [L,U]))\), then \({\tilde{y}} = 0\) necessarily. Therefore, there exists a fixed point \({\hat{x}}\) in the input domain where the tightest possible convex relaxation (for example, from a sharp formulation) is exact, but the big-M formulation deviates from this value by at least \(\frac{1}{2}\gamma \eta \).

Intuitively, this example suggests that the big-M formulation can be particularly weak around the boundary of the input domain, as it cares only about the value f(x) of the affine function, and not the particular input value x.

2.2 An ideal extended formulation

It is possible to produce an ideal extended formulation for the ReLU neuron by introducing a modest number of auxiliary continuous variables:

$$\begin{aligned} (x,y)&= (x^0,y^0) + (x^1,y^1) \end{aligned}$$
(7a)
$$\begin{aligned} y^0&= 0 \geqslant w \cdot x^0 + b(1-z) \end{aligned}$$
(7b)
$$\begin{aligned} y^1&= w \cdot x^1 + bz \geqslant 0 \end{aligned}$$
(7c)
$$\begin{aligned} L(1-z)&\leqslant x^0 \leqslant U(1-z) \end{aligned}$$
(7d)
$$\begin{aligned} Lz&\leqslant x^1 \leqslant Uz \end{aligned}$$
(7e)
$$\begin{aligned} z&\in \{0,1\}, \end{aligned}$$
(7f)

This is the standard “multiple choice” formulation for piecewise linear functions [75], which can also be derived from techniques due to Balas [8, 9].

Although the multiple choice formulation offers the tightest possible convex relaxation for a single neuron, it requires a copy \(x^0\) of the input variables [the copy \(x^1\) can be eliminated using the Eq. (7a)]. This means that when this formulation is applied to every neuron in the network to formulate \(\texttt {NN}{}\), the total number of continuous variables required is \(m_0 + \sum _{i=1}^{r} (m_{i-1}+1)m_{i}\), where \(m_i\) is the number of neurons in layer i. In contrast, the big-M formulation requires only \(m_0 + \sum _{i=1}^r m_i\) continuous variables to formulate the entire network. As we will see in Sect. 6, the quadratic growth in size of the extended formulation can quickly become burdensome. Additionally, a folklore observation in the MIP community is that multiple choice formulations tend to not perform as well as expected in simplex-based branch-and-bound algorithms, likely due to degeneracy introduced by the block structure of the formulation [74].

2.3 An ideal MIP formulation without auxiliary continuous variables

In this work, our most broadly useful contribution is the derivation of an ideal MIP formulation for the ReLU nonlinearity over a box domain that is non-extended; that is, it does not require additional auxiliary variables as in formulation (7). We informally summarize this main result as follows.

figure a

We defer the formal statement and proof to Sect. 5.2, for after we have derived the requisite machinery. This is also the main result for the extended abstract version of this work [4], where it is derived through alternative means.

3 Our general machinery: formulations for the maximum of d affine functions

We will state our main structural results in the following generic setting. Take the maximum operator \(\texttt {Max}{}(v_1,\ldots ,v_d) = \max _{i=1}^d v_i\) over d scalar inputs. We study the composition of this nonlinearity with d affine functions \(f^i : D \rightarrow {\mathbb {R}}\) with \(f^i(x) = w^i \cdot x + b^i\), all sharing some bounded polyhedral domain D:

$$\begin{aligned} S_{{\text {max}}}\mathop {=}\limits ^{\text {def}}{\text {gr}}(\texttt {Max}{} \circ (f^1,\ldots ,f^d); D) \equiv \left\{ (x,y) \in D \times {\mathbb {R}}\left| y = \max _{i=1}^d f^i(x) \right\} ,\right. \end{aligned}$$

This setting subsumes the ReLU over box input domain presented in Sect. 2 as a special case with \(d=2\), \(f^2(x) = 0\), and \(D = [L,U]\). It also covers a number of other settings arising in modern deep learning, either by making D more complex (e.g. one-hot encodings for categorical features), or by increasing d (e.g. max pooling neurons often used in convolutional networks for image classification tasks [18], or in maxout networks [33]).

In this section, we present structural results that characterize the Cayley embedding [39, 73, 74, 76] of \(S_{{\text {max}}}\). Take the set

$$\begin{aligned} S_{{\text {cayley}}}\mathop {=}\limits ^{\text {def}}\bigcup _{k=1}^d \left\{ (x,f^k(x),\mathbf{e}^k) \left| x \in D_{|k}\right\} ,\right. \end{aligned}$$

where \(\mathbf{e}^k\) is the unit vector where the k-th element is 1, and for each \(k \in [\![d ]\!]\),

$$\begin{aligned} D_{|k}&\mathop {=}\limits ^{\text {def}}\left\{ x \in D \left| k \in \arg \max _{\ell =1}^d f^\ell (x)\right. \right\} \equiv \left\{ x \in D \left| w^k \cdot x + b^k \geqslant w^\ell \cdot x + b^\ell \ \forall \ell \ne k \right. \right\} \end{aligned}$$

is the portion of the input domain D where \(f^k\) attains the maximum. The Cayley embedding of \(S_{{\text {max}}}\) is the convex hull of this set: \(R_{{\text {cayley}}}\mathop {=}\limits ^{\text {def}}{\text {Conv}}(S_{{\text {cayley}}})\).

The following straightforward observation holds directly from definition.

Observation 1

The set \(R_{{\text {cayley}}}\) is a bounded polyhedron, and an ideal formulation for \(S_{{\text {max}}}\) is given by the system \(\left\{ (x,y,z) \in R_{{\text {cayley}}}\, \big |\, z \in \{0,1\}^d\right\} \).

Therefore, if we can produce an explicit inequality description for \(R_{{\text {cayley}}}\), we immediately have an ideal MIP formulation for \(S_{{\text {max}}}\). Indeed, we have already seen an extended representation in (7) when \(d=2\), which we now state in the more general setting (projecting out the superfluous copies of y).

Proposition 2

An ideal MIP formulation for \(S_{{\text {max}}}\) is:

$$\begin{aligned} (x,y)&= \sum _{k=1}^d ({\widetilde{x}}^k, w^k \cdot {\widetilde{x}}^k + b^kz_k) \end{aligned}$$
(8a)
$$\begin{aligned} {\widetilde{x}}^k&\in z_k \cdot D_{|k}&\quad&\forall k \in [\![d ]\!] \end{aligned}$$
(8b)
$$\begin{aligned} z&\in \varDelta ^d \end{aligned}$$
(8c)
$$\begin{aligned} z&\in \{0,1\}^d. \end{aligned}$$
(8d)

Denote its LP relaxation by \(R_{{\text {extended}}}= \left\{ (x,y,z,{\widetilde{x}}^1,\ldots ,{\widetilde{x}}^k) \,\big |\, (8a-8c)\right\} \). Then \({\text {Proj}}_{x,y,z}(R_{{\text {extended}}}) = R_{{\text {cayley}}}\).

Proof

Follows directly from results in [8, 9]. \(\square \)

Although this formulation is ideal and polynomially-sized, this extended formulation can exhibit poor practical performance, as noted in Sect. 2.2 and corroborated in the computational experiments in Sect. 6. Observe that, from definition, constraint (8b) for a given \(k \in [\![d ]\!]\) is equivalent to the set of constraints \({\widetilde{x}}^k \in z_k \cdot D\) and \(w^k \cdot {\widetilde{x}}^k + b^kz_k \geqslant w^\ell \cdot {\widetilde{x}}^k + b^\ell z_k\) for each \(\ell \ne k\).

3.1 A recipe for constructing ideal formulations

Our goal in this section is to derive generic tools that allow us to build ideal formulations for \(S_{{\text {max}}}\) via the Cayley embedding.

3.1.1 A primal characterization

Our first structural result provides a characterization for the Cayley embedding. Although it is not an explicit polyhedral characterization, we will subsequently see how it can be massaged into a more practically amenable form.

Take the system

$$\begin{aligned} y&\leqslant {\overline{g}}(x,z) \end{aligned}$$
(9a)
$$\begin{aligned} y&\geqslant \underline{g}(x,z) \end{aligned}$$
(9b)
$$\begin{aligned} (x,y,z)&\in D \times {\mathbb {R}} \times \varDelta ^d, \end{aligned}$$
(9c)

where

$$\begin{aligned} {\overline{g}}(x,z) \mathop {=}\limits ^{\text {def}}\max _{{\widetilde{x}}^1, \ldots , {\widetilde{x}}^d}\left\{ \sum _{k=1}^d w^k \cdot {\widetilde{x}}^k + b^k z_k \left| \begin{array}{cc} x = \sum _k {\widetilde{x}}^k &{} \\ {\widetilde{x}}^k \in z_k \cdot D_{|k} &{}\ \forall k \in [\![d ]\!]\end{array}\right. \right\} \\ \underline{g}(x,z) \mathop {=}\limits ^{\text {def}}\min _{{\widetilde{x}}^1, \ldots , {\widetilde{x}}^d}\left\{ \sum _{k=1}^d w^k \cdot {\widetilde{x}}^k + b^k z_k \left| \begin{array}{cc} x = \sum _k {\widetilde{x}}^k &{} \\ {\widetilde{x}}^k \in z_k \cdot D_{|k} &{}\ \forall k \in [\![d ]\!]\end{array}\right. \right\} , \end{aligned}$$

and define the set \(R_{{\text {ideal}}}\mathop {=}\limits ^{\text {def}}\left\{ (x,y,z) \big | (9)\right\} \). Note that \(R_{{\text {ideal}}}\) can be thought of as (8) with the \({\widetilde{x}}^k\) variables implicitly projected out.

Proposition 3

The set \(R_{{\text {ideal}}}\) is polyhedral, and \(R_{{\text {ideal}}}= R_{{\text {cayley}}}\).

Proof

By Proposition 2, it suffices to show that \(R_{{\text {ideal}}}= {\text {Proj}}_{x,y,z}(R_{{\text {extended}}})\). We start by observing that, as \({\overline{g}}\) (respectively \(\underline{g})\) is concave (resp. convex) in its imputs as it is the value function of a linear program (cf. [14, Theorem 5.1]). Therefore, the set of points satisfying (9) is convex.

Let \(({\hat{x}}, {\hat{y}}, {\hat{z}})\) be an extreme point of \(R_{{\text {ideal}}}\), which exists as \(R_{{\text {ideal}}}\) is convex. Then it must satisfy either (9a) or (9b) at equality, as otherwise it is a convex combination of \(({\hat{x}}, {\hat{y}} - \epsilon , {\hat{z}})\) and \(({\hat{x}}, {\hat{y}} + \epsilon , {\hat{z}})\) for some \(\epsilon > 0\). Take \({\widetilde{x}}^1, \ldots , {\widetilde{x}}^d\) that optimizes \({\overline{g}}(x,z)\) or \(\underline{g}(x,z)\), depending on which constraint is satisfied at equality. Then \(({\hat{x}}, {\hat{y}}, {\hat{z}}, {\widetilde{x}}^1, \ldots , {\widetilde{x}}^d) \in R_{{\text {extended}}}\). In other words, \(\text {ext}(R_{{\text {ideal}}}) \subseteq {\text {Proj}}_{x,y,z}(R_{{\text {extended}}})\), and thus \(R_{{\text {ideal}}}\subseteq {\text {Proj}}_{x,y,z}(R_{{\text {extended}}})\) by convexity.

Conversely, let \(({\hat{x}},{\hat{y}},{\hat{z}},{\widetilde{x}}^1,\ldots ,{\widetilde{x}}^d) \in R_{{\text {extended}}}\). Then \({\hat{y}} = \sum _{k=1}^d (w^k \cdot {\widetilde{x}}^k + b^k{\hat{z}}_k) \leqslant {\overline{g}}({\hat{x}}, {\hat{z}})\), as \((\{{\widetilde{x}}^k\}_{k=1}^d)\) is feasible for the optimization problem in \({\overline{g}}(x, z)\). Likewise, \({\hat{y}} \geqslant \underline{g}({\hat{x}}, {\hat{z}})\), and (9c) is trivially satisfied. Therefore, \(({\hat{x}}, {\hat{y}}, {\hat{z}}) \in R_{{\text {ideal}}}\).

Polyhedrality of \(R_{{\text {ideal}}}\) then follows as \(R_{{\text {cayley}}}\) is itself a polyhedron. \(\square \)

Note that the input domain constraint \(x \in D\) is implied by the constraints (9a)–(9b), and therefore do not need to be explicitly included in this description. However, we include it here in our description for clarity. Moreover, observe that \({\overline{g}}\) is a function from \(D \times \varDelta ^d\) to \({\mathbb {R}} \cup \{-\infty \}\), since the optimization problem may be infeasible, but is always bounded from above since D is bounded. Likewise, \(\underline{g}\) is a function from \(D \times \varDelta ^d\) to \({\mathbb {R}} \cup \{+\infty \}\).

3.1.2 A dual characterization

From Proposition 3, we can derive a more useful characterization by applying Lagrangian duality to the LPs describing the envelopes \({\overline{g}}(x,z)\) and \(\underline{g}(x,z)\).

Proposition 4

The Cayley embedding \(R_{{\text {cayley}}}\) is equal to all (xyz) satisfying

$$\begin{aligned} y \leqslant {\overline{\alpha }} \cdot x + \sum _{k=1}^d \left( \max _{x^k \in D_{|k}}\{(w^k - {\overline{\alpha }}) \cdot x^k\} + b^k\right) z_k&\quad \forall {\overline{\alpha }} \in {\mathbb {R}}^\eta \end{aligned}$$
(10a)
$$\begin{aligned} y \geqslant \underline{\alpha } \cdot x + \sum _{k=1}^d \left( \min _{x^k \in D_{|k}}\{(w^k - \underline{\alpha }) \cdot x^k\} + b^k\right) z_k&\quad \forall \underline{\alpha } \in {\mathbb {R}}^\eta \end{aligned}$$
(10b)
$$\begin{aligned} (x,y,z) \in D \times {\mathbb {R}} \times \varDelta ^d.&\end{aligned}$$
(10c)

Proof

Consider the upper bound inequalities for y in Proposition 3, that is, (9a). Now apply the change of variables \(x^k \leftarrow \frac{{\widetilde{x}}^k}{z_k}\) for each \(k \in [\![d ]\!]\) and, for all (xz), take the Lagrangian dual of the optimization problem in \({\overline{g}}(x,z)\) with respect to the constraint \(x = \sum _k x^k z_k\). Note that the duality gap is zero since the problem is an LP. We then obtain that

$$\begin{aligned} {\overline{g}}(x,z)&= \min _{{\overline{\alpha }}} \max _{x^k \in D_{|k}} \sum _{k=1}^d \left( w^k \cdot x^k + b^k\right) z_k + {\overline{\alpha }} \cdot \left( x - \sum _{k=1}^d x^k z_k\right) \\&= \min _{{\overline{\alpha }}}\ {\overline{\alpha }} \cdot x + \sum _{k=1}^d \left( \max _{x^k \in D_{|k}}\{(w^k - {\overline{\alpha }}) \cdot x^k\} + b^k\right) z_k. \end{aligned}$$

In other words, we can equivalently express (9a) via the family of inequalities (10a). The same can be done with (9b), yielding the set of inequalities (10b). Therefore, \(\left\{ (x,y,z) \,\big |\, (10)\right\} = {\text {Proj}}_{x,y,z}(R_{{\text {extended}}}) = R_{{\text {cayley}}}\). \(\square \)

This gives an exact outer description for the Cayley embedding in terms of an infinite number of linear inequalities. Despite this, the formulation enjoys a simple, interpretable form: we can view the inequalities as choosing coefficients on x and individually tightening the coefficients on z according to explicitly described LPs. Similar to the relationship between (8) and (9), (10) can be seen as a simplification of the standard cut generation LP for unions of polyhedra [8, 9]. In later sections, we will see that this decoupling is helpful to simplify (a variant of) this formulation for special cases.

Separating a point \(({\hat{x}}, {\hat{y}}, {\hat{z}})\) over (10a) can be done by evaluating \({\overline{g}}({\hat{x}},{\hat{z}})\) in the form (10a) (and in the analogous form of \(\underline{g}({\hat{x}},{\hat{z}})\) for (10b)). As typically done when using Lagrangian relaxation, this optimization problem can be solved via a subgradient or bundle method, where each subgradient can be computed by solving the inner LP in (10a) for all \(x^k \in D_{|k}\). Observe that any feasible solution \({\overline{\alpha }}\) for the optimization problem in (10a) yields a valid inequality. However, this optimization problem is unbounded when \(({\hat{x}}, {\hat{z}}) \notin {\text {Proj}}_{x,z}(R_{{\text {cayley}}})\) (i.e. when the primal form of \({\overline{g}}({\hat{x}}, {\hat{z}})\) is infeasible). In other words, as illustrated in Fig. 3, \({\overline{g}}\) is an extended real valued function such that \({\overline{g}}({x}, {z})\in {\mathbb {R}}\cup \{-\infty \}\), so care must be taken to avoid numerical instabilities when separating a point \(({\hat{x}}, {\hat{z}})\) where \({\overline{g}}({\hat{x}}, {\hat{z}})=-\infty \).Footnote 2

3.2 A recipe for constructing hereditarily sharp formulations

Although Proposition 4 gives a separation-based way to optimize over \(S_{{\text {max}}}\), there are two potential downsides to this approach. First, it does not give us a explicit, finite description for a MIP formulation that we can directly pass to a MIP solver. Second, the separation problem requires optimizing over \(D_{|k}\), which may be substantially more complicated than optimizing over D (for example, if D is a box).

Therefore, in this section we set our sights slightly lower and present a similar technique to derive sharp MIP formulations for \(S_{{\text {max}}}\). Furthermore, we will see that our formulations trivially satisfy the hereditary sharpness property. In the coming sections, we will see how we can deploy these results in a practical manner, and study settings in which the simpler sharp formulation will also, in fact, be ideal.

3.2.1 A primal characterization

Consider the system

$$\begin{aligned} y&\leqslant {\overline{h}}(x, z) \end{aligned}$$
(11a)
$$\begin{aligned} y&\geqslant w^k \cdot x + b^k\quad \forall k \in [\![d ]\!] \end{aligned}$$
(11b)
$$\begin{aligned} (x,y,z)&\in D \times {\mathbb {R}} \times \varDelta ^d, \end{aligned}$$
(11c)

where

$$\begin{aligned} {\overline{h}}(x, z) \mathop {=}\limits ^{\text {def}}\max _{{\widetilde{x}}^1, \ldots , {\widetilde{x}}^d}\left\{ \sum _{k=1}^d (w^k \cdot {\widetilde{x}}^k + b^k z_k) \,\left| \, \begin{array}{cc} x = \sum _k {\widetilde{x}}^k &{} \\ {\widetilde{x}}^k \in z_k \cdot D &{}\ \forall k \in [\![d ]\!]\end{array}\right. \right\} . \end{aligned}$$

Take the set \(R_{{\text {sharp}}}\mathop {=}\limits ^{\text {def}}\left\{ (x,y,z) \,\big |\, (11) \right\} \).

It is worth dwelling on the differences between the systems (9) and (11). First, we have completely replaced the constraint (9b) with d explicit linear inequalities (11b). Second, when replacing \({\overline{g}}\) with \({\overline{h}}\) we have replaced the inner maximization over \(D_{|k}\) with an inner maximization over D (modulo constant scaling factors). As we will see in Sect. 5.1, this is particularly advantageous when D is trivial to optimize over (for example, a simplex or a box), allowing us to write these constraints in closed form, whereas optimizing over \(D_{|k}\) may be substantially more difficult (i.e. requiring an LP solve).

Furthermore, we will show that while (11) is not ideal, it is hereditarily sharp, and so in general may offer a strictly stronger relaxation than a standard sharp formulation. In particular, the formulation may be stronger than a sharp formulation constructed by composing a big-M formulation, along with an exact convex relaxation in the (xy)-space produced, for example, by studying the upper concave envelope of the function \(\texttt {Max}\circ (f^1,\ldots ,f^d)\).

Proposition 5

The system describing \(\left\{ (x,y,z) \in R_{{\text {sharp}}}\,\big |\, z \in \{0,1\}^d\right\} \) is a hereditarily sharp MIP formulation of \(S_{{\text {max}}}\).

Proof

For the result, we must show four properties: polyhedrality of \(R_{{\text {sharp}}}\), validity of the formulation whose LP relaxation is \(R_{{\text {sharp}}}\), sharpness, and then hereditary sharpness. We proceed in that order.

To show polyhedrality, consider a fixed value \(({\hat{x}},{\hat{y}},{\hat{z}})\) feasible for (11), and presume that we express the domain via the linear inequality constraints \(D = \left\{ x \,\big |\, Ax \leqslant c\right\} \). First, observe that due to (11a) and (11b), \({\overline{h}}({\hat{x}},{\hat{z}})\) is bounded from below. Now, using LP duality, we may rewrite

$$\begin{aligned} {\overline{h}}({\hat{x}},{\hat{z}})&= \max _{{\widetilde{x}}^1, \ldots , {\widetilde{x}}^d}\left\{ \sum _{k=1}^d w^k \cdot {\widetilde{x}}^k \left| \begin{array}{cc} {\hat{x}} = \sum _k {\widetilde{x}}^k &{} \\ A {\widetilde{x}}^k \leqslant {\hat{z}}_k c &{}\ \forall k \in [\![d ]\!]\end{array}\right. \right\} + \sum _{k=1}^d b^k {\hat{z}}_k \\&= \min _{(\alpha , \beta ^1, \ldots , \beta ^k) \in R} \left\{ \alpha \cdot {\hat{x}} + \sum _{k=1}^d c \cdot \beta ^k {\hat{z}}_k\right\} + \sum _{k=1}^d b^k{\hat{z}}_k, \end{aligned}$$

where R is a polyhedron that is independent of \({\hat{x}}\)and \({\hat{z}}\). Therefore, as (a) the above optimization problem is linear with \({\hat{x}}\) and \({\hat{z}}\) fixed, and (b) \({\overline{h}}({\hat{x}},{\hat{z}})\) is bounded from below, we may replace R with \({\text {ext}}(R)\) in the above optimization problem. In other words, \({\overline{h}}({\hat{x}},{\hat{z}})\) is equal to the minimum of a finite number of alternatives which are affine in \({\hat{x}}\) and \({\hat{z}}\). Therefore, \({\overline{h}}\) is a concave continuous piecewise linear function, and so \(R_{{\text {sharp}}}\) is polyhedral.

To show validity, we must have that \({\text {Proj}}_{x,y}\left( R_{{\text {sharp}}}\cap ({\mathbb {R}}^{\eta } \times {\mathbb {R}}\times \{0,1\}^d)\right) = S_{{\text {max}}}\). Observe that if \(({\hat{x}},{\hat{y}},{\hat{z}}) \in R_{{\text {sharp}}}\cap ({\mathbb {R}}^{\eta } \times {\mathbb {R}}\times \{0,1\}^d)\), then \({\hat{z}} = \mathbf{e}^\ell \) for some \(\ell \in [\![d ]\!]\), and

$$\begin{aligned} {\overline{h}}({\hat{x}},{\hat{z}})&= \max _{{\widetilde{x}}^1, \ldots , {\widetilde{x}}^d}\left\{ \sum _{k=1}^d w^k \cdot {\widetilde{x}}^k + b^\ell \left| \begin{array}{cc} {\hat{x}} = \sum _k {\widetilde{x}}^k &{} \\ {\widetilde{x}}^\ell \in D &{} \\ {\widetilde{x}}^k = \mathbf{0}^\eta &{}\forall k \ne \ell \end{array}\right. \right\} = w^\ell {\hat{x}} + b^\ell , \end{aligned}$$

where the first equality follows as \({\widetilde{x}}^k = \mathbf{0}^\eta \) for each \(k \ne \ell \) (recall that if D is bounded, then \(0 \cdot D = \left\{ x \in {\mathbb {R}}^\eta \,\big |\, Ax \leqslant 0\right\} = \{\mathbf{0}^\eta \}\)). Along with (11b), this implies that \({\hat{y}} = w^\ell \cdot {\hat{x}} + b^\ell \), and that \({\hat{y}} \geqslant w^k \cdot {\hat{x}} + b^k\) for each \(k \ne \ell \), giving the result.

To show sharpness, we must prove that \({\text {Proj}}_{x,y}(R_{{\text {sharp}}}) = {\text {Conv}}(S_{{\text {max}}})\). First, recall from Proposition 2 that \({\text {Conv}}(S_{{\text {max}}}) = {\text {Proj}}_{x,y}(R_{{\text {extended}}})\); thus, we state our proof in terms of \(R_{{\text {extended}}}\). We first show that \({\text {Proj}}_{x,y}(R_{{\text {extended}}}) \subseteq {\text {Proj}}_{x,y}(R_{{\text {sharp}}})\). Take \(({\hat{x}}, {\hat{y}}, {\hat{z}}, \{{\hat{x}}^k\}_{k=1}^d) \in R_{{\text {extended}}}\). Then \({\hat{y}} = \sum _{k=1}^d (w^k \cdot {\hat{x}}^k + b^k{\hat{z}}_k) \leqslant {\overline{h}}({\hat{x}}, {\hat{z}})\), as \((\{{\hat{x}}^k\}_{k=1}^d)\) is feasible for the optimization problem in \({\overline{h}}(x, z)\). It also holds that \({\hat{y}} \geqslant w^k \cdot {\hat{x}} + b^k\) for all \(k \in [\![d ]\!]\) and \({\hat{x}} \in D\) directly from the definition of \(S_{{\text {max}}}\), giving the result.

Next, we show that \({\text {Proj}}_{x,y}(R_{{\text {sharp}}}) \subseteq {\text {Proj}}_{x,y}(R_{{\text {extended}}})\). This proof is similar to the proof of Proposition 3, except that we choose z that simplifies the constraints. It suffices to show that \(\text {ext}({\text {Proj}}_{x,y}(R_{{\text {sharp}}})) \subseteq {\text {Proj}}_{x,y}(R_{{\text {extended}}})\). Let \(({\hat{x}}, {\hat{y}}) \in \text {ext}({\text {Proj}}_{x,y}(R_{{\text {sharp}}}))\). Define \({\overline{h}}(x) \mathop {=}\limits ^{\text {def}}\max _{z} \left\{ {\overline{h}}(x, z) \,\big |\, z \in \varDelta ^d\right\} \). Then either \(({\hat{x}}, {\hat{y}})\) satisfies \({\hat{y}} = {\overline{h}}({\hat{x}})\), or it satisfies (11b) at equality for some \(k \in [\![d ]\!]\), since otherwise \(({\hat{x}}, {\hat{y}})\) is a convex combination of the points \(({\hat{x}}, {\hat{y}} - \epsilon )\) and \(({\hat{x}}, {\hat{y}} + \epsilon )\) feasible for \({\text {Proj}}_{x,y}(R_{{\text {sharp}}})\) for some \(\epsilon > 0\). We show that in either case, \(({\hat{x}}, {\hat{y}}) \in {\text {Proj}}_{x,y}(R_{{\text {extended}}})\).

Case 1 Suppose that for some \(j \in [\![d ]\!]\), \(({\hat{x}}, {\hat{y}})\) satisfies the corresponding inequality in (11b) at equality; that is, \({\hat{y}} = w^j \cdot {\hat{x}} + b^j\). Then the point \(({\hat{x}}, {\hat{y}}, {\mathbf {e}}_j, \{{\hat{x}}^k\}_{k=1}^d) \in R_{{\text {extended}}}\), where \({\hat{x}}^j = x\) and \({\hat{x}}^\ell = 0\) if \(\ell \ne j\). Hence, \(({\hat{x}}, {\hat{y}}) \in \text {proj}_{x,y}(R_{{\text {extended}}})\).

Case 2 Suppose \(({\hat{x}}, {\hat{y}})\) satisfies \({\hat{y}} = {\overline{h}}({\hat{x}})\). Let \({\hat{z}}\) be an optimal solution for the optimization problem defining \({\overline{h}}({\hat{x}})\), and \(\{{\hat{x}}^k\}_{k=1}^d\) be an optimal solution for \({\overline{h}}({\hat{x}},{\hat{z}})\). By design, \(({\hat{x}}, {\hat{y}}, {\hat{z}}, \{{\hat{x}}^k\}_{k=1}^d)\) satisfies all constraints in \(R_{{\text {extended}}}\), except potentially constraint (8b).

We show that constraint (8b) is satisfied as well. Suppose not for contradiction; that is, \(w^k \cdot {\hat{x}}^k + b^k {\hat{z}}_k < w^\ell \cdot {\hat{x}}^k + b^\ell {\hat{z}}_k\) for some pair \(k, \ell \in [\![d ]\!], \ell \ne k\). Consider the solution \((\{{\overline{x}}^k\}_{k=1}^d, {\overline{z}})\) identical to \((\{{\hat{x}}^k\}_{k=1}^d, {\hat{z}})\) except that \({\overline{z}}_k = 0\), \({\overline{x}}^k = \mathbf{0}^\eta \), \({\overline{z}}_\ell = {\hat{z}}_\ell + {\hat{z}}_k\), and \({\overline{x}}^\ell = {\hat{x}}^\ell + {\hat{x}}^k\). By inspection, this solution is feasible for \({\overline{h}}({\hat{x}})\). The objective value changes by \(- (w^k \cdot {\hat{x}}^k + b^k {\hat{z}}_k) + (w^\ell \cdot {\hat{x}}^k + b^\ell {\hat{z}}_k) > 0\), contradicting the optimality of \((\{{\hat{x}}^k\}_{k=1}^d, {\hat{z}})\). Therefore, \(({\hat{x}}, {\hat{y}}, {\hat{z}}, \{{\hat{x}}^k\}_{k=1}^d) \in R_{{\text {extended}}}\), and thus \(({\hat{x}}, {\hat{y}}) \in {\text {Proj}}_{x,y}(R_{{\text {extended}}})\).

Finally, we observe that hereditary sharpness follows from the definition of \({\overline{h}}\). In particular, fixing any \(z_k = 0\) implies that \({\widetilde{x}}^k = \mathbf{0}^\eta \) in the maximization problem defining \({\overline{h}}\). In other words, the variables \({\widetilde{x}}^k\) and \(z_k\) drop completely from the maximization problem defining \({\overline{h}}\), meaning that it is equal to the corresponding version of \({\overline{h}}\) with the function k completely dropped as input. Additionally, if any \(z_k = 1\), then \(z_\ell = 0\) for each \(\ell \ne k\) since \(z \in \varDelta ^d\), and hence \({\widetilde{x}}^\ell = \mathbf{0}^\eta \). In this case, \({\overline{h}}(x,z) = w^k \cdot x + b^k\), which gives the result. \(\square \)

3.2.2 A dual characterization

We can apply a duality-based approach to produce an (albeit infinite) linear inequality description for the set \(R_{{\text {sharp}}}\), analogous to Sect. 3.1.2.

Proposition 6

The set \(R_{{\text {sharp}}}\) is equal to all (xyz) such that

$$\begin{aligned}&y \leqslant \alpha \cdot x + \sum _{k=1}^d \left( \max _{x^k \in D}\{(w^k - \alpha ) \cdot x^k\} + b^k\right) z_k\quad \forall \alpha \in {\mathbb {R}}^\eta \end{aligned}$$
(12a)
$$\begin{aligned}&y \geqslant w^k \cdot x + b^k\quad \forall k \in [\![d ]\!] \end{aligned}$$
(12b)
$$\begin{aligned}&(x,y,z) \in D \times {\mathbb {R}} \times \varDelta ^d. \end{aligned}$$
(12c)

Proof

Follows in an analogous manner as in Proposition 4. \(\square \)

Fig. 3
figure 3

Examples of the functions \({\overline{g}}(x,z)\), \({\overline{h}}(x,z)\), and \(\underline{g}(x,z)\) defined in (9) and (11) with some fixed value for z. Here, g(x) is the “maximum” function of interest, defined below each subfigure. a, c Depict the case \(d = 2\) (maximum of two affine functions), while b, d illustrate when \(d = 3\), each pair emphasizing upper and lower bounds respectively. Note that \({\overline{g}}(x,z)\) and \({\overline{h}}(x,z)\) coincide in (a) for \(x \in [1,2.5]\), and in (b) for \(x \in [2,2.5]\). The thick solid lines represent \({\overline{g}}(x,z)\) in (a, b) and \(\underline{g}(x,z)\) in (c, d), whereas the dashed lines correspond to \({\overline{h}}(x,z)\). The thin solid lines represent \(S_{{\text {max}}}\) and the shaded region is the slice of \(R_{{\text {cayley}}}\) with \(z={\hat{z}}\)

Figure 3 depicts slices of the functions \({\overline{g}}(x,z)\), \(\underline{g}(x,z)\), and \({\overline{h}}(x,z)\), created by fixing some value of z and varying x. Observe that \({\overline{g}}(x,z)\) can be viewed as the largest value for y such that (xy) can be written as a convex combination of points in the graph using convex multipliers z. Likewise, \(\underline{g}(x,z)\) can be interpreted as the minimum value for y. In \({\overline{h}}(x,z)\), we relax \(D_{|k}\) to D, and thus we can interpret it similarly to \({\overline{g}}(x,z)\), except that we may take convex combinations of points constructed by evaluating the affine functions at any point in the domain, not only those where the given function attains the maximum. Figure 3b shows that, in general, \({\overline{h}}(x,z)\) can be strictly looser than \({\overline{g}}(x,z)\) for \((x,z) \in {\text {Proj}}_{x,z}(R_{{\text {cayley}}})\). A similar situation occurs for the lower envelopes as illustrated by Fig. 3d. However, we prove in the next section that this does not occur for \(d = 2\), along with other desirable properties in special cases.

4 Simplifications to our machinery under common special cases

In this section we study how our dual characterizations in Propositions 4 and 6 simplify under common special cases with the number of input affine functions d and the input domain D.

4.1 Simplifications when \(d=2\)

When we consider taking the maximum of only two affine functions (i.e. \(d=2\)), we can prove that \(R_{{\text {sharp}}}\) is, in fact, ideal.

We start by returning to Proposition 3 and show that it can be greatly simplified when \(d = 2\). We first show \(\underline{g}(x,z)\) can be replaced by the maximum of the affine functions as illustrated in Fig. 3c, although it is not possible for \(d > 2\) as seen in Fig. 3d.

Lemma 1

If \(d=2\), then at any values of (xz) where \({\overline{g}}(x,z)\ge \underline{g}(x,z)\) (i.e. there exists a y such that \((x,y,z) \in R_{{\text {cayley}}}\)), we have

$$\begin{aligned} \underline{g}(x,z) = \max \{w^1 \cdot x + b^1, w^2 \cdot x + b^2\}. \end{aligned}$$

Proof

For convenience, we work with the change of variables \(x^k \leftarrow \frac{{\widetilde{x}}^k}{z_k}\) for each \(k \in [\![2 ]\!]\). Suppose without loss of generality that \(x\in D_{|2}\), and consider any feasible solution \((x^1,x^2)\) to the optimization problem for \(\underline{g}(x,z)\), which is feasible by the assumption that \((x,z) \in {\text {Proj}}_{x,z}(R_{{\text {cayley}}})\). We will show that \((w^1\cdot x^1+b^1)z_1+(w^2\cdot x^2+b^2)z_2\ge w^2\cdot x+b^2\). We assume that \(z_2>0\), since otherwise \(x=x^1\in D_{|1}\cap D_{|2}\) and the result is immediate.

Since \(x=x^1z_1+x^2z_2\) and \(z \in \varDelta ^2\), the line segment joining \(x^1\) to \(x^2\) contains x. Furthermore, since \(x^1\in D_{|1}\) and \(x^2\in D_{|2}\), this line segment also intersects the hyperplane \(\left\{ {\hat{x}}\in {\mathbb {R}}^\eta \big | w^1 \cdot {\hat{x}}+b^1=w^2 \cdot {\hat{x}}+b^2\right\} \). Let \({\hat{x}}^1\) denote this point of intersection, and let \({\hat{z}}^1\in \varDelta ^2\) be such that \({\hat{x}}^1=x^1{\hat{z}}^1_1+x^2{\hat{z}}^1_2\). Since \(x\in D_{|2}\), we know that \({\hat{x}}^1\) is closer to \(x^1\) than x, i.e. \({\hat{z}}^1_1 \ge z_1\). Moreover, take the point \({\hat{x}}^2\) on this line segment such that \(x={\hat{x}}^1z_1+{\hat{x}}^2z_2\), where \({\hat{x}}^2=x^1{\hat{z}}^2_1+x^2{\hat{z}}^2_2\) for some \({\hat{z}}^2 \in \varDelta ^2\). We have \({\hat{z}}^2_1 \le z_1\) since \({\hat{x}}^2\) is further away from \(x^1\) than x. Note that \({\hat{x}}^1\in D_{|1}\cap D_{|2}\) while \({\hat{x}}^2\in D_{|2}\), and thus \(({\hat{x}}^1, {\hat{x}}^2)\) is feasible.

It can be computed that \({\hat{z}}^2_1 = z_1\cdot \frac{{\hat{z}}^1_2}{z_2}\), which implies that \(z_1=z_1({\hat{z}}^1_1+{\hat{z}}^1_2)=z_1{\hat{z}}^1_1+z_2{\hat{z}}^2_1\) and \(z_2=z_2({\hat{z}}^2_1+{\hat{z}}^2_2)=z_1{\hat{z}}^1_2+z_2{\hat{z}}^2_2\). Using these two identities, we obtain

$$\begin{aligned}&(w^1\cdot x^1+b^1)z_1+(w^2\cdot x^2+b^2)z_2\\&\quad = \ (w^1\cdot x^1+b^1)(z_1{\hat{z}}^1_1+z_2{\hat{z}}^2_1)+(w^2\cdot x^2+b^2)(z_1{\hat{z}}^1_2+z_2{\hat{z}}^2_2) \\&\quad = \ ((w^1\cdot x^1+b^1){\hat{z}}^1_1+(w^2\cdot x^2+b^2){\hat{z}}^1_2)z_1 \\&\qquad + ((w^1\cdot x^1+b^1){\hat{z}}^2_1+(w^2\cdot x^2+b^2){\hat{z}}^2_2)z_2 \\&\quad = \ (f(x^1){\hat{z}}^1_1+f(x^2){\hat{z}}^1_2)z_1+(f(x^1){\hat{z}}^2_1+f(x^2){\hat{z}}^2_2)z_2, \end{aligned}$$

where we let \(f({\hat{x}})\) denote the function \(\max \{w^1\cdot {\hat{x}}+b^1,w^2\cdot {\hat{x}}+b^2\}\), recalling that \(x^1\in D_{|1}\) and \(x^2\in D_{|2}\). Since \(f({\hat{x}})\) is convex, by Jensen’s inequality the preceding expression is at least \(f(x^1{\hat{z}}^1_1+x^2{\hat{z}}^1_2)z_1+f(x^1{\hat{z}}^2_1+x^2{\hat{z}}^2_2)z_2\). The preceding expression equals \((w^2\cdot {\hat{x}}^1+b^2)z_1+(w^2\cdot {\hat{x}}^2+b^2)z_2\) by the definitions of \({\hat{x}}^1\) and \({\hat{x}}^2\), and the fact that they both lie in \(D_{|2}\). Recalling the equation \(x={\hat{x}}^1z_1+{\hat{x}}^2z_2\) used to select \({\hat{x}}^2\) completes the proof. \(\square \)

Moreover, we show that when \(d=2\) we can replace \({\overline{g}}\) with \({\overline{h}}\), as illustrated in Fig. 3a. This property may not hold when \(d > 2\) as shown in Fig. 3b.

Lemma 2

If \(d = 2\), then at any values of (xz) where \({\overline{g}}(x,z)\ge \underline{g}(x,z)\) (i.e. there exists a y such that \((x,y,z) \in R_{{\text {cayley}}}\)), we have

$$\begin{aligned} {\overline{g}}(x,z) = \max _{{\widetilde{x}}^1, {\widetilde{x}}^2}\left\{ w^1 \cdot {\widetilde{x}}^1 + b^1 z_1 + w^2 \cdot {\widetilde{x}}^2 + b^2 z_2 \left| \begin{array}{l} x = {\widetilde{x}}^1 + {\widetilde{x}}^2 \\ {\widetilde{x}}^1 \in z_1 \cdot D \\ {\widetilde{x}}^2 \in z_2 \cdot D \end{array}\right. \right\} . \end{aligned}$$
(13)

Proof

We show that despite expanding the feasible set of the optimization problem in (13) by replacing \(D_{|k}\) by D, its optimal value is no greater when (xz) is such that \({\overline{g}}(x,z)\ge \underline{g}(x,z)\). It suffices to show without loss of generality that \(w^1 \cdot {\widetilde{x}}^1 + b^1 z_1 \geqslant w^2 \cdot {\widetilde{x}}^1 + b^2 z_1\) holds for any optimal \({\widetilde{x}}^1, {\widetilde{x}}^2\). By the assumption on (xz) and Lemma 1, we have \({\overline{g}}(x,z) \geqslant w^2 \cdot x + b^2\), which implies the existence of some optimal \({\widetilde{x}}^1, {\widetilde{x}}^2\). That is, we have \(w^2 \cdot (x - {\widetilde{x}}^1) + b^2 z_2 + w^1 \cdot {\widetilde{x}}^1 + b^1 z_1 \geqslant w^2 \cdot x + b^2\), which is equivalent to \(w^1 \cdot {\widetilde{x}}^1 + b^1 z_1 \geqslant w^2 \cdot {\widetilde{x}}^1 + b^2 z_1\). \(\square \)

After observing that these simplifications are identical to those presented in Proposition 6, we obtain the following corollary promised at the beginning of the section.

Corollary 1

When \(d=2\), \(\left\{ (x,y,z) \in R_{{\text {sharp}}}\,\big |\, z \in \{0,1\}^d\right\} \) is an ideal MIP formulation of \(S_{{\text {max}}}\).

Proof

Lemmas 1 and 2 imply that \(R_{{\text {sharp}}}= R_{{\text {ideal}}}\), while Proposition 3 implies that \(R_{{\text {ideal}}}= R_{{\text {cayley}}}\), completing the chain and giving the result. \(\square \)

In later sections, we will study conditions under which we can produce an explicit inequality description for \(R_{{\text {sharp}}}\).

4.2 Simplifications when D is the product of simplices

In this section, we consider another important special case: when the input domain is the Cartesian product of simplices. Indeed, the box domain case introduced in Sect. 2 can be viewed as a product of two-dimensional simplices, and we will also see in Sect. 5.3 that this structure naturally arises in machine learning settings with categorical or discrete features.

When D is the product of simplices, we can derive a finite representation for the the set (12) [i.e. a finite representation for the infinite family of linear inequalities (12a)] through an elegant connection with the transportation problem. To do so, we introduce the following notation.

Definition 2

Suppose the input domain is \(D = \prod _{i=1}^\tau \varDelta ^{p_i}\), with \(p_1+\cdots +p_{\tau }=\eta \). For notational simplicity, we re-organize the indices of x and refer to its entries via \(x_{i,j}\), where \(i\in [\![\tau ]\!]\) is the simplex index, and \(j\in [\![p_i]\!]\) refers to the coordinate within simplex i. The domain for x is then

$$\begin{aligned} D=\left\{ ((x_{i,j})_{j=1}^{p_i})_{i=1}^{\tau } \,\big |\, (x_{i,j})_{j=1}^{p_i} \in \varDelta ^{p_i} \quad \forall i \in [\![\tau ]\!]\right\} , \end{aligned}$$
(14)

where the rows of x correspond to each simplex. Correspondingly, we re-index the weights of the affine functions so that for each \(k\in [\![d]\!]\), we have \(f^k(x)=\sum _{i=1}^{\tau }\sum _{j=1}^{p_i}w^k_{i,j}x_{i,j}+b^k\).

Using the notation from Definition 2, constraints (12a) can be written as

$$\begin{aligned} y\leqslant \sum _{i=1}^{\tau }\sum _{j=1}^{p_i}\alpha _{i,j}x_{i,j} + \sum _{k=1}^d \left( \max _{x^k \in D}\sum _{i=1}^{\tau }\sum _{j=1}^{p_i}(w^k_{i,j} - \alpha _{i,j})x^k_{i,j} + b^k\right) z_k\quad \forall \alpha \in {\mathbb {R}}^{\eta }. \end{aligned}$$

Since D is a product of simplices, the maximization over \(x^k\in D\) appearing in the right-hand side above is separable over each simplex \(i \in [\![\tau ]\!]\). Moreover, for each simplex i, the maximum value of \(\sum _{j=1}^{p_i}(w^k_{ij} - \alpha _{ij})x^k_{ij}\), subject to the constraint \(x^k \in D\), is obtained when \(x^k_{ij}=1\) for some \(j\in [\![p_i]\!]\). Therefore, the family of constraints (12a) is equivalent to

$$\begin{aligned} y&\leqslant \min _{\alpha }\left( \sum _{i=1}^{\tau }\sum _{j=1}^{p_i}\alpha _{i,j}x_{i,j} + \sum _{k=1}^d \left( \sum _{i=1}^{\tau }\max _{j=1}^{p_i}(w^k_{i,j} - \alpha _{i,j}) + b^k\right) z_k\right) \nonumber \\&=\sum _{i=1}^{\tau }\min _{\alpha _{i,1},\ldots ,\alpha _{i,p_i}}\left( \sum _{j=1}^{p_i}\alpha _{i,j}x_{i,j} + \sum _{k=1}^dz_k\cdot \max _{j=1}^{p_i}(w^k_{i,j} - \alpha _{i,j})\right) +\sum _{k=1}^db^kz_k. \end{aligned}$$
(15)

We show that the minimization problem in (15), for any i, is equivalent to a transportation problem defined as follows.

Definition 3

For any values \(x \in \varDelta ^p\) and \(z \in \varDelta ^d\), and arbitrary weights \(w^k_j\in {\mathbb {R}}\) for all \(j\in [\![p]\!]\) and \(k\in [\![d ]\!]\), define the max-weight transportation problem to be

In the transportation problem, since \(\sum _jx_j=1=\sum _kz_k\), it follows that \(\beta ^k_j \in [0,1]\), and so this value can be interpreted as the percent of total flow “shipped” between j and k. The relation to (15) is now established through LP duality.

Proposition 7

For any fixed \(x \in \varDelta ^{p}\) and \(z \in \varDelta ^d\),

$$\begin{aligned} \min _{\alpha } \left( \sum _{j=1}^{p}\alpha _{j}x_{j} + \sum _{k=1}^dz_k\cdot \max _{j=1}^p(w^k_{j} - \alpha _{j}) \right) = {\mathsf {Transport}}(x,z;w^1,\ldots ,w^d). \end{aligned}$$
(16)

Therefore, when D is a product of simplices, the constraints (12a) can be replaced in (12) with the single inequality

$$\begin{aligned} y\le \sum _{i=1}^{\tau }{\mathsf {Transport}}\left( (x_{i,j})_{j=1}^{p_i},z;(w^1_{i,j})_{j=1}^{p_i},\ldots ,(w^d_{i,j})_{j=1}^{p_i}\right) +\sum _{k=1}^db^kz_k. \end{aligned}$$
(17)

Proof

By using a variable \(\gamma _{k}\) to model the value of \(\max _{j=1}^{p}(w^k_{j} - \alpha _{j})\) for each \(k\in [\![d]\!]\), the minimization problem on the LHS of (16) is equivalent to

$$\begin{aligned} \min _{\alpha ,\gamma }\left\{ \sum _{j=1}^{p}\alpha _{j}x_{j}+\sum _{k=1}^d\gamma _{k}z_k \;\left| \; \gamma _{k} \ge w^k_{j}-\alpha _{j} \quad \forall k\in [\![d]\!],j\in [\![p]\!]\right. \right\} \end{aligned}$$

which is a minimization LP with free variables \(\alpha _j\) and \(\gamma _k\). Applying LP duality, this completes the proof of Eq. (16). The inequality (17) then arises by substituting Eq. (16) into (15), for every simplex \(i=1,\ldots ,\tau \). \(\square \)

4.3 Simplifications when both \(d=2\) and D is the product of simplices

Proposition 7 shows that, when the input domain is a product of simplices, the tightest upper bound on y can be computed through a series of transportation problems. We now leverage the fact that if either side of the transportation problem from Definition 3 (i.e. p or d) has only two entities, then it reduces to a simpler fractional knapsack problem. Later, this will allow us to represent (15), in either of the cases \(d=2\) or \(p_1=\cdots =p_{\tau }=2\), using an explicit finite family of linear inequalities in x and z which has a greedy linear-time separation oracle.

Proposition 8

Given data \(w^1, w^2 \in {\mathbb {R}}^p\), take \({\widetilde{w}}_j = w^1_j-w^2_j\) for all \(j\in [\![p]\!]\), and suppose the indices have been sorted so that \({\widetilde{w}}_1\le \cdots \le {\widetilde{w}}_p\). Then

$$\begin{aligned} {\mathsf {Transport}}(x,z;w^1,w^2)=\min _{J=1}^{p}\left( {\widetilde{w}}_Jz_1+\sum _{j=J+1}^{p}({\widetilde{w}}_j-{\widetilde{w}}_J)x_j\right) +\sum _{j=1}^{p}w^2_jx_j. \end{aligned}$$
(18)

Moreover, a \(J \in [\![p ]\!]\) that attains the minimum in the right-hand side of (18) can be found in \({\mathcal {O}}(p)\) time.

Proof

When \(d=2\), the transportation problem becomes

$$\begin{aligned} \max _{\beta ^1,\beta ^2 \geqslant 0}\left\{ \sum _{j=1}^p(w^1_j\beta ^1_j+w^2_j\beta ^2_j)\;\left| \;\beta ^1_j+\beta ^2_j =x_j \quad \forall j\in [\![p]\!],\quad \sum _{j=1}^p\beta ^1_j =z_1\right. \right\} \end{aligned}$$
(19)

where the constraint \(\sum _{j=1}^p\beta ^2_j=z_2\) is implied because \(\sum _{j=1}^p\beta ^2_j=\sum _{j=1}^p(x_j-\beta ^1_j)=1-\sum _{j=1}^p\beta ^1_j=1-z_1=z_2\). Substituting \(\beta ^2_j=x_j-\beta ^1_j\) for all \(j \in [\![p ]\!]\) and then omitting the superscript “1”, (19) becomes

$$\begin{aligned} \max _{\beta }\left\{ \sum _{j=1}^p(w^1_j-w^2_j)\beta _j+\sum _{j=1}^pw^2_jx_j\;\left| \;\sum _{j=1}^p\beta _j =z_1,\quad 0\le \beta _j \le x_j \quad \forall j\in [\![p]\!]\right. \right\} \end{aligned}$$

which is a fractional knapsack problem that can be solved greedily.

An optimal solution to the fractional knapsack LP above, assuming the sorting \(w^1_1-w^2_1\le \cdots \le w^1_p-w^2_p\), can be computed greedily. Let \(J\in [\![p]\!]\) be the maximum index at which \(\sum _{j=J}^px_j\ge z_1\). We set \(\beta _j = 0\) for all \(j < J\), \(\beta _J = z_1 - \sum _{j=J+1}^p x_j\), and \(\beta _j = x_j\) for each \(j > J\). The optimal cost is

$$\begin{aligned} \sum _{j=J+1}^p(w^1_j-w^2_j)x_j+(w^1_J-w^2_J)\left( z_1-\sum _{j=J+1}^px_j\right) +\sum _{j=1}^pw^2_jx_j, \end{aligned}$$

which yields the desired expression after substituting \({\widetilde{w}}_j=w^1_j-w^2_j\) for all \(j\ge J\). Moreover, the index J above can be found in \({\mathcal {O}}(p)\) time by storing a running total for \(\sum _{j=J}^p x_j\), completing the proof. \(\square \)

Observe that the \({\mathcal {O}}(p)\) runtime in Proposition 8 is non-trivial, as a naïve implementation would run in time \({\mathcal {O}}(p^2)\), as the inner sum is linear in p.

Combining Propositions 7 and 8 immediately yields the following result.

Corollary 2

Suppose that \(d=2\) and that D is a product of simplices. Let \(z = z_1 \equiv 1-z_2\). For each simplex \(i \in [\![\tau ]\!]\), take \({\widetilde{w}}_{i,j} = w^1_{i,j}-w^2_{i,j}\) for all \(j=1,\ldots ,p_i\) and relabel the indices so that \({\widetilde{w}}_{i,1}\le \cdots \le {\widetilde{w}}_{i,p_i}\). Then, in the context of (12), the upper-bound constraints (12a) are equivalent to

$$\begin{aligned}&y\le \sum _{i=1}^{\tau }\left( {\widetilde{w}}_{i,J(i)}z{+}\sum _{j=J(i){+}1}^{p_i}({\widetilde{w}}_{i,j}{-}{\widetilde{w}}_{i,J(i)})x_{i,j}{+}\sum _{j=1}^{p_i}w^2_{i,j}x_{i,j}\right) +(b^1-b^2)z+b^2 \nonumber \\&\quad \quad \quad \forall \text { mappings } J:[\![\tau ]\!]\rightarrow {\mathbb {Z}}\text { with }J(i)\in [\![p_i]\!]\ \forall i\in [\![\tau ]\!]. \end{aligned}$$
(20)

Moreover, given any point \((x,y,z)\in D\times {\mathbb {R}}\times [0,1]\), feasibility can be verified or a most violated constraint can be found in \({\mathcal {O}}(p_1+\cdots +p_{\tau })\) time.

Corollary 2 gives an explicit finite family of linear inequalities equivalent to (12). Moreover, we have already shown in Corollary 1 that \(R_{{\text {sharp}}}\) yields an ideal formulation when \(d=2\). Hence, we have an ideal nonextended formulation whose exponentially-many inequalities can be separated in \({\mathcal {O}}(p_1+\cdots +p_{\tau })\) time, where the initial sorting requires \({\mathcal {O}}(p_1\log p_1+\cdots +p_{\tau }\log p_{\tau })\) time. Note that this sorting can be avoided: we may instead solve the fractional knapsack problem in the separation via weighted median in \({\mathcal {O}}(p_1+\cdots +p_{\tau })\) time [46, Chapter 17.1].

We can also show that none of the constraints in (20) are redundant.

Proposition 9

Consider the polyhedron P defined as the intersection of all halfspaces corresponding to the inequalities (20). Consider some arbitrary mapping \(J : [\![\tau ]\!]\rightarrow {\mathbb {Z}}\) with \(J(i) \in [\![p_i ]\!]\) for each \(i \in [\![\tau ]\!]\). Then the inequality in (20) for J is irredundant with respect to P. That is, removing the halfspace corresponding to mapping J (note that this halfspace could also correspond to other mappings) from the description of P will strictly enlarge the feasible set.

Proof

Fix a mapping J. Consider the feasible points with \(z=1/2\), and \(x_{i,j}= \mathbb {1}[j = J(i)]\) for each i and j. At such points, the constraint corresponding to any mapping \(J' \ne J\) in (20) is

$$\begin{aligned} y&\le \sum _{i=1}^{\tau }\left( \frac{{\widetilde{w}}_{i,J'(i)}}{2}+\sum _{j=J'(i)+1}^{p_i}({\widetilde{w}}_{i,j}-{\widetilde{w}}_{i,J'(i)})\mathbb {1}[j=J(i)]+\sum _{j=1}^{p_i}w^2_{i,j}\mathbb {1}[j=J(i)]\right) \\&\quad +\frac{b^1+b^2}{2} \\&=\sum _{i=1}^{\tau }\left( \frac{{\widetilde{w}}_{i,J'(i)}}{2}+({\widetilde{w}}_{i,J(i)}-{\widetilde{w}}_{i,J'(i)})\mathbb {1}[J'(i)<J(i)]\right) +\sum _{i=1}^{\tau }w^2_{i,J(i)}+\frac{b^1+b^2}{2}. \end{aligned}$$

For any simplex i, recall that the indices are sorted so that \({\widetilde{w}}_{i1}\le \cdots \le {\widetilde{w}}_{ip_i}\). Thus, if \(J'(i)\ge J(i)\), then the expression inside the outer parentheses equals \(\frac{{\widetilde{w}}_{i,J'(i)}}{2} \geqslant \frac{{\widetilde{w}}_{i,J(i)}}{2}\). On the other hand, if \(J'(i)<J(i)\), then the expression inside the outer parentheses can be re-written as \(\frac{{\widetilde{w}}_{i,J(i)}}{2}+\frac{{\widetilde{w}}_{i,J(i)}-{\widetilde{w}}_{i,J'(i)}}{2} \geqslant \frac{{\widetilde{w}}_{i,J(i)}}{2}\). Therefore, setting \(J'(i)=J(i)\) for every simplex i achieves the tightest upper bound in (20), which simplifies to

$$\begin{aligned} y\le \sum _{i=1}^{\tau }\frac{{\widetilde{w}}_{i,J(i)}}{2}+\sum _{i=1}^{\tau }w^2_{i,J(i)}+\frac{b^1+b^2}{2}. \end{aligned}$$
(21)

Now, suppose that the same upper bound on y is achieved by a mapping \(J'\) such that \(J'(i)\ne J(i)\) on a simplex i. By the argument above, regardless of whether \(J'(i)>J(i)\) or \(J'(i)<J(i)\), the expression inside the outer parentheses can equal \(\frac{{\widetilde{w}}_{i,J(i)}}{2}\) only if \({\widetilde{w}}_{i,J'(i)}={\widetilde{w}}_{i,J(i)}\). In this case, inspecting the term inside the summation in (20) for mappings J and \(J'\), we observe that regardless of the values of x or z,

$$\begin{aligned} {\widetilde{w}}_{i,J'(i)}z+\sum _{j=J'(i)+1}^{p_i}({\widetilde{w}}_{i,j}-{\widetilde{w}}_{i,J'(i)})x_{i,j}={\widetilde{w}}_{i,J(i)}z+\sum _{j=J(i)+1}^{p_i}({\widetilde{w}}_{i,j}-{\widetilde{w}}_{i,J(i)})x_{i,j}. \end{aligned}$$

Therefore, for a mapping \(J'\) to achieve the tightest upper bound (21), it must be the case that \({\widetilde{w}}_{i,J'(i)}={\widetilde{w}}_{i,J(i)}\) on every simplex i, and so \(J'\) and J necessarily correspond to the same half-space in \(D\times {\mathbb {R}}\times [0,1]\), completing the proof. \(\square \)

To close this section, we consider the setting where every simplex is 2-dimensional, i.e. \(p_1=\cdots =p_{\tau }=2\), but the number of affine functions d can be arbitrary. Note that by contrast, the previous results (Propositions 89, Corollary 2) held in the setting where \(d=2\) but \(p_1,\ldots ,p_{\tau }\) were arbitrary. We exploit the symmetry of the transportation problem to immediately obtain the following analogous results, all wrapped up into Corollary 3.

Corollary 3

Given data \(w^1, \ldots , w^d \in {\mathbb {R}}^2\), take \({\widetilde{w}}^k = w^k_1 - w^k_2\) for all \(k\in [\![d]\!]\), and suppose the indices have been sorted so that \({\widetilde{w}}^1\le \cdots \le {\widetilde{w}}^d\). Then

$$\begin{aligned} {\mathsf {Transport}}(x,z;w^1,\ldots ,w^d)=\min _{K=1}^d\left( {\widetilde{w}}^Kx_1+\sum _{k=1}^{K}w^k_2z_k+\sum _{k=K+1}^{d}(w^k_1-{\widetilde{w}}^K)z_k\right) . \end{aligned}$$
(22)

Moreover, a \(K \in [\![d ]\!]\) that attains the minimum in the right-hand side of (22) can be found in \({\mathcal {O}}(d)\) time.

Therefore, suppose that D is a product of \(\tau \) simplices of dimensions \(p_1=\cdots =p_{\tau }=2\). For each simplex \(i\in [\![\tau ]\!]\), take \({\widetilde{w}}^k_i=w^k_{i,1}-w^k_{i,2}\) for all \(k=1,\ldots ,d\) and relabel the indices so that \({\widetilde{w}}^1_i\le \cdots \le {\widetilde{w}}^d_i\). Then, in the context of (12), the upper-bound constraints (12a) are equivalent to

$$\begin{aligned} y&\le \sum _{i=1}^{\tau }\left( {\widetilde{w}}^{K(i)}_ix_{i,1}+\sum _{k=1}^{K(i)}w^k_{i,2}z_k+\sum _{k=K(i)+1}^{d}(w^k_{i,1}-{\widetilde{w}}^{K(i)}_i)z_k\right) +\sum _{k=1}^db^kz_k \nonumber \\&\qquad \quad \forall \text { mappings } K:[\![\tau ]\!]\rightarrow [\![d]\!]. \end{aligned}$$
(23)

Furthermore, none of the constraints in (23) are redundant.

Corollary 3 will be particularly useful in Sect. 5.1, where it will allow us to derive sharp formulations for the maximum of d affine functions over a box input domain, in analogy to (20).

5 Applications of our machinery

We are now prepared to return to the concrete goal of this paper: building strong MIP formulations for nonlinearities used in modern neural networks.

5.1 A hereditarily sharp formulation for Max on box domains

We can now present a hereditarily sharp formulation, with exponentially-many constraints that can be efficiently separated, for the maximum of \(d > 2\) affine functions over a shared box input domain.

Proposition 10

For each \(k, \ell \in [\![d ]\!]\), take

$$\begin{aligned} N^{\ell ,k}&= \sum _{i=1}^\eta \max \{(w^k_i-w^\ell _i)L_i, (w^k_i-w^\ell _i)U_i)\}. \end{aligned}$$

A valid MIP formulation for \({\text {gr}}(\texttt {Max}{} \circ (f^1,\ldots ,f^d); [L,U])\) is

$$\begin{aligned}&y \leqslant w^\ell \cdot x + \sum _{k=1}^d \left( N^{\ell ,k} + b^k\right) z_k \quad \forall \ell \in [\![d ]\!] \end{aligned}$$
(24a)
$$\begin{aligned}&y \geqslant w^k \cdot x + b^k \quad \forall k \in [\![d ]\!] \end{aligned}$$
(24b)
$$\begin{aligned}&(x,y,z) \in [L,U] \times {\mathbb {R}} \times \varDelta ^d \end{aligned}$$
(24c)
$$\begin{aligned}&z \in \{0,1\}^d. \end{aligned}$$
(24d)

Moreover, a hereditarily sharp formulation is given by (24), along with the constraints

$$\begin{aligned}&y \leqslant \sum _{i=1}^{\eta }\left( w^{I(i)}_ix_i+\sum _{k=1}^d\max \{(w^{k}_i-w^{I(i)}_i)L_i,(w^{k}_i-w^{I(i)}_i)U_i\}z_k\right) +\sum _{k=1}^db^kz_k\nonumber \\&\qquad \quad \quad \forall \text { mappings } I : [\![\eta ]\!]\rightarrow [\![d ]\!] \end{aligned}$$
(25)

Furthermore, none of the constraints (24b) and (25) are redundant.

Proof

The result directly follows from Corollary 3 if we make a change of variables to transform the box domain [LU] to the domain \((\varDelta ^2)^{\eta }\), where the x-coordinates are given by \(x_{i,1},x_{i,2}\ge 0\) over \(i\in [\![\eta ]\!]\), with \(x_{i,1}+x_{i,2}=1\) for each simplex i. For each i, let \(w^k_{i,1}=w^k_iU_i,w^k_{i,2}=w^k_iL_i\), and let \(\sigma _i:[\![d]\!]\rightarrow [\![d]\!]\) be a permutation such that \(w^{\sigma _i(1)}_{i,1}-w^{\sigma _i(1)}_{i,2}\le \cdots \le w^{\sigma _i(d)}_{i,1}-w^{\sigma _i(d)}_{i,2}\). Fix a mapping \(I:[\![\eta ]\!]\rightarrow [\![d]\!]\) for (23). We make the change of variables \(\xi _i \leftarrow (U_i-L_i)x_{i,1}+L_i\) and rewrite (23) from Corollary 3 to take the form (25), as follows:

$$\begin{aligned} y&\leqslant \sum _{i=1}^{\eta }\Bigg (w^{\sigma _i(I(i))}_i(U_i-L_i)x_{i,1}+\sum _{k=1}^{I(i)}w^{\sigma _i(k)}_iL_iz_k+\sum _{k=I(i)+1}^{d}(w^{\sigma _i(k)}_iU_i-w^{\sigma _i(I(i))}_i(U_i-L_i))z_k\Bigg )\nonumber \\&\quad +\sum _{k=1}^db^kz_k,\nonumber \\&=\sum _{i=1}^{\eta }\left( w^{\sigma _i(I(i))}_i\xi _i+\sum _{k=1}^{I(i)}(w^{\sigma _i(k)}_i-w^{\sigma _i(I(i))}_i)L_iz_k+\sum _{k=I(i)+1}^{d}(w^{\sigma _i(k)}_i-w^{\sigma _i(I(i))}_i)U_iz_k\right) +\sum _{k=1}^db^kz_k \nonumber \\&=\sum _{i=1}^{\eta }\left( w^{\sigma _i(I(i))}_i\xi _i+\sum _{k=1}^d\max \{(w^{\sigma _i(k)}_i-w^{\sigma _i(I(i))}_i)L_i,(w^{\sigma _i(k)}_i-w^{\sigma _i(I(i))}_i)U_i\}z_k\right) +\sum _{k=1}^db^kz_k \nonumber \\&=\sum _{i=1}^{\eta }\left( w^{\sigma _i(I(i))}_i\xi _i+\sum _{k=1}^d\max \{(w^{k}_i-w^{\sigma _i(I(i))}_i)L_i,(w^{k}_i-w^{\sigma _i(I(i))}_i)U_i\}z_k\right) +\sum _{k=1}^db^kz_k, \end{aligned}$$
(26)

where the first equality holds because \(\xi _i-(U_i-L_i)x_{i,1}=L_i\) for each i and \(\sum _{k=1}^d z_k=1\), the second equality holds because \((w^{\sigma _i(k)}_i-w^{\sigma _i(I(i))}_i)L_i\ge (w^{\sigma _i(k)}_i-w^{\sigma _i(I(i))}_i)U_i\) if \(k\le I(i)\) (and a reverse argument can be made if \(k>I(i)\)), and the third equality holds as each \(\sigma _i:[\![d]\!]\rightarrow [\![d]\!]\) is a bijection. Now, since (26) is taken over all mappings \(I:[\![\eta ]\!]\rightarrow [\![d]\!]\), it is equivalent to replace \(\sigma _i(I(i))\) with I(i) in (26). This yields (25) over \(\xi \) instead of x.

The lower bound in the context of Corollary 3 can be expressed as \(y \geqslant \sum _{i=1}^{\eta }(w^k_{i,1}x_{i,1}+w^k_{i,2}x_{i,2}) + b^k,\ \forall k \in [\![d ]\!]\). To transform it to (24b) over \(\xi \), observe that \(w^k_{i,1}x_{i,1}+w^k_{i,2}x_{i,2}\) can be re-written as \(w^k_i(U_ix_{i,1}+L_i(1-x_{i,1}))=w^k_i\xi _i\).

Finally, note that this transformation of variables is a bijection since each \(x_{i,1}\) was allowed to range over [0, 1], and thus each \(\xi _i\) is allowed to range over \([L_i,U_i]\). Hence the new formulation over \((\xi ,y,z)\in [L,U]\times {\mathbb {R}}\times \varDelta ^d\) is also sharp, yielding the desired result.

The irredundancy of (25) also follows from Corollary 3. For the irredundancy of (24b), fix k and take a point \(({\hat{x}}, {\hat{y}}, {\hat{z}})\) where \(f^k({\hat{x}}) > f^\ell ({\hat{x}})\) for all \(\ell \ne k\), which exists by Assumption 2. Thus, since the inequalities (24b) are the only ones bounding y from below, the point \(({\hat{x}}, {\hat{y}} - \epsilon , {\hat{z}})\) for some \(\epsilon > 0\) is satisfied by all constraints except (24b) corresponding to k, and therefore it is not redundant. \(\square \)

Observe that the constraints (24a) are a special case of (25) for those constant mappings \(I(i) = \ell \) for each \(i \in [\![\eta ]\!]\). We note in passing that this big-M formulation (24) may be substantially stronger than existing formulations appearing in the literature. For example, the tightest version of the formulation of Tjeng et al. [70] is equivalent to (24) with the coefficients \(N^{\ell ,k}\) replaced with

$$\begin{aligned} N^{\ell ,k} = b^\ell - b^k + \max \limits _{t \ne \ell } \left( \sum _{i=1}^\eta \left( \max \{w^t_iL_i,w^t_iU_i\} - \min \{w^\ell _iL_i,w^\ell _iU_i\}\right) \right) . \end{aligned}$$

Note in particular that as the inner maximization and minimization are completely decoupled, and that the outer maximization in the definition of \(N^{\ell ,k,+}\) is completely independent of k.

In “Appendix A”, we generalize (24) to provide a valid formulation for arbitrary polytope domains. We next emphasize that the hereditarily sharp formulation from Proposition 10 is particularly strong when \(d = 2\).

Corollary 4

The formulation given by (24) and (25) is ideal when \(d = 2\). Moreover, the constraints in the families (25) and (24b) are facet-defining.

Proof

Idealness follows directly from Corollary 1. Since the constraints (25) and (24b) are irredundant and the formulation is ideal, they must either be facet-defining or describe an implied equality. Given that the equality \(\sum _{k=1}^d z_k = 1\) appears in (24c24d), it suffices to observe that the polyhedron defined by (24) and (25) has dimension \(\eta + d\), which holds under Assumption 2. \(\square \)

We can compute a most-violated inequality from the family (25) efficiently.

Proposition 11

Consider the family of inequalities (25). Take some point \(({\hat{x}},{\hat{y}},{\hat{z}}) \in [L,U] \times {\mathbb {R}}\times \varDelta ^d\). If any constraint in the family is violated at the given point, a most-violated constraint can be constructed by selecting \({\hat{I}} : [\![\eta ]\!]\rightarrow [\![d ]\!]\) such that

$$\begin{aligned} {\hat{I}}(i) \in \arg \min _{\ell \in [\![d]\!]} \left( w^{\ell }_i{\hat{x}}_i+\sum _{k=1}^d\max \{(w^{k}_i-w^{\ell }_i)L_i,(w^{k}_i-w^{\ell }_i)U_i\}{\hat{z}}_k \right) \end{aligned}$$
(27)

for each \(i \in [\![\eta ]\!]\). Moreover, if the weights \(w_i^k\) are sorted on k for each \(i \in [\![\eta ]\!]\), this can be done in \({\mathcal {O}}(\eta d)\) time.

Proof

Follows directly from Corollary 3, which says that the minimization problem (27) can be solved in \({\mathcal {O}}(d)\) time for any \(i\in [\![\eta ]\!]\). \(\square \)

Note that naïvely, the minimization problem (27) would take \({\mathcal {O}}(d^2)\) time, because one has to check every \(\ell \in [\![d]\!]\), and then sum over \(k\in [\![d]\!]\) for every \(\ell \). However, if we instead pre-sort the weights \(w^1_i,\ldots ,w^d_i\) for every \(i\in [\![\eta ]\!]\) in \({\mathcal {O}}(\eta d \log d)\) time, we can use Corollary 3 to run efficiently separate via a linear search. We note, however, that this pre-sorting step can potentially be obviated by solving the fractional knapsack problems appearing as a weighted median problem, which can be solved in \({\mathcal {O}}(d)\) time.

5.2 The ReLU over a box domain

We can now present the results promised in Theorem 2.3. In particular, we derive a non-extended ideal formulation for the ReLU nonlinearity, stated only in terms of the original variables (xy) and the single additional binary variable z. Put another way, it is the strongest possible tightening that can be applied to the big-M formulation (5), and so matches the strength of the multiple choice formulation without the growth in the number of variables remarked upon in Sect. 2.2. Notationally, for each \(i \in [\![\eta ]\!]\) take

$$\begin{aligned} \breve{L}_i = {\left\{ \begin{array}{ll} L_i &{} \text { if } w_i \geqslant 0 \\ U_i &{} \text { if } w_i< 0 \end{array}\right. } \quad \text { and } \quad \breve{U}_i = {\left\{ \begin{array}{ll} U_i &{} \text { if } w_i \geqslant 0 \\ L_i &{} \text { if } w_i < 0 \end{array}\right. }. \end{aligned}$$

Proposition 12

Take some affine function \(f(x) = w \cdot x + b\) over input domain \(D = [L,U]\). The following is an ideal MIP formulation for \({\text {gr}}(\texttt {ReLU}{} \circ f; [L,U])\):

$$\begin{aligned} y&\leqslant \sum _{i \in I} w_i(x_i - \breve{L}_i(1-z)) + \left( b + \sum _{i \not \in I} w_i\breve{U}_i\right) z \quad \forall I \subseteq [\![\eta ]\!]\end{aligned}$$
(28a)
$$\begin{aligned} y&\geqslant w \cdot x + b \end{aligned}$$
(28b)
$$\begin{aligned} (x,y,z)&\in [L,U] \times {\mathbb {R}}_{\geqslant 0} \times [0,1] \end{aligned}$$
(28c)
$$\begin{aligned} z&\in \{0,1\}. \end{aligned}$$
(28d)

Furthermore, each inequality in (28a) and (28b) is facet-defining.

Proof

Footnote 3This result is a special case of Corollary 4. Observe that (28) is equivalent to (24) and (25) with \(w^1_i=w_i\) and \(w^2_i=0\) for all \(i \in [\![\eta ]\!]\), \(b^1=b\), \(b^2=0\), \(z_1=z\), and \(z_2=1-z\). The constraints (28a) are found by setting \(I=\left\{ i \in [\![\eta ]\!]\,\big |\, {\hat{I}}(i)=1\right\} \) for each mapping \({\hat{I}}\) in (25). \(\square \)

Formulation (28) has a number of constraints exponential in the input dimension \(\eta \), so it will not be useful directly as a MIP formulation. However, it is straightforward to separate the exponential family (28a) efficiently.

Proposition 13

Take \(({\hat{x}},{\hat{y}},{\hat{z}}) \in [L,U] \times {\mathbb {R}}_{\geqslant 0} \times [0,1]\), along with the set

$$\begin{aligned} {\hat{I}} = \left\{ i \in [\![\eta ]\!]\,\big |\, w_i{\hat{x}}_i < w_i\left( \breve{L}(1-{\hat{z}}) + \breve{U}_i{\hat{z}}\right) \right\} . \end{aligned}$$

If

$$\begin{aligned} {\hat{y}} > b{\hat{z}} + \sum _{i \in {\hat{I}}} w_i\left( {\hat{x}}_i - \breve{L}(1-{\hat{z}})\right) + \sum _{i \not \in {\hat{I}}} w_i\breve{U}_i{\hat{z}}, \end{aligned}$$

then the constraint in (28a) corresponding to \({\hat{I}}\) is the most violated in the family. Otherwise, no inequality in the family is violated at \(({\hat{x}},{\hat{y}},{\hat{z}})\).

Proof

Follows as a special case of Proposition 11. \(\square \)

Note that (5b) and (5c) correspond to (28a) with \(I = [\![\eta ]\!]\) and \(I = \emptyset \), respectively. All this suggests an iterative approach to formulating ReLU neurons over box domains: start with the big-M formulation (5), and use Proposition 13 to separate strengthening inequalities from (28a) as they are needed.

5.3 The ReLU with one-hot encodings

Although box domains are a natural choice for many applications, it is often the case that some (or all) of the first layer of a neural network will be constrained to be the product of simplices. The one-hot encoding is a standard technique used in the machine learning community to preprocess discrete or categorical data to a format more amenable for learning (see, for example, [16, Chapter 2.2]). More formally, if input x is constrained to take categorical values \(x \in C = \{c^1,\ldots ,c^t\}\), the one-hot transformation encodes this as \({\widetilde{x}}\in \{0,1\}^{t}\), where \({\widetilde{x}}_i = 1\) if and only if \(x = c^i\). In other words, the input is constrained such that \({\widetilde{x}}\in {\underline{\varDelta }}^\eta \mathop {=}\limits ^{\text {def}}\varDelta ^\eta \cap \{0,1\}^\eta \).

It is straightforward to construct a small ideal formulation for \({\text {gr}}(\texttt {ReLU}{} \circ f; {\underline{\varDelta }}^\eta )\) as \(\left\{ (x,\sum _{i=1}^\eta \max \{0, w_ix_i + b\}) \,\big |\, x \in {\underline{\varDelta }}^\eta \right\} \). However, it is typically the case that multiple features will be present in the input, meaning that the input domain would consist of the product of (potentially many) simplices. For example, neural networks have proven well-suited for predicting the propensity for a given DNA sequence to bind with a given protein [2, 83], where the network input consists of a sequence of n base pairs, each of which can take 4 possible values. In this context, the input domain would be \(\prod _{i=1}^n {\underline{\varDelta }}^4\).

In this section, we restate the general results presented in Sect. 2, specialized for the standard case of the ReLU nonlinearity.

Corollary 5

Presume that the input domain \(D = \varDelta ^{p_1} \times \cdots \times \varDelta ^{p_\tau }\) is a product of \(\tau \) simplices, and that \(f(x) = \sum _{i=1}^\tau \sum _{j=1}^{p_i} w_{i,j} x_{i,j} + b\) is an affine function. Presume that, for each \(i \in [\![\tau ]\!]\), the weights are sorted such that \(w_{i,1} \leqslant \cdots \leqslant w_{i,p_i}\). Then an ideal formulation for \({\text {gr}}(\texttt {ReLU}{} \circ f; D)\) is:

$$\begin{aligned}&y \geqslant w \cdot x + b \end{aligned}$$
(29a)
$$\begin{aligned}&y \leqslant \sum _{i=1}^\tau \left( w_{i,J(i)} z + \sum _{j=J(i)+1}^{p_i} (w_{i,j} - w_{i,J(i)})x_{i,j} \right) + bz \nonumber \\&\qquad \quad \forall \text { mappings }J:[\![\tau ]\!]\rightarrow {\mathbb {Z}}\text { with }J(i)\in [\![p_i]\!]\ \forall i\in [\![\tau ]\!]\end{aligned}$$
(29b)
$$\begin{aligned}&(x,y,z) \in D \times {\mathbb {R}}_{\geqslant 0} \times \{0,1\}. \end{aligned}$$
(29c)

Moreover, a most-violated constraint from the family (29b), if one exists, can be identified in \({\mathcal {O}}(p_1+\cdots +p_\tau )\) time. Finally, none of the constraints from (29b) are redundant.

Proof

Follows directly from applying Corollary 2 to the set \(R_{{\text {sharp}}}\). By Corollary 1, this set actually leads to an ideal formulation, because we are taking the maximum of only two functions (with one of them being zero). The statement about non-redundancy follows from Proposition 9. \(\square \)

5.4 The leaky ReLU over a box domain

A slightly more exotic variant of the ReLU is the leaky ReLU, defined as \(\texttt {Leaky}{}(v;\alpha ) = \max \{\alpha v, v\}\) for some constant \(0< \alpha < 1\). Instead of fixing any negative input to zero, the leaky ReLU scales it by a (typically small) constant \(\alpha \). This has been empirically observed to help avoid the “vanishing gradient” problem during the training of certain networks [55, 82]. We present analogous results for the leaky ReLU as for the ReLU: an ideal MIP formulation with an efficient separation routine for the constraints.

Proposition 14

Take some affine function \(f(x) = w \cdot x + b\) over input domain \(D = [L,U]\). The following is a valid formulation for \({\text {gr}}(\texttt {Leaky}{} \circ f; [L,U])\):

$$\begin{aligned} y&\geqslant f(x) \end{aligned}$$
(30a)
$$\begin{aligned} y&\geqslant \alpha f(x) \end{aligned}$$
(30b)
$$\begin{aligned} y&\leqslant f(x) - (1-\alpha ) \cdot M^-(f; [L,U]) \cdot (1-z) \end{aligned}$$
(30c)
$$\begin{aligned} y&\leqslant \alpha f(x) - (\alpha -1) \cdot M^+(f; [L,U]) \cdot z \end{aligned}$$
(30d)
$$\begin{aligned} (x,y,z)&\in [L,U] \times {\mathbb {R}}\times [0,1] \end{aligned}$$
(30e)
$$\begin{aligned} z&\in \{0,1\}. \end{aligned}$$
(30f)

Moreover, an ideal formulation is given by (30), along with the constraints

$$\begin{aligned} y \leqslant&\left( \sum _{i \in I} w_i(x_i - \breve{L}_i(1-z)) + \left( b + \sum _{i \not \in I} w_i\breve{U}_i\right) z\right) \nonumber \\&+ \alpha \left( \sum _{i \not \in I} w_i(x_i - \breve{U}_iz) + \left( b + \sum _{i \in I} w_i\breve{L}_i\right) (1-z)\right) \quad \forall I \subseteq [\![\eta ]\!]. \end{aligned}$$
(31)

Additionally, the most violated inequality from the family (31) can be separated in \({\mathcal {O}}(\eta )\) time. Finally, each inequality in (30a)–(30d) and (31) is facet-defining.

Proof

Follows as a special case of Corollary 4. \(\square \)

6 Computational experiments

To conclude this work, we perform a preliminary computational study of our approaches for ReLU-based networks. We focus on verifying image classification networks trained on the canonical MNIST digit data set [49]. We train a neural network \(f : [0,1]^{28 \times 28} \rightarrow {\mathbb {R}}^{10}\), where each of the 10 outputs corresponds to the logitsFootnote 4 for each of the digits from 0 to 9. Given a training image \({\tilde{x}} \in [0,1]^{28 \times 28}\), our goal is to prove that there does not exist a perturbation of \({\tilde{x}}\) such that the neural network f produces a wildly different classification result. If \(f({\tilde{x}})_i = \max _{j=1}^{10} f({\tilde{x}})_j\), then \({\tilde{x}}\) is placed in class i. Consider an input image with known label i. To evaluate robustness around \({\tilde{x}}\) with respect to class j, select some small \(\epsilon > 0\) and solve the problem \(\max _{a : ||a||_\infty \leqslant \epsilon }\, f({\tilde{x}} + a)_j - f({\tilde{x}} + a)_i\). If the optimal solution (or a dual bound thereof) is less than zero, this verifies that our network is robust around \({\tilde{x}}\) as we cannot produce a small perturbation that will flip the classification from i to j. We note that in the literature there are a number of variants of the verification problem presented above, produced by selecting a different objective function [50, 81] or constraint set [28]. We use a model similar to that of Dvijotham et al. [25, 26].

We train two models, each using the same architecture, with and without L1 regularization. The architecture starts with a convolutional layer with ReLU activation functions (4 filters, kernel size of 4, stride of 2), which has 676 ReLUs, then a linear convolutional layer (with the same parameters but without ReLUs), feeding into a dense layer of 16 ReLU neurons, and then a dense linear layer with one output per digit representing the logits. Finally, we have a softmax layer that is only enabled during training time to normalize the logits and output probabilities. Such a network is smaller than typically used for image classification tasks, but is nonetheless capable of achieving near-perfect out of sample accuracy on the MNIST data set, and of presenting us with challenging optimization problems. We generate 100 instances for each network by randomly selecting images \({\tilde{x}}\) with true label i from the test data, along with a random target adversarial class \(j \ne i\).

As a general rule of thumb, larger networks will be capable of achieving higher accuracy, but will lead to larger optimization problems which are more difficult to solve. However, even with a fixed network architecture, there can still be dramatic variability in the difficulty of optimizing over different parameter realizations. We refer the interested reader to Ryu et al. [62] for an example of this phenomena in reinforcement learning. Moreover, in the scope of this work we make no attempts to utilize recent techniques that train the networks to be verifiable [25, 78, 79, 81].

For all experiments, we use the Gurobi v7.5.2 solver, running with a single thread on a machine with 128 GB of RAM and 32 CPUs at 2.30 GHz. We use a time limit of 30 minutes (1800 s) for each run. We perform our experiments using the tf.opt package for optimization over trained neural networks; tf.opt is under active development at Google, with the intention to open source the project in the future. The big-M + (28a) method is the big-M formulation (5) paired with separationFootnote 5 over the exponential family (28a), and with Gurobi’s cutting plane generation turned off. Similarly, the big-M and the extended methods are the big-M formulation (5) and the extended formulation (7) respectively, with default Gurobi settings. Finally, the big-M + no cuts method turns off Gurobi’s cutting plane generation without separating over (28a). As a preprocessing step, if we can infer that a neuron is linear based on its bounds (e.g. a nonnegative lower bound or nonpositive upper bound on the input affine function of a ReLU), we transform it into a linear neuron, thus ensuring Assumption 2.

Table 1 Shifted geometric mean for time and optimality gap taken over 100 instances (shift of 10 and 1, respectively)

6.1 Network with standard training

We start with a model trained with a standard procedure, using the Adam algorithm [45], running for 15 epochs with a learning rate of \(10^{-3}\). The model attains 97.2% test accuracy. We select a perturbation ball radius of \(\epsilon = 0.1\). We report the results in Table 1a and in Fig. 4a. The big-M + (28a) method solves 7 times faster on average than the big-M formulation. Indeed, for 79 out of 100 instances the big-M method does not prove optimality after 30 minutes, and it is never the fastest choice (the “win” column). Moreover, the big-M + no cuts times out on every instance, implying that using some cuts is important. The extended method is roughly 5 times slower than the big-M + (28a) method, but only exceeds the time limit on 19 instances, and so is substantially more reliable than the big-M method for a network of this size.

Fig. 4
figure 4

Number of instances solved within a given amount of time. Curves to the upper left are better, with more instances solved in less time

6.2 ReLU network with L1 regularization

The optimization problems studied in Sect. 6.1 are surprisingly difficult given the relatively small size of the networks involved. This can largely be attributed to the fact that the weights describing the neural network are almost completely dense. To remedy this, we train a second model, using the same network architecture, but with L1 regularization as suggested by Xiao et al. [81]. We again set a radius of \(\epsilon =0.1\), and train for 100 epochs with a learning rate of \(5 \times 10^{-4}\) and a regularization parameter of \(10^{-4}\). The network achieves a test accuracy of 98.0%. With regularization, 4% of the weights in the network are zero, compared to 0.3% in the previous network. Moreover, with regularization we can infer from variable bounds that on average 73.1% of the ReLUs are linear within the perturbation box of radius \(\epsilon \), enabling us to eliminate the corresponding binary variables. In the first network, we can do so only for 27.8% of the ReLUs.

We report the corresponding results in Table 1b and Fig. 4b. While the extended approach does not seem to be substantially affected by the network sparsity, the big-M-based approaches are able to exploit it to solve more instances, more quickly. The big-M approach is able to solve 70 of 100 instances to optimality within the time limit, though the mean solve time is still quite large. In contrast, the big-M + (28a) approach fully exploits the sparsity in the model, solving each instance strictly faster than each of the other approaches. Indeed, the approach is able to solve 69 instances in less than 10 s, and solves all instances in under 120 s.