Keywords

1 Introduction

Deep learning has proven immensely powerful at solving a number of important predictive tasks arising in areas such as image classification, speech recognition, machine translation, and robotics and control [27, 35]. The workhorse model in deep learning is the feedforward network \(\texttt {NN}{}:\mathbb {R}^{m_0}\rightarrow \mathbb {R}^{m_s}\) with rectified linear unit (ReLU) activation functions, for which \(\texttt {NN}{}(x^0) = x^s\) is defined through

$$\begin{aligned} x^i_j = \texttt {ReLU}{}(w^{i,j} \cdot x^{i-1} + b^{i,j}) \end{aligned}$$
(1)

for each layer and \(j \in \llbracket m_i \rrbracket \). Note that the input \(x^0 \in \mathbb {R}^{m_0}\) might be high-dimensional, and that the output \(x^s \in \mathbb {R}^{m_s}\) may be multivariate. In this recursive description, is the ReLU activation function, and \(w^{i,j}\) and \(b^{i,j}\) are the weights and bias of an affine function which are learned during the training procedure. Each equation in (1) corresponds to a single neuron in the network. Networks with any specialized linear transformations such as convolutional layers can be reduced to this model after training, without loss of generality.

Fig. 1.
figure 1

The convex relaxation for a ReLU neuron using: (Left) existing MIP formulations, and (Right) the formulations presented in this paper.

There are numerous contexts in which one may want to solve an optimization problem containing a trained neural network such as \(\texttt {NN}{}\). For example, such problems arise in deep reinforcement learning problems with high dimensional action spaces and where any of the cost-to-go function, immediate cost, or the state transition functions are learned by a neural network [3, 19, 40, 44, 55]. Alternatively, there has been significant recent interest in verifying the robustness of trained neural networks deployed in systems like self-driving cars that are incredibly sensitive to unexpected behavior from the machine learning model [15, 43, 48]. Relatedly, a string of recent work has used optimization over neural networks trained for visual perception tasks to generate new images which are “most representative” for a given class [42], are “dreamlike” [41], or adhere to a particular artistic style via neural style transfer [26].

1.1 MIP Formulation Preliminaries

In this work, we study mixed-integer programming (MIP) approaches for optimization problems containing trained neural networks. In contrast to heuristic or local search methods often deployed for the applications mentioned above, MIP offers a framework for producing provably optimal solutions. This is of particular interest in the verification problem, where rigorous dual bounds can guarantee robustness in a way that purely primal methods cannot.

We focus on constructing MIP formulations for the graph of ReLU neurons:

(2)

where \(\circ \) is the standard function composition operator \((g \circ f)(x) = g(f(x))\). This substructure consists of a single ReLU activation function, taking as input an affine function \(f(x) = w \cdot x + b\) over a \(\eta \)-dimensional box-constrained input domain. The nonlinearity is handled by introducing an auxiliary binary variable z to indicate whether \((\texttt {ReLU}{} \circ f) (x) = 0\) or \((\texttt {ReLU}{} \circ f) (x) = f(x)\) for a given value of x. We focus on these particular substructures because we can readily produce a MIP formulation for the entire network as the composition of formulations for each individual neuron.Footnote 1

A MIP formulation is ideal if the extreme points of its linear programming (LP) relaxation are integral. Ideal formulations are highly desirable from a computational perspective, and offer the strongest possible convex relaxation for the set being formulated [50].

Our main contribution is an ideal formulation for a single ReLU neuron with no auxiliary continuous variables and an exponential number of inequality constraints. We show that each of these exponentially-many constraints is facet-defining under very mild conditions. We also provide a simple linear-time separation routine to generate the most violated inequality from the exponential family. This formulation is derived by constructing an ideal extended formulation that uses \(\eta \) auxiliary continuous variables and projecting them out. We evaluate our methods computationally on verification problems for image classification networks trained on the MNIST digit dataset, where we observe that separating over these exponentially-many inequalities solves smaller instances faster than using Gurobi’s default cut generation by a factor of 7, and (nearly) matches the dual bounds on larger instances in orders of magnitude less time.

1.2 Relevant Prior Work

In recent years a number of authors have used MIP formulations to model trained neural networks [14, 16, 20, 25, 32, 38, 44, 46, 47, 49, 55, 56], mostly applying big-M formulation techniques to ReLU-based networks. When applied to a single neuron of the form (2), these big-M formulations will not be ideal or offer an exact convex relaxation; see Example 1 for an illustration. Additionally, a stream of literature in the deep learning community has studied convex relaxations in the original space of input/output variables x and y (or a dual representation thereof), primarily for verification tasks [9, 22, 23]. It has been shown that these convex relaxations are equivalent to those provided by the standard big-M MIP formulation, after projecting out the auxiliary binary variables (e.g. [46]). Moreover, some authors have investigated how to use convex relaxations within the training procedure in the hopes of producing neural networks with a priori robustness guarantees [21, 53, 54].

Beyond MIP and convex relaxations, a number of authors have investigated other algorithmic techniques for modeling trained neural networks in optimization problems, drawing primarily from the satisfiability, constraint programming, and global optimization communities [7, 8, 33, 37, 45]. Another intriguing direction studies restrictions to the space of models that may make the optimization problem over the network inputs simpler: for example, the classes of binarized [34] or input convex [1] neural networks.

Broadly, our work fits into a growing body of research in prescriptive analytics and specifically the “predict, then optimize” framework, which considers how to embed trained machine learning models into optimization problems [11, 12, 17, 18, 24, 28, 39]. Additionally, the formulations presented below have connections with existing structures studied in the MIP and constraint programming community like indicator variables and on/off constraints [4, 10, 13, 29, 30].

1.3 Starting Assumptions and Notation

We will assume that \(-\infty< L_i< U_i < \infty \) for each input component i. While a bounded input domain will make the formulations and analysis considerably more difficult than the unbounded setting (see [4] for a similar phenomenon), it ensures that standard MIP representability conditions are satisfied (e.g. [50, Sect. 11]). Furthermore, variable bounds are natural for many applications (for example in verification problems), and are absolutely essential for ensuring reasonable dual bounds.

Define \(\breve{L}, \breve{U} \in \mathbb {R}^\eta \) such that, for each \(i \in \llbracket \eta \rrbracket \),

This definition implies that for each i, which simplifies the handling of negative weights \(w_i < 0\). Take the values and . Define . Finally, take as the nonnegative orthant.

We say that strict activity holds for a given ReLU neuron \({\text {gr}}(\texttt {ReLU}{} \circ f; [L,U])\) if \(M^-(f)< 0 < M^+(f)\), or in other words, if \({\text {gr}}(\texttt {ReLU}{} \circ f; [L,U])\) is not equal to either \({\text {gr}}(0; [L,U])\) or \({\text {gr}}(f; [L,U])\). We assume for the remainder that strict activity holds for each ReLU neuron. This assumption is not onerous, as otherwise, the nonlinearity can be replaced by an affine function (either 0 or \(w \cdot x + b\)). Moreover, strict activity can be verified or disproven in time linear in \(\eta \).

2 The ReLU Neuron

The ReLU is the workhorse of deep learning models: it is easy to reason about, introduces little computational overhead, and despite its simple structure is nonetheless capable of articulating complex nonlinear relationships.

2.1 A Big-M Formulation

A standard big-M formulation for \({\text {gr}}(\texttt {ReLU}{} \circ f; [L,U])\) is:

$$\begin{aligned} y&\geqslant f(x) \end{aligned}$$
(3a)
$$\begin{aligned} y&\leqslant f(x) - M^-(f) \cdot (1-z) \end{aligned}$$
(3b)
$$\begin{aligned} y&\leqslant M^+(f) \cdot z \end{aligned}$$
(3c)
$$\begin{aligned} (x,y,z)&\in [L,U] \times \mathbb {R}_{\geqslant 0} \times \{0,1\}. \end{aligned}$$
(3d)

This is the formulation used recently in the bevy of papers referenced in Sect. 1.2. Unfortunately, this formulation is not necessarily ideal, as illustrated by the following example.

Example 1

If \(f(x) = x_1 + x_2 - 1.5\), formulation (3a3d) for \({\text {gr}}(\texttt {ReLU}{} \circ f; [0,1]^2)\) is

$$\begin{aligned} y&\geqslant x_1 + x_2 - 1.5 \end{aligned}$$
(4a)
$$\begin{aligned} y&\leqslant x_1 + x_2 - 1.5 + 1.5(1-z) \end{aligned}$$
(4b)
$$\begin{aligned} y&\leqslant 0.5z \end{aligned}$$
(4c)
$$\begin{aligned} (x,y,z)&\in [0,1]^2\times \mathbb {R}_{\geqslant 0} \times [0,1] \end{aligned}$$
(4d)
$$\begin{aligned} z&\in \{0,1\}. \end{aligned}$$
(4e)

The point \((\hat{x},\hat{y},\hat{z}) = ((1,0),0.25,0.5)\) is feasible for the LP relaxation (4a4d); however, \((\hat{x},\hat{y}) \equiv ((1,0),0.25)\) is not in \({\text {Conv}}({\text {gr}}(\texttt {ReLU}{} \circ f; [0,1]^2))\), and so the formulation does not offer an exact convex relaxation (and, hence, is not ideal). See Fig. 1 for an illustration: on the left, of the big-M formulation projected to (xy)-space, and on the right, the tightest possible convex relaxation.

The integrality gap of (3a3d) can be arbitrarily bad, even in fixed dimension \(\eta \).

Example 2

Fix and even \(\eta \in \mathbb {N}\). Take the affine function \(f(x) = \sum _{i=1}^\eta x_i\), the input domain \([L,U] = [-\gamma ,\gamma ]^\eta \), and the point \(\hat{x} = \gamma \cdot (1,-1,\cdots ,1,-1)\) as a scaled vector of alternating ones and negative ones. We can check that \((\hat{x},\hat{y},\hat{z}) = (\hat{x},\frac{1}{2}\gamma \eta ,\frac{1}{2})\) is feasible for the LP relaxation of the big-M formulation (3a3d). Additionally, \(f(\hat{x}) = 0\), and for any \(\tilde{y}\) such that \((\hat{x},\tilde{y}) \in {\text {Conv}}({\text {gr}}(\texttt {ReLU}{} \circ f; [L,U]))\), then \(\tilde{y} = 0\) necessarily. Therefore, there exists a fixed point \(\hat{x}\) in the input domain where the tightest possible convex relaxation (for example, from an ideal formulation) is exact, but the big-M formulation deviates from this value by at least \(\frac{1}{2}\gamma \eta \).

Intuitively, this example suggests that the big-M formulation is particularly weak around the boundary of the input domain, as it cares only about the value f(x) of the affine function, and not the particular input value x.

2.2 An Ideal Extended Formulation

It is possible to produce an ideal extended formulation for the ReLU neuron by introducing auxiliary continuous variables. The “multiple choice” formulation is

$$\begin{aligned} (x,y)&= (x^0,y^0) + (x^1,y^1) \end{aligned}$$
(5a)
$$\begin{aligned} y^0&= 0 \geqslant w \cdot x^0 + b(1-z) \end{aligned}$$
(5b)
$$\begin{aligned} y^1&= w \cdot x^1 + bz \geqslant 0 \end{aligned}$$
(5c)
$$\begin{aligned} L(1-z)&\leqslant x^0 \leqslant U(1-z) \end{aligned}$$
(5d)
$$\begin{aligned} Lz&\leqslant x^1 \leqslant Uz \end{aligned}$$
(5e)
$$\begin{aligned} z&\in \{0,1\}, \end{aligned}$$
(5f)

is an ideal extended formulation for piecewise linear functions [52]. It can alternatively be derived from techniques introduced by Balas [5, 6]. Although the multiple choice formulation offers the tightest possible convex relaxation for a single neuron, it requires a copy \(x^0\) of the input variables (note that it is straightforward to use Eq. (5a) to eliminate the second copy \(x^1\)). This means that when the multiple choice formulation is applied to every neuron in the network to formulate \(\texttt {NN}{}\), the total number of continuous variables required is \(m_0 + \sum _{i=1}^{r} (m_{i-1}+1)m_{i}\) (using the notation of (1), where \(m_i\) is the number of neurons in layer i). In contrast, the big-M formulation requires only \(m_0 + \sum _{i=1}^r m_i\) continuous variables to formulate the entire network. As we will see in Sect. 3.2, the quadratic growth in size of the extended formulation can quickly become burdensome. Additionally, a folklore observation in the MIP community is that multiple choice formulations tend to not perform as well as expected in simplex-based branch-and-bound algorithms, likely due to degeneracy introduced by the block structure [51].

2.3 An Ideal Non-extended Formulation

We now present a non-extended ideal formulation for the ReLU neuron, stated only in terms of the original variables (xy) and the single binary variable z. Put another way, it is the strongest possible tightening that can be applied to the big-M formulation (3a3d) and so matches the strength of the multiple choice formulation without the additional continuous variables.

Proposition 1

Take some affine function \(f(x) = w \cdot x + b\) over input domain [LU]. The following is an ideal formulation for \({\text {gr}}(\texttt {ReLU}{} \circ f; [L,U])\):

$$\begin{aligned} y&\geqslant w \cdot x + b \end{aligned}$$
(6a)
$$\begin{aligned} y&\leqslant \sum _{i \in I} w_i(x_i - \breve{L}_i(1-z)) + \left( b + \sum _{i \not \in I} w_i\breve{U}_i\right) z \quad \forall I \subseteq {\text {supp}}(w) \end{aligned}$$
(6b)
$$\begin{aligned} (x,y,z)&\in [L,U] \times \mathbb {R}_{\geqslant 0} \times \{0,1\} \end{aligned}$$
(6c)

Proof

See Appendix A.1.    \(\square \)

Furthermore, each of the exponentially-many inequalities in (6b) is necessary.

Proposition 2

Each inequality in (6b) is facet-defining.

Proof

See Appendix A.2.    \(\square \)

We require the assumption of strict activity above, as introduced in Sect. 1.3. Under the same condition, it is also possible to show that (6a) is facet-defining, but we omit it in this extended abstract for brevity. As a result of this and Proposition 2, the formulation (6a6c) is minimal (modulo variable bounds).

The proof of Proposition 2 offers a geometric interpretation of the facets induced by (6b). Each facet is a convex combination of two faces: an \((\eta - |I|)\)-dimensional face consisting of all feasible points with \(z = 0\) and \(x_i = \breve{L}_i\) for all \(i \in \llbracket \eta \rrbracket \setminus I\), and an |I|-dimensional face consisting of all feasible points with \(z = 1\) and \(x_i = \breve{U}_i\) for all \(i \in I\).

It is also possible to separate from the family (6b) in time linear in \(\eta \).

Proposition 3

Take a point \((\hat{x},\hat{y},\hat{z}) \in [L,U] \times \mathbb {R}_{\geqslant 0} \times [0,1]\), along with the set

If any constraint in the family (6b) is violated at \((\hat{x},\hat{y},\hat{z})\), then the one corresponding to \(\hat{I}\) is the most violated.

Proof

Follows from inspecting the family (6b): each has the same left-hand-side, and so to maximize violation, it suffices to select the subset I that minimizes the right-hand-side. This can be performed in a separable manner, independently for each component \(i \in {\text {supp}}(w)\), giving the result.    \(\square \)

Observe that the inequalities (3b) and (3c) are equivalent to those in (6b) with \(I = {\text {supp}}(w)\) and \(I = \varnothing \), respectively (modulo components i with \(w_i=0\)). This suggests an iterative scheme to produce strong relaxations for ReLU neurons: start with the big-M formulation (3a3d), and use Proposition 3 to separate strengthening inequalities from the exponential family (6b) as they are needed. We evaluate this approach in the following computational study.

3 Computational Experiments

To conclude the work, we study the strength of the ideal formulations presented in Sect. 2 for individual ReLU neurons. We study the verification problem on image classification networks trained on the canonical MNIST digit dataset [36]. We train a neural network \(f : [0,1]^{28 \times 28} \rightarrow \mathbb {R}^{10}\), where the 10 outputs correspond to the logits for each of the digits from 0 to 9. Given a labeled image \(\tilde{x} \in [0,1]^{28 \times 28}\), our goal is to prove or disprove the existence of a perturbation of \(\tilde{x}\) such that the neural network f produces a wildly different classification result. If \(f(\tilde{x})_i = \max _{j=1}^{10} f(\tilde{x})_j\), then image \(\tilde{x}\) is placed in class i. To evaluate robustness around \(\tilde{x}\) with respect to class j, we can solve the following optimization problem for some small constant \(\epsilon > 0\):

If the optimal solution (or a valid dual bound thereof) is less than zero, this verifies that our network is robust around \(\tilde{x}\) in the sense that we cannot produce a small perturbation that will flip the classification from i to j.

We train a smaller and a larger model, each with two convolutional layers with ReLU activation functions, feeding into a dense layer of ReLU neurons, and then a final dense linear layer. TensorFlow pseudocode specifying the two network architectures is included in Fig. 2. We generate 100 instances for each network by randomly selecting images \(\tilde{x}\) with true label i from the test data, along with a random target adversarial class \(j \ne i\). Note that we make no attempts to utilize recent techniques that train the networks to be verifiable [21, 53, 54, 56].

Fig. 2.
figure 2

TensorFlow pseudocode specifying the two network architectures used.

For all experiments, we use the Gurobi v7.5.2 solver, running with a single thread on a machine with 128 GB of RAM and 32 CPUs at 2.30 GHz. We use a time limit of 30 min (1800 s) for each run. We perform our experiments using the tf.opt package for optimization over trained neural networks; tf.opt is under active development at Google, with the intention to open source the project in the future. Below, the big-M + (6b) method is the big-M formulation (3a3d) paired with separationFootnote 2 over the exponential family (6b), and with Gurobi’s cutting plane generation turned off. Similarly, the big-M and the extended methods are the big-M formulation (3a3d) and the extended formulation (5a5f) respectively, with default Gurobi settings. Finally, the big-M + no cuts method turns off Gurobi’s cutting plane generation without adding separation over (6b).

Table 1. Results for smaller network. Shifted geometric mean for time and optimality gap taken over 100 instances (shift of 10 and 1, respectively). The “win” column is the number of (solved) instances on which the method is the fastest.
Fig. 3.
figure 3

Number of small network instances solved within a given amount of time. Curves to the upper left are better, with more instances solved in less time.

3.1 Small ReLU Network

We start with a smaller ReLU network whose architecture is depicted in TensorFlow pseudocode in Fig. 2a. The model attains 97.2% test accuracy. We select a perturbation ball radius of \(\epsilon = 0.1\). We report the results in Table 1 and in Fig. 3. The big-M + (6b) method solves 7 times faster on average than the big-M formulation. Indeed, for 79 out of 100 instances the big-M method does not prove optimality after 30 min, and it is never the fastest choice (the “win” column). Moreover, the big-M + no cuts times out on every instance, implying that using some cuts is important. The extended method is roughly 5 times slower than the big-M + (6b) method, but only exceeds the time limit on 19 instances, and so is substantially more reliable than the big-M method for a network of this size. From this, we conclude that the additional strength offered by the ideal formulations (5a5f) and (6a6c) can offer substantial computational improvement over the big-M formulation (3a3d).

3.2 Larger ReLU Network

Now we turn to the larger ReLU network described in Fig. 2b. The trained model attains \(98.5\%\) test accuracy. We select a perturbation ball radius of \(\epsilon = 10/256\). For these larger networks, we eschew solving the problems to optimality and focus on the quality of the dual bound available at the root node. As Gurobi does not reliably produce feasible primal solutions for these larger instances, we turn off primal heuristics and compare the approaches based on the “verification gap”, which measures how far the dual bound is from proving robustness (i.e. an objective value of 0). To evaluate the quality of a dual bound, we measure the “improvement percentage” \(\frac{\texttt {big\_M\_bound} - \texttt {other\_bound}}{\texttt {big\_M\_bound}}\), where our baseline for comparison, big_M_bound, is the bound from the big-M + no cuts method, and other_bound is the dual bound being compared.

Table 2. Results at the root node for larger network. Shifted geometric mean of bound, time, and improvement over 100 instances (shift of 10).

We report aggregated results over 100 instances in Table 2. First, we are unable to solve even the LP relaxation of the extended method on any of the instances in the allotted 30 min, due to the quadratic growth in size. In contrast, the LP relaxation of the big-M + no cuts method can be solved very quickly. The big-M + (6b) method strengthens this LP bound by more than \(15\%\) on average, and only takes roughly \(2.5\times \) as long. This is not only because the separation runs very quickly, but also for a technical reason: when Gurobi’s cutting planes are disabled, the callback separating over (6b) is only called a small number of times, as determined by Gurobi’s internal cut selection procedure. Therefore, this 15% improvement is the result of only a small number of separation rounds, not an exhaustive iterative procedure (i.e. Gurobi terminates the cut loop well before all violated inequalities have been separated).

We may compare these results against the big-M method, which is able to provide a modestly better bound (roughly 18% improvement), but requires almost two orders of magnitude more time to produce the bound. For another comparison, big-M + 15 s timeout, we set a smaller time limit of 15 s on Gurobi, which is a tighter upper bound on the maximum time used by the big-M + (6b) method. In this short amount of time, Gurobi is not able to improve the bound substantially, with less than 4% improvement. This suggests that the inequalities (6b) are not trivial to infer by generic cutting plane methods, and that it takes Gurobi many rounds of cut generation to achieve the same level of bound improvement we derive from restricting ourselves to those cuts in (6b).