1 Introduction

Continuous piecewise linear (CPWL) programming stands for optimization problems with CPWL objective functions and CPWL constraints. Intuitively, CPWL programming technique is useful for continuous optimization. On the one hand, a CPWL programming can be transformed into a series of linear problems, for which there are some very efficient algorithms. On the other hand, any continuous function can be approached by a CPWL function to arbitrary precision, so that any continuous nonlinear programming can be solved approximately by CPWL programming.

In the 1960s, Beale has applied CPWL approximation and CPWL programming to minimize a separable nonlinear function; see [1, 2] for details. Since then, CPWL programming has attracted the attention of researchers. For example, problems without explicit formulation have been discussed in [3] and a local optimality condition has been given, which is applicable to a special class of CPWL functions. In order to analyze complicated CPWL problems, it is necessary to represent a CPWL function explicitly. For example, via expressing one-dimensional CPWL functions by the breakpoints, a generalized simplex algorithm for minimizing separable CPWL functions has been proposed in [46]. Similarly, a series of formulations for CPWL functions have been given in [710], and so on. These models have been summarized and insightfully compared in [11]. Another kind of methods for modeling CPWL function, called compact representation, is motivated by circuit analysis. This kind of models can be found in [1217]. Compact representation methods have been used in identification, control, and other applications, but researches on CPWL programming in compact representation are rare. In this paper, we first give the comparison between the above two classes of models and then study CPWL programming in compact representation.

In the sequel, we first review some properties of CPWL functions and introduce representation methods in Sect. 2, where the comparison among these representations is discussed. Section 3 proves the existence of exact penalty. This property allows us to concentrate on unconstrained CPWL programming. For the unconstrained case, a local optimality condition is proved and an algorithm is established in Sect. 4. Then, we use numerical experiments to illustrate the gained theoretical results and evaluate the proposed algorithm in Sect. 5. Finally a brief conclusion is given in Sect. 6.

2 Continuous Piecewise Linear Functions and Representations

Generally, a CPWL programming problem takes the following form:

$$ \mathrm{min} f_0(x) \quad \mathrm{s.t.}\quad f_j(x) \leq0,\quad 1 \leq j \leq J,$$
(1)

where f j (x),0≤jJ, are continuous piecewise linear functions, and the domain of the problem is denoted by Ω.

Basically, a CPWL function is a continuous function, which equals to one of the finite distinct affine functions at any point in its domain. Let f(x) be a CPWL function on a convex set Ω. We have

$$ f(x) \in \bigl\{l_1(x), l_2(x), \ldots, l_N(x) \bigr\},\quad \forall x \in\varOmega,$$
(2)

where l i (x),1≤iN, are finite distinct affine functions. From (2), the subregions can be defined as Ω i ={x:f(x)=l i (x),xΩ} and the domain is partitioned into N subregions, i.e.,

$$\varOmega= \bigcup_{i = 1}^N\varOmega_i \quad\mathrm{and}\quad \stackrel{\circ}{\varOmega}_{i} \cap\stackrel{\circ}{\varOmega}_{j}= \emptyset, \quad\forall i \neq j,$$

where \(\stackrel{\circ}{\varOmega}_{i}\) denotes the interior of Ω i . And the continuity of f(x) requires l i (x)=l j (x),∀xΩ i Ω j ,∀i,j. Furthermore, it has been proved in [18] that if Ω is a polyhedron, then each subregion Ω i is either a polyhedron or a union of polyhedra. Thus, without any loss of generality, we assume that Ω and Ω i ,∀i, are polyhedra and we refer to Ω i as linear subregion. Then, one can represent a CPWL function as

$$f(x) = l_i(x), \quad\forall x \in\varOmega_i.$$

In the above expression, f(x) is defined by affine functions in subregions. Hence we name it as piecewise representation. If (1) is given in piecewise representation, then the optimum can be obtained via solving problems in the linear subregions one by one. However, it is not easy to construct CPWL functions in piecewise representation to describe complicated systems, since it is very hard to satisfy the continuity condition when adjusting the parameters. Therefore, it is more interesting to consider CPWL programming expressed by some explicit formulations given in Sects. 2.1 and 2.2.

2.1 Vertex Representation

It is popular to use the vertices of Ω i and some binary variables or special ordered sets to represent a CPWL function. A typical method is called a convex combination model, which has been discussed insightfully in [7, 9, 19, 20]. This formulation can represent any separable CPWL function \(f(x) = \sum_{j=1}^{n} f^{j}(x(j))\), where f j(x(j)) is one-dimensional piecewise linear function, which is affine on the segment \([d_{j}^{k}, d_{j}^{k+1}]\) for any k=0,1,…,K−1. If the breakpoints (the vertices in one dimension) \(d_{j}^{k}\) and their function values are all known, then we can write

$$ x(j) = \sum_{k=0}^Kd_j^k \lambda_j^k,\quad \forall j\quad\mathrm{and}\quad f(x) = \sum_{j=1}^n\sum_{k=0}^K f^j\bigl(d_j^k\bigr) \lambda_j^k,$$
(3)

where \(\sum_{k=0}^{K} \lambda_{j}^{k} = 1\), \(\lambda_{j}^{k} \geq0\), and each \(\{ \lambda_{j}^{k} \}\) must be a special ordered set of type 2 (SOS2), which has been discussed in [2] and defined in [21]. For a given j, set \(\{\lambda_{j}^{k}\}\) being a SOS2 means that: (i) all the variables are nonnegative; (ii) at most two variables can be positive; (iii) if there are two positives, then their subscripts k must be adjacent. SOS2 is actually a kind of logistic constraint. Hence, when the functions in CPWL programming (1) are represented by a convex combination model, (1) can be formulated as a mixed-integer programming (MILP).

For a comparison, another formulation, named incremental model, has been proposed in [9]. This model uses another special constraint called SOSX. The original forms of convex combination model and incremental model are restricted to separable CPWL functions. When extending them to high dimensional space, the breakpoints become the vertices of the linear subregions. If all the vertices of Ω i are known, then nonseparable CPWL functions can be represented by some models and most of these models are locally ideal. For details, one can see [11], where the computational complexity and the sizes of the variables are compared. Since the formulations in [11] are built from vertices, we call them vertex representations of CPWL functions.

In some particular problems, such as merge-in-transit, transportation problem and network flow problem, the objective functions can be represented by one of the vertex representations; see [8, 20] for details. Then, the corresponding problem can be posed as a MILP and solved. Also, we want approximately to solve nonlinear problems via CPWL programming. Thus, there have been some works on nonlinear interpolation or fitting using vertex representations. Recently, a method has been proposed by [22] to interpolate nonlinear function in 2 or 3 dimensions. This method partitions the domain into simplices and uses the convex combination of the vertices to get a CPWL function. Similarly, a method has been discussed in [23] for approximating two-dimensional nonlinear functions by CPWL functions constructed from simplicial partition, whose structure can be optimized.

2.2 Compact Representation

In order to describe piecewise linear resistors, the first compact representation for CPWL functions has been proposed in [12]. The main property of compact representation models is that CPWL functions are represented in some closed-form formulations and the continuity on the boundaries of the linear subregions holds naturally. Basically, these representations can be written as linear combinations of basis functions. Take the model proposed by [13] and called hinging hyperplanes (HH) as an example. The HH has the following form:

(4)

where H m (x) are the basis functions, the number of which is M, and l m (x) are affine functions. Since the composition of continuous functions is still continuous, the continuity of (4) holds naturally and no additional constraints, such as SOS2, are needed. The HH can be used to approximate nonlinear systems and its parameters can be identified from observed data by algorithms in [24, 25]. These algorithms are effective and the HH has been applied widely; see e.g. [2629].

Another popular compact representation is named high level canonical piecewise linear representation (HL-CPWL), which has been given by [14] and discussed in [30]. The HL-CPWL can approximate high dimensional functions and has been applied in [3133], and so on. The HL-CPWL is based on simplicial partition: the basis functions are constructed by the boundaries of the simplices and the linear coefficients of the basis functions are calculated through the least-squares method. Adjusting the coefficients is actually configuring the values for the vertices of the simplices. This is quite different from the methods used in [22, 23] where the function values at the vertices are unchanged and should be given.

Although the HH and HL-CPWL perform well in approximating some nonlinear functions, they cannot represent all the CPWL functions in 2 or higher dimensions. The lack of universal representation capability essentially affects the identification performance, which has been analyzed and explained in [18]. Without this capability, we may need a lot of basis functions to achieve satisfactory approximation precision. Moreover, only when a model has such a capability, the corresponding theoretical results are applicable to all CPWL functions. Therefore, researchers pursue CPWL models with universal representation capability. From (4), it is not hard to see that the boundaries of the subregions of the HH must be lines throughout the whole space. Hence, the structure of the HH is not flexible enough. This is the reason why the HH cannot represent all the CPWL functions. Naturally, the capability of representation can be extended via adding affine functions to generalize the basis function, i.e., letting the basis function be \(B_{m}(x) = \max \{ l^{m}_{1}(x), l^{m}_{2}(x), l^{m}_{3} (x),\ldots \}\). The model using such basis function is called generalized hinging hyperplanes (GHH). It has been proved in [15] that in n-dimensional space, using at most n+1 affine functions in B m (x), arbitrary CPWL function can be represented. That is to say, for arbitrary CPWL function f(x):ℝn→ℝ, there are M basis functions B m (x) and coefficients β m such that

$$ f(x) = \sum_{m=1}^M\beta_m B_m(x) = \sum_{m=1}^M\beta_m \max \bigl\{ l^m_1(x),l^m_2(x), \ldots, l^m_{n+1} (x)\bigr\},$$
(5)

where β m =±1, B m (x) are basis functions and \(l^{m}_{i}(x),1 \leq i \leq n + 1\), are affine functions. The identification algorithm for the GHH has been given by [34] and since the GHH has universal representation capability, we can analyze CPWL programming by using GHH to represent CPWL functions.

2.3 Comparison of Vertex and Compact Representation

Vertex representation and compact representation provide two ways to express CPWL functions. Although in theory both the methods can represent all CPWL functions, the transformation between them is not a tractable problem except for special cases, e.g., the separable functions. Vertex representation (3) can be expressed by the HH as follows:

$$f(x) = \sum_{j=1}^n f^j\bigl(x(j)\bigr) = \sum_{j=1}^n \sum _{k=1}^{K} a_j^k\max \bigl\{ 0, x(j) - d_j^{k-1} \bigr\},$$

where

$$a_j^k = \left \{ \begin{array}{l@{\quad}l}\frac{f^j(d_j^1) - f^j(d_j^0) }{d_j^1 - d_j^0}, & k = 1,\\ [8pt]\frac{f^j(d_j^k) - f^j(d_j^{k-1}) }{d_j^k - d_j^{k-1}} - \sum_{l=1}^{k-1} a_j^l, & k = 2, 3, \ldots, K.\end{array} \right .$$

Meanwhile, when the HH (4) is separable, i.e., the basis functions take the form of H m (x)=β m max{0,a m x(j)−b m }, we can calculate all the breakpoints, such as \(d_{j}^{k} =\frac{b_{m}}{a_{m}}\), and sort them to obtain the corresponding convex combination model. However, for nonseparable functions, the transformation is not easy. On the one hand, when a CPWL function is given in compact representation, to obtain the equivalent vertex representation we should know all the vertices and the topological relations, but the number of the vertices may be huge. Consider (4) as an example. Each basis function max{0,l m (x)} defines a boundary of linear subregions, i.e. l m (x)=0, and in n-dimensional space, a point is determined by n lines. Thus, the function represented by (4) has vertices and the translation to any vertex representation is very time-consuming when M and n are large. On the other hand, transforming a CPWL function in vertex representation to any compact representation is very hard as well and no practical method has been established. Therefore, CPWL programming in both vertex and compact representations is worthy of study. In practice, one should choose suitable technique according to the specifical formulation of (1).

Using a CPWL function to approach the nonlinear objective function and then minimizing the obtained surrogate model is the major motivation of the research on CPWL programming. Especially, when the concerned relationship is unclear or too complicated to be optimized directly, we need to do approximation before applying optimization technique. For this purpose, some methods for approximation using vertex representations have been proposed in [22, 23]. However, for high dimensional problems, it may be better to consider compact representation models.

The significant advantage of vertex representation is that the corresponding CPWL programming is a MILP and can be solved by mature techniques. From the equivalence of vertex representation and compact representation, one can expect that a CPWL programming represented by compact representation can be written as a MILP as well. Consider the following linearly constrained problem, whose objective function takes an HH formulation:

$$ \mathrm{min}\ f(x) = l_0(x) + \sum _{m=1}^M \beta_m \max \bigl\{0, l_m(x) \bigr\} \quad \mathrm{s.t.}\quad Ax \leq b.$$
(6)

By introducing M binary variables y 1,y 2,…,y M , the above problem can be equivalently transformed into the following MILP:

(7)

where U is positive and large enough. Notice that although (7) provides an easy and convenient way to obtain the equivalent MILP for (6), it is not a good formulation, since a MILP with large positive U may have bad LP relaxation. Meanwhile, transforming (6) into one-vertex representation and then getting the equivalent MILP is not good either, because the translation is difficult. A better formulation can be gained via further study. In this paper, we simply use (7) to calculate the optima in numerical examples.

A CPWL programming can be transformed into a MILP, but the equivalent MILP has been proved to be NP-hard in [20] and we cannot expect to get the global optimum for large scale problems in acceptable time. In some applications, there is a strict requirement on computing time. For example, in [31, 35, 36], compact representations have been used to identify some chemical processes and model predictive control method has been established, which involves finite horizon open-loop optimization problems with CPWL (or continuous piecewise quadratic, due to different norm) objective functions. It is required that the corresponding programming be solved within a short time (about 30 seconds) in order to obtain the control signal in time. In another application, the power consumption of a chiller plant is described as a CPWL function and the operation point is optimized online via CPWL programming; see [37] for details. For such problems, it is not practical to pursue the global optimum and we should consider local optimality condition. For this, a closely related work has been given in [3], which is applicable only to a special class of CPWL functions. In this paper, a local optimality condition, which is applicable to any CPWL programming, will be given by utilizing the GHH, which has the capability of representing all the CPWL functions.

3 Exact Penalty

First, we discuss the optimization problem with CPWL objective function and CPWL constraints. According to the properties of CPWL functions, the functions in (1) equal to one of the following vector functions at any point in Ω, that is:

$$ \left [ \begin{array}{c}f_0(x) \\f_1(x) \\\vdots\\f_J(x)\end{array}\right ] \in \left \{ \left [ \begin{array}{c}p_{10}(x) \\p_{11}(x) \\\vdots\\p_{1J}{x}\end{array} \right ], \left [ \begin{array}{c}p_{20}(x) \\p_{21}(x) \\\vdots\\p_{2J}(x)\end{array} \right ], \ldots, \left [\begin{array}{c}p_{T0}(x) \\p_{T1}(x) \\\vdots\\p_{TJ}(x)\end{array} \right ] \right \}, \quad\forall x \in \varOmega,$$
(8)

where p ij (x),∀i,j, are affine functions. And the domain Ω is partitioned into the following T subregions:

$$ \varOmega_i = \left \{x : \left [ \begin{array}{c}f_0(x) \\[2pt]f_1(x)) \\[2pt]\vdots\\f_J(x)\end{array} \right ] = \left [ \begin{array}{c}p_{i0}(x) \\[2pt]p_{i1}(x) \\[2pt]\vdots\\p_{iJ}(x)\end{array} \right ] = \left [ \begin{array}{c}a^T_{i0} x + b_{i0}\\[2pt]a^T_{i1} x + b_{i1} \\[2pt]\vdots\\a^T_{iJ} x + b_{iJ}\end{array} \right ] \right \}, \quad1\leq i \leq T.$$
(9)

Applying a l 1-norm penalty function, we can approach problem (1) by minimizing the following unconstrained function:

$$ F_\lambda(x) = f_0(x) +\lambda\sum_{j=1}^J \max\bigl\{f_j (x), 0 \bigr\},$$
(10)

for a succession of increasing positive values of the penalty parameter λ. F λ (x) is obviously a CPWL function, and it can be expressed by the following piecewise representation:

Then, each Ω i is divided into at most 2J linear subregions Ω il by hyperplanes \(a^{T}_{ij} x + b_{ij} = 0, j = 1,2,\dots, J\). In general, the number of the linear subregions is much less than 2J, since lots of hyperplanes \(a^{T}_{ij} x + b_{ij}= 0\) may not intersect with Ω i . In every subregion Ω il , F λ (x) is an affine function, i.e.,

$$F_\lambda(x) = a^T_{i0} x + b_{i0} +\lambda\sum_{j=1}^J \bigl(c^T_{ilj} x + d_{ilj} \bigr),\quad \forall x \in \varOmega_{il},$$

where

As assumed, Ω i are polyhedra, thus Ω il are polyhedra as well and can be determined by some linear inequalities. Hence, Ω il can be written as Ω il ={x:P il xq il }. Let xΩ be a given point and Ψ(x)={Ω il :xΩ il } stand for the subregions which contain x. Then, Ψ(x) is not empty and the number of its elements is finite.

A CPWL function is continuous in its domain but not differentiable at some points. Therefore, the concept of the directional derivative is needed in CPWL analysis. The directional derivative of an arbitrary continuous function f at x∈ℝn in the direction d can be defined as

$$\nabla^d f(x) = \lim_{t \mapsto0^+} \frac{f(x + td) - f(x) }{t}.$$

Using directional derivative, we can prove the following theorem which can derive the existence of exact penalty for (1).

Theorem 3.1

For CPWL function (10), there exists a \(\bar{\lambda}\) which ensures that for all \(\lambda > \bar{\lambda}\), F λ (x) have the same local minima.

Proof

\(\hat{x}\) is a local minimum of F λ (x) if and only if \(\nabla^{d} F_{\lambda}(\hat{x} ) \geq0, \forall d \in {\mathbb {R}}^{n}\). We use \(D_{il}(\hat{x})\) to represent the set of directions along which the ray from \(\hat{x}\) still stays in Ω il , i.e., \(D_{il}(\hat{x}) =\{ d: \exists\bar{t}> 0, \forall0 \leq t \leq\bar{t}, \hat{x} + t d \in\varOmega_{il} \}\). Then, there is

$$\nabla^d F_\lambda(\hat{x}) = \Biggl( a^T_{i 0}+ \lambda \sum_{j=1}^Jc^T_{ilj} \Biggr) d,\quad \forall d \in D_{il}(\hat{x}).$$

For a given direction d, there is at least one \(\varOmega_{il} \in \varPsi( \hat{x}) \) such that \(d \in D_{il}(\hat{x})\). Contrarily, in a given subregion \(\varOmega_{il} \in\varPsi( \hat{x})\), there is an \(\tilde{x}\) which is different from \(\hat{x}\) and belongs to Ω il ; hence we can get a direction \(d = \tilde{x} - \hat{x} \in D_{il}(\hat{x})\). Therefore, the condition \(\nabla^{d} F_{\lambda}(\hat{x} ) \geq0,\forall d \in {\mathbb {R}}^{n} \) equals

$$ \Biggl( a^T_{i 0} +\lambda\sum_{j=1}^J c^T_{ilj}\Biggr) d \geq 0,\quad \forall d \in D_{il}(\hat{x}), \ \varOmega_{il} \in\varPsi( \hat{x} ).$$
(11)

Ω il is determined by Ω il ={x:P il xq il }. We denote the kth row of P il by P il (k) and the kth component of q il by q il (k). Then, \(\varGamma_{il}(\hat{x}) = \{ k: P_{il}(k) \hat{x} = q_{il}( k ) \}\) stands for the set of the indices of the active constraints at \(\hat{x}\). It is easy to see that \(D_{il}(\hat{x}) = \{ d: P_{il}(k) d \leq0, \forall k \in \varGamma_{il}(\hat{x}) \}\) and it is convex. According to the Minkowski–Weyl theorem, \(D_{il}(\hat{x})\) can be expressed as \(D_{il}(\hat{x}) = \{d: d = \sum_{m}\theta_{m} \nu_{m}, \forall \theta_{m} \geq0\}\), where the number of ν m is finite and (11) is equal to

$$ \Biggl( a^T_{i 0}+ \lambda\sum_{j=1}^Jc^T_{ilj} \Biggr) \sum_m\theta_m \nu_{m} \geq0,\quad \forall\theta_m\geq0,\ \forall \varOmega_{il} \in\varPsi( \hat{x} ).$$

Therefore, \(\hat{x}\) is locally minimal if and only if for any \(\varOmega_{il} \in\varPsi( \hat{x} )\) ,

$$ \Biggl(a^T_{i 0} + \lambda\sum_{j=1}^Jc^T_{ilj} \Biggr) \nu_{m} \geq0,\quad \forall \nu_m.$$
(12)

If \(\sum_{j=1}^{J} c^{T}_{ilj} \nu_{m} = 0, \forall m\), then whether (12) holds or not is independent of λ. Otherwise, we set

$$\lambda_{il}^*(\hat{x}) = \max_{m: \sum_{j=1}^J c^T_{ilj} \nu_{m}\neq0 } \biggl\{ -\frac{a^T_{i 0} \nu_{m} }{\sum_{j=1}^J c^T_{ilj}\nu_{m}} \biggr\}.$$

Then, λ has no effect on the validity of inequality (12) as long as \(\lambda > \lambda_{il}^{*}(\hat{x})\), because all \(\lambda> \lambda_{il}^{*}(\hat{x})\) can ensure that

$$\Biggl( a^T_{i 0} + \lambda \sum _{j=1}^J c^T_{ilj} \Biggr)\nu_{m} > 0\quad \Leftrightarrow\quad \sum _{j=1}^J c^T_{ilj}\nu_{m} > 0$$

for any m satisfying \(\sum_{j=1}^{J} c^{T}_{ilj} \nu_{m} \neq0 \). This property means that \(\hat{x}\) is a local optimum of F λ (x) for all \(\lambda > \max_{\varOmega_{il} \in\varPsi(\hat{x})}\{ \lambda^{*}_{il}(\hat{x})\}\) or it is not a local optimum for any \(\lambda > \max_{\varOmega_{il} \in\varPsi(\hat{x})}\{ \lambda^{*}_{il}(\hat{x})\}\).

From the above we can see that if the value of \(\sup_{\hat{x}}\max_{\varOmega_{il} \in\varPsi(\hat{x})} \{\lambda^{*}_{il}(\hat{x}) \}\) is finite, then there exists \(\bar{\lambda}\) ensuring that for all \(\lambda > \bar{\lambda}\), F λ (x) have the same local minima. One can verify that

$$\sup_{\hat{x}} \max_{\varOmega_{il} \in\varPsi(\hat{x})} \bigl\{ \lambda^*_{il}(\hat{x})\bigr\} = \max_{\varOmega_{il}} \sup_{\hat{x} \in \varOmega_{il}} \bigl\{ \lambda^*_{il}(\hat{x}) \bigr\}.$$

When Ω il is given, ν m is directly related with \(D_{il}(\hat{x})\), which is determined by \(\varGamma_{il}(\hat{x})\). The number of different values of \(\varGamma_{il}(\hat{x})\) is finite, so is the number of different groups of ν m . Moreover, there are finite Ω il , thus the number of different values of \(\lambda_{il}^{*}(\hat{x})\) is finite and the maximal value can be achieved, i.e.,

$$\bar{\lambda}= \max_{\varOmega_{il}} \max_{\hat{x} \in\varOmega_{il}} \bigl\{\lambda^*_{il}(\hat{x}) \bigr\}.$$

Therefore, all the F λ (x) have the same local minima as long as \(\lambda > \bar{\lambda}\). □

If the original problem is unbounded, then (10) is also unbounded. Otherwise, we denote a global optimum for (1) by x . If the minimal value of F λ (x) is unbounded for all λ>0, then we can sum the following additional constraints for (1),

$$ \mathrm{LB} \leq x \leq\mathrm{UB},$$
(13)

where the vector LB (UB) is the lower (upper) bound for x. With these constraints, x is still the optimum as long as LB≤x ≤UB. Obviously, for the problem with constraints (13), the penalty function F λ (x) has a lower bound with sufficiently large but finite λ. Moreover, we can design LB and UB satisfying LB≤x ≤UB easily, even when x is not exactly known. Thus, without any loss of generality, we can assume that if an optimum of (1) exists, then there is a λ 0 such that a global optimum of F λ (x),∀λ>λ 0, is achievable. Under this assumption, the existence of exact penalty is guaranteed by the following theorem.

Theorem 3.2

If x is globally optimal for (1), then there exists a \(\tilde{\lambda}\) such that for any \(\lambda> \tilde{\lambda}\), x is a global minimum of F λ (x).

Proof

Define L λ to be the set of the local minima of F λ (x), i.e., L λ ={x:x is a local minimum of F λ (x)}. From Theorem 3.1 it can be concluded that L λ keeps unchanged as long as \(\lambda> \bar{\lambda}\), and we denote \(L = L_{\lambda}, \forall\lambda > \bar{\lambda}\). As assumed, for any λ>λ 0, the minimal value of F λ (x) is bounded and achievable. Therefore, its global minimum must be locally minimal. Then, for any \(\lambda> \max\{ \bar{\lambda}, \lambda_{0} \}\), x is a global minimum of F λ (x) if and only if F λ (x )≤F λ (x),∀xL.

For any xL, if f j (x)≤0,∀j=1,2,…,J, then there is F λ (x )≤F λ (x) obviously. Otherwise, we can set

$$\lambda_x = \frac{f_0(x^*) - f_0(x)}{ \sum_{j = 1}^J \max\{ 0,f_j(x)\} }.$$

And one can verify that F λ (x )≤F λ (x),∀λλ x . Hence, for any xL, there is F λ (x )≤F λ (x),∀λ≥sup xL λ x .

For any xL, there exists an Ω il such that xΩ il . Since F λ (x) is affine in Ω il and x is a local optimum of F λ (x), x must be an optimum of the corresponding linear programming in Ω il . The number of Ω il is finite, so is the number of different values of F λ (x),∀xL. That means when xL, the number of different values of f 0(x) and that of \(\sum_{j=1}^{J} \max\{ 0, f_{j}(x)\}\) are both finite. Therefore, sup xL λ x can be achieved and one can expect to get an optimum of (1) using a finite exact penalty factor by choosing \(\lambda> \tilde{\lambda}\), where \(\tilde{\lambda}= \max \{ \bar{\lambda}, \lambda_{0}, \max_{x \in L} \lambda_{x} \}\). □

4 Local Optimality Condition

In Sect. 3, the existence of exact penalty is proved. This property allows us to concentrate on the unconstrained continuous piecewise linear programming as the following:

$$\mathrm{min}\ f(x) = f_0(x) + \lambda\sum ^J_{j=1} \max\bigl\{ 0, f_j(x) \bigr\},$$

where f j (x),∀j, are CPWL functions. Since the composition of CPWL functions is still continuous and piecewise linear, f(x) is a CPWL function and there exists a GHH formulation to represent it. Hence, the above problem can be represented in the following closed form:

$$ \mathrm{min} f(x) = \sum _{m=1}^M \beta_m B_m(x) =\sum_{m=1}^M \beta_m \max \bigl \{ l^m_1(x), l^m_2(x), \ldots,l^m_{n+1} (x) \bigr\},$$
(14)

where β m =±1, B m (x) are the basis functions and \(l^{m}_{i}(x)= a^{T}_{mi} x + b_{mi}\) are affine functions.

When \(\hat{x}\) is given, \(B_{m}(\hat{x} )\) is determined and we can define . Since only the functions can affect the value of B m (x) in the neighborhood of \(\hat{x}\), we call the active index set of B m (x) at \(\hat{x}\). Then, the sufficient and necessary condition for \(\hat{x}\) being a local minimum of (14) is given by the following theorem.

Theorem 4.1

\(\hat{x}\) is a local minimum for (14) if and only if for any I=[I(1),I(2),…,I(M)]T, , there exist scalars θ mi ≥0 which make the following equality hold:

(15)

Proof

According to the definition of , one can find a neighborhood of \(\hat{x}\), denoted by , satisfying

Define . The radius of , denoted by r, is positive and in , (14) is equal to the following function:

(16)

Using vector I, we define function f I (x) as

$$f_I(x) = \sum_{m=1}^M\beta_m \bigl( a^T_{mI(m)} x + b_{mI(m)}\bigr),$$

and the corresponding polyhedron as

It is easy to see that . Next we show that \(\hat{x}\) is a local minimum of f(x) if and only if

$$ f_I( \hat{x} ) \leq f_I( x ),\quad \forall x \in\varPhi_I,\ \forall I,$$
(17)

where I=[I(1),I(2),…,I(M)]T, .

We first prove the necessity by supposing \(\exists I_{0}, \exists\tilde{x} \in\varPhi_{I_{0}}\) and \(f_{I_{0}}( \hat{x}) > f_{I_{0}} (\tilde{x})\). Since \(\varPhi_{I_{0}}\) is a polyhedron and \(f_{I_{0}} (x)\) is affine, we know that for any t∈(0,1], \(x_{t} =\hat{x} + t ( \tilde{x} -\hat{x})\) satisfies \(x_{t} \in\varPhi_{I_{0}}\) and \(f_{I_{0}} (\hat{x}) > f_{I_{0}} (x_{t}) \). For any δ∈(0,r], we can find \(0 < t_{0} \leq\min\{ 1, \delta/ \|\tilde{x} - \hat{x}\| \}\), then, \(\| x_{t_{0}} - \hat{x} \| \leq \delta\leq r\) and \(f_{I_{0}}(\hat{x}) > f_{I_{0}}(x_{t_{0}})\). That means both \(\hat{x}\) and \(x_{t_{0}}\) belong to . Therefore, \(f(\hat{x} ) = f_{I_{0}} (\hat{x}) > f_{I_{0}}(x_{t_{0}}) = f (x_{t_{0}}) \), i.e., \(\hat{x}\) is not locally minimal.

On the other hand, for any , one can find an I x satisfying and \(f(x) = f_{I_{x}}(x)\). It is noted that \(\hat{x} \in \bigcap\varPhi_{I}\) and \(f(\hat{x} ) =f_{I_{x}} (\hat{x}) \). According to (17), \(f(\hat{x}) \leq f(x)\), i.e., \(\hat{x}\) is locally minimal. Hence, (17) is sufficient for \(\hat{x}\) being locally minimal.

From the above discussion, we know that (17) is the sufficient and necessary condition for \(\hat{x}\) being locally minimal. Consider a given I satisfying . (17) means

$$ \sum_{m=1}^M\beta_m \bigl( a^T_{mI(m)} \hat{x} +b_{mI(m)} \bigr) \leq \sum_{m=1}^M\beta_m \bigl( a^T_{mI(m)} x + b_{mI(m)}\bigr)$$
(18)

is valid for all xΦ I , where Φ I is determined by

(19)

Since the following relationship holds:

(19) is equal to

Meanwhile, (18) can be written as

$$\sum_{m=1}^M \beta_ma^T_{mI(m)} ( x - \hat{x} ) \geq0.$$

Therefore, (17) holds if and only if for any I there is no feasible point for the following inequalities:

(20)

According to Farkas’ lemma, (20) has no feasible point if and only if there are nonnegative scalars θ mi ≥0 ensuring

(21)

From the above discussion, we know that \(\hat{x}\) is a local minimum if and only if (21) is valid for arbitrary I. The sufficient and necessary condition for \(\hat{x}\) being locally optimal is proved. □

An important advantage of CPWL programming is that it can be transformed into a series of linear problems in subregions, called sub-LP in this paper. If \(\hat{x}\) is not locally minimal, then the objective value can be strictly decreased through sub-LP. According to Theorem 4.1, when \(\hat{x}\) is not a local minimum of (14), there is at least an I 0 for which (15) does not hold for any θ mi ≥0. Then, we can construct the following sub-LP:

(22)

From the proof of Theorem 4.1, one can easily verify that the objective value can be strictly decreased from \(f_{0}(\hat{x})\) by solving the above sub-LP. Following this idea, we give an algorithm for (14). This algorithm uses sub-LP to decrease the objective function from an initial point and hence is named Descent Algorithm using LP Technique (DALPT), as shown in Algorithm 1.

Algorithm 1
figure 1

Descent Algorithm using LP Technique (DALPT)

The number of the distinct linear subregions is finite, DALPT hence can converge to a local optimum naturally. Obviously, in DALPT, to verify the local optimality of \(\hat{x}\), at most linear equations should be solved, where denotes the number of elements in . According to the definition of the active index set , one can see that \(\hat{x}\) must satisfy linear equations. Therefore, when these linear equations are linearly independent, we have and . Although may be very large, it is acceptable in regular cases. To show the effectiveness of DALPT, we report the computing time in numerical experiments. The basic thought of DALPT, i.e., solving LP in one subregion and choose another to test, is the same as the idea of the algorithm in [37] and the generalized simplex algorithm in [46]. The algorithm in [37] deals with a minimization problem of adaptive hinging hyperplanes (AHH), which is a nesting CPWL representation proposed in [17], by solving sub-LPs repeatedly in allowed time. In that algorithm, the local optimality of the result cannot be guaranteed, since the local optimality condition for the AHH has not been derived. To make algorithm more effective, local optimality condition is considered in both DALPT and the generalized simplex algorithm, the latter of which is established for minimizing a separable CPWL function in vertex representation. As mentioned in Sect. 2.3, transforming a nonseparable compact expression to vertex representation is not easy, we cannot simply extend the generalized simplex algorithm to solving (14).

According to Theorem 4.1, if the objective function can be represented by the HH, i.e.,

(23)

then the optimality condition is given by the following corollary. Using this corollary, the local optimality of (23) can be verified by solving one linear equation.

Corollary 4.1

For function (23), if \(\{a_{m}\}_{m: a^{T}_{m} \hat{x} + b_{m} = 0 }\) are linearly independent, then the sufficient and necessary condition for \(\hat{x}\) being locally minimal is: for any m satisfying \(a^{T}_{m} \hat{x} + b_{m} = 0\), (i) β m =1; (ii) there are scalars η m such that

Conn and Mongeau have given a local optimality condition for CPWL programming (see Theorem 2 in [3]) and that condition is equivalent to Corollary 4.1. Inspired by this equivalence, one can see that Conn’s condition is discussing the CPWL objective function which can be represented by the HH model in the neighborhood of the concerned point. As HH cannot represent all CPWL functions, there are lots of functions that Conn’s condition cannot handle. To show this fact, we compare two simple functions both taken from [3]. The two functions are shown below and illustrated by Fig. 1:

(24)
(25)
Fig. 1
figure 2

Decomposable and non-decomposable functions

Conn’s condition is proved via decomposing a CPWL function by ridge. In [3], a ridge of piecewise linear function f is a specified hyperplane “containing a relative neighborhood where the derivative of f is not defined.” One can verify that the ridges of f 1 are the two lines x(1)=0 and x(2)=0, and the ridges of f 2 are x(1)=0,x(2)=0 and x(1)=x(2). Using the optimality condition in [3], one can analyze the local optimality for a CPWL function with known ridges by decomposing the function into a smooth part and a non-smooth part. However, f 2 is non-decomposable at [0,0]T and the reason is rigorously explained in [3]. Compactly represent the two functions, i.e.,

One can see the main difference between f 1 and f 2 is that f 1 can be represented by the HH model but f 2 cannot be. We call the part of ridges that actually has relative neighborhood where the derivative is not defined as the active part of ridge. The active parts of ridges form the boundaries of the linear subregions. Consider the function f 2: the active parts of its ridges are

which are rays. This is the reason why f 2 is non-decomposable. In fact, when a CPWL function can be represented by the HH, it is decomposable at any point in its domain. From another point of view, if there are some boundaries of the linear subregions which are not lines but segments or rays, then the CPWL function is non-decomposable at the terminals of such segments or rays. In this case, Conn’s condition does not work but Theorem 4.1 is applicable to any CPWL programming.

5 Numerical Examples

Some properties of CPWL programming are discussed above and an algorithm is proposed. Two examples are considered below. The first example is a problem with nonseparable CPWL objective function and CPWL constraints. This example is used to illustrate some theoretical results. In the second example, we use DALPT to minimize a CPWL function under linear constraints. Then, the results are compared with the global optimum calculated by CPLEX.

Example 5.1

Consider the following CPWL programming:

The plots of the objective function and the feasible domain are shown in Fig. 2(a) and (b). It is noted that max{0,x(2)−max{x(1),−x(1)}}=max{max{x(1),−x(1)},x(2)}−max{x(1), −x(1)}. We can transform Example 5.1 into a unconstrained problem using penalty parameter λ as follows:

$$\mathrm{min}\ F_\lambda(x) = \sum_{m=1}^9\beta_m B_m(x),$$

where β 2=−1, β 9=−1, β m =1,m=1,3,4,…,8, and

Fig. 2
figure 3

Features of Example 5.1

To show the local optimality condition, we let \(\hat{x} =[0,0]^{T}\), for which . Considering I=[2,2,1,1,1,1,1,3,1]T, Eq. (15) becomes

Obviously, the above equation can be satisfied by nonnegative θ m when λ≥2. Following similar analysis, we know that Theorem 4.1 can be met, i.e., [0,0]T is locally optimal for F λ (x), when λ≥2. For the special case of λ=5, the penalized function is illustrated in Fig. 2(c) and 2(d). According to Theorem 3.1, one can see that \(\hat{x} = [0,0]^{T}\) is also a local optimum of Example 5.1.

Example 5.2

Consider the HH minimization problem below:

The objective function takes HH form and \(l_{m}(x) = a^{T}_{m} x + b_{m}\). We consider the cases [n,M]=[3,30],[3,100],[5,30],[5,100], separately. Parameters a m are selected from [−1,1]n following the uniform distribution. Then, we randomly generate b m and guarantee each hyperplane l m (x)=0,m=1,2,…,M, to intersect with the hypercube [0,1]n (otherwise, the basis function max{0,l m (x)} reduces to an affine function in the feasible domain). β m takes value from {−1,1} with equal possibility.

DALPT is established for unconstrained problem. Meanwhile, it can be directly used in linearly constrained problem. We apply DALPT to handle Example 5.2. Since the result of DALPT may differ from different initial points, we run DALPT 10 times from random initial points and choose the best one as the optimized result. Meanwhile, we can transform this problem into a MILP like (7) and solve it by CPLEX, whose result is globally optimal and can be used to evaluate the performance of DALPT. For a pair of M and n, 10 instances are generated and Table 1 shows the experimental results, including the average and standard derivation of computing time of DALPT, the average computing time of CPLEX and the percentage of global optimum are achieved within repeating DALPT 10 times.

Table 1 Performance of DALPT in Example 5.2

6 Conclusions

Continuous piecewise linear functions represented by compact models have been widely applied in circuit analysis and nonlinear system identification. Therefore, continuous piecewise linear programming in a compact model is an important problem worthy of study. In this paper, we prove the existence of exact penalty and give a local optimality condition for nonseparable CPWL programming. Based on the gained optimality condition, a decreasing algorithm is established. Then, we illustrate the theoretical results and evaluate the proposed algorithm by a numerical study. This paper gives some preliminary discussions on CPWL programming in compact model. Further study on efficient algorithm will make CPWL programming in compact representation a promising tool in nonlinear analysis and optimization.