1 INTRODUCTION

Neural networks can be regarded as an important and effective tool for solving various machine learning problems. In a number of applied tasks, the implementation of which is carried out using neural networks, it is required that the network output be limited, i.e., the predictions meet specified constraints. Examples of problems that require constraints on network output values are constrained optimization problems, models generating images or parts of images in a given region, classification problems with constraints on class probabilities, etc. The most common approach to account for such constraints is to add some additional penalty terms to the loss function to penalize constraint violations. However, this does not guarantee that the constraints will be satisfied for all training and new examples, since in this case, violation of the constraints is only penalized, but not eliminated. Therefore, this approach leads to the so-called soft constraints [1]. Another approach is to modify the neural network so that it predicts strictly within a limited output space. In this case the constraints are hard in the sense that they are satisfied for any input example during training and use [2].

Although many applications require hard constraints, there are currently not many models implementing them. Most models are based on the use of soft constraints due to their simple implementation [3]. A method taking into account rigid conical constraints of the form \(Ax \leqslant 0\) was proposed in [2]. The corresponding model puts predictions into a feasible region using a specific set of rays. A serious limitation of this method is the need to search for corresponding rays. If this approach is applied not only to conic constraints, it is necessary to look for all the vertices of the set. However, the number of vertices can be extremely large. A general approach to solving optimization problems with constraints is presented in [4]. It aims to incorporate (potentially nonconvex) equality and inequality constraints into deep learning-based optimization algorithms. Its performance largely depends on the training process and the chosen model architecture. A neural network architecture with hard constraints on output data using an additional output layer is proposed in [5]. However, the method considers only linear constraints, but the main thing is that its implementation requires knowledge of all vertices of the constraint polyhedron, the number of which can be huge.

An interesting approach to solving quadratic optimization problems with linear constraints was proposed in [6]. A similar approach, which embeds an optimization layer into a neural network, is proposed in [7]. Unlike [6], this approach is extended to the case of an arbitrary convex loss function, including its parameters. According to [7], the operation of projecting an output point onto a bounded domain can be implemented using a differentiable optimization layer, which ensures that the output of the neural network satisfies the constraints. However, these approaches require solving convex optimization problems for each forward pass of the neural network, which significantly complicates their implementation. Another method for solving an optimization problem with linear constraints is presented in [8]. It should be noted that the method may require significant computational resources and time to solve complex optimization problems. Moreover, it solves optimization problems with linear constraints only. Several approaches to solving constrained optimization problems were proposed in [911]. An analysis of these approaches can be found in review articles [12, 13].

As far as we know, there is currently no approach that allows us to implement and train neural networks whose output satisfies linear and quadratic constraints without solving the optimization problem with a forward pass of the neural network. Therefore, we present a new computationally simple method that imposes strong linear and quadratic constraints on the output values of a neural network. The main idea of the method is to map the vector of latent parameters of the neural network to a point that is guaranteed to be inside the admissible set defined by a set of constraints. The mapping is implemented by a special additional layer of the neural network. The method proposed can be easily generalized to the case where constraints are imposed not only on the output data of the network, but also on the input data. Another feature of the approach is that the method of projecting a point onto an admissible set is simply implemented within the framework of this approach. An important feature of the proposed method is its computational simplicity. For example, the computational complexity of the direct pass of the neural network layer implementing the method in the case of linear constraints is \(O\left( {nm} \right)\), and in the case of quadratic constraints \(O({{n}^{2}}m)\), where n us number of variables and m is the number of constraints.

The method proposed can be used in various applied problems. First of all, this is the solution of optimization problems with arbitrary differentiable loss functions and with linear or quadratic constraints. The method can be used to implement generative models with constraints. It can be used when constraints are imposed on specific subsets of points. There are many other applications where the input and output of neural networks must be constrained.

The contribution of the work can be summarized as follows:

1. A new computationally simple method is proposed that imposes hard linear and quadratic constraints on the output values of a neural network.

2. The implementation of the method with various types of constraints is considered, including linear and quadratic constraints, equality constraints, and constraints imposed on input and output data.

3. Various modifications of the proposed method are studied, including a model for obtaining solutions on the boundaries of the feasible set and projection models.

4. Numerical experiments illustrating the proposed method are presented. In particular, various optimization problems and classification problems are considered.

The software that implements the method can be found on the website: https://github.com/andruekonst/ConstraiNet/.

2 PROBLEM STATEMENT AND PROPOSED METHOD

Let \(z \in {{\mathbb{R}}^{d}}\) be the input vector of a neural network and \(x \in {{\mathbb{R}}^{n}}\) be the output vector (prediction) of this network. The neural network is considered a function \({{f}_{\theta }}:{{\mathbb{R}}^{d}} \to {{\mathbb{R}}^{n}}\) such that \(x = {{f}_{\theta }}\left( z \right)\), where \(\theta \in \Theta \) is the vector of trained parameters.

Let \(\Omega \subset {{\mathbb{R}}^{n}}\) be a convex feasible set of output data formed by a set of constraints in the form of \(m\) inequalities:

$$\Omega = \left\{ {x\,{\text{|}}\,{{h}_{i}}\left( x \right) \leqslant 0,\;i = 1,...,m} \right\},$$
(1)

where each constraint \({{h}_{i}}\left( x \right) \leqslant 0\) is convex.

Our goal is to build a neural network with constraints on the output data. In other words, it is necessary to build a model \(x = {{f}_{\theta }}\left( z \right):{{\mathbb{R}}^{d}} \to \Omega \) and impose constraints on x so that \(x \in \Omega \) for any \(z \in {{\mathbb{R}}^{d}}\), i.e.,

$$\forall z \in {{\mathbb{R}}^{d}}\left( {{{f}_{\theta }}\left( z \right) \in \Omega } \right).$$

2.1 Neural Network Layer with Constraints on Output Data

Given a fixed point p inside the convex domain \(\Omega \), i.e., \(p \in \Omega \). Any point x from many \(\Omega \) can be represented as

$$x = p + \alpha \cdot r,$$

where \(\alpha \geqslant 0\) is a scale parameter and \(r \in {{\mathbb{R}}^{n}}\) is a vector (ray from point p).

On the other hand, for any p and r, there is an upper bound \({{\bar {\alpha }}_{{p,r}}}\) of the parameter α, defined as

$${{\bar {\alpha }}_{{p,r}}} = {\text{max}}\left\{ {\alpha \geqslant 0\,{\text{|}}\,p + \alpha \cdot r \in \Omega } \right\}.$$

Moreover, the segment \(\left[ {p;p + {{{\bar {\alpha }}}_{{p,r}}} \cdot r} \right]\) belongs to \(\Omega \), because \(\Omega \) is a convex set. The meaning of determining the boundary \({{\bar {\alpha }}_{{p,r}}}\) is to determine the point of intersection of the ray \(r\) with one of the constraints.

We define a neural network layer that maps the ray \(r\) and scale s as

$${{g}_{p}}\left( {r,s} \right) = p + {{\alpha }_{{p,r}}}\left( s \right) \cdot r,$$

where \({{\alpha }_{{p,r}}}\left( s \right)\) is a function of the layer parameter s and \({{\bar {\alpha }}_{{p,r}}}\), defined as

$${{\alpha }_{{p,r}}}\left( s \right) = \sigma \left( s \right) \cdot {{\bar {\alpha }}_{{p,r}}},$$

\(\sigma \left( s \right):\mathbb{R} \to \left[ {0,1} \right]\) is a sigmoid function, i.e., a smooth monotonic function.

Such a layer is guaranteed to satisfy the following constraints:

$$\forall r \in {{\mathbb{R}}^{n}},\quad s \in \mathbb{R}\left( {{{g}_{p}}\left( {r,s} \right) \in \Omega } \right),$$

since the segment \(\left[ {p,p + {{{\bar {\alpha }}}_{{p,r}}} \cdot r} \right]\) belongs \(\Omega \).

The mapping of the ray r and scalar \(s\) to a point inside the domain \(\Omega \) is schematically shown in Fig. 1. From the ray r issuing from the point p, the intersection with the domain’s boundary \(p + {{\bar {\alpha }}_{{p,r}}} \cdot r\) is found and, as a result of scaling, the point g is obtained.

Fig. 1.
figure 1

Mapping \({{g}_{p}}\left( {r,s} \right)\).

For a system of constraints, it suffices to find the upper bound \({{\bar {\alpha }}_{{p,r}}}\) satisfying each constraint. Let \(\bar {\alpha }_{{p,r}}^{{\left( i \right)}}\) be the upper bound of the parameter \(\alpha \), corresponding to the ith constraint \(\left( {{{h}_{i}}\left( x \right) \leqslant 0} \right)\) of system (1). Then, the upper bound of the entire system is specified so that \([p,p + {{\bar {\alpha }}_{{p,r}}} \cdot r] \subseteq [p,p + \bar {\alpha }_{{p,r}}^{{\left( i \right)}} \cdot r]\), i.e.,

$${{\bar {\alpha }}_{{p,r}}} = {\text{min}}\{ \bar {\alpha }_{{p,r}}^{{\left( i \right)}}\} _{{i = 1}}^{m}.$$
(2)

The search for the upper bound \({{\bar {\alpha }}_{{p,r}}}\) under linear constraints is schematically shown in Fig. 2.

Fig. 2.
figure 2

Illustration of the search for the upper bound \({{\bar {\alpha }}_{{p,r}}}\) of crossing constraints.

The computational complexity of the forward pass of the described neural network layer is directly proportional to the number of constraints and the computational complexity of finding the intersection with one constraint.

Proposition 1. Any vector \(x \in \Omega \) can be represented using a layer \({{g}_{p}}\left( {r,s} \right)\). For any input \(\left( {r,s} \right)\), the output of the layer \({{g}_{p}}\left( {r,s} \right)\) belongs to the set \(\Omega \).

Proof:

1. Any output vector \({{g}_{p}}(r,s)\) satisfies the constraints; i.e., \(\forall r \in {{\mathbb{R}}^{n}}\), \(s \in \mathbb{R}\), the condition \({{g}_{p}}(r,s) \in \Omega \) is satisfied, because \({{\alpha }_{{p,r}}}(s) \leqslant \bar {\alpha }_{{p,r}}^{{(i)}}\) and [p, \(p\, + \,\bar {\alpha }_{{p,r}}^{{(i)}} \cdot r]\, \subset \,\Omega \). Hence, \({{g}_{p}}(r,s) \in [p,p + {{\alpha }_{{p,r}}}(s)]\, \subset \,\Omega \).

2. Any point \(x \in \Omega \) can be represented using a layer \({{g}_{p}}\left( {r,s} \right)\). Indeed, let \(r = x - p\), \(s \to + \infty \). Then, x = \({{g}_{p}}\left( {x - p, + \infty } \right) = p + 1 \cdot \left( {x - p} \right) = x\), which was to be proved.

To obtain the model \({{f}_{\theta }}\left( z \right):{{\mathbb{R}}^{d}} \to \Omega \), it is required to apply to the input of the layer \({{g}_{p}}\left( {r,s} \right)\) the output of the main neural network, \({{r}_{\theta }}\left( z \right)\) and \({{s}_{\theta }}\left( z \right)\):

$${{f}_{\theta }}\left( z \right) = g\left( {{{r}_{\theta }}\left( z \right),{{s}_{\theta }}\left( z \right)} \right).$$

This aggregate model also forms a neural network, which can be trained by a backpropagation algorithm.

2.2 Linear Constraints

In the case of linear constraints, \({{\bar {\alpha }}_{{p,r}}}\) is determined by the intersection of a ray from a point p in the direction r with the set of constraints. Consider an intersection with one linear constraint of the form \(a_{i}^{{\text{T}}}x \leqslant {{b}_{i}}.\) The upper bound of the parameter \(\alpha \) is determined by the solution of the system

$$\left\{ {\begin{array}{*{20}{l}} {x = p + \alpha \cdot r} \\ {a_{i}^{{\text{T}}}x = {{b}_{i}}} \\ {\alpha \geqslant 0.} \end{array}} \right.$$
(3)

Therefore, if a solution exists, then the following is true:

$$a_{i}^{{\text{T}}}p + \alpha \cdot a_{i}^{{\text{T}}}r = {{b}_{i}},$$
$$\bar {\alpha }_{{p,r}}^{{\left( i \right)}} = \frac{{{{b}_{i}} - a_{i}^{{\text{T}}}p}}{{a_{i}^{{\text{T}}}r}}.$$

If \(a_{i}^{{\text{T}}}r = 0\) or \({{\bar {\alpha }}_{i}}\left( {p,r} \right) < 0\), then (3) has no solution and \(\bar {\alpha }_{{p,r}}^{{\left( i \right)}}\) can be set to \( + \infty \). If a system of inequalities is given, then the upper bound is determined using (2).

In the case of linear constraints, the computational complexity of a direct pass through the neural network layer is \(O\left( {nm} \right)\).

2.3 Quadratic Constraints

Let the ith quadratic constraint be given as

$$\frac{1}{2}{{x}^{{\text{T}}}}{{P}^{{\left( i \right)}}}x + q_{i}^{{\text{T}}}x \leqslant {{b}_{i}},$$

where \({{P}^{{\left( i \right)}}}\) is a positive semidefinite matrix. Then, the intersection of the ray with the constraint is given by the equation

$$\frac{1}{2}{{\left( {p + \alpha \cdot r} \right)}^{{\text{T}}}}{{P}^{{\left( i \right)}}}\left( {p + \alpha \cdot r} \right) + q_{i}^{{\text{T}}}\left( {p + \alpha \cdot r} \right) = {{b}_{i}}.$$

Depending on the coefficient at α2, two cases are possible:

1. If \({{r}^{{\text{T}}}}{{P}^{{\left( i \right)}}}r = 0\), then the equation is linear and has a solution

$$\alpha = - \frac{{q_{i}^{{\text{T}}}p + \frac{1}{2}{{p}^{{\text{T}}}}{{P}^{{\left( i \right)}}}p - {{b}_{i}}}}{{{{p}^{{\text{T}}}}{{P}^{{\left( i \right)}}}r + q_{i}^{{\text{T}}}r}}.$$

2. If \({{r}^{{\text{T}}}}{{P}^{{\left( i \right)}}}r > 0\), then there are two solutions, but we consider only the larger positive one, corresponding to the movement in the direction of the ray. This solution is

$$\alpha = - \frac{{ - ({{p}^{{\text{T}}}}{{P}^{{\left( i \right)}}}r + q_{i}^{{\text{T}}}r) + \sqrt {D{\text{/}}4} }}{{{{r}^{{\text{T}}}}{{P}^{{\left( i \right)}}}r}},$$
$$\begin{gathered} \frac{D}{4} = {{({{p}^{{\text{T}}}}{{P}^{{\left( i \right)}}}r + q_{i}^{{\text{T}}}r)}^{2}} \\ \, - ({{r}^{{\text{T}}}}{{P}^{{\left( i \right)}}}r) \cdot (2q_{i}^{{\text{T}}}p + {{p}^{{\text{T}}}}{{P}^{{\left( i \right)}}}p - 2{{b}_{i}}), \\ \end{gathered} $$

since the denominator is positive.

The case of \({{r}^{{\text{T}}}}{{P}^{{\left( i \right)}}}r < 0\) is impossible, since the matrix is positive semidefinite. Otherwise, the constraint would specify a nonconvex set.

If \(\alpha \geqslant 0\), then \(\bar {\alpha }_{{p,r}}^{{\left( i \right)}} = \alpha \). Otherwise, if the ray does not intersect the constraint, then \(\bar {\alpha }_{{p,r}}^{{\left( i \right)}} = + \infty \). For a system of quadratic constraints, the upper bound has the form (2).

In the case of quadratic constraints, the computational complexity of the forward pass of the neural network layer is \(O({{n}^{2}}m)\).

2.4 Equality Constraints

Consider the case when the domain \(\Omega \) is specified by a system of linear equalities and inequalities of the form

$$x \in \Omega \Leftrightarrow \left\{ {\begin{array}{*{20}{l}} {Ax \leqslant b} \\ {Qx = p.} \end{array}} \right.$$
(4)

In this case, the problem can be reduced to (1), i.e., to a system of inequalities. To do this, we find and fix a vector u satisfying the system \(Qu = p\). If the system has no solutions, then \(\Omega \) empty. If there is only one solution, then \(\Omega \) consists of one point. Otherwise, there are an infinite number of solutions and it suffices to choose any of them, e.g., solving the least squares problem:

$${\text{||}}Qu - p{\text{|}}{{{\text{|}}}^{2}} \to {\text{min}}.$$

Then we can find the matrix R, which is a basis of the kernel Q; i.e., R satisfies the condition

$$\forall w\left( {QRw = 0} \right).$$

The matrix R can be obtained using the SVD decomposition of the matrix \(Q \in {{\mathbb{R}}^{{\mu \times n}}}\):

$$USV = Q,$$

where \(U \in {{\mathbb{R}}^{{\mu \times \mu }}}\) and \(V \in {{\mathbb{R}}^{{n \times n}}}\) are orthogonal matrices and \(S \in {{\mathbb{R}}^{{\mu \times n}}}\) is a matrix with non-zero values only on the diagonal, sorted in descending order.

Then, the matrix R is given as

$$R = \left( {{{v}_{1}}, \ldots ,{{v}_{\delta }}} \right),$$

where \(\delta \) is the number of nonzero diagonal elements of the matrix S and \({{v}_{1}}, \ldots ,{{v}_{\delta }}\) are the columns of the matrix V.

Hence,

$$\forall w \in {{\mathbb{R}}^{\delta }}\left( {Q\left( {Rw + u} \right) = p} \right).$$

The new system of constraints on the vector \(w\) is defined as

$$A\left( {Rw + u} \right) \leqslant b,$$

or, in canonical form,

$$Bw \leqslant t,$$
(5)

where \(B = AR\) and \(t = b - Au\).

Thus, w is the vector of variables for the new system (4). For any w, a vector x satisfying the original system can be reconstructed in the form \(x = Rw + u\). As a result, the model of the solution will be given as

$${{f}_{\theta }}\left( z \right) = R{{\tilde {f}}_{\theta }}\left( z \right) + u,$$
(6)

where \({{\tilde {f}}_{\theta }}\left( z \right)\) is the model for constraints (5).

In a more general case, if an arbitrary convex set \(\Omega \) is given as the intersection of convex inequality constraints (1) and one additional equality constraint:

$$x \in \Omega \Leftrightarrow \left\{ {\begin{array}{*{20}{l}} {{{h}_{i}}\left( x \right) \leqslant 0} \\ {Qx = p,} \end{array}} \right.$$

then we can use a change of variables to obtain new (possibly nonlinear) constraints:

$$x \in \Omega \Leftrightarrow \left\{ {\begin{array}{*{20}{l}} {{{h}_{i}}\left( {Rw + u} \right) \leqslant 0} \\ {x = Rw + u,} \end{array}} \right. \Leftrightarrow \left\{ {\begin{array}{*{20}{l}} {\tilde {h}_{i}^{{\left( {R,u} \right)}}\left( w \right) \leqslant 0} \\ {x = Rw + u.} \end{array}} \right.$$

Thus, the model can be used to generate solutions \({{\tilde {f}}_{\theta }}\left( z \right)\) satisfying nonlinear constraints \(\tilde {h}_{i}^{{\left( {R,u} \right)}}\left( {{{{\tilde {f}}}_{\theta }}\left( z \right)} \right)\) and then solutions for x are obtained through (6).

2.5 Constraints on Input and Output Data

In practice, it may be needed not only to impose constraints on the output vector \({{f}_{\theta }}\left( z \right)\), but also to impose joint constraints depending on some input vectors. Let there be given a convex region of constraints on the input vector z and output vector \({{f}_{\theta }}\left( z \right)\):

$$\Lambda \subset {{\mathbb{R}}^{k}} \times {{\mathbb{R}}^{n}},$$

i.e., for any z, the model \({{f}_{\theta }}\left( z \right)\) should satisfy the condition

$$y = \left[ {\begin{array}{*{20}{c}} {{{f}_{\theta }}\left( z \right)} \\ z \end{array}} \right] \in \Lambda .$$

Here, y is the concatenation of vectors \({{f}_{\theta }}\left( z \right)\) and \(z\). If the domain is specified as the intersection of convex constraints,

$$\Lambda = \left\{ {y\,{\text{|}}\,{{\Gamma }_{i}}\left( y \right) \leqslant 0,\;i = 1, \ldots ,m} \right\},$$

then, for fixed z, we can construct by substitution a new system of constraints only on the output vector \({{f}_{\theta }}\left( z \right)\):

$$G\left( z \right) = \left\{ {z\,{\text{|}}\,{{\gamma }_{i}}\left( {x;z} \right) \leqslant 0,\;i = 1,...,m} \right\},$$

where \({{\gamma }_{i}}\left( {x;z} \right)\) are obtained by the substitution of z into \({{\Gamma }_{i}}\).

Here, \({{\gamma }_{i}}\) depends on z as on fixed parameters and the only variable is x. For example, if \({{\Gamma }_{i}}\) is a linear function, then, after substituting the parameters, \({{\gamma }_{i}}\left( {x;z} \right) \leqslant 0\) will be a new linear constraint on x or be satisfied automatically. If \({{\Gamma }_{i}}\) is a quadratic function, then the constraint on x will either be quadratic, linear, or satisfied automatically.

New dynamic constraints on output data depending on the input vector \(z\) satisfy the condition

$${{f}_{\theta }}\left( z \right) \in G\left( z \right),$$

provided that the input vector \(z\) is from the feasible set:

$$z \in \left\{ {z\,{\text{|}}\,\exists x:\left[ {\begin{array}{*{20}{c}} x \\ z \end{array}} \right] \in \Lambda } \right\}.$$

Such constraints may change with \(z\) and can be implemented in a neural network provided that there is a fixed point p.

2.6 Projection Model

Note that using the proposed neural network layer with constraints on the output data, a model of a projection can be constructed that maps the points in \(\Omega \) and has the property of idempotency, i.e.,

$$\forall x \in {{\mathbb{R}}^{n}}\left( {{{f}_{\theta }}\left( {{{f}_{\theta }}\left( x \right)} \right) = {{f}_{\theta }}\left( x \right)} \right).$$

This model can be implemented in two ways:

1. The first way is to train the model fθ(z) = \(g({{r}_{\theta }}(z),{{s}_{\theta }}(s))\) to identical transformation by minimizing a functional that penalizes the distance between the image and the preimage, e.g., for an orthogonal projection in the Lp –norm:

$$\mathcal{L} = \frac{1}{N}\mathop \sum \limits_{i = 1}^N {\text{||}}{{f}_{\theta }}\left( {{{x}_{i}}} \right) - {{x}_{i}}{\text{|}}{{{\text{|}}}_{p}}.$$
(7)

This method can be used when it is necessary to build projection models for complex metrics, e.g., specified by neural networks.

2. The second way is the central projection, which can be obtained without optimizing the model, by specifying the ray \({{r}_{\theta }}\left( z \right): = z - p\). In this case, the scale must be specified without the sigmoid, explicitly, as \({{\alpha }_{{p,r}}}\left( s \right) = {\text{min}}\left\{ {1,{{{\bar {\alpha }}}_{{p,r}}}} \right\}\). Then,

$$\begin{gathered} {{{\tilde {g}}}_{p}}\left( {r\left( x \right)} \right) = p + {\text{min}}\left\{ {1,{{{\bar {\alpha }}}_{{p,\left( {x - p} \right)}}}} \right\} \cdot \left( {x - p} \right) \\ \, = \left\{ \begin{gathered} x,\quad {{{\bar {\alpha }}}_{{p,\left( {x - p} \right)}}} \geqslant 1 \hfill \\ \left( {1 - {{{\bar {\alpha }}}_{{p,\left( {x - p} \right)}}}} \right)p + {{{\bar {\alpha }}}_{{p,\left( {x - p} \right)}}}x,\quad {\text{otherwise}}{\text{.}} \hfill \\ \end{gathered} \right. \\ \end{gathered} $$

Examples of implementation of the projection model for quadratic constraints are shown in Fig. 3. For each of the options, a vector field is given, where the beginning of each arrow corresponds to the preimage and the end, to its projection into a set of 5 constraints. On the left is the result of a neural network whose parameters minimize (7), where \(\Omega \) contains artifacts corresponding to regions with large approximation errors. On the right is the result of a neural network without trained parameters, implementing a central projection where there are no errors.

Fig. 3.
figure 3

Examples of projection models for quadratic constraints.

2.7 Constructing Solutions at the Boundary

The method developed can also be applied to construct solutions in a nonconvex boundary set, denoted by \(\partial \Omega \). To do this, we set \(\sigma \left( s \right) = 1\). Then,

$$g\left( r \right) = p + \bar {\alpha }\left( {p,r} \right) \cdot r \in \partial \Omega .$$

Hence, g(r) is on the boundary for \(r \ne 0\).

It is noteworthy that this approach allows one to construct a mapping onto a non-convex connected union of convex domains. On the other hand, any method that relies on a convex combination of basis vectors, where the weights are found using the operation softmax, allows one to construct points only inside the domain rather than on the boundary. For example, consider the problem of projecting points onto the boundary of a convex set:

$$\begin{array}{*{20}{c}} {{\text{min||}}z - {{f}_{\theta }}\left( z \right){\text{|}}{{{\text{|}}}_{p}}} \\ {{\text{under constraints}}\;{{f}_{\theta }}\left( z \right) \in \partial \Omega .} \end{array}$$
(8)

Examples of projection onto a domain specified by a set of linear constraints for the L1 and L2 norms are shown in Fig. 4. To solve each of the problems, a neural network was trained, containing 5 layers of size 100, minimizing (8), the set of values of which is specified as \(\partial \Omega \). It can be seen from Fig. 6 that the generated points are on the boundary of the constraints.

Fig. 4.
figure 4

Examples of approximation of projection onto a boundary using the L1 and L2 norms.

Fig. 5.
figure 5

Comparison of the running time of the proposed neural network and projection using CVXPYLayers.

Fig. 6.
figure 6

Optimization trajectories for the Rosenbrock [15] and Bird [16] functions.

3 EXPERIMENTS

3.1 Optimization Problems

One of the applications of the proposed approach is solving optimization problems with constraints. For each particular problem, the vector of input parameters \(~{{z}_{i}}\) is optimized in order to minimize the loss function \(~{{l}_{i}}\left( {{{f}_{\theta }}\left( {{{z}_{i}}} \right)} \right)\). For testing, 50 sets of problems and constraints were generated for output dimensions of 2, 5, and 10, with 50, 100, and 200 constraints separately for the cases of linear and quadratic constraints, so that the domain \(\Omega \) was bounded.

For linear constraints, we generated \(m\) vectors \({{a}_{1}}, \ldots ,{{a}_{m}} \sim \mathcal{N}\left( {0,I} \right)\) defining points belonging to hyperplanes and normals to these hyperplanes. Then, \({{b}_{i}}\, = \,a_{i}^{{\text{T}}}{{a}_{i}}\) and \(\Omega \, = \,\{ x\,{\text{|}}\,a_{i}^{{\text{T}}}x \leqslant {{b}_{i}},\;i\, = \,1,...,m\} \). For quadratic constraints, positive semidefinite matrices \({{P}^{{\left( i \right)}}}\) and vectors \({{q}_{i}} \sim \mathcal{N}\left( {0,I} \right)\), \(i = 1, \ldots ,m\), were generated. The constraints were then shifted to satisfy the clearance constraints \({{b}_{i}} > 0\). As a result, we obtain a system of quadratic constraints Ω = \(\{ x\,{\text{|}}\,{{x}^{{\text{T}}}}{{P}^{{(i)}}}x + q_{i}^{{\text{T}}}x \leqslant {{b}_{i}},\;i = 1, \ldots ,m\} .\)

The relative error for comparison of models is calculated from the value of the loss function of the resulting solution, \({{l}_{i}}\left( {{{f}_{\theta }}\left( {{{x}_{i}}} \right)} \right)\), relative to the loss function of the reference solution, \({{l}_{i}}(x_{i}^{{\text{*}}})\):

$$RE = \frac{{{\text{max\{ }}0,{{l}_{i}}\left( {{{f}_{\theta }}\left( {{{x}_{i}}} \right)} \right) - {{l}_{i}}(x_{i}^{*}){\text{\} }}}}{{{\text{|}}{{l}_{i}}(x_{i}^{*}){\text{|}}}} \cdot 100 \%.$$

Reference solutions were obtained using the algorithm OSQP [14], designed exclusively for solving linear and quadratic problems.

Table 1 shows the average relative errors for optimization problems with linear (LL) and quadratic (QL) loss functions, as well as linear (LC) and quadratic (QC) constraints. It can be seen from the table that the method allows one to optimize the input parameters r and \(s\) for the proposed layer. It can be seen that this layer does not degrade the gradient for the entire neural network.

Table 1. RE for problems with various loss functions and constraints

For comparison with a neural network, the CVXPYLayers library [7] was chosen, which allows one to specify differentiable optimization layers within the neural network. Figure 5 shows a comparison of the optimization execution time with the same hyperparameters, provided that, after 5 minutes from the start of optimization, the CVXPYLayers algorithm stopped, even if the optimization was not completed. As can be seen from the figure, the proposed algorithm requires significantly less overhead in terms of performance.

The structure of the neural network makes it possible to implement algorithms for solving arbitrary problems of both convex and nonconvex optimization. Figure 6 shows optimization trajectories for the Rosenbrock function [15] with quadratic constraints:

$${{\mathcal{L}}_{{{\text{Ros}}}}}\left( x \right) = (1 - x_{1}^{2}) + 100{{({{x}_{2}} - x_{1}^{2})}^{2}},$$
$${{\Omega }_{{{\text{Ros}}}}} = \left\{ {x\,{\text{|}}\,x_{1}^{2} + x_{2}^{2} \leqslant 2} \right\},$$

and the Bird function [16], which has 4 local minima in \({{\Omega }_{{{\text{Bird}}}}}~\), two of which lie on the boundary of \({{\Omega }_{{{\text{Bird}}}}}~\):

$$\begin{gathered} {{\mathcal{L}}_{{{\text{Bird}}}}}\left( x \right) = \sin \left( {{{x}_{2}}} \right){{e}^{{{{{\left( {1 - \cos \left( {{{x}_{1}}} \right)} \right)}}^{2}}}}} \\ \, + \cos \left( {{{x}_{1}}} \right){{e}^{{{{{\left( {1 - \sin \left( {{{x}_{2}}} \right)} \right)}}^{2}}}}} + {{\left( {{{x}_{1}} - {{x}_{2}}} \right)}^{2}}, \\ \end{gathered} $$
$${{\Omega }_{{{\text{Bird}}}}} = \{ x\,{\text{|}}\,{{\left( {{{x}_{1}} + 5} \right)}^{2}} + {{\left( {{{x}_{2}} + 5} \right)}^{2}} < 25\} .$$

The function \({{\mathcal{L}}_{{{\text{Ros}}}}}\left( x \right)\) has a global minimum at the point (1, 1). The boundaries of \({{\Omega }_{{{\text{Ros}}}}}\) and \({{\Omega }_{{{\text{Bird}}}}}~\) are shown as large black circles. 2000 iterations of the Adam algorithm were used with a learning rate of 0.1; 9 points on a uniform mesh from –0.75 to 0.75 were selected as starting points, for which the optimization trajectories are shown in Fig. 6, where the end point is indicated by a white star.

3.2 Example of Classification

To illustrate the capabilities of the proposed method, we considered a classification problem using as an example the simplest data set “Iris” with constraints on the upper bounds \({{\bar {p}}_{i}}\) for each class probability \({{x}_{i}}\):

$${{f}_{\theta }}\left( z \right) \in \left\{ {x\,{\text{|}}\,0 \leqslant {{x}_{i}} \leqslant {{{\bar {p}}}_{i}},\;{{1}^{{\text{T}}}}x = 1} \right\}.$$

These constraints can play a balancing role in learning, reducing the influence of already correctly classified points on the loss function. To illustrate how a neural network learns with these constraints and to make it easier to visualize the results, we chose this classic dataset, containing 3 classes and 150 examples. Three classes allow one to visualize results using a single probability simplex. In this example, we are setting the upper bounds \({{\bar {p}}_{i}} = 0.75\) for i = 1, 2, 3. Figure 7 shows unit simplexes and points, which are the output of a neural network trained for 100, 500, and 1000 epochs. The neural network consists of 5 layers of size 64. The colors of the dots indicate the corresponding classes (Setosa, Versicolor, Virginica). It can be seen from Fig. 7 that the constraints make effect not only when the output points are very close to them, but also throughout the entire training of the network.

Fig. 7.
figure 7

Class probabilities for different numbers (100, 500, 1000) of training epochs.

4 CONCLUSIONS

A method has been presented that imposes hard constraints on the output values of a neural network. The method is extremely simple from a computational point of view, is implemented by an additional layer of a neural network, and allows one to impose a large class of constraints: linear and quadratic inequalities, linear equalities, and joint constraints on inputs and outputs. The method can be directly extended to any convex constraints. It allows one to approximate orthogonal projections onto a convex set or generate vectors on a boundary set.

The disadvantage of the method is that it does not allow working with exclusively conical constraints. In the future, the method proposed can be modified to solve problems with such constraints. In addition, it is interesting to extend the ideas of the proposed method to the case of nonconvex constraints.

An important direction is also the consideration of various machine learning applications that require taking into account constraints, e.g., physics-informed neural networks [17]. Each application can be regarded as a separate problem for further study.